I only skimmed this thread. Nobody seems to have mentioned Lucene's own faceting, which merits looking into.
Otis Solr & ElasticSearch Support http://sematext.com/ On Oct 22, 2013 1:56 AM, "Colton McInroy" <[email protected]> wrote: > > Thanks, > Colton McInroy > > * Director of Security Engineering > > > Phone > (Toll Free) > _US_ (888)-818-1344 Press 2 > _UK_ 0-800-635-0551 Press 2 > > My Extension 101 > 24/7 Support [email protected] <mailto:[email protected]> > Email [email protected] <mailto:[email protected]> > Website http://www.dosarrest.com > > On 10/21/2013 5:40 PM, Aaron McCurry wrote: > >> On Mon, Oct 21, 2013 at 4:45 PM, Colton McInroy <[email protected] >> >wrote: >> >> You have any suggestions on how I should deal with needing this type of >>> information in the mean time?... >>> >>> Typically what I used facet data for was to generate graph data. Instead >>> of having to go through every match, group it by time, count them up >>> manually, etc, I would get facet data for timestamps. For instance, I >>> create a query which says "field1:value" I would then have grabbed the >>> facets for the Date field use the facet counts to plot a graph with >>> timestamp/matches. >>> >>> I was thinking just go through all of the matches for now, which >>> althrough >>> probably is not nearly as efficient as going using lucene type facets, >>> would get the trick done temporarily until proper facets are implemented. >>> >>> Agreed, is the date field only a date? Or does it contain timestamps as >> well? What is the range of the dates? Days? Weeks? Months? Years? >> All >> of the above? >> > To the second... YYYYMMDDHHmmss > >> >> The reason I ask is basically, if you are looking at let's say a months >> worth and you have a time scope on the date field of days. Then that's >> only 30-31 facets that you will have to add manually to the query. >> Obviously as the time scope and range grows this will get a little too >> messy to want to deal with on the client side. Also you can use the terms >> call to get the current terms in a field, so if you want to traverse the >> indexed values that can give you that info. >> > Depends upon the timescale being queried. If the timescale is the past > hour, then it would be by minute, if it's over a month, then it would be by > hour. For lucene, I just get the facets, and post process them by shrinking > the timestamp value down the the level I want.... Such as if I wanted to > view hourly counts, I would loop through all of the facet results > condensing them down to minute values. Postprocessing the facet results > from lucene facets was by far a LOT quicker than going through all of the > actual results, which I am betting is probly the case with blur as well. > With lucene, facets was what I used the most when trying to present > information to GUI interfaces because it makes the most sense when viewing > for people. > >> >> Just trying to help get you want you need right now. >> >> >> Currently the blur site lists facets as being something that works >>> here... >>> >>> http://incubator.apache.org/****blur/how_it_works.html<http://incubator.apache.org/**blur/how_it_works.html> >>> <http://**incubator.apache.org/blur/how_**it_works.html<http://incubator.apache.org/blur/how_it_works.html> >>> > >>> >>> But as this thread kinda pointed out, facets the way faceted >>> classification describes does not exist right now within apache blur. So >>> someone may want to change that to inform that it is currently on the >>> todo >>> list or something. >>> >>> http://en.wikipedia.org/wiki/****Faceted_classification<http://en.wikipedia.org/wiki/**Faceted_classification> >>> <http:/**/en.wikipedia.org/wiki/**Faceted_classification<http://en.wikipedia.org/wiki/Faceted_classification> >>> > >>> >>> A great example I use to show people what facets are is the following >>> site... >>> >>> http://www.fasttech.com/****category/1499/consumer-****electronics<http://www.fasttech.com/**category/1499/consumer-**electronics> >>> <http://www.**fasttech.com/category/1499/**consumer-electronics<http://www.fasttech.com/category/1499/consumer-electronics> >>> > >>> >>> On the left side, it is easy to see a breakdown of all the different >>> Fields/Values associated with the current search query. My intention is >>> to >>> display facet data for all (or the important ones anyway) of the fields >>> associated with the current query along with a line graph showing the >>> count >>> of all matching rows for each time interval. Then the query can be >>> refined >>> more by querying a specific time range, or field. >>> >>> Is proper facet implementation something that is has a somewhat high >>> priority and will hopefully be at least partially implemented within the >>> next couple of weeks/months? Or should I just work on processing all the >>> results myself for now? Also, I notice the default query matches is only >>> 10, and I see no way to specify unlimited. Can I specify -1 for limited >>> or >>> something like that, or do I need to specify a really large number that >>> will always be higher than the number of actual results I am expecting... >>> like Long.MAX_VALUE or something? >>> >> >> I agree it is a priority, my top priority is getting 0.2.1 out the door. >> But if we can decide on the API changes that need to be made in the >> facet >> apit we can begin on it in 0.3.0 at any point. And once 0.2.1 is complete >> I will be turning my focus on 0.3.0, I hope to call for a vote for 0.2.1 >> in >> the next week. >> >> Ok, so for queries you can page through the results. However the facet >> count reflect the entire answer. You can't ask for all the results back >> at >> once due to memory on constraints within the system. But you can set in >> the BlurQuery object the start and fetch (which is the number to fetch). >> >> http://incubator.apache.org/**blur/docs/0.2.0/Blur.html#** >> Struct_BlurQuery<http://incubator.apache.org/blur/docs/0.2.0/Blur.html#Struct_BlurQuery> >> > Hmm... yea, when going through say 100,000,000+ rows to generate a graph, > it is no doubt going to take a long time though re-querying in 1,000 > results intervals 100,000+ times. If that's for only 5 minutes of data, > it's a huge amount of processing to see general statistics of the data you > have in front of you. > > This is where facets became vital for me. I understand that right now > "facets" in blur are not really facets, they are instead additional queries > which get run. Not really sure why it was implemented that way, but when > you read the lucene documentation > (http://lucene.apache.org/**core/4_3_0/<http://lucene.apache.org/core/4_3_0/>) > it links to wiki pages about faceted searches as well as a use guide > explaining what facets are, the implementation in blur does not match what > everything else defines facets as. > > I'm not sure who or how facets became to be implemented in the current > manor, but it does not make sense at all or comply with all definitions of > facets I have found. I find this to be a conflict, if blur advertises them > but does not really have them. Since there is no documentation about facets > really, other than it saying it's in the feature list, it took me a while > to discover this. For me in particular, this is vital. What use is indexing > massive amounts of information if you do not have very good visibility of > it. > > As I have mentioned, my use is for storing logged events. Let's say you > have events for sshd being stored in a table along with the fields Date, > LoginMethod, IP, User, Server, and Success. If you have a LOT servers being > monitored which have a lot of user login activity. In lucene I would do a > single query against any of those fields, or perhaps just start with > matching all records. Along with that query, I would get the facets for > those fields using Date to display a time graph of activity for the rows. I > would then display the top 5-10 facets for each field along with a subquery > that does just a facetquery to display another time graph of the Date > facets. With this you can instantly see 10 login failures within 100,000 > successes, how many times each user has logged in and what methods where > used, etc. This is a simple example, but expand that out to all kinds of > other information and it's night and day visibility of data. > > When trying to view data of any kind in an effective manor, graphing > always helps, but to process every matching row is obviously inefficient. I > believe some of the other systems out there such as splunk do that, but > when I did my own work, I found that to slow and inefficient. Sure, it > works fine when viewing a small amount of data, but when we are talking > about big data, which is what Blur is designed for, and what I am working > with, it's just to much overhead. Using facets on date values to produce > time graphs of entries no matter how many rows/records you produce pretty > much is almost instant. > > In splunk or other search systems, I would see events populated over time > in a graph along with the first page of data. The time graph continues to > fill over time showing a timeline of data. Depending on your data, this can > take a seriously long time. This is no doubt doing what your suggesting > with the processing of data one page at a time, sending it to the browser > to parse into data stores that display graphs. > With facet results, I was able to display the historical timelines in the > same amount of time it took to do a single query along with the facet data. > There just is no match from what I have seen so far, for Lucene indexes > along with facet indexes, which is what got my so excited about blur. I > myself literally was in the design phase of writing my own implementation > of a distributed lucene index system when I decided to stop and check what > was out there before re-inventing the wheel. When I came across the blur > project, I found the feature list and looked at two things primarily which > got me into starting to work with the project. Those two things were "Fast > data ingestion" and "Facets". So far, data seems to be getting pretty > quickly in my virtual box tests, which is good. I am going to be scaling up > soon once the new hardware requisition is finished. Facets though is > currently stopping me from moving forward on some of the code development > which requires facets, which is why I am so interested in it's > implementation. With looping through records, it could take minutes to get > proper visibility of data, whereas with Facets only a couple seconds if > that. > > While waiting, I am going to probably make that IP field type definition I > mentioned earlier, as possibly some additional ones. Most of the code for > that seems to make sense, but I'll need to load it up in something other > than a text editor to really get an appreciation for it. If some of what > needs to be done for facets can be explained, I'll perhaps see if I can > dedicate some company time to it. > >> >>> Thanks, >>> Colton McInroy >>> >>> * Director of Security Engineering >>> >>> >>> Phone >>> (Toll Free) >>> _US_ (888)-818-1344 Press 2 >>> _UK_ 0-800-635-0551 Press 2 >>> >>> My Extension 101 >>> 24/7 Support [email protected] <mailto:[email protected]> >>> Email [email protected] <mailto:[email protected]> >>> Website http://www.dosarrest.com >>> >>> On 10/18/2013 8:40 AM, Colton McInroy wrote: >>> >>> Hello Aaron, >>>> >>>> Yes, that's basically what I was thinking of for the facet results. >>>> The current implementation doesn't really make any sense if your coming >>>> from lucene. For simplicity and uniformity, I think it should be >>>> somewhat >>>> like it is with lucene... with adaptation to the way blur is built... I >>>> could kinda see something like this... >>>> >>>> public static void queryBlur(String queryString, String table) { >>>> Iface client = BlurClient.getClient(**** >>>> mainConfig.getString("** >>>> controllers")); >>>> Query query = new Query(); >>>> query.setQuery(queryString); >>>> >>>> Selector selector = new Selector(); >>>> >>>> // This will fetch all the columns in family "fam0". >>>> selector.****addToColumnFamiliesToFetch("****event"); >>>> selector.****addToColumnFamiliesToFetch("****msg"); >>>> >>>> BlurQuery blurQuery = new BlurQuery(); >>>> int matches = 10; >>>> List<Facet> facets = Arrays.asList(new Facet("field1", >>>> matches),new Facet("field2", matches)); >>>> blurQuery.setFacets(facets); >>>> blurQuery.setFetch(50); >>>> blurQuery.setQuery(query); >>>> blurQuery.setSelector(****selector); >>>> >>>> try { >>>> BlurResults results = client.query(table, blurQuery); >>>> for (Facet facet : result.getFacetResults()) { >>>> System.out.println(facet.name+****" "+facet.value); >>>> } >>>> } catch (BlurException e) { >>>> // TODO Auto-generated catch block >>>> e.printStackTrace(); >>>> } catch (TException e) { >>>> // TODO Auto-generated catch block >>>> e.printStackTrace(); >>>> } >>>> return null; >>>> } >>>> >>>> Just a brief modification from what I am doing now. Basically I >>>> just >>>> envision a method called getFacetResults which returns List<Facet> with >>>> each Facet object containing a "name" and a "value" which would be the >>>> column name and facet count respectively. I'm just throwing this out >>>> there >>>> for now. This is a different way of implementing the facets than lucene >>>> in >>>> terms of how the code is accessed, but it would provide the same >>>> results. >>>> >>>> It could also be done something like this... >>>> >>>> List<Facet> facets = Arrays.asList(new Facet("field1"), new >>>> Facet("field2")); >>>> blurQuery.setFacets(facets, matches); >>>> >>>> Depends if the number of matches should be per facet or per query, >>>> although I see the merits in being able to specify the matches for each >>>> field. >>>> >>>> Thanks, >>>> Colton McInroy >>>> >>>> * Director of Security Engineering >>>> >>>> >>>> Phone >>>> (Toll Free) >>>> _US_ (888)-818-1344 Press 2 >>>> _UK_ 0-800-635-0551 Press 2 >>>> >>>> My Extension 101 >>>> 24/7 Support [email protected] <mailto:[email protected]> >>>> Email [email protected] <mailto:[email protected]> >>>> Website http://www.dosarrest.com >>>> >>>> On 10/18/2013 5:20 AM, Aaron McCurry wrote: >>>> >>>> I have an issue in Jira to document facets in 0.2.1, it's not been >>>>> worked >>>>> yet but I hope I can get to it soon. It looks like you figured out >>>>> what >>>>> is >>>>> there. >>>>> >>>>> We will likely improve facets in 0.3.0 so the API will have to change a >>>>> bit. The biggest change we will need to make is the scenario that you >>>>> bring up. Facets in the current implementation case are simply other >>>>> queries that can range from a single term to a complex query. I'm >>>>> assuming >>>>> that you would like to specify a field name and get something like a >>>>> map >>>>> of >>>>> terms to counts for the given facet? >>>>> >>>>> The field facetCounts are counts that each of the facets in the input >>>>> list >>>>> from the query. So the count list corresponds one for one to the facet >>>>> list in the Query. I realize this is less than ideal and we can going >>>>> to >>>>> be improving it soon. >>>>> >>>>> If you have some suggestions on how you would want the facet api to >>>>> operate, new features, or anything else for that matter just write up >>>>> your >>>>> thoughts on this thread and we can incorporate them into the task. >>>>> >>>>> Thanks! >>>>> >>>>> Aaron >>>>> >>>>> >>>>> >>>>> On Fri, Oct 18, 2013 at 6:43 AM, Colton McInroy <[email protected] >>>>> >>>>>> wrote: >>>>>> >>>>> Ok, so I created this method... >>>>> >>>>>> public static BlurResults queryBlur(String queryString, String table) >>>>>> { >>>>>> Iface client = BlurClient.getClient(**** >>>>>> mainConfig.getString("** >>>>>> controllers")); >>>>>> Query query = new Query(); >>>>>> query.setQuery(queryString); >>>>>> >>>>>> Selector selector = new Selector(); >>>>>> >>>>>> // This will fetch all the columns in family "fam0". >>>>>> selector.******addToColumnFamiliesToFetch("******event"); >>>>>> selector.******addToColumnFamiliesToFetch("******msg"); >>>>>> >>>>>> BlurQuery blurQuery = new BlurQuery(); >>>>>> List<Facet> facets = Arrays.asList(new Facet(queryString, >>>>>> Long.MAX_VALUE)); >>>>>> blurQuery.setFacets(facets); >>>>>> blurQuery.setFetch(50); >>>>>> blurQuery.setQuery(query); >>>>>> blurQuery.setSelector(******selector); >>>>>> >>>>>> try { >>>>>> BlurResults results = client.query(table, blurQuery); >>>>>> return results; >>>>>> } catch (BlurException e) { >>>>>> // TODO Auto-generated catch block >>>>>> e.printStackTrace(); >>>>>> } catch (TException e) { >>>>>> // TODO Auto-generated catch block >>>>>> e.printStackTrace(); >>>>>> } >>>>>> return null; >>>>>> } >>>>>> >>>>>> From reading through source code, I was able to find out that you >>>>>> specify >>>>>> facets as a list, but this is fairly confusing to me coming from >>>>>> lucene. >>>>>> >>>>>> In lucene when getting facet data, I specify the facet fields I am >>>>>> interested in, and the facet results show me a top X list of values >>>>>> within >>>>>> that field. Whereas with blur, it appears that a facet is another >>>>>> query >>>>>> which gives only a number as a result. When I tried to obtain the >>>>>> facet >>>>>> data I am used to with Lucene, the only thing I could find was... >>>>>> >>>>>> System.out.println("Facet Results: "+results.getFacetCountsSize()** >>>>>> ****); >>>>>> System.out.println(JSONArray.******toJSONString(results.****** >>>>>> getFacetCounts())); >>>>>> >>>>>> >>>>>> Could you please elaborate on this. >>>>>> >>>>>> >>>>>> Thanks, >>>>>> Colton McInroy >>>>>> >>>>>> * Director of Security Engineering >>>>>> >>>>>> >>>>>> Phone >>>>>> (Toll Free) >>>>>> _US_ (888)-818-1344 Press 2 >>>>>> _UK_ 0-800-635-0551 Press 2 >>>>>> >>>>>> My Extension 101 >>>>>> 24/7 Support [email protected] <mailto:[email protected]> >>>>>> Email [email protected] <mailto:[email protected]> >>>>>> Website http://www.dosarrest.com >>>>>> >>>>>> On 10/18/2013 3:07 AM, Colton McInroy wrote: >>>>>> >>>>>> I think I wrote this to soon, I believe I just found out how to do >>>>>> it. >>>>>> >>>>>>> I'll test it out and supply some example code if correct to help >>>>>>> others. >>>>>>> >>>>>>> Thanks, >>>>>>> Colton McInroy >>>>>>> >>>>>>> * Director of Security Engineering >>>>>>> >>>>>>> >>>>>>> Phone >>>>>>> (Toll Free) >>>>>>> _US_ (888)-818-1344 Press 2 >>>>>>> _UK_ 0-800-635-0551 Press 2 >>>>>>> >>>>>>> My Extension 101 >>>>>>> 24/7 Support [email protected] <mailto:[email protected] >>>>>>> > >>>>>>> Email [email protected] <mailto:[email protected]> >>>>>>> Website http://www.dosarrest.com >>>>>>> >>>>>>> On 10/18/2013 2:58 AM, Colton McInroy wrote: >>>>>>> >>>>>>> Hey Aaron, >>>>>>> >>>>>>>> You mentioned a while ago that blur handles facets as well and >>>>>>>> that >>>>>>>> you would provide an example. Unless I have missed that email, I >>>>>>>> haven't >>>>>>>> seen an example yet, could you provide one? I just took a quick look >>>>>>>> myself >>>>>>>> and could not figure it out. I see there is an example >>>>>>>> FacetQueryTest.java >>>>>>>> in blur-query but that appears to be basically just a copy of the >>>>>>>> lucene >>>>>>>> file. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>> >
