I am willing to look at anything. My biggest concern is to get the API right first which requires knowing what features are really desired. So that why I suggested a jira issue to discuss the feature set.
Aaron Sent from my iPhone On Oct 25, 2013, at 3:20 PM, Otis Gospodnetic <[email protected]> wrote: > I only skimmed this thread. Nobody seems to have mentioned Lucene's own > faceting, which merits looking into. > > Otis > Solr & ElasticSearch Support > http://sematext.com/ > On Oct 22, 2013 1:56 AM, "Colton McInroy" <[email protected]> wrote: > >> >> Thanks, >> Colton McInroy >> >> * Director of Security Engineering >> >> >> Phone >> (Toll Free) >> _US_ (888)-818-1344 Press 2 >> _UK_ 0-800-635-0551 Press 2 >> >> My Extension 101 >> 24/7 Support [email protected] <mailto:[email protected]> >> Email [email protected] <mailto:[email protected]> >> Website http://www.dosarrest.com >> >> On 10/21/2013 5:40 PM, Aaron McCurry wrote: >> >>> On Mon, Oct 21, 2013 at 4:45 PM, Colton McInroy <[email protected] >>>> wrote: >>> >>> You have any suggestions on how I should deal with needing this type of >>>> information in the mean time?... >>>> >>>> Typically what I used facet data for was to generate graph data. Instead >>>> of having to go through every match, group it by time, count them up >>>> manually, etc, I would get facet data for timestamps. For instance, I >>>> create a query which says "field1:value" I would then have grabbed the >>>> facets for the Date field use the facet counts to plot a graph with >>>> timestamp/matches. >>>> >>>> I was thinking just go through all of the matches for now, which >>>> althrough >>>> probably is not nearly as efficient as going using lucene type facets, >>>> would get the trick done temporarily until proper facets are implemented. >>>> >>>> Agreed, is the date field only a date? Or does it contain timestamps as >>> well? What is the range of the dates? Days? Weeks? Months? Years? >>> All >>> of the above? >> To the second... YYYYMMDDHHmmss >> >>> >>> The reason I ask is basically, if you are looking at let's say a months >>> worth and you have a time scope on the date field of days. Then that's >>> only 30-31 facets that you will have to add manually to the query. >>> Obviously as the time scope and range grows this will get a little too >>> messy to want to deal with on the client side. Also you can use the terms >>> call to get the current terms in a field, so if you want to traverse the >>> indexed values that can give you that info. >> Depends upon the timescale being queried. If the timescale is the past >> hour, then it would be by minute, if it's over a month, then it would be by >> hour. For lucene, I just get the facets, and post process them by shrinking >> the timestamp value down the the level I want.... Such as if I wanted to >> view hourly counts, I would loop through all of the facet results >> condensing them down to minute values. Postprocessing the facet results >> from lucene facets was by far a LOT quicker than going through all of the >> actual results, which I am betting is probly the case with blur as well. >> With lucene, facets was what I used the most when trying to present >> information to GUI interfaces because it makes the most sense when viewing >> for people. >> >>> >>> Just trying to help get you want you need right now. >>> >>> >>> Currently the blur site lists facets as being something that works >>>> here... >>>> >>>> http://incubator.apache.org/****blur/how_it_works.html<http://incubator.apache.org/**blur/how_it_works.html> >>>> <http://**incubator.apache.org/blur/how_**it_works.html<http://incubator.apache.org/blur/how_it_works.html> >>>> >>>> But as this thread kinda pointed out, facets the way faceted >>>> classification describes does not exist right now within apache blur. So >>>> someone may want to change that to inform that it is currently on the >>>> todo >>>> list or something. >>>> >>>> http://en.wikipedia.org/wiki/****Faceted_classification<http://en.wikipedia.org/wiki/**Faceted_classification> >>>> <http:/**/en.wikipedia.org/wiki/**Faceted_classification<http://en.wikipedia.org/wiki/Faceted_classification> >>>> >>>> A great example I use to show people what facets are is the following >>>> site... >>>> >>>> http://www.fasttech.com/****category/1499/consumer-****electronics<http://www.fasttech.com/**category/1499/consumer-**electronics> >>>> <http://www.**fasttech.com/category/1499/**consumer-electronics<http://www.fasttech.com/category/1499/consumer-electronics> >>>> >>>> On the left side, it is easy to see a breakdown of all the different >>>> Fields/Values associated with the current search query. My intention is >>>> to >>>> display facet data for all (or the important ones anyway) of the fields >>>> associated with the current query along with a line graph showing the >>>> count >>>> of all matching rows for each time interval. Then the query can be >>>> refined >>>> more by querying a specific time range, or field. >>>> >>>> Is proper facet implementation something that is has a somewhat high >>>> priority and will hopefully be at least partially implemented within the >>>> next couple of weeks/months? Or should I just work on processing all the >>>> results myself for now? Also, I notice the default query matches is only >>>> 10, and I see no way to specify unlimited. Can I specify -1 for limited >>>> or >>>> something like that, or do I need to specify a really large number that >>>> will always be higher than the number of actual results I am expecting... >>>> like Long.MAX_VALUE or something? >>> >>> I agree it is a priority, my top priority is getting 0.2.1 out the door. >>> But if we can decide on the API changes that need to be made in the >>> facet >>> apit we can begin on it in 0.3.0 at any point. And once 0.2.1 is complete >>> I will be turning my focus on 0.3.0, I hope to call for a vote for 0.2.1 >>> in >>> the next week. >>> >>> Ok, so for queries you can page through the results. However the facet >>> count reflect the entire answer. You can't ask for all the results back >>> at >>> once due to memory on constraints within the system. But you can set in >>> the BlurQuery object the start and fetch (which is the number to fetch). >>> >>> http://incubator.apache.org/**blur/docs/0.2.0/Blur.html#** >>> Struct_BlurQuery<http://incubator.apache.org/blur/docs/0.2.0/Blur.html#Struct_BlurQuery> >> Hmm... yea, when going through say 100,000,000+ rows to generate a graph, >> it is no doubt going to take a long time though re-querying in 1,000 >> results intervals 100,000+ times. If that's for only 5 minutes of data, >> it's a huge amount of processing to see general statistics of the data you >> have in front of you. >> >> This is where facets became vital for me. I understand that right now >> "facets" in blur are not really facets, they are instead additional queries >> which get run. Not really sure why it was implemented that way, but when >> you read the lucene documentation >> (http://lucene.apache.org/**core/4_3_0/<http://lucene.apache.org/core/4_3_0/>) >> it links to wiki pages about faceted searches as well as a use guide >> explaining what facets are, the implementation in blur does not match what >> everything else defines facets as. >> >> I'm not sure who or how facets became to be implemented in the current >> manor, but it does not make sense at all or comply with all definitions of >> facets I have found. I find this to be a conflict, if blur advertises them >> but does not really have them. Since there is no documentation about facets >> really, other than it saying it's in the feature list, it took me a while >> to discover this. For me in particular, this is vital. What use is indexing >> massive amounts of information if you do not have very good visibility of >> it. >> >> As I have mentioned, my use is for storing logged events. Let's say you >> have events for sshd being stored in a table along with the fields Date, >> LoginMethod, IP, User, Server, and Success. If you have a LOT servers being >> monitored which have a lot of user login activity. In lucene I would do a >> single query against any of those fields, or perhaps just start with >> matching all records. Along with that query, I would get the facets for >> those fields using Date to display a time graph of activity for the rows. I >> would then display the top 5-10 facets for each field along with a subquery >> that does just a facetquery to display another time graph of the Date >> facets. With this you can instantly see 10 login failures within 100,000 >> successes, how many times each user has logged in and what methods where >> used, etc. This is a simple example, but expand that out to all kinds of >> other information and it's night and day visibility of data. >> >> When trying to view data of any kind in an effective manor, graphing >> always helps, but to process every matching row is obviously inefficient. I >> believe some of the other systems out there such as splunk do that, but >> when I did my own work, I found that to slow and inefficient. Sure, it >> works fine when viewing a small amount of data, but when we are talking >> about big data, which is what Blur is designed for, and what I am working >> with, it's just to much overhead. Using facets on date values to produce >> time graphs of entries no matter how many rows/records you produce pretty >> much is almost instant. >> >> In splunk or other search systems, I would see events populated over time >> in a graph along with the first page of data. The time graph continues to >> fill over time showing a timeline of data. Depending on your data, this can >> take a seriously long time. This is no doubt doing what your suggesting >> with the processing of data one page at a time, sending it to the browser >> to parse into data stores that display graphs. >> With facet results, I was able to display the historical timelines in the >> same amount of time it took to do a single query along with the facet data. >> There just is no match from what I have seen so far, for Lucene indexes >> along with facet indexes, which is what got my so excited about blur. I >> myself literally was in the design phase of writing my own implementation >> of a distributed lucene index system when I decided to stop and check what >> was out there before re-inventing the wheel. When I came across the blur >> project, I found the feature list and looked at two things primarily which >> got me into starting to work with the project. Those two things were "Fast >> data ingestion" and "Facets". So far, data seems to be getting pretty >> quickly in my virtual box tests, which is good. I am going to be scaling up >> soon once the new hardware requisition is finished. Facets though is >> currently stopping me from moving forward on some of the code development >> which requires facets, which is why I am so interested in it's >> implementation. With looping through records, it could take minutes to get >> proper visibility of data, whereas with Facets only a couple seconds if >> that. >> >> While waiting, I am going to probably make that IP field type definition I >> mentioned earlier, as possibly some additional ones. Most of the code for >> that seems to make sense, but I'll need to load it up in something other >> than a text editor to really get an appreciation for it. If some of what >> needs to be done for facets can be explained, I'll perhaps see if I can >> dedicate some company time to it. >> >>> >>>> Thanks, >>>> Colton McInroy >>>> >>>> * Director of Security Engineering >>>> >>>> >>>> Phone >>>> (Toll Free) >>>> _US_ (888)-818-1344 Press 2 >>>> _UK_ 0-800-635-0551 Press 2 >>>> >>>> My Extension 101 >>>> 24/7 Support [email protected] <mailto:[email protected]> >>>> Email [email protected] <mailto:[email protected]> >>>> Website http://www.dosarrest.com >>>> >>>> On 10/18/2013 8:40 AM, Colton McInroy wrote: >>>> >>>> Hello Aaron, >>>>> >>>>> Yes, that's basically what I was thinking of for the facet results. >>>>> The current implementation doesn't really make any sense if your coming >>>>> from lucene. For simplicity and uniformity, I think it should be >>>>> somewhat >>>>> like it is with lucene... with adaptation to the way blur is built... I >>>>> could kinda see something like this... >>>>> >>>>> public static void queryBlur(String queryString, String table) { >>>>> Iface client = BlurClient.getClient(**** >>>>> mainConfig.getString("** >>>>> controllers")); >>>>> Query query = new Query(); >>>>> query.setQuery(queryString); >>>>> >>>>> Selector selector = new Selector(); >>>>> >>>>> // This will fetch all the columns in family "fam0". >>>>> selector.****addToColumnFamiliesToFetch("****event"); >>>>> selector.****addToColumnFamiliesToFetch("****msg"); >>>>> >>>>> BlurQuery blurQuery = new BlurQuery(); >>>>> int matches = 10; >>>>> List<Facet> facets = Arrays.asList(new Facet("field1", >>>>> matches),new Facet("field2", matches)); >>>>> blurQuery.setFacets(facets); >>>>> blurQuery.setFetch(50); >>>>> blurQuery.setQuery(query); >>>>> blurQuery.setSelector(****selector); >>>>> >>>>> try { >>>>> BlurResults results = client.query(table, blurQuery); >>>>> for (Facet facet : result.getFacetResults()) { >>>>> System.out.println(facet.name+****" "+facet.value); >>>>> } >>>>> } catch (BlurException e) { >>>>> // TODO Auto-generated catch block >>>>> e.printStackTrace(); >>>>> } catch (TException e) { >>>>> // TODO Auto-generated catch block >>>>> e.printStackTrace(); >>>>> } >>>>> return null; >>>>> } >>>>> >>>>> Just a brief modification from what I am doing now. Basically I >>>>> just >>>>> envision a method called getFacetResults which returns List<Facet> with >>>>> each Facet object containing a "name" and a "value" which would be the >>>>> column name and facet count respectively. I'm just throwing this out >>>>> there >>>>> for now. This is a different way of implementing the facets than lucene >>>>> in >>>>> terms of how the code is accessed, but it would provide the same >>>>> results. >>>>> >>>>> It could also be done something like this... >>>>> >>>>> List<Facet> facets = Arrays.asList(new Facet("field1"), new >>>>> Facet("field2")); >>>>> blurQuery.setFacets(facets, matches); >>>>> >>>>> Depends if the number of matches should be per facet or per query, >>>>> although I see the merits in being able to specify the matches for each >>>>> field. >>>>> >>>>> Thanks, >>>>> Colton McInroy >>>>> >>>>> * Director of Security Engineering >>>>> >>>>> >>>>> Phone >>>>> (Toll Free) >>>>> _US_ (888)-818-1344 Press 2 >>>>> _UK_ 0-800-635-0551 Press 2 >>>>> >>>>> My Extension 101 >>>>> 24/7 Support [email protected] <mailto:[email protected]> >>>>> Email [email protected] <mailto:[email protected]> >>>>> Website http://www.dosarrest.com >>>>> >>>>> On 10/18/2013 5:20 AM, Aaron McCurry wrote: >>>>> >>>>> I have an issue in Jira to document facets in 0.2.1, it's not been >>>>>> worked >>>>>> yet but I hope I can get to it soon. It looks like you figured out >>>>>> what >>>>>> is >>>>>> there. >>>>>> >>>>>> We will likely improve facets in 0.3.0 so the API will have to change a >>>>>> bit. The biggest change we will need to make is the scenario that you >>>>>> bring up. Facets in the current implementation case are simply other >>>>>> queries that can range from a single term to a complex query. I'm >>>>>> assuming >>>>>> that you would like to specify a field name and get something like a >>>>>> map >>>>>> of >>>>>> terms to counts for the given facet? >>>>>> >>>>>> The field facetCounts are counts that each of the facets in the input >>>>>> list >>>>>> from the query. So the count list corresponds one for one to the facet >>>>>> list in the Query. I realize this is less than ideal and we can going >>>>>> to >>>>>> be improving it soon. >>>>>> >>>>>> If you have some suggestions on how you would want the facet api to >>>>>> operate, new features, or anything else for that matter just write up >>>>>> your >>>>>> thoughts on this thread and we can incorporate them into the task. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> Aaron >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Oct 18, 2013 at 6:43 AM
