Did you create the jira issue for this? I didn't see a notification for it being sent into the mailing list.

Now that 0.2.1 is out, is proper facet implementation going to get worked on now? This is extremely important for our implementation of Blur, I have been reading through the code, but I see that a LOT of changes are going to have to occur for it to be functional. I'm sure I could convince our company to donate monetarily if it will help speed things up, I am also able to spend my own time helping work on the code changes.

Thanks,
Colton McInroy

 * Director of Security Engineering

        
Phone
(Toll Free)     
_US_    (888)-818-1344 Press 2
_UK_    0-800-635-0551 Press 2

My Extension    101
24/7 Support    [email protected] <mailto:[email protected]>
Email   [email protected] <mailto:[email protected]>
Website         http://www.dosarrest.com

On 10/25/2013 1:26 PM, Aaron McCurry wrote:
Colton,

Yes I think that is exactly what you are describing.  I will create the
inital jira issue and either copy the content you have created or link to
it and we will continue discussing implementation there.  Thanks!

Aaron


On Fri, Oct 25, 2013 at 3:39 PM, Colton McInroy <[email protected]>wrote:

Umm... isn't that what I did? I mentioned it a few times, supplied a link
to the lucene documentation, etc.


Thanks,
Colton McInroy

  * Director of Security Engineering


Phone
(Toll Free)
_US_    (888)-818-1344 Press 2
_UK_    0-800-635-0551 Press 2

My Extension    101
24/7 Support    [email protected] <mailto:[email protected]>
Email   [email protected] <mailto:[email protected]>
Website         http://www.dosarrest.com

On 10/25/2013 12:20 PM, Otis Gospodnetic wrote:

I only skimmed this thread. Nobody seems to have mentioned Lucene's own
faceting, which merits looking into.

Otis
Solr & ElasticSearch Support
http://sematext.com/
On Oct 22, 2013 1:56 AM, "Colton McInroy" <[email protected]> wrote:

  Thanks,
Colton McInroy

   * Director of Security Engineering


Phone
(Toll Free)
_US_    (888)-818-1344 Press 2
_UK_    0-800-635-0551 Press 2

My Extension    101
24/7 Support    [email protected] <mailto:[email protected]>
Email   [email protected] <mailto:[email protected]>
Website         http://www.dosarrest.com

On 10/21/2013 5:40 PM, Aaron McCurry wrote:

  On Mon, Oct 21, 2013 at 4:45 PM, Colton McInroy <[email protected]
wrote:

   You have any suggestions on how I should deal with needing this type
of

information in the mean time?...

Typically what I used facet data for was to generate graph data.
Instead
of having to go through every match, group it by time, count them up
manually, etc, I would get facet data for timestamps. For instance, I
create a query which says "field1:value" I would then have grabbed the
facets for the Date field use the facet counts to plot a graph with
timestamp/matches.

I was thinking just go through all of the matches for now, which
althrough
probably is not nearly as efficient as going using lucene type facets,
would get the trick done temporarily until proper facets are
implemented.

   Agreed, is the date field only a date?  Or does it contain
timestamps as

well?  What is the range of the dates?  Days?  Weeks?  Months?  Years?
   All
of the above?

  To the second... YYYYMMDDHHmmss
  The reason I ask is basically, if you are looking at let's say a months
worth and you have a time scope on the date field of days.  Then that's
only 30-31 facets that you will have to add manually to the query.
    Obviously as the time scope and range grows this will get a little
too
messy to want to deal with on the client side.  Also you can use the
terms
call to get the current terms in a field, so if you want to traverse the
indexed values that can give you that info.

  Depends upon the timescale being queried. If the timescale is the past
hour, then it would be by minute, if it's over a month, then it would be
by
hour. For lucene, I just get the facets, and post process them by
shrinking
the timestamp value down the the level I want.... Such as if I wanted to
view hourly counts, I would loop through all of the facet results
condensing them down to minute values. Postprocessing the facet results
from lucene facets was by far a LOT quicker than going through all of the
actual results, which I am betting is probly the case with blur as well.
With lucene, facets was what I used the most when trying to present
information to GUI interfaces because it makes the most sense when
viewing
for people.

  Just trying to help get you want you need right now.

   Currently the blur site lists facets as being something that works

here...

http://incubator.apache.org/******blur/how_it_works.html<http://incubator.apache.org/****blur/how_it_works.html>
<http:**//incubator.apache.org/**blur/**how_it_works.html<http://incubator.apache.org/**blur/how_it_works.html>
<http://**incubator.apache.**org/blur/how_**it_works.html<http://incubator.apache.org/blur/how_**it_works.html>
<h**ttp://incubator.apache.org/**blur/how_it_works.html<http://incubator.apache.org/blur/how_it_works.html>
But as this thread kinda pointed out, facets the way faceted
classification describes does not exist right now within apache blur.
So
someone may want to change that to inform that it is currently on the
todo
list or something.

http://en.wikipedia.org/wiki/******Faceted_classification<http://en.wikipedia.org/wiki/****Faceted_classification>
<http**://en.wikipedia.org/wiki/****Faceted_classification<http://en.wikipedia.org/wiki/**Faceted_classification>
<http:/**/en.wikipedia.org/**wiki/**Faceted_classification<http://en.wikipedia.org/wiki/**Faceted_classification>
<**http://en.wikipedia.org/wiki/**Faceted_classification<http://en.wikipedia.org/wiki/Faceted_classification>
A great example I use to show people what facets are is the following
site...

http://www.fasttech.com/******category/1499/consumer-******electronics<http://www.fasttech.com/****category/1499/consumer-****electronics>
<http://www.**fasttech.com/**category/1499/**consumer-**electronics<http://www.fasttech.com/**category/1499/consumer-**electronics>
<http://www.**fasttech.com/**category/1499/**consumer-**electronics<http://fasttech.com/category/1499/**consumer-electronics>
<http://www.**fasttech.com/category/1499/**consumer-electronics<http://www.fasttech.com/category/1499/consumer-electronics>
On the left side, it is easy to see a breakdown of all the different
Fields/Values associated with the current search query. My intention is
to
display facet data for all (or the important ones anyway) of the fields
associated with the current query along with a line graph showing the
count
of all matching rows for each time interval. Then the query can be
refined
more by querying a specific time range, or field.

Is proper facet implementation something that is has a somewhat high
priority and will hopefully be at least partially implemented within
the
next couple of weeks/months? Or should I just work on processing all
the
results myself for now? Also, I notice the default query matches is
only
10, and I see no way to specify unlimited. Can I specify -1 for limited
or
something like that, or do I need to specify a really large number that
will always be higher than the number of actual results I am
expecting...
like Long.MAX_VALUE or something?

  I agree it is a priority, my top priority is getting 0.2.1 out the
door.
    But if we can decide on the API changes that need to be made in the
facet
apit we can begin on it in 0.3.0 at any point.  And once 0.2.1 is
complete
I will be turning my focus on 0.3.0, I hope to call for a vote for 0.2.1
in
the next week.

Ok, so for queries you can page through the results.  However the facet
count reflect the entire answer.  You can't ask for all the results back
at
once due to memory on constraints within the system.  But you can set in
the BlurQuery object the start and fetch (which is the number to fetch).

http://incubator.apache.org/****blur/docs/0.2.0/Blur.html#**<http://incubator.apache.org/**blur/docs/0.2.0/Blur.html#**>
Struct_BlurQuery<http://**incubator.apache.org/blur/**
docs/0.2.0/Blur.html#Struct_**BlurQuery<http://incubator.apache.org/blur/docs/0.2.0/Blur.html#Struct_BlurQuery>
  Hmm... yea, when going through say 100,000,000+ rows to generate a
graph,
it is no doubt going to take a long time though re-querying in 1,000
results intervals 100,000+ times. If that's for only 5 minutes of data,
it's a huge amount of processing to see general statistics of the data
you
have in front of you.

This is where facets became vital for me. I understand that right now
"facets" in blur are not really facets, they are instead additional
queries
which get run. Not really sure why it was implemented that way, but when
you read the lucene documentation (http://lucene.apache.org/****
core/4_3_0/ <http://lucene.apache.org/**core/4_3_0/><http://lucene.**
apache.org/core/4_3_0/ <http://lucene.apache.org/core/4_3_0/>>)
it links to wiki pages about faceted searches as well as a use guide
explaining what facets are, the implementation in blur does not match
what
everything else defines facets as.

I'm not sure who or how facets became to be implemented in the current
manor, but it does not make sense at all or comply with all definitions
of
facets I have found. I find this to be a conflict, if blur advertises
them
but does not really have them. Since there is no documentation about
facets
really, other than it saying it's in the feature list, it took me a while
to discover this. For me in particular, this is vital. What use is
indexing
massive amounts of information if you do not have very good visibility of
it.

As I have mentioned, my use is for storing logged events. Let's say you
have events for sshd being stored in a table along with the fields Date,
LoginMethod, IP, User, Server, and Success. If you have a LOT servers
being
monitored which have a lot of user login activity. In lucene I would do a
single query against any of those fields, or perhaps just start with
matching all records. Along with that query, I would get the facets for
those fields using Date to display a time graph of activity for the
rows. I
would then display the top 5-10 facets for each field along with a
subquery
that does just a facetquery to display another time graph of the Date
facets. With this you can instantly see 10 login failures within 100,000
successes, how many times each user has logged in and what methods where
used, etc. This is a simple example, but expand that out to all kinds of
other information and it's night and day visibility of data.

When trying to view data of any kind in an effective manor, graphing
always helps, but to process every matching row is obviously
inefficient. I
believe some of the other systems out there such as splunk do that, but
when I did my own work, I found that to slow and inefficient. Sure, it
works fine when viewing a small amount of data, but when we are talking
about big data, which is what Blur is designed for, and what I am working
with, it's just to much overhead. Using facets on date values to produce
time graphs of entries no matter how many rows/records you produce pretty
much is almost instant.

In splunk or other search systems, I would see events populated over time
in a graph along with the first page of data. The time graph continues to
fill over time showing a timeline of data. Depending on your data, this
can
take a seriously long time. This is no doubt doing what your suggesting
with the processing of data one page at a time, sending it to the browser
to parse into data stores that display graphs.
With facet results, I was able to display the historical timelines in the
same amount of time it took to do a single query along with the facet
data.
There just is no match from what I have seen so far, for Lucene indexes
along with facet indexes, which is what got my so excited about blur. I
myself literally was in the design phase of writing my own implementation
of a distributed lucene index system when I decided to stop and check
what
was out there before re-inventing the wheel. When I came across the blur
project, I found the feature list and looked at two things primarily
which
got me into starting to work with the project. Those two things were
"Fast
data ingestion" and "Facets". So far, data seems to be getting pretty
quickly in my virtual box tests, which is good. I am going to be scaling
up
soon once the new hardware requisition is finished. Facets though is
currently stopping me from moving forward on some of the code development
which requires facets, which is why I am so interested in it's
implementation. With looping through records, it could take minutes to
get
proper visibility of data, whereas with Facets only a couple seconds if
that.

While waiting, I am going to probably make that IP field type definition
I
mentioned earlier, as possibly some additional ones. Most of the code for
that seems to make sense, but I'll need to load it up in something other
than a text editor to really get an appreciation for it. If some of what
needs to be done for facets can be explained, I'll perhaps see if I can
dedicate some company time to it.

  Thanks,
Colton McInroy

    * Director of Security Engineering


Phone
(Toll Free)
_US_    (888)-818-1344 Press 2
_UK_    0-800-635-0551 Press 2

My Extension    101
24/7 Support    [email protected] <mailto:[email protected]>
Email   [email protected] <mailto:[email protected]>
Website         http://www.dosarrest.com

On 10/18/2013 8:40 AM, Colton McInroy wrote:

   Hello Aaron,

       Yes, that's basically what I was thinking of for the facet
results.
The current implementation doesn't really make any sense if your
coming
from lucene. For simplicity and uniformity, I think it should be
somewhat
like it is with lucene... with adaptation to the way blur is built...
I
could kinda see something like this...

       public static void queryBlur(String queryString, String table) {
           Iface client = BlurClient.getClient(****
mainConfig.getString("**
controllers"));
           Query query = new Query();
           query.setQuery(queryString);

           Selector selector = new Selector();

           // This will fetch all the columns in family "fam0".
           selector.******addToColumnFamiliesToFetch("******event");
           selector.******addToColumnFamiliesToFetch("******msg");

           BlurQuery blurQuery = new BlurQuery();
           int matches = 10;
           List<Facet> facets = Arrays.asList(new Facet("field1",
matches),new Facet("field2", matches));
           blurQuery.setFacets(facets);
           blurQuery.setFetch(50);
           blurQuery.setQuery(query);
           blurQuery.setSelector(******selector);

           try {
               BlurResults results = client.query(table, blurQuery);
               for (Facet facet : result.getFacetResults()) {
                   System.out.println(facet.name+******"
"+facet.value);
               }
           } catch (BlurException e) {
               // TODO Auto-generated catch block
               e.printStackTrace();
           } catch (TException e) {
               // TODO Auto-generated catch block
               e.printStackTrace();
           }
           return null;
       }

       Just a brief modification from what I am doing now. Basically I
just
envision a method called getFacetResults which returns List<Facet>
with
each Facet object containing a "name" and a "value" which would be the
column name and facet count respectively. I'm just throwing this out
there
for now. This is a different way of implementing the facets than
lucene
in
terms of how the code is accessed, but it would provide the same
results.

       It could also be done something like this...

       List<Facet> facets = Arrays.asList(new Facet("field1"), new
Facet("field2"));
       blurQuery.setFacets(facets, matches);

       Depends if the number of matches should be per facet or per
query,
although I see the merits in being able to specify the matches for
each
field.

Thanks,
Colton McInroy

    * Director of Security Engineering


Phone
(Toll Free)
_US_     (888)-818-1344 Press 2
_UK_     0-800-635-0551 Press 2

My Extension     101
24/7 Support     [email protected] <mailto:[email protected]>
Email     [email protected] <mailto:[email protected]>
Website     http://www.dosarrest.com

On 10/18/2013 5:20 AM, Aaron McCurry wrote:

   I have an issue in Jira to document facets in 0.2.1, it's not been

worked
yet but I hope I can get to it soon.  It looks like you figured out
what
is
there.

We will likely improve facets in 0.3.0 so the API will have to
change a
bit.  The biggest change we will need to make is the scenario that
you
bring up.  Facets in the current implementation case are simply other
queries that can range from a single term to a complex query. I'm
assuming
that you would like to specify a field name and get something like a
map
of
terms to counts for the given facet?

The field facetCounts are counts that each of the facets in the input
list
from the query.  So the count list corresponds one for one to the
facet
list in the Query.  I realize this is less than ideal and we can
going
to
be improving it soon.

If you have some suggestions on how you would want the facet api to
operate, new features, or anything else for that matter just write up
your
thoughts on this thread and we can incorporate them into the task.

Thanks!

Aaron



On Fri, Oct 18, 2013 at 6:43 AM, Colton McInroy <
[email protected]

  wrote:
     Ok, so I created this method...
  public static BlurResults queryBlur(String queryString, String
table)
{
            Iface client = BlurClient.getClient(****
mainConfig.getString("**
controllers"));
            Query query = new Query();
            query.setQuery(queryString);

            Selector selector = new Selector();

            // This will fetch all the columns in family "fam0".
            selector.********addToColumnFamiliesToFetch("****
****event");
            selector.********addToColumnFamiliesToFetch("****
****msg");

            BlurQuery blurQuery = new BlurQuery();
            List<Facet> facets = Arrays.asList(new Facet(queryString,
Long.MAX_VALUE));
            blurQuery.setFacets(facets);
            blurQuery.setFetch(50);
            blurQuery.setQuery(query);
            blurQuery.setSelector(********selector);

            try {
                BlurResults results = client.query(table, blurQuery);
                return results;
            } catch (BlurException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            } catch (TException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
            return null;
        }

    From reading through source code, I was able to find out that you
specify
facets as a list, but this is fairly confusing to me coming from
lucene.

In lucene when getting facet data, I specify the facet fields I am
interested in, and the facet results show me a top X list of values
within
that field. Whereas with blur, it appears that a facet is another
query
which gives only a number as a result. When I tried to obtain the
facet
data I am used to with Lucene, the only thing I could find was...

System.out.println("Facet Results: "+results.getFacetCountsSize()**
**
****);
System.out.println(JSONArray.********toJSONString(results.********
getFacetCounts()));


Could you please elaborate on this.


Thanks,
Colton McInroy

     * Director of Security Engineering


Phone
(Toll Free)
_US_    (888)-818-1344 Press 2
_UK_    0-800-635-0551 Press 2

My Extension    101
24/7 Support    [email protected] <mailto:[email protected]
Email   [email protected] <mailto:[email protected]>
Website         http://www.dosarrest.com

On 10/18/2013 3:07 AM, Colton McInroy wrote:

    I think I wrote this to soon, I believe I just found out how to
do
it.

  I'll test it out and supply some example code if correct to help
others.

Thanks,
Colton McInroy

     * Director of Security Engineering


Phone
(Toll Free)
_US_     (888)-818-1344 Press 2
_UK_     0-800-635-0551 Press 2

My Extension     101
24/7 Support     [email protected] <mailto:
[email protected]
Email     [email protected] <mailto:[email protected]>
Website     http://www.dosarrest.com

On 10/18/2013 2:58 AM, Colton McInroy wrote:

    Hey Aaron,

         You mentioned a while ago that blur handles facets as well
and
that
you would provide an example. Unless I have missed that email, I
haven't
seen an example yet, could you provide one? I just took a quick
look
myself
and could not figure it out. I see there is an example
FacetQueryTest.java
in blur-query but that appears to be basically just a copy of the
lucene
file.






Reply via email to