I would to start a thread on the topic of the future of Blur's query
language.  Currently the "simpleQuery" is just a normal Lucene based
syntax with a little magic to figure out the joins (via the
SuperQuery) that the user probably intended.  Of course this guess
work gets it wrong sometimes.  Let me explain with an example:

Given the query with superOn:

+cf1.field1:value1 +cf1.field2.value2

The current implementation will ASSUME that you want to find where
"cf1.field1" contains "value1" and where "cf1.field2" contains
"value2" in the same Record because the column family is the same.
i.e. NO JOIN across records

But perhaps the user really does want a join, meaning that the user
wants to find any Row that contains one or more Records that have a
field "cf1.field1" that contains "value1" and one or more Records in
the same Row (but not necessarily in the same Record) that contains a
field "cf1.field2" that contains "value2".  i.e. JOIN

Given that current implementation, the only way to force the JOIN is
to do something like:

+(+cf1.field1:value1 nocf.nofield:somevalue) +(+cf1.field2.value2
nocf.nofield:somevalue)

This will trick the parser into creating 2 separate join query
(SuperQuery) objects and perform the JOIN.


THIS IS UGLY.

Here are the current criteria for a query language:
- The ability to support any Lucene query type (Boolean, Term, Fuzzy,
Span, etc.)
- User defined query type should be supported, extensible
- The query language should be compatible with any programming
language so that the current thrift RPC can continue to be utilized

Here are options that I have been thinking about:

Option 1:
Somehow extend the current Lucene Query syntax to support these "new"
features.  The biggest issue I have with this is that we would be
creating yet another query language that users would have to learn.
Also I think that allowing users to extend the query language by
adding there own types would required a rewrite of the Lucene
implemented query parser.  So even starting with the Lucene query
language would be a lot of work.

Option 2:
Some limited version of SQL or SQL like syntax, basically supporting
normal SQL with limited join support (probably only natural joins).
This would be nice, because most users understand SQL.  But because
Blur can not support all the various operations that SQL can provide
this will probably be frustrating to users.  And they will need to
learn what Blur SQL will provide and any special Blur only syntax.  So
this would again be like inventing another query language.

Option 3:
CQL (http://en.wikipedia.org/wiki/Contextual_Query_Language) not to be
confused with Cassandra Query Language.  Currently I like this option
the best, because it has built-in extensibility as well as the normal
options needed for a search engine.  Boolean, fuzzy, wildcard, etc.

I really would like to get other's opinions here and any other options.  Thanks!

Aaron

Reply via email to