I would to start a thread on the topic of the future of Blur's query language. Currently the "simpleQuery" is just a normal Lucene based syntax with a little magic to figure out the joins (via the SuperQuery) that the user probably intended. Of course this guess work gets it wrong sometimes. Let me explain with an example:
Given the query with superOn: +cf1.field1:value1 +cf1.field2.value2 The current implementation will ASSUME that you want to find where "cf1.field1" contains "value1" and where "cf1.field2" contains "value2" in the same Record because the column family is the same. i.e. NO JOIN across records But perhaps the user really does want a join, meaning that the user wants to find any Row that contains one or more Records that have a field "cf1.field1" that contains "value1" and one or more Records in the same Row (but not necessarily in the same Record) that contains a field "cf1.field2" that contains "value2". i.e. JOIN Given that current implementation, the only way to force the JOIN is to do something like: +(+cf1.field1:value1 nocf.nofield:somevalue) +(+cf1.field2.value2 nocf.nofield:somevalue) This will trick the parser into creating 2 separate join query (SuperQuery) objects and perform the JOIN. THIS IS UGLY. Here are the current criteria for a query language: - The ability to support any Lucene query type (Boolean, Term, Fuzzy, Span, etc.) - User defined query type should be supported, extensible - The query language should be compatible with any programming language so that the current thrift RPC can continue to be utilized Here are options that I have been thinking about: Option 1: Somehow extend the current Lucene Query syntax to support these "new" features. The biggest issue I have with this is that we would be creating yet another query language that users would have to learn. Also I think that allowing users to extend the query language by adding there own types would required a rewrite of the Lucene implemented query parser. So even starting with the Lucene query language would be a lot of work. Option 2: Some limited version of SQL or SQL like syntax, basically supporting normal SQL with limited join support (probably only natural joins). This would be nice, because most users understand SQL. But because Blur can not support all the various operations that SQL can provide this will probably be frustrating to users. And they will need to learn what Blur SQL will provide and any special Blur only syntax. So this would again be like inventing another query language. Option 3: CQL (http://en.wikipedia.org/wiki/Contextual_Query_Language) not to be confused with Cassandra Query Language. Currently I like this option the best, because it has built-in extensibility as well as the normal options needed for a search engine. Boolean, fuzzy, wildcard, etc. I really would like to get other's opinions here and any other options. Thanks! Aaron
