There is some (unfinished) code in the current repo on CQL a SQL-like Cassandra Query Language that is super simple and (AFAIK) limited to single node queries.
I suspect there are bigger questions to tackle before we get to query lanuages in the sense we're talking about-- 1. Data model -- Cassandra's values are byte arrays. Any proposal for a language needs to figure out precisely what data model you're planning to support. (your examples include numbers, dates, strings) 2. Secondary indexes 3. Query runtime (queries that run on a single node, multiple nodes, query optimizer?) I've never understood the value of a parallel-programming abstraction (map-reduce) for a single node database(CouchDB) ... and I certainly don't think we're ready to build a map-reduce view engine *in* Cassandra right now. IMHO, there are a bunch of interesting issues we will need to solve before we can seriously talk about a query language. On Mon, Jun 22, 2009 at 11:12 AM, Alexander Staubo <[email protected]> wrote: > Has anyone given thought to how an SQL-like query language could be > integrated into Cassandra? > > I'm thinking of something which would let you evaluate a limited set > of relational select operators. For example: > > * first_name = 'Bob' > * age > 32 > * created_at between '2009-08' and '2009-09' > * employer_id in (34543, 13177, 9338) > > First, is such functionality desired within the framework of > Cassandra, or do people prefer to keep this functionality in a > completely separate server component? There are pros and cons to keep > queries inside Cassandra. I could enumerate them, but I would like to > hear other people's thoughts first. > > An alternative to a text-based query syntax would be to borrow > CouchDB's concept of views [1]. In CouchDB, views are pre-defined > indexes which are populated by filtering data through a pair of > map/reduce functions, which are usually written in JavaScript. Views > are somewhat limited in expressiveness and flexibility, and do not > address all possible use cases, but they are very efficient to compute > and store, and are a fairly elegant system. > > Some challenges come to mind: > > Cassandra's distributed nature means that a node's queryable indexes > can/should only reference data in that same node's partition, and that > a query might have to be executed on multiple nodes. For performance, > the query processing needs to be parallelized and pipelined. > > Could a query planner/optimizer be able to reduce the number of nodes > required to satisfy a query by looking at the distribution of node > values across nodes? For example, if the column "first_name" value > "Foo" only occurs on node A, there's no need to involve node B. But > such knowledge requires the maintenance of statistics on each node > that cover all known peers, and the statistics must be kept up to date > to avoid glaring consistency issues. > > Given the nature of Cassandra's column families it's not immediately > obvious to me how to best address columns in such a language. > > [1] http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views > > A. >
