[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Todd Nine (JIRA) Tue, 30 Aug 2011 14:37:38 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13094116#comment-13094116
 ]


Todd Nine commented on CASSANDRA-2915:
--------------------------------------

I agree that order by could be a performance killer for large data sets.  In 
large data sets I think that users should make use of de-normalization and 
create their own secondary index for efficient querying.  However, on small 
data sets, which seem to be very common in web systems (ours is about 80% of 
the data a user sees), order by semantics are very important.  Most of our data 
the user sees has a very small result set, < 100 rows.  I think explicitly 
prohibiting these features limit the user too much.  Shouldn't they be 
supported and ultimately it is up to the user to determine which approach they 
take in implementing index for their data?

> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>            Assignee: Jason Rutherglen
>              Labels: secondary_index
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their 
> current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the 
> highest clause
>    - One new column family is created per index this means 10 new CFs for 10 
> secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one 
> index per CF, and utilize the Lucene query engine to handle multiple index 
> clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync 
> our memtable flushes to lucene flushes. Lucene also has optimize() which 
> correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the 
> data can be stored properly, the big win in once this is done we can perform 
> complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since 
> documents in Lucene are written as complete documents. For random workloads 
> with lot's of indexed columns this means we need to read the document from 
> the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes

Reply via email to