[jira] [Issue Comment Edited] (CASSANDRA-2915) Lucene based Secondary Indexes

Todd Nine (JIRA) Sun, 28 Aug 2011 21:31:19 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092606#comment-13092606
 ]


Todd Nine edited comment on CASSANDRA-2915 at 8/29/11 4:30 AM:
---------------------------------------------------------------

I don't necessarily think there is a 1 to 1 relationship between a column and a 
Lucene document field.  In our case we have the need to index fields in more 
than one manner.  For instance, we index users as straight strings (lowercased) 
with email, first name and last name columns.  However we also want to tokenize 
the email, first and last name columns to allow our customer support people to 
perform partial name matching.  I think a 1 to N mapping is required for column 
to document field to allow this sort of functionality.

As far as expiration on columns, is there a system event that we can hook into 
to just force a document reindex when a column expires rather than add an 
additional field that will need to be sorted from?

As per Jason's previous post, I think supporting ORDER BY, GROUP BY, COUNT, 
LIKE etc are a must.  Most users have become accustomed to this functionality 
with RDBMS.  If they cause potential performance problems, I think this should 
be documented so that users have enough information to determine if they can 
rely on the Lucene index or should build their own index directly.


Has anyone looked at existing code in ElasticSearch to avoid some of the 
pitfalls they have already experienced in building something similar?

http://www.elasticsearch.org/


Lastly, this is a huge feature for the hector-jpa plugin, what can I do to 
help?  



      was (Author: tnine):
    I don't necessaryly think there is a 1 to 1 relationship between a column 
and a Lucene document field.  In our case we have the need to index fields in 
more than one manner.  For instance, we index users as straight strings 
(lowercased) with email, first name and last name columns.  However we also 
want to tokenize the email, first and last name columns to allow our customer 
support people to perform partial name matching.  I think a 1 to N mapping is 
required for column to document field to allow this sort of functionality.

As far as expiration on columns, is there a system event that we can hook into 
to just force a document reindex when a column expires rather than add an 
additional field that will need to be sorted from?

As per Jason's previous post, I think supporting ORDER BY, GROUP BY, COUNT, 
LIKE etc are a must.  Most users have become accustomed to this functionality 
with RDBMS.  If they cause potential performance problems, I think this should 
be documented so that users have enough information to determine if they can 
rely on the Lucene index or should build their own index directly.


Has anyone looked at existing code in ElasticSearch to avoid some of the 
pitfalls they have already experienced in building something similar?

http://www.elasticsearch.org/


Lastly, this is a huge feature for the hector-jpa plugin, what can I do to 
help?  


  
> Lucene based Secondary Indexes
> ------------------------------
>
>                 Key: CASSANDRA-2915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: T Jake Luciani
>            Assignee: Jason Rutherglen
>              Labels: secondary_index
>
> Secondary indexes (of type KEYS) suffer from a number of limitations in their 
> current form:
>    - Multiple IndexClauses only work when there is a subset of rows under the 
> highest clause
>    - One new column family is created per index this means 10 new CFs for 10 
> secondary indexes
> This ticket will use the Lucene library to implement secondary indexes as one 
> index per CF, and utilize the Lucene query engine to handle multiple index 
> clauses. Also, by using the Lucene we get a highly optimized file format.
> There are a few parallels we can draw between Cassandra and Lucene.
> Lucene indexes segments in memory then flushes them to disk so we can sync 
> our memtable flushes to lucene flushes. Lucene also has optimize() which 
> correlates to our compaction process, so these can be sync'd as well.
> We will also need to correlate column validators to Lucene tokenizers, so the 
> data can be stored properly, the big win in once this is done we can perform 
> complex queries within a column like wildcard searches.
> The downside of this approach is we will need to read before write since 
> documents in Lucene are written as complete documents. For random workloads 
> with lot's of indexed columns this means we need to read the document from 
> the index, update it and write it back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-2915) Lucene based Secondary Indexes

Reply via email to