Re: Secondary indexing and 0.6/0.7 integration with Datanucleus

aaron morton Wed, 16 Jun 2010 05:38:15 -0700

I've not read up on the secondary indexes, but am doing some thing similar. I 
got some inspiration from the Lucandra project. You will probably need to make 
multiple calls to the cassandra for each clause of your query.


The design I used had two CF's rough idea was; in the TermDocIndex the key term 
(e.g. lastName=Smith) and the column names are the keys for the object / 
document the term is from e.g. key1. The DocTermIndex uses the object/doc id as 
the key and has columns for each term the document contains, e.g. 
"lastname=Smith"). I also maintained some stats on how many objects/documents 
had the term (using redis, will move to cassandra counters in 0.7 perhaps). 

The query process then becomes.
1.      Determine the most selective term in the query using the stats
2.      Do a get_slice to get the first X (1000 perhaps) column values from the 
TermDocIndex using the term key.
3.      Use the keys from step 2 in a multi_get_slice against the DocTermIndex, 
passing the list of keys from 2 and listing the remaining terms as the column 
names you want to get back. 
4.      From the result of 3 filter all keys that returned less columns that we 
asked for. 
5.      Repeat from 3 if needed. 

I was hoping the limit in step 2 would bound the queries into the cluster, and 
the multiget in step 3 would be better at distributing the most of the work 
around the cluster. E.g. rather than reading 1000 columns from, say, 3 keys. It 
reads 3 columns from 1000 keys.

Aaron


On 16 Jun 2010, at 16:57, Todd Nine wrote:

> No problem,
>  I didn't want to implement my own solution if an existing one could
> easily be applied.  Since I'll be creating CF that represent secondary
> indexes, I'll need to perform range scans over the keys of those
> secondary index CFs.  The column names within the CF's are the row keys
> of the primary table.  Is there a way I can get the intersection of all
> of the column names from multiple ranges scans over different column
> families in one result set?  Otherwise I'll need to make multiple trips
> and create the intersection myself in my plugin.  Here is an example of
> what I'm trying to do.
> 
> CF: Person
> 
> key1: {
>   firstName: John
>   lastName: Smith
>   email: smi...@foo.com
> }
> 
> key2: {
>  firstName: Jane
>  lastName: Smith
>  email: smi...@foo.com
> }
> 
> key3: {
>  firstName: Jane
>  lastName: Doe
>  email: smi...@foo.com
> }
> 
> 
> My secondary index tables would be the following
> 
> CF: Person_LastName
> 
> Smith:{
>  key1: 0x00
>  key2: 0x00
> }
> 
> Doe: {
>  key3:0x00
> }
> 
> CF: Person_Email
>  smi...@foo.com:{
>    key1:0x00
>    key2:0x00 
>    key3:0x00
> }
> 
> If my input is something similar to lastName == 'Smith' && email ==
> "smi...@foo.com", I would return all columns from key "Smith" in CF
> Person_LastName, and all columns from key "smi...@foo.com" in CF
> Person_Email.  The intersection of the two sets is key1, and key2, and
> have cassandra only return those rows.
> 
> Thanks,
> Todd
> 
> 
> 
> 
> 
> On Tue, 2010-06-15 at 23:38 -0500, Jonathan Ellis wrote:
> 
>> No chance that 749 can be backported to 0.6, sorry.
>> 
>> On Tue, Jun 15, 2010 at 10:35 PM, Todd Nine <t...@spidertracks.co.nz> wrote:
>> 
>>> Lets try that again.....
>>> 
>>> This is the intended issue.
>>> 
>>> https://issues.apache.org/jira/browse/CASSANDRA-749
>>> 
>>> thanks,
>>> Todd
>>> 
>>> 
>>> 
>>>  On Tue, 2010-06-15 at 20:02 -0500, Jonathan Ellis wrote:
>>> 
>>> What issue were you trying to link? :)
>>> 
>>> On Tue, Jun 15, 2010 at 6:56 PM, Todd Nine <t...@spidertracks.co.nz> wrote:
>>>> Hi all,
>>>> I'm implementing a Datanucleus plugin for Cassandra.  I'm finished
>>>> with the basic functionality, and everything seems to work pretty well.
>>>> Now my issue is performing secondary indexing on fields within my data.
>>>> I have outlined some of the issues I'm facing in this post.
>>>> 
>>>> http://www.datanucleus.org/servlet/forum/viewthread_thread,6087_lastpage,yes#32610
>>>> 
>>>> Essentially, for each operand the user specifies, I will need to make a
>>>> trip to Cassandra, load the key columns, then perform an intersection
>>>> with the result from my previous read.  Eventually at the end of all the
>>>> intersections, I will have a list of keys I will then load.  This
>>>> obviously requires several trips to Cassandra, where from my
>>>> understanding of secondary indexing, I would only need to make one trip
>>>> for multiple operands over a column family.    I've read over this
>>>> issue.
>>>> 
>>>> http://issues.apache.org/jira/browse/CASSANDRA-32610
>>>> 
>>>> And it seems to solve a lot of my woes.  Is it possible/recommended to
>>>> patch the current code base of 0.6.2 to perform this functionality?
>>>> 
>>>> Thanks,
>>>> Todd
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>>

Re: Secondary indexing and 0.6/0.7 integration with Datanucleus

Reply via email to