On Thu, Nov 26, 2009 at 10:34:26PM +0100, gabriele renzi wrote:
> On Thu, Nov 26, 2009 at 7:12 PM, Anthony Molinaro
> <antho...@alumni.caltech.edu> wrote:
> 
> 
> > Unless you are using order preserving partitioning which might or might not
> > be what you want, you won't be able to do a full scan.  Instead you should
> > probably have two column families, one keyed by primary, one by secondary,
> > each with a column for the other, then you can do you operations.  It
> > uses more space, but disk is cheap so probably not a big deal.
> 
> yes, we thought so, using the second column family to only keep a list
> of the keys in the former without the data.
> 
> > If you
> > have to model a many-to-many relationship you can use super columns.
> 
> For now we are only storing a single attribute data, so we used normal
> columns instead of super columns, so in the end our schema is
> PrimaryCF
> { 'primary' => {'secondary'=>'data_0'} }
> SecondaryCF
> {'secondary'={'primary'=>''}
> 
> I believe that using a SuperColumn in PrimaryCF would be necessary
> only when using more than one attribute, or are there other
> implications I'm not seeing?

Well, sort of depends on how you plan to query things.  If you really
only have one secondary for each primary, you can have column names
like 'secondary' and 'data' which then contain the data, so you could
then get the 'data' column or the 'secondary' column.  Otherwise if you
use the 'secondary_id' as the column name as you have above you need to
get_slice since you won't know the name.  So you could model this like

PrimaryCF
  { '<primary_id>' => { 'secondary_id' => "<secondary_id>",
                        'data'         => "<data>"
                      }
  }
SecondaryCF
  { '<secondary_id>' => { 'primary_id' => "<primary_id>" } }

Then you can use get_column to get data.  However, it seems like you might
have multiple columns for a single primary id so your scheme is probably
fine.

As for the implications of Super Columns, I tend to think of it as
an extra hashing layer.  So you either have

  { 'key_0' => { 'column_name_0' => 'column_value_0 } }
or
  { 'key_0' => { 'super_column_name_0' => { 'column_name_0' =>
                                            'column_value_0 } } }

Then you can query columns with either
  'key_0', 'column_name_0'
or
  'key_0', 'super_column_name_0', 'column_name_0'

> As for the secondary, I don't like the idea of storing a dummy value
> (new byte[0]) when I only need the name, is that a smell that I should
> be using something else?

I don't think there is anything else other than an empty entry you
can use.  I tend to use "1" for those fields.

> > You do your inserts into both, and for deletes you do a get_slice for the
> > secondary id, which will give you all primary ids which contain the
> > secondary id.  Then you can delete everything.
> 
> yes, we actually did it a bit "smarter" by querying first, and keeping
> a list of only the diff between the first and second insert. Thanks a
> lot for your answer, it's been very useful.

No problems, good luck,

-Anthony

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <antho...@alumni.caltech.edu>

Reply via email to