Re: secondary index support in Cassandra

2009-03-26 Thread Sandeep Tata
The compaction optimization that Prashant mentioned is likely to solve
many of the problems that Jun brings up.

We were thinking of tackling this problem ... I've opened a ticket in
JIRA (https://issues.apache.org/jira/browse/CASSANDRA-16)

Avinash, Prashant -- If you guys are already working on it, feel free
to assign it to yourself. Otherwise we'll sketch out a plan and send
it out, if the community agrees on the idea, we can start hacking
away.

Sandeep


On Wed, Mar 25, 2009 at 10:50 AM, Jun Rao jun...@almaden.ibm.com wrote:

 Some comments inlined below.

 Jun
 IBM Almaden Research Center
 K55/B1, 650 Harry Road, San Jose, CA  95120-6099

 jun...@almaden.ibm.com


 Avinash Lakshman avinash.laksh...@gmail.com wrote on 03/24/2009 10:08:45
 PM:

 Comments inline.

 On Tue, Mar 24, 2009 at 6:53 PM, Jun Rao jun...@almaden.ibm.com wrote:

 
  Prashant,
 
  Thanks for the comments. They are quite useful. Let me try to address
 some
  of the points that you made.
 
  1. It is true that in our current implementation, we can glue the
 changes
  on both the data and the index in one batch_update() call. This way,
 the
  data and the index will be maintained synchronously. However,
 maintaining
  the index on the server is likely more efficient since there is less
  communication overhead. You seem to agree with this.


 [Avinash] You can update multiple column families for a single key in one
 mutation.

 
 
  2. Cassandra currently doesn't acquire row-lock for row accesses.
 However,
  the implication is that a reader may see partial updates of a row. For
  example, suppose that a writer updates two columns in different CFs.
 Then,
  it is possible for a concurrent reader to see the update on one column,
 but
  not the other one. For some applications, row-level consistency could
 be
  important. It's probably for this reason, in HBase, a region server
  acquires a row lock for every read and write.


 [Avinash] Updates to a single row within a machine are atomic. Which
 means
 what you are stating will not happen. Writes and reads will be serialized
 at
 the Memtable.

 This problem doesn't show up in Cassandra today because there is no method
 that can read columns from different CFs in a row. If there were such a
 method, it would be hard to enforce that a reader always sees a complete
 update (updating multiple CFs) without some sort of row locks.


 
 
  3. For our current application, the size of all entities in a group is
 not
  too large and likely fits within the capacity of a single node.
 However,
  for other applications, being able to scale a group to more than a node
  could be useful. Storing a group within a single row will prevent
 scaling
  out the group.

 [Avinash] I guess the question is how many entities do you envision in a
 group. What do you mean by fitting into one node?


 A large group may not fit in memory, but should fit in a commodity disk.
 The compaction optimization Prashant mentioned will definitely make our
 current approach more feasible.

 However, in general, I am a bit concerned about putting too much stuff
 within a row. A row is a unit that has finite capacity and a user shouldn't
 expect to put an infinite number of columns within a row. I actually like
 the current assumption in Cassandra that a row has to fit in memory since
 it simplifies the implementation. On the other hand, a table can have
 arbitrary capacity (one just need to provision enough nodes in the cluster)
 and it can have as many rows as you want.

 
 
  Jun
  IBM Almaden Research Center
  K55/B1, 650 Harry Road, San Jose, CA  95120-6099
 
  jun...@almaden.ibm.com
 
 
  Prashant Malik pma...@gmail.com wrote on 03/24/2009 11:34:51 AM:
 
   Some questions Iline
  
   On Tue, Mar 24, 2009 at 10:21 AM, Jun Rao jun...@almaden.ibm.com
  wrote:
  
   
   
We have an application that has groups and entities. A group has
 many
entities and an entity has a bunch of (attribute, value) pairs. A
  common
access pattern is to select some number of entities within a group
 with
attribute X equals to x and ordered by attribute Y. For efficiency,
 we
  want
to build a secondary index for each group and collocate a group and
 its
secondary index on the same node. Our current approach is to map a
  group to
a row in Cassandra and each entity to a column in a column family
 (CF).
Within the same row, we use a separate CF (ordered by name) to
  implement a
secondary index, say on attribute X and Y. In this family, each
 column
  name
has the form of X:x:Y:y:entityID. We extended the get_slice()
 function
  so
that it can get a slice of columns starting from a given column.
 The
extended function uses the column-level index to locate the
 starting
  column
quickly. (We'd be happy to contribute this extension back to
 Cassandra
  if
people find this useful). Using the extended get_slice(), we were
 able
  to
access the entities through the simulated secondary 

Re: secondary index support in Cassandra

2009-03-25 Thread Jun Rao

Some comments inlined below.

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

jun...@almaden.ibm.com


Avinash Lakshman avinash.laksh...@gmail.com wrote on 03/24/2009 10:08:45
PM:

 Comments inline.

 On Tue, Mar 24, 2009 at 6:53 PM, Jun Rao jun...@almaden.ibm.com wrote:

 
  Prashant,
 
  Thanks for the comments. They are quite useful. Let me try to address
some
  of the points that you made.
 
  1. It is true that in our current implementation, we can glue the
changes
  on both the data and the index in one batch_update() call. This way,
the
  data and the index will be maintained synchronously. However,
maintaining
  the index on the server is likely more efficient since there is less
  communication overhead. You seem to agree with this.


 [Avinash] You can update multiple column families for a single key in one
 mutation.

 
 
  2. Cassandra currently doesn't acquire row-lock for row accesses.
However,
  the implication is that a reader may see partial updates of a row. For
  example, suppose that a writer updates two columns in different CFs.
Then,
  it is possible for a concurrent reader to see the update on one column,
but
  not the other one. For some applications, row-level consistency could
be
  important. It's probably for this reason, in HBase, a region server
  acquires a row lock for every read and write.


 [Avinash] Updates to a single row within a machine are atomic. Which
means
 what you are stating will not happen. Writes and reads will be serialized
at
 the Memtable.

This problem doesn't show up in Cassandra today because there is no method
that can read columns from different CFs in a row. If there were such a
method, it would be hard to enforce that a reader always sees a complete
update (updating multiple CFs) without some sort of row locks.


 
 
  3. For our current application, the size of all entities in a group is
not
  too large and likely fits within the capacity of a single node.
However,
  for other applications, being able to scale a group to more than a node
  could be useful. Storing a group within a single row will prevent
scaling
  out the group.

 [Avinash] I guess the question is how many entities do you envision in a
 group. What do you mean by fitting into one node?


A large group may not fit in memory, but should fit in a commodity disk.
The compaction optimization Prashant mentioned will definitely make our
current approach more feasible.

However, in general, I am a bit concerned about putting too much stuff
within a row. A row is a unit that has finite capacity and a user shouldn't
expect to put an infinite number of columns within a row. I actually like
the current assumption in Cassandra that a row has to fit in memory since
it simplifies the implementation. On the other hand, a table can have
arbitrary capacity (one just need to provision enough nodes in the cluster)
and it can have as many rows as you want.

 
 
  Jun
  IBM Almaden Research Center
  K55/B1, 650 Harry Road, San Jose, CA  95120-6099
 
  jun...@almaden.ibm.com
 
 
  Prashant Malik pma...@gmail.com wrote on 03/24/2009 11:34:51 AM:
 
   Some questions Iline
  
   On Tue, Mar 24, 2009 at 10:21 AM, Jun Rao jun...@almaden.ibm.com
  wrote:
  
   
   
We have an application that has groups and entities. A group has
many
entities and an entity has a bunch of (attribute, value) pairs. A
  common
access pattern is to select some number of entities within a group
with
attribute X equals to x and ordered by attribute Y. For efficiency,
we
  want
to build a secondary index for each group and collocate a group and
its
secondary index on the same node. Our current approach is to map a
  group to
a row in Cassandra and each entity to a column in a column family
(CF).
Within the same row, we use a separate CF (ordered by name) to
  implement a
secondary index, say on attribute X and Y. In this family, each
column
  name
has the form of X:x:Y:y:entityID. We extended the get_slice()
function
  so
that it can get a slice of columns starting from a given column.
The
extended function uses the column-level index to locate the
starting
  column
quickly. (We'd be happy to contribute this extension back to
Cassandra
  if
people find this useful). Using the extended get_slice(), we were
able
  to
access the entities through the simulated secondary index.
   
We see a couple of problems with the current approach. First, our
application has to maintain the index. This is inefficient and
could
  leave
the index inconsistent when failure occurs.  Second, mapping each
  entity to
a column may not be a good idea. Often, there is some sort of
locking
  for
each row access. Putting many entities per row limits concurrency.
  Today,
in Cassandra, a full row is deserialized into memory during
compaction.
This limits the number of entities that can be put in a single row.
  Also,
intuitively, an 

Re: secondary index support in Cassandra

2009-03-24 Thread Brian McCallister
On Tue, Mar 24, 2009 at 9:39 PM, Jonathan Ellis jbel...@gmail.com wrote:
 On Tue, Mar 24, 2009 at 8:33 PM, Jun Rao jun...@almaden.ibm.com wrote:
 As for queries vs. low-level api, we can make both available to the
 application developer.

 Just do both is almost always a cop-out. :)

Define a query AST and then don't care how the AST is generated :-)

-Brian