Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.

The following page has been changed by JonathanEllis:
http://wiki.apache.org/cassandra/CassandraLimitations

New page:
= Limitations =

>From easiest to fix to hardest:

 * Cassandra's compaction code currently deserializes an entire row (per 
columnfamily) at a time.  So all the data from a given columnfamily/key pair 
must fit in memory.  Fixing this is relatively easy since columns are stored 
in-order on disk so there is really no reason you have to deserialize 
row-at-a-time except that that is easier with the current encapsulation of 
functionality.
 * Cassandra does not currently fsync the commitlog before acking a write.  
Most of the time this is Good Enough when you are writing to multiple replicas 
since the odds are slim of all replicas dying before the data actually hits the 
disk, but the truly paranoid will want real fsync-before-ack.  Just adding 
fsync would be just a few lines (to CommitLog, naturally), but we want to do 
this without killing performance, so what we want is an Executor that fsyncs 
after writing batches of commitlog entries (and then asynchronously notifies 
the write threads).
 * Cassandra has two levels of indexes: key and column.  But in super 
columnfamilies there is a third level of subcolumns; these are not indexed, and 
any request for a subcolumn deserializes _all_ the subcolumns in that 
supercolumn.  So you want to avoid a data model that requires large numbers of 
subcolumns.  This can be fixed; the core classes involved are SuperColumn and 
SequenceFile.
 * Cassandra's public API is based on Thrift, which offers no streaming 
abilities -- any value written or fetched has to fit in memory.  This is 
inherent to Thrift's design; I don't see it changing.  So (similar to 
traditional rdbmses) you're better off storing large blobs directly in the 
filesystem with a pointer to machine:path, than storing the blobs directly in 
Cassandra.

Reply via email to