Hi,

  I’m using Cassandra to store NLP data, the dataset is not that huge (about 
1TB), but I need to iterate over it quite frequently, updating the full dataset 
(each record, but not necessarily each column).

  I’ve run into two problems (I’m using the latest Cassandra):

  1. I was trying to copy from one Cassandra cluster to another via a python 
driver, however the driver confused the two instances
  2. While trying to update the full dataset with a simple transformation 
(again via python driver), single node and clustered Cassandra run out of 
memory no matter what settings I try, even I put a lot of sleeps into the mix. 
However simpler transformations (updating just one column, specially when there 
is a lot of processing overhead) work just fine.

I’m really concerned about #2, since we’re moving all heavy processing to a 
Spark cluster and will expand it, and I would expect much heavier traffic 
to/from Cassandra. Any hints, war stories, etc. very appreciated!

Thank you,
Pavel Velikhov

Reply via email to