Hi Carlos,

  I tried on a single node and a 4-node cluster. On the 4-node cluster I setup 
the tables with replication factor = 2.
I usually iterate over a subset, but it can be about ~40% right now. Some of my 
column values could be quite big… I remember I was exporting to csv and I had 
to change the default csv max column length.

If I just update, there are no problems, its reading and updating that kills 
everything (could it have something to do with the driver?)

I’m using 2.0.8 release right now.

I was trying to tweak memory sizes. If I give Cassandra too much memory (>8 or 
>16 GB) it dies much faster due to GC not being able to keep up. But it 
consistently dies on a specific row in single instance case…

Is this enough info to point me somewhere?

Thank you,
Pavel

> On Feb 11, 2015, at 1:48 PM, Carlos Rolo <r...@pythian.com> wrote:
> 
> Hello Pavel,
> 
> What is the size of the Cluster (# of nodes)? And you need to iterate over 
> the full 1TB every time you do the update? Or just parts of it?
> 
> IMO information is short to make any kind of assessment of the problem you 
> are having.
> 
> I can suggest to try a 2.0.x (or 2.1.1) release to see if you get the same 
> problem. 
> 
> Regards,
> 
> Carlos Juzarte Rolo
> Cassandra Consultant
>  
> Pythian - Love your data
> 
> rolo@pythian | Twitter: cjrolo | Linkedin: linkedin.com/in/carlosjuzarterolo 
> <http://linkedin.com/in/carlosjuzarterolo>
> Tel: 1649
> www.pythian.com <http://www.pythian.com/>
> On Wed, Feb 11, 2015 at 11:22 AM, Pavel Velikhov <pavel.velik...@gmail.com 
> <mailto:pavel.velik...@gmail.com>> wrote:
> Hi,
> 
>   I’m using Cassandra to store NLP data, the dataset is not that huge (about 
> 1TB), but I need to iterate over it quite frequently, updating the full 
> dataset (each record, but not necessarily each column).
> 
>   I’ve run into two problems (I’m using the latest Cassandra):
> 
>   1. I was trying to copy from one Cassandra cluster to another via a python 
> driver, however the driver confused the two instances
>   2. While trying to update the full dataset with a simple transformation 
> (again via python driver), single node and clustered Cassandra run out of 
> memory no matter what settings I try, even I put a lot of sleeps into the 
> mix. However simpler transformations (updating just one column, specially 
> when there is a lot of processing overhead) work just fine.
> 
> I’m really concerned about #2, since we’re moving all heavy processing to a 
> Spark cluster and will expand it, and I would expect much heavier traffic 
> to/from Cassandra. Any hints, war stories, etc. very appreciated!
> 
> Thank you,
> Pavel Velikhov
> 
> 
> --
> 
> 
> 
> 
> 

Reply via email to