Hi Carlos, I tried on a single node and a 4-node cluster. On the 4-node cluster I setup the tables with replication factor = 2. I usually iterate over a subset, but it can be about ~40% right now. Some of my column values could be quite big… I remember I was exporting to csv and I had to change the default csv max column length.
If I just update, there are no problems, its reading and updating that kills everything (could it have something to do with the driver?) I’m using 2.0.8 release right now. I was trying to tweak memory sizes. If I give Cassandra too much memory (>8 or >16 GB) it dies much faster due to GC not being able to keep up. But it consistently dies on a specific row in single instance case… Is this enough info to point me somewhere? Thank you, Pavel > On Feb 11, 2015, at 1:48 PM, Carlos Rolo <r...@pythian.com> wrote: > > Hello Pavel, > > What is the size of the Cluster (# of nodes)? And you need to iterate over > the full 1TB every time you do the update? Or just parts of it? > > IMO information is short to make any kind of assessment of the problem you > are having. > > I can suggest to try a 2.0.x (or 2.1.1) release to see if you get the same > problem. > > Regards, > > Carlos Juzarte Rolo > Cassandra Consultant > > Pythian - Love your data > > rolo@pythian | Twitter: cjrolo | Linkedin: linkedin.com/in/carlosjuzarterolo > <http://linkedin.com/in/carlosjuzarterolo> > Tel: 1649 > www.pythian.com <http://www.pythian.com/> > On Wed, Feb 11, 2015 at 11:22 AM, Pavel Velikhov <pavel.velik...@gmail.com > <mailto:pavel.velik...@gmail.com>> wrote: > Hi, > > I’m using Cassandra to store NLP data, the dataset is not that huge (about > 1TB), but I need to iterate over it quite frequently, updating the full > dataset (each record, but not necessarily each column). > > I’ve run into two problems (I’m using the latest Cassandra): > > 1. I was trying to copy from one Cassandra cluster to another via a python > driver, however the driver confused the two instances > 2. While trying to update the full dataset with a simple transformation > (again via python driver), single node and clustered Cassandra run out of > memory no matter what settings I try, even I put a lot of sleeps into the > mix. However simpler transformations (updating just one column, specially > when there is a lot of processing overhead) work just fine. > > I’m really concerned about #2, since we’re moving all heavy processing to a > Spark cluster and will expand it, and I would expect much heavier traffic > to/from Cassandra. Any hints, war stories, etc. very appreciated! > > Thank you, > Pavel Velikhov > > > -- > > > > >