Thanks Stefania for the informative answer. The next blog was pretty useful as well: http://www.datastax.com/dev/blog/how-we-optimized-cassandra-cqlsh-copy-from . Ill upgrade to 3.0.5 and test with C extensions enabled and report on this thread.
On Sat, Apr 23, 2016 at 8:54 AM, Stefania Alborghetti < stefania.alborghe...@datastax.com> wrote: > Hi Bhuvan > > Support for large datasets in COPY FROM was added by CASSANDRA-11053 > <https://issues.apache.org/jira/browse/CASSANDRA-11053>, which is > available in 2.1.14, 2.2.6, 3.0.5 and 3.5. Your scenario is valid with this > patch applied. > > The 3.0.x and 3.x releases are already available, whilst the other two > releases are due in the next few days. You only need to install an > up-to-date release on the machine where COPY FROM is running. > > You may find the setup instructions in this blog > <http://www.datastax.com/dev/blog/six-parameters-affecting-cqlsh-copy-from-performance> > interesting. Specifically, for large datasets, I would highly recommend > installing the Python driver with C extensions, as it will speed things up > considerably. Again, this is only possible with the 11053 patch. Please > ignore the suggestion to also compile the cqlsh copy module itself with C > extensions (Cython), as you may hit CASSANDRA-11574 > <https://issues.apache.org/jira/browse/CASSANDRA-11574> in the 3.0.5 and > 3.5 releases. > > Before CASSANDRA-11053, the parent process was a bottleneck. This is > explained further in this blog > <http://www.datastax.com/dev/blog/how-we-optimized-cassandra-cqlsh-copy-from>, > second paragraph in the "worker processes" section. As a workaround, if you > are unable to upgrade, you may try reducing the INGESTRATE and introducing > a few extra worker processes via NUMPROCESSES. Also, the parent process is > overloaded and is therefore not able to report progress correctly. > Therefore, if the progress report is frozen, it doesn't mean the COPY > OPERATION is not making progress. > > Do let us know if you still have problems, as this is new functionality. > > With best regards, > Stefania > > > On Sat, Apr 23, 2016 at 6:34 AM, Bhuvan Rawal <bhu1ra...@gmail.com> wrote: > >> Hi, >> >> Im trying to copy a 20 GB CSV file into a 3 node fresh cassandra cluster >> with 32 GB memory each, sufficient disk, RF-1 and durable write false. The >> machine im feeding into is external to the cluster and shares 1GBps line >> and has 16 GB RAM. (We have chosen this setup to possibly reduce CPU and IO >> usage). >> >> Im trying to use COPY command to feed in data. It kicks off well, >> launches a set of processes, does about 50,000 rows per second. But I can >> see that the parent process starts aggregating memory almost of the size of >> data processed and after a point the processes just hang. The parent >> process was consuming 95% system memory when it had processed around 60% >> data. >> >> I had earlier tried to feed in data from multiple files (Less than 4GB >> each) and it was working as expected. >> >> Is it a valid scenario? >> >> Regards, >> Bhuvan >> > > > > -- > > > [image: datastax_logo.png] <http://www.datastax.com/> > > Stefania Alborghetti > > Apache Cassandra Software Engineer > > |+852 6114 9265| stefania.alborghe...@datastax.com > > > [image: cassandrasummit.org/Email_Signature] > <http://cassandrasummit.org/Email_Signature> >