Thanks Stefania for the informative answer.  The next blog was pretty
useful as well:
http://www.datastax.com/dev/blog/how-we-optimized-cassandra-cqlsh-copy-from
. Ill upgrade to 3.0.5 and test with C extensions enabled and report on
this thread.

On Sat, Apr 23, 2016 at 8:54 AM, Stefania Alborghetti <
stefania.alborghe...@datastax.com> wrote:

> Hi Bhuvan
>
> Support for large datasets in COPY FROM was added by CASSANDRA-11053
> <https://issues.apache.org/jira/browse/CASSANDRA-11053>, which is
> available in 2.1.14, 2.2.6, 3.0.5 and 3.5. Your scenario is valid with this
> patch applied.
>
> The 3.0.x and 3.x releases are already available, whilst the other two
> releases are due in the next few days. You only need to install an
> up-to-date release on the machine where COPY FROM is running.
>
> You may find the setup instructions in this blog
> <http://www.datastax.com/dev/blog/six-parameters-affecting-cqlsh-copy-from-performance>
> interesting. Specifically, for large datasets, I would highly recommend
> installing the Python driver with C extensions, as it will speed things up
> considerably. Again, this is only possible with the 11053 patch. Please
> ignore the suggestion to also compile the cqlsh copy module itself with C
> extensions (Cython), as you may hit CASSANDRA-11574
> <https://issues.apache.org/jira/browse/CASSANDRA-11574> in the 3.0.5 and
> 3.5 releases.
>
> Before CASSANDRA-11053, the parent process was a bottleneck. This is
> explained further in  this blog
> <http://www.datastax.com/dev/blog/how-we-optimized-cassandra-cqlsh-copy-from>,
> second paragraph in the "worker processes" section. As a workaround, if you
> are unable to upgrade, you may try reducing the INGESTRATE and introducing
> a few extra worker processes via NUMPROCESSES. Also, the parent process is
> overloaded and is therefore not able to report progress correctly.
> Therefore, if the progress report is frozen, it doesn't mean the COPY
> OPERATION is not making progress.
>
> Do let us know if you still have problems, as this is new functionality.
>
> With best regards,
> Stefania
>
>
> On Sat, Apr 23, 2016 at 6:34 AM, Bhuvan Rawal <bhu1ra...@gmail.com> wrote:
>
>> Hi,
>>
>> Im trying to copy a 20 GB CSV file into a 3 node fresh cassandra cluster
>> with 32 GB memory each, sufficient disk, RF-1 and durable write false. The
>> machine im feeding into is external to the cluster and shares 1GBps line
>> and has 16 GB RAM. (We have chosen this setup to possibly reduce CPU and IO
>> usage).
>>
>> Im trying to use COPY command to feed in data. It kicks off well,
>> launches a set of processes, does about 50,000 rows per second. But I can
>> see that the parent process starts aggregating memory almost of the size of
>> data processed and after a point the processes just hang. The parent
>> process was consuming 95% system memory when it had processed around 60%
>> data.
>>
>> I had earlier tried to feed in data from multiple files (Less than 4GB
>> each) and it was working as expected.
>>
>> Is it a valid scenario?
>>
>> Regards,
>> Bhuvan
>>
>
>
>
> --
>
>
> [image: datastax_logo.png] <http://www.datastax.com/>
>
> Stefania Alborghetti
>
> Apache Cassandra Software Engineer
>
> |+852 6114 9265| stefania.alborghe...@datastax.com
>
>
> [image: cassandrasummit.org/Email_Signature]
> <http://cassandrasummit.org/Email_Signature>
>

Reply via email to