[jira] [Updated] (CASSANDRA-13446) CQLSSTableWriter takes 100% CPU when the buffer_size_in_mb is larger than 64MB

xiangdong Huang (JIRA) Thu, 13 Apr 2017 07:35:03 -0700

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


xiangdong Huang updated CASSANDRA-13446:
----------------------------------------
    Description: 
I want to use CQLSSTableWriter to load large amounts of data as SSTables, 
however the CPU cost and the speed is not good.
```
CQLSSTableWriter writer = CQLSSTableWriter.builder()
                .inDirectory(new File("output"+j))
                .forTable(SCHEMA)
                
.withBufferSizeInMB(Integer.parseInt(System.getProperty("buffer_size_in_mb", 
"256")))//FIXME!! if the size is 64, it is ok, if it is 128 or larger, boom!!
                .using(INSERT_STMT)
                .withPartitioner(new Murmur3Partitioner()).build();
```
if the `buffer_size_in_mb` is less than 64MB in my  PC, everything is ok: the 
CPU utilization is about 60% and the memory is about 3GB (why 3GB? Luckly, I 
can bear that...).  The process creates 24MB per sstable (I think it is because 
sstable compresses data) one by one.

However, if the `buffer_size_in_mb` is greater, e.g., 128MB on my PC,  The CPU 
utilization is about 70%, the memory is still about 3GB.
When the CQLSSTableWriter receives 128MB data, it begins to flush data as a 
sstable. At this time, the bad thing comes:
CQLSSTableWriter.addRow() becomes very slow, and NO SSTABLE IS WRITTEN. Windows 
task manager shows the disk I/O for this process is 0.0 MB/s.  There is no file 
appears in the output folder (Sometimes a _zero-KB mc-1-big-Data.db_ and a 
_zero-KB mc-1-big-Index.db_ appear, and some transaction log file comes and 
disappears..). At this time, the process spends 99% CPU! and the memory is a 
little larger than 3GB....
Long long time later, the process crashes because of "GC overhead...", and 
there is still no sstable file built.....

When I use jprofile 10 to check who uses so much CPU, it says 
CQLSSTableWriter.addRow() takes about 99% CPU, other threads (Thread-1 and 
ScheduledTasks are waiting...)....

I have no idea to optimize the process, because Cassandra's SStable writing 
process is so complex...

The important thing is, 64MB buffer size is too small in production 
environments: it creates many 24MB SSTables, but we want a large sstable which 
can hold all the data in the batch load process. 

Now I wonder whether Spark and MapReduce work well with Cassandra, because when 
I have a glance of the source code, I notice that they also use 
CQLSSTableWriter to save output data....

The  cassandra version is 3.10. The datastax driver (for typec) is 3.2.0.

The attachment is my test program and the csv data. 
A complete test program can be found from: 
https://bitbucket.org/jixuan1989/csv2sstable

Update:
I find a similar issue on stackoverflow but no a good solution:
 
http://stackoverflow.com/questions/28506947/loading-large-row-data-into-cassandra-using-java-and-cqlsstablewriter

  was:
I want to use CQLSSTableWriter to load large amounts of data as SSTables, 
however the CPU cost and the speed is not good.
```
CQLSSTableWriter writer = CQLSSTableWriter.builder()
                .inDirectory(new File("output"+j))
                .forTable(SCHEMA)
                
.withBufferSizeInMB(Integer.parseInt(System.getProperty("buffer_size_in_mb", 
"256")))//FIXME!! if the size is 64, it is ok, if it is 128 or larger, boom!!
                .using(INSERT_STMT)
                .withPartitioner(new Murmur3Partitioner()).build();
```
if the `buffer_size_in_mb` is less than 64MB in my  PC, everything is ok: the 
CPU utilization is about 60% and the memory is about 3GB (why 3GB? Luckly, I 
can bear that...).  The process creates 24MB per sstable (I think it is because 
sstable compresses data) one by one.

However, if the `buffer_size_in_mb` is greater, e.g., 128MB on my PC,  The CPU 
utilization is about 70%, the memory is still about 3GB.
When the CQLSSTableWriter receives 128MB data, it begins to flush data as a 
sstable. At this time, the bad thing comes:
CQLSSTableWriter.addRow() becomes very slow, and NO SSTABLE IS WRITTEN. Windows 
task manager shows the disk I/O for this process is 0.0 MB/s.  There is no file 
appears in the output folder (Sometimes a _zero-KB mc-1-big-Data.db_ and a 
_zero-KB mc-1-big-Index.db_ appear, and some transaction log file comes and 
disappears..). At this time, the process spends 99% CPU! and the memory is a 
little larger than 3GB....
Long long time later, the process crashes because of "GC overhead...", and 
there is still no sstable file built.....

When I use jprofile 10 to check who uses so much CPU, it says 
CQLSSTableWriter.addRow() takes about 99% CPU, other threads (Thread-1 and 
ScheduledTasks are waiting...)....

I have no idea to optimize the process, because Cassandra's SStable writing 
process is so complex...

The important thing is, 64MB buffer size is too small in production 
environments: it creates many 24MB SSTables, but we want a large sstable which 
can hold all the data in the batch load process. 

Now I wonder whether Spark and MapReduce work well with Cassandra, because when 
I have a glance of the source code, I notice that they also use 
CQLSSTableWriter to save output data....

The  cassandra version is 3.10. The datastax driver (for typec) is 3.2.0.

The attachment is my test program and the csv data. 
A complete test program can be found from: 
https://bitbucket.org/jixuan1989/csv2sstable


 


> CQLSSTableWriter takes 100% CPU when the buffer_size_in_mb is larger than 
> 64MB 
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13446
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13446
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>         Environment: Windows 10, 8GB memory, i7 CPU
>            Reporter: xiangdong Huang
>         Attachments: csv2sstable.java, pom.xml, test.csv
>
>
> I want to use CQLSSTableWriter to load large amounts of data as SSTables, 
> however the CPU cost and the speed is not good.
> ```
> CQLSSTableWriter writer = CQLSSTableWriter.builder()
>                 .inDirectory(new File("output"+j))
>                 .forTable(SCHEMA)
>                 
> .withBufferSizeInMB(Integer.parseInt(System.getProperty("buffer_size_in_mb", 
> "256")))//FIXME!! if the size is 64, it is ok, if it is 128 or larger, boom!!
>                 .using(INSERT_STMT)
>                 .withPartitioner(new Murmur3Partitioner()).build();
> ```
> if the `buffer_size_in_mb` is less than 64MB in my  PC, everything is ok: the 
> CPU utilization is about 60% and the memory is about 3GB (why 3GB? Luckly, I 
> can bear that...).  The process creates 24MB per sstable (I think it is 
> because sstable compresses data) one by one.
> However, if the `buffer_size_in_mb` is greater, e.g., 128MB on my PC,  The 
> CPU utilization is about 70%, the memory is still about 3GB.
> When the CQLSSTableWriter receives 128MB data, it begins to flush data as a 
> sstable. At this time, the bad thing comes:
> CQLSSTableWriter.addRow() becomes very slow, and NO SSTABLE IS WRITTEN. 
> Windows task manager shows the disk I/O for this process is 0.0 MB/s.  There 
> is no file appears in the output folder (Sometimes a _zero-KB 
> mc-1-big-Data.db_ and a _zero-KB mc-1-big-Index.db_ appear, and some 
> transaction log file comes and disappears..). At this time, the process 
> spends 99% CPU! and the memory is a little larger than 3GB....
> Long long time later, the process crashes because of "GC overhead...", and 
> there is still no sstable file built.....
> When I use jprofile 10 to check who uses so much CPU, it says 
> CQLSSTableWriter.addRow() takes about 99% CPU, other threads (Thread-1 and 
> ScheduledTasks are waiting...)....
> I have no idea to optimize the process, because Cassandra's SStable writing 
> process is so complex...
> The important thing is, 64MB buffer size is too small in production 
> environments: it creates many 24MB SSTables, but we want a large sstable 
> which can hold all the data in the batch load process. 
> Now I wonder whether Spark and MapReduce work well with Cassandra, because 
> when I have a glance of the source code, I notice that they also use 
> CQLSSTableWriter to save output data....
> The  cassandra version is 3.10. The datastax driver (for typec) is 3.2.0.
> The attachment is my test program and the csv data. 
> A complete test program can be found from: 
> https://bitbucket.org/jixuan1989/csv2sstable
> Update:
> I find a similar issue on stackoverflow but no a good solution:
>  
> http://stackoverflow.com/questions/28506947/loading-large-row-data-into-cassandra-using-java-and-cqlsstablewriter



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (CASSANDRA-13446) CQLSSTableWriter takes 100% CPU when the buffer_size_in_mb is larger than 64MB

Reply via email to