xiangdong Huang created CASSANDRA-13446:
-------------------------------------------

             Summary: CQLSSTableWriter takes 100% CPU when the 
buffer_size_in_mb is larger than 64MB 
                 Key: CASSANDRA-13446
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13446
             Project: Cassandra
          Issue Type: Bug
          Components: Tools
         Environment: Windows 10, 8GB memory, i7 CPU
            Reporter: xiangdong Huang
         Attachments: csv2sstable.java, pom.xml, test.csv

I want to use CQLSSTableWriter to load large amounts of data as SSTables, 
however the CPU cost and the speed is not good.
```java
CQLSSTableWriter writer = CQLSSTableWriter.builder()
                .inDirectory(new File("output"+j))
                .forTable(SCHEMA)
                
.withBufferSizeInMB(Integer.parseInt(System.getProperty("buffer_size_in_mb", 
"256")))//FIXME!! if the size is 64, it is ok, if it is 128 or larger, boom!!
                .using(INSERT_STMT)
                .withPartitioner(new Murmur3Partitioner()).build();
```
if the `buffer_size_in_mb` is less than 64MB in my  PC, everything is ok: the 
CPU utilization is about 60% and the memory is about 3GB (why 3GB? Luckly, I 
can bear that...).  The process creates 24MB per sstable (I think it is because 
sstable compresses data) one by one.

However, if the `buffer_size_in_mb` is greater, e.g., 128MB on my PC,  The CPU 
utilization is about 70%, the memory is still about 3GB.
When the CQLSSTableWriter receives 128MB data, it begins to flush data as a 
sstable. At this time, the bad thing comes:
CQLSSTableWriter.addRow() becomes very slow, and!! NO SSTABLE IS WRITTEN. 
Windows task manager shows the disk I/O for this process is 0.0 MB/s.  There is 
no file appears in the output folder (Sometimes a _zero-KB mc-1-big-Data.db_ 
and a _zero-KB mc-1-big-Index.db_ appear, and some transaction log file comes 
and disappears..). At this time, the process spends 99% CPU! and the memory is 
a little larger than 3GB....
Long long time later, the process crashes because of "GC overhead...", and 
there is still no sstable file built.....

When I use jprofile 10 to check who uses so much CPU, it says 
CQLSSTableWriter.addRow() takes about 99% CPU....

I have no idea to optimize the process, because Cassandra's SStable writing 
process is so complex...

The important thing is, 64MB buffer size is too small in production 
environments: it creates many 24MB SSTables, but we want a large sstable which 
can hold all the data in the batch load process. 

Now I wonder whether Spark and MapReduce work well with Cassandra, because when 
I have a glance of the source code, I notice that they also use 
CQLSSTableWriter to save output data....

The  cassandra version is 3.10. The datastax driver (for typec) is 3.2.0.

The attachment is my test program and the csv data. 
A complete test program can be found from: 
https://bitbucket.org/jixuan1989/csv2sstable


 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to