[ 
https://issues.apache.org/jira/browse/CASSANDRA-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua McKenzie updated CASSANDRA-13446:
----------------------------------------
    Issue Type: Improvement  (was: Bug)

> CQLSSTableWriter takes 100% CPU when the buffer_size_in_mb is larger than 
> 64MB 
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13446
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13446
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>         Environment: Windows 10, 8GB memory, i7 CPU
>            Reporter: xiangdong Huang
>         Attachments: csv2sstable.java, pom.xml, test.csv
>
>
> I want to use CQLSSTableWriter to load large amounts of data as SSTables, 
> however the CPU cost and the speed is not good.
> ```
> CQLSSTableWriter writer = CQLSSTableWriter.builder()
>                 .inDirectory(new File("output"+j))
>                 .forTable(SCHEMA)
>                 
> .withBufferSizeInMB(Integer.parseInt(System.getProperty("buffer_size_in_mb", 
> "256")))//FIXME!! if the size is 64, it is ok, if it is 128 or larger, boom!!
>                 .using(INSERT_STMT)
>                 .withPartitioner(new Murmur3Partitioner()).build();
> ```
> if the `buffer_size_in_mb` is less than 64MB in my  PC, everything is ok: the 
> CPU utilization is about 60% and the memory is about 3GB (why 3GB? Luckly, I 
> can bear that...).  The process creates 24MB per sstable (I think it is 
> because sstable compresses data) one by one.
> However, if the `buffer_size_in_mb` is greater, e.g., 128MB on my PC,  The 
> CPU utilization is about 70%, the memory is still about 3GB.
> When the CQLSSTableWriter receives 128MB data, it begins to flush data as a 
> sstable. At this time, the bad thing comes:
> CQLSSTableWriter.addRow() becomes very slow, and NO SSTABLE IS WRITTEN. 
> Windows task manager shows the disk I/O for this process is 0.0 MB/s.  There 
> is no file appears in the output folder (Sometimes a _zero-KB 
> mc-1-big-Data.db_ and a _zero-KB mc-1-big-Index.db_ appear, and some 
> transaction log file comes and disappears..). At this time, the process 
> spends 99% CPU! and the memory is a little larger than 3GB....
> Long long time later, the process crashes because of "GC overhead...", and 
> there is still no sstable file built.....
> When I use jprofile 10 to check who uses so much CPU, it says 
> CQLSSTableWriter.addRow() takes about 99% CPU, other threads (Thread-1 and 
> ScheduledTasks are waiting...)....
> I have no idea to optimize the process, because Cassandra's SStable writing 
> process is so complex...
> The important thing is, 64MB buffer size is too small in production 
> environments: it creates many 24MB SSTables, but we want a large sstable 
> which can hold all the data in the batch load process. 
> Now I wonder whether Spark and MapReduce work well with Cassandra, because 
> when I have a glance of the source code, I notice that they also use 
> CQLSSTableWriter to save output data....
> The  cassandra version is 3.10. The datastax driver (for typec) is 3.2.0.
> The attachment is my test program and the csv data. 
> A complete test program can be found from: 
> https://bitbucket.org/jixuan1989/csv2sstable
> Update:
> I find a similar issue on stackoverflow but no a good solution:
>  
> http://stackoverflow.com/questions/28506947/loading-large-row-data-into-cassandra-using-java-and-cqlsstablewriter



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to