[
https://issues.apache.org/jira/browse/CASSANDRA-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joshua McKenzie updated CASSANDRA-13446:
----------------------------------------
Issue Type: Improvement (was: Bug)
> CQLSSTableWriter takes 100% CPU when the buffer_size_in_mb is larger than
> 64MB
> -------------------------------------------------------------------------------
>
> Key: CASSANDRA-13446
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13446
> Project: Cassandra
> Issue Type: Improvement
> Components: Tools
> Environment: Windows 10, 8GB memory, i7 CPU
> Reporter: xiangdong Huang
> Attachments: csv2sstable.java, pom.xml, test.csv
>
>
> I want to use CQLSSTableWriter to load large amounts of data as SSTables,
> however the CPU cost and the speed is not good.
> ```
> CQLSSTableWriter writer = CQLSSTableWriter.builder()
> .inDirectory(new File("output"+j))
> .forTable(SCHEMA)
>
> .withBufferSizeInMB(Integer.parseInt(System.getProperty("buffer_size_in_mb",
> "256")))//FIXME!! if the size is 64, it is ok, if it is 128 or larger, boom!!
> .using(INSERT_STMT)
> .withPartitioner(new Murmur3Partitioner()).build();
> ```
> if the `buffer_size_in_mb` is less than 64MB in my PC, everything is ok: the
> CPU utilization is about 60% and the memory is about 3GB (why 3GB? Luckly, I
> can bear that...). The process creates 24MB per sstable (I think it is
> because sstable compresses data) one by one.
> However, if the `buffer_size_in_mb` is greater, e.g., 128MB on my PC, The
> CPU utilization is about 70%, the memory is still about 3GB.
> When the CQLSSTableWriter receives 128MB data, it begins to flush data as a
> sstable. At this time, the bad thing comes:
> CQLSSTableWriter.addRow() becomes very slow, and NO SSTABLE IS WRITTEN.
> Windows task manager shows the disk I/O for this process is 0.0 MB/s. There
> is no file appears in the output folder (Sometimes a _zero-KB
> mc-1-big-Data.db_ and a _zero-KB mc-1-big-Index.db_ appear, and some
> transaction log file comes and disappears..). At this time, the process
> spends 99% CPU! and the memory is a little larger than 3GB....
> Long long time later, the process crashes because of "GC overhead...", and
> there is still no sstable file built.....
> When I use jprofile 10 to check who uses so much CPU, it says
> CQLSSTableWriter.addRow() takes about 99% CPU, other threads (Thread-1 and
> ScheduledTasks are waiting...)....
> I have no idea to optimize the process, because Cassandra's SStable writing
> process is so complex...
> The important thing is, 64MB buffer size is too small in production
> environments: it creates many 24MB SSTables, but we want a large sstable
> which can hold all the data in the batch load process.
> Now I wonder whether Spark and MapReduce work well with Cassandra, because
> when I have a glance of the source code, I notice that they also use
> CQLSSTableWriter to save output data....
> The cassandra version is 3.10. The datastax driver (for typec) is 3.2.0.
> The attachment is my test program and the csv data.
> A complete test program can be found from:
> https://bitbucket.org/jixuan1989/csv2sstable
> Update:
> I find a similar issue on stackoverflow but no a good solution:
>
> http://stackoverflow.com/questions/28506947/loading-large-row-data-into-cassandra-using-java-and-cqlsstablewriter
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)