[
https://issues.apache.org/jira/browse/CASSANDRA-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
xiangdong Huang updated CASSANDRA-13446:
----------------------------------------
Description:
I want to use CQLSSTableWriter to load large amounts of data as SSTables,
however the CPU cost and the speed is not good.
```
CQLSSTableWriter writer = CQLSSTableWriter.builder()
.inDirectory(new File("output"+j))
.forTable(SCHEMA)
.withBufferSizeInMB(Integer.parseInt(System.getProperty("buffer_size_in_mb",
"256")))//FIXME!! if the size is 64, it is ok, if it is 128 or larger, boom!!
.using(INSERT_STMT)
.withPartitioner(new Murmur3Partitioner()).build();
```
if the `buffer_size_in_mb` is less than 64MB in my PC, everything is ok: the
CPU utilization is about 60% and the memory is about 3GB (why 3GB? Luckly, I
can bear that...). The process creates 24MB per sstable (I think it is because
sstable compresses data) one by one.
However, if the `buffer_size_in_mb` is greater, e.g., 128MB on my PC, The CPU
utilization is about 70%, the memory is still about 3GB.
When the CQLSSTableWriter receives 128MB data, it begins to flush data as a
sstable. At this time, the bad thing comes:
CQLSSTableWriter.addRow() becomes very slow, and NO SSTABLE IS WRITTEN. Windows
task manager shows the disk I/O for this process is 0.0 MB/s. There is no file
appears in the output folder (Sometimes a _zero-KB mc-1-big-Data.db_ and a
_zero-KB mc-1-big-Index.db_ appear, and some transaction log file comes and
disappears..). At this time, the process spends 99% CPU! and the memory is a
little larger than 3GB....
Long long time later, the process crashes because of "GC overhead...", and
there is still no sstable file built.....
When I use jprofile 10 to check who uses so much CPU, it says
CQLSSTableWriter.addRow() takes about 99% CPU, other threads (Thread-1 and
ScheduledTasks are waiting...)....
I have no idea to optimize the process, because Cassandra's SStable writing
process is so complex...
The important thing is, 64MB buffer size is too small in production
environments: it creates many 24MB SSTables, but we want a large sstable which
can hold all the data in the batch load process.
Now I wonder whether Spark and MapReduce work well with Cassandra, because when
I have a glance of the source code, I notice that they also use
CQLSSTableWriter to save output data....
The cassandra version is 3.10. The datastax driver (for typec) is 3.2.0.
The attachment is my test program and the csv data.
A complete test program can be found from:
https://bitbucket.org/jixuan1989/csv2sstable
Update:
I find a similar issue on stackoverflow but no a good solution:
http://stackoverflow.com/questions/28506947/loading-large-row-data-into-cassandra-using-java-and-cqlsstablewriter
was:
I want to use CQLSSTableWriter to load large amounts of data as SSTables,
however the CPU cost and the speed is not good.
```
CQLSSTableWriter writer = CQLSSTableWriter.builder()
.inDirectory(new File("output"+j))
.forTable(SCHEMA)
.withBufferSizeInMB(Integer.parseInt(System.getProperty("buffer_size_in_mb",
"256")))//FIXME!! if the size is 64, it is ok, if it is 128 or larger, boom!!
.using(INSERT_STMT)
.withPartitioner(new Murmur3Partitioner()).build();
```
if the `buffer_size_in_mb` is less than 64MB in my PC, everything is ok: the
CPU utilization is about 60% and the memory is about 3GB (why 3GB? Luckly, I
can bear that...). The process creates 24MB per sstable (I think it is because
sstable compresses data) one by one.
However, if the `buffer_size_in_mb` is greater, e.g., 128MB on my PC, The CPU
utilization is about 70%, the memory is still about 3GB.
When the CQLSSTableWriter receives 128MB data, it begins to flush data as a
sstable. At this time, the bad thing comes:
CQLSSTableWriter.addRow() becomes very slow, and NO SSTABLE IS WRITTEN. Windows
task manager shows the disk I/O for this process is 0.0 MB/s. There is no file
appears in the output folder (Sometimes a _zero-KB mc-1-big-Data.db_ and a
_zero-KB mc-1-big-Index.db_ appear, and some transaction log file comes and
disappears..). At this time, the process spends 99% CPU! and the memory is a
little larger than 3GB....
Long long time later, the process crashes because of "GC overhead...", and
there is still no sstable file built.....
When I use jprofile 10 to check who uses so much CPU, it says
CQLSSTableWriter.addRow() takes about 99% CPU, other threads (Thread-1 and
ScheduledTasks are waiting...)....
I have no idea to optimize the process, because Cassandra's SStable writing
process is so complex...
The important thing is, 64MB buffer size is too small in production
environments: it creates many 24MB SSTables, but we want a large sstable which
can hold all the data in the batch load process.
Now I wonder whether Spark and MapReduce work well with Cassandra, because when
I have a glance of the source code, I notice that they also use
CQLSSTableWriter to save output data....
The cassandra version is 3.10. The datastax driver (for typec) is 3.2.0.
The attachment is my test program and the csv data.
A complete test program can be found from:
https://bitbucket.org/jixuan1989/csv2sstable
> CQLSSTableWriter takes 100% CPU when the buffer_size_in_mb is larger than
> 64MB
> -------------------------------------------------------------------------------
>
> Key: CASSANDRA-13446
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13446
> Project: Cassandra
> Issue Type: Bug
> Components: Tools
> Environment: Windows 10, 8GB memory, i7 CPU
> Reporter: xiangdong Huang
> Attachments: csv2sstable.java, pom.xml, test.csv
>
>
> I want to use CQLSSTableWriter to load large amounts of data as SSTables,
> however the CPU cost and the speed is not good.
> ```
> CQLSSTableWriter writer = CQLSSTableWriter.builder()
> .inDirectory(new File("output"+j))
> .forTable(SCHEMA)
>
> .withBufferSizeInMB(Integer.parseInt(System.getProperty("buffer_size_in_mb",
> "256")))//FIXME!! if the size is 64, it is ok, if it is 128 or larger, boom!!
> .using(INSERT_STMT)
> .withPartitioner(new Murmur3Partitioner()).build();
> ```
> if the `buffer_size_in_mb` is less than 64MB in my PC, everything is ok: the
> CPU utilization is about 60% and the memory is about 3GB (why 3GB? Luckly, I
> can bear that...). The process creates 24MB per sstable (I think it is
> because sstable compresses data) one by one.
> However, if the `buffer_size_in_mb` is greater, e.g., 128MB on my PC, The
> CPU utilization is about 70%, the memory is still about 3GB.
> When the CQLSSTableWriter receives 128MB data, it begins to flush data as a
> sstable. At this time, the bad thing comes:
> CQLSSTableWriter.addRow() becomes very slow, and NO SSTABLE IS WRITTEN.
> Windows task manager shows the disk I/O for this process is 0.0 MB/s. There
> is no file appears in the output folder (Sometimes a _zero-KB
> mc-1-big-Data.db_ and a _zero-KB mc-1-big-Index.db_ appear, and some
> transaction log file comes and disappears..). At this time, the process
> spends 99% CPU! and the memory is a little larger than 3GB....
> Long long time later, the process crashes because of "GC overhead...", and
> there is still no sstable file built.....
> When I use jprofile 10 to check who uses so much CPU, it says
> CQLSSTableWriter.addRow() takes about 99% CPU, other threads (Thread-1 and
> ScheduledTasks are waiting...)....
> I have no idea to optimize the process, because Cassandra's SStable writing
> process is so complex...
> The important thing is, 64MB buffer size is too small in production
> environments: it creates many 24MB SSTables, but we want a large sstable
> which can hold all the data in the batch load process.
> Now I wonder whether Spark and MapReduce work well with Cassandra, because
> when I have a glance of the source code, I notice that they also use
> CQLSSTableWriter to save output data....
> The cassandra version is 3.10. The datastax driver (for typec) is 3.2.0.
> The attachment is my test program and the csv data.
> A complete test program can be found from:
> https://bitbucket.org/jixuan1989/csv2sstable
> Update:
> I find a similar issue on stackoverflow but no a good solution:
>
> http://stackoverflow.com/questions/28506947/loading-large-row-data-into-cassandra-using-java-and-cqlsstablewriter
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)