[
https://issues.apache.org/jira/browse/CASSANDRA-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096867#comment-13096867
]
Benoit Perroud commented on CASSANDRA-3122:
-------------------------------------------
Digging further in SSTableSimpleUnsortedWriter, I found out another point :
every time newRow is called, serializedSize iterate through all the columns to
compute the size.
In my use case, I have line whith hourly values (data:h0|h1|h2|...|h23), and
for every line I will use the date of the day concatenated with the hour as key
("dateoftheday|hour"), and the value composed (using composite) with the data
as column name ([value,data]=null). More clearly, my data look like :
abc:1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2
bcd:3|4|3|4|3|4|3|4|3|4|3|4|3|4|3|4|3|4|3|4|3|4|3|4
and the for every line I call
writer.newRow("20110804|0"), writer.addColum(Composite(1, "abc"), empty_array),
writer.newRow("20110804|1"), writer.addColum(Composite(2, "abc"), empty_array),
writer.newRow("20110804|3"), writer.addColum(Composite(1, "abc"), empty_array),
writer.newRow("20110804|4"), writer.addColum(Composite(2, "abc"), empty_array),
...
So writer.newRow() is called 24 times for every lines.
So one solution could be to have a local class "CachedSizeColumFamily"
extending ColumFamily that will increase the serialized size at every
addColumn, and return it directly when serializedSize() is called.
In the same topic, even if ConcurrentSkipListMap claims to have good
performances (which is the case in multi threading environments), I had really
better results using a TreeMap in ColumnFamily (and then avoid the putIfAbscent
call on the ConcurrentSkipListMap). In bulk loading,
SSTableSimpleUnsortedWriter is single threaded anyway, there is no needs of
having a complex but yes slower data structure like ConcurrentSkipListMap. An
improvement in bulk loading would be to use a "single threaded" ColumFamily for
bulk loading. This could be part of another Jira.
> SSTableSimpleUnsortedWriter take long time when inserting big rows
> ------------------------------------------------------------------
>
> Key: CASSANDRA-3122
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3122
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Affects Versions: 0.8.3
> Reporter: Benoit Perroud
> Assignee: Sylvain Lebresne
> Priority: Minor
> Fix For: 0.8.5
>
> Attachments: 3122.patch, SSTableSimpleUnsortedWriter-v2.patch,
> SSTableSimpleUnsortedWriter.patch
>
>
> In SSTableSimpleUnsortedWriter, when dealing with rows having a lot of
> columns, if we call newRow several times (to flush data as soon as possible),
> the time taken by the newRow() call is increasing non linearly. This is
> because when newRow is called, we merge the size increasing existing CF with
> the new one.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira