[ 
https://issues.apache.org/jira/browse/CASSANDRA-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096867#comment-13096867
 ] 

Benoit Perroud commented on CASSANDRA-3122:
-------------------------------------------

Digging further in SSTableSimpleUnsortedWriter, I found out another point : 

every time newRow is called, serializedSize iterate through all the columns to 
compute the size.

In my use case, I have line whith hourly values (data:h0|h1|h2|...|h23), and 
for every line I will use the date of the day concatenated with the hour as key 
("dateoftheday|hour"), and the value composed (using composite) with the data 
as column name ([value,data]=null). More clearly, my data look like :
abc:1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2
bcd:3|4|3|4|3|4|3|4|3|4|3|4|3|4|3|4|3|4|3|4|3|4|3|4

and the for every line I call 

writer.newRow("20110804|0"), writer.addColum(Composite(1, "abc"), empty_array), 
writer.newRow("20110804|1"), writer.addColum(Composite(2, "abc"), empty_array), 
writer.newRow("20110804|3"), writer.addColum(Composite(1, "abc"), empty_array), 
writer.newRow("20110804|4"), writer.addColum(Composite(2, "abc"), empty_array), 
...

So writer.newRow() is called 24 times for every lines.

So one solution could be to have a local class "CachedSizeColumFamily" 
extending ColumFamily that will increase the serialized size at every 
addColumn, and return it directly when serializedSize() is called.

In the same topic, even if ConcurrentSkipListMap claims to have good 
performances (which is the case in multi threading environments), I had really 
better results using a TreeMap in ColumnFamily (and then avoid the putIfAbscent 
call on the ConcurrentSkipListMap). In bulk loading, 
SSTableSimpleUnsortedWriter is single threaded anyway, there is no needs of 
having a complex but yes slower data structure like ConcurrentSkipListMap. An 
improvement in bulk loading would be to use a "single threaded" ColumFamily for 
bulk loading. This could be part of another Jira.



> SSTableSimpleUnsortedWriter take long time when inserting big rows
> ------------------------------------------------------------------
>
>                 Key: CASSANDRA-3122
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3122
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.8.3
>            Reporter: Benoit Perroud
>            Assignee: Sylvain Lebresne
>            Priority: Minor
>             Fix For: 0.8.5
>
>         Attachments: 3122.patch, SSTableSimpleUnsortedWriter-v2.patch, 
> SSTableSimpleUnsortedWriter.patch
>
>
> In SSTableSimpleUnsortedWriter, when dealing with rows having a lot of 
> columns, if we call newRow several times (to flush data as soon as possible), 
> the time taken by the newRow() call is increasing non linearly. This is 
> because when newRow is called, we merge the size increasing existing CF with 
> the new one.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to