SSTableSimpleUnsortedWriter take long time when inserting big rows

Benoit Perroud Fri, 02 Sep 2011 01:30:10 -0700

Hi All,

I started using SSTableSimpleUnsortedWriter to load data, and my data
has a few rows but a lot of column name in each rows.


I call SSTableSimpleUnsortedWriter.newRow every 10'000 columns inserted.

But the time taken to insert columns is increasing as the column
family is increasing. The problem appears because everytime we call
newRow, all the columns of the previous CF is added to the new CF.

Attached is a small patch that check which is the smallest CF, and add
the smallest CF to the biggest one.

Should I open I bug for that ?

Thanks in advance,

Benoit

Index: src/java/org/apache/cassandra/io/sstable/SSTableSimpleUnsortedWriter.java
===================================================================
--- src/java/org/apache/cassandra/io/sstable/SSTableSimpleUnsortedWriter.java	(revision 1164377)
+++ src/java/org/apache/cassandra/io/sstable/SSTableSimpleUnsortedWriter.java	(working copy)
@@ -73,9 +73,17 @@
 
         // Note that if the row was existing already, our size estimation will be slightly off
         // since we'll be counting the key multiple times.
-        if (previous != null)
-            columnFamily.addAll(previous);
-
+        if (previous != null) {
+            // Add the smallest CF to the other one
+            if (columnFamily.getSortedColumns().size() < previous.getSortedColumns().size()) {
+                previous.addAll(columnFamily);
+                // Re-add the previous CF to the map because it has been overwritten
+                keys.put(key, previous);
+            } else {
+                columnFamily.addAll(previous);
+            }
+        }
+        
         if (currentSize > bufferSize)
             sync();
     }

SSTableSimpleUnsortedWriter take long time when inserting big rows

Reply via email to