[jira] [Commented] (CASSANDRA-3003) Trunk single-pass streaming doesn't handle large row correctly

Stu Hood (JIRA) Wed, 10 Aug 2011 17:17:54 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082800#comment-13082800
 ]


Stu Hood commented on CASSANDRA-3003:
-------------------------------------

bq. I really think it is not very hard to do 'inline'. We really just want to 
deserialize, cleanup, reserialize. It should be super easy to add some 
"CounterCleanedRow" that does that.
I'm probably missing something, but isn't the problem that this can't be done 
without two passes for rows that are too large to fit in memory? And you can't 
perform two passes without buffering data somewhere? I suggested removing the 
cleanup step out of streaming because then the row could be echoed to disk 
without modification.

bq. It would also be less efficient, because until we have compacted the 
streamed sstable, each read will have to call the cleanup over and over
This is true, but compaction is fairly likely to trigger soon after a big batch 
of streamed files arrives, since they will trigger compaction thresholds.

> Trunk single-pass streaming doesn't handle large row correctly
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-3003
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3003
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Yuki Morishita
>            Priority: Critical
>              Labels: streaming
>
> For normal column family, trunk streaming always buffer the whole row into 
> memory. In uses
> {noformat}
>   ColumnFamily.serializer().deserializeColumns(in, cf, true, true);
> {noformat}
> on the input bytes.
> We must avoid this for rows that don't fit in the inMemoryLimit.
> Note that for regular column families, for a given row, there is actually no 
> need to even recreate the bloom filter of column index, nor to deserialize 
> the columns. It is enough to filter the key and row size to feed the index 
> writer, but then simply dump the rest on disk directly. This would make 
> streaming more efficient, avoid a lot of object creation and avoid the 
> pitfall of big rows.
> Counters column family are unfortunately trickier, because each column needs 
> to be deserialized (to mark them as 'fromRemote'). However, we don't need to 
> do the double pass of LazilyCompactedRow for that. We can simply use a 
> SSTableIdentityIterator and deserialize/reserialize input as it comes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3003) Trunk single-pass streaming doesn't handle large row correctly

Reply via email to