[jira] [Commented] (CASSANDRA-3003) Trunk single-pass streaming doesn't handle large row correctly

Stu Hood (JIRA) Wed, 10 Aug 2011 00:08:36 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082198#comment-13082198
 ]


Stu Hood commented on CASSANDRA-3003:
-------------------------------------

Oof... I don't know how I missed this one in review: very, very sorry 
Yuki/Sylvain.

Perhaps we can use this as an opportunity to switch to using only 
PrecompactedRow (for narrow rows which might go to cache) or EchoedRow (for 
wide rows, which go directly to disk)?

In order to use EchoedRow, we'd have to move where we do CounterContext 
cleanup: I've suggested in the past that it could be done at read time if we 
added "fromRemote" as a field in the metadata of an SSTable. Every 
SSTable*Iterator would be affected, because they'd need to respect the 
fromRemote field.

Alternatively, we could revert 2920 and 2677 (which I would hate: this has been 
a huge cleanup).

> Trunk single-pass streaming doesn't handle large row correctly
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-3003
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3003
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Yuki Morishita
>            Priority: Critical
>              Labels: streaming
>
> For normal column family, trunk streaming always buffer the whole row into 
> memory. In uses
> {noformat}
>   ColumnFamily.serializer().deserializeColumns(in, cf, true, true);
> {noformat}
> on the input bytes.
> We must avoid this for rows that don't fit in the inMemoryLimit.
> Note that for regular column families, for a given row, there is actually no 
> need to even recreate the bloom filter of column index, nor to deserialize 
> the columns. It is enough to filter the key and row size to feed the index 
> writer, but then simply dump the rest on disk directly. This would make 
> streaming more efficient, avoid a lot of object creation and avoid the 
> pitfall of big rows.
> Counters column family are unfortunately trickier, because each column needs 
> to be deserialized (to mark them as 'fromRemote'). However, we don't need to 
> do the double pass of LazilyCompactedRow for that. We can simply use a 
> SSTableIdentityIterator and deserialize/reserialize input as it comes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3003) Trunk single-pass streaming doesn't handle large row correctly

Reply via email to