[ 
https://issues.apache.org/jira/browse/CASSANDRA-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuki Morishita updated CASSANDRA-3003:
--------------------------------------

    Attachment: mylyn-context.zip

> Trunk single-pass streaming doesn't handle large row correctly
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-3003
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3003
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0
>            Reporter: Sylvain Lebresne
>            Assignee: Yuki Morishita
>            Priority: Critical
>              Labels: streaming
>             Fix For: 1.0
>
>         Attachments: 3003-v1.txt, mylyn-context.zip
>
>
> For normal column family, trunk streaming always buffer the whole row into 
> memory. In uses
> {noformat}
>   ColumnFamily.serializer().deserializeColumns(in, cf, true, true);
> {noformat}
> on the input bytes.
> We must avoid this for rows that don't fit in the inMemoryLimit.
> Note that for regular column families, for a given row, there is actually no 
> need to even recreate the bloom filter of column index, nor to deserialize 
> the columns. It is enough to filter the key and row size to feed the index 
> writer, but then simply dump the rest on disk directly. This would make 
> streaming more efficient, avoid a lot of object creation and avoid the 
> pitfall of big rows.
> Counters column family are unfortunately trickier, because each column needs 
> to be deserialized (to mark them as 'fromRemote'). However, we don't need to 
> do the double pass of LazilyCompactedRow for that. We can simply use a 
> SSTableIdentityIterator and deserialize/reserialize input as it comes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to