[jira] [Commented] (CASSANDRA-3003) Trunk single-pass streaming doesn't handle large row correctly

Sylvain Lebresne (JIRA) Thu, 11 Aug 2011 07:07:53 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13083123#comment-13083123
 ]


Sylvain Lebresne commented on CASSANDRA-3003:
---------------------------------------------

bq. Can we pad it somehow?

It's doable. Basically a context is an array of shards, with a header that is a 
(variable) list of which of those shards are a delta. When we cleanup the delta 
we remove the header basically. We could have a specific cleanup for streaming 
that just set all the header to -1. But we probably want to do that only for 
the cleanup during streaming, and have compaction clean those afterwards, 
otherwise it is ugly. I don't know how much easier it is than cleaning during 
reads, though it avoids having to add a new info for sstable metadata.

> Trunk single-pass streaming doesn't handle large row correctly
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-3003
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3003
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Yuki Morishita
>            Priority: Critical
>              Labels: streaming
>
> For normal column family, trunk streaming always buffer the whole row into 
> memory. In uses
> {noformat}
>   ColumnFamily.serializer().deserializeColumns(in, cf, true, true);
> {noformat}
> on the input bytes.
> We must avoid this for rows that don't fit in the inMemoryLimit.
> Note that for regular column families, for a given row, there is actually no 
> need to even recreate the bloom filter of column index, nor to deserialize 
> the columns. It is enough to filter the key and row size to feed the index 
> writer, but then simply dump the rest on disk directly. This would make 
> streaming more efficient, avoid a lot of object creation and avoid the 
> pitfall of big rows.
> Counters column family are unfortunately trickier, because each column needs 
> to be deserialized (to mark them as 'fromRemote'). However, we don't need to 
> do the double pass of LazilyCompactedRow for that. We can simply use a 
> SSTableIdentityIterator and deserialize/reserialize input as it comes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3003) Trunk single-pass streaming doesn't handle large row correctly

Reply via email to