[ 
https://issues.apache.org/jira/browse/CASSANDRA-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054229#comment-13054229
 ] 

Stu Hood edited comment on CASSANDRA-2319 at 6/24/11 3:46 AM:
--------------------------------------------------------------

Attaching a version of this patch that has been rebased atop 674. Together they 
exhibit very good performance.

Regarding the key-cache: in this latest revision the SSTableReader will only 
cache narrow rows (based on the number of blocks they contain). IMO, this is a 
reasonable temporary middle ground, since wide rows can be eliminated using 
only the index, and narrow rows continue to enjoy the benefits of the key 
cache. The long term goal would still be to cache the block headers for the 
file individually.

EDIT: The fundamental change in this revision is that in order to gain random 
access to a portion of a row, an SSTableReader.getPositions method now returns 
a collection of row headers, which are essentially equivalent to the row index 
in the existing format. This information is enough to eliminate a datafile.

      was (Author: stuhood):
    Attaching a version of this patch that has been rebased atop 674. Together 
they exhibit very good performance.

Regarding the key-cache: in this latest revision the SSTableReader will only 
cache narrow rows (based on the number of blocks they contain). IMO, this is a 
reasonable temporary middle ground, since wide rows can be eliminated using 
only the index, and narrow rows continue to enjoy the benefits of the key 
cache. The long term goal would still be to cache the block headers for the 
file individually.
  
> Promote row index
> -----------------
>
>                 Key: CASSANDRA-2319
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2319
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>              Labels: compression, index, timeseries
>             Fix For: 1.0
>
>         Attachments: 2319-v1.tgz, 2319-v2.tgz, promotion.pdf, version-f.txt, 
> version-g-lzf.txt, version-g.txt
>
>
> The row index contains entries for configurably sized blocks of a wide row. 
> For a row of appreciable size, the row index ends up directing the third seek 
> (1. index, 2. row index, 3. content) to nearby the first column of a scan.
> Since the row index is always used for wide rows, and since it contains 
> information that tells us whether or not the 3rd seek is necessary (the 
> column range or name we are trying to slice may not exist in a given 
> sstable), promoting the row index into the sstable index would allow us to 
> drop the maximum number of seeks for wide rows back to 2, and, more 
> importantly, would allow sstables to be eliminated using only the index.
> An example usecase that benefits greatly from this change is time series data 
> in wide rows, where data is appended to the beginning or end of the row. Our 
> existing compaction strategy gets lucky and clusters the oldest data in the 
> oldest sstables: for queries to recently appended data, we would be able to 
> eliminate wide rows using only the sstable index, rather than needing to seek 
> into the data file to determine that it isn't interesting. For narrow rows, 
> this change would have no effect, as they will not reach the threshold for 
> indexing anyway.
> A first cut design for this change would look very similar to the file format 
> design proposed on #674: 
> http://wiki.apache.org/cassandra/FileFormatDesignDoc: row keys clustered, 
> column names clustered, and offsets clustered and delta encoded.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to