[
https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839395#action_12839395
]
David Strauss commented on CASSANDRA-674:
-----------------------------------------
This is a good opportunity to improve get_count() performance. Currently, it is
O(n) at call-time, where n is the number of columns being counted. I discussed
the issue with Stu on IRC, and he mentioned how a "mini-merge" happens at
call-time for the SSTables storing data for a column making it difficult to
maintain counts.
Instead of counting all columns, we could maintain and use column counts in the
oldest SSTable and "repair" the relevant counts at get_count() call-time with
the changes found in the newer SSTables. That would allow calls to get_count()
to run in O(m) time, where m is the number of columns being counted in *all but
the oldest SSTable*. (Granted, m can approach n on high write volume, but m can
never exceed n.)
For stable data, this would bring get_count() to near constant-time with
performance gradually degrading depending on the number of non-oldest SSTables.
(Note: I'm probably missing a multiplier in my big-O notation for looking up
columns in older SSTables to detect intersections.)
> New SSTable Format
> ------------------
>
> Key: CASSANDRA-674
> URL: https://issues.apache.org/jira/browse/CASSANDRA-674
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Stu Hood
> Assignee: Stu Hood
> Fix For: 0.7
>
> Attachments: 674-v1.diff, perf-674-v1.txt,
> perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt
>
>
> Various tickets exist due to limitations in the SSTable file format,
> including #16, #47 and #328. Attached is a proposed design/implementation of
> a new file format for SSTables that addresses a few of these limitations. The
> implementation has a bunch of issues/fixmes, which I'll describe in the
> comments.
> The file format is described in the javadoc for the o.a.c.io.SSTableWriter
> class, but briefly:
> * Blocks are opaque (except for their header) so that they can be
> compressed. The index file contains an entry for the first key in every
> Block. Blocks contain Slices.
> * Slices are series of columns with the same parents and (deletion)
> metadata. They can be used to represent ColumnFamilies or SuperColumns (or a
> slice of columns at any other depth). A single CF can be split across
> multiple Slices, which can be split across multiple blocks.
> * Neither Slices nor Blocks have a fixed size or maximum length, but they
> each have target lengths which can be stretched and broken by very large
> columns.
> The most interesting concepts from this patch are:
> * Block compression is possible (currently using GZIP, which has one bug
> mentioned in the comments),
> * Compaction involves merging intersecting Slices from input SSTables. Since
> large rows will be broken down into multiple slices, only the portions of
> rows that intersect between tables need to be
> deserialized/merged/held-in-memory,
> * Indexes for individual rows are gone, since the global index allows random
> access to the middle of column families that span Blocks, and Slices allow
> batches of columns to be skipped within a Block.
> * Bloom filters for individual rows are gone, and the global filter contains
> ColumnKeys instead, meaning that a query for a column that doesn't exist in a
> row that does will often not need to seek to the row.
> * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...)
> for columns are defined recursively, so deeply nested slices are possible,
> * Slices representing a single parent (CF, SC, etc) can have different
> Metadata, meaning that a tombstone Slice from d-f could sit between Slices
> containing columns a-c and g-h. This allows for eventually consistent range
> deletes of columns.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.