[
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12922659#action_12922659
]
Kelvin Kakugawa commented on CASSANDRA-1608:
--------------------------------------------
An alternative strategy that dovetails w/ the above proposal (fixed-size SSTs
and keeping track of SST co-access stats):
Lazily coordinate timestamps across the cluster by maintaining increasing
timestamps across SSTs on each replica.
If we can enforce increasing timestamps across SSTs, then lookups will be
cheaper. Instead of doing a slice (under the hood) for each lookup, we only
need to read from the last SST where a key+column was written based on BFs (FPs
not withstanding). Slices will not be improved by this scheme.
A way to implement this would be to:
1) on a CL.ONE write, always write to a lead replica, and
2) on a CL.QUORUM write, a replica will reject a write w/ a timestamp less than
the highest timestamp in that CF's last SST (may need to be MT); upon which,
the client will need to re-submit a write w/ the appropriate timestamp offset.
A consideration that needs to be taken into account is AES and other SST
streaming operations across replicas. e.g. streamed SSTs will need to be lined
up by min/max timestamp of the SSTs; if an SST overlaps, then we may need to,
either:
1) re-partition/compact the overlapping SSTs, or
2) lazily compact the overlapping SSTs and absorb the cost to lookups, in the
meantime.
The benefits of loosely coordinated timestamps are:
1) lookups will be measurably improved,
2) dovetails nicely w/ fixed-size SSTs, and
3) SST co-access stats can be more coarse-grained (based on SST), instead of
fine-grained row-level stats.
The above proposal is would align our data model to be closer to BigTable and
HBase. i.e. lookups won't be penalized, anymore.
> Redesigned Compaction
> ---------------------
>
> Key: CASSANDRA-1608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Chris Goffinet
> Fix For: 0.7.1
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the
> moment, compaction is kicked off based on a write access pattern, not read
> access pattern. In most cases, you want the opposite. You want to be able to
> track how well each SSTable is performing in the system. If we were to keep
> statistics in-memory of each SSTable, prioritize them based on most accessed,
> and bloom filter hit/miss ratios, we could intelligently group sstables that
> are being read most often and schedule them for compaction. We could also
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives
> us the ability to better utilize our bloom filters in a predictable manner.
> At the moment after a certain size, the bloom filters become less reliable.
> This would also allow us to group data most accessed. Currently the size of
> an SSTable can grow to a point where large portions of the data might not
> actually be accessed as often.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.