[
https://issues.apache.org/jira/browse/CASSANDRA-7890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14123668#comment-14123668
]
Jonathan Ellis commented on CASSANDRA-7890:
-------------------------------------------
bq. Im curious about the historical choice to order data on disk by token and
not key.
That means that adding new nodes means you stream contiguous ranges.
> LCS and time series data
> ------------------------
>
> Key: CASSANDRA-7890
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7890
> Project: Cassandra
> Issue Type: New Feature
> Components: Core
> Reporter: Dan Hendry
> Fix For: 3.0
>
>
> Consider the following very typical schema for bucketed time series data:
> {noformat}
> CREATE TABLE user_timeline (
> ts_bucket bigint,
> username varchar,
> ts timeuuid,
> data blob,
> PRIMARY KEY ((ts_bucket, username), ts))
> {noformat}
> If you have a single cassandra node (or cluster where RF = N) and use the
> ByteOrderedPartitioner, LCS becomes *ridiculously*, *obscenely*, efficient.
> Under a typical workload where data is inserted in order, compaction IO could
> be reduced to *near zero* as sstable ranges dont overlap (with a trivial
> change to LCS so sstables with no overlap are not rewritten when being
> promoted into the next level). Better yet, we don't _require_ ordered data
> insertion. Even if insertion order is completely random, you still get
> standard LCS performance characteristics which are usually acceptable
> (although I believe there are a few degenerate compaction cases which are not
> handled in the current implementation). A quick benchmark using vanilla
> cassandra 2.0.10 (ie no rewrite optimization) shows a *77% reduction in
> compaction IO* when switching from the Murmur3Partitioner to the
> ByteOrderedPartitioner.
> The obvious problem is, of course, that using an order preserving partitioner
> is a Very Bad idea when N > RF. Using an OPP for time series data ordered by
> time is utter lunacy.
> It seems to me that one solution is to split apart the roles of the
> partitioner so that data distribution across the cluster and data ordering on
> disk can be controlled independently. Ideally on disk ordering could be set
> per CF. Im curious about the historical choice to order data on disk by token
> and not key. Randomized (hashed key ordered) distribution across the cluster
> is obviously a good idea but natural key ordered on disk seem like it would
> have a number of advantages:
> * Better read performance and file system page cache efficiency for any
> workload which access certain ranges of row keys more frequently than others
> (this applies to _many_ use cases beyond time series data).
> * I can't think of a realistic workload where CRUD operations would be
> noticeably less performant when using natural instead of hash ordering.
> * Better compression ratios (although probably only for skinny rows).
> * Range based truncation becomes feasible.
> * Ordered range scans might be feasible to implement even with random cluster
> distribution.
> The only things I can think of which could suffer when using different
> cluster and disk ordering are bootstrap and repair. Although I have no
> evidence, the massive potential performance gains certainly still seem to be
> worth it.
> Thoughts? This approach seems to be fundamentally different from other
> tickets related to improving time series data (CASSANDRA-6602,
> CASSANDRA-5561) which focus only on new or modified compaction strategies. By
> changing data sort order, existing compaction strategies can be made
> significantly more efficient without imposing new, restrictive, and use case
> specific limitations on the user.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)