[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537417#comment-16537417 ] Benedict commented on CASSANDRA-7282: - Also, as a random aside, I have no idea where I got the idea it was called Van Emde Boas layout. I'm sure the internet mislead me one day, and I never verified the nomenclature, but apparently it's actually Eytzinger layout, and we should probably avoid propagating this error. > Faster Memtable map > --- > > Key: CASSANDRA-7282 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Michael Burman >Priority: Major > Labels: performance > Fix For: 4.x > > Attachments: jasobrown-sample-run.txt, profile.yaml, reads.svg, > run1.svg, writes.svg > > > Currently we maintain a ConcurrentSkipLastMap of DecoratedKey -> Partition in > our memtables. Maintaining this is an O(lg(n)) operation; since the vast > majority of users use a hash partitioner, it occurs to me we could maintain a > hybrid ordered list / hash map. The list would impose the normal order on the > collection, but a hash index would live alongside as part of the same data > structure, simply mapping into the list and permitting O(1) lookups and > inserts. > I've chosen to implement this initial version as a linked-list node per item, > but we can optimise this in future by storing fatter nodes that permit a > cache-line's worth of hashes to be checked at once, further reducing the > constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537398#comment-16537398 ] Benedict commented on CASSANDRA-7282: - bq. but without Value Types that's a pretty long shot with Java Well, you can do n-ary (i.e. B-Tree) equivalents of the Van Emde Boas layout; or you could do a simpler binary Van Emde Boas initial search (sized to about a cache fetch unit), to select the correct range of a sorted array to search, which would probably yield similar results for our domain sizes. But to be honest I only really put the binary Van Emde Boas approach in here to save time, as it was faster than digging too closely into the differences in the benchmark between the two more basic approaches. bq. I'll start the other ticket by moving the range detection to an upper level by creating a new Memtable per range Great. Thanks bq. I'll need to find a good way from getting that event from bootstrap & streaming So thinking about this a bit more, it seems there are four options: 1) Locate all the places the token ring changes, insert some code 2) Subscribe to changes in the token ring, and flush if we introduce a new range (or expand an existing range) 3) Detect an out-of-range message, and synchronise our ranges with the ring 4) Detect an out-of-range message, and simply route it to an out-of-range bucket The thing is, there's nothing right now preventing us receiving records for a range not assigned to us. So we will need an out-of-range bucket. But at the same time, we will have to update ourselves, consciously, when the ring updates. If we're moving up out of the memtable to impose this, we cannot rely on flush to resynchronise ourselves. So, probably we will need an out-of-range bucket anyway - that IMO should be served by a CSLM until we implement NBHOM degrading to a skip list under skew. But we also want to know when our owned ranges change - and this can happen for a variety of circumstances. Probably the most robust thing to do, is to subscribe to changes in TokenMetadata. Which isn't currently possible, but shouldn't be onerous to add - though I hate adding to an already ugly class. bq. Also, any table not using ranges (such as system, maybe some new virtual ones?) don't use any. Those tables that don't use any ranges all use the LocalStrategy for replication, which right now looks like it would yield a separate range for every token in the cluster, all mapping to localhost. We should probably confirm this is the case, and override the method to just return a single range owned by localhost. We should probably take this discussion to another ticket though. > Faster Memtable map > --- > > Key: CASSANDRA-7282 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Michael Burman >Priority: Major > Labels: performance > Fix For: 4.x > > Attachments: jasobrown-sample-run.txt, profile.yaml, reads.svg, > run1.svg, writes.svg > > > Currently we maintain a ConcurrentSkipLastMap of DecoratedKey -> Partition in > our memtables. Maintaining this is an O(lg(n)) operation; since the vast > majority of users use a hash partitioner, it occurs to me we could maintain a > hybrid ordered list / hash map. The list would impose the normal order on the > collection, but a hash index would live alongside as part of the same data > structure, simply mapping into the list and permitting O(1) lookups and > inserts. > I've chosen to implement this initial version as a linked-list node per item, > but we can optimise this in future by storing fatter nodes that permit a > cache-line's worth of hashes to be checked at once, further reducing the > constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537046#comment-16537046 ] Michael Burman commented on CASSANDRA-7282: --- Yeah, VamEmdeBoas is fine. In theory there are few better (according to "Array layouts for comparison-based searching" including B-Tree in array, but without Value Types that's a pretty long shot with Java) but as said, this is perhaps not the best place to spend more time on - considering how small amount from overall execution time it takes and how small improvements are expected. So I'll change the implementation to veb. I'll start the other ticket by moving the range detection to an upper level by creating a new Memtable per range and use the array for finding the correct Memtable (but still make the NonBlockingHashMap aware of the range it processes as it needs that for the hash normalization). I'll need to find a good way from getting that event from bootstrap & streaming (did I forget any other event that could create more - or do you have other idea where to get it best?) back to CFS if we want it before there's any actual writes happening. Also, any table not using ranges (such as system, maybe some new virtual ones?) don't use any. > Faster Memtable map > --- > > Key: CASSANDRA-7282 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Michael Burman >Priority: Major > Labels: performance > Fix For: 4.x > > Attachments: jasobrown-sample-run.txt, profile.yaml, reads.svg, > run1.svg, writes.svg > > > Currently we maintain a ConcurrentSkipLastMap of DecoratedKey -> Partition in > our memtables. Maintaining this is an O(lg(n)) operation; since the vast > majority of users use a hash partitioner, it occurs to me we could maintain a > hybrid ordered list / hash map. The list would impose the normal order on the > collection, but a hash index would live alongside as part of the same data > structure, simply mapping into the list and permitting O(1) lookups and > inserts. > I've chosen to implement this initial version as a linked-list node per item, > but we can optimise this in future by storing fatter nodes that permit a > cache-line's worth of hashes to be checked at once, further reducing the > constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535478#comment-16535478 ] Benedict commented on CASSANDRA-7282: - I've posted a slightly modified gist [here|https://gist.github.com/belliottsmith/b559b6a09882009fb229e1485dcb7ac4] This simply ensures that the tests for each algorithm run identical operations (by ensuring we use a consistent seed across forks/runs), randomises a larger number of search keys, introduces a new polluteCache parameter that reduces cache occupancy by generating many tests, and walking through them linearly for each invocation. I also fixed a couple of minor bugs - or at least inconsistencies in logic. It also introduces two new variants: A simpler (IMO) binary search variant, using some old code of mine from [here|https://github.com/belliottsmith/jjoost/blob/master/jjoost-base/src/org/jjoost/util/order/IntOrder.java], and a Van Emde Boas array layout, that should exhibit the best cache behaviour, by netting the possible benefits of a tree (highest level nodes retaining cache occupancy across operations) with those of the array (much smaller footprint). In reality there are better still variants, but the project is a long way from benefitting from those, here, I expect. ||Benchmark||(polluteCache)||(vnodes)||Mode||Cnt||Score Error Units|| |BSTBench.multiArrayBinarySearchFloor|1|4|thrpt|16|12065.122 ± 401.339 ops/ms| |BSTBench.multiArrayBinarySearchFloor|1|64|thrpt|16|6292.591 ± 135.290 ops/ms| |BSTBench.multiArrayBinarySearchFloor|1|512|thrpt|16|4702.458 ± 54.880 ops/ms| |BSTBench.multiArrayBinarySearchFloor|4096|4|thrpt|16|4691.372 ± 151.983 ops/ms| |BSTBench.multiArrayBinarySearchFloor|4096|64|thrpt|16|4218.482 ± 145.470 ops/ms| |BSTBench.multiArrayBinarySearchFloor|4096|512|thrpt|16|3737.212 ± 110.495 ops/ms| |BSTBench.multiArrayBinarySeek|1|4|thrpt|16|12657.483 ± 266.605 ops/ms| |BSTBench.multiArrayBinarySeek|1|64|thrpt|16|7002.695 ± 105.505 ops/ms| |BSTBench.multiArrayBinarySeek|1|512|thrpt|16|4339.367 ± 235.567 ops/ms| |BSTBench.multiArrayBinarySeek|4096|4|thrpt|16|4517.014 ± 94.628 ops/ms| |BSTBench.multiArrayBinarySeek|4096|64|thrpt|16|3984.565 ± 173.088 ops/ms| |BSTBench.multiArrayBinarySeek|4096|512|thrpt|16|3682.319 ± 125.062 ops/ms| |BSTBench.multiArrayVanEmdeBoasLayout|1|4|thrpt|16|15872.360 ± 390.277 ops/ms| |BSTBench.multiArrayVanEmdeBoasLayout|1|64|thrpt|16|7285.622 ± 100.053 ops/ms| |BSTBench.multiArrayVanEmdeBoasLayout|1|512|thrpt|16|4568.883 ± 603.805 ops/ms| |BSTBench.multiArrayVanEmdeBoasLayout|4096|4|thrpt|16|5217.134 ± 12.719 ops/ms| |BSTBench.multiArrayVanEmdeBoasLayout|4096|64|thrpt|16|4978.465 ± 55.890 ops/ms| |BSTBench.multiArrayVanEmdeBoasLayout|4096|512|thrpt|16|4361.356 ± 240.606 ops/ms| |BSTBench.treeSeek|1|4|thrpt|16|14766.468 ± 134.707 ops/ms| |BSTBench.treeSeek|1|64|thrpt|16|6400.449 ± 42.273 ops/ms| |BSTBench.treeSeek|1|512|thrpt|16|4116.524 ± 102.472 ops/ms| |BSTBench.treeSeek|4096|4|thrpt|16|5132.769 ± 183.395 ops/ms| |BSTBench.treeSeek|4096|64|thrpt|16|4003.251 ± 69.857 ops/ms| |BSTBench.treeSeek|4096|512|thrpt|16|3458.094 ± 17.092 ops/ms| > Faster Memtable map > --- > > Key: CASSANDRA-7282 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Michael Burman >Priority: Major > Labels: performance > Fix For: 4.x > > Attachments: jasobrown-sample-run.txt, profile.yaml, reads.svg, > run1.svg, writes.svg > > > Currently we maintain a ConcurrentSkipLastMap of DecoratedKey -> Partition in > our memtables. Maintaining this is an O(lg(n)) operation; since the vast > majority of users use a hash partitioner, it occurs to me we could maintain a > hybrid ordered list / hash map. The list would impose the normal order on the > collection, but a hash index would live alongside as part of the same data > structure, simply mapping into the list and permitting O(1) lookups and > inserts. > I've chosen to implement this initial version as a linked-list node per item, > but we can optimise this in future by storing fatter nodes that permit a > cache-line's worth of hashes to be checked at once, further reducing the > constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534146#comment-16534146 ] Benedict commented on CASSANDRA-7282: - Thanks. I'm surprised, and will dig into this a little more myself. However, we have to be careful with these microbenchmarks. The whole state fits in L1 cache here, but there's a good chance this will not be hot in L1 in live operation, given how poor our general cache usage is; it's certainly less likely with the tree approach (given it is much larger in memory). So if the scores are similar (which they are), I would still prefer the array variant. Still, it should be outright faster, so I'll take a look and see if I can figure out why you're seeing this behaviour. > Faster Memtable map > --- > > Key: CASSANDRA-7282 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Michael Burman >Priority: Major > Labels: performance > Fix For: 4.x > > Attachments: jasobrown-sample-run.txt, profile.yaml, reads.svg, > run1.svg, writes.svg > > > Currently we maintain a ConcurrentSkipLastMap of DecoratedKey -> Partition in > our memtables. Maintaining this is an O(lg(n)) operation; since the vast > majority of users use a hash partitioner, it occurs to me we could maintain a > hybrid ordered list / hash map. The list would impose the normal order on the > collection, but a hash index would live alongside as part of the same data > structure, simply mapping into the list and permitting O(1) lookups and > inserts. > I've chosen to implement this initial version as a linked-list node per item, > but we can optimise this in future by storing fatter nodes that permit a > cache-line's worth of hashes to be checked at once, further reducing the > constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534129#comment-16534129 ] Michael Burman commented on CASSANDRA-7282: --- I updated the original gist > Faster Memtable map > --- > > Key: CASSANDRA-7282 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Michael Burman >Priority: Major > Labels: performance > Fix For: 4.x > > Attachments: jasobrown-sample-run.txt, profile.yaml, reads.svg, > run1.svg, writes.svg > > > Currently we maintain a ConcurrentSkipLastMap of DecoratedKey -> Partition in > our memtables. Maintaining this is an O(lg(n)) operation; since the vast > majority of users use a hash partitioner, it occurs to me we could maintain a > hybrid ordered list / hash map. The list would impose the normal order on the > collection, but a hash index would live alongside as part of the same data > structure, simply mapping into the list and permitting O(1) lookups and > inserts. > I've chosen to implement this initial version as a linked-list node per item, > but we can optimise this in future by storing fatter nodes that permit a > cache-line's worth of hashes to be checked at once, further reducing the > constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534127#comment-16534127 ] Benedict commented on CASSANDRA-7282: - Do you want to post a link to the code before hitting the sack? > Faster Memtable map > --- > > Key: CASSANDRA-7282 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Michael Burman >Priority: Major > Labels: performance > Fix For: 4.x > > Attachments: jasobrown-sample-run.txt, profile.yaml, reads.svg, > run1.svg, writes.svg > > > Currently we maintain a ConcurrentSkipLastMap of DecoratedKey -> Partition in > our memtables. Maintaining this is an O(lg(n)) operation; since the vast > majority of users use a hash partitioner, it occurs to me we could maintain a > hybrid ordered list / hash map. The list would impose the normal order on the > collection, but a hash index would live alongside as part of the same data > structure, simply mapping into the list and permitting O(1) lookups and > inserts. > I've chosen to implement this initial version as a linked-list node per item, > but we can optimise this in future by storing fatter nodes that permit a > cache-line's worth of hashes to be checked at once, further reducing the > constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534123#comment-16534123 ] Michael Burman commented on CASSANDRA-7282: --- Updated the code with hopefully what you meant.. (need to get some sleep). I see worse performance, which I believe is because the added complexity of the minimum processing. I'll need to investigate more tomorrow. {code} [java] Benchmark(vnodes) Mode Cnt Score Error Units [java] BSTBench.multiArrayBinarySeek 4 thrpt 16 106134.448 ± 3211.072 ops/ms [java] BSTBench.multiArrayBinarySeek 16 thrpt 16 21704.218 ± 788.881 ops/ms [java] BSTBench.multiArrayBinarySeek 64 thrpt 164347.090 ± 106.703 ops/ms [java] BSTBench.multiArrayBinarySeek 256 thrpt 16 667.122 ± 210.476 ops/ms [java] BSTBench.multiArrayBinarySeek 512 thrpt 16 192.777 ± 29.690 ops/ms [java] BSTBench.multiArrayLinearSearch 4 thrpt 16 172888.358 ± 7778.132 ops/ms [java] BSTBench.multiArrayLinearSearch16 thrpt 16 16703.622 ± 565.180 ops/ms [java] BSTBench.multiArrayLinearSearch64 thrpt 161096.093 ± 33.050 ops/ms [java] BSTBench.multiArrayLinearSearch 256 thrpt 16 84.511 ± 2.776 ops/ms [java] BSTBench.multiArrayLinearSearch 512 thrpt 16 22.130 ± 0.714 ops/ms {code} > Faster Memtable map > --- > > Key: CASSANDRA-7282 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Michael Burman >Priority: Major > Labels: performance > Fix For: 4.x > > Attachments: jasobrown-sample-run.txt, profile.yaml, reads.svg, > run1.svg, writes.svg > > > Currently we maintain a ConcurrentSkipLastMap of DecoratedKey -> Partition in > our memtables. Maintaining this is an O(lg(n)) operation; since the vast > majority of users use a hash partitioner, it occurs to me we could maintain a > hybrid ordered list / hash map. The list would impose the normal order on the > collection, but a hash index would live alongside as part of the same data > structure, simply mapping into the list and permitting O(1) lookups and > inserts. > I've chosen to implement this initial version as a linked-list node per item, > but we can optimise this in future by storing fatter nodes that permit a > cache-line's worth of hashes to be checked at once, further reducing the > constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533884#comment-16533884 ] Benedict commented on CASSANDRA-7282: - Great, thanks. The mismatch in expectations is probably because this is not the optimal way to implement a binary search variant. Ideally it would have two (or three) backing arrays, as opposed to a backing array of nodes. One primitive array of one side of bounds (e.g. lower bounds) to search on; another of keys, and another primitive array of the opposing bounds to filter results (or if the bounds are encoded in the keys already, that could be skipped). Do you mind adding that variant to the comparison? > Faster Memtable map > --- > > Key: CASSANDRA-7282 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Michael Burman >Priority: Major > Labels: performance > Fix For: 4.x > > Attachments: jasobrown-sample-run.txt, profile.yaml, reads.svg, > run1.svg, writes.svg > > > Currently we maintain a ConcurrentSkipLastMap of DecoratedKey -> Partition in > our memtables. Maintaining this is an O(lg(n)) operation; since the vast > majority of users use a hash partitioner, it occurs to me we could maintain a > hybrid ordered list / hash map. The list would impose the normal order on the > collection, but a hash index would live alongside as part of the same data > structure, simply mapping into the list and permitting O(1) lookups and > inserts. > I've chosen to implement this initial version as a linked-list node per item, > but we can optimise this in future by storing fatter nodes that permit a > cache-line's worth of hashes to be checked at once, further reducing the > constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533829#comment-16533829 ] Michael Burman commented on CASSANDRA-7282: --- Here you go: https://gist.github.com/burmanm/7fb14589d5c0d11ae79c018bd7849653 > Faster Memtable map > --- > > Key: CASSANDRA-7282 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Michael Burman >Priority: Major > Labels: performance > Fix For: 4.x > > Attachments: jasobrown-sample-run.txt, profile.yaml, reads.svg, > run1.svg, writes.svg > > > Currently we maintain a ConcurrentSkipLastMap of DecoratedKey -> Partition in > our memtables. Maintaining this is an O(lg(n)) operation; since the vast > majority of users use a hash partitioner, it occurs to me we could maintain a > hybrid ordered list / hash map. The list would impose the normal order on the > collection, but a hash index would live alongside as part of the same data > structure, simply mapping into the list and permitting O(1) lookups and > inserts. > I've chosen to implement this initial version as a linked-list node per item, > but we can optimise this in future by storing fatter nodes that permit a > cache-line's worth of hashes to be checked at once, further reducing the > constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533770#comment-16533770 ] Benedict commented on CASSANDRA-7282: - Could you share the code you used for this benchmark? The result is counterintuitive. > Faster Memtable map > --- > > Key: CASSANDRA-7282 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Michael Burman >Priority: Major > Labels: performance > Fix For: 4.x > > Attachments: jasobrown-sample-run.txt, profile.yaml, reads.svg, > run1.svg, writes.svg > > > Currently we maintain a ConcurrentSkipLastMap of DecoratedKey -> Partition in > our memtables. Maintaining this is an O(lg(n)) operation; since the vast > majority of users use a hash partitioner, it occurs to me we could maintain a > hybrid ordered list / hash map. The list would impose the normal order on the > collection, but a hash index would live alongside as part of the same data > structure, simply mapping into the list and permitting O(1) lookups and > inserts. > I've chosen to implement this initial version as a linked-list node per item, > but we can optimise this in future by storing fatter nodes that permit a > cache-line's worth of hashes to be checked at once, further reducing the > constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533582#comment-16533582 ] Michael Burman commented on CASSANDRA-7282: --- Out of range tokens were also important for system tables as they do not follow the normal ranges. I agree that this type of information should not be inside the Memtable, but outside the Memtable. I however designed the patch to be "plug-in" without touching other parts too much. There's certainly going to be more overhead for one memtable per range, but that might not be an issue depending on the amount of vnodes per node. Which then is part of the reason for your next question. The tree is certainly an overkill, but I made it because it behaves more predictably with the amount of vnodes currently in use: {code:java} [java] Benchmark (vnodes) Mode Cnt Score Error Units [java] BSTBench.binarySeek 4 thrpt 16 122117.853 ± 6815.672 ops/ms [java] BSTBench.binarySeek 16 thrpt 16 22711.007 ± 691.020 ops/ms [java] BSTBench.binarySeek 64 thrpt 16 4547.383 ± 180.749 ops/ms [java] BSTBench.binarySeek 256 thrpt 16 892.406 ± 42.791 ops/ms [java] BSTBench.binarySeek 512 thrpt 16 224.545 ± 6.469 ops/ms [java] BSTBench.linearSearch 4 thrpt 16 214901.639 ± 3264.386 ops/ms [java] BSTBench.linearSearch 16 thrpt 16 19594.770 ± 718.201 ops/ms [java] BSTBench.linearSearch 64 thrpt 16 1302.657 ± 43.841 ops/ms [java] BSTBench.linearSearch 256 thrpt 16 105.317 ± 2.950 ops/ms [java] BSTBench.linearSearch 512 thrpt 16 25.932 ± 3.488 ops/ms [java] BSTBench.treeSeek 4 thrpt 16 162053.915 ± 4228.035 ops/ms [java] BSTBench.treeSeek 16 thrpt 16 31650.599 ± 789.500 ops/ms [java] BSTBench.treeSeek 64 thrpt 16 5901.952 ± 137.575 ops/ms [java] BSTBench.treeSeek 256 thrpt 16 1055.737 ± 22.786 ops/ms [java] BSTBench.treeSeek 512 thrpt 16 208.169 ± 4.617 ops/ms {code} (The above test creates vnodes size of token ranges and then tries to find each one in a random order) And the amount of code for binary search vs tree search is almost equivalent. But between 4-256 it is a fast method (with linear search being the quickest one for very small amount of token ranges). So I thought it as a best "compromise". > Faster Memtable map > --- > > Key: CASSANDRA-7282 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Michael Burman >Priority: Major > Labels: performance > Fix For: 4.x > > Attachments: jasobrown-sample-run.txt, profile.yaml, reads.svg, > run1.svg, writes.svg > > > Currently we maintain a ConcurrentSkipLastMap of DecoratedKey -> Partition in > our memtables. Maintaining this is an O(lg(n)) operation; since the vast > majority of users use a hash partitioner, it occurs to me we could maintain a > hybrid ordered list / hash map. The list would impose the normal order on the > collection, but a hash index would live alongside as part of the same data > structure, simply mapping into the list and permitting O(1) lookups and > inserts. > I've chosen to implement this initial version as a linked-list node per item, > but we can optimise this in future by storing fatter nodes that permit a > cache-line's worth of hashes to be checked at once, further reducing the > constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533153#comment-16533153 ] Benedict commented on CASSANDRA-7282: - I actually wonder if we should make this ticket depend on a new ticket to separate all memtables by token range. By doing this at a slightly higher level, we can flush data for each token range on an independent schedule. This might permit more even use our disks, and reduce overall compaction work by permitting us to flush larger sstables on average. It's also an easier level to integrate decisions about token range changes, and avoids the unsightly slicing (on flush) of already sliced data. > Faster Memtable map > --- > > Key: CASSANDRA-7282 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Michael Burman >Priority: Major > Labels: performance > Fix For: 4.x > > Attachments: jasobrown-sample-run.txt, profile.yaml, reads.svg, > run1.svg, writes.svg > > > Currently we maintain a ConcurrentSkipLastMap of DecoratedKey -> Partition in > our memtables. Maintaining this is an O(lg(n)) operation; since the vast > majority of users use a hash partitioner, it occurs to me we could maintain a > hybrid ordered list / hash map. The list would impose the normal order on the > collection, but a hash index would live alongside as part of the same data > structure, simply mapping into the list and permitting O(1) lookups and > inserts. > I've chosen to implement this initial version as a linked-list node per item, > but we can optimise this in future by storing fatter nodes that permit a > cache-line's worth of hashes to be checked at once, further reducing the > constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533127#comment-16533127 ] Benedict commented on CASSANDRA-7282: - Hi Michael, I'll try to review this properly, as I find time. A couple of things leap out to me though: # While we certainly need to handle out of range tokens, we should try harder to minimise their occurrence, by switching memtable on a token range change. Ideally we would switch before this new token range actually takes effect in the case of an expansion; whether or not it makes sense to defer a token range removal probably needs some thought. # A range tree is probably overkill. Why not binary search over an array? > Faster Memtable map > --- > > Key: CASSANDRA-7282 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Michael Burman >Priority: Major > Labels: performance > Fix For: 4.x > > Attachments: jasobrown-sample-run.txt, profile.yaml, reads.svg, > run1.svg, writes.svg > > > Currently we maintain a ConcurrentSkipLastMap of DecoratedKey -> Partition in > our memtables. Maintaining this is an O(lg(n)) operation; since the vast > majority of users use a hash partitioner, it occurs to me we could maintain a > hybrid ordered list / hash map. The list would impose the normal order on the > collection, but a hash index would live alongside as part of the same data > structure, simply mapping into the list and permitting O(1) lookups and > inserts. > I've chosen to implement this initial version as a linked-list node per item, > but we can optimise this in future by storing fatter nodes that permit a > cache-line's worth of hashes to be checked at once, further reducing the > constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527305#comment-16527305 ] Michael Burman commented on CASSANDRA-7282: --- I've updated this work with the following updates to get this process restarted: * Ported to newer storage format and rebased to current master * Removed visibility optimization from resize operation as I could make this NPE with a race condition if large resizes happen in short duration ** I tested different resize algorithms, such as tracking the read and write path chains and resizing if certain criteria is met. This is the strategy used by the Linux kernel's RCU, but I did not manage to make any notable performance gains using that method so I left the original resize algorithm. * Each node token range is now backed by a separate index. ** Reduces the contention on the index updates, resizes and size parameter update (which was a contention point for all threads previously). ** Apply a normalization for token range hash updates to improve the distribution of hashes and that way the coverage of the index *** Center the hash range to around 0 first and then apply a scaling multiplier (proportional size of the token range compared to the available hash range [Long.MIN_VALUE, Long.MAX_VALUE]) ** Allows different growth sizes for each token range ** For system tables and cases where node might see writes that were not known during the initilization and overflow index with range [Long.MIN_VALUE, Long.MAX_VALUE] is used (works like the original patch) ** For lookup to the correct index I used a balanced BST. If there's a better way, I'm all ears. Linear search would be as fast if the amount of vnodes is small, but this scales to larger amount of vnodes also. I've tried to run different workloads against it and also tried to use different hashing methods, including the 32-bit hash from MurmurHash to trigger a bad hash distribution, but I didn't couldn't see any scenario with performance degradation compared to CSLM. With a hash attack (which Murmur is vulnerable to) this can be done of course, but those will break Cassandra node distribution also. Link to the branch: [https://github.com/burmanm/cassandra/tree/7282-trunk-2] > Faster Memtable map > --- > > Key: CASSANDRA-7282 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Priority: Major > Labels: performance > Fix For: 4.x > > Attachments: jasobrown-sample-run.txt, profile.yaml, reads.svg, > run1.svg, writes.svg > > > Currently we maintain a ConcurrentSkipLastMap of DecoratedKey -> Partition in > our memtables. Maintaining this is an O(lg(n)) operation; since the vast > majority of users use a hash partitioner, it occurs to me we could maintain a > hybrid ordered list / hash map. The list would impose the normal order on the > collection, but a hash index would live alongside as part of the same data > structure, simply mapping into the list and permitting O(1) lookups and > inserts. > I've chosen to implement this initial version as a linked-list node per item, > but we can optimise this in future by storing fatter nodes that permit a > cache-line's worth of hashes to be checked at once, further reducing the > constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385740#comment-14385740 ] Benedict commented on CASSANDRA-7282: - bq. +1 for massive understatement. Thanks. I spent a day working on just that, so glad it panned out :) I'm currently leaning towards postponing this ticket until 3.1, since some careful consideration is needed to ensure a uniform distribution of hash keys within the map, especially without vnodes on large clusters. It's possible we could only enable this optimisation on nodes that can predict their distribution will be fair. In either case I think it may be helpful to consider the ticket in relation to CASSANDRA-7032 and CASSANDRA-6696, by e.g. having a separate hash table for each vnode range. Depending on 3.0 release timeline, the incorporation of these tickets, and on the progression of my other commitments, I may still aim to deliver this in 3.0, but just alerting that at the moment my view is this is uncertain and on balance less than likely. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: jasobrown-sample-run.txt, profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363708#comment-14363708 ] Joshua McKenzie commented on CASSANDRA-7282: From the comments: bq. We also use a somewhat different explanation here for behaviour, which is perhaps easier to comprehend. +1 for massive understatement. :) The binary reversal magic for determining bucket to hash into is very well explained in the comments now. That's the subtlest part to wrap my head around out of all this and your explanation is spot on. bq. I'd prefer to see it used more widely in the codebase though. I don't disagree, but when in rome... that, and a piece of code as dense as NBHOM can use all the familiarity we can get in here. So my last few minor outstanding concerns: # hashCode conformance bq. The best option is probably to make all Token implement a HashComparable interface, and throw UnsupportedOperationException if they don't really implement it, but that is also pretty ugly. Not *too* ugly in my opinion - a little boilerplate, sure, but not ugly. The distinction of whether a hashCode implementation conforms to the requirements of NBHOM is subtle enough that I believe it warrants codifying. # i value / i(...) is a strange combo. For instance, the following: {noformat} int i = i(hash, index); {noformat} took me by surprise. It's relatively clear what you're doing when you read the surrounding code - just could use different naming imo. # maybeResize and the load factor of 66%. Given that it's a lazy, async move the buckets, not the items approach I can see how it's not going to make-or-break at 50% vs. 75%. I'd like *some* kind of data on this though; I'd be content with back-of-the-napkin theorization here - just something more than seems pretty reasonable. :) On the whole - looking very good. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: jasobrown-sample-run.txt, profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322942#comment-14322942 ] Branimir Lambov commented on CASSANDRA-7282: bq. To add some more data to this discussion, I decided quickly to isolate just the CSLM and NBHOM for comparison. This benchmark uses Longs as keys, whose hashCode() does not satisfy the requirements of the NBHOM. The results aren't materially different when this is fixed (NBHOM is still dramatically faster). However, if even the author of the code can make this mistake so easily I don't think reusing {{hashCode()}} for the NBHOM ordering key is acceptable. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: jasobrown-sample-run.txt, profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323080#comment-14323080 ] Benedict commented on CASSANDRA-7282: - bq. This benchmark uses Longs as keys Thanks for spotting that bq. However, if even the author of the code can make this mistake so easily It's worth noting that I was recovering from brain surgery performed a couple of weeks prior to posting this benchmark, so it's probably not _quite_ as indicative of the problem as it might appear. I agree it would be preferable to use a different method, though. The problem is doing so neatly and without penalty. The best option is probably to make all Token implement a HashComparable interface, and throw UnsupportedOperationException if they don't really implement it, but that is also pretty ugly. I'm pretty agnostic to the solution to this particular problem, so I'll defer to strong opinions. A potentially more damning criticism of that benchmark is that it assumes a uniform address space for the hashes. As the number of nodes in a cluster grows, the entropy in the top bits decreases, and so the performance of this map could degrade. Besides the aforementioned improvement to bound worst case behaviour at O(lg(N)) we could also normalise the top order bits across the owned ranges. Possibly there are some other strategies, but I need to think on that. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: jasobrown-sample-run.txt, profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315157#comment-14315157 ] Benedict commented on CASSANDRA-7282: - I've pushed an updated branch [here|https://github.com/belliottsmith/cassandra/tree/7282-withcomments] that I hope addresses all of your concerns. bq. LongNonBlockingHashOrderedMapTest times out on default settings for ant long-test on my box The test is really designed to be run on a machine with at least 8 physical cores, and it may take an excessively long time on fewer. In general it's hard to make a test like this terminate in a predictable time horizon, and the longer it runs the better (ideally we'd just leave it running for days). I've tweaked the settings slightly to make it run better on lower core counts but still actually perform some useful work. bq. this.index aliasing in predecessor appears redundant This is necessary to save fetching a different version of the index each time, and incurring the memory access costs with each reference bq. Would it be worth making maybeResize threshold configurable? What's our reasoning for targeting 66%? 66% seems pretty reasonable (defaults are usually somewhere between 50% and 75% for hash tables). We don't generally expose this kind of internal feature to users, but I am not totally opposed to doing so. I doubt a value much different from this is likely to help much, though; a value of 50% could mean the index becomes more stale and see we have more scanning forwards to do on lookups, whereas a value of above 100% means we start seeing significant worst case collisions. bq. in putIfAbsent / maybeResize / resize: Just to confirm - are we relying on the executor to drop duplicate resize calls on the floor? If so, it might be worth documenting. It doesn't matter if it does or does not; it only ever resizes upwards. We also shouldn't ever have a duplicate resize call, unless we haven't executed by the time the table's contents have doubled. bq. Why do we bit shift on generation of index node instead of just using integer value? This is a standard practice for me: shifts in multiples of 10 are powers of ~1000 (i.e. 1024). I find it much neater than lots of magic constants when you get above 1024 (e.g. 1048576 or 1073741824) , or snippets of 1024L * 1024L * 1024L. 1 30 is much clearer (representing 1024^3). However it's pretty marginal utility when working with small quantities, so I've switched it. I'd prefer to see it used more widely in the codebase though. I've spent a while trying to comment very clearly the algorithm at the top of the class, with diagrams. Please let me know if it's a success, and if there's anything more I can do to clarify it. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: jasobrown-sample-run.txt, profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276213#comment-14276213 ] Jonathan Ellis commented on CASSANDRA-7282: --- bq. It may be we want to harden the data structure against determined attackers that produce a lot of colliding tokens with different partition keys I'm happy to leave this as a non-goal for now. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: jasobrown-sample-run.txt, profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273063#comment-14273063 ] Benedict commented on CASSANDRA-7282: - Just a note for the future (I haven't yet addressed your comments): It may be we want to harden the data structure against determined attackers that produce a lot of colliding tokens with different partition keys. It's not a terribly complex modification to make this a skip-list, instead of a linked-list (since it is add-only, maintaining a skip-list is just applying the current linked-list logic N times, and the majority of complexity here is for the hash index). The hash index would remain an _optimisation_ for expected constant time access into a logarithmic structure, instead of a linear one, giving us best- and expected- case constant time, worst case logarithmic. The skip-list index levels could perhaps be dropped as the hash index extends to overlap them, although it may be cheaper to encode them all in the nodes themselves. This should probably be a follow up ticket, but I wanted to log the thought as it occurred to me. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: jasobrown-sample-run.txt, profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149929#comment-14149929 ] Joshua McKenzie commented on CASSANDRA-7282: (nits interleaved with the rest) General: * Could use a comment on LongToken.hashCode() explaining reasoning for shift / truncation. While it makes sense in the context of this patch, there's little context within the class itself to explain why we're just lopping off half the bits. * Consider more descriptive variable names rather than abbreviations in AtomicReferenceArrayUpdater (rather than V exp, V upd) * AtomicReferenceArrayUpdater: [could use some comments|https://github.com/belliottsmith/cassandra/blob/7282-fastmemtablemap/src/java/org/apache/cassandra/utils/concurrent/AtomicReferenceArrayUpdater.java#L35-41] clarifying index calculation logic NonBlockingHashOrderedMap.java: * I think a diagram showing how the container works ([see CSLM|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/concurrent/ConcurrentSkipListMap.java]) would be quite helpful * Document why INDEX_SHIFT == 18. Currently just a magic number * Why do we bit shift on generation of index node instead of just using integer value? {code} private volatile NodeK, V[][] index = new Node[1][1 10]; {code} * It might help to reference the shalev whitepaper's title as a reference for some more formal reading on a similar data structure to what we're using here * this.index aliasing in predecessor appears redundant * [predecessor() is still very dense|https://github.com/belliottsmith/cassandra/blob/7282-fastmemtablemap/src/java/org/apache/cassandra/concurrent/NonBlockingHashOrderedMap.java#L182-203]. Not a bad thing if necessary, but I'm wondering if there aren't opportunities to abstract out some of the guts of what's going on in there to segregate the algorithm from the implementation. * The comments on indexHash and firstHashOfIndex are quite helpful. * As non-blocking algorithms are still somewhat new, it might be worth documenting why we're using [lazy updates|http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6275329] for future readers of the code. Might not. * Would it be worth making maybeResize threshold configurable? What's our reasoning for targeting 66%? * [maybeResize|https://github.com/belliottsmith/cassandra/blob/7282-fastmemtablemap/src/java/org/apache/cassandra/concurrent/NonBlockingHashOrderedMap.java#L268-274] logic could use some more clarity on documentation. * [Signal-to-noise ratio in resize() seems off|https://github.com/belliottsmith/cassandra/blob/7282-fastmemtablemap/src/java/org/apache/cassandra/concurrent/NonBlockingHashOrderedMap.java#L293-302]. Might be worth factoring out all the INDEX_SHIFTING and Math.max/min to get implementation details out of the immediate visibility of the algorithm's scope, assuming we can do that without impacting performance. * in putIfAbsent / maybeResize / resize: Just to confirm - are we relying on the executor to drop duplicate resize calls on the floor? If so, it might be worth documenting. LongNonBlockingHashOrderedMapTest.java * Reinforces the feeling that the heavy bitwise operations could serve to be abstracted out or more heavily commented to help reduce complexity. For example, [this|https://github.com/belliottsmith/cassandra/blob/7282-fastmemtablemap/test/long/org/apache/cassandra/concurrent/LongNonBlockingHashOrderedMapTest.java#L213-216] is opaque. While it's parsable and appears correct, some trivial commenting could help smooth out the reading / maintaining process * LongNonBlockingHashOrderedMapTest times out on default settings for ant long-test on my box * And still times out at 10x timeout threshold. :) This may be a Windows thing - do they run to completion within the time on your setup? In summary - the code looks good but in my opinion could use some more documentation and possibly some abstracting out of lower-level implementation details. The logic of the container itself is definitely simpler than the CSLM and seems more tailored to our needs. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: jasobrown-sample-run.txt, profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147705#comment-14147705 ] Jason Brown commented on CASSANDRA-7282: I ran benedict's 7282 patch vs. the commit prior in his branch, for several use cases. TL;DR the patch was a clear winner for the the specific use case he called out, and marginally better, if not the same, for a 'typical' use case. For the first use case, I used Benedict's attached profile.yaml, although i changed the RF from 1 to 2. I then used this stress command to execute (basically the same as Benedict's above) {code}./bin/cassandra-stress user profile=~/jasobrown/b7282.yaml ops\(insert=5,read=5\) n=2000 -pop seq=1..10M read-lookback=extreme\(1..1M,2\) -rate threads=200 -mode cql3 native prepared -node node_addresses{code} I've attached the results of running the above command three times successively on both patch and unpatched code, and the results can be summarized: - 15% improvement in throughput - 10% improvement at 95%/99%-iles - 50% improvement at 99.9%-ile ~ 40% less garbage created / 40% less time in GC Note that over the life of stress run, memtables were flushing and sstables compacting, so not all reads were coming directly from the memtable. One thing I perhaps could have tried (and would have liked to) was switching the CL from ONE to LOCAL_QUORUM, although I think stress would have applied that to both reads and writes for the above command, whereas I would have wanted to affect either read or write. I also ran the stress a more 'standard' use case of mine (I'm still developing it alongside the new stress), and results were about even between patch and unpatched, although there may have been very slight advantage toward the patched version. So, I think in the case where you are reading your most recent writes, and the data in the memtable is small (not wide), there is a performance gain in this patch. In a more typical case, it wasn't necessarily proven that the patch boosts performance (but then the patch wasn't attempting to help the general case, anyway). Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140093#comment-14140093 ] Benedict commented on CASSANDRA-7282: - We should test both. This change is going to apply not only now, but in the future. If we reject it, it won't be revisited for a long time. Not only are VM improvements like shenandoah and G1GC going to mean we support users with huge heaps that can support gigantic memtables, this data structure is easily moved completely offheap once the objects it refers to are. Lots of users run with only single disks and large memory capacities also, and with a single disk an mct of 0.5 is the norm. So even an mct of 0.99 is pretty underwhelming for helping to bench these cases. An mct of 0.11 is completely inadequate. However we should test both if that's what you run with, to see how it fairs in different climates. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140118#comment-14140118 ] Benedict commented on CASSANDRA-7282: - We should be certain to test this with offheap_objects, btw, since this also reduces heap use and increases the number of records we can store. We should also not generally be benchmarking use cases with clustering columns (or very few cql rows and very few data columns, i.e. low cell counts), since they are less likely to exhibit the change, and we have plenty of users with partitions with low cell counts. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139755#comment-14139755 ] Jason Brown commented on CASSANDRA-7282: Looks like it's not something on my end that's causing weirdness: [~benedict]'s recommendation of setting memtable_cleanup_threshold = 0.99 causes all (three) of the nodes in my cluster to more or less lock up doing GC during startup. I've got a large enough memory space on the machine (128GB), and typically use 8G for the jvm, so I'll play with some values for memtable_cleanup_threshold to get it running reasonably. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139943#comment-14139943 ] Benedict commented on CASSANDRA-7282: - There's no good reason for that unless your commit log directory isn't empty... Could you confirm you're clearing your cluster state between runs...? Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139955#comment-14139955 ] Jason Brown commented on CASSANDRA-7282: Yes, I clear the the cluster every time (rm -rf $data_dir). Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139962#comment-14139962 ] Benedict commented on CASSANDRA-7282: - I hope $data_dir is a shared dir with all the state, not just the data :) If so, mct should have zero bearing on anything at start up. It only plays a part in deciding when to flush, and at start up there should never be a reason to flush due to memtable occupancy... Something funny is going on Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140010#comment-14140010 ] Jason Brown commented on CASSANDRA-7282: bq. I hope $data_dir is a shared dir with all the state, not just the data Yeah, yeah, I got some things right today :) I should note that when I set mct back to 0.111, the cluster launches properly - admittedly, I didn't actually do anything with the cluster after I saw it successfully came and stayed up, but it's still alive six hours later. I have not investigated further. I agree that mct should be relatively harmless on new cluster launch, but let's back up for a moment - would we ever set mct that high in a real world cluster? I understand we're trying to avoid flushing (and losing the opportunity to actually test what the patch does), but if those kind of mct settings are so artificial, shouldn't we be testing with more realistic, or, at least, less crazy-town, values? Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136884#comment-14136884 ] Benedict commented on CASSANDRA-7282: - To add some more data to this discussion, I decided quickly to isolate just the CSLM and NBHOM for comparison. Code available [here|https://github.com/belliottsmith/bench]. Using JMH for writes to an append-only collection is tricky, since any increased throughput rate is naturally throttling and incurs greater GC and algorithmic overhead, so these numbers are likely to be understating the yield, but the NBHOM is anywhere from 1.75-2.75x the throughput of CSLM, depending on if the workload is read-heavy or write-heavy. It is likely the real yield for writes is inbetween these two numbers. {noformat} b.b.c.HashOrderedCollections.test 4000 0.9CSLM thrpt 10 3593.788 196.459 ops/ms b.b.c.HashOrderedCollections.test 4000 0.9 NBHOM thrpt 10 9299.991 179.408 ops/ms b.b.c.HashOrderedCollections.test 4000 0.5CSLM thrpt 10 2402.021 79.321 ops/ms b.b.c.HashOrderedCollections.test 4000 0.5 NBHOM thrpt 10 5648.160 294.727 ops/ms b.b.c.HashOrderedCollections.test 4000 0.1CSLM thrpt 10 2089.597 71.228 ops/ms b.b.c.HashOrderedCollections.test 4000 0.1 NBHOM thrpt 10 3665.048 138.384 ops/ms b.b.c.HashOrderedCollections.test 4000 0CSLM thrpt 10 2008.595 46.308 ops/ms b.b.c.HashOrderedCollections.test 4000 0 NBHOM thrpt 10 3515.328 214.304 ops/ms {noformat} So, as we reduce other overheads in the application and increase memtable capacity through further offheap improvements, any yield we are seeing application side can only go up. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136908#comment-14136908 ] Benedict commented on CASSANDRA-7282: - Some more numbers, with a warmup dataset to populate the map so that variability due to throughput rate is reduced. These numbers show the NBHOM consistently around 3x+ faster, although it introduces per-run variability due to GC. {noformat} Benchmark(readWriteRatio) (type) (warmup) Mode Samples Score Score error Units b.b.c.HashOrderedCollections.test 0.9CSLM 500 thrpt 5 1392.273 2918.717 ops/ms b.b.c.HashOrderedCollections.test 0.9 NBHOM 500 thrpt 5 5088.408 8964.885 ops/ms b.b.c.HashOrderedCollections.test 0.5CSLM 500 thrpt 5 1128.637 2589.679 ops/ms b.b.c.HashOrderedCollections.test 0.5 NBHOM 500 thrpt 5 3406.299 5606.085 ops/ms b.b.c.HashOrderedCollections.test 0.1CSLM 500 thrpt 5 924.642 1802.045 ops/ms b.b.c.HashOrderedCollections.test 0.1 NBHOM 500 thrpt 5 3311.107 999.896 ops/ms b.b.c.HashOrderedCollections.test 0CSLM 500 thrpt 5 939.757 1776.641 ops/ms b.b.c.HashOrderedCollections.test 0 NBHOM 500 thrpt 5 2781.503 4723.844 ops/ms {noformat} Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137603#comment-14137603 ] Jason Brown commented on CASSANDRA-7282: Food for thought: I've reviewed the branch Benedict has linked, and while I'll run the stress later today, I'd like to add some thoughts on the code itself. To perhaps state the obvious from reading the patch, the field Memtable.rows is now a InsertOnlyOrderedMap instead of the previous ConcurrentSkipListMap. If the Partitioner sorts data by hash code (sortsByHashCode()), Memtable will use the simple IOOM.Adapter implementation; else Memtable uses the NonBlockingHashOrderedMap. I think we can agree NBHOM is considerably more code and more involved than IIOM.Adapter (wait 'til the end, please, Benedict :) ). Currently, only the Murmur3Partitioner is written to sort by hash code. I would venture to guess the vast majority of new clusters created = cassandra 1.2 use the Murmur3Partitioner, and, further, the vast majority of clusters created before 1.2 used RandomPartitioner. Thus, not many clusters use the ordered partitioners; we've also steered folks away from those unless they really know what they are doing (and then we still try to dissuade them).If we take the assumption that most clusters' partioner is either Murmur3 or Random, and weigh it against the the current patch, the majority of the code (and, what some have argued, the complexity) is to support the RandomPartition and the ordered partitioners. If we can make RandomPartitioner be able to sort by hash code, and thus be able to take advantage of IIOM.Adapter's functionality, then the NBHOM-related functionality would only be necessary for for ordered partitioners (let's leave LocalPartitioner, used for indices, out for a moment). It seems to me that we could pass on the NBHOM-related functionality as it would only be beneficial for a small subset of uses case; not zero use cases, mind you, but perhaps enough to take a pass on it for this round. I think some commenters concerns about the perceived complexity of the patch would then be assuaged, and we can deliver an improvement performance for the majority of uses. [~benedict], Please let me know if I've misunderstood something fundamental about the patch that would invalidate my arguments; the conclusion/recommendation is, of course, open for discussion. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137609#comment-14137609 ] Benedict commented on CASSANDRA-7282: - I think you may have misunderstood, or I have misunderstood what you're saying. I assume by IIOM you mean IOOM. But the IOOM.Adapter is not a feature, it's a support for old functionality. RandomPartitioner is currently supported by IOOM.Adapter backed by CSLM, because I was too lazy at the time to update RP's Tokens. NBHOM is essentially the entirety of the patch, _not_ a legacy support mechanism. It is an efficient means for supporting Murmur3 (and optionally Random) partitioners. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137638#comment-14137638 ] Jason Brown commented on CASSANDRA-7282: damnit, there's a ! operator there {code}if (!cfs.partitioner.sortsByHashCode()) {code} oh, well, I tried :) Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138519#comment-14138519 ] Jason Brown commented on CASSANDRA-7282: just an update: running into some weird problems getting trunk to run on my servers - I think it might be some settings I have in the cassandra.in.sh script conflict with 3.0. will figure it out in the morning when i'm not falling asleep. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136370#comment-14136370 ] Jonathan Ellis commented on CASSANDRA-7282: --- bq. it does factor in to the decision as to whether the benefits outweigh the costs associated with moving the complexity from the CSLM into our context as well as flushing out any subtle bugs and/or downstream unintended side-effects that come along with the change. I do agree in principle that we have a complexity budget and we need to be wise about choosing where to spend it. I look forward to seeing Jason's numbers. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136745#comment-14136745 ] Benedict commented on CASSANDRA-7282: - I'm not sure which statement I made gave the impression I disagree with this. I do not. My argumentation has all been around which facts are most pertinent to this decision, and how they should be interpreted. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133122#comment-14133122 ] Benedict commented on CASSANDRA-7282: - bq. I wouldn't even call this a hashCode to avoid the confusion, perhaps an ordering key or ordering prefix? Creating an interface would be problematic, since we need to have our map key be a shared type for both CSLM keys and NBHOM keys. So I'm going to stick with the current situation. If you meant from a purely documentation point of view, it is absolutely essential that the value is a _hash_, otherwise performance will be O\(n\^2\), so whilst it may be worth clarifying it is essential we call it a hashCode(). To elaborate on this in documentation, I've included the following extra comment {quote} This data structure essentially only works for keys that are first sorted by some hash value (and may then be sorted within those hashes arbitrarily), where a 32-bit prefix of the hash we sort by is returned by hashCode() {quote} Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133164#comment-14133164 ] Benedict commented on CASSANDRA-7282: - Uploaded a version with updated comments [here|https://github.com/belliottsmith/cassandra/tree/7282-fastmemtablemap] Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132586#comment-14132586 ] Benedict commented on CASSANDRA-7282: - Perhaps this is a good opportunity, for our second set of data points, to try taking the latest 'realistic workload' features of cassandra-stress for a whirl Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: reads.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132885#comment-14132885 ] Branimir Lambov commented on CASSANDRA-7282: I have a question about the implementation: You state in the preface of the NBHOM that the following condition must be satisfied for two keys k1, k2: k1 k2 = k1.hashCode() k2.hashCode(). I can see that if the condition is satisfied the structure will work very well, but this condition is much too strong. On one hand, I don't believe the inequality on the right side of the condition should be strict, because this forbids all collisions in the hash function in a totally ordered key set. More importantly, even with = on the right, I don't think the condition is satisfiable for any usable hash function and in particular, I don't see how the Murmur3 partitioner tokens satisfy it. Could you elaborate on why this condition is satisfied by the proposed code? Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132887#comment-14132887 ] Benedict commented on CASSANDRA-7282: - For the Murmur3Partitioner, the Token (which is key for this map) is sorted by hash, so all we require is that the hashCode() method returns a prefix of this hash. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132901#comment-14132901 ] Joshua McKenzie commented on CASSANDRA-7282: {quote} but if the effect is not seen, or is lost in the noise, on the specific hardware / use case we're benchmarking that doesn't isolate, my only point is that this doesn't IMO reflect badly on the change. {quote} It doesn't reflect badly on the change, per se, but it does factor in to the decision as to whether the benefits outweigh the costs associated with moving the complexity from the CSLM into our context as well as flushing out any subtle bugs and/or downstream unintended side-effects that come along with the change. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132906#comment-14132906 ] Benedict commented on CASSANDRA-7282: - As already stated, I disagree. Always, when making these decisions, the important factors are: 1) what workloads / portions of workloads are affected; and 2) do we consider these common enough (or to become common enough) to warrant inclusion? Since we are very constrained on our hardware and workload generation, picking a realistic workload that we can perform at best helps us rule out regressions if it is not compatible with exhibiting the change. The important question is simply: do we consider it likely it will impact other workloads we are not capable of benchmarking, given the known information we have from isolating its effect in a manner we _can_. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132913#comment-14132913 ] Branimir Lambov commented on CASSANDRA-7282: This makes sense. Thanks for clarifying, the fact that the list is ordered by the Token escaped me. Since this is the case, should we restrict this method to Murmur tokens only? For anything that sorts naturally we could just grab the first four bytes. I wouldn't even call this a hashCode to avoid the confusion, perhaps an ordering key or ordering prefix? I would still change the condition in the preface to use non-strict inequality, though, because cropping tokens to 32 bits will introduce collisions. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132924#comment-14132924 ] Joshua McKenzie commented on CASSANDRA-7282: {quote} does factor in to the decision {quote} {quote} 1) what workloads / portions of workloads are affected; {quote} That's what I was referring to. To be clear: I'm not saying we shouldn't use the container or that I'm against it in any way, I'm just re-iterating that we need to be clear as to which use-cases and configurations we expect this to help with (which you've done) and weigh that relative to the added risk of untested code and the added complexity associated with the change. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132953#comment-14132953 ] Benedict commented on CASSANDRA-7282: - bq. Since this is the case, should we restrict this method to Murmur tokens only? Well, if I can spend the time ensuring safety we should be able to use this for RandomPartitioner also. bq. I would still change the condition in the preface to use non-strict inequality, though, because cropping tokens to 32 bits will introduce collisions. There will be collisions with or without truncation. The fact that there are collisions doesn't affect the constraint I've imposed upon the data; we assume nothing about the dataset when two hashCode()s are equal, and simply resort to the underlying token comparator, so I think the constraint is sufficient. Non-strict equality is surely more broken, however? It would hold true for k1:1, k2:2, hash(k1):2, hash(k2):2. We could strengthen it to a bi-directional implication, i.e. k1 = k2 == k1.hashCode() = k2.hashCode(). bq. I wouldn't even call this a hashCode to avoid the confusion, perhaps an ordering key or ordering prefix? Perhaps. This was the simplest approach, and since it *is* a hash key used to index a hash table it seems suitable to use hashCode(), and impose the extra constraint contextually. I'm fairly neutral though; we certainly could introduce a new interface. I'll see how it looks. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132983#comment-14132983 ] Branimir Lambov commented on CASSANDRA-7282: {quote}I think the constraint is sufficient.{quote} The constraint is sufficient, but it is too strong and isn't satisfied by the current version of the code. Assuming that is a total ordering (since any ordering defined by a Comparable must be total), the problem with the current k1 k2 - hash(k1) hash(k2) is that it implies k1 != k2 - hash(k1) != hash(k2) and thus hash(k1) == hash(k2) - k1 == k2, because k1 != k2 means that either k1 k2 or k2 k1. This means that the condition will only be satisfied for a hash without collisions. A bidirectional = has the same problem, since hash(k1) == hash(k2) -- hash(k1) = hash(k2) hash(k2) = hash(k1) -- k1 = k2 k2 = k1 -- k1 == k2. It is actually a stronger statement than the original condition, hence it is not a surprise that it doesn't solve the problem. We need something weaker. To make it weaker we either strengthen the premise, or weaken the conclusion. We can do the latter by making the comparison non-strict: k1 k2 - hash(k1) = hash(k2). For this one k1 != k2 implies only hash(k1) = hash(k2) || hash(k1) = hash(k2) which is trivially true. Looking at the code this is actually the condition you use, because you choose predecessors based on hash order rather than direct comparison. {quote}It would hold true for k1:1, k2:2, hash(k1):2, hash(k2):2.{quote} I don't see anything wrong with this, just a hash collision. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132991#comment-14132991 ] Benedict commented on CASSANDRA-7282: - Hmm. Yes. I shouldn't get into these discussions at this time of night. Agreed. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132137#comment-14132137 ] Jonathan Ellis commented on CASSANDRA-7282: --- [~jasobrown] could you try this on your cluster? Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: reads.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132212#comment-14132212 ] Jason Brown commented on CASSANDRA-7282: [~jbellis] Sure, will try it out early next week. [~benedict] How are you running stress? Any special use case? Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: reads.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132494#comment-14132494 ] Benedict commented on CASSANDRA-7282: - Yes, we want to restrict the workload to memtables only, and we want to make them big to generate enough numbers on insert to avoid noise. So from a fresh cluster, I start it with a memtable_cleanup_threshold of 0.99, memtable_allocation_type: offheap_objects, and I run a stress test with only _one column_ per partition, and make that column size 1 (although any size is fine if it's offheap). I'm just trying to make the memtable as large as possible, which with my 4Gb laptop is difficult. I then set the memtable_heap_space_in_mb to 1024 (feel free to make it much bigger), and then insert around 5M items. If you stick to one column, ensure your offheap space is sufficiently large for the space you insert into it, then 5M items per node per Gb of on-heap space is achievable (my math tells me around 9M should be possible, but I overshot slightly and decided to be conservative). I then follow up immediately with a read run hitting a random selection of those PKs. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: reads.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132504#comment-14132504 ] Aleksey Yeschenko commented on CASSANDRA-7282: -- bq. Yes, we want to restrict the workload to memtables only I understand why, but doesn't it sort of defeats the purpose of benchmarking in an actual real world situation? Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: reads.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132508#comment-14132508 ] Benedict commented on CASSANDRA-7282: - No. You benchmark to isolate performance improvements. These numbers demonstrate a benefit _to any portion of a workload that fits these characteristics_. Different use cases will exhibit different ratios of this effect; some considerable, some not so much. Certainly the read performance enhancements will be more generally applicable, but being able to fill your memtables faster and with less garbage is also not a bad thing, even if you end up bottlenecking on disk. Especially since server characteristics are changing rapidly, so that disk bottlenecks are disappearing. As optimisation work goes deeper, isolating the specific portion of work that is affected is pretty essential to clearly delineating the benefit. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: reads.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132511#comment-14132511 ] Aleksey Yeschenko commented on CASSANDRA-7282: -- True, but the new more complex code replacing the old code tested by years of production stays there for all the use cases. So maybe you want to benchmark both the niche and the general use case as well - to make a decision whether or not a particular change is justified. (No idea if it is or it isn't in this case, but would love to see some real world data on this). Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: reads.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132523#comment-14132523 ] Benedict commented on CASSANDRA-7282: - The new code is not significantly more complex. In fact, its arguably less complex, the only difference being some of that complexity is not derived from an external source (CSLM is significantly more complex than the NBHOM). However it is a pretty simple piece of code. The point of isolating is that we cannot possibly test all possible real world use cases. DSE, for instance, delivers Memory-Only SSTables, which depend on memtables, so all workloads hitting these would be affected fully by this change. All other workloads will be hit to different degrees. In general we have to make a judgement call about how common such workloads are, vs how significant the effect is for those workloads. However trying to define a success metric based on a narrow definition of 'realistic workloads' that do not isolate the behaviour doesn't buy us anything in my book. Our hardware and workload definitions are not sufficiently universal. If we see a 15% bump for accesses to memtables, that is IMO worth incorporating, esp. as the bump will increase over time. As we move memtables fully offheap we can support larger and larger memtables, and since the algorithmic quality of this change is that the performance does not increase as memtable size grows, whereas CSLM grows logarithmically, this change also has the likelihood of paying further future dividends. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: reads.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132530#comment-14132530 ] Jason Brown commented on CASSANDRA-7282: +1 to [~iamaleksey]'s sentiment. Both the general and specific use cases must be tested, and if the general case has a regression, we should not optimize for the specific case. I'm sure we all agree about that. Let's do some testing over the next week and observe what happens. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: reads.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132532#comment-14132532 ] Aleksey Yeschenko commented on CASSANDRA-7282: -- For the record, I'm not saying that we shouldn't do it - just that having the data for different use cases, and not just one specific one, would be nice, and ultimately necessary to make that judgement call. I'm expecting that it would probably be worth incorporating. But there definitely is *some* value in heavily battle-tested code when you work on a database. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: reads.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132536#comment-14132536 ] Benedict commented on CASSANDRA-7282: - bq. But there definitely is some value in heavily battle-tested code when you work on a database. Sure, agreed :) Ok, I'm not disputing there's value in _seeing extra data_ (there always is), but if the effect is not seen, or is lost in the noise, on the specific hardware / use case we're benchmarking that _doesn't_ isolate, my only point is that this doesn't IMO reflect badly on the change. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: reads.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103866#comment-14103866 ] Joshua McKenzie commented on CASSANDRA-7282: Some interesting results on the [99.9th|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.7282.defaults.jsonmetric=99.9th_latencyoperation=1_writesmoothing=1show_aggregates=truexmin=0xmax=688.16ymin=0ymax=1600] and [max|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.7282.defaults.jsonmetric=max_latencyoperation=1_writesmoothing=1show_aggregates=truexmin=0xmax=688.16ymin=0ymax=3000] on write latency. While their aggregated values look better on the new non-blocking hashordered map there look to be more outliers. Any ideas on why that might be [~benedict]? 95th and 99th shows pretty [marked improvement|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.7282.defaults.jsonmetric=99th_latencyoperation=1_writesmoothing=1show_aggregates=truexmin=0xmax=688.16ymin=0ymax=140] so the differential looks to be on the extremes only. Either way - I'll be very interested to see the results of the test w/memtable and row size changes. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104014#comment-14104014 ] Benedict commented on CASSANDRA-7282: - A few possibilities: 1) very few data points, so difficult to draw much conclusion 2) the 99.9th%ile border is arbitrary with so few data points, as evidenced by the better max latency 3) the largest latencies are ordinarily encountered either during CL or memtable flush; since we're writing faster, it's possible we are simply putting the disks under increased pressure and hitting higher latencies more often, bringing the 99.9th%ile closer to max Ryan - I'd suggest running with offheap_objects, raising the commit log size limit to 16Gb, a memtable_cleanup_threshold of 0.99, setting your on heap memtable capacity to 4Gb, your offheap to 2Gb, and running the test with just one column of around 32 bytes. Then use 48M items and perform your normal run of insert then read; we *shouldn't* see any flush result from that if I did my quick bit of math correctly. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104307#comment-14104307 ] Ryan McGuire commented on CASSANDRA-7282: - Here's a trial that incorporates some of the above suggestions: http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.7282.default_mct.json * 48M 32 byte rows. * commitlog_total_space_in_mb: 16384 * memtable_allocation_type: offheap_objects I'm not familiar with the settings that heap memtable capacity to 4Gb, your offheap to 2Gb refers to. I tried adding a memtable_cleanup_threshold of 0.99 but the write failed on both runs when it started GC, so I'll try scaling that back some more (the above results are just using the default instead.) Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104607#comment-14104607 ] Ryan McGuire commented on CASSANDRA-7282: - memtable_commit_threshold: 0.5 is the highest I've been able to go without the write failing mid-operation. http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.7282.mct_05.json Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104924#comment-14104924 ] Benedict commented on CASSANDRA-7282: - memtable_heap_space_in_mb: 4096 memtable_offheap_space_in_mb: 2048 -Xmx8Gb Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104970#comment-14104970 ] Ryan McGuire commented on CASSANDRA-7282: - Thanks Benedict, I tried it with those new settings. Still using memtable_cleanup_threshold of 0.5 : http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.7282.3.json Getting a bit more interesting now, not sure what's up with that first read and then the drop. I'll try scaling up memtable_cleanup_threshold again. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14104988#comment-14104988 ] Ryan McGuire commented on CASSANDRA-7282: - memtable_cleanup_threshold:0.99 is working with these new settings: http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.7282.4.json Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105022#comment-14105022 ] Benedict commented on CASSANDRA-7282: - Ryan provided me with logs, and we can see a flush trigger exactly as the 0.5 mct write test finishes, explaining the dip. However the temporary dip below normal is a bit odd and I'd like to explain it better - best guess is key cache being populated; there is a slower ramp up for the normal run as well, here, which would be consistent with it (since it flushed earlier). We could try a run with key cache disabled, to confirm. I'd also like to be able to explain the different GC characteristics - we GC less often with the new code, but have one _very_ long ParNew pause, fairly consistently. Currently I don't have a great explanation for this, except possibly the very large array allocations we need to do. Possibly switching to an array-of-arrays would be superior, keeping each of the arrays preferably 1M or so For the mct 0.99 run, what's most surprising is reads are _not_ particularly faster, which is not what I found running locally, and also conflicts with the non-trivial increase in write performance. Back to the drawing board to try and figure out exactly what's happening. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102900#comment-14102900 ] Joshua McKenzie commented on CASSANDRA-7282: [~enigmacurry] Any chance you got some performance numbers for the branch Benedict linked above? [~benedict] Do you have any particular use-cases you think best showcase the benefits of the code-changes here? Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103245#comment-14103245 ] Benedict commented on CASSANDRA-7282: - The only workload we'll easily measure it in at the moment is in-memory reads. So, increase total memtable space substantially, insert small records up to an amount small enough to fit entirely in the memtable but large enough to fill as much as possible, and then perform aggressive random reads. It will also have a similar level of impact on improving _insert_ times, however since these are ultimately disk bound we are unlikely to be able to measure it as easily. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103252#comment-14103252 ] Ryan McGuire commented on CASSANDRA-7282: - This fell under my radar, I've started a basic test comparing trunk to this. However, I notice this branch has diverged quite a bit, is that still the best comparison, or did you want to merge in changes? Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103259#comment-14103259 ] Benedict commented on CASSANDRA-7282: - Can you compare to the SHA it was based on? If not I'll rebase Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14103324#comment-14103324 ] Ryan McGuire commented on CASSANDRA-7282: - Here's a test run with default settings. We fared just slightly better than stock. http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.7282.defaults.json Tomorrow I'll run a test incorporating [~benedict]'s suggestions regarding memtable and row size adjustments. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089019#comment-14089019 ] Benedict commented on CASSANDRA-7282: - Just pushed a minor update removing an extraneous comment and making the resize threshold triggering uniform + clearer. Agree it would be good to get some performance numbers on this, but I'm not sure which magical facilities you're referring to? New stress isn't likely to stress this bit out any more interestingly than old stress, and we don't yet have a working magical performance service I don't think... Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089071#comment-14089071 ] Aleksey Yeschenko commented on CASSANDRA-7282: -- bq. I'm not sure which magical facilities you're referring to Just Ryan's stress setup (: Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087964#comment-14087964 ] Jonathan Ellis commented on CASSANDRA-7282: --- [~JoshuaMcKenzie] to review Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088631#comment-14088631 ] Aleksey Yeschenko commented on CASSANDRA-7282: -- Can we perf-test it first, using our new magical facilities, just to be sure? Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019074#comment-14019074 ] Benedict commented on CASSANDRA-7282: - First version available [here|https://github.com/belliottsmith/cassandra/tree/7282-fastmemtablemap] Need to perform some more rigorous performance tests, but I see a 7-10% bump on my box for naive stress test hitting memtable reads. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Fix For: 3.0 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid skip list / hash map. The skip list would impose the order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the skip list and permitting O(1) lookups. It should be possible to define the hash map to also permit O(1) inserts. Our decorated keys are in fact perfectly designed for this scheme. At the same time, we can potentially improve the data locality in the skip list by baking the initial 64 token bits directly into the structure, and storing multiple values per skip list entry to improve cache performance, bringing down memory and constant factor cpu overheads. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019075#comment-14019075 ] Benedict commented on CASSANDRA-7282: - [~enigmacurry] how easy would it be to kick off a quick comparison of this with stock? Do we have any hardware that's unoccupied? Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Fix For: 3.0 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid skip list / hash map. The skip list would impose the order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the skip list and permitting O(1) lookups. It should be possible to define the hash map to also permit O(1) inserts. Our decorated keys are in fact perfectly designed for this scheme. At the same time, we can potentially improve the data locality in the skip list by baking the initial 64 token bits directly into the structure, and storing multiple values per skip list entry to improve cache performance, bringing down memory and constant factor cpu overheads. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019082#comment-14019082 ] Benedict commented on CASSANDRA-7282: - Hmm. Scratch that, I've broken something. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Fix For: 3.0 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid skip list / hash map. The skip list would impose the order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the skip list and permitting O(1) lookups. It should be possible to define the hash map to also permit O(1) inserts. Our decorated keys are in fact perfectly designed for this scheme. At the same time, we can potentially improve the data locality in the skip list by baking the initial 64 token bits directly into the structure, and storing multiple values per skip list entry to improve cache performance, bringing down memory and constant factor cpu overheads. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019081#comment-14019081 ] Ryan McGuire commented on CASSANDRA-7282: - Probably not til next week.. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Fix For: 3.0 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid skip list / hash map. The skip list would impose the order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the skip list and permitting O(1) lookups. It should be possible to define the hash map to also permit O(1) inserts. Our decorated keys are in fact perfectly designed for this scheme. At the same time, we can potentially improve the data locality in the skip list by baking the initial 64 token bits directly into the structure, and storing multiple values per skip list entry to improve cache performance, bringing down memory and constant factor cpu overheads. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019086#comment-14019086 ] Benedict commented on CASSANDRA-7282: - Next week is fine Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Fix For: 3.0 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid skip list / hash map. The skip list would impose the order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the skip list and permitting O(1) lookups. It should be possible to define the hash map to also permit O(1) inserts. Our decorated keys are in fact perfectly designed for this scheme. At the same time, we can potentially improve the data locality in the skip list by baking the initial 64 token bits directly into the structure, and storing multiple values per skip list entry to improve cache performance, bringing down memory and constant factor cpu overheads. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019210#comment-14019210 ] Benedict commented on CASSANDRA-7282: - There was a minor bug with compatability mode for non-murmur3 partitioners. Unfortunately it looks like the unit tests and dtests all specify ByteOrderedPartitioner, making this difficult to test thoroughly across all database features. It works fine with stress, but I will need to write some thorough bespoke unit tests. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.2#6252)