[jira] [Comment Edited] (CASSANDRA-11452) Cache implementation using LIRS eviction for in-process page cache

2016-04-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244465#comment-15244465
 ] 

Benedict edited comment on CASSANDRA-11452 at 4/16/16 11:29 PM:


One more variant for you: instead of random admission, with a similar (or 
slightly higher) rate, walk an LRU order iterator one step and use the next 
key's frequency.  After each step, reset the iterator with a 1% chance.

Basically it's the same as random admission but without its blindness.  Could 
have a bound on frequency, but could be very low, perhaps just 3 to ignore 100% 
no doubt legit rejections.

Not suggesting you go and do it, just wanted to note it for posterity as I 
think it's approximately optimal.  Only risk is if your finger gets referenced 
and bumped to MRU, which could be guarded against.

bq. Thanks a lot for all your help on this =)

My pleasure - this is my idea of fun.

Thanks for putting together an implementation of W-TinyLFU and the trace 
simulators!



was (Author: benedict):
One more variant for you: instead of random admission, with a similar (or 
slightly higher) rate, walk an LRU order iterator one step and use the next 
key's frequency.  After each step, reset the iterator with a 1% chance.

Basically it's the same as random admission but without its blindness.  Could 
have a bound on frequency, but could be very low, perhaps just 3 to ignore 100% 
no doubt legit rejections.

Not suggesting you go and do it, just wanted to note it for posterity as I 
think it's approximately optimal.  Only risk is if your finger gets referenced 
and bumped to MRU, which could be guarded against.

bq. Thanks a lot for all your help on this =)

My pleasure - this is my idea of fun.


> Cache implementation using LIRS eviction for in-process page cache
> --
>
> Key: CASSANDRA-11452
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11452
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Branimir Lambov
>Assignee: Branimir Lambov
>
> Following up from CASSANDRA-5863, to make best use of caching and to avoid 
> having to explicitly marking compaction accesses as non-cacheable, we need a 
> cache implementation that uses an eviction algorithm that can better handle 
> non-recurring accesses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11452) Cache implementation using LIRS eviction for in-process page cache

2016-04-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244465#comment-15244465
 ] 

Benedict commented on CASSANDRA-11452:
--

One more variant for you: instead of random admission, with a similar (or 
slightly higher) rate, walk an LRU order iterator one step and use the next 
key's frequency.  After each step, reset the iterator with a 1% chance.

Basically it's the same as random admission but without its blindness.  Could 
have a bound on frequency, but could be very low, perhaps just 3 to ignore 100% 
no doubt legit rejections.

Not suggesting you go and do it, just wanted to note it for posterity as I 
think it's approximately optimal.  Only risk is if your finger gets referenced 
and bumped to MRU, which could be guarded against.

bq. Thanks a lot for all your help on this =)

My pleasure - this is my idea of fun.


> Cache implementation using LIRS eviction for in-process page cache
> --
>
> Key: CASSANDRA-11452
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11452
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Branimir Lambov
>Assignee: Branimir Lambov
>
> Following up from CASSANDRA-5863, to make best use of caching and to avoid 
> having to explicitly marking compaction accesses as non-cacheable, we need a 
> cache implementation that uses an eviction algorithm that can better handle 
> non-recurring accesses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11452) Cache implementation using LIRS eviction for in-process page cache

2016-04-16 Thread Ben Manes (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244463#comment-15244463
 ] 

Ben Manes commented on CASSANDRA-11452:
---

I checked in our fix and I'll release after we hear back from Roy. I'll forward 
that along here and might revisit the fix based on his feedback. When the 
release is out then I'll update my pull request and the notify everyone on the 
task so we can look at moving it forward.

Thanks a lot for all your help on this =)

> Cache implementation using LIRS eviction for in-process page cache
> --
>
> Key: CASSANDRA-11452
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11452
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Branimir Lambov
>Assignee: Branimir Lambov
>
> Following up from CASSANDRA-5863, to make best use of caching and to avoid 
> having to explicitly marking compaction accesses as non-cacheable, we need a 
> cache implementation that uses an eviction algorithm that can better handle 
> non-recurring accesses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-11452) Cache implementation using LIRS eviction for in-process page cache

2016-04-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244411#comment-15244411
 ] 

Benedict edited comment on CASSANDRA-11452 at 4/16/16 9:31 PM:
---

bq. Sadly Ehcache3 did this for their randomly sampled LRU

Ah, but they were _evicting_ a random sample: here we only want to randomly 
admit, and since we're protecting from a biased sample of too high a threshold, 
and we're looking for (thereabouts) the _lowest_ frequency in the map, we 
shouldn't ever negatively impact performance, only possibly be slightly slow to 
respond to these attacks.  With an Iterator we'd eventually visit the entire 
map too, so we'd be absolutely guaranteed to be robust (without higher costs of 
"sampling" - I should clarify that for the Iterator and ideal direct CHM bucket 
"sampling" I simply meant picking a single random guard key).

That doesn't take away from what I said though: we're almost certainly 
achieving as much robustness as we need, and the algorithm you sketched there 
looks fine to me.


was (Author: benedict):
bq. Sadly Ehcache3 did this for their randomly sampled LRU

Ah, but they were _evicting_ a random sample: here we only want to randomly 
admit, and since we're protecting from a biased sample of too high a threshold, 
and we're looking for (thereabouts) the _lowest_ frequency in the map, we 
shouldn't ever negatively impact performance, only possibly be slightly slow to 
respond to these attacks.  With an Iterator we'd eventually visit the entire 
map too, so we'd be absolutely guaranteed to be robust.

That doesn't take away from what I said though: we're almost certainly 
achieving as much robustness as we need, and the algorithm you sketched there 
looks fine to me.

> Cache implementation using LIRS eviction for in-process page cache
> --
>
> Key: CASSANDRA-11452
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11452
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Branimir Lambov
>Assignee: Branimir Lambov
>
> Following up from CASSANDRA-5863, to make best use of caching and to avoid 
> having to explicitly marking compaction accesses as non-cacheable, we need a 
> cache implementation that uses an eviction algorithm that can better handle 
> non-recurring accesses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11452) Cache implementation using LIRS eviction for in-process page cache

2016-04-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244411#comment-15244411
 ] 

Benedict commented on CASSANDRA-11452:
--

bq. Sadly Ehcache3 did this for their randomly sampled LRU

Ah, but they were _evicting_ a random sample: here we only want to randomly 
admit, and since we're protecting from a biased sample of too high a threshold, 
and we're looking for (thereabouts) the _lowest_ frequency in the map, we 
shouldn't ever negatively impact performance, only possibly be slightly slow to 
respond to these attacks.  With an Iterator we'd eventually visit the entire 
map too, so we'd be absolutely guaranteed to be robust.

That doesn't take away from what I said though: we're almost certainly 
achieving as much robustness as we need, and the algorithm you sketched there 
looks fine to me.

> Cache implementation using LIRS eviction for in-process page cache
> --
>
> Key: CASSANDRA-11452
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11452
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Branimir Lambov
>Assignee: Branimir Lambov
>
> Following up from CASSANDRA-5863, to make best use of caching and to avoid 
> having to explicitly marking compaction accesses as non-cacheable, we need a 
> cache implementation that uses an eviction algorithm that can better handle 
> non-recurring accesses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11452) Cache implementation using LIRS eviction for in-process page cache

2016-04-16 Thread Ben Manes (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244406#comment-15244406
 ] 

Ben Manes commented on CASSANDRA-11452:
---

It definitely would be nice to be able to reduce the per-entry cost, have 
access to the hash, avoid lambdas by inlining into the code. I kept hoping Doug 
would take a stab at it and see what ideas he'd use.

bq. It's a shame we don't have access to the CHM to do the sampling, as that 
would make it robust to scans since all the members of the LRU would have high 
frequencies.

Sadly Ehcache3 did this for their randomly sampled LRU. CHM has a weak hashing 
function due to degrading into red-black trees, but they degraded it further 
for speed. This results in -20% hit rate over LRU by taking an MRU-heavy 
sample, surprisingly even in large caches. They also made it very slow, taking 
minutes instead of a few seconds. I'm now very weary of that idea because it 
can be done so horribly if naively handled.

I think for now I'm most comfortable using the following. I think its robust 
enough, low cost, and should be hard to exploit (especially for an external 
actor). If we discover it is not strong enough, we have a plethora of options 
now. :-)

{code:java}
boolean admit(K candidateKey, K victimKey) {
  int victimFreq = frequencySketch().frequency(victimKey);
  int candidateFreq = frequencySketch().frequency(candidateKey);
  if (candidateFreq > victimFreq) {
return true;
  } else if (candidateFreq <= 5) {
return false;
  }
  int random = ThreadLocalRandom.current().nextInt();
  return ((random & 127) == 0);
}
{code}

> Cache implementation using LIRS eviction for in-process page cache
> --
>
> Key: CASSANDRA-11452
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11452
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Branimir Lambov
>Assignee: Branimir Lambov
>
> Following up from CASSANDRA-5863, to make best use of caching and to avoid 
> having to explicitly marking compaction accesses as non-cacheable, we need a 
> cache implementation that uses an eviction algorithm that can better handle 
> non-recurring accesses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-11452) Cache implementation using LIRS eviction for in-process page cache

2016-04-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244395#comment-15244395
 ] 

Benedict edited comment on CASSANDRA-11452 at 4/16/16 9:02 PM:
---

Or, much more simply: maintain a CHM Iterator (i.e. in hash, or approximately 
random, order) in a member variable.  For, say, 5% of rejections above 
frequency 4 and 0.1% of all rejections, step it forwards and use the next value 
as the guard.

I think this is the probably the simplest and most robust to all of the above 
problems.


was (Author: benedict):
Or, much more simply: maintain a CHM Iterator (i.e. in hash, or approximately 
random, order) in a member variable.  For 5% of _rejections above frequency 4_ 
step it forwards and use the next value as the guard.

I think this is the probably the simplest and most robust to all of the above 
problems.

> Cache implementation using LIRS eviction for in-process page cache
> --
>
> Key: CASSANDRA-11452
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11452
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Branimir Lambov
>Assignee: Branimir Lambov
>
> Following up from CASSANDRA-5863, to make best use of caching and to avoid 
> having to explicitly marking compaction accesses as non-cacheable, we need a 
> cache implementation that uses an eviction algorithm that can better handle 
> non-recurring accesses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11452) Cache implementation using LIRS eviction for in-process page cache

2016-04-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244395#comment-15244395
 ] 

Benedict commented on CASSANDRA-11452:
--

Or, much more simply: maintain a CHM Iterator (i.e. in hash, or approximately 
random, order) in a member variable.  For 5% of _rejections above frequency 4_ 
step it forwards and use the next value as the guard.

I think this is the probably the simplest and most robust to all of the above 
problems.

> Cache implementation using LIRS eviction for in-process page cache
> --
>
> Key: CASSANDRA-11452
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11452
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Branimir Lambov
>Assignee: Branimir Lambov
>
> Following up from CASSANDRA-5863, to make best use of caching and to avoid 
> having to explicitly marking compaction accesses as non-cacheable, we need a 
> cache implementation that uses an eviction algorithm that can better handle 
> non-recurring accesses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11452) Cache implementation using LIRS eviction for in-process page cache

2016-04-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244390#comment-15244390
 ] 

Benedict commented on CASSANDRA-11452:
--

bq.  though a rough calculation indicates it isn't a huge savings

Assuming a basic CLHM, with CompressedOops on a 64-bit VM (Cassandra's 
defaults) I calculate overhead inflation of around 22% - I reckon 72 bytes are 
needed vs a possible 56 (once the 12 byte overheads are removed, and alignment 
accounted for).  You'd also be able to avoid recalculating the hash for the 
sketches since its memoized in CHM.  Admittedly I don't 100% vouch for the 
accuracy of those calculations as I'm doing it from memory.

I absolutely am not suggesting your calculation of cost/benefit is wrong 
though, or that I would even have arrived at a different conclusion.  Certainly 
the user key/value sizes further amortize that overhead inflation, and for many 
workloads the distinction is barely perceptible.

bq. What do you think about combining the approach

I assume you mean the inversion of that guard.  It's a shame we don't have 
access to the CHM to do the sampling, as that would make it robust to scans 
since all the members of the LRU would have high frequencies.  My only slight 
concern is that we may have to wait 10s of thousands of rejections to cycle out 
the collision, which is quite slow to respond. By raising the chance we harm 
scans though.  A couple of other options:

# Randomly sample the frequency of, say, 1% of the items we admit (on 
admission, storing the last 16 or so), on demand compute the low quartile
# On demand, sample a random short run of the sketch when we encounter this 
situation, compute some percentile (need some thought for which)

Then either for 1% of admissions, or when your current guard is triggered, 
compute this statistic for the guard.  For absolute security, for say 0.01% of 
candidates, admit without any check.

That all said, I expect for Cassandra's purposes many of the proposed solutions 
so far will be sufficient, and I certainly wouldn't have any problem with the 
solution you propose.

> Cache implementation using LIRS eviction for in-process page cache
> --
>
> Key: CASSANDRA-11452
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11452
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Branimir Lambov
>Assignee: Branimir Lambov
>
> Following up from CASSANDRA-5863, to make best use of caching and to avoid 
> having to explicitly marking compaction accesses as non-cacheable, we need a 
> cache implementation that uses an eviction algorithm that can better handle 
> non-recurring accesses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-11452) Cache implementation using LIRS eviction for in-process page cache

2016-04-16 Thread Ben Manes (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244357#comment-15244357
 ] 

Ben Manes edited comment on CASSANDRA-11452 at 4/16/16 7:46 PM:


CLHM was always a decorator, but in 1.4 it embedded the CHMv8 backport. We did 
that to help improve performance for very large caches, like Cassandra's were, 
since JDK8 took a long time. That's probably what your remembering.

I agree that reducing per-entry overhead is attractive, though a [rough 
calculation|https://github.com/ben-manes/caffeine/wiki/Memory-overhead] 
indicates it isn't a huge savings. My view is that it is a premature 
optimization and best left to the end after the implementation has matured, to 
re-evaluate if the impact is worth attempting a direct rewrite. Otherwise it 
adds greatly to the complexity budget from the get go and leading to less time 
focused on the unique problems of the domain (API, features, efficiency). For 
example there is more space savings by using TinyLFU over LIRS's ghost entries, 
but evaluating took effort that I might have been to overwhelmed to expend. It 
would also be interesting to see if pairing with [Apache 
Mnemonic|https://github.com/apache/incubator-mnemonic] could reduce the GC 
overhead by having off-heap without the serialization penalty.

bq. Just to clarify those numbers are for small workloads?

Yep.

bq. ...it would still leave the gate open for an attacker to reduce the 
efficacy of the cache for items that have only moderate reuse likelihood.

Since the frequency is reduced by half every sample period, my assumption was 
that this attack would be very difficult. Gil's response was to instead detect 
if TinyLFU had a large number of consecutive rejections, e.g. 80 (assuming 1:20 
is admitted on average). That worked quite well, except on ARC's database trace 
(ds1) which had a negative impact. It makes sense that scans (db, analytics) 
will have a high rejection rate. What do you think about combining the 
approach, e.g. {{(candidateFreq <= 3) || (++unadmittedItems < 80)}}, as a guard 
prior to performing a 1% random admittance?


was (Author: ben.manes):
CLHM was always a decorator, but in 1.4 it embedded the CHMv8 backport. We did 
that to help improve performance for very large caches, like Cassandra's were, 
since JDK8 took a long time. That's probably what your remembering.

I agree that reducing per-entry overhead is attractive, though a [rough 
calculation|https://github.com/ben-manes/caffeine/wiki/Memory-overhead] 
indicates it isn't a huge savings. My view is that it is a premature 
optimization and best left to the end after the implementation has matured, to 
re-evaluate if the impact is worth attempting a direct rewrite. Otherwise it 
adds greatly to the complexity budget from the get go and leading to less time 
focused on the unique problems of the domain (API, features, efficiency). For 
example there is more space savings by using TinyLFU over LIRS's ghost entries, 
but evaluating took effort that I might have been to overwhelmed to expend. It 
would also be interesting to see if pairing with [Apache 
Mnemonic|https://github.com/apache/incubator-mnemonic] could reduce the GC 
overhead by having off-heap without the serialization penalty.

bq. Just to clarify those numbers are for small workloads?

Yep.

bq ...it would still leave the gate open for an attacker to reduce the efficacy 
of the cache for items that have only moderate reuse likelihood.

Since the frequency is reduced by half every sample period, my assumption was 
that this attack would be very difficult. Gil's response was to instead detect 
if TinyLFU had a large number of consecutive rejections, e.g. 80 (assuming 1:20 
is admitted on average). That worked quite well, except on ARC's database trace 
(ds1) which had a negative impact. It makes sense that scans (db, analytics) 
will have a high rejection rate. What do you think about combining the 
approach, e.g. {{(candidateFreq <= 3) || (++unadmittedItems < 80)}}, as a guard 
prior to performing a 1% random admittance?

> Cache implementation using LIRS eviction for in-process page cache
> --
>
> Key: CASSANDRA-11452
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11452
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Branimir Lambov
>Assignee: Branimir Lambov
>
> Following up from CASSANDRA-5863, to make best use of caching and to avoid 
> having to explicitly marking compaction accesses as non-cacheable, we need a 
> cache implementation that uses an eviction algorithm that can better handle 
> non-recurring accesses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11452) Cache implementation using LIRS eviction for in-process page cache

2016-04-16 Thread Ben Manes (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244357#comment-15244357
 ] 

Ben Manes commented on CASSANDRA-11452:
---

CLHM was always a decorator, but in 1.4 it embedded the CHMv8 backport. We did 
that to help improve performance for very large caches, like Cassandra's were, 
since JDK8 took a long time. That's probably what your remembering.

I agree that reducing per-entry overhead is attractive, though a [rough 
calculation|https://github.com/ben-manes/caffeine/wiki/Memory-overhead] 
indicates it isn't a huge savings. My view is that it is a premature 
optimization and best left to the end after the implementation has matured, to 
re-evaluate if the impact is worth attempting a direct rewrite. Otherwise it 
adds greatly to the complexity budget from the get go and leading to less time 
focused on the unique problems of the domain (API, features, efficiency). For 
example there is more space savings by using TinyLFU over LIRS's ghost entries, 
but evaluating took effort that I might have been to overwhelmed to expend. It 
would also be interesting to see if pairing with [Apache 
Mnemonic|https://github.com/apache/incubator-mnemonic] could reduce the GC 
overhead by having off-heap without the serialization penalty.

bq. Just to clarify those numbers are for small workloads?

Yep.

bq ...it would still leave the gate open for an attacker to reduce the efficacy 
of the cache for items that have only moderate reuse likelihood.

Since the frequency is reduced by half every sample period, my assumption was 
that this attack would be very difficult. Gil's response was to instead detect 
if TinyLFU had a large number of consecutive rejections, e.g. 80 (assuming 1:20 
is admitted on average). That worked quite well, except on ARC's database trace 
(ds1) which had a negative impact. It makes sense that scans (db, analytics) 
will have a high rejection rate. What do you think about combining the 
approach, e.g. {{(candidateFreq <= 3) || (++unadmittedItems < 80)}}, as a guard 
prior to performing a 1% random admittance?

> Cache implementation using LIRS eviction for in-process page cache
> --
>
> Key: CASSANDRA-11452
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11452
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Branimir Lambov
>Assignee: Branimir Lambov
>
> Following up from CASSANDRA-5863, to make best use of caching and to avoid 
> having to explicitly marking compaction accesses as non-cacheable, we need a 
> cache implementation that uses an eviction algorithm that can better handle 
> non-recurring accesses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-11586) Avoid Silent Insert or Update Failure In Clusters With Time Skew

2016-04-16 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244325#comment-15244325
 ] 

DOAN DuyHai edited comment on CASSANDRA-11586 at 4/16/16 5:53 PM:
--

bq. The client receives no error when the last statement is executed even 
though the request was dropped.

 To be able to detect that the existing cell (in memtable *AND* on disk) has a 
timestamp greater than the timestamp of the current mutation, Cassandra will 
necessarily need to read data before validating the mutation.

 That's what we call *read-before-write* and this is an anti-pattern because it 
just kills the write throughput.

 I think this JIRA is not an issue because synchronizing system time is out of 
Cassandra scope. This is the system administration responsibility



was (Author: doanduyhai):
bq. The client receives no error when the last statement is executed even 
though the request was dropped.

 To be able to detect that the existing cell (in memtable **AND** on disk) has 
a timestamp greater than the timestamp of the current mutation, Cassandra will 
necessarily need to read data before validating the mutation.

 That's what we call **read-before-write** and this is an anti-pattern because 
it just kills the write throughput.

 I think this JIRA is not an issue because synchronizing system time is out of 
Cassandra scope. This is the system administration responsibility


> Avoid Silent Insert or Update Failure In Clusters With Time Skew
> 
>
> Key: CASSANDRA-11586
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11586
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core, CQL
>Reporter: Mukil Kesavan
>
> It isn't uncommon to have a cluster of Cassandra servers with clock skew 
> ranging from a few milliseconds to seconds or even minutes even with NTP 
> configured on them. We use the coordinator's timestamp for all insert/update 
> requests. Currently, an update to an already existing row with an older 
> timestamp (because the request coordinator's clock is lagging behind) results 
> in a successful response to the client even though the update was dropped. 
> Here's a sample sequence of requests:
> * Consider 3 Cassandra servers with times, T+10, T+5 and T respectively
> * INSERT INTO TABLE1 (id, data) VALUES (1, "one"); is coordinated by server 1 
> with timestamp (T+10)
> * UPDATE TABLE1 SET data='One' where id=1; is coordinated by server 3 with 
> timestamp T
> The client receives no error when the last statement is executed even though 
> the request was dropped.
> It will be really helpful if we could return an error or response to the 
> client indicating that the request was dropped. This gives the client an 
> option to handle this situation gracefully. If this is useful, I can work on 
> a patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11586) Avoid Silent Insert or Update Failure In Clusters With Time Skew

2016-04-16 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244325#comment-15244325
 ] 

DOAN DuyHai commented on CASSANDRA-11586:
-

bq. The client receives no error when the last statement is executed even 
though the request was dropped.

 To be able to detect that the existing cell (in memtable **AND** on disk) has 
a timestamp greater than the timestamp of the current mutation, Cassandra will 
necessarily need to read data before validating the mutation.

 That's what we call **read-before-write** and this is an anti-pattern because 
it just kills the write throughput.

 I think this JIRA is not an issue because synchronizing system time is out of 
Cassandra scope. This is the system administration responsibility


> Avoid Silent Insert or Update Failure In Clusters With Time Skew
> 
>
> Key: CASSANDRA-11586
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11586
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core, CQL
>Reporter: Mukil Kesavan
>
> It isn't uncommon to have a cluster of Cassandra servers with clock skew 
> ranging from a few milliseconds to seconds or even minutes even with NTP 
> configured on them. We use the coordinator's timestamp for all insert/update 
> requests. Currently, an update to an already existing row with an older 
> timestamp (because the request coordinator's clock is lagging behind) results 
> in a successful response to the client even though the update was dropped. 
> Here's a sample sequence of requests:
> * Consider 3 Cassandra servers with times, T+10, T+5 and T respectively
> * INSERT INTO TABLE1 (id, data) VALUES (1, "one"); is coordinated by server 1 
> with timestamp (T+10)
> * UPDATE TABLE1 SET data='One' where id=1; is coordinated by server 3 with 
> timestamp T
> The client receives no error when the last statement is executed even though 
> the request was dropped.
> It will be really helpful if we could return an error or response to the 
> client indicating that the request was dropped. This gives the client an 
> option to handle this situation gracefully. If this is useful, I can work on 
> a patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11588) Unify CollectionType#{getValuesType|getElementsType}

2016-04-16 Thread Alex Petrov (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-11588:

Description: 
Currently, {{ListType}} and {{SetType}} have {{getElementsType}}, while 
{{MapType}} has {{getValuesType}} (along with {{getKeysType}}).

It might make sense to unify these types and make all three types have 
{{getValuesType}}, which would simply mean renaming for {{Set}} and {{List}} 
and pulling the method up to {{CollectionType}}. This would make 
[#7826|https://issues.apache.org/jira/browse/CASSANDRA-7826] a bit simpler.



  was:
Currently, {{ListType}} and {{SetType}} have {{getElementsType}}, while 
{{MapType}} has {{getValuesType}} (along with {{getKeysType}}.

It might make sense to unify these types and make all three types have 
{{getValuesType}}, which would simply mean renaming for {{Set}} and {{List}} 
and pulling the method up to {{CollectionType}}. This would make 
[#7826|https://issues.apache.org/jira/browse/CASSANDRA-7826] a bit simpler.




> Unify CollectionType#{getValuesType|getElementsType}
> 
>
> Key: CASSANDRA-11588
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11588
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Alex Petrov
>Assignee: Alex Petrov
> Fix For: 3.x
>
>
> Currently, {{ListType}} and {{SetType}} have {{getElementsType}}, while 
> {{MapType}} has {{getValuesType}} (along with {{getKeysType}}).
> It might make sense to unify these types and make all three types have 
> {{getValuesType}}, which would simply mean renaming for {{Set}} and {{List}} 
> and pulling the method up to {{CollectionType}}. This would make 
> [#7826|https://issues.apache.org/jira/browse/CASSANDRA-7826] a bit simpler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11588) Unify CollectionType#{getValuesType|getElementsType}

2016-04-16 Thread Alex Petrov (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-11588:

Priority: Major  (was: Minor)

> Unify CollectionType#{getValuesType|getElementsType}
> 
>
> Key: CASSANDRA-11588
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11588
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Alex Petrov
>Assignee: Alex Petrov
> Fix For: 3.x
>
>
> Currently, {{ListType}} and {{SetType}} have {{getElementsType}}, while 
> {{MapType}} has {{getValuesType}} (along with {{getKeysType}}.
> It might make sense to unify these types and make all three types have 
> {{getValuesType}}, which would simply mean renaming for {{Set}} and {{List}} 
> and pulling the method up to {{CollectionType}}. This would make 
> [#7826|https://issues.apache.org/jira/browse/CASSANDRA-7826] a bit simpler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11588) Unify CollectionType#{getValuesType|getElementsType}

2016-04-16 Thread Alex Petrov (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-11588:

Description: 
Currently, {{ListType}} and {{SetType}} have {{getElementsType}}, while 
{{MapType}} has {{getValuesType}} (along with {{getKeysType}}.

It might make sense to unify these types and make all three types have 
{{getValuesType}}, which would simply mean renaming for {{Set}} and {{List}} 
and pulling the method up to {{CollectionType}}. This would make 
[#7826|https://issues.apache.org/jira/browse/CASSANDRA-7826] a bit simpler.



  was:
Currently, {{ListType}} and {{SetType}} have {{getElementsType}}, while 
{{MapType}} has {{getValuesType}} (along with {{getKeysType}}.

It might make sense to unify these types and make all three types have 
{{getValuesType}}, which would simply mean renaming for {{Set}} and {{List}} 
and pulling the method up to {{CollectionType}}. 




> Unify CollectionType#{getValuesType|getElementsType}
> 
>
> Key: CASSANDRA-11588
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11588
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Minor
> Fix For: 3.x
>
>
> Currently, {{ListType}} and {{SetType}} have {{getElementsType}}, while 
> {{MapType}} has {{getValuesType}} (along with {{getKeysType}}.
> It might make sense to unify these types and make all three types have 
> {{getValuesType}}, which would simply mean renaming for {{Set}} and {{List}} 
> and pulling the method up to {{CollectionType}}. This would make 
> [#7826|https://issues.apache.org/jira/browse/CASSANDRA-7826] a bit simpler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-11588) Unify CollectionType#{getValuesType|getElementsType}

2016-04-16 Thread Alex Petrov (JIRA)
Alex Petrov created CASSANDRA-11588:
---

 Summary: Unify CollectionType#{getValuesType|getElementsType}
 Key: CASSANDRA-11588
 URL: https://issues.apache.org/jira/browse/CASSANDRA-11588
 Project: Cassandra
  Issue Type: Improvement
Reporter: Alex Petrov
Assignee: Alex Petrov
Priority: Minor
 Fix For: 3.x


Currently, {{ListType}} and {{SetType}} have {{getElementsType}}, while 
{{MapType}} has {{getValuesType}} (along with {{getKeysType}}.

It might make sense to unify these types and make all three types have 
{{getValuesType}}, which would simply mean renaming for {{Set}} and {{List}} 
and pulling the method up to {{CollectionType}}. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9555) Don't let offline tools run while cassandra is running

2016-04-16 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244215#comment-15244215
 ] 

Robert Stupp commented on CASSANDRA-9555:
-

Hm, you're right.

> Don't let offline tools run while cassandra is running
> --
>
> Key: CASSANDRA-9555
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9555
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Marcus Eriksson
>Assignee: Robert Stupp
>Priority: Minor
> Fix For: 3.x
>
>
> We should not let offline tools that modify sstables run while Cassandra is 
> running. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9555) Don't let offline tools run while cassandra is running

2016-04-16 Thread Jeremiah Jordan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244174#comment-15244174
 ] 

Jeremiah Jordan commented on CASSANDRA-9555:


What are you doing to read the config settings?  You need to implement 
CASSANDRA-9054 to be able to safely use things from DatabaseDescriptor.

> Don't let offline tools run while cassandra is running
> --
>
> Key: CASSANDRA-9555
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9555
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Marcus Eriksson
>Assignee: Robert Stupp
>Priority: Minor
> Fix For: 3.x
>
>
> We should not let offline tools that modify sstables run while Cassandra is 
> running. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-11452) Cache implementation using LIRS eviction for in-process page cache

2016-04-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244140#comment-15244140
 ] 

Benedict edited comment on CASSANDRA-11452 at 4/16/16 11:19 AM:


bq. The hash table trick isn't applicable since I didn't fork it for Caffeine 
or CLHM. 

Ah; last time I looked (many years ago) your CLHM was a fork I'm sure; I hadn't 
realised you'd reverted to stock.  I'm not personally certain the trade-off is 
so sharp, since the per-item overheads are costlier without forking as there is 
no way to specialise the entries of a CHM, so we pay object overheads multiple 
times.  For small items this is likely costly.  But no huge matter, and of 
course there are significant downsides of either merging changes from upstream 
or managing your own implementation of a concurrent hash map.

bq. For larger workloads (database, search, oltp) its equivalent, as we'd 
expect. (multi1: -4%, multi3: +3%, gli: -2.5%, cs: -2%)

Just to clarify those numbers are for small workloads?

I'd personally say that a small net loss over small caches is not something to 
worry about, and am personally against too many tuning knobs (-which any such 
value would probably need to be-).  However if we really cared we could do one 
of two things:

# Make the admission based on min(victim, guard); i.e., only apply the guard 
logic if the victim does not already permit entry.  Assuming the problem in 
small traces is occasionally hitting a _high_ frequency value in the guard 
where the victim was not (which is the only thing that makes sense - the 
alternative would be just blind unintended random coincidence in the trace, 
which I don't think we can really account for), then this should have the same 
effect as the threshold for triggering the random walk.
# we could gradate the introduction of the behaviour based on cache size.  For 
instance, we could divide the walk distance by the ratio of cache-size bits to 
integer bits, i.e. if the cache contains 64K elements we would divide by 2, but 
if it contains 512 elements we would divide by 4.  This would mean that ~94% 
would check the victim, and ~5% the next, and all but a tiny fraction the 
third.  This would also have the effect of _increasing_ protection as caches 
got much larger and collisions more likely.

edit: On reconsideration I retract my statement about tuning knobs. It seems 
that 5 is probably a _near_ universal lower bound on _disproportionately_ hot 
keys.  Although it would still leave the gate open for an attacker to reduce 
the efficacy of the cache for items that have only moderate reuse likelihood.  
With a large enough cache sabotage keys could be rereferenced just before they 
become victims (or a fleet of them could be used so that only one is in cache 
at any one time), keeping its frequency suppressed to 4, but permitting the 
rest of the cache to become cold.  I think this would still narrow the surface 
sufficiently for us, but perhaps one of the other approaches will work as well 
without leaving that vector open.


was (Author: benedict):
bq. The hash table trick isn't applicable since I didn't fork it for Caffeine 
or CLHM. 

Ah; last time I looked (many years ago) your CLHM was a fork I'm sure; I hadn't 
realised you'd reverted to stock.  I'm not personally certain the trade-off is 
so sharp, since the per-item overheads are costlier without forking as there is 
no way to specialise the entries of a CHM, so we pay object overheads multiple 
times.  For small items this is likely costly.  But no huge matter, and of 
course there are significant downsides of either merging changes from upstream 
or managing your own implementation of a concurrent hash map.

bq. For larger workloads (database, search, oltp) its equivalent, as we'd 
expect. (multi1: -4%, multi3: +3%, gli: -2.5%, cs: -2%)

Just to clarify those numbers are for small workloads?

I'd personally say that a small net loss over small caches is not something to 
worry about, and am personally against too many tuning knobs (which any such 
value would probably need to be).  However if we really cared we could do one 
of two things:

# Make the admission based on min(victim, guard); i.e., only apply the guard 
logic if the victim does not already permit entry.  Assuming the problem in 
small traces is occasionally hitting a _high_ frequency value in the guard 
where the victim was not (which is the only thing that makes sense - the 
alternative would be just blind unintended random coincidence in the trace, 
which I don't think we can really account for), then this should have the same 
effect as the threshold for triggering the random walk.
# we could gradate the introduction of the behaviour based on cache size.  For 
instance, we could divide the walk distance by the ratio of cache-size bits to 
integer bits, i.e. if the cache contains 64K elements we would divide by 2, 

[jira] [Commented] (CASSANDRA-9555) Don't let offline tools run while cassandra is running

2016-04-16 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244143#comment-15244143
 ] 

Robert Stupp commented on CASSANDRA-9555:
-

I've put something together for these offline tools:
* sstableloader / BulkLoader
* sstableexpiredblockers / SSTableExpiredBlockers
* sstabledump / SSTableExport
* sstablelevelreset / SSTableLevelResetter
* sstablemetadata / SSTableMetadataViewer
* sstableofflinerelevel / SSTableOfflineRelevel
* sstablerepairedset / SSTableRepairedAtSetter
* sstablescrub / StandanloneScrubber
* sstablesplit / StandaloneSplitter
* sstableutil / StandaloneSSTableUtil
* sstableupgrade / StandaloneUpgrader
* sstableverify / StandaloneVerifier

These tools take a new option {{--run-even-if-cassandra-is-running}}. If these 
already had a similar option, I've reused that. If those tools have a 
corresponding nodetool command, that is then mentioned in the error message.

The check is performed against storage port, storage SSL port, rpc port, native 
port, native SSL port using the configuration in {{cassandra.yaml}} to work 
even if some of the ports are temporarily disabled. Detecting whether the JMX 
port is open, is a bit difficult since the JMX configuration is not available 
in the offline tools and including {{cassandra-env.sh}} is probably not a good 
idea.

Most of the tools modify sstables, which will cause errors in Cassandra like 
snapshots no longer working, repair failing and more.
Other tools seem to just read data, but these are designed to work correctly as 
long as the set of referenced sstables is not change (same situation just the 
other way around) - but a running Cassandra daemon will do that.

Such a message looks like this:
{code}
$ bin/sstablescrub system_schema tables
Storage/Gossip port at localhost/127.0.0.1:7000 is accepting connections, 
assuming Cassandra is running.
It is strongly recommended to NOT run sstablescrub while Cassandra is running!

Consider using 'nodetool scrub'.
$
{code}


> Don't let offline tools run while cassandra is running
> --
>
> Key: CASSANDRA-9555
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9555
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Marcus Eriksson
>Assignee: Robert Stupp
>Priority: Minor
> Fix For: 3.x
>
>
> We should not let offline tools that modify sstables run while Cassandra is 
> running. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11452) Cache implementation using LIRS eviction for in-process page cache

2016-04-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244140#comment-15244140
 ] 

Benedict commented on CASSANDRA-11452:
--

bq. The hash table trick isn't applicable since I didn't fork it for Caffeine 
or CLHM. 

Ah; last time I looked (many years ago) your CLHM was a fork I'm sure; I hadn't 
realised you'd reverted to stock.  I'm not personally certain the trade-off is 
so sharp, since the per-item overheads are costlier without forking as there is 
no way to specialise the entries of a CHM, so we pay object overheads multiple 
times.  For small items this is likely costly.  But no huge matter, and of 
course there are significant downsides of either merging changes from upstream 
or managing your own implementation of a concurrent hash map.

bq. For larger workloads (database, search, oltp) its equivalent, as we'd 
expect. (multi1: -4%, multi3: +3%, gli: -2.5%, cs: -2%)

Just to clarify those numbers are for small workloads?

I'd personally say that a small net loss over small caches is not something to 
worry about, and am personally against too many tuning knobs (which any such 
value would probably need to be).  However if we really cared we could do one 
of two things:

# Make the admission based on min(victim, guard); i.e., only apply the guard 
logic if the victim does not already permit entry.  Assuming the problem in 
small traces is occasionally hitting a _high_ frequency value in the guard 
where the victim was not (which is the only thing that makes sense - the 
alternative would be just blind unintended random coincidence in the trace, 
which I don't think we can really account for), then this should have the same 
effect as the threshold for triggering the random walk.
# we could gradate the introduction of the behaviour based on cache size.  For 
instance, we could divide the walk distance by the ratio of cache-size bits to 
integer bits, i.e. if the cache contains 64K elements we would divide by 2, but 
if it contains 512 elements we would divide by 4.  This would mean that ~94% 
would check the victim, and ~5% the next, and all but a tiny fraction the 
third.  This would also have the effect of slightly increasing protection as 
caches got much larger and collisions more likely.



> Cache implementation using LIRS eviction for in-process page cache
> --
>
> Key: CASSANDRA-11452
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11452
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Branimir Lambov
>Assignee: Branimir Lambov
>
> Following up from CASSANDRA-5863, to make best use of caching and to avoid 
> having to explicitly marking compaction accesses as non-cacheable, we need a 
> cache implementation that uses an eviction algorithm that can better handle 
> non-recurring accesses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11574) COPY FROM command in cqlsh throws error

2016-04-16 Thread Mahafuzur Rahman (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244129#comment-15244129
 ] 

Mahafuzur Rahman commented on CASSANDRA-11574:
--

Looks like the problem is in this line:

https://github.com/apache/cassandra/blob/trunk/pylib/cqlshlib/copyutil.py#L321

Here get_num_processes is being called as a kwarg. Looks like this should be a 
simple fix, though i'm not very sure. As i'm currently stuck with this problem, 
i wanted to know if changing this file in my installed cassandra would solve 
the problem or do i need to do some compilation steps to reload the changed 
python source? I've installed cassandra from package installation (using 
apt-get). 

> COPY FROM command in cqlsh throws error
> ---
>
> Key: CASSANDRA-11574
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11574
> Project: Cassandra
>  Issue Type: Bug
>  Components: CQL
> Environment: Operating System: Ubuntu Server 14.04
> JDK: Oracle JDK 8 update 77
> Python: 2.7.6
>Reporter: Mahafuzur Rahman
> Fix For: 3.0.6
>
>
> Any COPY FROM command in cqlsh is throwing the following error:
> "get_num_processes() takes no keyword arguments"
> Example command: 
> COPY inboxdata 
> (to_user_id,to_user_network,created_time,attachments,from_user_id,from_user_name,from_user_network,id,message,to_user_name,updated_time)
>  FROM 'inbox.csv';
> Similar commands worked parfectly in the previous versions such as 3.0.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format

2016-04-16 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244100#comment-15244100
 ] 

Robert Stupp commented on CASSANDRA-11206:
--

bq. have ColumnIndex but it's been refactored into RowIndexWriter

Yea - it doesn't look the same any more. So I went ahead and moved it into BTW 
since it's the only class from which it's being used. Could move that to 
{{o.a.c.io.sstable.format.big}}, where BTW is.

bq. BTW.addIndexBlock() the indexOffsets\[0\] is always 0

Put some comments in the code for that.

bq. explain in RowIndexEntry.create why you are returning each of the types

Put some comments in the code for that.

bq. don't need indexOffsets once you reach column_index_cache_size_in_kb

It's needed for both cases (shallow and non-shallow RIEs). Put a comment in the 
code for that.

Also ran some cstar tests to compare a version with and without the metrics 
with column_index_cache_size_in_kb 0kB and 2kB on taylor and blade_11_b:
[2kB on 
taylor|http://cstar.datastax.com/tests/id/b4c3dd12-033e-11e6-8db8-0256e416528f] 
[2kB on 
blade_11_b|http://cstar.datastax.com/tests/id/a9c828be-033e-11e6-8db8-0256e416528f]
 [0kB on 
taylor|http://cstar.datastax.com/tests/id/621f0886-034b-11e6-8db8-0256e416528f] 
[0kB on 
blade_11_b|http://cstar.datastax.com/tests/id/6f010ad6-034b-11e6-8db8-0256e416528f]

Commits pushed and CI triggered.

> Support large partitions on the 3.0 sstable format
> --
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Robert Stupp
> Fix For: 3.x
>
> Attachments: 11206-gc.png, trunk-gc.png
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within 
> each partition of every 64KB (by default) range of rows.  To find a row, we 
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, 
> we deserialize the entire set of IndexInfo, which both creates a lot of GC 
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity 
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform 
> the IndexInfo bsearch while only deserializing IndexInfo that we need to 
> compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)