[
https://issues.apache.org/jira/browse/CASSANDRA-11452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255723#comment-15255723
]
Ben Manes commented on CASSANDRA-11452:
---------------------------------------
Chatted with Roy and below is his analysis. I tried simulating his gratuitous
access idea which had a positive impact on multi3, a negative impact on DS1
(database), and negligible impact on all other traces. It seems he dislikes the
probability approach in the code above, while I dislike the additional space of
(6).
Longer term, I'd like to devise a way to adaptively size the admission window.
I think if done right the strategy could help alleviate or resolve the attack.
My current idea is to use a bloom filter that is reset on with small sample
period, and captures the candidates that were rejected by TinyLFU. If multiple
rejections for a key are observed in a sample, then it is admitted and the
window size is increased. If not observed in a larger sample (e.g. 3x) then the
window is decreased. This would follow some ARC-like equation with a min/max
for the window. The admission of the candidate would resolve this attack, which
would only result in making the window oversized for a workload.
----
Here is my summary.
Assumptions:
1) An oblivious adversary that does not see what all items in the cache are and
does not control which items get in the cache; its only way if impacting what's
in the cache is by issuing requests for items by itself.
2) Caffeine uses the hash code of an object as its identifier for the frequency
histogram management, so if two items have the same hash code, they will access
the same locations in CMS/CBF since the respective hash functions will be
computed on the objects' hash code.
Attack: The attacker can create two items O1 and O2 whose hash code is the
same. It issues frequent requests for both objects until both are admitted to
the cache. Once this happens, the attacker stops issuing requests for O2 but
continues generating enough requests for O1 such that it remains one of the
most frequent items. As O2 is no longer being requested, it gets demoted to the
bottom of the LRU/SLRU eviction mechanism, so from some point on, it is always
the chosen cache victim. However, due to the hash code collision and the high
frequency count of O1, TinyLFU will never evict O2 and at that point the cache
is completely nullified.
Notice, the attacker needs to construct two objects whose hash code will
collide, or at least all hash values of the CMS will collide. In the other
direction, it is not enough to find to values whose CMS hash values collide –
one should create objects whose hash code will match these.
Possible solutions:
1. A domain specific solution: Use another identifier, e.g., in a youtube
cache, one can use the video identifier, which google ensures is unique.
2. Java enhancement solution: One of the main problems with Java hash codes is
that they are only 32 bits long. It may be possible to enhance it to 64 bits,
but this is not realistic to expect.
3. Jitter: Instead of always replacing the least recently used item, pick one
of the lowest k. The problem with this solution is that in order not to hurt
performance, k should be kept small. However, in this case, the attacker can
nullify 1/k of the with no additional effort, or the entire cache with k times
the effort.
4. Random eviction: As the name suggests, use random eviction instead of LRU.
The problem is that in some workloads, this would hurt performance.
5. Circular eviction: Each time evict the next position in the LRU queue rather
than the last one. Has same impact as random, but faster to implement. Same as
in random.
6. Ignore the filter on cache collisions: If there is a hash-code collision on
the cache victim, we ignore TinyLFU. Impact on performance probably marginal
since the chance for unintentional collision is small, and would definitely not
show up in the traces. But, how do we detect this? Do we maintain a
non-counting TinyTable for this storing the entire hash code? This adds space.
On the other hand, the cache probably maintains in any case a map of all items
– so we can build on this, in which case it simply costs another lookup. I
would not worry about false positives here since in a naïve trace, the chances
are very small, and azthe chances that both are heavy-hitters is negligible, so
in the worst case, we will suffer one extra miss on the entire trace.
7. Count limit: If TinyLFU rejects replacement for k consecutive attempts,
ignore TinyLFU. But what should K be? Also, incurs a small performance impact.
8. Probabilistically ignoring TinyLFU with some small probability. Same as
above. Also, the attacker could repeat his trick each time his item gets
evicted, hurting the performance.
9. Gratuitous access. If the cache victim wins too many times in the LFU filter
test, we make a gratuitous access to this item. This way, it will take some
time until it reaches the bottom of the LRU stack again. The important thing,
we do not evict it, and so we do not pollute the cache with tail items unless
the next item in line is also a tail item. Also, only requires a single counter
so minimal space and time overheads.
Personally, I do not like any solution that hurts performance in the normal
case just to solve this hypothetical attack.
Therefore, I vote for the "Ignore the filter on cache collisions" option, and
even this, if possible, should be a parameter and not default. I also like the
fact that it tackles the problem at its source rather an indirect hack. Maybe
even here, perform the check only when the victim has not been evicted multiple
times, to reduce the performance impact even further.
I also fine the gratuitous access interesting, but we need to check its impact
on performance.
Of course, in applications of TinyLFY in which item ids are unique (e.g.,
youtube), use this id and not a hash code, so no problem.
Cheers,
Roy
> Cache implementation using LIRS eviction for in-process page cache
> ------------------------------------------------------------------
>
> Key: CASSANDRA-11452
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11452
> Project: Cassandra
> Issue Type: Improvement
> Components: Local Write-Read Paths
> Reporter: Branimir Lambov
> Assignee: Branimir Lambov
>
> Following up from CASSANDRA-5863, to make best use of caching and to avoid
> having to explicitly marking compaction accesses as non-cacheable, we need a
> cache implementation that uses an eviction algorithm that can better handle
> non-recurring accesses.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)