[ 
https://issues.apache.org/jira/browse/CASSANDRA-8860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341781#comment-14341781
 ] 

Benedict commented on CASSANDRA-8860:
-------------------------------------

bq. If createOverlapChain(a, overlapMap) return {a,b,c}, createOverlapChain(b, 
overlapMap) must also return {a,b,c} because "overlapping" is a bidirectional 
relationship. So if we have already get {a,b,c} by SStableReader a, we needn't 
check b and c at all.

This is unfortunately not true. _a_ may overlap _b_ and _c_, but _b_ may only 
overlap _a_, and not _c_, for instance. I knocked together a patch on the plane 
with an average time complexity closer to O(N) but with worst case still 
O(N^2), and achieved by filtering out sets of readers that are wholly contained 
in other sets of overlapping readers. This may not be desirable, though, as it 
may be that by considering a smaller group of readers we see a larger yield 
from rewriting.

I think the best (medium term) solution to this complexity is to take ownership 
of the hyperloglog implementation and to build a structure that can yield 
overlaps efficiently from a multitude of hyperloglogs. If each hyperloglog 
bucket maintained an ordered map from bucket cardinality to the set of objects 
with that cardinality, an efficient search could be performed for overlap 
candidates. Within each bucket we could find the largest overlapping 
cardinalities, then take the intersection of these across all buckets, leaving 
the top-ranked partition overlaps, and applying correction for the amount of 
clustering column overlap between the candidates. We could then post-process to 
include any files that are almost wholly overlapping with this set of best 
overlaps. This is just a quick off-the-cuff sketch of a design, and no doubt 
there are better and more developed approaches along these lines.

> Too many java.util.HashMap$Entry objects in heap
> ------------------------------------------------
>
>                 Key: CASSANDRA-8860
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8860
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Cassandra 2.1.3, jdk 1.7u51
>            Reporter: Phil Yang
>            Assignee: Phil Yang
>             Fix For: 2.1.4
>
>         Attachments: 8860-v2.txt, 8860.txt, cassandra-env.sh, cassandra.yaml, 
> jmap.txt, jstack.txt, jstat-afterv1.txt, jstat-afterv2.txt, jstat-before.txt
>
>
> While I upgrading my cluster to 2.1.3, I find some nodes (not all) may have 
> GC issue after the node restarting successfully. Old gen grows very fast and 
> most of the space can not be recycled after setting its status to normal 
> immediately. The qps of both reading and writing are very low and there is no 
> heavy compaction.
> Jmap result seems strange that there are too many java.util.HashMap$Entry 
> objects in heap, where in my experience the "[B" is usually the No1.
> If I downgrade it to 2.1.1, this issue will not appear.
> I uploaded conf files and jstack/jmap outputs. I'll upload heap dump if 
> someone need it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to