[
https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193082#comment-15193082
]
Stefania commented on CASSANDRA-11206:
--------------------------------------
bq. IndexInfo is also used from
{{UnfilteredRowIteratorWithLowerBound#getPartitionIndexLowerBound}}
(CASSANDRA-8180) - not sure whether it's worth to deserialize the index for
this functionality, *as it is currently restricted to the entries that are
present in the key cache*. I tend to remove this access.
If I am not mistaken when the sstable iterator is created, the partition should
be added to the key cache if not already present. Please have a look at
BigTableReader {{iterator()}} and {{getPosition()}} to confirm. The reason we
need the index info is that the lower bounds in the sstable metatdata do not
work for tombstones. This is the only lower bound we have for tombstones. If
it's removed then the optimization of CASSANDRA-8180 no longer works in the
presence of tombstones (whether this is acceptable is up for discussion).
Can't we add the partition bounds to the offset map?
For completeness, I also add that we don't necessarily need a lower bound for
the partion, it can be a lower bound for the entire sstable if easier. However
it should work for tombstones, that is it should be an instance of
{{ClusteringPrefix}} rather than an array of {{ByteBuffer}} as it is currently
stored in the sstable metadata.
> Support large partitions on the 3.0 sstable format
> --------------------------------------------------
>
> Key: CASSANDRA-11206
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Jonathan Ellis
> Assignee: Robert Stupp
> Fix For: 3.x
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within
> each partition of every 64KB (by default) range of rows. To find a row, we
> binary search this sample, then scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss,
> we deserialize the entire set of IndexInfo, which both creates a lot of GC
> overhead (as noted in CASSANDRA-9754) but is also non-negligible i/o activity
> (relative to reading a single 64KB row range) as partitions get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform
> the IndexInfo bsearch while only deserializing IndexInfo that we need to
> compare against, i.e. log(N) deserializations.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)