[
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139735#comment-15139735
]
Michael Kjellman commented on CASSANDRA-9754:
---------------------------------------------
[~jkrupan] ~2GB is the max target at the moment I'd recommend from experience.
The current implementation will create a IndexInfo entry every 64kb (by default
- but I highly doubt anyone actually changes this default) worth of data. Each
IndexInfo object contains the offset into the sstable where the partition/row
starts, the length to read, and the name. These IndexInfo objects are placed
into a list and binary searched over to find the name closest to the query.
Then, we go to that offset in the sstable and start reading the actual data.
The issue here that makes things so bad with large partitions is when doing an
Indexed read across a given partition the entire list of indexinfo objects is
currently just serialized one after another into the index file on disk. To use
it we have to read the entire thing off disk, deserializing every IndexInfo
object, place it into a list, and the binary search across it. This creates a
ton of small objects very quickly that are likely to be promoted and thus
create a lot of GC pressure.
If you take the average size of each column you have in a row you can figure
out how many index entry objects will be created (for every 64k of your data in
that partition). I've found that once the index info array will contain > 300k
objects things get bad.
The implementation I'm *almost* done with has the same big O complexity
(O(log(n))) as the current implementation but instead the index is backed by
page cache aligned mmap'ed segments (B+ tree-ish with an overflow page
implementation similar to that of SQLite). This means we can now walk the
IndexEntry objects an only bring the 4k chunks onto the heap that are involved
in the binary search for the correct entry itself.
The tree itself is finished and heavily tested. I've also already abstracted
out the index implementation in Cassandra so that the current implementation
and the new one I'll be proposing and contributing here can be dropped in
easily without special casing the code all over the place to check the SSTable
descriptor for what index implementation was used. All the unit tests and
d-tests pass after my abstraction work. The final thing I'm almost done with is
refactoring my Page Cache Aligned/Aware File Writer to be SegmentedFile aware
(and make sure all the math works when the offset into the actual file will
differ depending on the segment etc).
> Make index info heap friendly for large CQL partitions
> ------------------------------------------------------
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
> Issue Type: Improvement
> Reporter: sankalp kohli
> Assignee: Michael Kjellman
> Priority: Minor
>
> Looking at a heap dump of 2.0 cluster, I found that majority of the objects
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for
> GC. Can this be improved by not creating so many objects?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)