[ 
https://issues.apache.org/jira/browse/CASSANDRA-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877181#comment-17877181
 ] 

Dmitry Konstantinov edited comment on CASSANDRA-17298 at 8/27/24 9:02 PM:
--------------------------------------------------------------------------

Hi, I have faced the issue (a big discrepancy between actual and expected heap 
size) with the test several times when committed other changes, so I decided to 
volunteer :) and have spent some time to analyze it in more details. 
Please find below the results of the analysis.

There are several aspects which contribute to the issue:
1) The memtable actual size is measured using 
ObjectSizes.measureDeep(memtable). A memtable object has several references 
which go beyond Memtable scope:
{code:java}
(org.apache.cassandra.db.Memtable) memtable
  (org.apache.cassandra.db.ColumnFamilyStore) cfs
    (org.apache.cassandra.db.Keyspace) keyspace <-------- has references to all 
tables in the same keyspace
  (org.apache.cassandra.utils.memory.SlabAllocator) allocator
    (org.apache.cassandra.utils.memory.MemtableAllocator$SubAllocator) onHeap
    ---- the boundary between memtable specific allocator objects and global 
memory pool objects ----
      (org.apache.cassandra.utils.memory.MemtablePool$SubPool) parent
        (org.apache.cassandra.utils.memory.SlabPool) this$0
          (org.apache.cassandra.utils.memory.MemtableCleanerThread) cleaner
            (java.lang.Thread) thread
              (java.lang.ThreadGroup) group
                (java.lang.Thread[]) threads // references to almost all 
Cassandra threads, then to thread-locals, etc
                (java.lang.ThreadGroup) parent
{code}
so, when we measure heap size of memtable object through Memtable.cfs and 
SubAllocator.parent references we also count non-related and dynamically 
changed amount of memory. When we run a test several times to fight with 
flakiness without cleanup the impact is even increasing - we implicitly observe 
memory allocations related to the previous runs.
To avoid this issue we can ask jamm Meter to ignore the correspondent objects 
using @Unmetered annotation on the related fields.
Also, it makes sense to drop the table after the test to cleanup memory.

2) heap_buffers memtable allocation type uses 1Mb shared byte array slabs to 
store partition key, clustering keys and cell values.
The free part of such slab is not counted as owned by MemtableAllocator logic 
but it is measured by ObjectSizes.measureDeep(memtable) due to an existence of 
reference path from Memtable via allocator the the .
There are two possible ways to deal with it: 
a) doesn't measure Memtable.allocator object at all
b) add a slab size to the allowed difference assertion between 
MemtableAllocator reported value and ObjectSizes.measureDeep(memtable) value
My own preference is a) because:
 - MemtableAllocator does not consider the slab free space as used and we 
actually test correctness of MemtableAllocator tracking by this test
 - adding such 1Mb difference decreases the accuracy of assertion and may hide 
other issues

NOTE:
4.0.x ObjectSizes uses jamm-0.3.2 with omitSharedBufferOverhead option --> the 
shared byte[] slab is NOT counted by Meter in ObjectSizes.
4.1.x ObjectSizes uses jamm-0.3.2 withOUT omitSharedBufferOverhead option --> 
the shared byte[] slab is counted by Meter in ObjectSizes.

The related change in 4.1 branch for ObjectSizes 
([https://github.com/apache/cassandra/commits/cassandra-4.1/src/java/org/apache/cassandra/utils/ObjectSizes.java):]
 
[https://github.com/apache/cassandra/commit/0d8126dd143898588f4efcdc40b8e2bb10597185#diff-1122d7d3efe9721af7244d373e66378f7e90cb05fd65859a52e8a3ea58a7c8f9]
omitSharedBufferOverhead() is removed

So, to not measure the whole slab array in case of  4.1.x branch a separate 
Meter needs to be configured for MemtableSizeTest

3) The most interesting part - the precision of memtable heap tracking. I have 
applied the following approach to find the missed parts.
To see the actual memory structure I have dumped few partitions for each write 
type (insert/delete partition/delete row) as a tree using MemoryMeter with 
enabled debug:
{code:java}
MemoryMeter meter = new MemoryMeter().omitSharedBufferOverhead()
                                     
.withGuessing(MemoryMeter.Guess.FALLBACK_UNSAFE)
                                     .ignoreKnownSingletons()
                                     .enableDebug();
{code}
While it does not print aggregated memory usage accurately (it does not reflect 
in the printing a memory used from the shared buffers) it allows to see the 
object graph and size of objects themselves:
{code:java}
root [org.apache.cassandra.db.partitions.AtomicBTreePartition] 632 bytes (32 
bytes)
  |
  +--ref [org.apache.cassandra.db.partitions.AbstractBTreePartition$Holder] 504 
bytes (32 bytes)
  |  |
  |  +--columns [org.apache.cassandra.db.RegularAndStaticColumns] 48 bytes (24 
bytes)
  |  |  |
  |  |  +--statics [org.apache.cassandra.db.Columns] 24 bytes (24 bytes)
  |  |
  |  +--deletionInfo [org.apache.cassandra.db.MutableDeletionInfo] 24 bytes (24 
bytes)
  |  |
  |  +--tree [java.lang.Object[]] 200 bytes (24 bytes)
  |  |  |
  |  |  +--0 [org.apache.cassandra.db.rows.BTreeRow] 176 bytes (32 bytes)
  |  |    |
  |  |    +--clustering [org.apache.cassandra.db.BufferClustering] 96 bytes (24 
bytes)
  |  |    |  |
  |  |    |  +--values [java.nio.ByteBuffer[]] 72 bytes (24 bytes)
  |  |    |    |
  |  |    |    +--0 [java.nio.HeapByteBuffer] 48 bytes (48 bytes)
  |  |    |
  |  |    +--deletion [org.apache.cassandra.db.rows.Row$Deletion] 48 bytes (24 
bytes)
  |  |      |
  |  |      +--time [org.apache.cassandra.db.DeletionTime] 24 bytes (24 bytes)
  |  |
  |  +--staticRow [org.apache.cassandra.db.rows.BTreeRow] 168 bytes (32 bytes)
  |  |  |
  |  |  +--clustering [org.apache.cassandra.db.Clustering$1] 40 bytes (24 bytes)
  |  |  |  |
  |  |  |  +--values [java.nio.ByteBuffer[]] 16 bytes (16 bytes)
  |  |  |
  |  |  +--primaryKeyLivenessInfo [org.apache.cassandra.db.LivenessInfo] 24 
bytes (24 bytes)
  |  |  |
  |  |  +--deletion [org.apache.cassandra.db.rows.Row$Deletion] 48 bytes (24 
bytes)
  |  |  |  |
  |  |  |  +--time [org.apache.cassandra.db.DeletionTime] 24 bytes (24 bytes)
  |  |  |
  |  |  +--btree [java.lang.Object[]] 24 bytes (24 bytes)
  |  |
  |  +--stats [org.apache.cassandra.db.rows.EncodingStats] 32 bytes (32 bytes)
  |
  +--partitionKey [org.apache.cassandra.db.BufferDecoratedKey] 96 bytes (24 
bytes)
    |
    +--key [java.nio.HeapByteBuffer] 48 bytes (48 bytes)
    |
    +--token [org.apache.cassandra.dht.Murmur3Partitioner$LongToken] 24 bytes 
(24 bytes)
{code}
Based on this input and source code analysis we can get a visual representation 
of the partitions and count manually the amount of used memory.

[^structure_example.svg]

As the second step I added temporary a tracing of records to the allocator 
tracking logic to see which values and where are counted by MemtableAllocator 
during a put operation, like:
{code:java}
class org.apache.cassandra.utils.memory.MemtableAllocator.SubAllocator {

public static final ThreadLocal<Boolean> tracing = ThreadLocal.withInitial(() 
-> Boolean.FALSE);
....

private void allocated(long size)
{
    parent.allocated(size);
    reportChange(size); <==== added
    ownsUpdater.addAndGet(this, size);

    if (state == LifeCycle.DISCARDING)
    {
        if (logger.isTraceEnabled())
            logger.trace("Allocated {} bytes whilst discarding", size);
        updateReclaiming();
    }
}

public void reportChange(long delta) {
            if (tracing.get())
            {
                System.out.println("ALLOCATOR: " + delta);
                new Exception().printStackTrace(System.out);
            }
        }
{code}
"tracing" thread-local is used to print such information only for an analyzed 
memtable
Similar tracing was added temporary to 
org.apache.cassandra.db.partitions.AtomicBTreePartition.RowUpdater as well.
The output looks like:
{code:java}
ALLOCATOR: 8
java.lang.Exception
   at 
org.apache.cassandra.utils.memory.MemtableAllocator$SubAllocator.reportChange(MemtableAllocator.java:226)
   at 
org.apache.cassandra.utils.memory.MemtableAllocator$SubAllocator.acquired(MemtableAllocator.java:238)
   at 
org.apache.cassandra.utils.memory.MemtableAllocator$SubAllocator.allocate(MemtableAllocator.java:179)
   at 
org.apache.cassandra.utils.memory.SlabAllocator.allocate(SlabAllocator.java:89)
   at 
org.apache.cassandra.utils.memory.MemtableBufferAllocator$1.allocate(MemtableBufferAllocator.java:40)
   at 
org.apache.cassandra.utils.memory.ByteBufferCloner.clone(ByteBufferCloner.java:77)
   at 
org.apache.cassandra.utils.memory.ByteBufferCloner.clone(ByteBufferCloner.java:63)
   at org.apache.cassandra.db.Clustering.clone(Clustering.java:53)
   at 
org.apache.cassandra.utils.memory.ByteBufferCloner.clone(ByteBufferCloner.java:52)
   at org.apache.cassandra.db.rows.BTreeRow.clone(BTreeRow.java:505)
   at 
org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.insert(AtomicBTreePartition.java:376)
   at 
org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.insert(AtomicBTreePartition.java:352)
   at org.apache.cassandra.utils.btree.BTree.transformLeaf(BTree.java:1308)
   at org.apache.cassandra.utils.btree.BTree.transform(BTree.java:1255)
   at org.apache.cassandra.utils.btree.BTree.update(BTree.java:353)
   at 
org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDeltaInternal(AtomicBTreePartition.java:147)
   at 
org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDelta(AtomicBTreePartition.java:191)
   at org.apache.cassandra.db.Memtable.put(Memtable.java:327)
{code}
As a result for a single put operation we can observe how does memory tracking 
works, with stack traces to the points where it is happening.
I have used this information and put marks on the object graph picture created 
on the previous step.
It allows to see the discrepancies - what objects are not counted correctly.

[^analyzed_objects.svg]

(*g) So, there are the following discrepancies found:
1) we do not count ByteBuffer BufferDecoratedKey.key, it is supposed to be 
counted by Memtable.estimateRowOverhead method logic but estimateRowOverhead 
puts ByteBufferUtil.EMPTY_BYTE_BUFFER with size = 0 to the field. Cloning of 
BufferDecoratedKey does not help, we still have the same object returned by 
org.apache.cassandra.utils.memory.ByteBufferCloner#clone(V, 
org.apache.cassandra.db.marshal.ValueAccessor<V>).
So, when we measure ConcurrentSkipListMap deep size in estimateRowOverhead we 
have the same ByteBufferUtil.EMPTY_BYTE_BUFFER object used and it is counted 
only once. To avoid this issue we should allocate a buffer of non-zero size in 
Memtable.estimateRowOverhead.

2) Columns (under RegularAndStaticColumns) stores ColumnMetadata in btree 
structure (Object[]) but the cost of this structure is not counted.
To fix the issue BTree.sizeOfStructureOnHeap(columns) should be added to 
org.apache.cassandra.db.Columns#unsharedHeapSize

3) Container for cells = BTreeRow.btree is EMPTY_LEAF when we delete a row, 
EMPTY_LEAF is a shared singleton but 
org.apache.cassandra.utils.btree.BTree#sizeOfStructureOnHeap used in 
org.apache.cassandra.db.rows.BTreeRow#unsharedHeapSizeExcludingData
to measure size of this structure does not check usage of EMPTY_LEAF and 
incorrectly reports a non-zero size for this Object[] array value.
To fix this issue a check for EMPTY_LEAF value should be added to 
BTree#sizeOfStructureOnHeap

By combining of these changes together I have got about 1 byte of difference 
per partition between measured and allocator reported values.
When I found the exact places in the code I have compared them with 5.0 code 
and found the following commit by () - 
[https://github.com/apache/cassandra/commit/49e0c61107005b1a83799f7f1e6c0a855d159c29]
 done as a part of CASSANDRA-17240 story. So, it looks like I independently 
have come to the same conclusions as [~blambov]  :).

I have aggregated the changes to MRs for 4.0 and 4.1:
 * [https://github.com/apache/cassandra/pull/3503]
 * [https://github.com/apache/cassandra/pull/3504] 

(?) An open question: there are two very similar methods: BTree.sizeOnHeapOf vs 
BTree.sizeOfStructureOnHeap used in the code.
The difference is only in measuring of sizeOfArray(sizeMap(tree)). I wonder why 
do we need sizeOfStructureOnHeap.., especially taking in account what we use 
sizeOnHeapOf to measure Holder.btree memory usage (in Object[] tree = 
BTree.update(current.tree, update.holder().tree, update.metadata().comparator, 
updater)). So, in one case we use sizeOfStructureOnHeap and in another - 
sizeOnHeapOf for the same kind of logic..
{code:java}
    public static long sizeOfStructureOnHeap(Object[] tree)
    {
        long size = ObjectSizes.sizeOfArray(tree);
        if (isLeaf(tree))
            return size;
        for (int i = getChildStart(tree); i < getChildEnd(tree); i++)
            size += sizeOfStructureOnHeap((Object[]) tree[i]);
        return size;
    }

    public static long sizeOnHeapOf(Object[] tree)
    {
        if (isEmpty(tree))
            return 0;
        long size = ObjectSizes.sizeOfArray(tree);
        if (isLeaf(tree))
            return size;
        for (int i = childOffset(tree); i < childEndOffset(tree); i++)
            size += sizeOnHeapOf((Object[]) tree[i]);
        size += ObjectSizes.sizeOfArray(sizeMap(tree)); // may overcount, since 
we share size maps
        return size;
    }
{code}


was (Author: dnk):
Hi, I have faced the issue with the test several times when committed other 
changes, so I decided to volunteer :) and have spent some time to analyze it in 
more details. 
Please find below the results of the analysis.

There are several aspects which contribute to the issue:
1) The memtable actual size is measured using 
ObjectSizes.measureDeep(memtable). A memtable object has several references 
which go beyond Memtable scope:
{code:java}
(org.apache.cassandra.db.Memtable) memtable
  (org.apache.cassandra.db.ColumnFamilyStore) cfs
    (org.apache.cassandra.db.Keyspace) keyspace <-------- has references to all 
tables in the same keyspace
  (org.apache.cassandra.utils.memory.SlabAllocator) allocator
    (org.apache.cassandra.utils.memory.MemtableAllocator$SubAllocator) onHeap
    ---- the boundary between memtable specific allocator objects and global 
memory pool objects ----
      (org.apache.cassandra.utils.memory.MemtablePool$SubPool) parent
        (org.apache.cassandra.utils.memory.SlabPool) this$0
          (org.apache.cassandra.utils.memory.MemtableCleanerThread) cleaner
            (java.lang.Thread) thread
              (java.lang.ThreadGroup) group
                (java.lang.Thread[]) threads // references to almost all 
Cassandra threads, then to thread-locals, etc
                (java.lang.ThreadGroup) parent
{code}
so, when we measure heap size of memtable object through Memtable.cfs and 
SubAllocator.parent references we also count non-related and dynamically 
changed amount of memory. When we run a test several times to fight with 
flakiness without cleanup the impact is even increasing - we implicitly observe 
memory allocations related to the previous runs.
To avoid this issue we can ask jamm Meter to ignore the correspondent objects 
using @Unmetered annotation on the related fields.
Also, it makes sense to drop the table after the test to cleanup memory.

2) heap_buffers memtable allocation type uses 1Mb shared byte array slabs to 
store partition key, clustering keys and cell values.
The free part of such slab is not counted as owned by MemtableAllocator logic 
but it is measured by ObjectSizes.measureDeep(memtable) due to an existence of 
reference path from Memtable via allocator the the .
There are two possible ways to deal with it: 
a) doesn't measure Memtable.allocator object at all
b) add a slab size to the allowed difference assertion between 
MemtableAllocator reported value and ObjectSizes.measureDeep(memtable) value
My own preference is a) because:
 - MemtableAllocator does not consider the slab free space as used and we 
actually test correctness of MemtableAllocator tracking by this test
 - adding such 1Mb difference decreases the accuracy of assertion and may hide 
other issues

NOTE:
4.0.x ObjectSizes uses jamm-0.3.2 with omitSharedBufferOverhead option --> the 
shared byte[] slab is NOT counted by Meter in ObjectSizes.
4.1.x ObjectSizes uses jamm-0.3.2 withOUT omitSharedBufferOverhead option --> 
the shared byte[] slab is counted by Meter in ObjectSizes.

The related change in 4.1 branch for ObjectSizes 
([https://github.com/apache/cassandra/commits/cassandra-4.1/src/java/org/apache/cassandra/utils/ObjectSizes.java):]
 
[https://github.com/apache/cassandra/commit/0d8126dd143898588f4efcdc40b8e2bb10597185#diff-1122d7d3efe9721af7244d373e66378f7e90cb05fd65859a52e8a3ea58a7c8f9]
omitSharedBufferOverhead() is removed

So, to not measure the whole slab array in case of  4.1.x branch a separate 
Meter needs to be configured for MemtableSizeTest

3) The most interesting part - the precision of memtable heap tracking. I have 
applied the following approach to find the missed parts.
To see the actual memory structure I have dumped few partitions for each write 
type (insert/delete partition/delete row) as a tree using MemoryMeter with 
enabled debug:
{code:java}
MemoryMeter meter = new MemoryMeter().omitSharedBufferOverhead()
                                     
.withGuessing(MemoryMeter.Guess.FALLBACK_UNSAFE)
                                     .ignoreKnownSingletons()
                                     .enableDebug();
{code}
While it does not print aggregated memory usage accurately (it does not reflect 
in the printing a memory used from the shared buffers) it allows to see the 
object graph and size of objects themselves:
{code:java}
root [org.apache.cassandra.db.partitions.AtomicBTreePartition] 632 bytes (32 
bytes)
  |
  +--ref [org.apache.cassandra.db.partitions.AbstractBTreePartition$Holder] 504 
bytes (32 bytes)
  |  |
  |  +--columns [org.apache.cassandra.db.RegularAndStaticColumns] 48 bytes (24 
bytes)
  |  |  |
  |  |  +--statics [org.apache.cassandra.db.Columns] 24 bytes (24 bytes)
  |  |
  |  +--deletionInfo [org.apache.cassandra.db.MutableDeletionInfo] 24 bytes (24 
bytes)
  |  |
  |  +--tree [java.lang.Object[]] 200 bytes (24 bytes)
  |  |  |
  |  |  +--0 [org.apache.cassandra.db.rows.BTreeRow] 176 bytes (32 bytes)
  |  |    |
  |  |    +--clustering [org.apache.cassandra.db.BufferClustering] 96 bytes (24 
bytes)
  |  |    |  |
  |  |    |  +--values [java.nio.ByteBuffer[]] 72 bytes (24 bytes)
  |  |    |    |
  |  |    |    +--0 [java.nio.HeapByteBuffer] 48 bytes (48 bytes)
  |  |    |
  |  |    +--deletion [org.apache.cassandra.db.rows.Row$Deletion] 48 bytes (24 
bytes)
  |  |      |
  |  |      +--time [org.apache.cassandra.db.DeletionTime] 24 bytes (24 bytes)
  |  |
  |  +--staticRow [org.apache.cassandra.db.rows.BTreeRow] 168 bytes (32 bytes)
  |  |  |
  |  |  +--clustering [org.apache.cassandra.db.Clustering$1] 40 bytes (24 bytes)
  |  |  |  |
  |  |  |  +--values [java.nio.ByteBuffer[]] 16 bytes (16 bytes)
  |  |  |
  |  |  +--primaryKeyLivenessInfo [org.apache.cassandra.db.LivenessInfo] 24 
bytes (24 bytes)
  |  |  |
  |  |  +--deletion [org.apache.cassandra.db.rows.Row$Deletion] 48 bytes (24 
bytes)
  |  |  |  |
  |  |  |  +--time [org.apache.cassandra.db.DeletionTime] 24 bytes (24 bytes)
  |  |  |
  |  |  +--btree [java.lang.Object[]] 24 bytes (24 bytes)
  |  |
  |  +--stats [org.apache.cassandra.db.rows.EncodingStats] 32 bytes (32 bytes)
  |
  +--partitionKey [org.apache.cassandra.db.BufferDecoratedKey] 96 bytes (24 
bytes)
    |
    +--key [java.nio.HeapByteBuffer] 48 bytes (48 bytes)
    |
    +--token [org.apache.cassandra.dht.Murmur3Partitioner$LongToken] 24 bytes 
(24 bytes)
{code}
Based on this input and source code analysis we can get a visual representation 
of the partitions and count manually the amount of used memory.

[^structure_example.svg]

As the second step I added temporary a tracing of records to the allocator 
tracking logic to see which values and where are counted by MemtableAllocator 
during a put operation, like:
{code:java}
class org.apache.cassandra.utils.memory.MemtableAllocator.SubAllocator {

public static final ThreadLocal<Boolean> tracing = ThreadLocal.withInitial(() 
-> Boolean.FALSE);
....

private void allocated(long size)
{
    parent.allocated(size);
    reportChange(size); <==== added
    ownsUpdater.addAndGet(this, size);

    if (state == LifeCycle.DISCARDING)
    {
        if (logger.isTraceEnabled())
            logger.trace("Allocated {} bytes whilst discarding", size);
        updateReclaiming();
    }
}

public void reportChange(long delta) {
            if (tracing.get())
            {
                System.out.println("ALLOCATOR: " + delta);
                new Exception().printStackTrace(System.out);
            }
        }
{code}
"tracing" thread-local is used to print such information only for an analyzed 
memtable
Similar tracing was added temporary to 
org.apache.cassandra.db.partitions.AtomicBTreePartition.RowUpdater as well.
The output looks like:
{code:java}
ALLOCATOR: 8
java.lang.Exception
   at 
org.apache.cassandra.utils.memory.MemtableAllocator$SubAllocator.reportChange(MemtableAllocator.java:226)
   at 
org.apache.cassandra.utils.memory.MemtableAllocator$SubAllocator.acquired(MemtableAllocator.java:238)
   at 
org.apache.cassandra.utils.memory.MemtableAllocator$SubAllocator.allocate(MemtableAllocator.java:179)
   at 
org.apache.cassandra.utils.memory.SlabAllocator.allocate(SlabAllocator.java:89)
   at 
org.apache.cassandra.utils.memory.MemtableBufferAllocator$1.allocate(MemtableBufferAllocator.java:40)
   at 
org.apache.cassandra.utils.memory.ByteBufferCloner.clone(ByteBufferCloner.java:77)
   at 
org.apache.cassandra.utils.memory.ByteBufferCloner.clone(ByteBufferCloner.java:63)
   at org.apache.cassandra.db.Clustering.clone(Clustering.java:53)
   at 
org.apache.cassandra.utils.memory.ByteBufferCloner.clone(ByteBufferCloner.java:52)
   at org.apache.cassandra.db.rows.BTreeRow.clone(BTreeRow.java:505)
   at 
org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.insert(AtomicBTreePartition.java:376)
   at 
org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.insert(AtomicBTreePartition.java:352)
   at org.apache.cassandra.utils.btree.BTree.transformLeaf(BTree.java:1308)
   at org.apache.cassandra.utils.btree.BTree.transform(BTree.java:1255)
   at org.apache.cassandra.utils.btree.BTree.update(BTree.java:353)
   at 
org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDeltaInternal(AtomicBTreePartition.java:147)
   at 
org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDelta(AtomicBTreePartition.java:191)
   at org.apache.cassandra.db.Memtable.put(Memtable.java:327)
{code}
As a result for a single put operation we can observe how does memory tracking 
works, with stack traces to the points where it is happening.
I have used this information and put marks on the object graph picture created 
on the previous step.
It allows to see the discrepancies - what objects are not counted correctly.

[^analyzed_objects.svg]

(*g) So, there are the following discrepancies found:
1) we do not count ByteBuffer BufferDecoratedKey.key, it is supposed to be 
counted by Memtable.estimateRowOverhead method logic but estimateRowOverhead 
puts ByteBufferUtil.EMPTY_BYTE_BUFFER with size = 0 to the field. Cloning of 
BufferDecoratedKey does not help, we still have the same object returned by 
org.apache.cassandra.utils.memory.ByteBufferCloner#clone(V, 
org.apache.cassandra.db.marshal.ValueAccessor<V>).
So, when we measure ConcurrentSkipListMap deep size in estimateRowOverhead we 
have the same ByteBufferUtil.EMPTY_BYTE_BUFFER object used and it is counted 
only once. To avoid this issue we should allocate a buffer of non-zero size in 
Memtable.estimateRowOverhead.

2) Columns (under RegularAndStaticColumns) stores ColumnMetadata in btree 
structure (Object[]) but the cost of this structure is not counted.
To fix the issue BTree.sizeOfStructureOnHeap(columns) should be added to 
org.apache.cassandra.db.Columns#unsharedHeapSize

3) Container for cells = BTreeRow.btree is EMPTY_LEAF when we delete a row, 
EMPTY_LEAF is a shared singleton but 
org.apache.cassandra.utils.btree.BTree#sizeOfStructureOnHeap used in 
org.apache.cassandra.db.rows.BTreeRow#unsharedHeapSizeExcludingData
to measure size of this structure does not check usage of EMPTY_LEAF and 
incorrectly reports a non-zero size for this Object[] array value.
To fix this issue a check for EMPTY_LEAF value should be added to 
BTree#sizeOfStructureOnHeap

By combining of these changes together I have got about 1 byte of difference 
per partition between measured and allocator reported values.
When I found the exact places in the code I have compared them with 5.0 code 
and found the following commit by () - 
[https://github.com/apache/cassandra/commit/49e0c61107005b1a83799f7f1e6c0a855d159c29]
 done as a part of CASSANDRA-17240 story. So, it looks like I independently 
have come to the same conclusions as [~blambov]  :).

I have aggregated the changes to MRs for 4.0 and 4.1:
 * [https://github.com/apache/cassandra/pull/3503]
 * [https://github.com/apache/cassandra/pull/3504] 

(?) An open question: there are two very similar methods: BTree.sizeOnHeapOf vs 
BTree.sizeOfStructureOnHeap used in the code.
The difference is only in measuring of sizeOfArray(sizeMap(tree)). I wonder why 
do we need sizeOfStructureOnHeap.., especially taking in account what we use 
sizeOnHeapOf to measure Holder.btree memory usage (in Object[] tree = 
BTree.update(current.tree, update.holder().tree, update.metadata().comparator, 
updater)). So, in one case we use sizeOfStructureOnHeap and in another - 
sizeOnHeapOf for the same kind of logic..
{code:java}
    public static long sizeOfStructureOnHeap(Object[] tree)
    {
        long size = ObjectSizes.sizeOfArray(tree);
        if (isLeaf(tree))
            return size;
        for (int i = getChildStart(tree); i < getChildEnd(tree); i++)
            size += sizeOfStructureOnHeap((Object[]) tree[i]);
        return size;
    }

    public static long sizeOnHeapOf(Object[] tree)
    {
        if (isEmpty(tree))
            return 0;
        long size = ObjectSizes.sizeOfArray(tree);
        if (isLeaf(tree))
            return size;
        for (int i = childOffset(tree); i < childEndOffset(tree); i++)
            size += sizeOnHeapOf((Object[]) tree[i]);
        size += ObjectSizes.sizeOfArray(sizeMap(tree)); // may overcount, since 
we share size maps
        return size;
    }
{code}

> Test Failure: org.apache.cassandra.cql3.MemtableSizeTest
> --------------------------------------------------------
>
>                 Key: CASSANDRA-17298
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17298
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/unit
>            Reporter: Josh McKenzie
>            Assignee: Dmitry Konstantinov
>            Priority: Normal
>             Fix For: 4.0.x, 4.1.x
>
>         Attachments: analyzed_objects.svg, structure_example.svg
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> [https://ci-cassandra.apache.org/job/Cassandra-4.0/313/testReport/org.apache.cassandra.cql3/MemtableSizeTest/testTruncationReleasesLogSpace_2/]
>  Failed 4 times in the last 30 runs. Flakiness: 27%, Stability: 86%
> Error Message
> Expected heap usage close to 49.930MiB, got 41.542MiB.
> {code}
> Stacktrace
> junit.framework.AssertionFailedError: Expected heap usage close to 49.930MiB, 
> got 41.542MiB.
>       at 
> org.apache.cassandra.cql3.MemtableSizeTest.testSize(MemtableSizeTest.java:130)
>       at org.apache.cassandra.Util.runCatchingAssertionError(Util.java:644)
>       at org.apache.cassandra.Util.flakyTest(Util.java:669)
>       at 
> org.apache.cassandra.cql3.MemtableSizeTest.testTruncationReleasesLogSpace(MemtableSizeTest.java:61)
>       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  {code}
> *UPDATE:* It was discovered that unit tests were running with 
> memtable_allocation_type: offheap_objects when we ship C* with heap_buffers.
> So we changed that in CASSANDRA-19326, now we test with 
> memtable_allocation_type: heap_buffers. As a result, this test now fails all 
> the time on 4.0 and 4.1. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to