[
https://issues.apache.org/jira/browse/CASSANDRA-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877181#comment-17877181
]
Dmitry Konstantinov edited comment on CASSANDRA-17298 at 8/27/24 9:02 PM:
--------------------------------------------------------------------------
Hi, I have faced the issue (a big discrepancy between actual and expected heap
size) with the test several times when committed other changes, so I decided to
volunteer :) and have spent some time to analyze it in more details.
Please find below the results of the analysis.
There are several aspects which contribute to the issue:
1) The memtable actual size is measured using
ObjectSizes.measureDeep(memtable). A memtable object has several references
which go beyond Memtable scope:
{code:java}
(org.apache.cassandra.db.Memtable) memtable
(org.apache.cassandra.db.ColumnFamilyStore) cfs
(org.apache.cassandra.db.Keyspace) keyspace <-------- has references to all
tables in the same keyspace
(org.apache.cassandra.utils.memory.SlabAllocator) allocator
(org.apache.cassandra.utils.memory.MemtableAllocator$SubAllocator) onHeap
---- the boundary between memtable specific allocator objects and global
memory pool objects ----
(org.apache.cassandra.utils.memory.MemtablePool$SubPool) parent
(org.apache.cassandra.utils.memory.SlabPool) this$0
(org.apache.cassandra.utils.memory.MemtableCleanerThread) cleaner
(java.lang.Thread) thread
(java.lang.ThreadGroup) group
(java.lang.Thread[]) threads // references to almost all
Cassandra threads, then to thread-locals, etc
(java.lang.ThreadGroup) parent
{code}
so, when we measure heap size of memtable object through Memtable.cfs and
SubAllocator.parent references we also count non-related and dynamically
changed amount of memory. When we run a test several times to fight with
flakiness without cleanup the impact is even increasing - we implicitly observe
memory allocations related to the previous runs.
To avoid this issue we can ask jamm Meter to ignore the correspondent objects
using @Unmetered annotation on the related fields.
Also, it makes sense to drop the table after the test to cleanup memory.
2) heap_buffers memtable allocation type uses 1Mb shared byte array slabs to
store partition key, clustering keys and cell values.
The free part of such slab is not counted as owned by MemtableAllocator logic
but it is measured by ObjectSizes.measureDeep(memtable) due to an existence of
reference path from Memtable via allocator the the .
There are two possible ways to deal with it:
a) doesn't measure Memtable.allocator object at all
b) add a slab size to the allowed difference assertion between
MemtableAllocator reported value and ObjectSizes.measureDeep(memtable) value
My own preference is a) because:
- MemtableAllocator does not consider the slab free space as used and we
actually test correctness of MemtableAllocator tracking by this test
- adding such 1Mb difference decreases the accuracy of assertion and may hide
other issues
NOTE:
4.0.x ObjectSizes uses jamm-0.3.2 with omitSharedBufferOverhead option --> the
shared byte[] slab is NOT counted by Meter in ObjectSizes.
4.1.x ObjectSizes uses jamm-0.3.2 withOUT omitSharedBufferOverhead option -->
the shared byte[] slab is counted by Meter in ObjectSizes.
The related change in 4.1 branch for ObjectSizes
([https://github.com/apache/cassandra/commits/cassandra-4.1/src/java/org/apache/cassandra/utils/ObjectSizes.java):]
[https://github.com/apache/cassandra/commit/0d8126dd143898588f4efcdc40b8e2bb10597185#diff-1122d7d3efe9721af7244d373e66378f7e90cb05fd65859a52e8a3ea58a7c8f9]
omitSharedBufferOverhead() is removed
So, to not measure the whole slab array in case of 4.1.x branch a separate
Meter needs to be configured for MemtableSizeTest
3) The most interesting part - the precision of memtable heap tracking. I have
applied the following approach to find the missed parts.
To see the actual memory structure I have dumped few partitions for each write
type (insert/delete partition/delete row) as a tree using MemoryMeter with
enabled debug:
{code:java}
MemoryMeter meter = new MemoryMeter().omitSharedBufferOverhead()
.withGuessing(MemoryMeter.Guess.FALLBACK_UNSAFE)
.ignoreKnownSingletons()
.enableDebug();
{code}
While it does not print aggregated memory usage accurately (it does not reflect
in the printing a memory used from the shared buffers) it allows to see the
object graph and size of objects themselves:
{code:java}
root [org.apache.cassandra.db.partitions.AtomicBTreePartition] 632 bytes (32
bytes)
|
+--ref [org.apache.cassandra.db.partitions.AbstractBTreePartition$Holder] 504
bytes (32 bytes)
| |
| +--columns [org.apache.cassandra.db.RegularAndStaticColumns] 48 bytes (24
bytes)
| | |
| | +--statics [org.apache.cassandra.db.Columns] 24 bytes (24 bytes)
| |
| +--deletionInfo [org.apache.cassandra.db.MutableDeletionInfo] 24 bytes (24
bytes)
| |
| +--tree [java.lang.Object[]] 200 bytes (24 bytes)
| | |
| | +--0 [org.apache.cassandra.db.rows.BTreeRow] 176 bytes (32 bytes)
| | |
| | +--clustering [org.apache.cassandra.db.BufferClustering] 96 bytes (24
bytes)
| | | |
| | | +--values [java.nio.ByteBuffer[]] 72 bytes (24 bytes)
| | | |
| | | +--0 [java.nio.HeapByteBuffer] 48 bytes (48 bytes)
| | |
| | +--deletion [org.apache.cassandra.db.rows.Row$Deletion] 48 bytes (24
bytes)
| | |
| | +--time [org.apache.cassandra.db.DeletionTime] 24 bytes (24 bytes)
| |
| +--staticRow [org.apache.cassandra.db.rows.BTreeRow] 168 bytes (32 bytes)
| | |
| | +--clustering [org.apache.cassandra.db.Clustering$1] 40 bytes (24 bytes)
| | | |
| | | +--values [java.nio.ByteBuffer[]] 16 bytes (16 bytes)
| | |
| | +--primaryKeyLivenessInfo [org.apache.cassandra.db.LivenessInfo] 24
bytes (24 bytes)
| | |
| | +--deletion [org.apache.cassandra.db.rows.Row$Deletion] 48 bytes (24
bytes)
| | | |
| | | +--time [org.apache.cassandra.db.DeletionTime] 24 bytes (24 bytes)
| | |
| | +--btree [java.lang.Object[]] 24 bytes (24 bytes)
| |
| +--stats [org.apache.cassandra.db.rows.EncodingStats] 32 bytes (32 bytes)
|
+--partitionKey [org.apache.cassandra.db.BufferDecoratedKey] 96 bytes (24
bytes)
|
+--key [java.nio.HeapByteBuffer] 48 bytes (48 bytes)
|
+--token [org.apache.cassandra.dht.Murmur3Partitioner$LongToken] 24 bytes
(24 bytes)
{code}
Based on this input and source code analysis we can get a visual representation
of the partitions and count manually the amount of used memory.
[^structure_example.svg]
As the second step I added temporary a tracing of records to the allocator
tracking logic to see which values and where are counted by MemtableAllocator
during a put operation, like:
{code:java}
class org.apache.cassandra.utils.memory.MemtableAllocator.SubAllocator {
public static final ThreadLocal<Boolean> tracing = ThreadLocal.withInitial(()
-> Boolean.FALSE);
....
private void allocated(long size)
{
parent.allocated(size);
reportChange(size); <==== added
ownsUpdater.addAndGet(this, size);
if (state == LifeCycle.DISCARDING)
{
if (logger.isTraceEnabled())
logger.trace("Allocated {} bytes whilst discarding", size);
updateReclaiming();
}
}
public void reportChange(long delta) {
if (tracing.get())
{
System.out.println("ALLOCATOR: " + delta);
new Exception().printStackTrace(System.out);
}
}
{code}
"tracing" thread-local is used to print such information only for an analyzed
memtable
Similar tracing was added temporary to
org.apache.cassandra.db.partitions.AtomicBTreePartition.RowUpdater as well.
The output looks like:
{code:java}
ALLOCATOR: 8
java.lang.Exception
at
org.apache.cassandra.utils.memory.MemtableAllocator$SubAllocator.reportChange(MemtableAllocator.java:226)
at
org.apache.cassandra.utils.memory.MemtableAllocator$SubAllocator.acquired(MemtableAllocator.java:238)
at
org.apache.cassandra.utils.memory.MemtableAllocator$SubAllocator.allocate(MemtableAllocator.java:179)
at
org.apache.cassandra.utils.memory.SlabAllocator.allocate(SlabAllocator.java:89)
at
org.apache.cassandra.utils.memory.MemtableBufferAllocator$1.allocate(MemtableBufferAllocator.java:40)
at
org.apache.cassandra.utils.memory.ByteBufferCloner.clone(ByteBufferCloner.java:77)
at
org.apache.cassandra.utils.memory.ByteBufferCloner.clone(ByteBufferCloner.java:63)
at org.apache.cassandra.db.Clustering.clone(Clustering.java:53)
at
org.apache.cassandra.utils.memory.ByteBufferCloner.clone(ByteBufferCloner.java:52)
at org.apache.cassandra.db.rows.BTreeRow.clone(BTreeRow.java:505)
at
org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.insert(AtomicBTreePartition.java:376)
at
org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.insert(AtomicBTreePartition.java:352)
at org.apache.cassandra.utils.btree.BTree.transformLeaf(BTree.java:1308)
at org.apache.cassandra.utils.btree.BTree.transform(BTree.java:1255)
at org.apache.cassandra.utils.btree.BTree.update(BTree.java:353)
at
org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDeltaInternal(AtomicBTreePartition.java:147)
at
org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDelta(AtomicBTreePartition.java:191)
at org.apache.cassandra.db.Memtable.put(Memtable.java:327)
{code}
As a result for a single put operation we can observe how does memory tracking
works, with stack traces to the points where it is happening.
I have used this information and put marks on the object graph picture created
on the previous step.
It allows to see the discrepancies - what objects are not counted correctly.
[^analyzed_objects.svg]
(*g) So, there are the following discrepancies found:
1) we do not count ByteBuffer BufferDecoratedKey.key, it is supposed to be
counted by Memtable.estimateRowOverhead method logic but estimateRowOverhead
puts ByteBufferUtil.EMPTY_BYTE_BUFFER with size = 0 to the field. Cloning of
BufferDecoratedKey does not help, we still have the same object returned by
org.apache.cassandra.utils.memory.ByteBufferCloner#clone(V,
org.apache.cassandra.db.marshal.ValueAccessor<V>).
So, when we measure ConcurrentSkipListMap deep size in estimateRowOverhead we
have the same ByteBufferUtil.EMPTY_BYTE_BUFFER object used and it is counted
only once. To avoid this issue we should allocate a buffer of non-zero size in
Memtable.estimateRowOverhead.
2) Columns (under RegularAndStaticColumns) stores ColumnMetadata in btree
structure (Object[]) but the cost of this structure is not counted.
To fix the issue BTree.sizeOfStructureOnHeap(columns) should be added to
org.apache.cassandra.db.Columns#unsharedHeapSize
3) Container for cells = BTreeRow.btree is EMPTY_LEAF when we delete a row,
EMPTY_LEAF is a shared singleton but
org.apache.cassandra.utils.btree.BTree#sizeOfStructureOnHeap used in
org.apache.cassandra.db.rows.BTreeRow#unsharedHeapSizeExcludingData
to measure size of this structure does not check usage of EMPTY_LEAF and
incorrectly reports a non-zero size for this Object[] array value.
To fix this issue a check for EMPTY_LEAF value should be added to
BTree#sizeOfStructureOnHeap
By combining of these changes together I have got about 1 byte of difference
per partition between measured and allocator reported values.
When I found the exact places in the code I have compared them with 5.0 code
and found the following commit by () -
[https://github.com/apache/cassandra/commit/49e0c61107005b1a83799f7f1e6c0a855d159c29]
done as a part of CASSANDRA-17240 story. So, it looks like I independently
have come to the same conclusions as [~blambov] :).
I have aggregated the changes to MRs for 4.0 and 4.1:
* [https://github.com/apache/cassandra/pull/3503]
* [https://github.com/apache/cassandra/pull/3504]
(?) An open question: there are two very similar methods: BTree.sizeOnHeapOf vs
BTree.sizeOfStructureOnHeap used in the code.
The difference is only in measuring of sizeOfArray(sizeMap(tree)). I wonder why
do we need sizeOfStructureOnHeap.., especially taking in account what we use
sizeOnHeapOf to measure Holder.btree memory usage (in Object[] tree =
BTree.update(current.tree, update.holder().tree, update.metadata().comparator,
updater)). So, in one case we use sizeOfStructureOnHeap and in another -
sizeOnHeapOf for the same kind of logic..
{code:java}
public static long sizeOfStructureOnHeap(Object[] tree)
{
long size = ObjectSizes.sizeOfArray(tree);
if (isLeaf(tree))
return size;
for (int i = getChildStart(tree); i < getChildEnd(tree); i++)
size += sizeOfStructureOnHeap((Object[]) tree[i]);
return size;
}
public static long sizeOnHeapOf(Object[] tree)
{
if (isEmpty(tree))
return 0;
long size = ObjectSizes.sizeOfArray(tree);
if (isLeaf(tree))
return size;
for (int i = childOffset(tree); i < childEndOffset(tree); i++)
size += sizeOnHeapOf((Object[]) tree[i]);
size += ObjectSizes.sizeOfArray(sizeMap(tree)); // may overcount, since
we share size maps
return size;
}
{code}
was (Author: dnk):
Hi, I have faced the issue with the test several times when committed other
changes, so I decided to volunteer :) and have spent some time to analyze it in
more details.
Please find below the results of the analysis.
There are several aspects which contribute to the issue:
1) The memtable actual size is measured using
ObjectSizes.measureDeep(memtable). A memtable object has several references
which go beyond Memtable scope:
{code:java}
(org.apache.cassandra.db.Memtable) memtable
(org.apache.cassandra.db.ColumnFamilyStore) cfs
(org.apache.cassandra.db.Keyspace) keyspace <-------- has references to all
tables in the same keyspace
(org.apache.cassandra.utils.memory.SlabAllocator) allocator
(org.apache.cassandra.utils.memory.MemtableAllocator$SubAllocator) onHeap
---- the boundary between memtable specific allocator objects and global
memory pool objects ----
(org.apache.cassandra.utils.memory.MemtablePool$SubPool) parent
(org.apache.cassandra.utils.memory.SlabPool) this$0
(org.apache.cassandra.utils.memory.MemtableCleanerThread) cleaner
(java.lang.Thread) thread
(java.lang.ThreadGroup) group
(java.lang.Thread[]) threads // references to almost all
Cassandra threads, then to thread-locals, etc
(java.lang.ThreadGroup) parent
{code}
so, when we measure heap size of memtable object through Memtable.cfs and
SubAllocator.parent references we also count non-related and dynamically
changed amount of memory. When we run a test several times to fight with
flakiness without cleanup the impact is even increasing - we implicitly observe
memory allocations related to the previous runs.
To avoid this issue we can ask jamm Meter to ignore the correspondent objects
using @Unmetered annotation on the related fields.
Also, it makes sense to drop the table after the test to cleanup memory.
2) heap_buffers memtable allocation type uses 1Mb shared byte array slabs to
store partition key, clustering keys and cell values.
The free part of such slab is not counted as owned by MemtableAllocator logic
but it is measured by ObjectSizes.measureDeep(memtable) due to an existence of
reference path from Memtable via allocator the the .
There are two possible ways to deal with it:
a) doesn't measure Memtable.allocator object at all
b) add a slab size to the allowed difference assertion between
MemtableAllocator reported value and ObjectSizes.measureDeep(memtable) value
My own preference is a) because:
- MemtableAllocator does not consider the slab free space as used and we
actually test correctness of MemtableAllocator tracking by this test
- adding such 1Mb difference decreases the accuracy of assertion and may hide
other issues
NOTE:
4.0.x ObjectSizes uses jamm-0.3.2 with omitSharedBufferOverhead option --> the
shared byte[] slab is NOT counted by Meter in ObjectSizes.
4.1.x ObjectSizes uses jamm-0.3.2 withOUT omitSharedBufferOverhead option -->
the shared byte[] slab is counted by Meter in ObjectSizes.
The related change in 4.1 branch for ObjectSizes
([https://github.com/apache/cassandra/commits/cassandra-4.1/src/java/org/apache/cassandra/utils/ObjectSizes.java):]
[https://github.com/apache/cassandra/commit/0d8126dd143898588f4efcdc40b8e2bb10597185#diff-1122d7d3efe9721af7244d373e66378f7e90cb05fd65859a52e8a3ea58a7c8f9]
omitSharedBufferOverhead() is removed
So, to not measure the whole slab array in case of 4.1.x branch a separate
Meter needs to be configured for MemtableSizeTest
3) The most interesting part - the precision of memtable heap tracking. I have
applied the following approach to find the missed parts.
To see the actual memory structure I have dumped few partitions for each write
type (insert/delete partition/delete row) as a tree using MemoryMeter with
enabled debug:
{code:java}
MemoryMeter meter = new MemoryMeter().omitSharedBufferOverhead()
.withGuessing(MemoryMeter.Guess.FALLBACK_UNSAFE)
.ignoreKnownSingletons()
.enableDebug();
{code}
While it does not print aggregated memory usage accurately (it does not reflect
in the printing a memory used from the shared buffers) it allows to see the
object graph and size of objects themselves:
{code:java}
root [org.apache.cassandra.db.partitions.AtomicBTreePartition] 632 bytes (32
bytes)
|
+--ref [org.apache.cassandra.db.partitions.AbstractBTreePartition$Holder] 504
bytes (32 bytes)
| |
| +--columns [org.apache.cassandra.db.RegularAndStaticColumns] 48 bytes (24
bytes)
| | |
| | +--statics [org.apache.cassandra.db.Columns] 24 bytes (24 bytes)
| |
| +--deletionInfo [org.apache.cassandra.db.MutableDeletionInfo] 24 bytes (24
bytes)
| |
| +--tree [java.lang.Object[]] 200 bytes (24 bytes)
| | |
| | +--0 [org.apache.cassandra.db.rows.BTreeRow] 176 bytes (32 bytes)
| | |
| | +--clustering [org.apache.cassandra.db.BufferClustering] 96 bytes (24
bytes)
| | | |
| | | +--values [java.nio.ByteBuffer[]] 72 bytes (24 bytes)
| | | |
| | | +--0 [java.nio.HeapByteBuffer] 48 bytes (48 bytes)
| | |
| | +--deletion [org.apache.cassandra.db.rows.Row$Deletion] 48 bytes (24
bytes)
| | |
| | +--time [org.apache.cassandra.db.DeletionTime] 24 bytes (24 bytes)
| |
| +--staticRow [org.apache.cassandra.db.rows.BTreeRow] 168 bytes (32 bytes)
| | |
| | +--clustering [org.apache.cassandra.db.Clustering$1] 40 bytes (24 bytes)
| | | |
| | | +--values [java.nio.ByteBuffer[]] 16 bytes (16 bytes)
| | |
| | +--primaryKeyLivenessInfo [org.apache.cassandra.db.LivenessInfo] 24
bytes (24 bytes)
| | |
| | +--deletion [org.apache.cassandra.db.rows.Row$Deletion] 48 bytes (24
bytes)
| | | |
| | | +--time [org.apache.cassandra.db.DeletionTime] 24 bytes (24 bytes)
| | |
| | +--btree [java.lang.Object[]] 24 bytes (24 bytes)
| |
| +--stats [org.apache.cassandra.db.rows.EncodingStats] 32 bytes (32 bytes)
|
+--partitionKey [org.apache.cassandra.db.BufferDecoratedKey] 96 bytes (24
bytes)
|
+--key [java.nio.HeapByteBuffer] 48 bytes (48 bytes)
|
+--token [org.apache.cassandra.dht.Murmur3Partitioner$LongToken] 24 bytes
(24 bytes)
{code}
Based on this input and source code analysis we can get a visual representation
of the partitions and count manually the amount of used memory.
[^structure_example.svg]
As the second step I added temporary a tracing of records to the allocator
tracking logic to see which values and where are counted by MemtableAllocator
during a put operation, like:
{code:java}
class org.apache.cassandra.utils.memory.MemtableAllocator.SubAllocator {
public static final ThreadLocal<Boolean> tracing = ThreadLocal.withInitial(()
-> Boolean.FALSE);
....
private void allocated(long size)
{
parent.allocated(size);
reportChange(size); <==== added
ownsUpdater.addAndGet(this, size);
if (state == LifeCycle.DISCARDING)
{
if (logger.isTraceEnabled())
logger.trace("Allocated {} bytes whilst discarding", size);
updateReclaiming();
}
}
public void reportChange(long delta) {
if (tracing.get())
{
System.out.println("ALLOCATOR: " + delta);
new Exception().printStackTrace(System.out);
}
}
{code}
"tracing" thread-local is used to print such information only for an analyzed
memtable
Similar tracing was added temporary to
org.apache.cassandra.db.partitions.AtomicBTreePartition.RowUpdater as well.
The output looks like:
{code:java}
ALLOCATOR: 8
java.lang.Exception
at
org.apache.cassandra.utils.memory.MemtableAllocator$SubAllocator.reportChange(MemtableAllocator.java:226)
at
org.apache.cassandra.utils.memory.MemtableAllocator$SubAllocator.acquired(MemtableAllocator.java:238)
at
org.apache.cassandra.utils.memory.MemtableAllocator$SubAllocator.allocate(MemtableAllocator.java:179)
at
org.apache.cassandra.utils.memory.SlabAllocator.allocate(SlabAllocator.java:89)
at
org.apache.cassandra.utils.memory.MemtableBufferAllocator$1.allocate(MemtableBufferAllocator.java:40)
at
org.apache.cassandra.utils.memory.ByteBufferCloner.clone(ByteBufferCloner.java:77)
at
org.apache.cassandra.utils.memory.ByteBufferCloner.clone(ByteBufferCloner.java:63)
at org.apache.cassandra.db.Clustering.clone(Clustering.java:53)
at
org.apache.cassandra.utils.memory.ByteBufferCloner.clone(ByteBufferCloner.java:52)
at org.apache.cassandra.db.rows.BTreeRow.clone(BTreeRow.java:505)
at
org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.insert(AtomicBTreePartition.java:376)
at
org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.insert(AtomicBTreePartition.java:352)
at org.apache.cassandra.utils.btree.BTree.transformLeaf(BTree.java:1308)
at org.apache.cassandra.utils.btree.BTree.transform(BTree.java:1255)
at org.apache.cassandra.utils.btree.BTree.update(BTree.java:353)
at
org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDeltaInternal(AtomicBTreePartition.java:147)
at
org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDelta(AtomicBTreePartition.java:191)
at org.apache.cassandra.db.Memtable.put(Memtable.java:327)
{code}
As a result for a single put operation we can observe how does memory tracking
works, with stack traces to the points where it is happening.
I have used this information and put marks on the object graph picture created
on the previous step.
It allows to see the discrepancies - what objects are not counted correctly.
[^analyzed_objects.svg]
(*g) So, there are the following discrepancies found:
1) we do not count ByteBuffer BufferDecoratedKey.key, it is supposed to be
counted by Memtable.estimateRowOverhead method logic but estimateRowOverhead
puts ByteBufferUtil.EMPTY_BYTE_BUFFER with size = 0 to the field. Cloning of
BufferDecoratedKey does not help, we still have the same object returned by
org.apache.cassandra.utils.memory.ByteBufferCloner#clone(V,
org.apache.cassandra.db.marshal.ValueAccessor<V>).
So, when we measure ConcurrentSkipListMap deep size in estimateRowOverhead we
have the same ByteBufferUtil.EMPTY_BYTE_BUFFER object used and it is counted
only once. To avoid this issue we should allocate a buffer of non-zero size in
Memtable.estimateRowOverhead.
2) Columns (under RegularAndStaticColumns) stores ColumnMetadata in btree
structure (Object[]) but the cost of this structure is not counted.
To fix the issue BTree.sizeOfStructureOnHeap(columns) should be added to
org.apache.cassandra.db.Columns#unsharedHeapSize
3) Container for cells = BTreeRow.btree is EMPTY_LEAF when we delete a row,
EMPTY_LEAF is a shared singleton but
org.apache.cassandra.utils.btree.BTree#sizeOfStructureOnHeap used in
org.apache.cassandra.db.rows.BTreeRow#unsharedHeapSizeExcludingData
to measure size of this structure does not check usage of EMPTY_LEAF and
incorrectly reports a non-zero size for this Object[] array value.
To fix this issue a check for EMPTY_LEAF value should be added to
BTree#sizeOfStructureOnHeap
By combining of these changes together I have got about 1 byte of difference
per partition between measured and allocator reported values.
When I found the exact places in the code I have compared them with 5.0 code
and found the following commit by () -
[https://github.com/apache/cassandra/commit/49e0c61107005b1a83799f7f1e6c0a855d159c29]
done as a part of CASSANDRA-17240 story. So, it looks like I independently
have come to the same conclusions as [~blambov] :).
I have aggregated the changes to MRs for 4.0 and 4.1:
* [https://github.com/apache/cassandra/pull/3503]
* [https://github.com/apache/cassandra/pull/3504]
(?) An open question: there are two very similar methods: BTree.sizeOnHeapOf vs
BTree.sizeOfStructureOnHeap used in the code.
The difference is only in measuring of sizeOfArray(sizeMap(tree)). I wonder why
do we need sizeOfStructureOnHeap.., especially taking in account what we use
sizeOnHeapOf to measure Holder.btree memory usage (in Object[] tree =
BTree.update(current.tree, update.holder().tree, update.metadata().comparator,
updater)). So, in one case we use sizeOfStructureOnHeap and in another -
sizeOnHeapOf for the same kind of logic..
{code:java}
public static long sizeOfStructureOnHeap(Object[] tree)
{
long size = ObjectSizes.sizeOfArray(tree);
if (isLeaf(tree))
return size;
for (int i = getChildStart(tree); i < getChildEnd(tree); i++)
size += sizeOfStructureOnHeap((Object[]) tree[i]);
return size;
}
public static long sizeOnHeapOf(Object[] tree)
{
if (isEmpty(tree))
return 0;
long size = ObjectSizes.sizeOfArray(tree);
if (isLeaf(tree))
return size;
for (int i = childOffset(tree); i < childEndOffset(tree); i++)
size += sizeOnHeapOf((Object[]) tree[i]);
size += ObjectSizes.sizeOfArray(sizeMap(tree)); // may overcount, since
we share size maps
return size;
}
{code}
> Test Failure: org.apache.cassandra.cql3.MemtableSizeTest
> --------------------------------------------------------
>
> Key: CASSANDRA-17298
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17298
> Project: Cassandra
> Issue Type: Bug
> Components: Test/unit
> Reporter: Josh McKenzie
> Assignee: Dmitry Konstantinov
> Priority: Normal
> Fix For: 4.0.x, 4.1.x
>
> Attachments: analyzed_objects.svg, structure_example.svg
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> [https://ci-cassandra.apache.org/job/Cassandra-4.0/313/testReport/org.apache.cassandra.cql3/MemtableSizeTest/testTruncationReleasesLogSpace_2/]
> Failed 4 times in the last 30 runs. Flakiness: 27%, Stability: 86%
> Error Message
> Expected heap usage close to 49.930MiB, got 41.542MiB.
> {code}
> Stacktrace
> junit.framework.AssertionFailedError: Expected heap usage close to 49.930MiB,
> got 41.542MiB.
> at
> org.apache.cassandra.cql3.MemtableSizeTest.testSize(MemtableSizeTest.java:130)
> at org.apache.cassandra.Util.runCatchingAssertionError(Util.java:644)
> at org.apache.cassandra.Util.flakyTest(Util.java:669)
> at
> org.apache.cassandra.cql3.MemtableSizeTest.testTruncationReleasesLogSpace(MemtableSizeTest.java:61)
> at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> {code}
> *UPDATE:* It was discovered that unit tests were running with
> memtable_allocation_type: offheap_objects when we ship C* with heap_buffers.
> So we changed that in CASSANDRA-19326, now we test with
> memtable_allocation_type: heap_buffers. As a result, this test now fails all
> the time on 4.0 and 4.1.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]