[
https://issues.apache.org/jira/browse/CASSANDRA-18919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18079938#comment-18079938
]
Ling Mao commented on CASSANDRA-18919:
--------------------------------------
h5. Root Cause Analysis
The SSTable DIGEST file stores a CRC32 checksum as decimal text (e.g.,
"2287252" or "3065344519"). It is written by ChecksumWriter.writeFullChecksum()
using String.valueOf(fullChecksum.getValue()).getBytes(UTF_8).
In VerifyTest, the test reads this file using RandomAccessReader.readLong(),
which interprets the next 8 raw bytes as a binary long. CRC32 values range from
0 to 4,294,967,295 (1–10 decimal digits). When the CRC32 value happens to be
less than 10,000,000 (fewer than 8 decimal digits), the DIGEST file contains
fewer than 8 bytes, and readLong() throws EOFException.
This is flaky because the CRC32 depends on the exact binary content of the
compacted SSTable, which in turn depends on timestamps assigned by
FBUtilities.timestampMicros() at the moment apply() is called. Different runs
produce different timestamps, different encoded deltas, and thus different
CRC32 values — occasionally(prob=10000000/4294967295=0.23%) one with fewer than
8 digits.
The production code (DataIntegrityMetadata.FileDigestValidator) already reads
the DIGEST correctly using Long.parseLong(digestReader.readLine()). The bug was
exclusively in the test code.
h5.
How to reproduce this flaky test:
# Replace fillCF(cfs, 2); in the testMutateRepair with the following method:
fillCFForTestMutateRepair(cfs, 2) to fill the data
# fillCFForTestMutateRepair uses specific timestamps that deterministically
produce a short (7-digit) CRC32, serving as a regression test for this exact
scenario
{code:java}
protected void fillCFForTestMutateRepair(ColumnFamilyStore cfs, int
partitionsPerSSTable)
{
// CASSANDRA-18919: The DIGEST file stores a CRC32 value as decimal text
(e.g. "2287252").
// The test reads it with readLong() which expects 8 binary bytes. When
CRC32 < 10,000,000
// (fewer than 8 decimal digits), the file has < 8 bytes -> EOFException.
//
// To reproduce: use different timestamps per partition. With the same
timestamp, delta
// encoding makes data file content independent of the absolute timestamp
value, always
// producing the same CRC32. Different timestamps between partitions change
the encoded
// deltas and thus the CRC32.
//
// Found by brute-force: partition "0" at 1704067200000000
(2024-01-01T00:00:00 UTC micros)
// and partition "1" at +410 micros produces DIGEST "2287252" (7 bytes)
after compaction.
long baseTimestamp = 1704067200000000L;
long[] timestamps = { baseTimestamp, baseTimestamp + 410L };
for (int i = 0; i < partitionsPerSSTable; i++)
{
UpdateBuilder.create(cfs.metadata(), String.valueOf(i))
.withTimestamp(timestamps[i])
.newRow("c1").add("val", "1")
.newRow("c2").add("val", "2")
.apply();
}
Util.flush(cfs);
}{code}
h5. Fix
Replaced all 4 occurrences of file.readLong() in VerifyTest.java with
Long.parseLong(file.readLine()), matching the correct approach used in
production code.
> Test failure:
> org.apache.cassandra.io.sstable.VerifyTest.testMutateRepair-.jdk11
> --------------------------------------------------------------------------------
>
> Key: CASSANDRA-18919
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18919
> Project: Apache Cassandra
> Issue Type: Bug
> Components: Test/unit
> Reporter: Ekaterina Dimitrova
> Priority: Normal
> Fix For: 4.1.x, 5.0.x, 6.x
>
>
>
> {code:java}
> org.apache.cassandra.io.sstable.VerifyTest.testMutateRepair-.jdk11 (from
> org.apache.cassandra.io.sstable.VerifyTest-.jdk11)
> Failing for the past 1 build (Since #60 ) Took 0.42 sec. Failed 1 times
> in the last 16 runs. Flakiness: 6%, Stability: 93% Stacktrace
> java.io.EOFException at
> org.apache.cassandra.io.util.RebufferingInputStream.readByte(RebufferingInputStream.java:180)
> at
> org.apache.cassandra.io.util.RebufferingInputStream.readPrimitiveSlowly(RebufferingInputStream.java:142)
> at
> org.apache.cassandra.io.util.RebufferingInputStream.readLong(RebufferingInputStream.java:231)
> at
> org.apache.cassandra.io.sstable.VerifyTest.testMutateRepair(VerifyTest.java:538)
> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method) at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> {code}
> Seen here:
> https://ci-cassandra.apache.org/job/Cassandra-5.0/60/testReport/org.apache.cassandra.io.sstable/VerifyTest/testMutateRepair__jdk11/
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]