[ 
https://issues.apache.org/jira/browse/CASSANDRA-18919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18079938#comment-18079938
 ] 

Ling Mao commented on CASSANDRA-18919:
--------------------------------------

h5. Root Cause Analysis 

The SSTable DIGEST file stores a CRC32 checksum as decimal text (e.g., 
"2287252" or "3065344519"). It is written by ChecksumWriter.writeFullChecksum() 
using String.valueOf(fullChecksum.getValue()).getBytes(UTF_8).

In VerifyTest, the test reads this file using RandomAccessReader.readLong(), 
which interprets the next 8 raw bytes as a binary long. CRC32 values range from 
0 to 4,294,967,295 (1–10 decimal digits). When the CRC32 value happens to be 
less than 10,000,000 (fewer than 8 decimal digits), the DIGEST file contains 
fewer than 8 bytes, and readLong() throws EOFException.

This is flaky because the CRC32 depends on the exact binary content of the 
compacted SSTable, which in turn depends on timestamps assigned by 
FBUtilities.timestampMicros() at the moment apply() is called. Different runs 
produce different timestamps, different encoded deltas, and thus different 
CRC32 values — occasionally(prob=10000000/4294967295=0.23%) one with fewer than 
8 digits.

The production code (DataIntegrityMetadata.FileDigestValidator) already reads 
the DIGEST correctly using Long.parseLong(digestReader.readLine()). The bug was 
exclusively in the test code.
h5. 
How to reproduce this flaky test:

# Replace fillCF(cfs, 2); in the testMutateRepair with the following method: 
fillCFForTestMutateRepair(cfs, 2) to fill the data
# fillCFForTestMutateRepair uses specific timestamps that deterministically 
produce a short (7-digit) CRC32, serving as a regression test for this exact 
scenario

 
{code:java}
protected void fillCFForTestMutateRepair(ColumnFamilyStore cfs, int 
partitionsPerSSTable)
{
    // CASSANDRA-18919: The DIGEST file stores a CRC32 value as decimal text 
(e.g. "2287252").
    // The test reads it with readLong() which expects 8 binary bytes. When 
CRC32 < 10,000,000
    // (fewer than 8 decimal digits), the file has < 8 bytes -> EOFException.
    //
    // To reproduce: use different timestamps per partition. With the same 
timestamp, delta
    // encoding makes data file content independent of the absolute timestamp 
value, always
    // producing the same CRC32. Different timestamps between partitions change 
the encoded
    // deltas and thus the CRC32.
    //
    // Found by brute-force: partition "0" at 1704067200000000 
(2024-01-01T00:00:00 UTC micros)
    // and partition "1" at +410 micros produces DIGEST "2287252" (7 bytes) 
after compaction.
    long baseTimestamp = 1704067200000000L;
    long[] timestamps = { baseTimestamp, baseTimestamp + 410L };
    for (int i = 0; i < partitionsPerSSTable; i++)
    {
        UpdateBuilder.create(cfs.metadata(), String.valueOf(i))
                     .withTimestamp(timestamps[i])
                     .newRow("c1").add("val", "1")
                     .newRow("c2").add("val", "2")
                     .apply();
    }
    Util.flush(cfs);
}{code}
 
h5. Fix
Replaced all 4 occurrences of file.readLong() in VerifyTest.java with 
Long.parseLong(file.readLine()), matching the correct approach used in 
production code.

> Test failure: 
> org.apache.cassandra.io.sstable.VerifyTest.testMutateRepair-.jdk11
> --------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-18919
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18919
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Test/unit
>            Reporter: Ekaterina Dimitrova
>            Priority: Normal
>             Fix For: 4.1.x, 5.0.x, 6.x
>
>
>  
> {code:java}
> org.apache.cassandra.io.sstable.VerifyTest.testMutateRepair-.jdk11 (from 
> org.apache.cassandra.io.sstable.VerifyTest-.jdk11)
> Failing for the past 1 build (Since #60 ) Took 0.42 sec.      Failed 1 times 
> in the last 16 runs. Flakiness: 6%, Stability: 93% Stacktrace
> java.io.EOFException at 
> org.apache.cassandra.io.util.RebufferingInputStream.readByte(RebufferingInputStream.java:180)
>  at 
> org.apache.cassandra.io.util.RebufferingInputStream.readPrimitiveSlowly(RebufferingInputStream.java:142)
>  at 
> org.apache.cassandra.io.util.RebufferingInputStream.readLong(RebufferingInputStream.java:231)
>  at 
> org.apache.cassandra.io.sstable.VerifyTest.testMutateRepair(VerifyTest.java:538)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> {code}
> Seen here: 
> https://ci-cassandra.apache.org/job/Cassandra-5.0/60/testReport/org.apache.cassandra.io.sstable/VerifyTest/testMutateRepair__jdk11/
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to