B. Micheal Okutubo created SPARK-50151:
------------------------------------------

             Summary: RocksDB Hardening: Fix new file mapping version 
advancement and ineffective file reuse
                 Key: SPARK-50151
                 URL: https://issues.apache.org/jira/browse/SPARK-50151
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 4.0.0
            Reporter: B. Micheal Okutubo


There are 2 bugs in the recently added new approach for RocksDB SST file 
mapping in this PR: [https://github.com/apache/spark/pull/47875]
 # The file mapping version is not properly advancing and only advances when we 
are reopening the RocksDB. This causes ineffective file reuse and we end up not 
reusing files that should have been reused. Leading to a lot of unnecessary 
file upload/download, and in the worst case it will make us act like file reuse 
is disabled.
 # Ineffective file reuse when creating a checkpoint. We currently will not 
reuse the files for creating checkpoint, if it was added in the current version 
i.e. if you do load(v1) -> save(v2), the SST files loaded in v1 will not be 
reused in v2, and we will upload them again.

These two bugs were not caught even though we have tests to catch them, because 
there was also a bug in the test.

NOTE: these bugs will not cause corruption and more of a performance bug. They 
just make file reuse ineffective. And end up working like there's no file reuse 
enabled. i.e. A lot of file upload/download 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to