B. Micheal Okutubo created SPARK-50151:
------------------------------------------
Summary: RocksDB Hardening: Fix new file mapping version
advancement and ineffective file reuse
Key: SPARK-50151
URL: https://issues.apache.org/jira/browse/SPARK-50151
Project: Spark
Issue Type: Bug
Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: B. Micheal Okutubo
There are 2 bugs in the recently added new approach for RocksDB SST file
mapping in this PR: [https://github.com/apache/spark/pull/47875]
# The file mapping version is not properly advancing and only advances when we
are reopening the RocksDB. This causes ineffective file reuse and we end up not
reusing files that should have been reused. Leading to a lot of unnecessary
file upload/download, and in the worst case it will make us act like file reuse
is disabled.
# Ineffective file reuse when creating a checkpoint. We currently will not
reuse the files for creating checkpoint, if it was added in the current version
i.e. if you do load(v1) -> save(v2), the SST files loaded in v1 will not be
reused in v2, and we will upload them again.
These two bugs were not caught even though we have tests to catch them, because
there was also a bug in the test.
NOTE: these bugs will not cause corruption and more of a performance bug. They
just make file reuse ineffective. And end up working like there's no file reuse
enabled. i.e. A lot of file upload/downloadÂ
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]