[
https://issues.apache.org/jira/browse/SPARK-50151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-50151:
-----------------------------------
Labels: pull-request-available (was: )
> RocksDB Hardening: Fix new file mapping version advancement and ineffective
> file reuse bug
> ------------------------------------------------------------------------------------------
>
> Key: SPARK-50151
> URL: https://issues.apache.org/jira/browse/SPARK-50151
> Project: Spark
> Issue Type: Bug
> Components: Structured Streaming
> Affects Versions: 4.0.0
> Reporter: B. Micheal Okutubo
> Priority: Major
> Labels: pull-request-available
>
> There are 2 bugs in the recently added new approach for RocksDB SST file
> mapping in this PR: [https://github.com/apache/spark/pull/47875]
> # The file mapping version is not properly advancing and only advances when
> we are reopening the RocksDB. This causes ineffective file reuse and we end
> up not reusing files that should have been reused. Leading to a lot of
> unnecessary file upload/download, and in the worst case it will make us act
> like file reuse is disabled.
> # Ineffective file reuse when creating a checkpoint. We currently will not
> reuse the files for creating checkpoint, if it was added in the current
> version i.e. if you do load(v1) -> save(v2), the SST files loaded in v1 will
> not be reused in v2, and we will upload them again.
> These two bugs were not caught even though we have tests to catch them,
> because there was also a bug in the test.
> NOTE: these bugs will not cause corruption and more of a performance bug.
> They just make file reuse ineffective. And end up working like there's no
> file reuse enabled. i.e. A lot of file upload/downloadÂ
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]