[ 
https://issues.apache.org/jira/browse/SPARK-50151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

B. Micheal Okutubo updated SPARK-50151:
---------------------------------------
    Summary: RocksDB Hardening: Fix new file mapping version advancement and 
ineffective file reuse bug  (was: RocksDB Hardening: Fix new file mapping 
version advancement and ineffective file reuse)

> RocksDB Hardening: Fix new file mapping version advancement and ineffective 
> file reuse bug
> ------------------------------------------------------------------------------------------
>
>                 Key: SPARK-50151
>                 URL: https://issues.apache.org/jira/browse/SPARK-50151
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 4.0.0
>            Reporter: B. Micheal Okutubo
>            Priority: Major
>
> There are 2 bugs in the recently added new approach for RocksDB SST file 
> mapping in this PR: [https://github.com/apache/spark/pull/47875]
>  # The file mapping version is not properly advancing and only advances when 
> we are reopening the RocksDB. This causes ineffective file reuse and we end 
> up not reusing files that should have been reused. Leading to a lot of 
> unnecessary file upload/download, and in the worst case it will make us act 
> like file reuse is disabled.
>  # Ineffective file reuse when creating a checkpoint. We currently will not 
> reuse the files for creating checkpoint, if it was added in the current 
> version i.e. if you do load(v1) -> save(v2), the SST files loaded in v1 will 
> not be reused in v2, and we will upload them again.
> These two bugs were not caught even though we have tests to catch them, 
> because there was also a bug in the test.
> NOTE: these bugs will not cause corruption and more of a performance bug. 
> They just make file reuse ineffective. And end up working like there's no 
> file reuse enabled. i.e. A lot of file upload/download 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to