[ 
https://issues.apache.org/jira/browse/SPARK-30294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998935#comment-16998935
 ] 

Jungtaek Lim commented on SPARK-30294:
--------------------------------------

Working on the fix. I might bring the solution first which opens the chance to 
optimize for read-only state store, and try to go with workaround solution if 
the community is not happy with the solution.

> Read-only state store unnecessarily creates and deletes the temp file for 
> delta file every batch
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-30294
>                 URL: https://issues.apache.org/jira/browse/SPARK-30294
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 3.0.0
>            Reporter: Jungtaek Lim
>            Priority: Minor
>
> [https://github.com/apache/spark/blob/d38f8167483d4d79e8360f24a8c0bffd51460659/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L143-L155]
> {code:java}
>     /** Abort all the updates made on this store. This store will not be 
> usable any more. */
>     override def abort(): Unit = {
>       // This if statement is to ensure that files are deleted only if there 
> are changes to the
>       // StateStore. We have two StateStores for each task, one which is used 
> only for reading, and
>       // the other used for read+write. We don't want the read-only to delete 
> state files.
>       if (state == UPDATING) {
>         state = ABORTED
>         cancelDeltaFile(compressedStream, deltaFileStream)
>       } else {
>         state = ABORTED
>       }
>       logInfo(s"Aborted version $newVersion for $this")
>     } {code}
> Despite of the comment, read-only state store also does the same things for 
> preparing write - creates the temporary file, initializes output streams for 
> the file, closes these output streams, and deletes the temporary file. That 
> is just unnecessary and gives confusion as according to the log messages two 
> different instances seem to write to same delta file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to