HeartSaVioR opened a new pull request #26935: [SPARK-30294][SS] Explicitly 
defines read-only StateStore and optimize for HDFSBackedStateStore
URL: https://github.com/apache/spark/pull/26935
 
 
   ### What changes were proposed in this pull request?
   
   There's a concept of 'read-only' and 'read+write' state store in Spark which 
is defined "implicitly". Spark doesn't prevent write for 'read-only' state 
store; Spark just assumes read-only stateful operator will not modify the state 
store. Given it's not defined explicitly, the instance of state store has to be 
implemented as 'read+write' even it's being used as 'read-only', which 
sometimes brings confusion.
   
   For example, abort() in HDFSBackedStateStore - 
https://github.com/apache/spark/blob/d38f8167483d4d79e8360f24a8c0bffd51460659/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L143-L155
   
   ```
    /** Abort all the updates made on this store. This store will not be usable 
any more. */
       override def abort(): Unit = {
         // This if statement is to ensure that files are deleted only if there 
are changes to the
         // StateStore. We have two StateStores for each task, one which is 
used only for reading, and
         // the other used for read+write. We don't want the read-only to 
delete state files.
         if (state == UPDATING) {
           state = ABORTED
           cancelDeltaFile(compressedStream, deltaFileStream)
         } else {
           state = ABORTED
         }
         logInfo(s"Aborted version $newVersion for $this")
       } 
   ```
   
   The comment sounds as if statement works differently between 'read-only' and 
'read+write', but that's not true as both state store has state initialized as 
UPDATING (no difference). So 'read-only' state also creates the temporary file, 
initializes output streams to write to temporary file, closes output streams, 
and finally deletes the temporary file. This unnecessary operations are being 
done per batch/partition.
   
   This patch explicitly defines 'read-only' StateStore, and enables state 
store provider to create 'read-only' StateStore instance if requested. Relevant 
code paths are modified, as well as 'read-only' StateStore implementation for 
HDFSBackedStateStore is introduced. The new implementation gets rid of 
unnecessary operations explained above.
   
   In point of backward-compatibility view, the only thing being changed in 
public API side is `StateStoreProvider`. The trait `StateStoreProvider` has to 
be changed to allow requesting 'read-only' StateStore; this patch adds default 
implementation which leverages 'read+write' StateStore but wrapping with 
'write-protected' StateStore instance, so that custom providers don't need to 
change their code to reflect the change. But if the providers can optimize for 
read-only workload, they'll be happy to make a change.
   
   Please note that this patch makes ReadOnlyStateStore extend StateStore and 
being referred as StateStore, as StateStore is being used in so many places and 
it's not easy to support both traits if we differentiate them.
   
   ### Why are the changes needed?
   
   The new API opens the chance to optimize read-only state store instance 
compared with read+write state store instance. HDFSBackedStateStoreProvider is 
modified to provide read-only version of state store which doesn't deal with 
temporary file as well as state machine.
   
   ### Does this PR introduce any user-facing change?
   
   Clearly "no" for most end users, and also "no" for custom state store 
providers as it doesn't touch trait `StateStore` as well as provides default 
implementation for added method in trait `StateStoreProvider`.
   
   ### How was this patch tested?
   
   Modified UT. Existing UTs ensure the change doesn't break anything.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to