Eric Marnadi created SPARK-55795:
------------------------------------

             Summary: Add automatic V1 to V2 offset log upgrade for streaming 
queries with named sources
                 Key: SPARK-55795
                 URL: https://issues.apache.org/jira/browse/SPARK-55795
             Project: Spark
          Issue Type: Task
          Components: Structured Streaming
    Affects Versions: 4.2.0
            Reporter: Eric Marnadi


Introduce an automatic offset log upgrade mechanism that allows streaming 
queries to migrate from V1 (positional) offset tracking to V2 (named) offset 
tracking when users add {{.name()}} to their streaming sources.

Currently, when users want to migrate from V1 (index-based) to V2 (name-based) 
offset tracking, they must:
 # Delete their checkpoint directory (losing all state)
 # Start fresh

This is problematic because:
 * {*}State loss{*}: All stateful operators (aggregations, joins, 
deduplication) lose their state
 * {*}Data reprocessing{*}: Query must reprocess all historical data from the 
beginning
 * {*}Downtime{*}: Requires stopping the query and careful coordination

With this change, users can safely migrate existing V1 offset logs to V2 format 
by:
 # Adding {{.name()}} to all streaming sources
 # Setting {{spark.sql.streaming.offsetLog.formatVersion=2}}
 # Setting {{spark.sql.streaming.offsetLog.v1ToV2.autoUpgrade.enabled=true}}
 # Restarting the query

The upgrade preserves all state and offset positions, enabling seamless 
transition to the more flexible V2 format that supports source evolution 
(adding/removing sources by name).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to