HeartSaVioR opened a new pull request, #37041:
URL: https://github.com/apache/spark/pull/37041

   ### What changes were proposed in this pull request?
   
   This PR proposes to fix the incorrect value schema in streaming 
deduplication. It stores the empty row having a single column with null (using 
NullType), but the value schema is specified as all columns, which leads 
incorrect behavior from state store schema compatibility checker.
   
   This PR proposes to set the schema of value as 
`StructType(Array(StructField("__dummy__", NullType)))` to fit with the empty 
row. With this change, the streaming queries creating the checkpoint after this 
fix would work smoothly.
   
   To not break the existing streaming queries having incorrect value schema, 
this PR proposes to disable the check for value schema on streaming 
deduplication. Disabling the value check was there for the format validation 
(we have two different checkers for state store), but it has been missing for 
state store schema compatibility check. To avoid adding more config, this PR 
leverages the existing config "format validation" is using.
   
   ### Why are the changes needed?
   
   This is a bug fix. Suppose the streaming query below:
   
   ```
   # df has the columns `a`, `b`, `c`
   val df = spark.readStream.format("...").load()
   val query = df.dropDuplicate("a").writeStream.format("...").start()
   ```
   
   while the query is running, df can produce a different set of columns (e.g. 
`a`, `b`, `c`, `d`) from the same source due to schema evolution. Since we only 
deduplicate the rows with column `a`, the change of schema should not matter 
for streaming deduplication, but state store schema checker throws error saying 
"value schema is not compatible" before this fix.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, this is basically a bug fix which end users wouldn't notice unless they 
encountered a bug.
   
   ### How was this patch tested?
   
   New tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to