HeartSaVioR opened a new pull request #26162: [SPARK-29438][SS] Use partition 
ID of source for state store in stream-stream join
URL: https://github.com/apache/spark/pull/26162
 
 
   ### What changes were proposed in this pull request?
   
   Credit to @uncleGen for discovering the problem and providing simple 
reproducer as UT. New UT in this patch is borrowed from #26156 and I'm 
retaining a commit from #26156 (except unnecessary part on this path) to 
properly give a credit.
   
   This patch fixes the issue that partition ID could be mis-assigned when the 
query contains UNION and stream-stream join is placed on the right side. We 
assume the range of partition IDs as `(0 ~ number of shuffle partitions - 1)`, 
but when we use stream-stream join on the right side of UNION, the range of 
partition ID of task goes to `(number of partitions in left side, number of 
partitions in left side + number of shuffle partitions - 1)`.
   
   The root reason of bug is that stream-stream join picks the partition ID 
from TaskContext, which wouldn't be same as partition ID from source if union 
is being used. Hopefully we can pick the right partition ID from source in 
StateStoreAwareZipPartitionsRDD - this patch leverages that partition ID.
   
   ### Why are the changes needed?
   
   This patch will fix the broken of assumption of partition range on stateful 
operator, as well as fix the issue reported in JIRA issue SPARK-29438.
   
   ### Does this PR introduce any user-facing change?
   
   Yes, if their query is using UNION and stream-stream join is placed on the 
right side. They may encounter the problem to read state from checkpoint and 
may need to discard checkpoint to continue.
   
   ### How was this patch tested?
   
   Added UT which fails on current master branch, and passes with this patch.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to