bersprockets commented on PR #41712:
URL: https://github.com/apache/spark/pull/41712#issuecomment-1646662544

   @steven-aerts 
   
   I took a closer look at
   
   
https://github.com/apache/spark/blob/24960d84a9dac17728822f3e783335f221c49da3/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/JoinCodegenSupport.scala#L82
   
   This map operation is the culprit. The function passed to the map operation 
relies on mutable state that's manipulated outside of the function 
(`ctx.currentVars` and `ctx.INPUT_ROW`). If the map operation runs immediately, 
the state of those fields are appropriate for the function. If the map 
operation runs later (or runs immediately for the first element, and then 
completes sometime later), then the state of those fields may not be correct 
for proper operation of the function, and in fact are not correct for your 
example cases.
   
   I confirmed this by adding debug statements to the function indicating 
whether those two fields in `ctx` are as expected. In the `Stream` case, they 
are not as expected in the second join when processing the last element of 
`plan.output`. In the `IndexedSeq` case, the fields are always as expected.
   
   Personally, I would fix it here by changing the line to 
`plan.output.toIndexedSeq.zipWithIndex.map... etc.`. That would guarantee that 
all paths to this code (not just those coming from the Dataset APIs) would be 
handled properly. However, per [this 
comment](https://github.com/apache/spark/pull/35635#discussion_r836062422), a 
committer thought it better to fix this sort of issue in the public APIs. 
Either works for me.
   
   By the way you have conflict in JoinSuite, so you need to rebase.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to