Github user davidnavas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/15084#discussion_r79159837
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
 ---
    @@ -954,12 +954,12 @@ private class SortMergeFullOuterJoinScanner(
         }
     
         if (leftMatches.size <= leftMatched.capacity) {
    -      leftMatched.clear()
    +      leftMatched.clearUntil(leftMatches.size)
    --- End diff --
    
    I was worried about off-by-one errors myself, which is why I added the 
tests, so I understand the concern.
    The simplest argument is that only leftMatches.size bits are ever used (set 
or checked), otherwise the re-allocation side of the if would have needed to be 
larger.  The default bitset is allocated with "1" as its capacity, so at pretty 
much anytime the join will have to either clear that one bit, or allocate more. 
 leftMatches.size is the capacity used to re-allocate, so that's all that need 
to be cleared  Same for rightMatches.
    You can verify that assumption by seeing how this is used in 
scanNextInBuffered() -- leftMatched.get(leftIndex) where leftIndex (on that 
line) is always less than leftMatches.size.  And you can see that leftMatches 
is only updated in that same findMatchingRows() method that clears the bitset.  
Also, leftMatches is private, so no funny business in subclasses.  As far as 
guaranteeing correctness, you can arrange for the code to misbehave if either 
the leftKeyGenerator or the leftIter objects passed into the constructor calls 
advanceNext(), but such code would currently eventually throw an out-of-bounds 
exception in the cases where the bitset is not yet large enough.
    
    Let me know if that argument isn't coherent, it's still a wee early over 
here :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to