Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1977#issuecomment-91347431
  
    I spent a bit of time fuzz-testing this code to try to reach 100% coverage 
of the changes in this patch.  While doing so, I think I uncovered a bug:
    
    ```
    ../Spark/python/pyspark/shuffle.py:383: in _external_items
        for v in self._merged_items(i):
    ../Spark/python/pyspark/shuffle.py:826: in <genexpr>
        return ((k, vs) for k, vs in GroupByKey(sorted_items))
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _
    
    self = <pyspark.shuffle.GroupByKey object at 0x1048d0990>
    
        def next(self):
    >       key, value = self.next_item if self.next_item else 
next(self.iterator)
    E       TypeError: list object is not an iterator
    
    ../Spark/python/pyspark/shuffle.py:669: TypeError
    ```
    
    It looks like the `GroupByKey` object expects to be instantiated with an 
iterator, but in `GroupBy. _merge_sorted_items` we end up calling it with the 
output of `ExternalSorter.sorted`.  It looks like there's a branch in 
`ExternalSorter.sorted` where we can end up returning a list instead of an 
iterator (line 517), where we return `current_chunk`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to