Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2712#issuecomment-58761280
  
    Actually, I don't think that this is a bug.  Instead, I think that the 
behavior that you're seeing could be an instance of 
[SPARK-1018](https://issues.apache.org/jira/browse/SPARK-1018), where calling 
`take()` or `collect()` on a non-transformed HadoopRDD returns the same element 
several times because the same `Writable` object is re-used.
    
    There's actually a note about this in the `sequenceFile()` Java/Scaladoc 
(added by 
https://github.com/apache/spark/commit/7101017803a70f3267381498594c0e8c604f932c):
    
    ```scala
      /** Get an RDD for a Hadoop SequenceFile with given key and value types.
        *
        * '''Note:''' Because Hadoop's RecordReader class re-uses the same 
Writable object for each
        * record, directly caching the returned RDD will create many references 
to the same object.
        * If you plan to directly cache Hadoop writable objects, you should 
first copy them using
        * a `map` function.
        * */
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to