Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2712#issuecomment-58761280 Actually, I don't think that this is a bug. Instead, I think that the behavior that you're seeing could be an instance of [SPARK-1018](https://issues.apache.org/jira/browse/SPARK-1018), where calling `take()` or `collect()` on a non-transformed HadoopRDD returns the same element several times because the same `Writable` object is re-used. There's actually a note about this in the `sequenceFile()` Java/Scaladoc (added by https://github.com/apache/spark/commit/7101017803a70f3267381498594c0e8c604f932c): ```scala /** Get an RDD for a Hadoop SequenceFile with given key and value types. * * '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each * record, directly caching the returned RDD will create many references to the same object. * If you plan to directly cache Hadoop writable objects, you should first copy them using * a `map` function. * */ ```
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org