[
https://issues.apache.org/jira/browse/SPARK-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13980747#comment-13980747
]
Arun Ramakrishnan commented on SPARK-993:
-----------------------------------------
How does one reproduce this issue ?
I tried a few things on the spark shell locally
{noformat}
import java.io.File
import com.google.common.io.Files
import org.apache.hadoop.io._
val tempDir = Files.createTempDir()
val outputDir = new File(tempDir, "output").getAbsolutePath
val num = 100
val nums = sc.makeRDD(1 to num).map(x => ("a" * x, x))
nums.saveAsSequenceFile(outputDir)
val output = sc.sequenceFile[String,Int](outputDir)
assert(output.collect().toSet.size == num)
val t = sc.sequenceFile(outputDir, classOf[Text], classOf[IntWritable])
assert( t.map { case (k,v) => (k.toString, v.get) }.collect().toSet.size ==
num )
{noformat}
But, asserts seem to be fine.
> Don't reuse Writable objects in SequenceFile by default
> -------------------------------------------------------
>
> Key: SPARK-993
> URL: https://issues.apache.org/jira/browse/SPARK-993
> Project: Spark
> Issue Type: Improvement
> Reporter: Matei Zaharia
> Labels: Starter
>
> Right now we reuse them as an optimization, which leads to weird results when
> you call collect() on a file with distinct items. We should instead make that
> behavior optional through a flag.
--
This message was sent by Atlassian JIRA
(v6.2#6252)