[ 
https://issues.apache.org/jira/browse/SPARK-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13980747#comment-13980747
 ] 

Arun Ramakrishnan commented on SPARK-993:
-----------------------------------------

How does one reproduce this issue ?

I tried a few things on the spark shell locally

{noformat}
    import java.io.File
    import com.google.common.io.Files
    import org.apache.hadoop.io._
    val tempDir = Files.createTempDir()
    val outputDir = new File(tempDir, "output").getAbsolutePath
    val num = 100
    val nums = sc.makeRDD(1 to num).map(x => ("a" * x, x)) 
    nums.saveAsSequenceFile(outputDir)

    val output = sc.sequenceFile[String,Int](outputDir)
    assert(output.collect().toSet.size == num)

    val t = sc.sequenceFile(outputDir, classOf[Text], classOf[IntWritable])
    assert( t.map { case (k,v) => (k.toString, v.get) }.collect().toSet.size == 
num )
{noformat}

But, asserts seem to be fine. 

> Don't reuse Writable objects in SequenceFile by default
> -------------------------------------------------------
>
>                 Key: SPARK-993
>                 URL: https://issues.apache.org/jira/browse/SPARK-993
>             Project: Spark
>          Issue Type: Improvement
>            Reporter: Matei Zaharia
>              Labels: Starter
>
> Right now we reuse them as an optimization, which leads to weird results when 
> you call collect() on a file with distinct items. We should instead make that 
> behavior optional through a flag.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to