[
https://issues.apache.org/jira/browse/SPARK-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-1960.
------------------------------
Resolution: Not a Problem
An "empty" {{SequenceFile}} will still contain some header info. For example
when I write an empty one (configured to contain {{LongWritable}}) I get
roughly:
{code}
SEQ^F!org.apache.hadoop.io.LongWritable!org.apache.hadoop.io.LongWritable^A^@*org.apache.hadoop.io.compress.DefaultCodec^@^@^@^@ï<9c>p<84>º74K=æÅ3!<92>^A^F
{code}
So an empty {{SequenceFile}} is indeed malformed, so I don't think this is a
bug. An error is correct. Reopen if I misunderstand.
> EOFException when file size 0 exists when use sc.sequenceFile[K,V]("path")
> --------------------------------------------------------------------------
>
> Key: SPARK-1960
> URL: https://issues.apache.org/jira/browse/SPARK-1960
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.0.0
> Reporter: Eunsu Yun
>
> java.io.EOFException throws when use sc.sequenceFile[K,V] if there is a file
> which size is 0.
> I also tested sc.textFile() in the same condition and it does not throw
> EOFException.
> val text = sc.sequenceFile[Long, String]("data-gz/*.dat.gz")
> val result = text.filter(filterValid)
> result.saveAsTextFile("data-out/")
> ------------------
> java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at java.io.DataInputStream.readFully(DataInputStream.java:169)
> at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1845)
> at
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1810)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1759)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1773)
> at
> org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:49)
> at
> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:156)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:33)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> ..............
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]