[
https://issues.apache.org/jira/browse/PARQUET-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16327684#comment-16327684
]
venkata yerubandi commented on PARQUET-1176:
--------------------------------------------
Any suggestions from the core team ?
> Occasional corruption of parquet files , parquet writer might not be calling
> ParquetFileWriter->end()
> -----------------------------------------------------------------------------------------------------
>
> Key: PARQUET-1176
> URL: https://issues.apache.org/jira/browse/PARQUET-1176
> Project: Parquet
> Issue Type: Bug
> Affects Versions: 1.6.0, 1.7.0
> Reporter: venkata yerubandi
> Priority: Major
>
> We have a high volume streaming service which works most of the time . But
> off late we have been observing that some of the parquet files written out by
> write flow are getting corrupted. This is manifested in our reading flow with
> the following exception
> Writer version - 1.6.0 , Reader version - 1.7.0
> Caused by: java.lang.RuntimeException:
> hdfs://Ingest/ingest/jobs/2017-11-30/00-05/part4139 is not a Parquet file.
> expected magic number at tail [80, 65, 82, 49] but found [-28, -126, 1, 1]
> at
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:422)
> at
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
> at
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
> at
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
> at
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:180)
> at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)```
> After looking at the code , i can see that one of the possible causes is/are
> 1] footer not being serialized in the writer due to end not being called
> but we are not seeing any exceptions on the writer.
> 2] data size - does data size has impact ? There will be cases when row group
> sizes will be huge as it is activity data of a user
> We are using default parquet block size and hdfs block size . Other than
> upgrading to the latest version and re-test , what are the options we have to
> debug a issue like this
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)