[ 
https://issues.apache.org/jira/browse/PARQUET-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16327684#comment-16327684
 ] 

venkata yerubandi commented on PARQUET-1176:
--------------------------------------------

Any suggestions from the core team ? 

> Occasional corruption of parquet files , parquet writer might not be calling 
> ParquetFileWriter->end()
> -----------------------------------------------------------------------------------------------------
>
>                 Key: PARQUET-1176
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1176
>             Project: Parquet
>          Issue Type: Bug
>    Affects Versions: 1.6.0, 1.7.0
>            Reporter: venkata yerubandi
>            Priority: Major
>
> We have a high volume streaming service which works most of the time . But 
> off late we have been observing that some of the parquet files written out by 
> write flow are getting corrupted. This is manifested in our reading flow with 
> the following exception
> Writer version - 1.6.0 , Reader version - 1.7.0
> Caused by: java.lang.RuntimeException: 
> hdfs://Ingest/ingest/jobs/2017-11-30/00-05/part4139 is not a Parquet file. 
> expected magic number at tail [80, 65, 82, 49] but found [-28, -126, 1, 1]
>     at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:422)
>     at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
>     at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
>     at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>     at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:180)
>     at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>     at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>     at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>     at org.apache.spark.scheduler.Task.run(Task.scala:89)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:745)```
> After looking at the code , i can see that one of the possible causes is/are
> 1] footer not being serialized in the writer due to end not being called
> but we are not seeing any exceptions on the writer. 
> 2] data size - does data size has impact ? There will be cases when row group 
> sizes will be huge as it is activity data of a user 
> We are using default parquet block size and hdfs block size . Other than 
> upgrading to the latest version and re-test , what are the options we have to 
> debug a issue like this 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to