[ 
https://issues.apache.org/jira/browse/PARQUET-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

venkata yerubandi updated PARQUET-1176:
---------------------------------------
    Description: 
We have a high volume streaming service which works most of the time . But off 
late we have been observing that some of the parquet files written out by write 
flow are getting corrupted. This is manifested in our reading flow with the 
following exception

Writer version - 1.6.0 , Reader version - 1.7.0
Caused by: java.lang.RuntimeException: 
hdfs://Ingest/ingest/jobs/2017-11-30/00-05/part4139 is not a Parquet file. 
expected magic number at tail [80, 65, 82, 49] but found [-28, -126, 1, 1]
    at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:422)
    at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
    at 
org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
    at 
org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
    at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:180)
    at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)```

After looking at the code , i can see that one of the possible causes is/are
1] footer not being serialized in the writer due to end not being called
but we are not seeing any exceptions on the writer. 
2] data size - does data size has impact ? There will be cases when row group 
sizes will be huge as it is activity data of a user 

We are using default parquet block size and hdfs block size . Other than 
upgrading to the latest version and re-test , what are the options we have to 
debug a issue like this 
 




  was:
We have a high volume streaming service which works most of the time . But off 
late we have been observing that some of the parquet files written out by write 
flow are getting corrupted. This is manifested in our reading flow with the 
following exception

Caused by: java.lang.RuntimeException: 
hdfs://Ingest/ingest/jobs/2017-11-30/00-05/part4139 is not a Parquet file. 
expected magic number at tail [80, 65, 82, 49] but found [-28, -126, 1, 1]
    at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:422)
    at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
    at 
org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
    at 
org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
    at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:180)
    at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)```

After looking at the code , i can see that one of the possible causes is/are
1] footer not being serialized in the writer due to end not being called
but we are not seeing any exceptions on the writer. 
2] data size - does data size has impact ? There will be cases when row group 
sizes will be huge as it is activity data of a user 

We are using default parquet block size and hdfs block size . Other than 
upgrading to the latest version and re-test , what are the options we have to 
debug a issue like this 
 





> Occasional corruption of parquet files , parquet writer might not be calling 
> ParquetFileWriter->end()
> -----------------------------------------------------------------------------------------------------
>
>                 Key: PARQUET-1176
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1176
>             Project: Parquet
>          Issue Type: Bug
>    Affects Versions: 1.6.0, 1.7.0
>            Reporter: venkata yerubandi
>
> We have a high volume streaming service which works most of the time . But 
> off late we have been observing that some of the parquet files written out by 
> write flow are getting corrupted. This is manifested in our reading flow with 
> the following exception
> Writer version - 1.6.0 , Reader version - 1.7.0
> Caused by: java.lang.RuntimeException: 
> hdfs://Ingest/ingest/jobs/2017-11-30/00-05/part4139 is not a Parquet file. 
> expected magic number at tail [80, 65, 82, 49] but found [-28, -126, 1, 1]
>     at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:422)
>     at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
>     at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
>     at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>     at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:180)
>     at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>     at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>     at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>     at org.apache.spark.scheduler.Task.run(Task.scala:89)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:745)```
> After looking at the code , i can see that one of the possible causes is/are
> 1] footer not being serialized in the writer due to end not being called
> but we are not seeing any exceptions on the writer. 
> 2] data size - does data size has impact ? There will be cases when row group 
> sizes will be huge as it is activity data of a user 
> We are using default parquet block size and hdfs block size . Other than 
> upgrading to the latest version and re-test , what are the options we have to 
> debug a issue like this 
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to