[jira] [Commented] (PARQUET-831) Corrupt Parquet Files

ASF GitHub Bot (Jira) Mon, 06 Feb 2023 09:37:06 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684828#comment-17684828
 ]


ASF GitHub Bot commented on PARQUET-831:
----------------------------------------

jianchun commented on code in PR #1022:
URL: https://github.com/apache/parquet-mr/pull/1022#discussion_r1097699497


##########
parquet-column/src/test/java/org/apache/parquet/column/impl/TestColumnWriterV1.java:
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.column.impl;
+
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.ParquetProperties;
+import org.apache.parquet.column.page.PageWriter;
+import org.apache.parquet.schema.PrimitiveType;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.junit.Test;
+
+import static org.mockito.Mockito.any;
+import static org.mockito.Mockito.anyInt;
+import static org.mockito.Mockito.atLeastOnce;
+import static org.mockito.Mockito.mock;
+import static org.mockito.Mockito.verify;
+
+public class TestColumnWriterV1 {
+
+  @Test
+  public void testEstimateNextSizeCheckOnZeroUsedMem() throws Exception {

Review Comment:
   The new test case is not applicable in master branch as later code has 
significantly refactored. After refactor ColumnWriterV1 is no longer 
responsible to slice and flush pages itself. Other components will call its 
"writePage" method instead.
   
   In 0.10 and prior, ColumnWriterV1 is responsible for slicing pages where 
this bug resides and this test targets.





> Corrupt Parquet Files
> ---------------------
>
>                 Key: PARQUET-831
>                 URL: https://issues.apache.org/jira/browse/PARQUET-831
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.7.0
>         Environment: HDP-2.5.3.0 Spark-2.0.2
>            Reporter: Steve Severance
>            Priority: Major
>
> I am getting corrupt parquet files as the result of a spark job. The write 
> job completes with no errors but when I read the data again I get the 
> following error:
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file 
> hdfs://MYPATH/part-r-00004-b5c93a19-2f75-4c04-b798-de9cb463f02f.gz.parquet
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
> at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NegativeArraySizeException
> at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:755)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:494)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
>  
> The job that generates this data partitions and sorts the data in a 
> particular way to achieve better compression. If I don't partition and sort I 
> have not been able to reproduce its behavior. It also only has this behavior 
> on say 25% of the data. Most of the time simply rerunning the write job would 
> cause the read error to go away but I have now run across cases where that 
> was not the case. I am happy to give what data I can, or work with someone to 
> run this down.
> I know this is a sub-optimal report, but I have not been able to randomly 
> generate data to reproduce this issue. The data that trips this bug is 
> typically 5GB+ post compression files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-831) Corrupt Parquet Files

Reply via email to