[jira] [Updated] (HUDI-4880) Corrupted parquet file found in Hudi Flink MOR pipeline

Teng Huo (Jira) Mon, 19 Sep 2022 20:43:07 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Teng Huo updated HUDI-4880:
---------------------------
    Description: 
h2. Env

Hudi version : 0.11.1 (but I believe this issue still exist in the current 
version)
Flink version : 1.13
Pipeline type: MOR, online compaction
h2. TLDR

Marker mechanism for cleaning corrupted parquet files is not effective now in 
Flink MOR online compaction due to this PR: 
[https://github.com/apache/hudi/pull/5611]
h2. Issue description

Recently, we suffered an issue which said there were corrupted parquet files in 
Hudi table, so this Hudi table is not readable, or compaction task will 
constantly fail.

e.g. Spark application complained this parquet file is too small.
{code:java}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 66 in 
stage 12.0 failed 4 times, most recent failure: Lost task 66.3 in stage 12.0 
(TID 156) (executor 6): java.lang.RuntimeException: 
hdfs://.../00000012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet
 is not a Parquet file (too small length: 0)
        at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:514)
{code}
h2. Root cause

After trouble shooting, I believe we might find the root cause of this issue.

At the beginning, this Flink MOR pipeline failed due to some reason, which left 
a bunch of unfinished parquet files in this Hudi table. It is acceptable for 
Hudi because we can clean them later with "Marker" in the method 
"finalizeWrite". It will call a method named "reconcileAgainstMarkers". It will 
find out these files which are in the marker folder, but not in the commit 
metadata, mark them as corrupted files, then delete them.

However, I found this part of code didn't work properly as expect, this 
corrupted parquet file 
"00000012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet" was 
not deleted in "20220919020324533.commit".

Then, we found there is [a piece of 
code|https://github.com/apache/hudi/blob/13eb892081fc4ddd5e1592ef8698831972012666/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactionPlanOperator.java#L134]
 deleting the marker folder at the beginning of every batch of compaction. This 
causes the mechanism of deleting corrupt files to be a failure, since all 
marker files created before the current batch were deleted.

And we found HDFS audit logs showing this marker folder 
"hdfs://.../.hoodie/.temp/20220919020324533" was deleted multiple times in a 
single Flink application, which proved the current behavior of 
"CompactionPlanOperator", it deletes marker folder every time.

  was:Adding content


> Corrupted parquet file found in Hudi Flink MOR pipeline
> -------------------------------------------------------
>
>                 Key: HUDI-4880
>                 URL: https://issues.apache.org/jira/browse/HUDI-4880
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: compaction, flink
>            Reporter: Teng Huo
>            Assignee: Teng Huo
>            Priority: Major
>
> h2. Env
> Hudi version : 0.11.1 (but I believe this issue still exist in the current 
> version)
> Flink version : 1.13
> Pipeline type: MOR, online compaction
> h2. TLDR
> Marker mechanism for cleaning corrupted parquet files is not effective now in 
> Flink MOR online compaction due to this PR: 
> [https://github.com/apache/hudi/pull/5611]
> h2. Issue description
> Recently, we suffered an issue which said there were corrupted parquet files 
> in Hudi table, so this Hudi table is not readable, or compaction task will 
> constantly fail.
> e.g. Spark application complained this parquet file is too small.
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 66 in 
> stage 12.0 failed 4 times, most recent failure: Lost task 66.3 in stage 12.0 
> (TID 156) (executor 6): java.lang.RuntimeException: 
> hdfs://.../00000012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet
>  is not a Parquet file (too small length: 0)
>       at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:514)
> {code}
> h2. Root cause
> After trouble shooting, I believe we might find the root cause of this issue.
> At the beginning, this Flink MOR pipeline failed due to some reason, which 
> left a bunch of unfinished parquet files in this Hudi table. It is acceptable 
> for Hudi because we can clean them later with "Marker" in the method 
> "finalizeWrite". It will call a method named "reconcileAgainstMarkers". It 
> will find out these files which are in the marker folder, but not in the 
> commit metadata, mark them as corrupted files, then delete them.
> However, I found this part of code didn't work properly as expect, this 
> corrupted parquet file 
> "00000012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet" was 
> not deleted in "20220919020324533.commit".
> Then, we found there is [a piece of 
> code|https://github.com/apache/hudi/blob/13eb892081fc4ddd5e1592ef8698831972012666/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactionPlanOperator.java#L134]
>  deleting the marker folder at the beginning of every batch of compaction. 
> This causes the mechanism of deleting corrupt files to be a failure, since 
> all marker files created before the current batch were deleted.
> And we found HDFS audit logs showing this marker folder 
> "hdfs://.../.hoodie/.temp/20220919020324533" was deleted multiple times in a 
> single Flink application, which proved the current behavior of 
> "CompactionPlanOperator", it deletes marker folder every time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4880) Corrupted parquet file found in Hudi Flink MOR pipeline

Reply via email to