[GitHub] [hudi] beyond1920 opened a new issue, #9615: [SUPPORT] After enable speculation execution of spark compaction job, some broken parquet files might be generated

via GitHub Mon, 04 Sep 2023 19:06:23 -0700


beyond1920 opened a new issue, #9615:
URL: https://github.com/apache/hudi/issues/9615

Dear community,
After enable speculation execution of spark compaction job, some broken
parquet might be generated. It would lead to subsequent jobs (no matter reader
jobs, ingestion jobs or compaction jobs) failures.

When it happens, we could find the following things.

- Compaction job is success. But after compaction jobs finished, there are
multiple base files under the same file slice.
- In HSFS returns the list of files by default sorted in alphabetical order,
the last base file is broken, size of the last base file is 4, the file only
contain header.

For example, :
* 00000003-0_1690878922298+11800433-63-0-4636_20230902075342551.parquet
* 00000003-0_1690878922298+11800433-63-0-63_20230902075342551.parquet
Those two files belongs to same file slice, but the first one is normal
parquet file, and second one is broker parquet, it only contains header.
And HDFS would return the list of files by default sorted in alphabetical
order, so currently the fileSystemView would choose the second one as base file
of this file slice. It would lead the subsequent jobs failure, for example, a
reader job would throw the following exception
<img width="2087" alt="image"
src="https://github.com/apache/hudi/assets/1525333/89b07bc3-b628-4fab-b6ee-7fb3c289ce25";>

The root cause is in the spark job, the driver **kills** the slow attempts
**asynchronously**. So even after the spark job finished, there might be some
attempts still running because it was not killed yet. It might generate a
parquet writer and write into the header, but killed before flushes any data
into storage. And the broken parquet is left in the file system.

I found this problem for several months, and solved the problem by sorting
by size instead of by alphabetical order when compose file slice in the
internal HUDI version.
I'm wondering if there is any better solution. If not, I would like to
create a pull request to fix this problem.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] beyond1920 opened a new issue, #9615: [SUPPORT] After enable speculation execution of spark compaction job, some broken parquet files might be generated

Reply via email to