beyond1920 opened a new issue, #9615: URL: https://github.com/apache/hudi/issues/9615
Dear community, After enable speculation execution of spark compaction job, some broken parquet might be generated. It would lead to subsequent jobs (no matter reader jobs, ingestion jobs or compaction jobs) failures. When it happens, we could find the following things. - Compaction job is success. But after compaction jobs finished, there are multiple base files under the same file slice. - In HSFS returns the list of files by default sorted in alphabetical order, the last base file is broken, size of the last base file is 4, the file only contain header. For example, : * 00000003-0_1690878922298+11800433-63-0-4636_20230902075342551.parquet * 00000003-0_1690878922298+11800433-63-0-63_20230902075342551.parquet Those two files belongs to same file slice, but the first one is normal parquet file, and second one is broker parquet, it only contains header. And HDFS would return the list of files by default sorted in alphabetical order, so currently the fileSystemView would choose the second one as base file of this file slice. It would lead the subsequent jobs failure, for example, a reader job would throw the following exception <img width="2087" alt="image" src="https://github.com/apache/hudi/assets/1525333/89b07bc3-b628-4fab-b6ee-7fb3c289ce25"> The root cause is in the spark job, the driver **kills** the slow attempts **asynchronously**. So even after the spark job finished, there might be some attempts still running because it was not killed yet. It might generate a parquet writer and write into the header, but killed before flushes any data into storage. And the broken parquet is left in the file system. I found this problem for several months, and solved the problem by sorting by size instead of by alphabetical order when compose file slice in the internal HUDI version. I'm wondering if there is any better solution. If not, I would like to create a pull request to fix this problem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
