beyond1920 opened a new issue, #9615:
URL: https://github.com/apache/hudi/issues/9615

   Dear community,
   After enable speculation execution of spark compaction job, some broken 
parquet might be generated. It would lead to subsequent jobs (no matter reader 
jobs, ingestion jobs or compaction jobs) failures.
   
   When it happens, we could find the following things.
   
   - Compaction job is success. But after compaction jobs finished, there are 
multiple base files under the same file slice. 
   - In HSFS returns the list of files by default sorted in alphabetical order, 
the last base file is broken, size of the last base file is 4, the file only 
contain header.
   
   For example, :
   * 00000003-0_1690878922298+11800433-63-0-4636_20230902075342551.parquet
   * 00000003-0_1690878922298+11800433-63-0-63_20230902075342551.parquet
   Those two files belongs to same file slice, but the first one is normal 
parquet file, and second one is broker parquet, it only contains header. 
   And HDFS would return the list of files by default sorted in alphabetical 
order, so currently the fileSystemView would choose the second one as base file 
of this file slice. It would lead the subsequent jobs failure, for example, a 
reader job would throw the following exception
   <img width="2087" alt="image" 
src="https://github.com/apache/hudi/assets/1525333/89b07bc3-b628-4fab-b6ee-7fb3c289ce25";>
   
   The root cause is in the spark job, the driver **kills** the slow attempts 
**asynchronously**. So even after the spark job finished, there might be some 
attempts still running because it was not killed yet. It might generate a 
parquet writer and write into the header, but killed before flushes any data 
into storage. And the broken parquet is left in the file system. 
   
   I found this problem for several months, and solved the problem by sorting 
by size instead of  by alphabetical order when compose file slice in the 
internal HUDI version. 
   I'm wondering if there is any better solution. If not, I would like to 
create a pull request to fix this problem.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to