wuwenchi commented on pull request #4162: URL: https://github.com/apache/iceberg/pull/4162#issuecomment-1046757683
> @wuwenchi, was there a problem that this caused? Can you update the description with what this is fixing besides trying to be slightly more permissive? @rdblue When doing rewrite, it will cause a parquet file to be completely rewritten, generating a new file that is exactly the same as the source file. I think this is unnecessary. Just like the testRewriteDataFilesForLargeFile use case in the previous spark2.4. There are 3 files in the table: big_a.parquet : 50k small_b.parquet : 2k small_c.parquet : 2k Set targetSize = 40k for rewrite, So two tasks will be generated here: task1 : big_a.parquet task2 : small_b.parquet + small_c.parquet What we expect is that task1 is filtered out because it has only one file. But in fact this file may also be split, so it is necessary to judge whether it is a file split or not. The second filter condition will determine whether the file needs to be split. Because of the error in the judgment of the parquet format file, this task is retained, so finally big_a.parquet is copied again and a new file is generated. So I increased the judgment of the parquet file, so that the task1 can be deleted. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
