wuwenchi commented on pull request #4162:
URL: https://github.com/apache/iceberg/pull/4162#issuecomment-1046757683


   > @wuwenchi, was there a problem that this caused? Can you update the 
description with what this is fixing besides trying to be slightly more 
permissive?
   
   @rdblue 
   When doing rewrite, it will cause a parquet file to be completely rewritten, 
generating a new file that is exactly the same as the source file. I think this 
is unnecessary.
   
   Just like the testRewriteDataFilesForLargeFile use case in the previous 
spark2.4. There are 3 files in the table:
   big_a.parquet : 50k
   small_b.parquet : 2k
   small_c.parquet : 2k
   
   Set targetSize = 40k for rewrite,
   So two tasks will be generated here:
   task1 : big_a.parquet
   task2 : small_b.parquet + small_c.parquet
   
   What we expect is that task1 is filtered out because it has only one file.
   But in fact this file may also be split, so it is necessary to judge whether 
it is a file split or not. 
   The second filter condition will determine whether the file needs to be 
split. Because of the error in the judgment of the parquet format file, this 
task is retained, so finally big_a.parquet is copied again and a new file is 
generated. 
   
   So I increased the judgment of the parquet file, so that the task1 can be 
deleted.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to