shangxinli commented on PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3587418543

   > Data Files Records per Data File   Normal Merge (s)        Parquet Merge 
(s)
   > 5  100*10000       4.7     0.9
   > 5  10*10000        1.8     0.9
   > 5  1*10000 0.8     0.9
   > 40 10*10000        6.7     4.4
   > 40 1*10000 4.1     3.6
   > 40 100     3.4     3.4
   > 100        1000    12.0    8.4
   > I ran some tests comparing Parquet merge and normal merge; the Parquet 
merge version I used is the original one without any lineage-related changes. 
From the current results, when there are many files with small contents, the 
performance advantage of Parquet merge is not particularly large; when files 
contain many rows, the advantage is significant. I suspect that validating the 
schema and reading the footer for every file introduces additional overhead.
   > 
   > I suggest that once the lineage part is ready, we add corresponding tests, 
because adding lineage will introduce more complexity.
   > 
   > These were manual tests; results may vary and are for reference only.
   
   It would also depend on how large is the record. When the record is tiny, 
then the overhead of adding physical row lineage would be more significant. 
   
   Also, if you run the spark job to compare, the job itself would have some 
overhead. Particularly, when the files are very tiny like 100 records, then not 
much difference between reading the footer and the data itself. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to