shangxinli commented on PR #14435: URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3587418543
> Data Files Records per Data File Normal Merge (s) Parquet Merge (s) > 5 100*10000 4.7 0.9 > 5 10*10000 1.8 0.9 > 5 1*10000 0.8 0.9 > 40 10*10000 6.7 4.4 > 40 1*10000 4.1 3.6 > 40 100 3.4 3.4 > 100 1000 12.0 8.4 > I ran some tests comparing Parquet merge and normal merge; the Parquet merge version I used is the original one without any lineage-related changes. From the current results, when there are many files with small contents, the performance advantage of Parquet merge is not particularly large; when files contain many rows, the advantage is significant. I suspect that validating the schema and reading the footer for every file introduces additional overhead. > > I suggest that once the lineage part is ready, we add corresponding tests, because adding lineage will introduce more complexity. > > These were manual tests; results may vary and are for reference only. It would also depend on how large is the record. When the record is tiny, then the overhead of adding physical row lineage would be more significant. Also, if you run the spark job to compare, the job itself would have some overhead. Particularly, when the files are very tiny like 100 records, then not much difference between reading the footer and the data itself. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
