danielcweeks commented on issue #105: Basic Benchmarks for Iceberg Spark Data Source URL: https://github.com/apache/incubator-iceberg/pull/105#issuecomment-469779958 @aokolnychyi Thanks for doing this, I think there are some really good insights and questions that come out of this. I would characterize some as the following (ordered by least controversial from my perspective): - Write path is effectively the same in performance (within error) - For the file skipping tests, I'm concerned that the difference in number of files processed per task obviates any conclusions we can reasonably draw from the results. With task combining being different, the largest factor may be the work done in the task. The flat data test demonstrates that as you point out, and the nested data test could be skewing the other direction because all files are processed in one task. - For the materialized scans, I feel we would get more accurate results by better isolating the read path (i.e. use the read path directly as opposed to running though a job). There are quite a number of variables above the materialization that could impact the performance especially with such small datasets that I don’t have a lot of confidence in the order of magnitude of the comparative difference and how that would scale with different datasets. Also, I would prefer that we don’t actually commit the results of the benchmarks as there are just too many variables to have a canonical version. For example, the underlying filesystem (e.g. HDFS, S3) would likely impact the performance in different areas (and scale different across number of partitions/files/etc.). Hardware and local customizations can also have a pronounced impact. Benchmarks are notoriously controversial, so allowing them to be run and interpreted in the environment they will be used is often the most relevant way to deal with all of these variables. I would be in favor of adding this (minus the results) along with more instruction and guidance of how to use/configure for no other reason than to have a good starting point for adding other benchmarks and iterating on improving the accuracy.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
