[GitHub] [incubator-iceberg] danielcweeks commented on issue #105: Basic Benchmarks for Iceberg Spark Data Source

GitBox Tue, 05 Mar 2019 09:43:50 -0800

danielcweeks commented on issue #105: Basic Benchmarks for Iceberg Spark Data 
Source
URL: https://github.com/apache/incubator-iceberg/pull/105#issuecomment-469779958
 
 
   @aokolnychyi Thanks for doing this, I think there are some really good 
insights and questions that come out of this.  I would characterize some as the 
following (ordered by least controversial from my perspective):
   
   - Write path is effectively the same in performance (within error)
   
   - For the file skipping tests, I'm concerned that the difference in number 
of files processed per task obviates any conclusions we can reasonably draw 
from the results.   With task combining being different, the largest factor may 
be the work done in the task.  The flat data test demonstrates that as you 
point out, and the nested data test could be skewing the other direction 
because all files are processed in one task.
    
   - For the materialized scans, I feel we would get more accurate results by 
better isolating the read path (i.e. use the read path directly as opposed to 
running though a job).  There are quite a number of variables above the 
materialization that could impact the performance especially with such small 
datasets that I don’t have a lot of confidence in the order of magnitude of the 
comparative difference and how that would scale with different datasets.
   
   Also, I would prefer that we don’t actually commit the results of the 
benchmarks as there are just too many variables to have a canonical version.  
For example, the underlying filesystem (e.g. HDFS, S3) would likely impact the 
performance in different areas (and scale different across number of 
partitions/files/etc.).  Hardware and local customizations can also have a 
pronounced impact.
   
   Benchmarks are notoriously controversial, so allowing them to be run and 
interpreted in the environment they will be used is often the most relevant way 
to deal with all of these variables. 
   
   I would be in favor of adding this (minus the results) along with more 
instruction and guidance of how to use/configure for no other reason than to 
have a good starting point for adding other benchmarks and iterating on 
improving the accuracy.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-iceberg] danielcweeks commented on issue #105: Basic Benchmarks for Iceberg Spark Data Source

Reply via email to