[GitHub] [iceberg] bryanck commented on pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

GitBox Sat, 09 Jul 2022 06:39:26 -0700


bryanck commented on PR #5215:
URL: https://github.com/apache/iceberg/pull/5215#issuecomment-1179546876


   This is already merged, but I thought I'd leave feedback anyway, in case it 
is useful.
   
   As a data engineer, many tables I have maintained have more than 32 
top-level columns. Often columns used for partitioning, sorting, auditing, and 
so forth are put at the end of a table schema, but these are some of the most 
frequently used in filtering. Also, additional columns are generally added at 
the end of the schema. The assumption that the first columns in a table schema 
are the most important to have stats on is not always accurate. 
   
   In testing 0.14, I ran into missing stats on tables, which was confusing and 
difficult to debug. I image those new to Iceberg and who are most likely to 
leave settings at the default, it would be even more confusing.
   
   I feel a more sensible default is to leave it the same as previous Iceberg 
versions (i.e. no column limit). Then an option could be introduced to limit 
the number of columns so those that prefer can set it on their tables, e.g. 
"first(32)". I feel it is better to err on the side of too many stats and dial 
that back as needed.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] bryanck commented on pull request #5215: Core: Update MetricsConfig to use a default for first 32 columns

Reply via email to