[GitHub] [iceberg] pvary commented on pull request #3328: Hive handler estimate file size

GitBox Fri, 22 Oct 2021 04:17:58 -0700


pvary commented on pull request #3328:
URL: https://github.com/apache/iceberg/pull/3328#issuecomment-949532847



   Sorry for the many short comments (not too much time today), but let me 
summarise:
   - We should double check everything which I am stating here for Hive 
2.3.8/9, and 3.1.2 - We were working on Hive 4.0.0 when considered stats.
   - We found that the estimation causes issues for tables with plenty of 
files. If we turn off `hive.stats.estimate` then we do not end up listing the 
directories recursively, so that could fix the planning performance issue
   - We found that if the `hive.stats.autogather` is true, then the statistics 
are collected, but there was problem with the `rowDataSize`. We fixed this in 
HIVE-24928 using the Iceberg table statistics where we propagate the Iceberg 
statistics to use as Hive statistics.
   - For automatic Column statistics we need: HIVE-25276
   - We still has to consider other engines writing these tables, and we have 
to invalidate column statistics, if other engine is update the table. For this 
we created this change: HIVE-25286
   
   I hope this finally helps 😄 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] pvary commented on pull request #3328: Hive handler estimate file size

Reply via email to