[ https://issues.apache.org/jira/browse/HIVE-23776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17148114#comment-17148114 ]
Prasanth Jayachandran commented on HIVE-23776: ---------------------------------------------- [~pvary] I understand the performance concerns that the basicstats brings esp. on the cloud environments. But I would like to discuss the alternatives instead of just removing it as there are certainly dependencies on file sizes and number of files which cannot be removed. The rawDataSize is good but only represents the in-memory representation which is certainly good for most optimizations but not for all.. The totalFileSize vs rawDataSize gives approximately the compression ratio which still is beneficial for some optimizations (totalFileSize can be used for estimating the splits, estimating the number of containers/nodes required without running the scans etc.). It is better to pay the cost of it once upfront during ETL when compared to every time when we run a query or desc formatted. If the basicstats are published as counters from the tasks then tez AM can aggregate it at DAG level (https://github.com/apache/hive/blob/6440d93981e6d6aab59ecf2e77ffa45cd84d47de/ql/src/test/results/clientpositive/llap/tez_compile_counters.q.out#L1524-L1530) which HS2 can use to store it into the metastore without ever doing file listing. This is one such approach and this can be abstracted out if this required for other engines. We could explore alternative approaches as well. I do not think it is good idea to remove it just because it is slow on one cloud filesystem. > Retire quickstats autocollection > -------------------------------- > > Key: HIVE-23776 > URL: https://issues.apache.org/jira/browse/HIVE-23776 > Project: Hive > Issue Type: Improvement > Reporter: Zoltan Haindrich > Assignee: Zoltan Haindrich > Priority: Major > > this is about: > * num files > * datasize (sum of filesizes) > * num erasure coded files > right now these are scanned during every BasicStatsTask execution - which > means some filesystem reads/etc - for small inserts these are visible in case > the fs is a bit slower (s3 and friends) > I don't think they are really in use...we rely more on columnstats which are > more accurate ; and because of the datasize in this case is for "offline" > (ondisk) - while we should be insted calculate with "online" sizes... > proposal: > * remove collection and storage of this data > * collect it on the fly during "desc formatted" statements to provide them > for informational purposes -- This message was sent by Atlassian Jira (v8.3.4#803005)