[ 
https://issues.apache.org/jira/browse/HIVE-23776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17148114#comment-17148114
 ] 

Prasanth Jayachandran commented on HIVE-23776:
----------------------------------------------

[~pvary] I understand the performance concerns that the basicstats brings esp. 
on the cloud environments. But I would like to discuss the alternatives instead 
of just removing it as there are certainly dependencies on file sizes and 
number of files which cannot be removed. The rawDataSize is good but only 
represents the in-memory representation which is certainly good for most 
optimizations but not for all.. The totalFileSize vs rawDataSize gives 
approximately the compression ratio which still is beneficial for some 
optimizations (totalFileSize can be used for estimating the splits, estimating 
the number of containers/nodes required without running the scans etc.). It is 
better to pay the cost of it once upfront during ETL when compared to every 
time when we run a query or desc formatted. If the basicstats are published as 
counters from the tasks then tez AM can aggregate it at DAG level 
(https://github.com/apache/hive/blob/6440d93981e6d6aab59ecf2e77ffa45cd84d47de/ql/src/test/results/clientpositive/llap/tez_compile_counters.q.out#L1524-L1530)
 which HS2 can use to store it into the metastore without ever doing file 
listing. This is one such approach and this can be abstracted out if this 
required for other engines. We could explore alternative approaches as well. I 
do not think it is good idea to remove it just because it is slow on one cloud 
filesystem.

> Retire quickstats autocollection
> --------------------------------
>
>                 Key: HIVE-23776
>                 URL: https://issues.apache.org/jira/browse/HIVE-23776
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Zoltan Haindrich
>            Assignee: Zoltan Haindrich
>            Priority: Major
>
> this is about:
> * num files
> * datasize (sum of filesizes)
> * num erasure coded files
> right now these are scanned during every BasicStatsTask execution - which 
> means some filesystem reads/etc - for small inserts these are visible in case 
> the fs is a bit slower (s3 and friends)
> I don't think they are really in use...we rely more on columnstats which are 
> more accurate ; and because of the datasize in this case is for "offline" 
> (ondisk) - while we should be insted calculate with "online" sizes...
> proposal:
> * remove collection and storage of this data
> * collect it on the fly during "desc formatted" statements to provide them 
> for informational purposes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to