[
https://issues.apache.org/jira/browse/SPARK-21079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048381#comment-16048381
]
Tejas Patil commented on SPARK-21079:
-------------------------------------
[~ZenWzh] The reason why unit tests won't catch this is because they run with
local FS / local mode. In prod setups, the locations of individual partitions
of tables might be different directories / may belong to different HDFS
namenodes. This is usually done when there is a federation of namenodes and
there has to be balancing of files across them. [0] is how Apache Hive does its
stats collection for partitioned tables ... its looking at the metadata of
every partition explicitly to get the location.
[0] :
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java#L240-L327
> ANALYZE TABLE fails to calculate totalSize for a partitioned table
> ------------------------------------------------------------------
>
> Key: SPARK-21079
> URL: https://issues.apache.org/jira/browse/SPARK-21079
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.1.1
> Reporter: Maria
> Labels: easyfix
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> ANALYZE TABLE table COMPUTE STATISTICS invoked for a partition table produces
> totalSize = 0.
> AnalyzeTableCommand fetches table-level storage URI and calculated total size
> of files in the corresponding directory recursively. However, for partitioned
> tables, each partition has its own storage URI which may not be a
> subdirectory of the table-level storage URI.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]