Zoltan Ivanfi created HIVE-17618:
------------------------------------

             Summary: Extend ANALYZE TABLE / DESCRIBE FORMATTED functionality 
with distribution of selected file-level metadata fields
                 Key: HIVE-17618
                 URL: https://issues.apache.org/jira/browse/HIVE-17618
             Project: Hive
          Issue Type: Improvement
            Reporter: Zoltan Ivanfi


DESCRIBE FORMATTED already shows the number of files:

{noformat}
[...]
Table Parameters:
    COLUMN_STATS_ACCURATE   true
    numFiles                14
    numRows                 15653
[...]
{noformat}

It would be useful to break this number down by different file-level metadata 
fields. Once such field would be the different compression settings used in the 
table. Currently there is no way to check whether the contents of a table are 
compressed because some files can be compressed while others not. A file-count 
breakdown could provide this missing information in the following form:

{noformat}
[...]
Table Parameters:
    COLUMN_STATS_ACCURATE   true
    numFiles                14
        breakdown by compression:
            Uncompressed:   3
            Snappy:         6
            Deflate:        5
    numRows                 15653
[...]
{noformat}

Another useful breakdown would be by the writer field of Parquet files, because 
Impala writes Parquet files slightly differently (string fields are not 
annotated with UTF8 by default, timestamps are not adjusted to UTC) and users 
may want to know what kind of Parquet files are in a table but have no way to 
query it at this moment. An example output for Parquet tables could look like:

{noformat}
[...]
Table Parameters:
    COLUMN_STATS_ACCURATE   true
    numFiles                14
        breakdown by compression:
            Uncompressed:   3
            Snappy:         6
            Deflate:        5
        breakdown by writer:
            parquet-mr:     9
            impala:         5
    numRows                 15653
[...]
{noformat}

Any other file-level metadata could be incorporated that we consider useful to 
the user. Since gathering file-level metadata is an expensive operation, it 
should be done when the user issues ANALYZE TABLE ... COMPUTE STATISTICS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to