[ 
https://issues.apache.org/jira/browse/HIVE-5324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carl Steinbach updated HIVE-5324:
---------------------------------
    Labels: backward-incompatible orcfile statistics  (was: orcfile statistics)

> Extend record writer and ORC reader/writer interfaces to provide statistics
> ---------------------------------------------------------------------------
>
>                 Key: HIVE-5324
>                 URL: https://issues.apache.org/jira/browse/HIVE-5324
>             Project: Hive
>          Issue Type: New Feature
>    Affects Versions: 0.13.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Prasanth Jayachandran
>              Labels: backward-incompatible, orcfile, statistics
>             Fix For: 0.13.0
>
>         Attachments: HIVE-5324.1.patch.txt, HIVE-5324.2.patch.txt, 
> HIVE-5324.3.patch.txt, HIVE-5324.4.patch.txt
>
>
> The current implementation for computing statistics (number of rows and raw 
> data size) happens for every single row processed. The processOp() method in 
> FileSinkOperator gets raw data size for each row from the serde and 
> accumulates the size in hashmap while counting the number of rows. This 
> accumulated statistics is then published to metastore. 
> In case of ORC, ORC already stores enough statistics internally which can be 
> made use of when publishing the stats to metastore. This will avoid the 
> duplication of work that is happening in the processOp(). Also getting the 
> statistics directly from ORC is very cheap (can directly read from the file 
> footer).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to