[ 
https://issues.apache.org/jira/browse/HIVE-5324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13780753#comment-13780753
 ] 

Hudson commented on HIVE-5324:
------------------------------

SUCCESS: Integrated in Hive-trunk-hadoop1-ptest #185 (See 
[https://builds.apache.org/job/Hive-trunk-hadoop1-ptest/185/])
HIVE-5324 : Extend record writer and ORC reader/writer interfaces to provide 
statistics (Prasanth J via Ashutosh Chauhan) (hashutosh: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1527149)
* 
/hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/fileformat/base64/Base64TextOutputFormat.java
* 
/hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHFileOutputFormat.java
* 
/hive/trunk/hcatalog/core/src/test/java/org/apache/hcatalog/cli/DummyStorageHandler.java
* 
/hive/trunk/hcatalog/storage-handlers/hbase/src/java/org/apache/hcatalog/hbase/HBaseBaseOutputFormat.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/FSRecordWriter.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveBinaryOutputFormat.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveIgnoreKeyTextOutputFormat.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveNullValueSequenceFileOutputFormat.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveOutputFormat.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HivePassThroughOutputFormat.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HivePassThroughRecordWriter.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveSequenceFileOutputFormat.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileOutputFormat.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroContainerOutputFormat.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordWriter.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcOutputFormat.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/Reader.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/ReaderImpl.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/Writer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestInputOutputFormat.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/udf/Rot13OutputFormat.java
* /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStats.java

                
> Extend record writer and ORC reader/writer interfaces to provide statistics
> ---------------------------------------------------------------------------
>
>                 Key: HIVE-5324
>                 URL: https://issues.apache.org/jira/browse/HIVE-5324
>             Project: Hive
>          Issue Type: New Feature
>    Affects Versions: 0.13.0
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>              Labels: orcfile, statistics
>             Fix For: 0.13.0
>
>         Attachments: HIVE-5324.1.patch.txt, HIVE-5324.2.patch.txt, 
> HIVE-5324.3.patch.txt, HIVE-5324.4.patch.txt
>
>
> The current implementation for computing statistics (number of rows and raw 
> data size) happens for every single row processed. The processOp() method in 
> FileSinkOperator gets raw data size for each row from the serde and 
> accumulates the size in hashmap while counting the number of rows. This 
> accumulated statistics is then published to metastore. 
> In case of ORC, ORC already stores enough statistics internally which can be 
> made use of when publishing the stats to metastore. This will avoid the 
> duplication of work that is happening in the processOp(). Also getting the 
> statistics directly from ORC is very cheap (can directly read from the file 
> footer).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to