[jira] [Commented] (HIVE-4340) ORC should provide raw data size
[ https://issues.apache.org/jira/browse/HIVE-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772548#comment-13772548 ] Prasanth J commented on HIVE-4340: -- [~ashutoshc] Thanks for your feedback. I broke up this patch into two patches (HIVE-5324 and HIVE-5325) as per your suggestion. ORC should provide raw data size Key: HIVE-4340 URL: https://issues.apache.org/jira/browse/HIVE-4340 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.11.0 Reporter: Kevin Wilfong Assignee: Prasanth J Attachments: HIVE-4340.1.patch.txt, HIVE-4340.2.patch.txt, HIVE-4340.3.patch.txt, HIVE-4340.4.patch.txt, HIVE-4340-java-only.4.patch.txt ORC's SerDe currently does nothing, and hence does not calculate a raw data size. WriterImpl, however, has enough information to provide one. WriterImpl should compute a raw data size for each row, aggregate them per stripe and record it in the strip information, as RC currently does in its key header, and allow the FileSinkOperator access to the size per row. FileSinkOperator should be able to get the raw data size from either the SerDe or the RecordWriter when the RecordWriter can provide it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4340) ORC should provide raw data size
[ https://issues.apache.org/jira/browse/HIVE-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771191#comment-13771191 ] Ashutosh Chauhan commented on HIVE-4340: Thanks [~prasanth_j] for picking this one up. I will suggest to break the patch into two: one which proposes new stats gathering and providing interfaces on RecordWriter and RecordReader. And another jira for ORC implementation of these two. ORC should provide raw data size Key: HIVE-4340 URL: https://issues.apache.org/jira/browse/HIVE-4340 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.11.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Attachments: HIVE-4340.1.patch.txt, HIVE-4340.2.patch.txt, HIVE-4340.3.patch.txt, HIVE-4340.4.patch.txt, HIVE-4340-java-only.4.patch.txt ORC's SerDe currently does nothing, and hence does not calculate a raw data size. WriterImpl, however, has enough information to provide one. WriterImpl should compute a raw data size for each row, aggregate them per stripe and record it in the strip information, as RC currently does in its key header, and allow the FileSinkOperator access to the size per row. FileSinkOperator should be able to get the raw data size from either the SerDe or the RecordWriter when the RecordWriter can provide it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4340) ORC should provide raw data size
[ https://issues.apache.org/jira/browse/HIVE-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768739#comment-13768739 ] Prasanth J commented on HIVE-4340: -- I tried enhancing this patch to support SerDeStats in ORC in a slightly more efficient and less intrusive way. The current implementation of stats gathering happens for each row in processOp() method of FileSinkOperator. For each row, a new SerDeStats object is created and the stats are accumulated in a hashmap. This is good for cases where statistics gathering is not done by underlying storage format. But in case of ORC, ORC already gathers lots of statistics while writing the data which can be leveraged to provide SerDeStats. The statistics gathered by ORC can be retrieved in closeOp() method of FileSinkOperator making it more efficient than row by row processing of serde statistics. Uploaded patch implements the above approach. ORC should provide raw data size Key: HIVE-4340 URL: https://issues.apache.org/jira/browse/HIVE-4340 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.11.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Attachments: HIVE-4340.1.patch.txt, HIVE-4340.2.patch.txt, HIVE-4340.3.patch.txt, HIVE-4340.4.patch.txt, HIVE-4340-java-only.4.patch.txt ORC's SerDe currently does nothing, and hence does not calculate a raw data size. WriterImpl, however, has enough information to provide one. WriterImpl should compute a raw data size for each row, aggregate them per stripe and record it in the strip information, as RC currently does in its key header, and allow the FileSinkOperator access to the size per row. FileSinkOperator should be able to get the raw data size from either the SerDe or the RecordWriter when the RecordWriter can provide it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4340) ORC should provide raw data size
[ https://issues.apache.org/jira/browse/HIVE-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768848#comment-13768848 ] Prasanth J commented on HIVE-4340: -- added UNION case to ORC writer raw data size computation. ORC should provide raw data size Key: HIVE-4340 URL: https://issues.apache.org/jira/browse/HIVE-4340 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.11.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Attachments: HIVE-4340.1.patch.txt, HIVE-4340.2.patch.txt, HIVE-4340.3.patch.txt, HIVE-4340.4.patch.txt, HIVE-4340-java-only.4.patch.txt ORC's SerDe currently does nothing, and hence does not calculate a raw data size. WriterImpl, however, has enough information to provide one. WriterImpl should compute a raw data size for each row, aggregate them per stripe and record it in the strip information, as RC currently does in its key header, and allow the FileSinkOperator access to the size per row. FileSinkOperator should be able to get the raw data size from either the SerDe or the RecordWriter when the RecordWriter can provide it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4340) ORC should provide raw data size
[ https://issues.apache.org/jira/browse/HIVE-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768786#comment-13768786 ] Prasanth J commented on HIVE-4340: -- Review board entry https://reviews.apache.org/r/14162 ORC should provide raw data size Key: HIVE-4340 URL: https://issues.apache.org/jira/browse/HIVE-4340 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.11.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Attachments: HIVE-4340.1.patch.txt, HIVE-4340.2.patch.txt, HIVE-4340.3.patch.txt, HIVE-4340.4.patch.txt, HIVE-4340-java-only.4.patch.txt ORC's SerDe currently does nothing, and hence does not calculate a raw data size. WriterImpl, however, has enough information to provide one. WriterImpl should compute a raw data size for each row, aggregate them per stripe and record it in the strip information, as RC currently does in its key header, and allow the FileSinkOperator access to the size per row. FileSinkOperator should be able to get the raw data size from either the SerDe or the RecordWriter when the RecordWriter can provide it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4340) ORC should provide raw data size
[ https://issues.apache.org/jira/browse/HIVE-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13641758#comment-13641758 ] Namit Jain commented on HIVE-4340: -- Compilation is failing: [javac] /Users/njain/hive/hive_commit3/ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java:986: abstract method write(java.lang.Object) in org.apache.hadoop.hive.ql.io.orc.WriterImpl.TreeWriter cannot be accessed directly [javac] super.write(obj); [javac]^ ORC should provide raw data size Key: HIVE-4340 URL: https://issues.apache.org/jira/browse/HIVE-4340 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Affects Versions: 0.11.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Attachments: HIVE-4340.1.patch.txt, HIVE-4340.2.patch.txt ORC's SerDe currently does nothing, and hence does not calculate a raw data size. WriterImpl, however, has enough information to provide one. WriterImpl should compute a raw data size for each row, aggregate them per stripe and record it in the strip information, as RC currently does in its key header, and allow the FileSinkOperator access to the size per row. FileSinkOperator should be able to get the raw data size from either the SerDe or the RecordWriter when the RecordWriter can provide it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4340) ORC should provide raw data size
[ https://issues.apache.org/jira/browse/HIVE-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13641967#comment-13641967 ] Kevin Wilfong commented on HIVE-4340: - Sorry, I hadn't tested the patch after refreshing it, it wasn't ready for review. ORC should provide raw data size Key: HIVE-4340 URL: https://issues.apache.org/jira/browse/HIVE-4340 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Affects Versions: 0.11.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Attachments: HIVE-4340.1.patch.txt, HIVE-4340.2.patch.txt, HIVE-4340.3.patch.txt ORC's SerDe currently does nothing, and hence does not calculate a raw data size. WriterImpl, however, has enough information to provide one. WriterImpl should compute a raw data size for each row, aggregate them per stripe and record it in the strip information, as RC currently does in its key header, and allow the FileSinkOperator access to the size per row. FileSinkOperator should be able to get the raw data size from either the SerDe or the RecordWriter when the RecordWriter can provide it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4340) ORC should provide raw data size
[ https://issues.apache.org/jira/browse/HIVE-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13640180#comment-13640180 ] Namit Jain commented on HIVE-4340: -- +1 ORC should provide raw data size Key: HIVE-4340 URL: https://issues.apache.org/jira/browse/HIVE-4340 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Affects Versions: 0.11.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Attachments: HIVE-4340.1.patch.txt ORC's SerDe currently does nothing, and hence does not calculate a raw data size. WriterImpl, however, has enough information to provide one. WriterImpl should compute a raw data size for each row, aggregate them per stripe and record it in the strip information, as RC currently does in its key header, and allow the FileSinkOperator access to the size per row. FileSinkOperator should be able to get the raw data size from either the SerDe or the RecordWriter when the RecordWriter can provide it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4340) ORC should provide raw data size
[ https://issues.apache.org/jira/browse/HIVE-4340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13636677#comment-13636677 ] Kevin Wilfong commented on HIVE-4340: - https://reviews.facebook.net/D10179 ORC should provide raw data size Key: HIVE-4340 URL: https://issues.apache.org/jira/browse/HIVE-4340 Project: Hive Issue Type: Improvement Components: Serializers/Deserializers Affects Versions: 0.11.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong ORC's SerDe currently does nothing, and hence does not calculate a raw data size. WriterImpl, however, has enough information to provide one. WriterImpl should compute a raw data size for each row, aggregate them per stripe and record it in the strip information, as RC currently does in its key header, and allow the FileSinkOperator access to the size per row. FileSinkOperator should be able to get the raw data size from either the SerDe or the RecordWriter when the RecordWriter can provide it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira