[jira] [Commented] (HIVE-2192) Stats table schema incompatible after HIVE-2185
[ https://issues.apache.org/jira/browse/HIVE-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13044709#comment-13044709 ] Tomasz Nykiel commented on HIVE-2192: - Thanks. Stats table schema incompatible after HIVE-2185 --- Key: HIVE-2192 URL: https://issues.apache.org/jira/browse/HIVE-2192 Project: Hive Issue Type: Bug Reporter: Ning Zhang Assignee: Tomasz Nykiel Fix For: 0.8.0 Attachments: HIVE-2192.patch HIVE-2185 introduced a new column in the intermediate stats table. This introduces incompatibility between old and new branches (multiple branches could be deployed in production): the old branch will not work with the new schema, and the new branch will not work with the old schema. A solution would be to rename the stats table name (requires code change) or use a different database name (requires hive-default.xml conf change). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
[ https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13043237#comment-13043237 ] Tomasz Nykiel commented on HIVE-2185: - I ran all tests. All quantities were the same as previously, but now the name of the metric changed. Thanks. extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics) Key: HIVE-2185 URL: https://issues.apache.org/jira/browse/HIVE-2185 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers, Statistics Reporter: Tomasz Nykiel Assignee: Tomasz Nykiel Attachments: HIVE-2185.1.patch, HIVE-2185.2.patch, HIVE-2185.patch Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. Here, we want to collect information about the sizes of uncompressed data, to be able to determine the efficiency of compression. Currently, a large part of statistics collection mechanism is hardcoded and not-easily extensible for other statistics. On top of adding the new statistic collected, it would be desirable to extend the collection mechanism, so any new statistics could be added easily. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2192) Stats table schema incompatible after HIVE-2185
[ https://issues.apache.org/jira/browse/HIVE-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomasz Nykiel updated HIVE-2192: Attachment: HIVE-2192.patch -Changes the table name used for collecting intermediate statistics by JDBCStatsPublisher -removes an empty TestStatsPublisher.java Stats table schema incompatible after HIVE-2185 --- Key: HIVE-2192 URL: https://issues.apache.org/jira/browse/HIVE-2192 Project: Hive Issue Type: Bug Reporter: Ning Zhang Assignee: Tomasz Nykiel Attachments: HIVE-2192.patch HIVE-2185 introduced a new column in the intermediate stats table. This introduces incompatibility between old and new branches (multiple branches could be deployed in production): the old branch will not work with the new schema, and the new branch will not work with the old schema. A solution would be to rename the stats table name (requires code change) or use a different database name (requires hive-default.xml conf change). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Review Request: Stats table schema incompatible after HIVE-2185
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/851/ --- Review request for hive, Hairong Kuang and Ning Zhang. Summary --- HIVE-2185 introduces new statistics collected by StatsPublisher, which causes a change in the schema of the table used for collecting intermediate statistics by JDBCStatsPublisher. This patch changes the table name, to avoid conflicts with the previous version. Also this patch removes an empty JUnit java file. This addresses bug HIVE-2192. https://issues.apache.org/jira/browse/HIVE-2192 Diffs - trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java 1131108 Diff: https://reviews.apache.org/r/851/diff Testing --- JDBCStatsPublisherEnhanced JUnit test. Thanks, Tomasz
[jira] [Updated] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
[ https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomasz Nykiel updated HIVE-2185: Attachment: HIVE-2185.2.patch Fixed some minor issues. Renamed the metric to rawDataSize extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics) Key: HIVE-2185 URL: https://issues.apache.org/jira/browse/HIVE-2185 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers, Statistics Reporter: Tomasz Nykiel Assignee: Tomasz Nykiel Attachments: HIVE-2185.1.patch, HIVE-2185.2.patch, HIVE-2185.patch Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. Here, we want to collect information about the sizes of uncompressed data, to be able to determine the efficiency of compression. Currently, a large part of statistics collection mechanism is hardcoded and not-easily extensible for other statistics. On top of adding the new statistic collected, it would be desirable to extend the collection mechanism, so any new statistics could be added easily. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Review Request: extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/785/#review718 --- trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java https://reviews.apache.org/r/785/#comment1457 should be: long current = 0; SerDeStats st = this.deserializer.getSerDeStats(); if(st != null) { current = st.getUncompressedSize(); } since we are not checking by hard which serde class is in use, and some the unsupported classes return NULL - Tomasz On 2011-05-26 02:52:55, Tomasz Nykiel wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/785/ --- (Updated 2011-05-26 02:52:55) Review request for hive. Summary --- Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. We introduce a new feature for collecting information about the sizes of uncompressed data, to be able to determine the efficiency of compression. On top of adding the new statistic collected, this patch extends the stats collection mechanism, so any new statistics could be added easily. 1. serializer/deserializer classes are amended to accommodate collecting sizes of uncompressed data, when serializing/deserializing objects. We support: Columnar SerDe LazySimpleSerDe LazyBinarySerDe For other SerDe classes the uncompressed siez will be 0. 2. StatsPublisher / StatsAggregator interfaces are extended to support multi-stats collection for both JDBC and HBase. 3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and TableScanOperator respectively are extended to support multi-stats collection. (2) and (3) enable easy extension for other types of statistics. 4. Collecting uncompressed size can be disabled by setting: hive.stats.collect.uncompressedsize = false This addresses bug HIVE-2185. https://issues.apache.org/jira/browse/HIVE-2185 Diffs - trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1127756 trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 1127756 trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java 1127756 trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java PRE-CREATION trunk/hbase-handler/src/test/queries/hbase_stats.q 1127756 trunk/hbase-handler/src/test/results/hbase_stats.q.out 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java PRE-CREATION trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 1127756 trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java PRE-CREATION trunk/ql/src/test/org/apache/hadoop/hive/serde2/TestSerDe.java 1127756
[jira] [Updated] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
[ https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomasz Nykiel updated HIVE-2185: Attachment: HIVE-2185.1.patch extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics) Key: HIVE-2185 URL: https://issues.apache.org/jira/browse/HIVE-2185 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers, Statistics Reporter: Tomasz Nykiel Assignee: Tomasz Nykiel Attachments: HIVE-2185.1.patch, HIVE-2185.patch Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. Here, we want to collect information about the sizes of uncompressed data, to be able to determine the efficiency of compression. Currently, a large part of statistics collection mechanism is hardcoded and not-easily extensible for other statistics. On top of adding the new statistic collected, it would be desirable to extend the collection mechanism, so any new statistics could be added easily. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Review Request: extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
On 2011-05-26 21:12:30, Ning Zhang wrote: trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java, line 100 https://reviews.apache.org/r/785/diff/1/?file=19586#file19586line100 should be = here Yes. On 2011-05-26 21:12:30, Ning Zhang wrote: trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java, line 82 https://reviews.apache.org/r/785/diff/1/?file=19585#file19585line82 Isn't isValidStatics() should take key as a parameter rather than rowID? key should indicate which statistics this is right? Yes. It was a bug, I fixed already, once I ran the HBase JUnit :) - Tomasz --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/785/#review719 --- On 2011-05-26 02:52:55, Tomasz Nykiel wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/785/ --- (Updated 2011-05-26 02:52:55) Review request for hive. Summary --- Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. We introduce a new feature for collecting information about the sizes of uncompressed data, to be able to determine the efficiency of compression. On top of adding the new statistic collected, this patch extends the stats collection mechanism, so any new statistics could be added easily. 1. serializer/deserializer classes are amended to accommodate collecting sizes of uncompressed data, when serializing/deserializing objects. We support: Columnar SerDe LazySimpleSerDe LazyBinarySerDe For other SerDe classes the uncompressed siez will be 0. 2. StatsPublisher / StatsAggregator interfaces are extended to support multi-stats collection for both JDBC and HBase. 3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and TableScanOperator respectively are extended to support multi-stats collection. (2) and (3) enable easy extension for other types of statistics. 4. Collecting uncompressed size can be disabled by setting: hive.stats.collect.uncompressedsize = false This addresses bug HIVE-2185. https://issues.apache.org/jira/browse/HIVE-2185 Diffs - trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1127756 trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 1127756 trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java 1127756 trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java PRE-CREATION trunk/hbase-handler/src/test/queries/hbase_stats.q 1127756 trunk/hbase-handler/src/test/results/hbase_stats.q.out 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java PRE-CREATION
[jira] [Created] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics) Key: HIVE-2185 URL: https://issues.apache.org/jira/browse/HIVE-2185 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers, Statistics Reporter: Tomasz Nykiel Assignee: Tomasz Nykiel Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. Here, we want to collect information about the sizes of uncompressed data, to be able to determine the efficiency of compression. Currently, a large part of statistics collection mechanism is hardcoded and not-easily extensible for other statistics. On top of adding the new statistic collected, it would be desirable to extend the collection mechanism, so any new statistics could be added easily. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2185) extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
[ https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomasz Nykiel updated HIVE-2185: Attachment: HIVE-2185.patch extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics) Key: HIVE-2185 URL: https://issues.apache.org/jira/browse/HIVE-2185 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers, Statistics Reporter: Tomasz Nykiel Assignee: Tomasz Nykiel Attachments: HIVE-2185.patch Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. Here, we want to collect information about the sizes of uncompressed data, to be able to determine the efficiency of compression. Currently, a large part of statistics collection mechanism is hardcoded and not-easily extensible for other statistics. On top of adding the new statistic collected, it would be desirable to extend the collection mechanism, so any new statistics could be added easily. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Review Request: extend table statistics to store the size of uncompressed data (+extend interfaces for collecting other types of statistics)
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/785/ --- Review request for hive. Summary --- Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we collect statistics about the number of rows per partition/table. Other statistics (e.g., total table/partition size) are derived from the file system. We introduce a new feature for collecting information about the sizes of uncompressed data, to be able to determine the efficiency of compression. On top of adding the new statistic collected, this patch extends the stats collection mechanism, so any new statistics could be added easily. 1. serializer/deserializer classes are amended to accommodate collecting sizes of uncompressed data, when serializing/deserializing objects. We support: Columnar SerDe LazySimpleSerDe LazyBinarySerDe For other SerDe classes the uncompressed siez will be 0. 2. StatsPublisher / StatsAggregator interfaces are extended to support multi-stats collection for both JDBC and HBase. 3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and TableScanOperator respectively are extended to support multi-stats collection. (2) and (3) enable easy extension for other types of statistics. 4. Collecting uncompressed size can be disabled by setting: hive.stats.collect.uncompressedsize = false This addresses bug HIVE-2185. https://issues.apache.org/jira/browse/HIVE-2185 Diffs - trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1127756 trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java 1127756 trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java 1127756 trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java 1127756 trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java PRE-CREATION trunk/hbase-handler/src/test/queries/hbase_stats.q 1127756 trunk/hbase-handler/src/test/results/hbase_stats.q.out 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java 1127756 trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java PRE-CREATION trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 1127756 trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java PRE-CREATION trunk/ql/src/test/org/apache/hadoop/hive/serde2/TestSerDe.java 1127756 trunk/ql/src/test/queries/clientpositive/stats14.q PRE-CREATION trunk/ql/src/test/queries/clientpositive/stats15.q PRE-CREATION trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1127756 trunk/ql/src/test/results/clientpositive/bucketmapjoin2.q.out 1127756 trunk/ql/src/test/results/clientpositive/bucketmapjoin3.q.out 1127756 trunk/ql/src/test/results/clientpositive/bucketmapjoin4.q.out 1127756 trunk/ql/src/test/results/clientpositive/bucketmapjoin5.q.out 1127756 trunk/ql/src/test/results/clientpositive/combine2.q.out 1127756 trunk/ql/src/test/results/clientpositive/filter_join_breaktask.q.out 1127756 trunk/ql/src/test/results/clientpositive/join_map_ppr.q.out 1127756 trunk/ql/src/test/results/clientpositive/merge3.q.out 1127756
[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher
[ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038757#comment-13038757 ] Tomasz Nykiel commented on HIVE-2144: - Thanks :) reduce workload generated by JDBCStatsPublisher --- Key: HIVE-2144 URL: https://issues.apache.org/jira/browse/HIVE-2144 Project: Hive Issue Type: Improvement Reporter: Ning Zhang Assignee: Tomasz Nykiel Fix For: 0.8.0 Attachments: HIVE-2144.1.patch, HIVE-2144.2.patch, HIVE-2144.patch In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2144) reduce workload generated by JDBCStatsPublisher
[ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomasz Nykiel updated HIVE-2144: Attachment: HIVE-2144.2.patch reduce workload generated by JDBCStatsPublisher --- Key: HIVE-2144 URL: https://issues.apache.org/jira/browse/HIVE-2144 Project: Hive Issue Type: Improvement Reporter: Ning Zhang Assignee: Tomasz Nykiel Attachments: HIVE-2144.1.patch, HIVE-2144.2.patch, HIVE-2144.patch In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Review Request: HIVE-2144 reduce workload generated by JDBCStatsPublisher
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/765/#review702 --- trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java https://reviews.apache.org/r/765/#comment1402 Yes. That's correct. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java https://reviews.apache.org/r/765/#comment1403 ok. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java https://reviews.apache.org/r/765/#comment1404 I will amend the test cases to aggregate over prefixes. I will also add one simple test case to aggregate over exact match. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java https://reviews.apache.org/r/765/#comment1405 The original value inserted in line 120 is 200. Neither 100, nor 150 should change the values. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java https://reviews.apache.org/r/765/#comment1406 As disscussed before, I will improve the test cases to aggregate over prefixes. - Tomasz On 2011-05-19 23:14:26, Tomasz Nykiel wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/765/ --- (Updated 2011-05-19 23:14:26) Review request for hive. Summary --- Currently, the JDBCStatsPublisher executes two queries per inserted row of statistics, first query to check if the ID was inserted by another task, and second query to insert a new or update the existing row. The latter occurs very rarely, since duplicates most likely originate from speculative failed tasks. Currently the schema of the stat table is the following: PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared. We amend it to: PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ). HIVE-2144 improves on performance by greedily performing the insertion statement. Then instead of executing two queries per row inserted, we can execute one INSERT query. In the case primary key constraint violation, we perform a single UPDATE query. The UPDATE query needs to check the condition, if the currently inserted stats are newer then the ones already in the table. This addresses bug HIVE-2144. https://issues.apache.org/jira/browse/HIVE-2144 Diffs - trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1125140 trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java PRE-CREATION Diff: https://reviews.apache.org/r/765/diff Testing --- TestStatsPublisher JUnit test: - basic behaviour - multiple updates - cleanup of the statistics table after aggregation Standalone testing on the cluster. - insert/analyze queries over non-partitioned/partitioned tables NOTE. For the correct behaviour, the primary_key index needs to be created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of the table with the constraint declared. Thanks, Tomasz
[jira] [Updated] (HIVE-2144) reduce workload generated by JDBCStatsPublisher
[ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomasz Nykiel updated HIVE-2144: Attachment: HIVE-2144.1.patch Fixed after revision 1. reduce workload generated by JDBCStatsPublisher --- Key: HIVE-2144 URL: https://issues.apache.org/jira/browse/HIVE-2144 Project: Hive Issue Type: Improvement Reporter: Ning Zhang Assignee: Tomasz Nykiel Attachments: HIVE-2144.1.patch, HIVE-2144.patch In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2144) reduce workload generated by JDBCStatsPublisher
[ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomasz Nykiel updated HIVE-2144: Attachment: HIVE-2144.patch reduce workload generated by JDBCStatsPublisher --- Key: HIVE-2144 URL: https://issues.apache.org/jira/browse/HIVE-2144 Project: Hive Issue Type: Improvement Reporter: Ning Zhang Assignee: Tomasz Nykiel Attachments: HIVE-2144.patch In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Review Request: HIVE-2144 reduce workload generated by JDBCStatsPublisher
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/765/ --- Review request for hive. Summary --- Currently, the JDBCStatsPublisher executes two queries per inserted row of statistics, first query to check if the ID was inserted by another task, and second query to insert a new or update the existing row. The latter occurs very rarely, since duplicates most likely originate from speculative failed tasks. Currently the schema of the stat table is the following: PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared. We amend it to: PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ). HIVE-2144 improves on performance by greedily performing the insertion statement. Then instead of executing two queries per row inserted, we can execute one INSERT query. In the case primary key constraint violation, we perform a single UPDATE query. The UPDATE query needs to check the condition, if the currently inserted stats are newer then the ones already in the table. This addresses bug HIVE-2144. https://issues.apache.org/jira/browse/HIVE-2144 Diffs - trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java 1125140 trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java PRE-CREATION Diff: https://reviews.apache.org/r/765/diff Testing --- TestStatsPublisher JUnit test: - basic behaviour - multiple updates - cleanup of the statistics table after aggregation Standalone testing on the cluster. - insert/analyze queries over non-partitioned/partitioned tables NOTE. For the correct behaviour, the primary_key index needs to be created, or the PARTITION_STAT_TABLE table dropped - which triggers creation of the table with the constraint declared. Thanks, Tomasz
[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher
[ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035493#comment-13035493 ] Tomasz Nykiel commented on HIVE-2144: - Currently the schema of the stat table is the following: PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have any integrity constraints declared. We can amend it to: PARTITION_STAT_TABLE ( ID VARCHAR(255) UNIQUE , ROW_COUNT BIGINT ). Then instead of executing two queries per row inserted, we can execute one INSERT query, as we do currently. In the case when the integrity constraint is violated, via the unique index, which can be caught by an exception, we perform a single UPDATE query. The UPDATE query needs to check the condition, if the currently inserted stats are newer then the ones already in the table: UPDATE PARTITION_STAT_TBL SET ROW_COUNT = new_value WHERE ID = rowID AND (0)new_value (1)(SELECT TEMP.ROW_COUNT FROM (2)(SELECT ROW_COUNT FROM PARTITION_STAT_TBL WHERE ID = rowID) TEMP ) --(0) is a condition that checks if the newly inserted value is greater that the one we already have. --(1) and (2) is a work-around for MySQL, which does not allow to refer to the table that occurs in the update statement. Here, we basically materialize the value that we need for comparison. --(1) should theoretically have (LIMIT 1) to choose exactly one tuple, however Derby does not support it, and by the unique constraint, and the fact that the insert failed, there exists exactly one tuple matching the ID predicate. To summarize, for non existing rows, only one insert query will be executed, instead of two. For existing rows, which seems to occur very infrequently, two queries instead of three will be executed. reduce workload generated by JDBCStatsPublisher --- Key: HIVE-2144 URL: https://issues.apache.org/jira/browse/HIVE-2144 Project: Hive Issue Type: Improvement Reporter: Ning Zhang Assignee: Tomasz Nykiel In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2144) reduce workload generated by JDBCStatsPublisher
[ https://issues.apache.org/jira/browse/HIVE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035634#comment-13035634 ] Tomasz Nykiel commented on HIVE-2144: - Yes, I agree. There are some subtle differences between UNIQUE and PK in Derby and MySQL (e.g., in MySQL the unique index allows null values, and in Derby it does not. So in general, PK constraint will be more suitable. CREATE TABLE PARTITION_STAT_TBL ( IDE VARCHAR(255) PRIMARY KEY, ROW_COUNT BIGINT ) works for both Derby and MySql. After a quick check it seems that it's supported by Oracle/MSSQL as well. reduce workload generated by JDBCStatsPublisher --- Key: HIVE-2144 URL: https://issues.apache.org/jira/browse/HIVE-2144 Project: Hive Issue Type: Improvement Reporter: Ning Zhang Assignee: Tomasz Nykiel In JDBCStatsPublisher, we first try a SELECT query to see if the specific ID was inserted by another task (mostly likely a speculative or previously failed task). Depending on if the ID is there, an INSERT or UPDATE query was issues. So there are basically 2x of queries per row inserted into the intermediate stats table. This workload could be reduced to 1/2 if we insert it anyway (it is very rare that IDs are duplicated) and use a different SQL query in the aggregation phase to dedup the ID (e.g., using group-by and max()). The benefits are that even though the aggregation query is more expensive, it is only run once per query. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira