[ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910813#action_12910813 ]
HBase Review Board commented on HIVE-1361: ------------------------------------------ Message from: "John Sichi" <jsi...@facebook.com> ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/862/#review1262 ----------------------------------------------------------- trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java <http://review.cloudera.org/r/862/#comment4256> Hive conf additions should be accompanied by new entries in conf/hive-default.xml for documentation purposes. trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java <http://review.cloudera.org/r/862/#comment4257> Using e.toString() alone here may lose some of the diagnostics. LOG.error has an overload which takes a Throwable parameter; use that to make sure that all the diagnostics (e.g. nested throwables) are logged. trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java <http://review.cloudera.org/r/862/#comment4258> As a performance followup, we probably want to use delete(List<Delete>) for batching. trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java <http://review.cloudera.org/r/862/#comment4264> See perf comment above. Also, this scan+delete code could be shared to avoid duplication. trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java <http://review.cloudera.org/r/862/#comment4266> See comments in HBaseStatsAggregator regarding diagnostics. trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java <http://review.cloudera.org/r/862/#comment4267> Another perf note: for batch update, we can use setAutoFlush(false) and then flushCommits in closeConnection. trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java <http://review.cloudera.org/r/862/#comment4268> Probably need a followup to make this configurable. trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java <http://review.cloudera.org/r/862/#comment4269> What is this code for? trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java <http://review.cloudera.org/r/862/#comment4294> Isn't this going to throw an NPE if aggregateStats returns null after handling an error? trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java <http://review.cloudera.org/r/862/#comment4295> s/retur/return/ trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java <http://review.cloudera.org/r/862/#comment4274> typo: MapRedTaks trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java <http://review.cloudera.org/r/862/#comment4278> Some more overview (or just link to updated wiki doc) would be good here since the methods below reference things like temporary stats and aggregation without really explaining them. Also: I think having the publisher/aggregator implementations catch errors themselves is confusing. It would be cleaner to let them propagate the exceptions, and instead catch+suppress+warn in the calling code (under control of a strictness config param). trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java <http://review.cloudera.org/r/862/#comment4279> @param, @return? trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java <http://review.cloudera.org/r/862/#comment4277> Use correct Javadoc @param syntax, and add @return. trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java <http://review.cloudera.org/r/862/#comment4280> Use @param, @return trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsFactory.java <http://review.cloudera.org/r/862/#comment4282> For other plugin-loading code, we use JavaUtils.getClassLoader(). Should probably do the same here? trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsFactory.java <http://review.cloudera.org/r/862/#comment4281> Don't use printStackTrace; log the exception instead. trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java <http://review.cloudera.org/r/862/#comment4283> See comments on StatsAggregator regarding Javadoc. Also, s/statics/statistics/ trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java <http://review.cloudera.org/r/862/#comment4284> I don't think this warrants four exclamation marks. trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java <http://review.cloudera.org/r/862/#comment4288> Is it worth using a prepared statement here? Also, depending on the transaction isolation level, concurrent update attempts could result in spurious error logging, although the final result should still be valid. Using a positioned update can give more control over this (and allow the number of SQL statements executed to be reduced), but not all JDBC drivers support it. trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java <http://review.cloudera.org/r/862/#comment4286> This is superfluous. trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java <http://review.cloudera.org/r/862/#comment4292> I think this actually means "explicitly shut down the database" (which is different from closing the connection). - John > table/partition level statistics > -------------------------------- > > Key: HIVE-1361 > URL: https://issues.apache.org/jira/browse/HIVE-1361 > Project: Hadoop Hive > Issue Type: Sub-task > Components: Query Processor > Reporter: Ning Zhang > Assignee: Ahmed M Aly > Fix For: 0.7.0 > > Attachments: HIVE-1361.java_only.patch, HIVE-1361.patch, stats0.patch > > > At the first step, we gather table-level stats for non-partitioned table and > partition-level stats for partitioned table. Future work could extend the > table level stats to partitioned table as well. > There are 3 major milestones in this subtask: > 1) extend the insert statement to gather table/partition level stats > on-the-fly. > 2) extend metastore API to support storing and retrieving stats for a > particular table/partition. > 3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for > existing tables/partitions. > The proposed stats are: > Partition-level stats: > - number of rows > - total size in bytes > - number of files > - max, min, average row sizes > - max, min, average file sizes > Table-level stats in addition to partition level stats: > - number of partitions -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.