[jira] Commented: (HIVE-1361) table/partition level statistics

HBase Review Board (JIRA) Fri, 17 Sep 2010 15:10:57 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910813#action_12910813
 ]


HBase Review Board commented on HIVE-1361:
------------------------------------------

Message from: "John Sichi" <jsi...@facebook.com>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/862/#review1262
-----------------------------------------------------------



trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
<http://review.cloudera.org/r/862/#comment4256>

    Hive conf additions should be accompanied by new entries in 
conf/hive-default.xml for documentation purposes.



trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java
<http://review.cloudera.org/r/862/#comment4257>

    Using e.toString() alone here may lose some of the diagnostics.
    
    LOG.error has an overload which takes a Throwable parameter; use that to 
make sure that all the diagnostics (e.g. nested throwables) are logged.



trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java
<http://review.cloudera.org/r/862/#comment4258>

    As a performance followup, we probably want to use delete(List<Delete>) for 
batching.
    



trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java
<http://review.cloudera.org/r/862/#comment4264>

    See perf comment above.  Also, this scan+delete code could be shared to 
avoid duplication.
    



trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java
<http://review.cloudera.org/r/862/#comment4266>

    See comments in HBaseStatsAggregator regarding diagnostics.
    



trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java
<http://review.cloudera.org/r/862/#comment4267>

    Another perf note:  for batch update, we can use setAutoFlush(false) and 
then flushCommits in closeConnection.



trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java
<http://review.cloudera.org/r/862/#comment4268>

    Probably need a followup to make this configurable.



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java
<http://review.cloudera.org/r/862/#comment4269>

    What is this code for?



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java
<http://review.cloudera.org/r/862/#comment4294>

    Isn't this going to throw an NPE if aggregateStats returns null after 
handling an error?



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java
<http://review.cloudera.org/r/862/#comment4295>

    s/retur/return/



trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java
<http://review.cloudera.org/r/862/#comment4274>

    typo:  MapRedTaks



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java
<http://review.cloudera.org/r/862/#comment4278>

    Some more overview (or just link to updated wiki doc) would be good here 
since the methods below reference things like temporary stats and aggregation 
without really explaining them.
    
    Also:  I think having the publisher/aggregator implementations catch errors 
themselves is confusing.  It would be cleaner to let them propagate the 
exceptions, and instead catch+suppress+warn in the calling code (under control 
of a strictness config param).
    



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java
<http://review.cloudera.org/r/862/#comment4279>

    @param, @return?



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java
<http://review.cloudera.org/r/862/#comment4277>

    Use correct Javadoc @param syntax, and add @return.



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java
<http://review.cloudera.org/r/862/#comment4280>

    Use @param, @return



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsFactory.java
<http://review.cloudera.org/r/862/#comment4282>

    For other plugin-loading code, we use JavaUtils.getClassLoader().  Should 
probably do the same here?
    



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsFactory.java
<http://review.cloudera.org/r/862/#comment4281>

    Don't use printStackTrace; log the exception instead.



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java
<http://review.cloudera.org/r/862/#comment4283>

    See comments on StatsAggregator regarding Javadoc.  Also, 
s/statics/statistics/



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java
<http://review.cloudera.org/r/862/#comment4284>

    I don't think this warrants four exclamation marks.



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
<http://review.cloudera.org/r/862/#comment4288>

    Is it worth using a prepared statement here?  
    
    Also, depending on the transaction isolation level, concurrent update 
attempts could result in spurious error logging, although the final result 
should still be valid.
    
    Using a positioned update can give more control over this (and allow the 
number of SQL statements executed to be reduced), but not all JDBC drivers 
support it.
    



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
<http://review.cloudera.org/r/862/#comment4286>

    This is superfluous.



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
<http://review.cloudera.org/r/862/#comment4292>

    I think this actually means "explicitly shut down the database" (which is 
different from closing the connection).
    


- John





> table/partition level statistics
> --------------------------------
>
>                 Key: HIVE-1361
>                 URL: https://issues.apache.org/jira/browse/HIVE-1361
>             Project: Hadoop Hive
>          Issue Type: Sub-task
>          Components: Query Processor
>            Reporter: Ning Zhang
>            Assignee: Ahmed M Aly
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1361.java_only.patch, HIVE-1361.patch, stats0.patch
>
>
> At the first step, we gather table-level stats for non-partitioned table and 
> partition-level stats for partitioned table. Future work could extend the 
> table level stats to partitioned table as well. 
> There are 3 major milestones in this subtask: 
>  1) extend the insert statement to gather table/partition level stats 
> on-the-fly.
>  2) extend metastore API to support storing and retrieving stats for a 
> particular table/partition. 
>  3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for 
> existing tables/partitions. 
> The proposed stats are:
> Partition-level stats: 
>   - number of rows
>   - total size in bytes
>   - number of files
>   - max, min, average row sizes
>   - max, min, average file sizes
> Table-level stats in addition to partition level stats:
>   - number of partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1361) table/partition level statistics

Reply via email to