Re: Review Request: HIVE-1362: Support for column statistics in Hive

Carl Steinbach Thu, 13 Sep 2012 14:58:30 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6878/#review11502
-----------------------------------------------------------



Made it most of the way through the first page of the review. More comments to 
follow.


data/files/binary.txt
<https://reviews.apache.org/r/6878/#comment24683>

    This doesn't look like binary data to me?



data/files/bool.txt
<https://reviews.apache.org/r/6878/#comment24685>

    Let's try to combine this into a single table/file that contains all 
primitive types. I already have something like this that I've been using for 
the HiveServer2 work. Let's talk offline and see if you can reuse it here.



metastore/if/hive_metastore.thrift
<https://reviews.apache.org/r/6878/#comment24662>

    Please rearrange the order of these elements, e.g.
    
    dbName
    tableName
    partName
    colName
    colType
    ...
    
    Also, please make each element either "required" or "optional".



metastore/if/hive_metastore.thrift
<https://reviews.apache.org/r/6878/#comment24676>

    Please add comments for each of these fields. 
    
    Also, Thrift supports a Union type, and I think it would probably be good 
to split the stats fields into a union of separate column stats structs, e.g:
    
    struct BooleanColumnStatisticsData {
      numNulls
      numTrues
      numFalses
    }
    
    struct IntColumnStatisticsData {
      numNulls
      lowValue
      maxValue
    ..
    }
    
    struct StringColumnStatisticsData {
      numNulls
      avgColLen
      maxColLen
    }
    
    union ColumnStatisticsData {
      BooleanColumnStatistics
      ..
      IntColumnStatistics
      ..
      StringColumnStatistics
    }
    
    struct ColumnStatisticsDescriptor {
      dbName
       tableName
       partName
       ColumnStatisticsData
    }
      



metastore/if/hive_metastore.thrift
<https://reviews.apache.org/r/6878/#comment24677>

    Since support for histograms will be added in Pt2 I think it makes sense to 
remove this for now. If you want to keep it in, then I think it makes sense to 
split these into optional ScalarStats and HistogramStats substructs.



metastore/if/hive_metastore.thrift
<https://reviews.apache.org/r/6878/#comment24663>

    isTblLevel looks redundant. If partName is not set, isn't it implicit that 
isTblLevel==true?



metastore/if/hive_metastore.thrift
<https://reviews.apache.org/r/6878/#comment24678>

    Supporting bulk update semantics makes things a lot trickier. For example, 
this is going to throw an InvalidObjectException if one of the ColumnStatistics 
structs references a bad DB/Table/Partition, but will the caller be able to 
figure out which record is bad? I think for Pt1 of this project it would be a 
lot safer to only support the non-bulk version.



metastore/if/hive_metastore.thrift
<https://reviews.apache.org/r/6878/#comment24679>

    Maybe change the name to update_column_statistics so that it aligns more 
closely with delete_column_statistics (because most people think 
CREATE/UPDATE/DELETE, not CREATE/WRITE/DELETE.



metastore/if/hive_metastore.thrift
<https://reviews.apache.org/r/6878/#comment24680>

    Please add some comments explaining how each of these RPCs behaves, e.g. 
what happens if I call delete_column_statistics_by_table on a partition table 
that contains partition-level column statistics?



metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
<https://reviews.apache.org/r/6878/#comment24686>

    Formatting.



metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
<https://reviews.apache.org/r/6878/#comment24687>

    Formatting.



metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
<https://reviews.apache.org/r/6878/#comment24688>

    Why was it necessary to make these changes to this method, 
drop_partition_common, etc?



metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
<https://reviews.apache.org/r/6878/#comment24689>

    It would be more accurate to call this lowerCasePartitionColumnNames



metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
<https://reviews.apache.org/r/6878/#comment24690>

    Replace count with a boolean isFirst variable.



metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
<https://reviews.apache.org/r/6878/#comment24691>

    Formatting.



metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
<https://reviews.apache.org/r/6878/#comment24692>

    Formatting.



metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
<https://reviews.apache.org/r/6878/#comment24694>

    Why is it required that all of the records in the batch update have to be 
either table or partition level stats records? If there's a good reason for 
enforcing this restriction then I think you should have two different write() 
RPCs (one for table-level and one for partition-level) instead of inferring 
this based on the type of the first record in the batch list.



metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
<https://reviews.apache.org/r/6878/#comment24693>

    This leaves the caller in a bad position since she doesn't know which one 
of the statsObj records is bad. The easiest way to fix this is to not support 
bulk updates (at least for now).



metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
<https://reviews.apache.org/r/6878/#comment24695>

    Combine this into one line? For example
    
    startFunction("write_column_statistics: db=" ...
                      + (isTblLevel ? "" : "partition=" + partName)
                       + ....)



metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
<https://reviews.apache.org/r/6878/#comment24696>

    Instead of passing in null I think it would be cleaner to split this into 
two functions: writeTableColumnStatistics and writePartitionColumnStatistics.



metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
<https://reviews.apache.org/r/6878/#comment24697>

    In my opinion the problem with leaving lots of blank lines in the bodies of 
methods is that it makes it harder to spot the beginning and end of functions 
which are typically signalled by two lines of whitespace. The convention in 
this file is to not leave lots of blank lines in method bodies, and it's kind 
of distracting to see code that doesn't follow that convention.



metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java
<https://reviews.apache.org/r/6878/#comment24704>

    This javadoc should go in IMetaStoreClient, and in place of it here we 
should use "/** {@inheritDoc} */"



metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java
<https://reviews.apache.org/r/6878/#comment24700>

    Probably split this up into updateTableColumnStatistics and 
updatePartitionColumnStatistics?



metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java
<https://reviews.apache.org/r/6878/#comment24699>

    Please either change this method's name to 
deleteColumnStatisticsByPartition, or change getColumnStatisticsByPartition to 
getPartitionColumnStatistics (the latter option sounds better to me).



metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java
<https://reviews.apache.org/r/6878/#comment24698>

    Formatting.



ql/build.xml
<https://reviews.apache.org/r/6878/#comment24682>

    * Add "javolution.version=5.5.1" to ivy/libraries.properties
    * Reference javolution-${javolution.version}.jar" here.


- Carl Steinbach


On Aug. 31, 2012, 10:04 p.m., Shreepadma Venugopalan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6878/
> -----------------------------------------------------------
> 
> (Updated Aug. 31, 2012, 10:04 p.m.)
> 
> 
> Review request for hive and Carl Steinbach.
> 
> 
> Description
> -------
> 
> This patch implements version 1 of the column statistics project in Hive. It 
> adds support for computing and persisting statistical summary of column 
> values in Hive Tables and Partitions. In order to support column statistics 
> in Hive, this patch does the following,
> 
> * Adds a new compute stats UDAF to compute scalar statistics for all 
> primitive Hive data types. In version 1 of the project, we support the 
> following scalar statistics on primitive types - estimate of number of 
> distinct values, number of null values, number of trues/falses for boolean 
> typed columsn, max and avg length for string and binary typed columns, max 
> and min value for long and double typed columns. Note that version 1 of the 
> column stats project includes support for column statistics both at the table 
> and partition level.
> 
> * Adds Metastore schema tables to persist the newly added statistics both at 
> table and partition level.
> * Adds Metastore Thrift API to persist, retrieve and delete column statistics 
> at both table and partition level. 
> Please refer to the following wiki link for the details of the schema and the 
> Thrift API changes - 
> https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive
> 
> * Extends the analyze table compute statistics statement to trigger 
> statistics computation and persistence for one or more columns. Please note 
> that statistics for multiple columns is computed through a single scan of the 
> table data. Please refer to the following wiki link for the syntax changes - 
> https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive
> 
> One thing missing from the patch at this point is the metastore upgrade 
> scrips for MySQL/Derby/Postgres/Oracle. I'm waiting for the review to 
> finalize the metastore schema changes before I go ahead and add the upgrade 
> scripts.
> 
> In a follow on patch, as part of version 2 of the column statistics project, 
> we will add support for computing, persisting and retrieving histograms on 
> long and double typed column values.
> 
> Generated Thrift files have been removed for viewing pleasure. JIRA page has 
> the patch with the generated Thrift files.
> 
> 
> This addresses bug HIVE-1362.
>     https://issues.apache.org/jira/browse/HIVE-1362
> 
> 
> Diffs
> -----
> 
>   data/files/UserVisits.dat PRE-CREATION 
>   data/files/binary.txt PRE-CREATION 
>   data/files/bool.txt PRE-CREATION 
>   data/files/double.txt PRE-CREATION 
>   data/files/employee.dat PRE-CREATION 
>   data/files/employee2.dat PRE-CREATION 
>   data/files/int.txt PRE-CREATION 
>   metastore/if/hive_metastore.thrift c4c6640 
>   metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 
> 5e4e4a4 
>   
> metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java 
> 4367f7a 
>   metastore/src/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java 
> f1a6c8b 
>   metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 
> 045b550 
>   metastore/src/java/org/apache/hadoop/hive/metastore/RawStore.java 11c7f23 
>   metastore/src/java/org/apache/hadoop/hive/metastore/Warehouse.java 77d1caa 
>   
> metastore/src/model/org/apache/hadoop/hive/metastore/model/MPartitionColumnStatistics.java
>  PRE-CREATION 
>   
> metastore/src/model/org/apache/hadoop/hive/metastore/model/MTableColumnStatistics.java
>  PRE-CREATION 
>   metastore/src/model/package.jdo 38ce6d5 
>   
> metastore/src/test/org/apache/hadoop/hive/metastore/DummyRawStoreForJdoConnection.java
>  4e152f4 
>   metastore/src/test/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java 
> 369eebe 
>   ql/build.xml 5de3f78 
>   ql/if/queryplan.thrift 05fbf58 
>   ql/ivy.xml 40f370b 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/ColumnStatsTask.java 
> PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 425900d 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Task.java 4446952 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/TaskFactory.java 79b87f1 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 271554c 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java
>  0b55ac4 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java 
> 3874862 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java 
> b4ee673 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/ExplainSemanticAnalyzer.java 
> e75a075 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/ExportSemanticAnalyzer.java 
> 61bc7fd 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/FunctionSemanticAnalyzer.java 
> 6024dd4 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g e969fbe 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java 
> 09ef969 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/LoadSemanticAnalyzer.java 
> 22fa20f 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/QB.java a0ccbe6 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/QBParseInfo.java 4e2b88a 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 97d1f09 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzerFactory.java 
> ad1a14c 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/StatsSemanticAnalyzer.java 
> PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/ColumnStatsDesc.java 
> PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/ColumnStatsWork.java 
> PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/HiveOperation.java cb54753 
>   
> ql/src/java/org/apache/hadoop/hive/ql/udf/generic/DoubleNumDistinctValueEstimator.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFComputeStats.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/udf/generic/LongNumDistinctValueEstimator.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumDistinctValueEstimator.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/udf/generic/StringNumDistinctValueEstimator.java
>  PRE-CREATION 
>   ql/src/test/queries/clientpositive/columnstats_partlvl.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/columnstats_tbllvl.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/compute_stats_binary.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/compute_stats_boolean.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/compute_stats_double.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/compute_stats_long.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/compute_stats_string.q PRE-CREATION 
>   ql/src/test/results/clientpositive/columnstats_partlvl.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/columnstats_tbllvl.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/compute_stats_binary.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/compute_stats_boolean.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/compute_stats_double.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/compute_stats_long.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/compute_stats_string.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/show_functions.q.out 02f6a94 
>   
> serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils.java
>  5430814 
> 
> Diff: https://reviews.apache.org/r/6878/diff/
> 
> 
> Testing
> -------
> 
> All the existing hive tests pass. Additionally this patch adds the following 
> unit tests,
> 
> * Tests to TestHiveMetaStore.java to test the Metastore schema and Thrift API 
> changes,
> * Tests to exercise compute_stats UDAF for all primitive types,
> * End to end test both at table and partition level for computing stats on 
> multiple columns. Note that these tests use the extended syntax of the 
> analyze command.
> 
> 
> Thanks,
> 
> Shreepadma Venugopalan
> 
>

Re: Review Request: HIVE-1362: Support for column statistics in Hive

Reply via email to