----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/6878/#review11502 -----------------------------------------------------------
Made it most of the way through the first page of the review. More comments to follow. data/files/binary.txt <https://reviews.apache.org/r/6878/#comment24683> This doesn't look like binary data to me? data/files/bool.txt <https://reviews.apache.org/r/6878/#comment24685> Let's try to combine this into a single table/file that contains all primitive types. I already have something like this that I've been using for the HiveServer2 work. Let's talk offline and see if you can reuse it here. metastore/if/hive_metastore.thrift <https://reviews.apache.org/r/6878/#comment24662> Please rearrange the order of these elements, e.g. dbName tableName partName colName colType ... Also, please make each element either "required" or "optional". metastore/if/hive_metastore.thrift <https://reviews.apache.org/r/6878/#comment24676> Please add comments for each of these fields. Also, Thrift supports a Union type, and I think it would probably be good to split the stats fields into a union of separate column stats structs, e.g: struct BooleanColumnStatisticsData { numNulls numTrues numFalses } struct IntColumnStatisticsData { numNulls lowValue maxValue .. } struct StringColumnStatisticsData { numNulls avgColLen maxColLen } union ColumnStatisticsData { BooleanColumnStatistics .. IntColumnStatistics .. StringColumnStatistics } struct ColumnStatisticsDescriptor { dbName tableName partName ColumnStatisticsData } metastore/if/hive_metastore.thrift <https://reviews.apache.org/r/6878/#comment24677> Since support for histograms will be added in Pt2 I think it makes sense to remove this for now. If you want to keep it in, then I think it makes sense to split these into optional ScalarStats and HistogramStats substructs. metastore/if/hive_metastore.thrift <https://reviews.apache.org/r/6878/#comment24663> isTblLevel looks redundant. If partName is not set, isn't it implicit that isTblLevel==true? metastore/if/hive_metastore.thrift <https://reviews.apache.org/r/6878/#comment24678> Supporting bulk update semantics makes things a lot trickier. For example, this is going to throw an InvalidObjectException if one of the ColumnStatistics structs references a bad DB/Table/Partition, but will the caller be able to figure out which record is bad? I think for Pt1 of this project it would be a lot safer to only support the non-bulk version. metastore/if/hive_metastore.thrift <https://reviews.apache.org/r/6878/#comment24679> Maybe change the name to update_column_statistics so that it aligns more closely with delete_column_statistics (because most people think CREATE/UPDATE/DELETE, not CREATE/WRITE/DELETE. metastore/if/hive_metastore.thrift <https://reviews.apache.org/r/6878/#comment24680> Please add some comments explaining how each of these RPCs behaves, e.g. what happens if I call delete_column_statistics_by_table on a partition table that contains partition-level column statistics? metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java <https://reviews.apache.org/r/6878/#comment24686> Formatting. metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java <https://reviews.apache.org/r/6878/#comment24687> Formatting. metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java <https://reviews.apache.org/r/6878/#comment24688> Why was it necessary to make these changes to this method, drop_partition_common, etc? metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java <https://reviews.apache.org/r/6878/#comment24689> It would be more accurate to call this lowerCasePartitionColumnNames metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java <https://reviews.apache.org/r/6878/#comment24690> Replace count with a boolean isFirst variable. metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java <https://reviews.apache.org/r/6878/#comment24691> Formatting. metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java <https://reviews.apache.org/r/6878/#comment24692> Formatting. metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java <https://reviews.apache.org/r/6878/#comment24694> Why is it required that all of the records in the batch update have to be either table or partition level stats records? If there's a good reason for enforcing this restriction then I think you should have two different write() RPCs (one for table-level and one for partition-level) instead of inferring this based on the type of the first record in the batch list. metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java <https://reviews.apache.org/r/6878/#comment24693> This leaves the caller in a bad position since she doesn't know which one of the statsObj records is bad. The easiest way to fix this is to not support bulk updates (at least for now). metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java <https://reviews.apache.org/r/6878/#comment24695> Combine this into one line? For example startFunction("write_column_statistics: db=" ... + (isTblLevel ? "" : "partition=" + partName) + ....) metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java <https://reviews.apache.org/r/6878/#comment24696> Instead of passing in null I think it would be cleaner to split this into two functions: writeTableColumnStatistics and writePartitionColumnStatistics. metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java <https://reviews.apache.org/r/6878/#comment24697> In my opinion the problem with leaving lots of blank lines in the bodies of methods is that it makes it harder to spot the beginning and end of functions which are typically signalled by two lines of whitespace. The convention in this file is to not leave lots of blank lines in method bodies, and it's kind of distracting to see code that doesn't follow that convention. metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java <https://reviews.apache.org/r/6878/#comment24704> This javadoc should go in IMetaStoreClient, and in place of it here we should use "/** {@inheritDoc} */" metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java <https://reviews.apache.org/r/6878/#comment24700> Probably split this up into updateTableColumnStatistics and updatePartitionColumnStatistics? metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java <https://reviews.apache.org/r/6878/#comment24699> Please either change this method's name to deleteColumnStatisticsByPartition, or change getColumnStatisticsByPartition to getPartitionColumnStatistics (the latter option sounds better to me). metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java <https://reviews.apache.org/r/6878/#comment24698> Formatting. ql/build.xml <https://reviews.apache.org/r/6878/#comment24682> * Add "javolution.version=5.5.1" to ivy/libraries.properties * Reference javolution-${javolution.version}.jar" here. - Carl Steinbach On Aug. 31, 2012, 10:04 p.m., Shreepadma Venugopalan wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/6878/ > ----------------------------------------------------------- > > (Updated Aug. 31, 2012, 10:04 p.m.) > > > Review request for hive and Carl Steinbach. > > > Description > ------- > > This patch implements version 1 of the column statistics project in Hive. It > adds support for computing and persisting statistical summary of column > values in Hive Tables and Partitions. In order to support column statistics > in Hive, this patch does the following, > > * Adds a new compute stats UDAF to compute scalar statistics for all > primitive Hive data types. In version 1 of the project, we support the > following scalar statistics on primitive types - estimate of number of > distinct values, number of null values, number of trues/falses for boolean > typed columsn, max and avg length for string and binary typed columns, max > and min value for long and double typed columns. Note that version 1 of the > column stats project includes support for column statistics both at the table > and partition level. > > * Adds Metastore schema tables to persist the newly added statistics both at > table and partition level. > * Adds Metastore Thrift API to persist, retrieve and delete column statistics > at both table and partition level. > Please refer to the following wiki link for the details of the schema and the > Thrift API changes - > https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive > > * Extends the analyze table compute statistics statement to trigger > statistics computation and persistence for one or more columns. Please note > that statistics for multiple columns is computed through a single scan of the > table data. Please refer to the following wiki link for the syntax changes - > https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive > > One thing missing from the patch at this point is the metastore upgrade > scrips for MySQL/Derby/Postgres/Oracle. I'm waiting for the review to > finalize the metastore schema changes before I go ahead and add the upgrade > scripts. > > In a follow on patch, as part of version 2 of the column statistics project, > we will add support for computing, persisting and retrieving histograms on > long and double typed column values. > > Generated Thrift files have been removed for viewing pleasure. JIRA page has > the patch with the generated Thrift files. > > > This addresses bug HIVE-1362. > https://issues.apache.org/jira/browse/HIVE-1362 > > > Diffs > ----- > > data/files/UserVisits.dat PRE-CREATION > data/files/binary.txt PRE-CREATION > data/files/bool.txt PRE-CREATION > data/files/double.txt PRE-CREATION > data/files/employee.dat PRE-CREATION > data/files/employee2.dat PRE-CREATION > data/files/int.txt PRE-CREATION > metastore/if/hive_metastore.thrift c4c6640 > metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java > 5e4e4a4 > > metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java > 4367f7a > metastore/src/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java > f1a6c8b > metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java > 045b550 > metastore/src/java/org/apache/hadoop/hive/metastore/RawStore.java 11c7f23 > metastore/src/java/org/apache/hadoop/hive/metastore/Warehouse.java 77d1caa > > metastore/src/model/org/apache/hadoop/hive/metastore/model/MPartitionColumnStatistics.java > PRE-CREATION > > metastore/src/model/org/apache/hadoop/hive/metastore/model/MTableColumnStatistics.java > PRE-CREATION > metastore/src/model/package.jdo 38ce6d5 > > metastore/src/test/org/apache/hadoop/hive/metastore/DummyRawStoreForJdoConnection.java > 4e152f4 > metastore/src/test/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java > 369eebe > ql/build.xml 5de3f78 > ql/if/queryplan.thrift 05fbf58 > ql/ivy.xml 40f370b > ql/src/java/org/apache/hadoop/hive/ql/exec/ColumnStatsTask.java > PRE-CREATION > ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 425900d > ql/src/java/org/apache/hadoop/hive/ql/exec/Task.java 4446952 > ql/src/java/org/apache/hadoop/hive/ql/exec/TaskFactory.java 79b87f1 > ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 271554c > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java > 0b55ac4 > ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java > 3874862 > ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java > b4ee673 > ql/src/java/org/apache/hadoop/hive/ql/parse/ExplainSemanticAnalyzer.java > e75a075 > ql/src/java/org/apache/hadoop/hive/ql/parse/ExportSemanticAnalyzer.java > 61bc7fd > ql/src/java/org/apache/hadoop/hive/ql/parse/FunctionSemanticAnalyzer.java > 6024dd4 > ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g e969fbe > ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java > 09ef969 > ql/src/java/org/apache/hadoop/hive/ql/parse/LoadSemanticAnalyzer.java > 22fa20f > ql/src/java/org/apache/hadoop/hive/ql/parse/QB.java a0ccbe6 > ql/src/java/org/apache/hadoop/hive/ql/parse/QBParseInfo.java 4e2b88a > ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 97d1f09 > ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzerFactory.java > ad1a14c > ql/src/java/org/apache/hadoop/hive/ql/parse/StatsSemanticAnalyzer.java > PRE-CREATION > ql/src/java/org/apache/hadoop/hive/ql/plan/ColumnStatsDesc.java > PRE-CREATION > ql/src/java/org/apache/hadoop/hive/ql/plan/ColumnStatsWork.java > PRE-CREATION > ql/src/java/org/apache/hadoop/hive/ql/plan/HiveOperation.java cb54753 > > ql/src/java/org/apache/hadoop/hive/ql/udf/generic/DoubleNumDistinctValueEstimator.java > PRE-CREATION > > ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFComputeStats.java > PRE-CREATION > > ql/src/java/org/apache/hadoop/hive/ql/udf/generic/LongNumDistinctValueEstimator.java > PRE-CREATION > > ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumDistinctValueEstimator.java > PRE-CREATION > > ql/src/java/org/apache/hadoop/hive/ql/udf/generic/StringNumDistinctValueEstimator.java > PRE-CREATION > ql/src/test/queries/clientpositive/columnstats_partlvl.q PRE-CREATION > ql/src/test/queries/clientpositive/columnstats_tbllvl.q PRE-CREATION > ql/src/test/queries/clientpositive/compute_stats_binary.q PRE-CREATION > ql/src/test/queries/clientpositive/compute_stats_boolean.q PRE-CREATION > ql/src/test/queries/clientpositive/compute_stats_double.q PRE-CREATION > ql/src/test/queries/clientpositive/compute_stats_long.q PRE-CREATION > ql/src/test/queries/clientpositive/compute_stats_string.q PRE-CREATION > ql/src/test/results/clientpositive/columnstats_partlvl.q.out PRE-CREATION > ql/src/test/results/clientpositive/columnstats_tbllvl.q.out PRE-CREATION > ql/src/test/results/clientpositive/compute_stats_binary.q.out PRE-CREATION > ql/src/test/results/clientpositive/compute_stats_boolean.q.out PRE-CREATION > ql/src/test/results/clientpositive/compute_stats_double.q.out PRE-CREATION > ql/src/test/results/clientpositive/compute_stats_long.q.out PRE-CREATION > ql/src/test/results/clientpositive/compute_stats_string.q.out PRE-CREATION > ql/src/test/results/clientpositive/show_functions.q.out 02f6a94 > > serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils.java > 5430814 > > Diff: https://reviews.apache.org/r/6878/diff/ > > > Testing > ------- > > All the existing hive tests pass. Additionally this patch adds the following > unit tests, > > * Tests to TestHiveMetaStore.java to test the Metastore schema and Thrift API > changes, > * Tests to exercise compute_stats UDAF for all primitive types, > * End to end test both at table and partition level for computing stats on > multiple columns. Note that these tests use the extended syntax of the > analyze command. > > > Thanks, > > Shreepadma Venugopalan > >