----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/765/#review696 -----------------------------------------------------------
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java <https://reviews.apache.org/r/765/#comment1395> Here we need to catch SQLRecoverableException and retry. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java <https://reviews.apache.org/r/765/#comment1396> If these parameters present in conf/hive-default.xml, you don't need to set them again here since new JobConf() should read from hive-default.xml. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java <https://reviews.apache.org/r/765/#comment1397> the usual use case for aggregateStats() is that the key should be the prefix (e.g., file_000) of the string inserted by publishStats, so that all keys that match the prefix will be aggregated. Can you add one more test for aggregateStats('file_000')? trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java <https://reviews.apache.org/r/765/#comment1398> won't this also change the stats at the 2nd publishStat()? trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java <https://reviews.apache.org/r/765/#comment1399> also add another aggStats for prefix. - Ning On 2011-05-19 23:14:26, Tomasz Nykiel wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/765/ > ----------------------------------------------------------- > > (Updated 2011-05-19 23:14:26) > > > Review request for hive. > > > Summary > ------- > > Currently, the JDBCStatsPublisher executes two queries per inserted row of > statistics, first query to check if the ID was inserted by another task, and > second query to insert a new or update the existing row. > The latter occurs very rarely, since duplicates most likely originate from > speculative failed tasks. > > Currently the schema of the stat table is the following: > > PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have > any integrity constraints declared. > > We amend it to: > > PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ). > > HIVE-2144 improves on performance by greedily performing the insertion > statement. > Then instead of executing two queries per row inserted, we can execute one > INSERT query. > In the case primary key constraint violation, we perform a single UPDATE > query. > The UPDATE query needs to check the condition, if the currently inserted > stats are "newer" then the ones already in the table. > > > This addresses bug HIVE-2144. > https://issues.apache.org/jira/browse/HIVE-2144 > > > Diffs > ----- > > > trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java > 1125140 > trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java > PRE-CREATION > > Diff: https://reviews.apache.org/r/765/diff > > > Testing > ------- > > TestStatsPublisher JUnit test: > - basic behaviour > - multiple updates > - cleanup of the statistics table after aggregation > > Standalone testing on the cluster. > - insert/analyze queries over non-partitioned/partitioned tables > > NOTE. For the correct behaviour, the primary_key index needs to be created, > or the PARTITION_STAT_TABLE table dropped - which triggers creation of the > table with the constraint declared. > > > Thanks, > > Tomasz > >