Re: Review Request: HIVE-2144 reduce workload generated by JDBCStatsPublisher

Ning Zhang Fri, 20 May 2011 00:26:30 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/765/#review696
-----------------------------------------------------------




trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1395>

    Here we need to catch SQLRecoverableException and retry.



trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1396>

    If these parameters present in conf/hive-default.xml, you don't need to set 
them again here since new JobConf() should read from hive-default.xml.



trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1397>

    the usual use case for aggregateStats() is that the key should be the 
prefix (e.g., file_000) of the string inserted by publishStats, so that all 
keys that match the prefix will be aggregated.
    
    Can you add one more test for aggregateStats('file_000')?



trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1398>

    won't this also change the stats at the 2nd publishStat()?



trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1399>

    also add another aggStats for prefix.


- Ning


On 2011-05-19 23:14:26, Tomasz Nykiel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/765/
> -----------------------------------------------------------
> 
> (Updated 2011-05-19 23:14:26)
> 
> 
> Review request for hive.
> 
> 
> Summary
> -------
> 
> Currently, the JDBCStatsPublisher executes two queries per inserted row of 
> statistics, first query to check if the ID was inserted by another task, and 
> second query to insert a new or update the existing row.
> The latter occurs very rarely, since duplicates most likely originate from 
> speculative failed tasks.
> 
> Currently the schema of the stat table is the following:
> 
> PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have 
> any integrity constraints declared.
> 
> We amend it to:
> 
> PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ).
> 
> HIVE-2144 improves on performance by greedily performing the insertion 
> statement.
> Then instead of executing two queries per row inserted, we can execute one 
> INSERT query.
> In the case primary key constraint violation, we perform a single UPDATE 
> query.
> The UPDATE query needs to check the condition, if the currently inserted 
> stats are "newer" then the ones already in the table.
> 
> 
> This addresses bug HIVE-2144.
>     https://issues.apache.org/jira/browse/HIVE-2144
> 
> 
> Diffs
> -----
> 
>   
> trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
>  1125140 
>   trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/765/diff
> 
> 
> Testing
> -------
> 
> TestStatsPublisher JUnit test:
> - basic behaviour
> - multiple updates
> - cleanup of the statistics table after aggregation
> 
> Standalone testing on the cluster.
> - insert/analyze queries over non-partitioned/partitioned tables
> 
> NOTE. For the correct behaviour, the primary_key index needs to be created, 
> or the PARTITION_STAT_TABLE table dropped - which triggers creation of the 
> table with the constraint declared.
> 
> 
> Thanks,
> 
> Tomasz
> 
>

Re: Review Request: HIVE-2144 reduce workload generated by JDBCStatsPublisher

Reply via email to