Re: Review Request: HIVE-2144 reduce workload generated by JDBCStatsPublisher

Ning Zhang Mon, 23 May 2011 14:16:25 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/765/#review709
-----------------------------------------------------------




trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1417>

    can you add a comment on what situation this exception will be thrown? Just 
for the sake of reader that didn't notice there is a primary key constraint in 
the DDL. 



trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
<https://reviews.apache.org/r/765/#comment1418>

    remove this?


- Ning


On 2011-05-21 01:49:07, Tomasz Nykiel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/765/
> -----------------------------------------------------------
> 
> (Updated 2011-05-21 01:49:07)
> 
> 
> Review request for hive.
> 
> 
> Summary
> -------
> 
> Currently, the JDBCStatsPublisher executes two queries per inserted row of 
> statistics, first query to check if the ID was inserted by another task, and 
> second query to insert a new or update the existing row.
> The latter occurs very rarely, since duplicates most likely originate from 
> speculative failed tasks.
> 
> Currently the schema of the stat table is the following:
> 
> PARTITION_STAT_TABLE ( ID VARCHAR(255), ROW_COUNT BIGINT ) and does not have 
> any integrity constraints declared.
> 
> We amend it to:
> 
> PARTITION_STAT_TABLE ( ID VARCHAR(255) PRIMARY KEY , ROW_COUNT BIGINT ).
> 
> HIVE-2144 improves on performance by greedily performing the insertion 
> statement.
> Then instead of executing two queries per row inserted, we can execute one 
> INSERT query.
> In the case primary key constraint violation, we perform a single UPDATE 
> query.
> The UPDATE query needs to check the condition, if the currently inserted 
> stats are "newer" then the ones already in the table.
> 
> 
> This addresses bug HIVE-2144.
>     https://issues.apache.org/jira/browse/HIVE-2144
> 
> 
> Diffs
> -----
> 
>   
> trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
>  1125468 
>   trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/765/diff
> 
> 
> Testing
> -------
> 
> TestStatsPublisher JUnit test:
> - basic behaviour
> - multiple updates
> - cleanup of the statistics table after aggregation
> 
> Standalone testing on the cluster.
> - insert/analyze queries over non-partitioned/partitioned tables
> 
> NOTE. For the correct behaviour, the primary_key index needs to be created, 
> or the PARTITION_STAT_TABLE table dropped - which triggers creation of the 
> table with the constraint declared.
> 
> 
> Thanks,
> 
> Tomasz
> 
>

Re: Review Request: HIVE-2144 reduce workload generated by JDBCStatsPublisher

Reply via email to