Hi Krishna, 

It's a good question. We should update the design doc on the wiki to include 
the design decisions. But here you go:

Hive implemented the updating numRows stats using Hadoop counters before, but 
the counter was not reliable: we saw wrong stats from time to time. There are 
several reasons why we didn't go to the direction of fixing hadoop counter but 
chose JDBC/HBase as the intermediate storage for stats publishing:

 1) AFAIK, hadoop counters were not designed to be super reliable/scalable to 
large # of updates (correct me if I'm wrong). At our production environment, we 
saw the peak stats publishing QPS is around 2k, which means the JT will handle 
a counter update every 0.5 msec. In addition aggregating counters will also add 
more workload to the JT, which is already heavy-loaded.  This is only for 1 
stats. If you have 3 stats you need to collect, you can put these 3 stats into 
3 columns and do 1 insert into RDBMS/HBase in order to "publish" the stats. 
With Hadoop counter, you'll need 3 counter updates, which is not as scalable 
down the road. 

 2) even if the hadoop counter is fixed and scaled to what we expected, the 
turn around time is high and Hive has to add a hadoop shim for old Hadoop 
releases. It's a pain and not as nice as supporting the feature out-of-the-box. 

On Jun 20, 2011, at 6:26 AM, Krishna Kumar wrote:

> Any reason why persistent stores such as jdbc and hbase are supported for 
> temporary stats storage IIUC, but hadoop counters were not used for the tasks 
> to 'publish' their stats for the aggregation task to pick it up from?
> 
> Cheers,
> Krishna

Reply via email to