Hi, >Been experimenting a little with vectorized execution in hive 0.13 and >found that group-by is super slow on string columns. This simple query is >13x slower when vectorization is enabled (c_customer_id is string). Don't >see this problem with int types.
I think the performance issue is due to the row-count triggers for flushing the in-memory aggregations. This shouldn¹t happen to you in the hive-1.0 branch, but for 0.13 there is a fairly easy workaround to the performance issue. >select c_customer_id from customer group by c_customer_id limit 10; A very odd query that one, since it is one of the few patterns which speeds up with an extra ORDER BY. select c_customer_id from customer group by c_customer_id order by c_customer_id limit 10; tends to run faster than regular group-by + fetch limit as it shuffles less data (10 keys per map task). Try the same with set hive.vectorized.groupby.checkinterval=1024; set hive.vectorized.groupby.flush.percent=0.8; set hive.limit.pushdown.memory.usage=0.04; set hive.optimize.reducededuplication.min.reducer=1; # above only if you¹re on MRv2, in Tez the default (4) is the faster option That combination of operators should be triggering the fastest codepath. @lefty: the limit pushdown seems to be missing in docs as the Top-N memory size. Cheers, Gopal
