Hi,

>Been experimenting a little with vectorized execution in hive 0.13 and
>found that group-by is super slow on string columns. This simple query is
>13x slower when vectorization is enabled (c_customer_id is string). Don't
>see this problem with int types.

I think the performance issue is due to the row-count triggers for
flushing the in-memory aggregations.

This shouldn¹t happen to you in the hive-1.0 branch, but for 0.13 there is
a fairly easy workaround to the performance issue.

>select c_customer_id from customer group by c_customer_id limit 10;

A very odd query that one, since it is one of the few patterns which
speeds up with an extra ORDER BY.

select c_customer_id from customer group by c_customer_id order by
c_customer_id limit 10;

tends to run faster than regular group-by + fetch limit as it shuffles
less data (10 keys per map task).

Try the same with

set hive.vectorized.groupby.checkinterval=1024;
set hive.vectorized.groupby.flush.percent=0.8;
set hive.limit.pushdown.memory.usage=0.04;

set hive.optimize.reducededuplication.min.reducer=1;
# above only if you¹re on MRv2, in Tez the default (4) is the faster option

That combination of operators should be triggering the fastest codepath.

@lefty: the limit pushdown seems to be missing in docs as the Top-N memory
size.

Cheers,
Gopal


Reply via email to