Hi All,

I'm a developer at Qubole (http://www.qubole.com) looking at Hadoop and Hive. 
In my past life, I was on the optimizer team of Greenplum Parallel Database. 
I'm a newbie to the Hive mailing list, so apologies for any missteps. I've done 
some searching in the Hive mailing list and JIRA and have not found any 
discussions around this topic - please feel free to redirect me to any old 
discussions I might've missed.

A class of queries we're interested in optimizing are top-k queries i.e. 
queries of the form:

(1) SELECT x, y from T order by z limit 10

You can imagine similar query with aggregates:

(2) SELECT x, y, count(*) as c from T group by x, y order by c desc limit 10

I'll continue my discussion with example (1) for simplicity. The way such a 
query is executed, every mapper sorts all rows from T and writes it to local 
files. Reducers (in this example, singular) read these files and merge them. 
These rows are fed to the limit operator which stops after 10 rows. 

The change I'm proposing is a combination of Hive and Hadoop changes which will 
greatly improve the performance of such queries:

Hadoop change:
        - New parameter map.sort.limitrecords which determines how many records 
each mapper in a job will send to every reducer
        - When writing out local files after sorting, map-task stops after 
map.sort.limitrecords records for each reducer
        - Effectively, each mapper sends out its top-K records

Hive change:
        - Determining when the Top-K optimization is applicable and setting K 
in ReduceSinkDesc
        - Passing the K value along to MapredWork
        - ExecDriver sets map.sort.limitrecords before executing the job 
corresponding to the MapredWork

This change will reduce the amount of I/O that happens on the map-side (writing 
only 10 rows per reducer as opposed to entire table) and can have a big effect 
on performance. Furthermore, it is possible to make the sort on the mapper side 
a top-k sort which can further improve performance - but the deep pocket is 
really the I/O savings. In my experiments, I see a 5x performance improvement 
for such queries.

Please let me know if this is of general interest - I'll be happy to contribute 
this back to the community. I'll also be mailing the Hadoop mailing list about 
this.

Thanks
Siva

Reply via email to