Is it possible to speed up select ... limit query?

Leo Alekseyev Wed, 06 Oct 2010 10:58:51 -0700

Suppose I have a large-ish table (over a billion rows) and want to
grab the first 5 million or so.  When I run the query "create table
foo_subset as select col1, col2, col3 from foo limit 5000000", the job
launches one reducer, which runs for a while.  Looking at
HDFS_BYTES_READ/WRITTEN, I see that the reducer read the entire table
(250 gigs) to write 700 megs.  This seems wasteful.  Is there a reason
that a reducer can't somehow just count the records and terminate
after it writes 5000000 or whatever the limit is?  Are there
workarounds to make the process faster?..


Thanks,

--Leo

Is it possible to speed up select ... limit query?

Reply via email to