Suppose I have a large-ish table (over a billion rows) and want to grab the first 5 million or so. When I run the query "create table foo_subset as select col1, col2, col3 from foo limit 5000000", the job launches one reducer, which runs for a while. Looking at HDFS_BYTES_READ/WRITTEN, I see that the reducer read the entire table (250 gigs) to write 700 megs. This seems wasteful. Is there a reason that a reducer can't somehow just count the records and terminate after it writes 5000000 or whatever the limit is? Are there workarounds to make the process faster?..
Thanks, --Leo