Hi all,

I've been dumping tables from mysql and loading them manually into
HDFS, and but decided to look at the DBInputFormat to better automate
the process.

I see it issuing the "select... from ... order by id limit..." which
takes ages (several minutes) on my large tables since I use myisam and
it hangs around on the "sorting result".

Is there anything I should watch out for if I customise the
DBInputFormat to select the max(id) in the getCount(), and use that to
create ID ranges for the splits, and then issue the selects with:

  select ... from ... where id between <lower> and <upper> order by id?

It does mean that they won't be equal splits as there are holes in the
order, and some might be empty but it is a very fast select statement.

Thanks for any pointers,

Tim

Reply via email to