Hi all, I've been dumping tables from mysql and loading them manually into HDFS, and but decided to look at the DBInputFormat to better automate the process.
I see it issuing the "select... from ... order by id limit..." which takes ages (several minutes) on my large tables since I use myisam and it hangs around on the "sorting result". Is there anything I should watch out for if I customise the DBInputFormat to select the max(id) in the getCount(), and use that to create ID ranges for the splits, and then issue the selects with: select ... from ... where id between <lower> and <upper> order by id? It does mean that they won't be equal splits as there are holes in the order, and some might be empty but it is a very fast select statement. Thanks for any pointers, Tim