Hi, On Sun, Dec 26, 2010 at 6:29 PM, Black, Michael (IS) <[email protected]> wrote: > I assume there's a way to make a specific # of splits and add each document > to the separate splits...but I'll be darned if I can find the docs or an > example to show this.
Would CombineFileInputFormat and CombineFileSplit be what you're looking for? Doc links: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html & http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/CombineFileSplit.html > As I said I'm using hadoop-0.20.2 which I know makes a difference as so many > things get deprecated on each release. Old references don't seem to work. The API marked deprecated in 0.20.{0,1,2} has been un-deprecated in the 0.21.0 release and is also considered as the "stable" API. You can continue using it, as it is still supported. (Maybe 0.20.3 will have them un-deprecated too, I'm not sure what's the status on that, although doing so would surely help avoid beginner confusion.) -- Harsh J www.harshj.com
