Re: Advice wanted

Doug Cutting Thu, 26 Oct 2006 11:10:52 -0700

Andrzej Bialecki wrote:

Grant Ingersoll wrote:
2. This time, instead of tokens I have X number of whole documentsthat need to be translated from source to destination and the way thetranslation systems work, it is best to have the whole documenttogether when getting a translation. My plan here is to implement myown InputFormat again, this time returning the whole document from theRecordReader.next() and overriding getSplits() in InputFormatBase toreturn only one split per file, regardless of numSplits. Again, Iwould need to put the metadata somewhere, either the JobConf or the key.
Is there a better way of doing this or am I on the right track?
Basically, it's ok - the only problematic aspect is that if you havemillions of documents then using this method you will get millions ofmap tasks to execute, because you create as many splits (hence, maptasks) as there are files ... perhaps a better way would be to firstwrap these documents into a single SequenceFile consisting of <fileName,fileContent>, and use SequenceFileInputFormat.

Another approach to this is to create a file listing the names of thefiles in a big flat text file, then use that file as the input, withTextInputFormat. Then map() will be passed file names, and can openthem, translate them and collect the output. That avoids having toappend the content of all the files.


Doug

Re: Advice wanted

Reply via email to