Andrzej Bialecki wrote:
Grant Ingersoll wrote:
2. This time, instead of tokens I have X number of whole documents that need to be translated from source to destination and the way the translation systems work, it is best to have the whole document together when getting a translation. My plan here is to implement my own InputFormat again, this time returning the whole document from the RecordReader.next() and overriding getSplits() in InputFormatBase to return only one split per file, regardless of numSplits. Again, I would need to put the metadata somewhere, either the JobConf or the key.

Is there a better way of doing this or am I on the right track?
Basically, it's ok - the only problematic aspect is that if you have millions of documents then using this method you will get millions of map tasks to execute, because you create as many splits (hence, map tasks) as there are files ... perhaps a better way would be to first wrap these documents into a single SequenceFile consisting of <fileName, fileContent>, and use SequenceFileInputFormat.

Another approach to this is to create a file listing the names of the files in a big flat text file, then use that file as the input, with TextInputFormat. Then map() will be passed file names, and can open them, translate them and collect the output. That avoids having to append the content of all the files.

Doug

Reply via email to