Andrzej Bialecki wrote:
Grant Ingersoll wrote:
2. This time, instead of tokens I have X number of whole documents
that need to be translated from source to destination and the way the
translation systems work, it is best to have the whole document
together when getting a translation. My plan here is to implement my
own InputFormat again, this time returning the whole document from the
RecordReader.next() and overriding getSplits() in InputFormatBase to
return only one split per file, regardless of numSplits. Again, I
would need to put the metadata somewhere, either the JobConf or the key.
Is there a better way of doing this or am I on the right track?
Basically, it's ok - the only problematic aspect is that if you have
millions of documents then using this method you will get millions of
map tasks to execute, because you create as many splits (hence, map
tasks) as there are files ... perhaps a better way would be to first
wrap these documents into a single SequenceFile consisting of <fileName,
fileContent>, and use SequenceFileInputFormat.
Another approach to this is to create a file listing the names of the
files in a big flat text file, then use that file as the input, with
TextInputFormat. Then map() will be passed file names, and can open
them, translate them and collect the output. That avoids having to
append the content of all the files.
Doug