Grant Ingersoll wrote:
Hi,
I have two tasks (although one is kind of a special case of the
other) I am looking for some advice on how best to take advantage
of Hadoop with:
1. I have a list of tokens that need to be translated from the
source language to the destination language. My approach is to
take the tokens, write them out to the FileSystem, one per line,
and then distribute (map) them onto the cluster for translation as
Text. I am not sure, however, how best to pass along the
metadata needed (source language and destination language). My
thoughts are to add the source and dest lang. to the JobConf (but
I could also see encoding it into the name of the file on the file
system and then into the key). Then during the Map phase, I would
need to either get the properties out of the JobConf or decode the
key to figure out the source and target languages.
Source and target language are two configuration properties for the
whole job, so passing them inside JobConf seems like the best way.
Each map/reduce task will get the same JobConf, including your
properties.
Cool, I figured JobConf was also distributed, but wasn't 100% certain.
Using the standard TextInputFormat you get <lineNo, lineSrcText> in
your map(), which you would then output after translation as
<lineSrcText, lineTgtText>. If you accidentally have duplicate
lines in the input, you will get multiple values in reduce, because
the same lineSrcText key would be associated with multiple
translations.
Makes sense.
Actually, if you need to transalte this into several languages, you
could loop in map() through all target languages, and output as
many translation tuples as needed, as <lineSrcText, <lang,
lineTgtText>> - then in your reduce() you would get them nicely
collected under a single key (lineSrcText) and all translated
values in Iterator.
No need there, each job will be one language pair, but is is an
interesting idea that may be worth pursuing down the road.
2. This time, instead of tokens I have X number of whole documents
that need to be translated from source to destination and the way
the translation systems work, it is best to have the whole
document together when getting a translation. My plan here is to
implement my own InputFormat again, this time returning the whole
document from the RecordReader.next() and overriding getSplits()
in InputFormatBase to return only one split per file, regardless
of numSplits. Again, I would need to put the metadata somewhere,
either the JobConf or the key.
Is there a better way of doing this or am I on the right track?
Basically, it's ok - the only problematic aspect is that if you
have millions of documents then using this method you will get
millions of map tasks to execute, because you create as many splits
(hence, map tasks) as there are files ... perhaps a better way
would be to first wrap these documents into a single SequenceFile
consisting of <fileName, fileContent>, and use
SequenceFileInputFormat.
OK, that makes more sense. I wasn't totally clear on SeqFile, but
based on what you said and looking at it again that seems like a much
better way to handle it.
Thanks,
Grant