Hi,
I have two tasks (although one is kind of a special case of the
other) I am looking for some advice on how best to take advantage of
Hadoop with:
1. I have a list of tokens that need to be translated from the
source language to the destination language. My approach is to take
the tokens, write them out to the FileSystem, one per line, and then
distribute (map) them onto the cluster for translation as Text. I
am not sure, however, how best to pass along the metadata needed
(source language and destination language). My thoughts are to add
the source and dest lang. to the JobConf (but I could also see
encoding it into the name of the file on the file system and then
into the key). Then during the Map phase, I would need to either get
the properties out of the JobConf or decode the key to figure out the
source and target languages.
2. This time, instead of tokens I have X number of whole documents
that need to be translated from source to destination and the way the
translation systems work, it is best to have the whole document
together when getting a translation. My plan here is to implement my
own InputFormat again, this time returning the whole document from
the RecordReader.next() and overriding getSplits() in InputFormatBase
to return only one split per file, regardless of numSplits. Again, I
would need to put the metadata somewhere, either the JobConf or the key.
Is there a better way of doing this or am I on the right track?
Thanks,
Grant