Hi,

I have two tasks (although one is kind of a special case of the other) I am looking for some advice on how best to take advantage of Hadoop with:

1. I have a list of tokens that need to be translated from the source language to the destination language. My approach is to take the tokens, write them out to the FileSystem, one per line, and then distribute (map) them onto the cluster for translation as Text. I am not sure, however, how best to pass along the metadata needed (source language and destination language). My thoughts are to add the source and dest lang. to the JobConf (but I could also see encoding it into the name of the file on the file system and then into the key). Then during the Map phase, I would need to either get the properties out of the JobConf or decode the key to figure out the source and target languages.

2. This time, instead of tokens I have X number of whole documents that need to be translated from source to destination and the way the translation systems work, it is best to have the whole document together when getting a translation. My plan here is to implement my own InputFormat again, this time returning the whole document from the RecordReader.next() and overriding getSplits() in InputFormatBase to return only one split per file, regardless of numSplits. Again, I would need to put the metadata somewhere, either the JobConf or the key.

Is there a better way of doing this or am I on the right track?

Thanks,
Grant

Reply via email to