Advice wanted

Grant Ingersoll Wed, 25 Oct 2006 13:08:43 -0700

Hi,

I have two tasks (although one is kind of a special case of theother) I am looking for some advice on how best to take advantage ofHadoop with:

1. I have a list of tokens that need to be translated from thesource language to the destination language. My approach is to takethe tokens, write them out to the FileSystem, one per line, and thendistribute (map) them onto the cluster for translation as Text. Iam not sure, however, how best to pass along the metadata needed(source language and destination language). My thoughts are to addthe source and dest lang. to the JobConf (but I could also seeencoding it into the name of the file on the file system and theninto the key). Then during the Map phase, I would need to either getthe properties out of the JobConf or decode the key to figure out thesource and target languages.

2. This time, instead of tokens I have X number of whole documentsthat need to be translated from source to destination and the way thetranslation systems work, it is best to have the whole documenttogether when getting a translation. My plan here is to implement myown InputFormat again, this time returning the whole document fromthe RecordReader.next() and overriding getSplits() in InputFormatBaseto return only one split per file, regardless of numSplits. Again, Iwould need to put the metadata somewhere, either the JobConf or the key.


Is there a better way of doing this or am I on the right track?

Thanks,
Grant

Advice wanted

Reply via email to