Re: Advice wanted

Andrzej Bialecki Wed, 25 Oct 2006 13:46:38 -0700

Grant Ingersoll wrote:

Hi,
I have two tasks (although one is kind of a special case of the other)I am looking for some advice on how best to take advantage of Hadoopwith:
1. I have a list of tokens that need to be translated from the sourcelanguage to the destination language. My approach is to take thetokens, write them out to the FileSystem, one per line, and thendistribute (map) them onto the cluster for translation as Text. I amnot sure, however, how best to pass along the metadata needed (sourcelanguage and destination language). My thoughts are to add the sourceand dest lang. to the JobConf (but I could also see encoding it intothe name of the file on the file system and then into the key). Thenduring the Map phase, I would need to either get the properties out ofthe JobConf or decode the key to figure out the source and targetlanguages.

Source and target language are two configuration properties for thewhole job, so passing them inside JobConf seems like the best way. Eachmap/reduce task will get the same JobConf, including your properties.

Using the standard TextInputFormat you get <lineNo, lineSrcText> in yourmap(), which you would then output after translation as <lineSrcText,lineTgtText>. If you accidentally have duplicate lines in the input, youwill get multiple values in reduce, because the same lineSrcText keywould be associated with multiple translations.

Actually, if you need to transalte this into several languages, youcould loop in map() through all target languages, and output as manytranslation tuples as needed, as <lineSrcText, <lang, lineTgtText>> -then in your reduce() you would get them nicely collected under a singlekey (lineSrcText) and all translated values in Iterator.

2. This time, instead of tokens I have X number of whole documentsthat need to be translated from source to destination and the way thetranslation systems work, it is best to have the whole documenttogether when getting a translation. My plan here is to implement myown InputFormat again, this time returning the whole document from theRecordReader.next() and overriding getSplits() in InputFormatBase toreturn only one split per file, regardless of numSplits. Again, Iwould need to put the metadata somewhere, either the JobConf or the key.
Is there a better way of doing this or am I on the right track?

Basically, it's ok - the only problematic aspect is that if you have millions of 
documents then using this method you will get millions of map tasks to execute, 
because you create as many splits (hence, map tasks) as there are files ... perhaps a 
better way would be to first wrap these documents into a single SequenceFile 
consisting of <fileName, fileContent>, and use SequenceFileInputFormat.


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Advice wanted

Reply via email to