Grant Ingersoll wrote:
Hi,

I have two tasks (although one is kind of a special case of the other) I am looking for some advice on how best to take advantage of Hadoop with:

1. I have a list of tokens that need to be translated from the source language to the destination language. My approach is to take the tokens, write them out to the FileSystem, one per line, and then distribute (map) them onto the cluster for translation as Text. I am not sure, however, how best to pass along the metadata needed (source language and destination language). My thoughts are to add the source and dest lang. to the JobConf (but I could also see encoding it into the name of the file on the file system and then into the key). Then during the Map phase, I would need to either get the properties out of the JobConf or decode the key to figure out the source and target languages.

Source and target language are two configuration properties for the whole job, so passing them inside JobConf seems like the best way. Each map/reduce task will get the same JobConf, including your properties.

Using the standard TextInputFormat you get <lineNo, lineSrcText> in your map(), which you would then output after translation as <lineSrcText, lineTgtText>. If you accidentally have duplicate lines in the input, you will get multiple values in reduce, because the same lineSrcText key would be associated with multiple translations.

Actually, if you need to transalte this into several languages, you could loop in map() through all target languages, and output as many translation tuples as needed, as <lineSrcText, <lang, lineTgtText>> - then in your reduce() you would get them nicely collected under a single key (lineSrcText) and all translated values in Iterator.


2. This time, instead of tokens I have X number of whole documents that need to be translated from source to destination and the way the translation systems work, it is best to have the whole document together when getting a translation. My plan here is to implement my own InputFormat again, this time returning the whole document from the RecordReader.next() and overriding getSplits() in InputFormatBase to return only one split per file, regardless of numSplits. Again, I would need to put the metadata somewhere, either the JobConf or the key.

Is there a better way of doing this or am I on the right track?
Basically, it's ok - the only problematic aspect is that if you have millions of 
documents then using this method you will get millions of map tasks to execute, 
because you create as many splits (hence, map tasks) as there are files ... perhaps a 
better way would be to first wrap these documents into a single SequenceFile 
consisting of <fileName, fileContent>, and use SequenceFileInputFormat.


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to