Re: Advice wanted

Grant Ingersoll Wed, 25 Oct 2006 17:41:36 -0700

Grant Ingersoll wrote:
Hi,
I have two tasks (although one is kind of a special case of theother) I am looking for some advice on how best to take advantageof Hadoop with:
1. I have a list of tokens that need to be translated from thesource language to the destination language. My approach is totake the tokens, write them out to the FileSystem, one per line,and then distribute (map) them onto the cluster for translation asText. I am not sure, however, how best to pass along themetadata needed (source language and destination language). Mythoughts are to add the source and dest lang. to the JobConf (butI could also see encoding it into the name of the file on the filesystem and then into the key). Then during the Map phase, I wouldneed to either get the properties out of the JobConf or decode thekey to figure out the source and target languages.
Source and target language are two configuration properties for thewhole job, so passing them inside JobConf seems like the best way.Each map/reduce task will get the same JobConf, including yourproperties.


Cool, I figured JobConf was also distributed, but wasn't 100% certain.

Using the standard TextInputFormat you get <lineNo, lineSrcText> inyour map(), which you would then output after translation as<lineSrcText, lineTgtText>. If you accidentally have duplicatelines in the input, you will get multiple values in reduce, becausethe same lineSrcText key would be associated with multipletranslations.


Makes sense.

Actually, if you need to transalte this into several languages, youcould loop in map() through all target languages, and output asmany translation tuples as needed, as <lineSrcText, <lang,lineTgtText>> - then in your reduce() you would get them nicelycollected under a single key (lineSrcText) and all translatedvalues in Iterator.

No need there, each job will be one language pair, but is is aninteresting idea that may be worth pursuing down the road.

2. This time, instead of tokens I have X number of whole documentsthat need to be translated from source to destination and the waythe translation systems work, it is best to have the wholedocument together when getting a translation. My plan here is toimplement my own InputFormat again, this time returning the wholedocument from the RecordReader.next() and overriding getSplits()in InputFormatBase to return only one split per file, regardlessof numSplits. Again, I would need to put the metadata somewhere,either the JobConf or the key.
Is there a better way of doing this or am I on the right track?
Basically, it's ok - the only problematic aspect is that if youhave millions of documents then using this method you will getmillions of map tasks to execute, because you create as many splits(hence, map tasks) as there are files ... perhaps a better waywould be to first wrap these documents into a single SequenceFileconsisting of <fileName, fileContent>, and useSequenceFileInputFormat.

OK, that makes more sense. I wasn't totally clear on SeqFile, butbased on what you said and looking at it again that seems like a muchbetter way to handle it.


Thanks,
Grant

Re: Advice wanted

Reply via email to