Grant Ingersoll wrote:
Hi,
I have two tasks (although one is kind of a special case of the other)
I am looking for some advice on how best to take advantage of Hadoop
with:
1. I have a list of tokens that need to be translated from the source
language to the destination language. My approach is to take the
tokens, write them out to the FileSystem, one per line, and then
distribute (map) them onto the cluster for translation as Text. I am
not sure, however, how best to pass along the metadata needed (source
language and destination language). My thoughts are to add the source
and dest lang. to the JobConf (but I could also see encoding it into
the name of the file on the file system and then into the key). Then
during the Map phase, I would need to either get the properties out of
the JobConf or decode the key to figure out the source and target
languages.
Source and target language are two configuration properties for the
whole job, so passing them inside JobConf seems like the best way. Each
map/reduce task will get the same JobConf, including your properties.
Using the standard TextInputFormat you get <lineNo, lineSrcText> in your
map(), which you would then output after translation as <lineSrcText,
lineTgtText>. If you accidentally have duplicate lines in the input, you
will get multiple values in reduce, because the same lineSrcText key
would be associated with multiple translations.
Actually, if you need to transalte this into several languages, you
could loop in map() through all target languages, and output as many
translation tuples as needed, as <lineSrcText, <lang, lineTgtText>> -
then in your reduce() you would get them nicely collected under a single
key (lineSrcText) and all translated values in Iterator.
2. This time, instead of tokens I have X number of whole documents
that need to be translated from source to destination and the way the
translation systems work, it is best to have the whole document
together when getting a translation. My plan here is to implement my
own InputFormat again, this time returning the whole document from the
RecordReader.next() and overriding getSplits() in InputFormatBase to
return only one split per file, regardless of numSplits. Again, I
would need to put the metadata somewhere, either the JobConf or the key.
Is there a better way of doing this or am I on the right track?
Basically, it's ok - the only problematic aspect is that if you have millions of
documents then using this method you will get millions of map tasks to execute,
because you create as many splits (hence, map tasks) as there are files ... perhaps a
better way would be to first wrap these documents into a single SequenceFile
consisting of <fileName, fileContent>, and use SequenceFileInputFormat.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com