Re: map/reduce, large content files and distributed filesystem question

Andrzej Bialecki Tue, 20 Nov 2007 09:18:30 -0800

Eugeny N Dzhurinsky wrote:

On Tue, Nov 20, 2007 at 05:42:28PM +0100, Andrzej Bialecki wrote:
I work with an application that faces a somewhat similar problem. If youkeep using large keys, at some point tasks will start failing withOutOfMemory exception, because Hadoop needs to keep many keys completely inRAM ...
The solution I found was to replace large keys with equivalent small keys,ie. hashes. Large content (both keys and values) are stored in HDFS. Forlarge keys (which are still somewhat smaller than values) I use a MapFileof <hash, largeKey>, and large values I store in regular files on HDFS,named by the hash.
This way, I can use <hash, largeKey> as the input to map-reduce, andretrieve large values lazily from HDFS from a pre-defined path + "/" +hash.
Okay, good point, thank you. So as far as I understand, in the Mapping job I
will take the blob, store it into HDFS using Path built in some way, and then
obtain InputStream from the file somehow? Or I just missed something?


Yes, exactly like this.

2) Our application needs to get data from these saved BLOBs later, as wellassome meta-data associated with each BLOB. Moreover, there should be anabilityto find the BLOB using certain criteria. As far as I understandMap/Reduce, it
The question is whether you need this search to occur as a batch job, or inreal-time. The solution I outlined above allows you to quickly lookup thelargeKey and also locate a file with the BLOB.
Well, I was thinking about scheduling a job with "search conditions" as an
input, which job will perform the search and return key/value pairs somehow,
which will be used later. Those key/value will be used to refer to another
blobs, if that matters. And there could be several jobs of such kind in the
system at the same time, so I would need to separate them somehow to avoid
data interference.

I probably don't understand your requirements - what you described lookslike one of the example applications supplied with Hadoop, namely Grep.This application selects specific input records according to a pattern.

Input data is split into portions with the help of Partitioner, and asingle map task is assigned to each part. Then map tasks are allocated toas many nodes as needed, taking into consideration the job priority,cluster capacity, max. number of tasks per node, etc.
Looks like I might be missing some of core concept of Hadoop. As far as I
understood, the data is always kept on the same node which executes a job,
which job produces the data. This means the blobs will be store on the same
node where the job was started, and another nodes will not know anything about
such blobs, so then doing some search - the jobs will be started at all nodes,
and some of the jobs will return data, and some will not, but end user will be
presented with merged data?

If you use HDFS, and you run datanodes and tasktrackers on each node,then Hadoop will try to execute map tasks on the node that holds thepart of input data allocated to the task. We are talking aboutindividual data blocks here - a single file consists of many data blocksand they may be located on many datanodes. In other words, parts of yourblob will be located on different machines across the cluster. HDFShides this and provides a simple InputStream to retrieve the whole content.

Then, each reduce task will produce one part of the output, again as afile on HDFS - so you will possibly get multiple files on HDFS whichtogether form the output of a job. These files are located on HDFS, soagain individual data blocks may be located anywhere on the cluster, butusing the FileSystem.open(...) you can retrieve the complete content ofsuch files as InputStream (or any other more complex stream, such asDataInputStream, SequenceFile.Reader, MapFile.Reader, etc).



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: map/reduce, large content files and distributed filesystem question

Reply via email to