Re: map/reduce, large content files and distributed filesystem question

Andrzej Bialecki Tue, 20 Nov 2007 08:43:02 -0800

Eugeny N Dzhurinsky wrote:

Hello, gentlemen!


We are trying to adapt hadoop to suit our application (or mostly adapt our
application to fit Map/Reduce and hadoop;) ), and I have several questions:

1) when doing mapping part of a job, our application creates some BLOB, which
we will need to save and then re-use in another part of the application. This
blob has an unique but large key, which is coming from the domain of the
application, and in nature the key is a string. The content of this BLOB is
being used within the mapping job, so first of all we need to obtain the
content (BLOB), pass the stream created on this BLOB to another legacy part of
our application, which does not need to know about the hadoop (in general it
just accepts an instance of java.io.InputStream), and then return key/value
pair to reducer (where the key is original unique key we adopted, and the
value is the BLOB we got after processing by the mapping job).

After looking at hadoop api documentation, I found there are several
implementations of OutputFormat available, however I'm not sure which one I
need to use to keep large value (which could be several tens of megabytes of
size)? Also how would we get content of the BLOB to pass it back to
legacy application? We want to avoid keeping the blob in RAM because of it's
size.


Hi Eugeny,

I work with an application that faces a somewhat similar problem. If youkeep using large keys, at some point tasks will start failing withOutOfMemory exception, because Hadoop needs to keep many keys completelyin RAM ...

The solution I found was to replace large keys with equivalent smallkeys, ie. hashes. Large content (both keys and values) are stored inHDFS. For large keys (which are still somewhat smaller than values) Iuse a MapFile of <hash, largeKey>, and large values I store in regularfiles on HDFS, named by the hash.

This way, I can use <hash, largeKey> as the input to map-reduce, andretrieve large values lazily from HDFS from a pre-defined path + "/" + hash.


2) Our application needs to get data from these saved BLOBs later, as well as
some meta-data associated with each BLOB. Moreover, there should be an ability
to find the BLOB using certain criteria. As far as I understand Map/Reduce, it

The question is whether you need this search to occur as a batch job, orin real-time. The solution I outlined above allows you to quickly lookupthe largeKey and also locate a file with the BLOB.

should be possible to spawn set of jobs, which would be executed on different
data nodes, and results of searches returned by these jobs will be collected
and "reduced" later. The question here  - does Hadoop take care of execution
as many jobs as data nodes are available in the system, and pass the jobs to
the remote node with same input data used to start the job?

Input data is split into portions with the help of Partitioner, and asingle map task is assigned to each part. Then map tasks are allocatedto as many nodes as needed, taking into consideration the job priority,cluster capacity, max. number of tasks per node, etc.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: map/reduce, large content files and distributed filesystem question

Reply via email to