map/reduce, large content files and distributed filesystem question

Eugeny N Dzhurinsky Tue, 20 Nov 2007 08:09:22 -0800

Hello, gentlemen!

We are trying to adapt hadoop to suit our application (or mostly adapt our
application to fit Map/Reduce and hadoop;) ), and I have several questions:


1) when doing mapping part of a job, our application creates some BLOB, which
we will need to save and then re-use in another part of the application. This
blob has an unique but large key, which is coming from the domain of the
application, and in nature the key is a string. The content of this BLOB is
being used within the mapping job, so first of all we need to obtain the
content (BLOB), pass the stream created on this BLOB to another legacy part of
our application, which does not need to know about the hadoop (in general it
just accepts an instance of java.io.InputStream), and then return key/value
pair to reducer (where the key is original unique key we adopted, and the
value is the BLOB we got after processing by the mapping job).

After looking at hadoop api documentation, I found there are several
implementations of OutputFormat available, however I'm not sure which one I
need to use to keep large value (which could be several tens of megabytes of
size)? Also how would we get content of the BLOB to pass it back to
legacy application? We want to avoid keeping the blob in RAM because of it's
size.

2) Our application needs to get data from these saved BLOBs later, as well as
some meta-data associated with each BLOB. Moreover, there should be an ability
to find the BLOB using certain criteria. As far as I understand Map/Reduce, it
should be possible to spawn set of jobs, which would be executed on different
data nodes, and results of searches returned by these jobs will be collected
and "reduced" later. The question here  - does Hadoop take care of execution
as many jobs as data nodes are available in the system, and pass the jobs to
the remote node with same input data used to start the job?

Thank you in advance!

-- 
Eugene N Dzhurinsky

pgpjCRwhM4FE7.pgp
Description: PGP signature

map/reduce, large content files and distributed filesystem question

Reply via email to