Hello, gentlemen! We are trying to adapt hadoop to suit our application (or mostly adapt our application to fit Map/Reduce and hadoop;) ), and I have several questions:
1) when doing mapping part of a job, our application creates some BLOB, which we will need to save and then re-use in another part of the application. This blob has an unique but large key, which is coming from the domain of the application, and in nature the key is a string. The content of this BLOB is being used within the mapping job, so first of all we need to obtain the content (BLOB), pass the stream created on this BLOB to another legacy part of our application, which does not need to know about the hadoop (in general it just accepts an instance of java.io.InputStream), and then return key/value pair to reducer (where the key is original unique key we adopted, and the value is the BLOB we got after processing by the mapping job). After looking at hadoop api documentation, I found there are several implementations of OutputFormat available, however I'm not sure which one I need to use to keep large value (which could be several tens of megabytes of size)? Also how would we get content of the BLOB to pass it back to legacy application? We want to avoid keeping the blob in RAM because of it's size. 2) Our application needs to get data from these saved BLOBs later, as well as some meta-data associated with each BLOB. Moreover, there should be an ability to find the BLOB using certain criteria. As far as I understand Map/Reduce, it should be possible to spawn set of jobs, which would be executed on different data nodes, and results of searches returned by these jobs will be collected and "reduced" later. The question here - does Hadoop take care of execution as many jobs as data nodes are available in the system, and pass the jobs to the remote node with same input data used to start the job? Thank you in advance! -- Eugene N Dzhurinsky
pgpjCRwhM4FE7.pgp
Description: PGP signature
