Eugeny N Dzhurinsky wrote:
On Tue, Nov 20, 2007 at 05:42:28PM +0100, Andrzej Bialecki wrote:
I work with an application that faces a somewhat similar problem. If you
keep using large keys, at some point tasks will start failing with
OutOfMemory exception, because Hadoop needs to keep many keys completely in
RAM ...
The solution I found was to replace large keys with equivalent small keys,
ie. hashes. Large content (both keys and values) are stored in HDFS. For
large keys (which are still somewhat smaller than values) I use a MapFile
of <hash, largeKey>, and large values I store in regular files on HDFS,
named by the hash.
This way, I can use <hash, largeKey> as the input to map-reduce, and
retrieve large values lazily from HDFS from a pre-defined path + "/" +
hash.
Okay, good point, thank you. So as far as I understand, in the Mapping job I
will take the blob, store it into HDFS using Path built in some way, and then
obtain InputStream from the file somehow? Or I just missed something?
Yes, exactly like this.
2) Our application needs to get data from these saved BLOBs later, as well
as
some meta-data associated with each BLOB. Moreover, there should be an
ability
to find the BLOB using certain criteria. As far as I understand
Map/Reduce, it
The question is whether you need this search to occur as a batch job, or in
real-time. The solution I outlined above allows you to quickly lookup the
largeKey and also locate a file with the BLOB.
Well, I was thinking about scheduling a job with "search conditions" as an
input, which job will perform the search and return key/value pairs somehow,
which will be used later. Those key/value will be used to refer to another
blobs, if that matters. And there could be several jobs of such kind in the
system at the same time, so I would need to separate them somehow to avoid
data interference.
I probably don't understand your requirements - what you described looks
like one of the example applications supplied with Hadoop, namely Grep.
This application selects specific input records according to a pattern.
Input data is split into portions with the help of Partitioner, and a
single map task is assigned to each part. Then map tasks are allocated to
as many nodes as needed, taking into consideration the job priority,
cluster capacity, max. number of tasks per node, etc.
Looks like I might be missing some of core concept of Hadoop. As far as I
understood, the data is always kept on the same node which executes a job,
which job produces the data. This means the blobs will be store on the same
node where the job was started, and another nodes will not know anything about
such blobs, so then doing some search - the jobs will be started at all nodes,
and some of the jobs will return data, and some will not, but end user will be
presented with merged data?
If you use HDFS, and you run datanodes and tasktrackers on each node,
then Hadoop will try to execute map tasks on the node that holds the
part of input data allocated to the task. We are talking about
individual data blocks here - a single file consists of many data blocks
and they may be located on many datanodes. In other words, parts of your
blob will be located on different machines across the cluster. HDFS
hides this and provides a simple InputStream to retrieve the whole content.
Then, each reduce task will produce one part of the output, again as a
file on HDFS - so you will possibly get multiple files on HDFS which
together form the output of a job. These files are located on HDFS, so
again individual data blocks may be located anywhere on the cluster, but
using the FileSystem.open(...) you can retrieve the complete content of
such files as InputStream (or any other more complex stream, such as
DataInputStream, SequenceFile.Reader, MapFile.Reader, etc).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com