Re: Some new requests about mapreduce

Doug Cutting Mon, 06 Nov 2006 09:04:26 -0800

Feng Jiang wrote:

I think some features are very useful for us:


1. Multi-key types supported in input. for example: SEQ file A is <Ka, Va>
pair, and SEQ file B is <Kb, Vb> pair. I can simply add both of these files
as input file, and the map funtion could be map(Object, Ojbect). By this
way, i don't have to wrap Ka and Kb into ObjectWritable, and the program
will be more readable.


This is addressed by http://issues.apache.org/jira/browse/HADOOP-372.

2. Value comparator supported. There is key comparator supported in current
hadoop, and by this way, i can specify the order the key in reduce phase.
But sometimes, i also need specify the order the value sequence in reduce
phase. For example, values in reduce phase consist of Shop and Goods, and i
want to the Shop object always be the 1st object in the values because the
output needs shop infor. Currently i have to store the Goods Info in a
buffer until the Shop object has been found.


This is addressed by http://issues.apache.org/jira/browse/HADOOP-485.

3. More effective "ObjectWritable". Look at the ObjectWritable's
implementation, the class type information is always written into sequence
file. But in many cases, both of key and value are pretty small, the class
type information is even much larger than key& value themselves.

ObjectWritable is not used so much for bulk data, but rather for smallitems, like RPC parameters, so the size overhead is usually not anissue. Where are you finding this overhead onerous?

4. Compression supported. Sequence file contains a lot of similar data, if
it could be compressed before it is really written into disk, a lot of time
will be saved. For example, if the value type is ObjectWritable, there must
be a lot of class declaration information could be compressed. In my
experience, 20% bandwidth and disk space will be saved.


SequenceFile already supports compression, with extensible codecs:

http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/SequenceFile.html

So far, this uses Java's built-in codecs. In the next release we alsohope to include native support for zlib and lzo, greatly improvingcompression performance.


http://issues.apache.org/jira/browse/HADOOP-538

Lzo doesn't compress quite as well as zlib, but it's much faster. Inparticular, zlib is generally slower than disk & net, while lzo isfaster. So zlib tends to save space but not time, while lzo should saveboth.


Doug

Re: Some new requests about mapreduce

Reply via email to