[
https://issues.apache.org/jira/browse/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tom White updated HADOOP-1986:
------------------------------
Attachment: SequenceFileWriterBenchmark.java
I've written a local benchmark to see the effect of the patch. Focusing on
RandomWriter, the input to the map is not read from disk, and there are no
reducers, so the bulk of the processing is writing the random output to a
SequenceFile. This benchmark simulates this pattern by writing Writable keys
and values to an in-memory filesystem. The file was 256MB, keys and values 256
bytes. Here are the numbers (using Java 6) averaged over 50 runs.
Trunk: 1301912844 ns
Patch: 1338563600 ns
This is a 2.8% overhead. When writing to disk I get the following numbers:
Trunk: 5431308533 ns
Patch: 5604898533 ns
A 3.1% overhead. I was surprised by this as I thought that the overhead would
be insignificant compared to disk IO.
I altered the patch to special-case SequenceFile.Writer.append(Writable,
Writable) and the times were the same as trunk (within 0.2% in either
direction).
So it seems to me that we can avoid any overhead by special casing Writable. As
well as in SequenceFile.Writer this would need doing in MapTask.MapOutputBuffer
and ReduceTask.ValuesIterator. I think this can be done with minimal code
duplication, and obviously it is not as clean a solution as the current patch,
but given the performance constraints and the general desire to get this issue
fixed, I think it is the best way to proceed.
Thoughts?
> Add support for a general serialization mechanism for Map Reduce
> ----------------------------------------------------------------
>
> Key: HADOOP-1986
> URL: https://issues.apache.org/jira/browse/HADOOP-1986
> Project: Hadoop Core
> Issue Type: New Feature
> Components: mapred
> Reporter: Tom White
> Assignee: Tom White
> Fix For: 0.17.0
>
> Attachments: hadoop-serializer-v2.tar.gz,
> SequenceFileWriterBenchmark.java, SerializableWritable.java,
> serializer-v1.patch, serializer-v2.patch, serializer-v3.patch,
> serializer-v4.patch, serializer-v5.patch
>
>
> Currently Map Reduce programs have to use WritableComparable-Writable
> key-value pairs. While it's possible to write Writable wrappers for other
> serialization frameworks (such as Thrift), this is not very convenient: it
> would be nicer to be able to use arbitrary types directly, without explicit
> wrapping and unwrapping.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.