[
https://issues.apache.org/jira/browse/HADOOP-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576391#action_12576391
]
Chris Douglas commented on HADOOP-2853:
---------------------------------------
I talked with Arun, and it's suspected that this might work:
>From the map, emit <<x, host>, hostStats>, <<y, host>,
><uri, uriStat>>
Such that x < y (obviously, if eq then let your comparator sort by host).
Write a partitioner st all records sharing a host will go to the same reduce.
Define a comparator st your keys tagged with 'x' are seen first by the reducer,
and use these to build a Map from host -> hostStats. Following them will be
all your records tagged with 'y', which will be your URIs. Equivalently, tag
all values coming out of the map as: <<x, host>, hostStats>,
<<y, uri>, uriStat>
And write your comparator to extract the host from the URI for keys tagged with
'y'. This keeps your types consistent. Again, the partitioner must ensure that
all records with the same host go to the same reducer. The reducer can shed the
tag and emit <uri, uriStat'> records. Would this work?
> Add Writable for very large lists of key / value pairs
> ------------------------------------------------------
>
> Key: HADOOP-2853
> URL: https://issues.apache.org/jira/browse/HADOOP-2853
> Project: Hadoop Core
> Issue Type: New Feature
> Components: io
> Affects Versions: 0.17.0
> Reporter: Andrzej Bialecki
> Fix For: 0.17.0
>
> Attachments: sequenceWritable-v1.patch, sequenceWritable-v2.patch,
> sequenceWritable-v3.patch, sequenceWritable-v4.patch,
> sequenceWritable-v5.patch
>
>
> Some map-reduce jobs need to aggregate and process very long lists as a
> single value. This usually happens when keys from a large domain are mapped
> into a small domain, and their associated values cannot be aggregated into
> few values but need to be preserved as members of a large list. Currently
> this can be implemented as a MapWritable or ArrayWritable - however, Hadoop
> needs to deserialize the current key and value completely into memory, which
> for extremely large values causes frequent OOM exceptions. This also works
> only with lists of relatively small size (e.g. 1000 records).
> This patch is an implementation of a Writable that can handle arbitrarily
> long lists. Initially it keeps an internal buffer (which can be
> (de)-serialized in the ordinary way), and if the list size exceeds certain
> threshold it is spilled to an external SequenceFile (hence the name) on a
> configured FileSystem. The content of this Writable can be iterated, and the
> data is pulled either from the internal buffer or from the external file in a
> transparent way.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.