[ http://issues.apache.org/jira/browse/HADOOP-363?page=comments#action_12421194 ] Dick King commented on HADOOP-363: ----------------------------------
In http://issues.apache.org/jira/browse/HADOOP-363#action_12421184 , Eric asked for a concrete example. I'm determining word frequencies by inverting documents interpreted as mappings from document to unique word, but of course most documents use mostly common words. Although the corpus contains documents in multiple languages, I expect my proposal to be effective on this piece of code because for any cluster of a few dozen or hundred documents not very many distinct words will appear. With my proposal, each mapper would issue not very many pages. This change would make most document unique word accumulation occur in core. Then there are runs that produce my word pair frequency tables. These are mappings from words to mappings from words to counts. This application is indeed exposed to the problem Eric describes, but again not badly for statistical reasons, with the additional mitigating fact that documents which contain words in one language are deemed to only contain additional words in that same language. Eric's point about key/value pairs' values being of varying size is in fact salient, and as Eric points out we are already exposed to possible problems even when building the 100K key/value pairs. That can blow heap space all by itself without the combiner ever having been called. We can use something like Runtime.getRuntime().freeMemory() [after a System.gc() call if needed] . Probably this would be a good idea whether or not we add this enhancement. There could be a parameter, expected_object_size_for_new_key_value_pairs, that would guide both the number of kv pairs created by each mapper before combining [replacing the 100K magic number] and the number of new elements. However, we don't really need such a parameter. We could instead check for free memory after creating each hundred output key/value pairs to decide whether to continue collecting output or spill. GCs are much to expensive to perform every hundred objects, but we only need to do one when we call freeMemory() and don't see enough memory to continue without one. -dk > When combiners exist, postpone mappers' spills of map output to disk until > combiners are unsuccessful. > ------------------------------------------------------------------------------------------------------ > > Key: HADOOP-363 > URL: http://issues.apache.org/jira/browse/HADOOP-363 > Project: Hadoop > Issue Type: Improvement > Components: mapred > Reporter: Dick King > > When a map/reduce job is set up with a combiner, the mapper tasks each build > up an in-heap collection of 100K key/value pairs -- and then apply the > combiner to reduce that to whatever it becomes by applying the combiner to > sets with like keys before spilling to disk to send it to the reducers. > Typically running the combiner consumes a lot less resources than shipping > the data, especially since the data end up in a reducer where probably the > same code will be run anyway. > I would like to see this changed so that when the combiner shrinks the 100K > key/value pairs to less than, say, 90K, we just keep running the mapper and > combiner alternately until we get enough distinct keys to make this unlikely > to be worthwhile [or until we run out of input, of course]. > This has two costs: the whole internal buffer has to be re-sorted so we can > apply the combiner even though as few as 10K new elements have been added, > and in some cases we'll call the combiner on many singletons. > The first of these costs can be avoided by doing a mini-sort in the new pairs > section and doing a merge to develop the combiner sets and the new sorted > retained elements section. > The second of these costs can be avoided by detecting what would otherwise be > singleton combiner calls and not making them, which is a good idea in itself > even if we don't decide to do this reform. > The two techniques combine well; recycled elements of the buffer need not be > combined if there's no new element with the same key. > -dk -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
