Hi Radim, I might misunderstand your suggestion but many M/R jobs actually require to run the two phases one after the other, and henceforth to store the intermediate results somewhere. While some may slightly reduce intermediate memory usage by using a combiner function (e.g., the word-count example), I don’t see how we can avoid intermediate storage altogether.
Thanks, Etienne (leads project — as Evangelos who initiated the thread) On 17 Feb 2014, at 08:48, Radim Vansa <[email protected]> wrote: > I think that the intermediate cache is not required at all. The M/R > algorithm itself can (and should!) run with memory occupied by the > result of reduction. The current implementation with Map first and > Reduce after that will always have these problems, using a cache for > temporary caching the result is only a workaround. > > The only situation when temporary cache could be useful is when the > result grows linearly (or close to that or even more) with the amount of > reduced entries. This would be the case for groupBy producing Map<Color, > List<Entry>> from all entries in cache. Then the task does not scale and > should be redesigned anyway, but flushing the results into cache backed > by cache store could help. > > Radim > > On 02/14/2014 04:54 PM, Vladimir Blagojevic wrote: >> Tristan, >> >> Actually they are not addressed in this pull request but the feature >> where custom output cache is used instead of results being returned is >> next in the implementation pipeline. >> >> Evangelos, indeed, depending on a reducer function all intermediate >> KOut/VOut pairs might be moved to a single node. How would custom cache >> help in this case? >> >> Regards, >> Vladimir >> >> >> On 2/14/2014, 10:16 AM, Tristan Tarrant wrote: >>> Hi Evangelos, >>> >>> you might be interested in looking into a current pull request which >>> addresses some (all?) of these issues >>> >>> https://github.com/infinispan/infinispan/pull/2300 >>> >>> Tristan >>> >>> On 14/02/2014 16:10, Evangelos Vazaios wrote: >>>> Hello everyone, >>>> >>>> I started using the MapReduce implementation of Infinispan and I came >>>> across some possible limitations. Thus, I want to make some suggestions >>>> about the MapReduce (MR) implementation of Infinispan. >>>> Depending on the algorithm, there might be some memory problems, >>>> especially for intermediate results. >>>> An example of such a case is group by. Suppose that we have a cluster >>>> of 2 nodes with 2 GB available. Let a distributed cache, where simple >>>> car objects (id,brand,colour) are stored and the total size of data is >>>> 3.5GB. If all objects have the same colour , then all 3.5 GB would go to >>>> only one reducer, as a result an OutOfMemoryException will be thrown. >>>> >>>> To overcome these limitations, I propose to add as parameter the name of >>>> the intermediate cache to be used. This will enable the creation of a >>>> custom configured cache that deals with the memory limitations. >>>> >>>> Another feature that I would like to have is to set the name of the >>>> output cache. The reasoning behind this is similar to the one mentioned >>>> above. >>>> >>>> I wait for your thoughts on these two suggestions. >>>> >>>> Regards, >>>> Evangelos >>>> _______________________________________________ >>>> infinispan-dev mailing list >>>> [email protected] >>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev >>>> >>>> >>> >>> >>> _______________________________________________ >>> infinispan-dev mailing list >>> [email protected] >>> https://lists.jboss.org/mailman/listinfo/infinispan-dev >> _______________________________________________ >> infinispan-dev mailing list >> [email protected] >> https://lists.jboss.org/mailman/listinfo/infinispan-dev > > > -- > Radim Vansa <[email protected]> > JBoss DataGrid QA > > _______________________________________________ > infinispan-dev mailing list > [email protected] > https://lists.jboss.org/mailman/listinfo/infinispan-dev _______________________________________________ infinispan-dev mailing list [email protected] https://lists.jboss.org/mailman/listinfo/infinispan-dev
