Hi Etienne, how does the requirement for all data provided to Reducer as a whole work for distributed caches? There you'd get only a subset of the whole mapped set on each node (afaik each node maps the nodes locally and performs a reduction before executing the "global" reduction). Or are these M/R jobs applicable only to local caches? I have to admit I have only a limited knowledge of M/R, could you give me an example where the algorithm works in distributed environment and still cannot be parallelized?
Thanks Radim On 02/17/2014 09:18 AM, Etienne Riviere wrote: > Hi Radim, > > I might misunderstand your suggestion but many M/R jobs actually require to > run the two phases one after the other, and henceforth to store the > intermediate results somewhere. While some may slightly reduce intermediate > memory usage by using a combiner function (e.g., the word-count example), I > don’t see how we can avoid intermediate storage altogether. > > Thanks, > Etienne (leads project — as Evangelos who initiated the thread) > > On 17 Feb 2014, at 08:48, Radim Vansa <[email protected]> wrote: > >> I think that the intermediate cache is not required at all. The M/R >> algorithm itself can (and should!) run with memory occupied by the >> result of reduction. The current implementation with Map first and >> Reduce after that will always have these problems, using a cache for >> temporary caching the result is only a workaround. >> >> The only situation when temporary cache could be useful is when the >> result grows linearly (or close to that or even more) with the amount of >> reduced entries. This would be the case for groupBy producing Map<Color, >> List<Entry>> from all entries in cache. Then the task does not scale and >> should be redesigned anyway, but flushing the results into cache backed >> by cache store could help. >> >> Radim >> >> On 02/14/2014 04:54 PM, Vladimir Blagojevic wrote: >>> Tristan, >>> >>> Actually they are not addressed in this pull request but the feature >>> where custom output cache is used instead of results being returned is >>> next in the implementation pipeline. >>> >>> Evangelos, indeed, depending on a reducer function all intermediate >>> KOut/VOut pairs might be moved to a single node. How would custom cache >>> help in this case? >>> >>> Regards, >>> Vladimir >>> >>> >>> On 2/14/2014, 10:16 AM, Tristan Tarrant wrote: >>>> Hi Evangelos, >>>> >>>> you might be interested in looking into a current pull request which >>>> addresses some (all?) of these issues >>>> >>>> https://github.com/infinispan/infinispan/pull/2300 >>>> >>>> Tristan >>>> >>>> On 14/02/2014 16:10, Evangelos Vazaios wrote: >>>>> Hello everyone, >>>>> >>>>> I started using the MapReduce implementation of Infinispan and I came >>>>> across some possible limitations. Thus, I want to make some suggestions >>>>> about the MapReduce (MR) implementation of Infinispan. >>>>> Depending on the algorithm, there might be some memory problems, >>>>> especially for intermediate results. >>>>> An example of such a case is group by. Suppose that we have a cluster >>>>> of 2 nodes with 2 GB available. Let a distributed cache, where simple >>>>> car objects (id,brand,colour) are stored and the total size of data is >>>>> 3.5GB. If all objects have the same colour , then all 3.5 GB would go to >>>>> only one reducer, as a result an OutOfMemoryException will be thrown. >>>>> >>>>> To overcome these limitations, I propose to add as parameter the name of >>>>> the intermediate cache to be used. This will enable the creation of a >>>>> custom configured cache that deals with the memory limitations. >>>>> >>>>> Another feature that I would like to have is to set the name of the >>>>> output cache. The reasoning behind this is similar to the one mentioned >>>>> above. >>>>> >>>>> I wait for your thoughts on these two suggestions. >>>>> >>>>> Regards, >>>>> Evangelos >>>>> _______________________________________________ >>>>> infinispan-dev mailing list >>>>> [email protected] >>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> infinispan-dev mailing list >>>> [email protected] >>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev >>> _______________________________________________ >>> infinispan-dev mailing list >>> [email protected] >>> https://lists.jboss.org/mailman/listinfo/infinispan-dev >> >> -- >> Radim Vansa <[email protected]> >> JBoss DataGrid QA >> >> _______________________________________________ >> infinispan-dev mailing list >> [email protected] >> https://lists.jboss.org/mailman/listinfo/infinispan-dev > > _______________________________________________ > infinispan-dev mailing list > [email protected] > https://lists.jboss.org/mailman/listinfo/infinispan-dev -- Radim Vansa <[email protected]> JBoss DataGrid QA _______________________________________________ infinispan-dev mailing list [email protected] https://lists.jboss.org/mailman/listinfo/infinispan-dev
