Thanks, good to know. On Wednesday, August 24, 2016 at 12:52:23 AM UTC-7, P. Oscar Boykin wrote: > > Yes, by default scalding attempts map-side aggregation of any commutative > operation (which we assume ToList to be since there is no ordering here). > > Your solution is a fine one here. Another solution is to turn down the > size of the map-side cache (see Config.scala for options on this). Another > approach is to use a different map-side cache that automatically tunes its > own size based on memory usage and cache hit-rate. > > We could disable map-side caching for toList since the usually it will be > very unlikely to help (toList is not a information reducing operation). > Perhaps that is a good solution to reduce the chance of this problem. > On Tue, Aug 23, 2016 at 19:56 Kostya Salomatin <[email protected] > <javascript:>> wrote: > >> Hey scalding pros, >> >> I've got a strange java heap space issue in my mapper. I've got a fix >> that helps, but I would like to understand better what is going on under >> the hood, why my fix helps and whether there is an alternative solution >> (e.g. changing job parameters). This the code in question >> >> pipe >> .map { candidateSet => (candidateSet.key, candidateSet.candidates) } >> .collect { case (Some(key), candidates) => (key, candidates) } >> .group >> //.forceToReducers - adding this line solves the problem >> .toList // this does not cause the issue, the rows have unique keys >> >> .mapValues {_.flatten} >> >> >> After this group the pipe is joined with another pipe using the same key, >> so I keep it as UnsortedGroupped[K,V] >> >> The data has unique keys, so there are no map side reductions, and >> .toList call is actually redundant. My guess is that mapper tries to >> execute some map-side sorting / data optimization and this is what causes >> problems. The default amount of memory is sufficient for all job overheads >> (works fine for lots of other jobs), just to be sure I increased the heap >> size significantly and it did not help. >> >> .forceToReducers solves the problem, it was my semi-intelligent guess, I >> expected this call to turn off some mapper logic that was redundant in case >> of unique keys, but still I don't understand why exactly it helped. Could >> be the way the input data is buffered and sorted in memory. >> >> Any ideas? >> >> Thanks, >> Kostya >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Scalding Development" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> For more options, visit https://groups.google.com/d/optout. >> >
-- You received this message because you are subscribed to the Google Groups "Scalding Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
