Yes, by default scalding attempts map-side aggregation of any commutative operation (which we assume ToList to be since there is no ordering here).
Your solution is a fine one here. Another solution is to turn down the size of the map-side cache (see Config.scala for options on this). Another approach is to use a different map-side cache that automatically tunes its own size based on memory usage and cache hit-rate. We could disable map-side caching for toList since the usually it will be very unlikely to help (toList is not a information reducing operation). Perhaps that is a good solution to reduce the chance of this problem. On Tue, Aug 23, 2016 at 19:56 Kostya Salomatin <[email protected]> wrote: > Hey scalding pros, > > I've got a strange java heap space issue in my mapper. I've got a fix that > helps, but I would like to understand better what is going on under the > hood, why my fix helps and whether there is an alternative solution (e.g. > changing job parameters). This the code in question > > pipe > .map { candidateSet => (candidateSet.key, candidateSet.candidates) } > .collect { case (Some(key), candidates) => (key, candidates) } > .group > //.forceToReducers - adding this line solves the problem > .toList // this does not cause the issue, the rows have unique keys > > .mapValues {_.flatten} > > > After this group the pipe is joined with another pipe using the same key, > so I keep it as UnsortedGroupped[K,V] > > The data has unique keys, so there are no map side reductions, and .toList > call is actually redundant. My guess is that mapper tries to execute some > map-side sorting / data optimization and this is what causes problems. The > default amount of memory is sufficient for all job overheads (works fine > for lots of other jobs), just to be sure I increased the heap size > significantly and it did not help. > > .forceToReducers solves the problem, it was my semi-intelligent guess, I > expected this call to turn off some mapper logic that was redundant in case > of unique keys, but still I don't understand why exactly it helped. Could > be the way the input data is buffered and sorted in memory. > > Any ideas? > > Thanks, > Kostya > > -- > You received this message because you are subscribed to the Google Groups > "Scalding Development" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Scalding Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
