Yes, by default scalding attempts map-side aggregation of any commutative
operation (which we assume ToList to be since there is no ordering here).

Your solution is a fine one here. Another solution is to turn down the size
of the map-side cache (see Config.scala for options on this). Another
approach is to use a different map-side cache that automatically tunes its
own size based on memory usage and cache hit-rate.

We could disable map-side caching for toList since the usually it will be
very unlikely to help (toList is not a information reducing operation).
Perhaps that is a good solution to reduce the chance of this problem.
On Tue, Aug 23, 2016 at 19:56 Kostya Salomatin <[email protected]> wrote:

> Hey scalding pros,
>
> I've got a strange java heap space issue in my mapper. I've got a fix that
> helps, but I would like to understand better what is going on under the
> hood, why my fix helps and whether there is an alternative solution (e.g.
> changing job parameters). This the code in question
>
> pipe
>   .map { candidateSet => (candidateSet.key, candidateSet.candidates) }
>   .collect { case (Some(key), candidates) => (key, candidates) }
>   .group
>   //.forceToReducers - adding this line solves the problem
>   .toList // this does not cause the issue, the rows have unique keys
>
>   .mapValues {_.flatten}
>
>
> After this group the pipe is joined with another pipe using the same key,
> so I keep it as UnsortedGroupped[K,V]
>
> The data has unique keys, so there are no map side reductions, and .toList
> call is actually redundant. My guess is that mapper tries to execute some
> map-side sorting / data optimization and this is what causes problems. The
> default amount of memory is sufficient for all job overheads (works fine
> for lots of other jobs), just to be sure I increased the heap size
> significantly and it did not help.
>
> .forceToReducers solves the problem, it was my semi-intelligent guess, I
> expected this call to turn off some mapper logic that was redundant in case
> of unique keys, but still I don't understand why exactly it helped. Could
> be the way the input data is buffered and sorted in memory.
>
> Any ideas?
>
> Thanks,
> Kostya
>
> --
> You received this message because you are subscribed to the Google Groups
> "Scalding Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to