Hey scalding pros,
I've got a strange java heap space issue in my mapper. I've got a fix that
helps, but I would like to understand better what is going on under the
hood, why my fix helps and whether there is an alternative solution (e.g.
changing job parameters). This the code in question
pipe
.map { candidateSet => (candidateSet.key, candidateSet.candidates) }
.collect { case (Some(key), candidates) => (key, candidates) }
.group
//.forceToReducers - adding this line solves the problem
.toList // this does not cause the issue, the rows have unique keys
.mapValues {_.flatten}
After this group the pipe is joined with another pipe using the same key,
so I keep it as UnsortedGroupped[K,V]
The data has unique keys, so there are no map side reductions, and .toList
call is actually redundant. My guess is that mapper tries to execute some
map-side sorting / data optimization and this is what causes problems. The
default amount of memory is sufficient for all job overheads (works fine
for lots of other jobs), just to be sure I increased the heap size
significantly and it did not help.
.forceToReducers solves the problem, it was my semi-intelligent guess, I
expected this call to turn off some mapper logic that was redundant in case
of unique keys, but still I don't understand why exactly it helped. Could
be the way the input data is buffered and sorted in memory.
Any ideas?
Thanks,
Kostya
--
You received this message because you are subscribed to the Google Groups
"Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.