Hey scalding pros,

I've got a strange java heap space issue in my mapper. I've got a fix that 
helps, but I would like to understand better what is going on under the 
hood, why my fix helps and whether there is an alternative solution (e.g. 
changing job parameters). This the code in question

pipe
  .map { candidateSet => (candidateSet.key, candidateSet.candidates) }
  .collect { case (Some(key), candidates) => (key, candidates) }
  .group
  //.forceToReducers - adding this line solves the problem
  .toList // this does not cause the issue, the rows have unique keys

  .mapValues {_.flatten}


After this group the pipe is joined with another pipe using the same key, 
so I keep it as UnsortedGroupped[K,V]

The data has unique keys, so there are no map side reductions, and .toList 
call is actually redundant. My guess is that mapper tries to execute some 
map-side sorting / data optimization and this is what causes problems. The 
default amount of memory is sufficient for all job overheads (works fine 
for lots of other jobs), just to be sure I increased the heap size 
significantly and it did not help.

.forceToReducers solves the problem, it was my semi-intelligent guess, I 
expected this call to turn off some mapper logic that was redundant in case 
of unique keys, but still I don't understand why exactly it helped. Could 
be the way the input data is buffered and sorted in memory.

Any ideas?

Thanks,
Kostya

-- 
You received this message because you are subscribed to the Google Groups 
"Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to