Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/931#issuecomment-49254607
Hey, so I rebased this PR and made it mergeable in my own branch,
https://github.com/mateiz/spark/tree/spark-931. However, in doing this I
realized that there might be some problems here that are fundamental.
The main issue is that AppendOnlyMap, and ExternalAppendOnlyMap, require
there's only one value for each key. The in-memory AOM will be very inefficient
otherwise, and the EAOM depends on it. This means that for sort, we have to
create (Key, ArrayBuffer[Value]) pairs, which will consume more memory by
default than our in-memory sort, and will make us crash if there are too many
identical values (something we do today but which may happen sooner here). Thus
it seems that long-term we need a very different solution here, basically an
external merge-sort.
A second, possibly less serious issue is that the changes to EAOM to use
comparator.compare instead of hash code equality make it less efficient in the
default hashing-based case, because instead of saving one key's hash code in an
Int and reusing it to compare with other keys in various places, we always
recompute it when we compare each pair of elements.
For these reasons I'd actually hold off on merging this (even my merged
version) until we implement an external merge-sort as part of sort-based
shuffle. Then we can use that data structure here.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---