[GitHub] spark pull request: SPARK-4550. In sort-based shuffle, store map o...

pwendell Wed, 22 Apr 2015 08:30:17 -0700

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4450#issuecomment-95237844

Hey @sryza - two higher level questions as I'm doing a deeper review of
this.

1. This seems predicated on the idea that serialization streams can safely
be sliced with respect to individual objects. In general that's a bit of a
gamble, given that both of our serializers only officially support use through
a stream interface. I was actually surprised this works given that Kryo does
object reference tracking, however, I looked and it turns out that Kryo will
reset its internal tracking maps after every outer call to writeObject():
https://github.com/EsotericSoftware/kryo/blob/kryo-2.21/src/com/esotericsoftware/kryo/Kryo.java#L488.
Is the idea that we just assume Kryo will continue to have this property, or
at least some configuration where this will work?
2. The second broad question is whether the chained buffer is necessary,
since it constitutes a lot of the complexity in this patch. Is the chain buffer
just an additional optimization or is it in some way required for using
serailized outputs? The way it is now, you are going to allocate a large number
of 8KB buffers, at random memory locations.... overall the fragmentation this
creates could make GC slower for the JVM. The "win" is that you don't need to
use a buffer that does copying to double growth, but given that the copying is
amortized over a relatively small number of doublings, I'm not sure I see why
that's a big deal. When GC tries to move your little buffers around, it will
likely have to copy them anyways. I guess I'd like to better understand the
evidence that this chaining is clearly better than the simpler alternative of
using a managed, growing buffer.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: SPARK-4550. In sort-based shuffle, store map o...

Reply via email to