Based on my, limited, read through of the code that uses this, it seems like often a new KryoSerializerInstance is created for whatever task and then it falls out of scope, instead of being reused. I did notice that comment about a pool size of 1, however if the use is generally how I just described, I don't think that really does anything.
I would be happy to open a PR but, pardon my ignorance, how would I go about doing that properly? Do I need to open a JIRA issue first? Also how would I demonstrate performance gains? Do you guys use something like ScalaMeter? Thanks for your help! On Wed, Oct 24, 2018 at 2:37 PM Sean Owen <sro...@gmail.com> wrote: > I don't know; possibly just because it wasn't available whenever Kryo > was first used in the project. > > Skimming the code, the KryoSerializerInstance looks like a wrapper > that provides a Kryo object to do work. It already maintains a 'pool' > of just 1 instance. Is the point that KryoSerializer can share a > KryoPool across KryoSerializerInstances that provides them with a Kryo > rather than allocate a new one? makes sense, though I believe the > concern is always whether that somehow shares state or config in a way > that breaks something. I see there's already a reset() call in here to > try to avoid that. > > Well, seems worth a PR, especially if you can demonstrate some > performance gains. > > On Wed, Oct 24, 2018 at 3:09 PM Patrick Brown > <patrick.barry.br...@gmail.com> wrote: > > > > Hi, > > > > I am wondering about the implementation of KryoSerializer, specifically > the lack of use of KryoPool, which is recommended by Kryo themselves. > > > > Looking at the code, it seems that frequently KryoSerializer.newInstance > is called, followed by a serialize and then this instance goes out of > scope, this seems like it causes frequent creation of Kryo instances, > something which the Kryo documentation says is expensive. > > > > By doing flame graphs on our own running software (it processes a lot of > small jobs) it seems like a good amount of time is spent on this. > > > > I have a small patch we are using internally which implements a reused > KryoPool inside KryoSerializer (not KryoSerializerInstance) in order to > avoid the creation of many Kryo instances. I am wonder if I am missing > something as to why this isn't done already. If not I am wondering if this > might be a patch that Spark would be interested in merging in, and how I > might go about that. > > > > Thanks, > > > > Patrick >