It's not so much the KryoSerializerInstance that's the problem, but that it will always make a new Kryo (although at most 1). You mean to supply it with a reference to a pool instead, shared across all KryoSerializerInstance? plausible yeah.
See https://spark.apache.org/contributing.html for guidance but basically you'd make a JIRA and then a pull request at apache/spark, with the JIRA number in the title and some other conventions. On Thu, Oct 25, 2018 at 11:51 AM Patrick Brown <patrick.barry.br...@gmail.com> wrote: > > Based on my, limited, read through of the code that uses this, it seems like > often a new KryoSerializerInstance is created for whatever task and then it > falls out of scope, instead of being reused. I did notice that comment about > a pool size of 1, however if the use is generally how I just described, I > don't think that really does anything. > > I would be happy to open a PR but, pardon my ignorance, how would I go about > doing that properly? Do I need to open a JIRA issue first? Also how would I > demonstrate performance gains? Do you guys use something like ScalaMeter? > > Thanks for your help! > > On Wed, Oct 24, 2018 at 2:37 PM Sean Owen <sro...@gmail.com> wrote: >> >> I don't know; possibly just because it wasn't available whenever Kryo >> was first used in the project. >> >> Skimming the code, the KryoSerializerInstance looks like a wrapper >> that provides a Kryo object to do work. It already maintains a 'pool' >> of just 1 instance. Is the point that KryoSerializer can share a >> KryoPool across KryoSerializerInstances that provides them with a Kryo >> rather than allocate a new one? makes sense, though I believe the >> concern is always whether that somehow shares state or config in a way >> that breaks something. I see there's already a reset() call in here to >> try to avoid that. >> >> Well, seems worth a PR, especially if you can demonstrate some >> performance gains. >> >> On Wed, Oct 24, 2018 at 3:09 PM Patrick Brown >> <patrick.barry.br...@gmail.com> wrote: >> > >> > Hi, >> > >> > I am wondering about the implementation of KryoSerializer, specifically >> > the lack of use of KryoPool, which is recommended by Kryo themselves. >> > >> > Looking at the code, it seems that frequently KryoSerializer.newInstance >> > is called, followed by a serialize and then this instance goes out of >> > scope, this seems like it causes frequent creation of Kryo instances, >> > something which the Kryo documentation says is expensive. >> > >> > By doing flame graphs on our own running software (it processes a lot of >> > small jobs) it seems like a good amount of time is spent on this. >> > >> > I have a small patch we are using internally which implements a reused >> > KryoPool inside KryoSerializer (not KryoSerializerInstance) in order to >> > avoid the creation of many Kryo instances. I am wonder if I am missing >> > something as to why this isn't done already. If not I am wondering if this >> > might be a patch that Spark would be interested in merging in, and how I >> > might go about that. >> > >> > Thanks, >> > >> > Patrick --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org