It's not so much the KryoSerializerInstance that's the problem, but
that it will always make a new Kryo (although at most 1). You mean to
supply it with a reference to a pool instead, shared across all
KryoSerializerInstance? plausible yeah.

See https://spark.apache.org/contributing.html for guidance but
basically you'd make a JIRA and then a pull request at apache/spark,
with the JIRA number in the title and some other conventions.
On Thu, Oct 25, 2018 at 11:51 AM Patrick Brown
<patrick.barry.br...@gmail.com> wrote:
>
> Based on my, limited, read through of the code that uses this, it seems like 
> often a new KryoSerializerInstance is created for whatever task and then it 
> falls out of scope, instead of being reused. I did notice that comment about 
> a pool size of 1, however if the use is generally how I just described, I 
> don't think that really does anything.
>
> I would be happy to open a PR but, pardon my ignorance, how would I go about 
> doing that properly? Do I need to open a JIRA issue first? Also how would I 
> demonstrate performance gains? Do you guys use something like ScalaMeter?
>
> Thanks for your help!
>
> On Wed, Oct 24, 2018 at 2:37 PM Sean Owen <sro...@gmail.com> wrote:
>>
>> I don't know; possibly just because it wasn't available whenever Kryo
>> was first used in the project.
>>
>> Skimming the code, the KryoSerializerInstance looks like a wrapper
>> that provides a Kryo object to do work. It already maintains a 'pool'
>> of just 1 instance. Is the point that KryoSerializer can share a
>> KryoPool across KryoSerializerInstances that provides them with a Kryo
>> rather than allocate a new one? makes sense, though I believe the
>> concern is always whether that somehow shares state or config in a way
>> that breaks something. I see there's already a reset() call in here to
>> try to avoid that.
>>
>> Well, seems worth a PR, especially if you can demonstrate some
>> performance gains.
>>
>> On Wed, Oct 24, 2018 at 3:09 PM Patrick Brown
>> <patrick.barry.br...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > I am wondering about the implementation of KryoSerializer, specifically 
>> > the lack of use of KryoPool, which is recommended by Kryo themselves.
>> >
>> > Looking at the code, it seems that frequently KryoSerializer.newInstance 
>> > is called, followed by a serialize and then this instance goes out of 
>> > scope, this seems like it causes frequent creation of Kryo instances, 
>> > something which the Kryo documentation says is expensive.
>> >
>> > By doing flame graphs on our own running software (it processes a lot of 
>> > small jobs) it seems like a good amount of time is spent on this.
>> >
>> > I have a small patch we are using internally which implements a reused 
>> > KryoPool inside KryoSerializer (not KryoSerializerInstance) in order to 
>> > avoid the creation of many Kryo instances. I am wonder if I am missing 
>> > something as to why this isn't done already. If not I am wondering if this 
>> > might be a patch that Spark would be interested in merging in, and how I 
>> > might go about that.
>> >
>> > Thanks,
>> >
>> > Patrick

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to