Ivan Vergiliev created SPARK-5269:
-------------------------------------

             Summary: BlockManager.dataDeserialize always creates a new 
serializer instance
                 Key: SPARK-5269
                 URL: https://issues.apache.org/jira/browse/SPARK-5269
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
            Reporter: Ivan Vergiliev


BlockManager.dataDeserialize always creates a new instance of the serializer, 
which is pretty slow in some cases. I'm using Kryo serialization and have a 
custom registrator, and its register method is showing up as taking about 15% 
of the execution time in my profiles. This started happening after I increased 
the number of keys in a job with a shuffle phase by a factor of 40.

One solution I can think of is to create a ThreadLocal SerializerInstance for 
the defaultSerializer, and only create a new one if a custom serializer is 
passed in. AFAICT a custom serializer is passed only from DiskStore.getValues, 
and that, on the other hand, depends on the serializer passed to 
ExternalSorter. I don't know how often this is used, but I think this can still 
be a good solution for the standard use case.
Oh, and also - ExternalSorter already has a SerializerInstance, so if the 
getValues method is called from a single thread, maybe we can pass that 
directly?

I'd be happy to try a patch but would probably need a confirmation from someone 
that this approach would indeed work (or an idea for another).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to