GitHub user witgo opened a pull request:

    The SerializerInstance instance used when deserializing a TaskResult is not 

    ## What changes were proposed in this pull request?
    The following code is called when the DirectTaskResult instance is 
      def value(): T = {
        if (valueObjectDeserialized) {
        } else {
          // Each deserialization creates a new instance of SerializerInstance, 
which is very time-consuming
          val resultSer = SparkEnv.get.serializer.newInstance()
          valueObject = resultSer.deserialize(valueBytes)
          valueObjectDeserialized = true
    In the case of stage has a lot of tasks, reuse SerializerInstance instance 
can improve the scheduling performance of three times
    The test data is TPC-DS 2T (Parquet) and  SQL statement as follows (query 
    select  i_item_id, 
            avg(ss_quantity) agg1,
            avg(ss_list_price) agg2,
            avg(ss_coupon_amt) agg3,
            avg(ss_sales_price) agg4 
     from store_sales, customer_demographics, date_dim, item, promotion
     where ss_sold_date_sk = d_date_sk and
           ss_item_sk = i_item_sk and
           ss_cdemo_sk = cd_demo_sk and
           ss_promo_sk = p_promo_sk and
           cd_gender = 'M' and 
           cd_marital_status = 'M' and
           cd_education_status = '4 yr Degree' and
           (p_channel_email = 'N' or p_channel_event = 'N') and
           d_year = 2001 
     group by i_item_id
     order by i_item_id
     limit 100;
    `spark-defaults.conf` file:
    spark.master                           yarn-client
    spark.executor.instances               20
    spark.driver.memory                    16g
    spark.executor.memory                  30g
    spark.executor.cores                   5
    spark.default.parallelism              100 
    spark.sql.shuffle.partitions           100000 
    spark.driver.maxResultSize              0
    spark.rpc.netty.dispatcher.numThreads   8
    spark.executor.extraJavaOptions          -XX:+UseG1GC 
-XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=256M 
    spark.cleaner.referenceTracking.blocking true
    spark.cleaner.referenceTracking.blocking.shuffle true
    Performance test results are as follows 
    ------------ | -------------
    54.5 s|231.7 s
    ## How was this patch tested?
    Existing tests.

You can merge this pull request into a Git repository by running:

    $ git pull SPARK-17930

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15512
commit 037871d8843760fbbdeab344d8228bfaeba6f6ae
Author: Guoqiang Li <>
Date:   2016-10-16T03:18:00Z

    The SerializerInstance instance used when deserializing a TaskResult is not 


If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to