GitHub user witgo opened a pull request:
https://github.com/apache/spark/pull/15512
The SerializerInstance instance used when deserializing a TaskResult is not
reused
## What changes were proposed in this pull request?
The following code is called when the DirectTaskResult instance is
deserialized
```scala
def value(): T = {
if (valueObjectDeserialized) {
valueObject
} else {
// Each deserialization creates a new instance of SerializerInstance,
which is very time-consuming
val resultSer = SparkEnv.get.serializer.newInstance()
valueObject = resultSer.deserialize(valueBytes)
valueObjectDeserialized = true
valueObject
}
}
```
In the case of stage has a lot of tasks, reuse SerializerInstance instance
can improve the scheduling performance of three times
The test data is TPC-DS 2T (Parquet) and SQL statement as follows (query
2)):
```sql
select i_item_id,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
from store_sales, customer_demographics, date_dim, item, promotion
where ss_sold_date_sk = d_date_sk and
ss_item_sk = i_item_sk and
ss_cdemo_sk = cd_demo_sk and
ss_promo_sk = p_promo_sk and
cd_gender = 'M' and
cd_marital_status = 'M' and
cd_education_status = '4 yr Degree' and
(p_channel_email = 'N' or p_channel_event = 'N') and
d_year = 2001
group by i_item_id
order by i_item_id
limit 100;
```
`spark-defaults.conf` file:
```
spark.master yarn-client
spark.executor.instances 20
spark.driver.memory 16g
spark.executor.memory 30g
spark.executor.cores 5
spark.default.parallelism 100
spark.sql.shuffle.partitions 100000
spark.serializer
org.apache.spark.serializer.KryoSerializer
spark.driver.maxResultSize 0
spark.rpc.netty.dispatcher.numThreads 8
spark.executor.extraJavaOptions -XX:+UseG1GC
-XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=256M
spark.cleaner.referenceTracking.blocking true
spark.cleaner.referenceTracking.blocking.shuffle true
```
Performance test results are as follows
[SPARK-17930](https://github.com/witgo/spark/tree/SPARK-17930)|
[ed14633](https://github.com/witgo/spark/commit/ed1463341455830b8867b721a1b34f291139baf3])
------------ | -------------
54.5 s|231.7 s
## How was this patch tested?
Existing tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/witgo/spark SPARK-17930
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15512.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15512
----
commit 037871d8843760fbbdeab344d8228bfaeba6f6ae
Author: Guoqiang Li <[email protected]>
Date: 2016-10-16T03:18:00Z
The SerializerInstance instance used when deserializing a TaskResult is not
reused
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]