GitHub user bogdanrdc opened a pull request:
https://github.com/apache/spark/pull/22201
[SPARK-25209][SQL] Avoid deserializer check in Dataset.apply when Dataset
is actually DataFrame
## What changes were proposed in this pull request?
Dataset.apply calls dataset.deserializer (to provide an early error) which
ends up calling the full Analyzer on the deserializer. This can take tens of
milliseconds, depending on how big the plan is.
Since Dataset.apply is called for many Dataset operations such as
Dataset.where it can be a significant overhead for short queries.
According to a comment in the PR that introduced this check, we can at
least remove this check for DataFrames:
https://github.com/apache/spark/pull/20402#discussion_r164338267
## How was this patch tested?
Existing tests + manual benchmark
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/bogdanrdc/spark deserializer-fix
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22201.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22201
----
commit 7089e035253c80bd143f3af4d12f39643e9eaf84
Author: Bogdan Raducanu <bogdan@...>
Date: 2018-08-23T12:11:34Z
optimization
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]