GitHub user marmbrus opened a pull request:
https://github.com/apache/spark/pull/10747
[SPARK-12813][SQL] Eliminate serialization for back to back operations
The goal of this PR is to eliminate unnecessary translations when there are
back-to-back `MapPartitions` operations. In order to achieve this I also made
the following simplifications:
- Operators no longer have hold encoders, instead they have only the
expressions that they need. The benefits here are twofold: the expressions are
visible to transformations so go through the normal resolution/binding process.
now that they are visible we can change them on a case by case basis.
- Operators no longer have type parameters. Since the engine is
responsible for its own type checking, having the types visible to the complier
was an unnecessary complication. We still leverage the scala compiler in the
companion factory when constructing a new operator, but after this the types
are discarded.
Deferred to a follow up PR:
- Remove as much of the resolution/binding from Dataset/GroupedDataset as
possible. We should still eagerly check resolution and throw an error though in
the case of mismatches for an `as` operation.
- Eliminate serializations in more cases by adding more cases to
`EliminateSerialization`
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/marmbrus/spark encoderExpressions
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/10747.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #10747
----
commit 4615c9614a2a63dbb716b97093ec83801ebdeefd
Author: Michael Armbrust <[email protected]>
Date: 2016-01-13T23:15:12Z
[SPARK-12813][SQL] Eliminate serialization for back to back operations
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]