GitHub user drewrobb opened a pull request:
https://github.com/apache/spark/pull/23062
[SPARK-8288][SQL] ScalaReflection can use companion object constructor
## What changes were proposed in this pull request?
This change fixes a particular scenario where default spark SQL can't
encode (thrift) types that are generated by twitter scrooge. These types are a
trait that extends `scala.ProductX` with a constructor defined only in a
companion object, rather than a actual case class. The actual case class used
is child class, but that type is almost never referred to in code. The type has
no corresponding constructor symbol and causes an exception. For all other
purposes, these classes act just like case classes, so it is unfortunate that
spark SQL can't serialize them nicely as it can actual case classes. For an
full example of a scrooge codegen class, see
https://gist.github.com/anonymous/ba13d4b612396ca72725eaa989900314.
This change catches the case where the type has no constructor but does
have an `apply` method on the type's companion object. This allows for thrift
types to be serialized/deserialized with implicit encoders the same way as
normal case classes. This fix had to be done in three places where the
constructor is assumed to be an actual constructor:
1) In serializing, determining the schema for the dataframe relies on
inspecting its constructor (`ScalaReflection.constructParams`). Here we fall
back to using the companion constructor arguments.
2) In deserializing or evaluating, in the java codegen (
`NewInstance.doGenCode`), the type couldn't be constructed with the new
keyword. If there is no constructor, we change the constructor call to try the
companion constructor.
3) In deserializing or evaluating, without codegen, the constructor is
directly invoked (`NewInstance.constructor`). This was fixed with scala
reflection to get the actual companion apply method.
The return type of `findConstructor` was changed because the companion
apply method constructor can't be represented as a
`java.lang.reflect.Constructor`.
There might be situations in which this approach would also fail in a new
way, but it does at a minimum work for the specific scrooge example and will
not impact cases that were already succeeding prior to this change
Note: this fix does not enable using scrooge thrift enums, additional work
for this is necessary. With this patch, it seems like you could patch
`com.twitter.scrooge.ThriftEnum` to extend `_root_.scala.Product1[Int]` with
`def _1 = value` to get spark's implicit encoders to handle enums, but I've yet
to use this method myself.
Note: I previously opened a PR for this issue, but only was able to fix
case 1) there: https://github.com/apache/spark/pull/18766
## How was this patch tested?
I've fixed all 3 cases and added two tests that use a case class that is
similar to scrooge generated one. The test in ScalaReflectionSuite checks 1),
and the additional asserting in ObjectExpressionsSuite checks 2) and 3).
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/drewrobb/spark SPARK-8288
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/23062.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #23062
----
commit e27f933710c289d5bc1ddcad38eecde0e555ac60
Author: Drew Robb <drewrobb@...>
Date: 2017-07-29T01:45:00Z
ScalaReflection can use companion object constructor
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]