Github user marmbrus commented on a diff in the pull request:
https://github.com/apache/spark/pull/9190#discussion_r42778416
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/Encoder.scala
---
@@ -46,13 +47,27 @@ trait Encoder[T] {
/**
* Returns an object of type `T`, extracting the required values from
the provided row. Note that
- * you must bind the encoder to a specific schema before you can call
this function.
+ * you must `bind` an encoder to a specific schema before you can call
this function.
*/
def fromRow(row: InternalRow): T
/**
* Returns a new copy of this encoder, where the expressions used by
`fromRow` are bound to the
- * given schema
+ * given schema.
*/
def bind(schema: Seq[Attribute]): Encoder[T]
--- End diff --
I agree that this is needs to be reworked. In particular we should
separate resolution from binding (as mentioned in the PR description). The way
we are doing it today allows us to do very efficient codegen (no extra copies)
and correctly handles things like joins that produce ambiguous column names
(since internally we are binding to AttributeReferences).
Given limited time before 1.6 code freeze, I'd rather mark the Encoder API
as private and focus on fleshing out the user facing API. I think that long
term we'll do what you suggest and have a wrapper that reorders input for
custom encoder, and stick with pure expressions for the built in encoders for
performance reasons.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]