[GitHub] spark pull request: [SPARK-11116] [SQL] First Draft of Dataset API

marmbrus Thu, 22 Oct 2015 10:29:15 -0700

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9190#discussion_r42778416
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/Encoder.scala
 ---
    @@ -46,13 +47,27 @@ trait Encoder[T] {
     
       /**
        * Returns an object of type `T`, extracting the required values from 
the provided row.  Note that
    -   * you must bind the encoder to a specific schema before you can call 
this function.
    +   * you must `bind` an encoder to a specific schema before you can call 
this function.
        */
       def fromRow(row: InternalRow): T
     
       /**
        * Returns a new copy of this encoder, where the expressions used by 
`fromRow` are bound to the
    -   * given schema
    +   * given schema.
        */
       def bind(schema: Seq[Attribute]): Encoder[T]
    --- End diff --
    
    I agree that this is needs to be reworked.  In particular we should 
separate resolution from binding (as mentioned in the PR description).  The way 
we are doing it today allows us to do very efficient codegen (no extra copies) 
and correctly handles things like joins that produce ambiguous column names 
(since internally we are binding to AttributeReferences).
    
    Given limited time before 1.6 code freeze, I'd rather mark the Encoder API 
as private and focus on fleshing out the user facing API.  I think that long 
term we'll do what you suggest and have a wrapper that reorders input for 
custom encoder, and stick with pure expressions for the built in encoders for 
performance reasons.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11116] [SQL] First Draft of Dataset API

Reply via email to