[GitHub] spark pull request: [SPARK-15657][SQL] RowEncoder should validate the data t...

cloud-fan Tue, 31 May 2016 10:06:02 -0700

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13401#discussion_r65222808
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
 ---
    @@ -721,8 +716,55 @@ case class GetExternalRowField(
               "cannot be null.");
           }
     
    -      final ${ctx.javaType(dataType)} ${ev.value} = $getField;
    +      final Object ${ev.value} = ${row.value}.get($index);
          """
         ev.copy(code = code, isNull = "false")
       }
     }
    +
    +/**
    + * Validates the actual data type of input expression at runtime.  If it 
doesn't match the
    + * expectation, throw an exception.
    + */
    +case class ValidateExternalType(child: Expression, expected: DataType)
    --- End diff --
    
    The problem is we can't trust it.... When users call `createDataFrame(rows, 
schema)`, we should definitely validate the passed-in rows. I think performance 
doesn't matter too much here, as this only happens at the beginning of the data 
flow. One potential issue may be that, `Dataset.map` can return row and users 
will provide a schema we should trust. However, I don't think we should expose 
`RowEncoder` to users and `Dataset.map` should never return a row.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-15657][SQL] RowEncoder should validate the data t...

Reply via email to