[GitHub] spark pull request: [SPARK-12696] Backport Dataset Bug fixes

marmbrus Thu, 07 Jan 2016 13:56:16 -0800

GitHub user marmbrus opened a pull request:

    https://github.com/apache/spark/pull/10650


    [SPARK-12696] Backport Dataset Bug fixes

    We've fixed a lot of bugs in master, and since this is experimental in 1.6 
we should consider back porting the fixes.  The only thing that is obviously 
risky to me is 0e07ed3, we might try to remove that.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/marmbrus/spark dataset-backports

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10650.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10650
    
----
commit 5025da6d8cd99ac793397c8e06ba2dbefa737ea8
Author: Wenchen Fan <[email protected]>
Date:   2015-12-10T07:11:13Z

    [SPARK-12252][SPARK-12131][SQL] refactor MapObjects to make it less hacky
    
    in https://github.com/apache/spark/pull/10133 we found that, we shoud 
ensure the children of `TreeNode` are all accessible in the `productIterator`, 
or the behavior will be very confusing.
    
    In this PR, I try to fix this problem by expsing the `loopVar`.
    
    This also fixes SPARK-12131 which is caused by the hacky `MapObjects`.
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #10239 from cloud-fan/map-objects.

commit c938a4375932fcf8cbdeefe55449798033d55bb1
Author: Wenchen Fan <[email protected]>
Date:   2015-12-15T00:48:11Z

    [SPARK-12274][SQL] WrapOption should not have type constraint for child
    
    I think it was a mistake, and we have not catched it so far until 
https://github.com/apache/spark/pull/10260 which begin to check if the 
`fromRowExpression` is resolved.
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #10263 from cloud-fan/encoder.

commit 7d1d389323a0a4732f0df6d7b3af5f83a05dc5eb
Author: gatorsmile <[email protected]>
Date:   2015-12-15T02:33:45Z

    [SPARK-12188][SQL][FOLLOW-UP] Code refactoring and comment correction in 
Dataset APIs
    
    marmbrus This PR is to address your comment. Thanks for your review!
    
    Author: gatorsmile <[email protected]>
    
    Closes #10214 from gatorsmile/followup12188.

commit 862cb1281be8e2b147634a6bbb70c1ab0c0061f0
Author: Nong Li <[email protected]>
Date:   2015-12-16T00:55:58Z

    [SPARK-12271][SQL] Improve error message when Dataset.as[ ] has 
incompatible schemas.
    
    Author: Nong Li <[email protected]>
    
    Closes #10260 from nongli/spark-11271.

commit d62701d1782df0c79bf9341ad988551175d1eb2c
Author: Wenchen Fan <[email protected]>
Date:   2015-12-16T21:18:56Z

    [SPARK-12320][SQL] throw exception if the number of fields does not line up 
for Tuple encoder
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #10293 from cloud-fan/err-msg.

commit 1a12f086acc61de0aa0e947d50bcab93ae235fed
Author: gatorsmile <[email protected]>
Date:   2015-12-16T21:22:34Z

    [SPARK-12164][SQL] Decode the encoded values and then display
    
    Based on the suggestions from marmbrus cloud-fan in 
https://github.com/apache/spark/pull/10165 , this PR is to print the decoded 
values(user objects) in `Dataset.show`
    ```scala
        implicit val kryoEncoder = Encoders.kryo[KryoClassData]
        val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), 
KryoClassData("c", 3)).toDS()
        ds.show(20, false);
    ```
    The current output is like
    ```
    
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |value                                                                      
                                                                                
                           |
    
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 
114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 
97, 116, -31, 1, 1, -126, 97, 2]|
    |[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 
114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 
97, 116, -31, 1, 1, -126, 98, 4]|
    |[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 
114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 
97, 116, -31, 1, 1, -126, 99, 6]|
    
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    ```
    After the fix, it will be like the below if and only if the users override 
the `toString` function in the class `KryoClassData`
    ```scala
    override def toString: String = s"KryoClassData($a, $b)"
    ```
    ```
    +-------------------+
    |value              |
    +-------------------+
    |KryoClassData(a, 1)|
    |KryoClassData(b, 2)|
    |KryoClassData(c, 3)|
    +-------------------+
    ```
    
    If users do not override the `toString` function, the results will be like
    ```
    +---------------------------------------+
    |value                                  |
    +---------------------------------------+
    |org.apache.spark.sql.KryoClassData68ef|
    |org.apache.spark.sql.KryoClassData6915|
    |org.apache.spark.sql.KryoClassData693b|
    +---------------------------------------+
    ```
    
    Question: Should we add another optional parameter in the function `show`? 
It will decide if the function `show` will display the hex values or the object 
values?
    
    Author: gatorsmile <[email protected]>
    
    Closes #10215 from gatorsmile/showDecodedValue.

commit 0e07ed3d3e10f148a2be3fe2d03c8427ffd8a9b1
Author: Wenchen Fan <[email protected]>
Date:   2015-12-21T20:47:07Z

    [SPARK-12321][SQL] JSON format for TreeNode (use reflection)
    
    An alternative solution for https://github.com/apache/spark/pull/10295 , 
instead of implementing json format for all logical/physical plans and 
expressions, use reflection to implement it in `TreeNode`.
    
    Here I use pre-order traversal to flattern a plan tree to a plan list, and 
add an extra field `num-children` to each plan node, so that we can reconstruct 
the tree from the list.
    
    example json:
    
    logical plan tree:
    ```
    [ {
      "class" : "org.apache.spark.sql.catalyst.plans.logical.Sort",
      "num-children" : 1,
      "order" : [ [ {
        "class" : "org.apache.spark.sql.catalyst.expressions.SortOrder",
        "num-children" : 1,
        "child" : 0,
        "direction" : "Ascending"
      }, {
        "class" : 
"org.apache.spark.sql.catalyst.expressions.AttributeReference",
        "num-children" : 0,
        "name" : "i",
        "dataType" : "integer",
        "nullable" : true,
        "metadata" : { },
        "exprId" : {
          "id" : 10,
          "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6"
        },
        "qualifiers" : [ ]
      } ] ],
      "global" : false,
      "child" : 0
    }, {
      "class" : "org.apache.spark.sql.catalyst.plans.logical.Project",
      "num-children" : 1,
      "projectList" : [ [ {
        "class" : "org.apache.spark.sql.catalyst.expressions.Alias",
        "num-children" : 1,
        "child" : 0,
        "name" : "i",
        "exprId" : {
          "id" : 10,
          "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6"
        },
        "qualifiers" : [ ]
      }, {
        "class" : "org.apache.spark.sql.catalyst.expressions.Add",
        "num-children" : 2,
        "left" : 0,
        "right" : 1
      }, {
        "class" : 
"org.apache.spark.sql.catalyst.expressions.AttributeReference",
        "num-children" : 0,
        "name" : "a",
        "dataType" : "integer",
        "nullable" : true,
        "metadata" : { },
        "exprId" : {
          "id" : 0,
          "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6"
        },
        "qualifiers" : [ ]
      }, {
        "class" : "org.apache.spark.sql.catalyst.expressions.Literal",
        "num-children" : 0,
        "value" : "1",
        "dataType" : "integer"
      } ], [ {
        "class" : "org.apache.spark.sql.catalyst.expressions.Alias",
        "num-children" : 1,
        "child" : 0,
        "name" : "j",
        "exprId" : {
          "id" : 11,
          "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6"
        },
        "qualifiers" : [ ]
      }, {
        "class" : "org.apache.spark.sql.catalyst.expressions.Multiply",
        "num-children" : 2,
        "left" : 0,
        "right" : 1
      }, {
        "class" : 
"org.apache.spark.sql.catalyst.expressions.AttributeReference",
        "num-children" : 0,
        "name" : "a",
        "dataType" : "integer",
        "nullable" : true,
        "metadata" : { },
        "exprId" : {
          "id" : 0,
          "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6"
        },
        "qualifiers" : [ ]
      }, {
        "class" : "org.apache.spark.sql.catalyst.expressions.Literal",
        "num-children" : 0,
        "value" : "2",
        "dataType" : "integer"
      } ] ],
      "child" : 0
    }, {
      "class" : "org.apache.spark.sql.catalyst.plans.logical.LocalRelation",
      "num-children" : 0,
      "output" : [ [ {
        "class" : 
"org.apache.spark.sql.catalyst.expressions.AttributeReference",
        "num-children" : 0,
        "name" : "a",
        "dataType" : "integer",
        "nullable" : true,
        "metadata" : { },
        "exprId" : {
          "id" : 0,
          "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6"
        },
        "qualifiers" : [ ]
      } ] ],
      "data" : [ ]
    } ]
    ```
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #10311 from cloud-fan/toJson-reflection.

commit b8a8ba40fcebbe37c7ffaeac91e0dd90b7a072a3
Author: Cheng Lian <[email protected]>
Date:   2015-12-22T11:41:44Z

    [SPARK-12371][SQL] Runtime nullability check for NewInstance
    
    This PR adds a new expression `AssertNotNull` to ensure non-nullable fields 
of products and case classes don't receive null values at runtime.
    
    Author: Cheng Lian <[email protected]>
    
    Closes #10331 from liancheng/dataset-nullability-check.

commit 36adc41a57284f58449e8df68e50d9baf7a042a8
Author: Cheng Lian <[email protected]>
Date:   2015-12-23T02:21:00Z

    [SPARK-12478][SQL] Bugfix: Dataset fields of product types can't be null
    
    When creating extractors for product types (i.e. case classes and tuples), 
a null check is missing, thus we always assume input product values are 
non-null.
    
    This PR adds a null check in the extractor expression for product types. 
The null check is stripped off for top level product fields, which are mapped 
to the outermost `Row`s, since they can't be null.
    
    Thanks cloud-fan for helping investigating this issue!
    
    Author: Cheng Lian <[email protected]>
    
    Closes #10431 from liancheng/spark-12478.top-level-null-field.

commit 7ca2d78b2e1d03ed1ffd5855c538529835a8c2c0
Author: gatorsmile <[email protected]>
Date:   2015-12-30T06:28:59Z

    [SPARK-12564][SQL] Improve missing column AnalysisException
    
    ```
    org.apache.spark.sql.AnalysisException: cannot resolve 'value' given input 
columns text;
    ```
    
    lets put a `:` after `columns` and put the columns in `[]` so that they 
match the toString of DataFrame.
    
    Author: gatorsmile <[email protected]>
    
    Closes #10518 from gatorsmile/improveAnalysisExceptionMsg.

commit d42f9a62f61ee602ed56274e44254f91f9ea05a5
Author: Wenchen Fan <[email protected]>
Date:   2015-12-30T18:56:08Z

    [SPARK-12495][SQL] use true as default value for propagateNull in 
NewInstance
    
    Most of cases we should propagate null when call `NewInstance`, and so far 
there is only one case we should stop null propagation: create product/java 
bean. So I think it makes more sense to propagate null by dafault.
    
    This also fixes a bug when encode null array/map, which is firstly 
discovered in https://github.com/apache/spark/pull/10401
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #10443 from cloud-fan/encoder.

commit 80773f5f6d6f792989364633ccf9c6589a6c6372
Author: Liang-Chi Hsieh <[email protected]>
Date:   2016-01-01T07:48:05Z

    [SPARK-11743][SQL] Move the test for arrayOfUDT
    
    A following pr for #9712. Move the test for arrayOfUDT.
    
    Author: Liang-Chi Hsieh <[email protected]>
    
    Closes #10538 from viirya/move-udt-test.

commit b6295c105406cfde46275ab8fef31fc57d1d86ba
Author: Liang-Chi Hsieh <[email protected]>
Date:   2016-01-05T18:19:56Z

    [SPARK-12438][SQL] Add SQLUserDefinedType support for encoder
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-12438
    
    ScalaReflection lacks the support of SQLUserDefinedType. We should add it.
    
    Author: Liang-Chi Hsieh <[email protected]>
    
    Closes #10390 from viirya/encoder-udt.

commit 87fc0ffb67e6538b2b850e0fd36ba6e2c63fc549
Author: Liang-Chi Hsieh <[email protected]>
Date:   2016-01-05T20:33:21Z

    [SPARK-12439][SQL] Fix toCatalystArray and MapObjects
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-12439
    
    In toCatalystArray, we should look at the data type returned by dataTypeFor 
instead of silentSchemaFor, to determine if the element is native type. An 
obvious problem is when the element is Option[Int] class, catalsilentSchemaFor 
will return Int, then we will wrongly recognize the element is native type.
    
    There is another problem when using Option as array element. When we encode 
data like Seq(Some(1), Some(2), None) with encoder, we will use MapObjects to 
construct an array for it later. But in MapObjects, we don't check if the 
return value of lambdaFunction is null or not. That causes a bug that the 
decoded data for Seq(Some(1), Some(2), None) would be Seq(1, 2, -1), instead of 
Seq(1, 2, null).
    
    Author: Liang-Chi Hsieh <[email protected]>
    
    Closes #10391 from viirya/fix-catalystarray.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-12696] Backport Dataset Bug fixes

Reply via email to