GitHub user vofque opened a pull request:
https://github.com/apache/spark/pull/22708
[Spark 21402] Fix java array/map of structs deserialization
When deserializing values of ArrayType with struct elements or MapType with
struct keys/values in java beans, fields of structs get mixed up.
I suggest using struct data types retrieved from resolved input data
instead of inferring them from java beans.
## What changes were proposed in this pull request?
_In case of array:_
MapObjects expression is used to map array elements to java beans. Struct
type of elements is inferred from java bean structure and ends up with mixed up
field order.
I used UnresolvedMapObjects instead of MapObjects, which allows to provide
element type for MapObjects during analysis based on the resolved input data,
not on the java bean.
_In case of map:_
Invocations of "keyArray" and "valueArray" functions are used to extract
arrays of keys and values. Struct type of keys or values is also inferred from
java bean structure and ends up with mixed up field order.
I created a new UnresolvedInvoke expression as a temporary substitution of
Invoke expression while no actual data is available. It allows to provide the
resulting data type during analysis based on the resolved input data, not on
the java bean (similar to UnresolvedMapObjects).
Key and value arrays are then fed to MapObjects expression which I replaced
with UnresolvedMapObjects, just like in case of ArrayType.
Finally I added resolution of UnresolvedInvoke expressions in
Analyzer.resolveExpression method as an additional pattern matching case.
## How was this patch tested?
Created and ran unit tests for two described cases.
Built complete project on travis.
@michalsenkyr @cloud-fan @marmbrus @liancheng
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/vofque/spark SPARK-21402
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22708.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22708
----
commit d3578bb3776a79b97f2713f752127849e7910368
Author: Vladimir Kuriatkov <vofque@...>
Date: 2018-10-12T10:49:18Z
Merge pull request #2 from apache/master
Synch with apache:master
commit 7576ba7c9b830e45b454a21978fbb0a4275064a6
Author: Vladimir Kuriatkov <vladimir_kuriatkov@...>
Date: 2018-10-11T12:56:15Z
Java array/map of structs deserialization fixed
commit c9ffcd4de5738dfabb124aae28ef7b79b36e48d6
Author: Vladimir Kuriatkov <vofque@...>
Date: 2018-10-12T11:10:38Z
Merge pull request #3 from vofque/bugfix
Java array/map of structs deserialization fixed
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]