GitHub user yhuai opened a pull request:
https://github.com/apache/spark/pull/4826
[SPARK-5950][SQL]Insert array into a metastore table saved as parquet
should work when using datasource api
This PR contains the following changes:
1. Add a new method, `DataType.equalsIgnoreCompatibleNullability`, which is
the middle ground between DataType's equality check and
`DataType.equalsIgnoreNullability`. For two data types `from` and `to`, it does
`equalsIgnoreNullability` as well as if the nullability of `from` is compatible
with that of `to`. For example, the nullability of `ArrayType(IntegerType,
containsNull = false)` is compatible with that of `ArrayType(IntegerType,
containsNull = true)` (for an array without null values, we can always say it
may contain null values). However, the nullability of `ArrayType(IntegerType,
containsNull = true)` is compatible with that of `ArrayType(IntegerType,
containsNull = false)` (for an array that may have null values, we cannot say
it does not have null values).
2. For the `resolved` field of `InsertIntoTable`, use
`equalsIgnoreCompatibleNullability` to replace the equality check of the data
types.
3. For our data source write path, when appending data, we always use the
schema of existing table to write the data. This is important for parquet,
since nullability direct impacts the way to encode/decode values. If we do not
do this, we may see corrupted values when reading values from a set of parquet
files generated with different nullability settings.
4. When generating a new parquet table, we always set
nullable/containsNull/valueContainsNull to true. So, we will not face
situations that we cannot append data because containsNull/valueContainsNull in
an Array/Map column of the existing table has already been set to `false`. This
change makes the whole data pipeline more robust.
5. Update the equality check of JSON relation. Since JSON does not really
cares nullability, `equalsIgnoreNullability` seems a better choice to compare
schemata from to JSON tables.
cc @marmbrus @liancheng
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/yhuai/spark insertNullabilityCheck
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/4826.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4826
----
commit 4ec17fd28a45e42db92144af9cb8a8e7e796eb40
Author: Yin Huai <[email protected]>
Date: 2015-02-27T21:20:00Z
Failed test.
commit 8f19fe520080f064b50dc5885f221889c2612eea
Author: Yin Huai <[email protected]>
Date: 2015-02-27T21:20:57Z
equalsIgnoreCompatibleNullability
commit 9a266114fca979c69468709ed82fbb99fe2595e6
Author: Yin Huai <[email protected]>
Date: 2015-02-27T21:26:33Z
Make InsertIntoTable happy.
commit 0a703e751cf0ebcd481f2f7dd66cc7bdea529f04
Author: Yin Huai <[email protected]>
Date: 2015-02-27T21:38:07Z
Test failed again since we cannot read correct content.
commit bf50d7383e499cbf1e3964a9391d4e9b56607f32
Author: Yin Huai <[email protected]>
Date: 2015-02-28T05:33:43Z
When appending data, we use the schema of the existing table instead of the
schema of the new data.
commit 8bd008b403140b430344d669727410de7b4bc235
Author: Yin Huai <[email protected]>
Date: 2015-02-28T05:34:54Z
nullable, containsNull, and valueContainsNull will be always true for
parquet data.
commit b2c06f8c4e67450650b2a58c5168eb31cd490641
Author: Yin Huai <[email protected]>
Date: 2015-02-28T05:35:30Z
Ignore nullability in JSON relation's equality check.
commit e4f397cea7ec0dc21a714b75a7254bb275319fc2
Author: Yin Huai <[email protected]>
Date: 2015-02-28T05:35:54Z
Unit tests.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]