[GitHub] spark issue #16030: [SPARK-18108][SQL] Fix a bug to fail partition schema in...

liancheng Wed, 30 Nov 2016 12:14:24 -0800

Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16030

@brkyvz @maropu Actually, we do allow users to create partitioned tables
that allow data schema to contains (part of) the partition columns, and there
are [test][1] [cases][2] for this use case.

This use case is mostly useful when you are trying to reorganize an
existing dataset into a partitioned form. Say you have a JSON dataset
containing all the tweets in 2016 and you'd like to partition it by date. By
allowing the data schema to contain partitioned columns, you may simply put
JSON files of the same date into the same directory. Otherwise, you'll have to
run an ETL job to erase the date column from the dataset, which can be
time-consuming.

As for the query @maropu mentioned in the PR description, the query itself
is problematic, because it lacks a user-specified schema to override the data
type of the partitioned column `a`. Ideally, partition discovery should be able
to fill in the correct data type `LongType`, but it's impossible since the
directory path doesn't really expose that information. That's why a
user-specified schema is necessary.

This query works in 2.0.2 because of the bug @brkyvz fixed: Spark ignores
the data types of partitioned columns specified in the user-specified schema.
Now the bug is fixed and this query doesn't work as expected.

In short:

1. This isn't a regression, the original query itself is problematic.
2. For this PR, we can either just close it or try to provide a better
error message in the read path (ask the user to provide a user-specified
schema) when:

- A partitioned columns `p` also appears in the data schema, and
- The discovered data type of `p` is different from the data type
specified in the data schema.

An alternative is to override the discovered partition column data type
using the one in the data schema if any. But I'd say this change is probably
too risky at this moment for 2.1.

[1]:
https://github.com/apache/spark/blob/c51c7725944d60738e2bac3e11f6aea74812905c/sql/hive/src/test/scala/org/apache/spark/sql/sources/ParquetHadoopFsRelationSuite.scala#L44-L66
[2]:
https://github.com/apache/spark/blob/c51c7725944d60738e2bac3e11f6aea74812905c/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcHadoopFsRelationSuite.scala#L43-L66



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #16030: [SPARK-18108][SQL] Fix a bug to fail partition schema in...

Reply via email to