GitHub user skambha opened a pull request:
https://github.com/apache/spark/pull/19747
[Spark-22431][SQL] Ensure that the datatype in the schema for the
table/view metadata is parseable by Spark before persisting it
## What changes were proposed in this pull request?
* JIRA: [SPARK-22431](https://issues.apache.org/jira/browse/SPARK-22431)
: Creating Permanent view with illegal type
**Description:**
- It is possible in Spark SQL to create a permanent view that uses an
nested field with an illegal name.
- For example if we create the following view:
```create view x as select struct('a' as `$q`, 1 as b) q```
- A simple select fails with the following exception:
```
select * from x;
org.apache.spark.SparkException: Cannot recognize hive type string:
struct<$q:string,b:int>
at
org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812)
at
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
at
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
...
```
**Issue/Analysis**: Right now, we can create a view with a schema that
cannot be read back by Spark from the Hive metastore. For more details, please
see the discussion about the analysis and proposed fix options in comment 1 and
comment 2 in the
[SPARK-22431](https://issues.apache.org/jira/browse/SPARK-22431)
**Proposed changes**:
- Fix the hive table/view codepath to check whether the schema datatype is
parseable by Spark before persisting it in the metastore. This change is
localized to HiveClientImpl to do the check similar to the check in
FromHiveColumn. This is fail-fast and we will avoid the scenario where we write
something to the metastore that we are unable to read it back.
- Added new unit tests
- Ran the sql related unit test suites ( hive/test, sql/test,
catalyst/test) OK
With the fix:
```
create view x as select struct('a' as `$q`, 1 as b) q;
17/11/14 19:16:03 ERROR SparkSQLDriver: Failed in [create view x as select
struct('a' as `$q`, 1 as b) q]
org.apache.spark.SparkException: Cannot recognize the data type:
struct<$q:string,b:int>
at
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$org$apache$spark$sql$hive$client$HiveClientImpl$$verifyColumnDataType$1.apply(HiveClientImpl.scala:907)
at
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$org$apache$spark$sql$hive$client$HiveClientImpl$$verifyColumnDataType$1.apply(HiveClientImpl.scala:901)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
```
## How was this patch tested?
- New unit tests have been added.
@hvanhovell, Please review and share your thoughts/comments. Thank you so
much.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/skambha/spark spark22431
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19747.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19747
----
commit c5824feb40af633ab480b311495ecb7737705c3a
Author: Sunitha Kambhampati <[email protected]>
Date: 2017-11-14T12:38:17Z
Add check to ensure that the schema col datatype is parseable before
persisting to metastore, and add unit tests
commit ce474b7b028bba45c8bd29c31308503626baafbc
Author: Sunitha Kambhampati <[email protected]>
Date: 2017-11-14T16:02:00Z
Add : in error message
commit d5b553438d8740716e402c0210e3d121a48c2c64
Author: Sunitha Kambhampati <[email protected]>
Date: 2017-11-14T16:07:28Z
Remove empty line
commit 626703310aa269a9351a2cf7b6ce23f8e4ab095a
Author: Sunitha Kambhampati <[email protected]>
Date: 2017-11-14T16:20:06Z
remove empty line
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]