GitHub user yhuai opened a pull request:
https://github.com/apache/spark/pull/3406
[WIP][SPARK-4476][SQL] Use MapType for dict in json which has unique keys
in each row.
This PR introduces the following changes:
* Users can provide a schema with MapTypes to a JSON dataset.
* Users can use `spark.sql.schema.convertStructToMap` to control if we will
try to convert a `StructType` with too many fields (the number of fields is
greater than or equal to the threshold set by
`park.sql.schema.convertStructToMapThreshold`) with same data type or a
tightest common data type for these field data types exist (using
`HiveTypeCoercion.findTightestCommonType`). Right now, this conversion will not
be applied to the schema itself (the top level `StructType`). Only inner
structs are candidate for this auto conversion.
* I am introducing one new `private[sql]` method to `DataType`
(`transformUp`) and one new `private[sql]` method to `StructType`
(`transformFieldTypeUp`). These two can be used to transform data types based
the given partial function. My implementation of struct to map conversion is
based on these two methods. Another example is `nullTypeToStringType` in
`JsonRDD`, which transforms all `NullType`s to `StringType`s for a given
`StructType`.
This is a WIP patch. I will add more unit tests and complete newly added
ones later.
JIRA: https://issues.apache.org/jira/browse/SPARK-4476
@davies @marmbrus @willb
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/yhuai/spark mapTypeInJson
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/3406.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3406
----
commit 6d3075ed4ea62888f15675d0eede4d9a16fdfac6
Author: Yin Huai <[email protected]>
Date: 2014-11-19T21:23:57Z
Accept MapType in the schema provided to jsonFile/jsonRDD.
commit fd3b300c0d2d6a30ff56ab94a76ea70a91c7738e
Author: Yin Huai <[email protected]>
Date: 2014-11-21T20:00:35Z
Optionally convert a StructType with too many fields to a MapType.
commit f0c38aa724e83349b152f094210ddd30296e2e53
Author: Yin Huai <[email protected]>
Date: 2014-11-21T20:03:07Z
Merge remote-tracking branch 'upstream/master' into mapTypeInJson
Conflicts:
sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]