[
https://issues.apache.org/jira/browse/SPARK-23520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-23520.
----------------------------------
Resolution: Incomplete
> Add support for MapType fields in JSON schema inference
> -------------------------------------------------------
>
> Key: SPARK-23520
> URL: https://issues.apache.org/jira/browse/SPARK-23520
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core, SQL
> Affects Versions: 2.2.1
> Reporter: David Courtinot
> Priority: Major
> Labels: bulk-closed
>
> _InferSchema_ currently does not support inferring _MapType_ fields from JSON
> data, and for a good reason: they are indistinguishable from structs in JSON
> format. In issue
> [SPARK-23494|https://issues.apache.org/jira/browse/SPARK-23494], I proposed
> to expose some methods of _InferSchema_ to users so that they can build on
> top of the inference primitives defined by this class. In this issue, I'm
> proposing to add more control to the user by letting them specify a set of
> fields that should be forced as _MapType._
> *Use-case*
> Some JSON datasets contain high-cardinality fields, namely fields which key
> space is very large. These fields shouldn't be interpreted as _StructType_
> for the following reasons:
> * it's not really what they are. The key space as well as the value space
> may both be infinite, so what best defines the schema of this data is the
> type of the keys and the type of the values, not a struct containing all
> possible key-value pairs.
> * interpreting high-cardinality fields as structs can lead to enormous
> schemata that don't even fit into memory.
> *Proposition*
> We would add a public overloaded signature for _InferSchema.inferField_ which
> allows to pass a set of field accessors (a class that supports representing
> the access to any JSON field, including nested ones) for which we wan't do
> not want to recurse and instead force a schema. That would allow, in
> particular, to ask that a few fields be inferred as maps rather than structs.
> I am very open to discuss this with people who are more well-versed in the
> Spark codebase than me, because I realize my proposition can feel somewhat
> patchy. I'll be more than happy to provide some development effort if we
> manage to sketch a reasonably easy solution.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]