[ 
https://issues.apache.org/jira/browse/SPARK-23520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23520.
----------------------------------
    Resolution: Incomplete

> Add support for MapType fields in JSON schema inference
> -------------------------------------------------------
>
>                 Key: SPARK-23520
>                 URL: https://issues.apache.org/jira/browse/SPARK-23520
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core, SQL
>    Affects Versions: 2.2.1
>            Reporter: David Courtinot
>            Priority: Major
>              Labels: bulk-closed
>
> _InferSchema_ currently does not support inferring _MapType_ fields from JSON 
> data, and for a good reason: they are indistinguishable from structs in JSON 
> format. In issue 
> [SPARK-23494|https://issues.apache.org/jira/browse/SPARK-23494], I proposed 
> to expose some methods of _InferSchema_ to users so that they can build on 
> top of the inference primitives defined by this class. In this issue, I'm 
> proposing to add more control to the user by letting them specify a set of 
> fields that should be forced as _MapType._
> *Use-case*
> Some JSON datasets contain high-cardinality fields, namely fields which key 
> space is very large. These fields shouldn't be interpreted as _StructType_ 
> for the following reasons:
>  * it's not really what they are. The key space as well as the value space 
> may both be infinite, so what best defines the schema of this data is the 
> type of the keys and the type of the values, not a struct containing all 
> possible key-value pairs.
>  * interpreting high-cardinality fields as structs can lead to enormous 
> schemata that don't even fit into memory.
> *Proposition*
> We would add a public overloaded signature for _InferSchema.inferField_ which 
> allows to pass a set of field accessors (a class that supports representing 
> the access to any JSON field, including nested ones) for which we wan't do 
> not want to recurse and instead force a schema. That would allow, in 
> particular, to ask that a few fields be inferred as maps rather than structs.
> I am very open to discuss this with people who are more well-versed in the 
> Spark codebase than me, because I realize my proposition can feel somewhat 
> patchy. I'll be more than happy to provide some development effort if we 
> manage to sketch a reasonably easy solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to