[
https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304577#comment-14304577
]
Corey J. Nolet commented on SPARK-5260:
---------------------------------------
I'm thinking all the schema-specific functions should be pulled out into an
object called JsonSchemaFunctions. allKeysWithValueTypes() and createSchema()
functions should be exposed via the public API and commented well based on
their use.
For the project I have that's using these functions, I am actually using the
allKeysWithValueTypes() over my entire RDD as it's being saved to a sequence
file and I'm using an Accumulator[Set[(String, DataType)]] that is aggregating
all the schema elements for the RDD into a final Set where I can then store off
the schema and later call "CreateSchema()" to get the final StructType that can
be used with the sql table. I had to write a isConflicted(Set[(String,
DataType)]]) function as well to determine if it's possible that a JSON object
or JSON array was also encountered as a primitive type in one of the records in
the RDD or vice versa.
> Expose JsonRDD.allKeysWithValueTypes() in a utility class
> ----------------------------------------------------------
>
> Key: SPARK-5260
> URL: https://issues.apache.org/jira/browse/SPARK-5260
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Reporter: Corey J. Nolet
> Assignee: Corey J. Nolet
>
> I have found this method extremely useful when implementing my own strategy
> for inferring a schema from parsed json. For now, I've actually copied the
> method right out of the JsonRDD class into my own project but I think it
> would be immensely useful to keep the code in Spark and expose it publicly
> somewhere else- like an object called JsonSchema.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]