[
https://issues.apache.org/jira/browse/SPARK-12436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219966#comment-15219966
]
Michel Lemay edited comment on SPARK-12436 at 3/31/16 2:53 PM:
---------------------------------------------------------------
Well, right now, I don't see what would be the minimum change necessary to
accommodate the following scenario:
Step 1:
- Read batches of json (from streaming) and inferSchema with field values being
null or empty string for structs and arrays.
- Persist dataframes to parquet files
Step 2:
- Read back folder full of parquets with "mergeSchema" set to true
ex:
batch at time 1: { "field": null } => write to parquet
batch at time 2: { "field": [ 999 ] } => write to parquet
Problem arises at different levels:
1) Type inference in json internally uses NullType for null and empty strings.
This is because some generators outputs "" when an object is not defined. i.e.
"" or null => StructType. At the end of schema inference, json datasource
canonicalize theses left overs NullTypes into a nullable StringType. Changing
this behavior to output NullType seems fair but leads to the next problem.
2) Write as parquet support only a limited set of types and NullType is not one
of them. Either, we could skip NullType columns entirely but we lose the
empty-string values that were converted to NullType earlier in json schema
inference OR we could keep json inference and parquet output like it is right
now and output nullable strings but that leads to the next problem.
3) When reading multiple parquets, schema merging (in StructType) does not know
how to merge a StringType with any other type and will throw an error. The only
way this could work is if we know if a StringType column contains only nulls
and empty-strings. Then allow the merge.
Anything else I'm missing?
was (Author: flamingmike):
Well, right now, I don't see what would be the minimum change necessary to
accommodate the following scenario:
Step 1:
- Read batches of json (from streaming) and inferSchema with field values being
null or empty string for structs and arrays.
- Persist dataframes to parquet files
Step 2:
- Read back folder full of parquets with "mergeSchema" set to true
ex:
batch at time 1: { "field": null } => write to parquet
batch at time 2: { "field": [ 999 ] } => write to parquet
Problem arises at different levels:
1) Type inference in json internally uses NullType for null and empty strings.
This is because some generators outputs "" when an object is not defined. i.e.
"" or null => StructType. At the end of schema inference, json datasource
canonicalize theses left overs NullTypes into a nullable StringType. Changing
this behavior to output NullType seems fair but leads to the next problem.
2) Write as parquet support only a limited set of types and NullType is not one
of them. Either, we could skip NullType columns entirely but we lose the
empty-string values that were converted to NullType earlier in json schema
inference OR we could keep json inference and parquet output like it is roight
now and that leads to the next problem.
3) When reading multiple parquets, schema merging (in StructType) does not know
how to merge an empty StringType with any other type and will throw an error.
The only way this could work is if we have the concept of StringType without
values. This would allow us to check if a StringType is full of null and
empty-strings and allow the merge.
Anything else I'm missing?
> If all values of a JSON field is null, JSON's inferSchema should return
> NullType instead of StringType
> ------------------------------------------------------------------------------------------------------
>
> Key: SPARK-12436
> URL: https://issues.apache.org/jira/browse/SPARK-12436
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Reporter: Reynold Xin
> Labels: starter
>
> Right now, JSON's inferSchema will return {{StringType}} for a field that
> always has null values or an {{ArrayType(StringType)}} for a field that
> always has empty array values. Although this behavior makes writing JSON data
> to other data sources easy (i.e. when writing data, we do not need to remove
> those {{NullType}} or {{ArrayType(NullType)}} columns), it makes downstream
> application hard to reason about the actual schema of the data and thus makes
> schema merging hard. We should allow JSON's inferSchema returns {{NullType}}
> and {{ArrayType(NullType)}}. Also, we need to make sure that when we write
> data out, we should remove those {{NullType}} or {{ArrayType(NullType)}}
> columns first.
> Besides {{NullType}} and {{ArrayType(NullType)}}, we may need to do the same
> thing for empty {{StructType}}s (i.e. a {{StructType}} having 0 fields).
> To finish this work, we need to finish the following sub-tasks:
> * Allow JSON's inferSchema returns {{NullType}} and {{ArrayType(NullType)}}.
> * Determine whether we need to add the operation of removing {{NullType}} and
> {{ArrayType(NullType)}} columns from the data that will be write out for all
> data sources (i.e. data sources based our data source API and Hive tables).
> Or, we should just add this operation for certain data sources (e.g.
> Parquet). For example, we may not need this operation for Hive because Hive
> has VoidObjectInspector.
> * Implement the change and get it merged to Spark master.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]