[ 
https://issues.apache.org/jira/browse/SPARK-12436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219966#comment-15219966
 ] 

Michel Lemay edited comment on SPARK-12436 at 3/31/16 2:51 PM:
---------------------------------------------------------------

Well, right now, I don't see what would be the minimum change necessary to 
accommodate the following scenario:

Step 1:
- Read batches of json (from streaming) and inferSchema with field values being 
null or empty string for structs and arrays.
- Persist dataframes to parquet files

Step 2: 
- Read back folder full of parquets with "mergeSchema" set to true

ex:
batch at time 1: { "field": null }  => write to parquet
batch at time 2: { "field": [ 999 ] }  => write to parquet


Problem arises at different levels:

1) Type inference in json internally uses NullType for null and empty strings.  
This is because some generators outputs "" when an object is not defined.  i.e. 
"" or null => StructType.  At the end of schema inference, json datasource 
canonicalize theses left overs NullTypes into a nullable StringType. Changing 
this behavior to output NullType seems fair but leads to the next problem. 

2) Write as parquet support only a limited set of types and NullType is not one 
of them. Either, we could skip NullType columns entirely but we lose the 
empty-string values that were converted to NullType earlier in json schema 
inference OR we could keep json inference and parquet output like it is roight 
now and that leads to the next problem.

3) When reading multiple parquets, schema merging (in StructType) does not know 
how to merge an empty StringType with any other type and will throw an error. 
The only way this could work is if we have the concept of StringType without 
values. This would allow us to check if a StringType is full of null and 
empty-strings and allow the merge.

Anything else I'm missing?


was (Author: flamingmike):
Well, right now, I don't see what would be the minimum change necessary to 
accommodate the following scenario:

Step 1:
- Read batches of json (from streaming) and inferSchema with field values being 
null or empty string for structs and arrays.
- Persist dataframes to parquet files

Step 2: 
- Read back folder full of parquets with "mergeSchema" set to true

ex:
batch at time 1: { "field": null }  => write to parquet
batch at time 2: { "field": [ 999 ] }  => write to parquet


Problem arises at different levels:

1) Type inference in json internally uses NullType for null and empty strings.  
This is because some generators outputs "" when an object is not defined.  i.e. 
"" or null => StructType.  At the end of schema inference, json datasource 
canonicalize theses left overs NullTypes into a nullable StringType. Changing 
this behavior to output NullType seems fair but leads to the next problem. 

2) Write as parquet support only a limited set of types and NullType is not one 
of them. Wither, we could skip NullType columns entirely but we lose the 
empty-string values that were converted to NullType earlier in json schema 
inference OR we could keep json inference and parquet output like it is roight 
now and that leads to the next problem.

3) When reading multiple parquets, schema merging (in StructType) does not know 
how to merge an empty StringType with any other type and will throw an error. 
The only way this could work is if we have the concept of StringType without 
values. This would allow us to check if a StringType is full of null and 
empty-strings and allow the merge.

Anything else I'm missing?

> If all values of a JSON field is null, JSON's inferSchema should return 
> NullType instead of StringType
> ------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-12436
>                 URL: https://issues.apache.org/jira/browse/SPARK-12436
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Reynold Xin
>              Labels: starter
>
> Right now, JSON's inferSchema will return {{StringType}} for a field that 
> always has null values or an {{ArrayType(StringType)}}  for a field that 
> always has empty array values. Although this behavior makes writing JSON data 
> to other data sources easy (i.e. when writing data, we do not need to remove 
> those {{NullType}} or {{ArrayType(NullType)}} columns), it makes downstream 
> application hard to reason about the actual schema of the data and thus makes 
> schema merging hard. We should allow JSON's inferSchema returns {{NullType}} 
> and {{ArrayType(NullType)}}. Also, we need to make sure that when we write 
> data out, we should remove those {{NullType}} or {{ArrayType(NullType)}} 
> columns first. 
> Besides  {{NullType}} and {{ArrayType(NullType)}}, we may need to do the same 
> thing for empty {{StructType}}s (i.e. a {{StructType}} having 0 fields). 
> To finish this work, we need to finish the following sub-tasks:
> * Allow JSON's inferSchema returns {{NullType}} and {{ArrayType(NullType)}}.
> * Determine whether we need to add the operation of removing {{NullType}} and 
> {{ArrayType(NullType)}} columns from the data that will be write out for all 
> data sources (i.e. data sources based our data source API and Hive tables). 
> Or, we should just add this operation for certain data sources (e.g. 
> Parquet). For example, we may not need this operation for Hive because Hive 
> has VoidObjectInspector.
> * Implement the change and get it merged to Spark master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to