[jira] [Updated] (SPARK-44991) Spark json schema inference and fromJson api having inconsistent behavior

nirav patel (Jira) Wed, 30 Aug 2023 12:24:35 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-44991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


nirav patel updated SPARK-44991:
--------------------------------
    Description: 
Spark json reader can infer datatype of a fields. I am ingesting millions of 
datapoints and  generating a `DataFrameA`. what i notice that Schema inference 
mark datatype of a field with tons of Integers and Empty Strings as a Long. 
That is an okay behavior as I don't set `primitivesAsString` as I do want  
primitive type inference. I store `DataFrameA` into `TableA` 

Now, this inference behavior is not respected by `fromJson` of `from_json` api 
when I am trying to write new data on `TableA`. Means, if I read a chunk of new 
input data into using 
`spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` 
reader complains that EmptyString cannot be cast to Long .

ps - `getStruct(TableA)` is psuedo method that returns `struct` of TableA 
schema somehow. and `/path/to/more/data` is new dataset which has some records 
with value for this fields as an empty string.

 

I think if reader doesn't complain about Empty string during schema inference 
it shouldn't complain either on reading without inference. May be treat Empty 
as Null just like during schema inference. Empty string is a legal value for 
String type field but not Number types fields so I don't see any reason not to 
treat it as a Null. Another option is to give additional reader option - 
treatEmptyAsNull so it's more explicit? 

ps - I marked it as bug but could be more suited as improvements.

  was:
Spark json reader can infer datatype of a fields. I am ingesting millions of 
datapoints and  generating a `DataFrameA`. what i notice that Schema inference 
mark datatype of a field with tons of Integers and Empty Strings as a Long. 
That is an okay behavior as I don't set `primitivesAsString` cause I do want  
primitive type inference. I store `DataFrameA` into `TableA` 

Now, this inference behavior is not respected by `fromJson` of `from_json` api 
when I am trying to write new data on `TableA`. Means, if I read a chunk of 
input data into using 
`spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` 
reader complains that EmptyString cannot be cast to Long . `getStruct(TableA)` 
is psuedo method that returns `struct` of TableA schema somehow. and 
`/path/to/more/data` have some value for this fields as an empty string.

I think if reader doesnt complain about Empty string during schema inference it 
shouldn't complain either on reading without inference. May be treat Empty as 
Null just like during schema inference or at least give an additional option - 
treatEmptyAsNull so it's more explicit for application users? 

ps - i marked it as bug but could be more suited as improvements.


> Spark json schema inference and fromJson api having inconsistent behavior
> -------------------------------------------------------------------------
>
>                 Key: SPARK-44991
>                 URL: https://issues.apache.org/jira/browse/SPARK-44991
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.3.2
>            Reporter: nirav patel
>            Priority: Major
>
> Spark json reader can infer datatype of a fields. I am ingesting millions of 
> datapoints and  generating a `DataFrameA`. what i notice that Schema 
> inference mark datatype of a field with tons of Integers and Empty Strings as 
> a Long. That is an okay behavior as I don't set `primitivesAsString` as I do 
> want  primitive type inference. I store `DataFrameA` into `TableA` 
> Now, this inference behavior is not respected by `fromJson` of `from_json` 
> api when I am trying to write new data on `TableA`. Means, if I read a chunk 
> of new input data into using 
> `spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` 
> reader complains that EmptyString cannot be cast to Long .
> ps - `getStruct(TableA)` is psuedo method that returns `struct` of TableA 
> schema somehow. and `/path/to/more/data` is new dataset which has some 
> records with value for this fields as an empty string.
>  
> I think if reader doesn't complain about Empty string during schema inference 
> it shouldn't complain either on reading without inference. May be treat Empty 
> as Null just like during schema inference. Empty string is a legal value for 
> String type field but not Number types fields so I don't see any reason not 
> to treat it as a Null. Another option is to give additional reader option - 
> treatEmptyAsNull so it's more explicit? 
> ps - I marked it as bug but could be more suited as improvements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-44991) Spark json schema inference and fromJson api having inconsistent behavior

Reply via email to