[jira] [Updated] (SPARK-44991) Spark json datasource reader and fromJson api having inconsistent behavior

nirav patel (Jira) Mon, 28 Aug 2023 12:18:37 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-44991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


nirav patel updated SPARK-44991:
--------------------------------
    Description: 
Spark json reader can infer datatype of a fields. I am ingesting millions of 
datapoints and  generating a `DataFrameA`. what i notice that Schema inference 
mark datatype of a field with tons of Integers and Empty Strings as a Long. 
That is an okay behavior as I don't set `primitivesAsString` cause I do want  
primitive type inference. I store `DataFrameA` into `TableA` 

Now, this inference behavior is not respected by `fromJson` of `from_json` api 
when I am trying to write new data on `TableA`. Means, if I read a chunk of 
input data into using 
`spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` 
reader complains that EmptyString cannot be cast to Long . `getStruct(TableA)` 
is psuedo method that returns `struct` of TableA schema somehow. and 
`/path/to/more/data` have some value for this fields as an empty string.

I think if reader doesnt complain about Empty string during schema inference it 
shouldn't complain either on reading without inference. May be treat Empty as 
Null just like during schema inference or at least give an additional option - 
treatEmptyAsNull so it's more explicit for application users? 

ps - i marked it as bug but could be more suited as improvements.

  was:
Spark json reader can infer datatype of a fields. I am ingesting millions of 
datapoints and  generating a `DataFrameA`. what i notice that Schema inference 
mark datatype of a field with tons of Integers and Empty Strings as a Long. 
That is an okay behavior as I don't set `primitivesAsString` cause I do want 
proper primitive type inference. I store `DataFrameA` into `TableA` 

Now, this infererence behavior is not respected by `fromJson` api when I am 
trying to write new data on `TableA` generated using my schema inference 
approach. Means, if I read a chunk of input data into using 
`spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` 
reader complains that EmptyString cannot be cast to Long . `getStruct(TableA)` 
is psuedo method that returns `struct` of TableA schema somehow. 


> Spark json datasource reader and fromJson api having inconsistent behavior
> --------------------------------------------------------------------------
>
>                 Key: SPARK-44991
>                 URL: https://issues.apache.org/jira/browse/SPARK-44991
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.3.2
>            Reporter: nirav patel
>            Priority: Major
>
> Spark json reader can infer datatype of a fields. I am ingesting millions of 
> datapoints and  generating a `DataFrameA`. what i notice that Schema 
> inference mark datatype of a field with tons of Integers and Empty Strings as 
> a Long. That is an okay behavior as I don't set `primitivesAsString` cause I 
> do want  primitive type inference. I store `DataFrameA` into `TableA` 
> Now, this inference behavior is not respected by `fromJson` of `from_json` 
> api when I am trying to write new data on `TableA`. Means, if I read a chunk 
> of input data into using 
> `spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` 
> reader complains that EmptyString cannot be cast to Long . 
> `getStruct(TableA)` is psuedo method that returns `struct` of TableA schema 
> somehow. and `/path/to/more/data` have some value for this fields as an empty 
> string.
> I think if reader doesnt complain about Empty string during schema inference 
> it shouldn't complain either on reading without inference. May be treat Empty 
> as Null just like during schema inference or at least give an additional 
> option - treatEmptyAsNull so it's more explicit for application users? 
> ps - i marked it as bug but could be more suited as improvements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44991) Spark json datasource reader and fromJson api having inconsistent behavior

Reply via email to