[ 
https://issues.apache.org/jira/browse/SPARK-44991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nirav patel updated SPARK-44991:
--------------------------------
    Summary: Spark json schema inference and fromJson api having inconsistent 
behavior  (was: Spark json datasource reader and fromJson api having 
inconsistent behavior)

> Spark json schema inference and fromJson api having inconsistent behavior
> -------------------------------------------------------------------------
>
>                 Key: SPARK-44991
>                 URL: https://issues.apache.org/jira/browse/SPARK-44991
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.3.2
>            Reporter: nirav patel
>            Priority: Major
>
> Spark json reader can infer datatype of a fields. I am ingesting millions of 
> datapoints and  generating a `DataFrameA`. what i notice that Schema 
> inference mark datatype of a field with tons of Integers and Empty Strings as 
> a Long. That is an okay behavior as I don't set `primitivesAsString` cause I 
> do want  primitive type inference. I store `DataFrameA` into `TableA` 
> Now, this inference behavior is not respected by `fromJson` of `from_json` 
> api when I am trying to write new data on `TableA`. Means, if I read a chunk 
> of input data into using 
> `spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` 
> reader complains that EmptyString cannot be cast to Long . 
> `getStruct(TableA)` is psuedo method that returns `struct` of TableA schema 
> somehow. and `/path/to/more/data` have some value for this fields as an empty 
> string.
> I think if reader doesnt complain about Empty string during schema inference 
> it shouldn't complain either on reading without inference. May be treat Empty 
> as Null just like during schema inference or at least give an additional 
> option - treatEmptyAsNull so it's more explicit for application users? 
> ps - i marked it as bug but could be more suited as improvements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to