[ https://issues.apache.org/jira/browse/SPARK-44991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
nirav patel updated SPARK-44991: -------------------------------- Description: Spark json reader can infer datatype of a fields. I am ingesting millions of datapoints and generating a `DataFrameA`. what i notice that Schema inference mark datatype of a field with tons of Integers and Empty Strings as a Long. That is an okay behavior as I don't set `primitivesAsString` cause I do want primitive type inference. I store `DataFrameA` into `TableA` Now, this inference behavior is not respected by `fromJson` of `from_json` api when I am trying to write new data on `TableA`. Means, if I read a chunk of input data into using `spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` reader complains that EmptyString cannot be cast to Long . `getStruct(TableA)` is psuedo method that returns `struct` of TableA schema somehow. and `/path/to/more/data` have some value for this fields as an empty string. I think if reader doesnt complain about Empty string during schema inference it shouldn't complain either on reading without inference. May be treat Empty as Null just like during schema inference or at least give an additional option - treatEmptyAsNull so it's more explicit for application users? ps - i marked it as bug but could be more suited as improvements. was: Spark json reader can infer datatype of a fields. I am ingesting millions of datapoints and generating a `DataFrameA`. what i notice that Schema inference mark datatype of a field with tons of Integers and Empty Strings as a Long. That is an okay behavior as I don't set `primitivesAsString` cause I do want proper primitive type inference. I store `DataFrameA` into `TableA` Now, this infererence behavior is not respected by `fromJson` api when I am trying to write new data on `TableA` generated using my schema inference approach. Means, if I read a chunk of input data into using `spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` reader complains that EmptyString cannot be cast to Long . `getStruct(TableA)` is psuedo method that returns `struct` of TableA schema somehow. > Spark json datasource reader and fromJson api having inconsistent behavior > -------------------------------------------------------------------------- > > Key: SPARK-44991 > URL: https://issues.apache.org/jira/browse/SPARK-44991 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.3.2 > Reporter: nirav patel > Priority: Major > > Spark json reader can infer datatype of a fields. I am ingesting millions of > datapoints and generating a `DataFrameA`. what i notice that Schema > inference mark datatype of a field with tons of Integers and Empty Strings as > a Long. That is an okay behavior as I don't set `primitivesAsString` cause I > do want primitive type inference. I store `DataFrameA` into `TableA` > Now, this inference behavior is not respected by `fromJson` of `from_json` > api when I am trying to write new data on `TableA`. Means, if I read a chunk > of input data into using > `spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` > reader complains that EmptyString cannot be cast to Long . > `getStruct(TableA)` is psuedo method that returns `struct` of TableA schema > somehow. and `/path/to/more/data` have some value for this fields as an empty > string. > I think if reader doesnt complain about Empty string during schema inference > it shouldn't complain either on reading without inference. May be treat Empty > as Null just like during schema inference or at least give an additional > option - treatEmptyAsNull so it's more explicit for application users? > ps - i marked it as bug but could be more suited as improvements. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org