[
https://issues.apache.org/jira/browse/SPARK-44991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
nirav patel updated SPARK-44991:
--------------------------------
Description:
Spark json reader can infer datatype of a fields. I am ingesting millions of
datapoints and generating a `DataFrameA`. what i notice that Schema inference
mark datatype of a field with tons of Integers and Empty Strings as a Long.
That is an okay behavior as I don't set `primitivesAsString` as I do want
primitive type inference. I store `DataFrameA` into `TableA`
Now, this inference behavior is not respected by `fromJson` of `from_json` api
when I am trying to write new data on `TableA`. Means, if I read a chunk of new
input data into using
`spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')`
reader complains that EmptyString cannot be cast to Long .
ps - `getStruct(TableA)` is psuedo method that returns `struct` of TableA
schema somehow. and `/path/to/more/data` is new dataset which has some records
with value for this fields as an empty string.
I think if reader doesn't complain about Empty string during schema inference
it shouldn't complain either on reading without inference. May be treat Empty
as Null just like during schema inference. Empty string is a legal value for
String type field but not Number types fields so I don't see any reason not to
treat it as a Null. Another option is to give additional reader option -
treatEmptyAsNull so it's more explicit?
ps - I marked it as bug but could be more suited as improvements.
was:
Spark json reader can infer datatype of a fields. I am ingesting millions of
datapoints and generating a `DataFrameA`. what i notice that Schema inference
mark datatype of a field with tons of Integers and Empty Strings as a Long.
That is an okay behavior as I don't set `primitivesAsString` cause I do want
primitive type inference. I store `DataFrameA` into `TableA`
Now, this inference behavior is not respected by `fromJson` of `from_json` api
when I am trying to write new data on `TableA`. Means, if I read a chunk of
input data into using
`spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')`
reader complains that EmptyString cannot be cast to Long . `getStruct(TableA)`
is psuedo method that returns `struct` of TableA schema somehow. and
`/path/to/more/data` have some value for this fields as an empty string.
I think if reader doesnt complain about Empty string during schema inference it
shouldn't complain either on reading without inference. May be treat Empty as
Null just like during schema inference or at least give an additional option -
treatEmptyAsNull so it's more explicit for application users?
ps - i marked it as bug but could be more suited as improvements.
> Spark json schema inference and fromJson api having inconsistent behavior
> -------------------------------------------------------------------------
>
> Key: SPARK-44991
> URL: https://issues.apache.org/jira/browse/SPARK-44991
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.3.2
> Reporter: nirav patel
> Priority: Major
>
> Spark json reader can infer datatype of a fields. I am ingesting millions of
> datapoints and generating a `DataFrameA`. what i notice that Schema
> inference mark datatype of a field with tons of Integers and Empty Strings as
> a Long. That is an okay behavior as I don't set `primitivesAsString` as I do
> want primitive type inference. I store `DataFrameA` into `TableA`
> Now, this inference behavior is not respected by `fromJson` of `from_json`
> api when I am trying to write new data on `TableA`. Means, if I read a chunk
> of new input data into using
> `spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')`
> reader complains that EmptyString cannot be cast to Long .
> ps - `getStruct(TableA)` is psuedo method that returns `struct` of TableA
> schema somehow. and `/path/to/more/data` is new dataset which has some
> records with value for this fields as an empty string.
>
> I think if reader doesn't complain about Empty string during schema inference
> it shouldn't complain either on reading without inference. May be treat Empty
> as Null just like during schema inference. Empty string is a legal value for
> String type field but not Number types fields so I don't see any reason not
> to treat it as a Null. Another option is to give additional reader option -
> treatEmptyAsNull so it's more explicit?
> ps - I marked it as bug but could be more suited as improvements.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]