[
https://issues.apache.org/jira/browse/ARROW-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17621158#comment-17621158
]
Joris Van den Bossche commented on ARROW-18106:
-----------------------------------------------
cc [~benpharkins]
> [C++] JSON reader ignores explicit schema with default
> unexpected_field_behavior="infer"
> ----------------------------------------------------------------------------------------
>
> Key: ARROW-18106
> URL: https://issues.apache.org/jira/browse/ARROW-18106
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Reporter: Joris Van den Bossche
> Priority: Major
> Labels: json
>
> Not 100% sure this is a "bug", but at least I find it an unexpected interplay
> between two options.
> By default, when reading json, we _infer_ the data type of columns, and when
> specifying an explicit schema, we _also_ by default infer the type of columns
> that are not specified in the explicit schema. The docs for
> {{unexpected_field_behavior}}:
> > How JSON fields outside of explicit_schema (if given) are treated
> But it seems that if you specify a schema, and the parsing of one of the
> columns fails according to that schema, we still fall back to this default of
> inferring the data type (while I would have expected an error, since we
> should only infer for columns _not_ in the schema.
> Example code using pyarrow:
> {code:python}
> import io
> import pyarrow as pa
> from pyarrow import json
> s_json = """{"column":"2022-09-05T08:08:46.000"}"""
> opts = json.ParseOptions(explicit_schema=pa.schema([("column",
> pa.timestamp("s"))]))
> json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
> {code}
> The parsing fails here because there are milliseconds and the type is "s",
> but the explicit schema is ignored, and we get a result with a string column
> as result:
> {code}
> pyarrow.Table
> column: string
> ----
> column: [["2022-09-05T08:08:46.000"]]
> {code}
> But when adding {{unexpected_field_behaviour="ignore"}}, we actually get the
> expected parse error:
> {code:python}
> opts = json.ParseOptions(explicit_schema=pa.schema([("column",
> pa.timestamp("s"))]), unexpected_field_behavior="ignore")
> json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
> {code}
> gives
> {code}
> ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't
> parse:2022-09-05T08:08:46.000
> {code}
> It might be this is specific to timestamps, I don't directly see a similar
> issue with eg {{"column": "A"}} and setting the schema to "column" being
> int64.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)