[ 
https://issues.apache.org/jira/browse/ARROW-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17621158#comment-17621158
 ] 

Joris Van den Bossche commented on ARROW-18106:
-----------------------------------------------

cc [~benpharkins]

> [C++] JSON reader ignores explicit schema with default 
> unexpected_field_behavior="infer"
> ----------------------------------------------------------------------------------------
>
>                 Key: ARROW-18106
>                 URL: https://issues.apache.org/jira/browse/ARROW-18106
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: json
>
> Not 100% sure this is a "bug", but at least I find it an unexpected interplay 
> between two options.
> By default, when reading json, we _infer_ the data type of columns, and when 
> specifying an explicit schema, we _also_ by default infer the type of columns 
> that are not specified in the explicit schema. The docs for 
> {{unexpected_field_behavior}}:
> > How JSON fields outside of explicit_schema (if given) are treated
> But it seems that if you specify a schema, and the parsing of one of the 
> columns fails according to that schema, we still fall back to this default of 
> inferring the data type (while I would have expected an error, since we 
> should only infer for columns _not_ in the schema.
> Example code using pyarrow:
> {code:python}
> import io
> import pyarrow as pa
> from pyarrow import json
> s_json = """{"column":"2022-09-05T08:08:46.000"}"""
> opts = json.ParseOptions(explicit_schema=pa.schema([("column", 
> pa.timestamp("s"))]))
> json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
> {code}
> The parsing fails here because there are milliseconds and the type is "s", 
> but the explicit schema is ignored, and we get a result with a string column 
> as result:
> {code}
> pyarrow.Table
> column: string
> ----
> column: [["2022-09-05T08:08:46.000"]]
> {code}
> But when adding {{unexpected_field_behaviour="ignore"}}, we actually get the 
> expected parse error:
> {code:python}
> opts = json.ParseOptions(explicit_schema=pa.schema([("column", 
> pa.timestamp("s"))]), unexpected_field_behavior="ignore")
> json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
> {code}
> gives
> {code}
> ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't 
> parse:2022-09-05T08:08:46.000
> {code}
> It might be this is specific to timestamps, I don't directly see a similar 
> issue with eg {{"column": "A"}} and setting the schema to "column" being 
> int64.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to