[
https://issues.apache.org/jira/browse/ARROW-9020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129001#comment-17129001
]
Joris Van den Bossche commented on ARROW-9020:
----------------------------------------------
Ah, I see that in C++ there is an additional {{UnexpectedFieldBehaviour}}
option, which determines what to do with fields not in the explicit schema:
https://github.com/apache/arrow/blob/d00c50a6ca0d88e3458742091c59f0fc5c2fc7de/cpp/src/arrow/json/options.h#L32-L39
(which has options to ignore it, but the default is to keep it but infer the
type).
But this option is not exposed in Python, it seems. So that's a needed
enhancement then to enable this functionality from Python.
> [Python] read_json won't respect explicit_schema in parse_options
> -----------------------------------------------------------------
>
> Key: ARROW-9020
> URL: https://issues.apache.org/jira/browse/ARROW-9020
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.17.1
> Environment: CPython 3.8.2, MacOS Mojave 10.14.6
> Reporter: Felipe Santos
> Priority: Major
> Fix For: 0.17.1
>
>
> I am trying to read a json file using an explicit schema but it looks like
> the schema is ignored. Moreover, if the my schema contains a field not
> present in the json file, then the output table contains all the fields in
> the json file plus the fields of my schema not found in the file.
> A minimal example:
> {code:python}
> import pyarrow as pa
> from pyarrow import json
> # allowing for type inference
> print(json.read_json('tmp.json'))
> # prints:
> # pyarrow.Table
> # foo: string
> # baz: string
> # using an explicit schema that would read only "foo"
> schema = pa.schema([('foo', pa.string())])
> print(json.read_json('tmp.json',
> parse_options=json.ParseOptions(explicit_schema=schema)))
> # prints:
> # pyarrow.Table
> # foo: string
> # baz: string
> # using an explicit schema that would read only "not_a_field",
> # which is not present in the json file
> schema = pa.schema([('not_a_field', pa.string())])
> print(json.read_json('tmp.json',
> parse_options=json.ParseOptions(explicit_schema=schema)))
> # prints:
> # pyarrow.Table
> # not_a_field: string
> # foo: string
> # baz: string
> {code}
> And the tmp.json file looks like:
> {code:json}
> {"foo": "bar", "baz": "1"}
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)