[
https://issues.apache.org/jira/browse/ARROW-9020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17128996#comment-17128996
]
Joris Van den Bossche commented on ARROW-9020:
----------------------------------------------
[~felipegssantos] thanks for the report!
The python and C++ documentation seems a bit contradicting on the purpose of
the keyword.
In Python we have (which seems to indicate that other fields not in the schema
are ignored, thus not in the result):
bq. Optional explicit schema (no type inference, ignores other fields).
While C++ only says it disables type inference on those fields, and not about
ignoring other fields:
bq. Optional explicit schema (disables type inference on those fields)
So I am not really sure what the original intent of the keyword was. cc
[~bkietz] ?
> [Python] read_json won't respect explicit_schema in parse_options
> -----------------------------------------------------------------
>
> Key: ARROW-9020
> URL: https://issues.apache.org/jira/browse/ARROW-9020
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.17.1
> Environment: CPython 3.8.2, MacOS Mojave 10.14.6
> Reporter: Felipe Santos
> Priority: Major
> Fix For: 0.17.1
>
>
> I am trying to read a json file using an explicit schema but it looks like
> the schema is ignored. Moreover, if the my schema contains a field not
> present in the json file, then the output table contains all the fields in
> the json file plus the fields of my schema not found in the file.
> A minimal example:
> {code:python}
> import pyarrow as pa
> from pyarrow import json
> # allowing for type inference
> print(json.read_json('tmp.json'))
> # prints:
> # pyarrow.Table
> # foo: string
> # baz: string
> # using an explicit schema that would read only "foo"
> schema = pa.schema([('foo', pa.string())])
> print(json.read_json('tmp.json',
> parse_options=json.ParseOptions(explicit_schema=schema)))
> # prints:
> # pyarrow.Table
> # foo: string
> # baz: string
> # using an explicit schema that would read only "not_a_field",
> # which is not present in the json file
> schema = pa.schema([('not_a_field', pa.string())])
> print(json.read_json('tmp.json',
> parse_options=json.ParseOptions(explicit_schema=schema)))
> # prints:
> # pyarrow.Table
> # not_a_field: string
> # foo: string
> # baz: string
> {code}
> And the tmp.json file looks like:
> {code:json}
> {"foo": "bar", "baz": "1"}
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)