[
https://issues.apache.org/jira/browse/ARROW-9020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Krisztian Szucs reassigned ARROW-9020:
--------------------------------------
Assignee: Krisztian Szucs
> [Python] read_json won't respect explicit_schema in parse_options
> -----------------------------------------------------------------
>
> Key: ARROW-9020
> URL: https://issues.apache.org/jira/browse/ARROW-9020
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.17.1
> Environment: CPython 3.8.2, MacOS Mojave 10.14.6
> Reporter: Felipe Santos
> Assignee: Krisztian Szucs
> Priority: Major
> Fix For: 1.0.0
>
>
> I am trying to read a json file using an explicit schema but it looks like
> the schema is ignored. Moreover, if the my schema contains a field not
> present in the json file, then the output table contains all the fields in
> the json file plus the fields of my schema not found in the file.
> A minimal example:
> {code:python}
> import pyarrow as pa
> from pyarrow import json
> # allowing for type inference
> print(json.read_json('tmp.json'))
> # prints:
> # pyarrow.Table
> # foo: string
> # baz: string
> # using an explicit schema that would read only "foo"
> schema = pa.schema([('foo', pa.string())])
> print(json.read_json('tmp.json',
> parse_options=json.ParseOptions(explicit_schema=schema)))
> # prints:
> # pyarrow.Table
> # foo: string
> # baz: string
> # using an explicit schema that would read only "not_a_field",
> # which is not present in the json file
> schema = pa.schema([('not_a_field', pa.string())])
> print(json.read_json('tmp.json',
> parse_options=json.ParseOptions(explicit_schema=schema)))
> # prints:
> # pyarrow.Table
> # not_a_field: string
> # foo: string
> # baz: string
> {code}
> And the tmp.json file looks like:
> {code:json}
> {"foo": "bar", "baz": "1"}
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)