christiangiessleraracom opened a new issue, #36060:
URL: https://github.com/apache/arrow/issues/36060
### Describe the usage question you have. Please include as many useful
details as possible.
The following problem:
If an optional field is not specified in the json, but is in the schema, it
is still created in the pyarrow table, including all nested fields that are
specified in the schema (with null values).
Is this the intended behaviour or is there a setting option so that
non-existent fields are also not in the table?
Here is a data example. our productive data schema is of course much more
complex and more nested, but it illustrates what I am doing:
Schema (all fields are nullable):
```
field1: struct<subfield1: double, subfield2: double>
field2: timestamp[ms]
field3: double
```
json file:
```json
{
"field3": 123.4
}
```
Python code handling the data:
```python
read_options = pajson.ReadOptions(block_size=1600000000)
parse_options = pajson.ParseOptions(
explicit_schema=pa_schema,
unexpected_field_behavior="ignore"
)
table = pajson.read_json(
tmp_file_name, read_options=read_options, parse_options=parse_options
)
pq.write_to_dataset(
table=table,
root_path=dataset_path,
basename_template=hashvalue + ".parquet",
existing_data_behavior="overwrite_or_ignore",
schema=pa_schema
)
```
table debug output from evaluation in pycharm:
```
column_names: ['field1', 'field2', 'field3']
columns:
[
-- is_valid:
[
false
]
-- child 0 type: double
[
null
]
-- child 1 type: double
[
null
]
]
[
[
null
]
]
[
[
123.4
]
]
```
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]