sidneymau opened a new issue, #37054:
URL: https://github.com/apache/arrow/issues/37054
### Describe the bug, including details regarding any error messages,
version, and platform.
When writing parquet files with a schema such as
```
schema = pa.schema(
[
pa.field("field_1", pa.float64(), metadata={"field_1": "description
of field 1"}),
pa.field("field_2", pa.int32(), metadata={"field_2": "description of
field 2"})
],
metadata={
"pyarrow version": pa.__version__,
},
)
```
the `write_dataset` function from `pyarrow.dataset` does not preserve the
field-level metadata (but does preserve the schema metadata). In contrast,
using a `ParquetWriter` from `pyarrow.parquet`, all of the metadata is
preserved.
I also noticed that the resulting schema with and without per-field metadata
are treated as equal, though I do not know if that is intentional.
A brief example follows:
```
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
FORMAT = "parquet"
# define the schema, including optional per-field metadata and general
metadata
schema = pa.schema(
[
pa.field("field_1", pa.float64(), metadata={"field_1": "description
of field 1"}),
pa.field("field_2", pa.int32(), metadata={"field_2": "description of
field 2"})
],
# metadata={
# "pyarrow version": pa.__version__,
# },
)
# alternatively, add additional metadata to the schema
schema = schema.with_metadata(
{
"pyarrow version": pa.__version__,
},
)
# make some quick data
table = pa.table([[0.1, 0.2, 0.3, 0.4], [1, 2, 3, 4]], schema=schema)
# Write using the dataset API
ds.write_dataset(
table,
"dataset_output",
format=FORMAT,
schema=schema,
)
# Write using the Parquet API
with pq.ParquetWriter("writer_output.parquet", schema) as writer:
writer.write(table)
dataset_schema = ds.dataset("dataset_output").schema
writer_schema = ds.dataset("writer_output.parquet").schema
print("Dataset schema")
print(dataset_schema)
print("")
print("Writer schema")
print(writer_schema)
print("")
print(f"Datset schema == Writer schema? {dataset_schema == writer_schema}")
```
output:
```
Dataset schema
field_1: double
field_2: int32
-- schema metadata --
pyarrow version: '12.0.0'
Writer schema
field_1: double
-- field metadata --
field_1: 'description of field 1'
field_2: int32
-- field metadata --
field_2: 'description of field 2'
-- schema metadata --
pyarrow version: '12.0.0'
Datset schema == Writer schema? True
```
I could not find any documentation that suggested this is expected behavior
### Component(s)
Parquet, Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]