[GitHub] [arrow] sidneymau opened a new issue, #37054: write_dataset does not preserve field metadata from schema

via GitHub Mon, 07 Aug 2023 14:50:09 -0700


sidneymau opened a new issue, #37054:
URL: https://github.com/apache/arrow/issues/37054


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When writing parquet files with a schema such as
   ```
   schema = pa.schema(
       [
           pa.field("field_1", pa.float64(), metadata={"field_1": "description 
of field 1"}),
           pa.field("field_2", pa.int32(), metadata={"field_2": "description of 
field 2"})
       ],
       metadata={
           "pyarrow version": pa.__version__,
       },
   )
   ```
   the `write_dataset` function from `pyarrow.dataset` does not preserve the 
field-level metadata (but does preserve the schema metadata). In contrast, 
using a `ParquetWriter` from `pyarrow.parquet`, all of the metadata is 
preserved.
   
   I also noticed that the resulting schema with and without per-field metadata 
are treated as equal, though I do not know if that is intentional.
   
   A brief example follows:
   ```
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.dataset as ds
   
   
   FORMAT = "parquet"
   
   # define the schema, including optional per-field metadata and general 
metadata
   schema = pa.schema(
       [
           pa.field("field_1", pa.float64(), metadata={"field_1": "description 
of field 1"}),
           pa.field("field_2", pa.int32(), metadata={"field_2": "description of 
field 2"})
       ],
       # metadata={
       #     "pyarrow version": pa.__version__,
       # },
   )
   
   # alternatively, add additional metadata to the schema
   schema = schema.with_metadata(
       {
           "pyarrow version": pa.__version__,
       },
   )
   
   # make some quick data
   table = pa.table([[0.1, 0.2, 0.3, 0.4], [1, 2, 3, 4]], schema=schema)
   
   # Write using the dataset API
   ds.write_dataset(
       table,
       "dataset_output",
       format=FORMAT,
       schema=schema,
   )
   
   # Write using the Parquet API
   with pq.ParquetWriter("writer_output.parquet", schema) as writer:
       writer.write(table)
   
   dataset_schema = ds.dataset("dataset_output").schema
   writer_schema = ds.dataset("writer_output.parquet").schema
   
   print("Dataset schema")
   print(dataset_schema)
   print("")
   print("Writer schema")
   print(writer_schema)
   print("")
   print(f"Datset schema == Writer schema? {dataset_schema == writer_schema}")
   ```
   output:
   ```
   Dataset schema
   field_1: double
   field_2: int32
   -- schema metadata --
   pyarrow version: '12.0.0'
   
   Writer schema
   field_1: double
     -- field metadata --
     field_1: 'description of field 1'
   field_2: int32
     -- field metadata --
     field_2: 'description of field 2'
   -- schema metadata --
   pyarrow version: '12.0.0'
   
   Datset schema == Writer schema? True
   ```
   
   I could not find any documentation that suggested this is expected behavior
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] sidneymau opened a new issue, #37054: write_dataset does not preserve field metadata from schema

Reply via email to