losze1cj opened a new issue #8025: URL: https://github.com/apache/arrow/issues/8025
It seems that the ParquetWriter doesn't behave as expected when I am passing a pyarrow schema that comes out of a pyarrow table. Approaching a problem in two ways, I notice unexpected behavior. If I construct a pyarrow schema from the datatypes, so I get a schema that has no metadata attached: ``` print(pyarrow_schema) --- sample_column1: string sample_column2: date32[day] sample_column3: float ``` Binding that to the ParquetWriter and a pyarrow table and writing it out: ``` pqwriter = pq.ParquetWriter(out_io, schema=pyarrow_schema, compression='snappy') df = pa.Table.from_pandas(df, schema=pyarrow_schema) pqwriter.write_table(table=df) ``` I get the an expected result, a queryable, well-formed, parquet file. I'm adding an external schema on top of the file to query through redshift spectrum. However, if I create the schema, bind it to the table, and then bind the table schema to the ParquetWriter. The result is a bad parquet file. ``` df = pa.Table.from_pandas(df, schema=pyarrow_schema) pqwriter = pq.ParquetWriter(out_io, schema=df.schema, compression='snappy') pqwriter.write_table(table=df) ``` What I notice is that the schema coming from the pyarrow table comes with attached metadata, but removing the metadata does not seem to solve the issue. ``` df = pa.Table.from_pandas(df, schema=pyarrow_schema) pqwriter = pq.ParquetWriter(out_io, schema=df.schema.remove_metadata(), compression='snappy') pqwriter.write_table(table=df) ``` Should I report a bug? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org