[GitHub] [arrow] losze1cj opened a new issue #8025: ParquetWriter creates bad files when passed pyarrow schema from pyarrow table?

GitBox Sat, 22 Aug 2020 09:44:55 -0700


losze1cj opened a new issue #8025:
URL: https://github.com/apache/arrow/issues/8025



   It seems that the ParquetWriter doesn't behave as expected when I am passing 
a pyarrow schema that comes out of a pyarrow table.  Approaching a problem in 
two ways, I notice unexpected behavior. 
   
   If I construct a pyarrow schema from the datatypes, so I get a schema that 
has no metadata attached:
   ```
   print(pyarrow_schema)
   ---
   sample_column1: string
   sample_column2: date32[day]
   sample_column3: float
   ``` 
   Binding that to the ParquetWriter and a pyarrow table and writing it out:
   ```
   pqwriter = pq.ParquetWriter(out_io, schema=pyarrow_schema, 
compression='snappy')
   df = pa.Table.from_pandas(df, schema=pyarrow_schema)
   pqwriter.write_table(table=df)
   ```
   I get the an expected result, a queryable, well-formed, parquet file.  I'm 
adding an external schema on top of the file to query through redshift spectrum.
   
   However, if I create the schema, bind it to the table, and then bind the 
table schema to the ParquetWriter.  The result is a bad parquet file.
   ```
   df = pa.Table.from_pandas(df, schema=pyarrow_schema)
   pqwriter = pq.ParquetWriter(out_io, schema=df.schema, compression='snappy')
   pqwriter.write_table(table=df)
   ```
   
   What I notice is that the schema coming from the pyarrow table comes with 
attached metadata, but removing the metadata does not seem to solve the issue.
   ```
   df = pa.Table.from_pandas(df, schema=pyarrow_schema)
   pqwriter = pq.ParquetWriter(out_io, schema=df.schema.remove_metadata(), 
compression='snappy')
   pqwriter.write_table(table=df)
   ```
   
   Should I report a bug?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] losze1cj opened a new issue #8025: ParquetWriter creates bad files when passed pyarrow schema from pyarrow table?

Reply via email to