Mr James Kelly created ARROW-10862:
--------------------------------------
Summary: Overriding Parquet schema when loading to SageMaker to
inspect bad data
Key: ARROW-10862
URL: https://issues.apache.org/jira/browse/ARROW-10862
Project: Apache Arrow
Issue Type: New Feature
Components: Python
Environment: AWS SageMaker / S3
Reporter: Mr James Kelly
Following SO post:
[https://stackoverflow.com/questions/53725691/pyarrow-lib-schema-vs-pyarrow-parquet-schema]
I am attempting to find a way to override a parquet schema for a parquet file
stored in S3. One date column has some bad dates which causes the load of the
entire parquet file to fail.
I have tried defining a schema, but get AttributeError: 'pyarrow.lib.schema'
object has no attribute 'to_arrow_schema'. Same error as in SO post above,
I have attempted to use the workaround suggested by Wes McKinney above by
creating a dummy df, saving that to parquet, reading the schema from it and
replacing the embedded schema in the parquet file with my replacement:
pq.ParquetDataset(my_filepath,filesystem = s3,
schema=dummy_schema).read_pandas().to_pandas()
I get an error message telling me that me schema is different! (It was supposed
to be!)
Can you either allow schemas to be overridden or, even better, suggest a way to
load a Parquet file where some of the dates in a date column are ='0001-01-01'
Thanks,
James Kelly
--
This message was sent by Atlassian Jira
(v8.3.4#803005)