Mr James Kelly created ARROW-10862:
--------------------------------------

             Summary: Overriding Parquet schema when loading to SageMaker to 
inspect bad data
                 Key: ARROW-10862
                 URL: https://issues.apache.org/jira/browse/ARROW-10862
             Project: Apache Arrow
          Issue Type: New Feature
          Components: Python
         Environment: AWS SageMaker / S3
            Reporter: Mr James Kelly


Following SO post: 
[https://stackoverflow.com/questions/53725691/pyarrow-lib-schema-vs-pyarrow-parquet-schema]

I am attempting to find a way to override a parquet schema for a parquet file 
stored in S3. One date column has some bad dates which causes the load of the 
entire parquet file to fail.

I have tried defining a schema, but get AttributeError: 'pyarrow.lib.schema' 
object has no attribute 'to_arrow_schema'. Same error as in SO post above,

I have attempted to use the workaround suggested by Wes McKinney above by 
creating a dummy df, saving that to parquet, reading the schema from it and 
replacing the embedded schema in the parquet file with my replacement:

pq.ParquetDataset(my_filepath,filesystem = s3, 
schema=dummy_schema).read_pandas().to_pandas()

I get an error message telling me that me schema is different! (It was supposed 
to be!)

 

Can you either allow schemas to be overridden or, even better, suggest a way to 
load a Parquet file where some of the dates in a date column are ='0001-01-01'

 

Thanks,

James Kelly

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to