[GitHub] [arrow] tooptoop4 opened a new issue, #35569: python - read multiple parquets that have different schema?

via GitHub Thu, 11 May 2023 19:05:22 -0700


tooptoop4 opened a new issue, #35569:
URL: https://github.com/apache/arrow/issues/35569


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   because of schema evolution some parquet files have more columns than others.
   i try to read them all in one go (there are over 20000 small files under 
different partition folders) because i want to write out their data into a 
single big file
   
   ```
   original_files = pq.ParquetDataset("bucket/folder", filesystem=s3_src, 
partitioning="hive")
   ```
   
   i get error:
   ```
   ValueError: Schema in partition[year=0, month=0, day=14, hour=9] 
bucket/folder/year=2023/month=02/day=22/hour=14/redact.snappy.parquet was 
different.
   ```
   
   how can i achieve this in pyarrow?
   pyspark has an automatic mergeSchema option - 
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] tooptoop4 opened a new issue, #35569: python - read multiple parquets that have different schema?

Reply via email to