[
https://issues.apache.org/jira/browse/ARROW-8964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119035#comment-17119035
]
Ira Saktor commented on ARROW-8964:
-----------------------------------
Thank you very much for your fast answer. In the meantime, regarding schema
specification, could you please tell me if there a way in pyarrow.dataset to
read schema from specific parquet file? I could then simply pass it one of the
recent parquet files to infer schema from.
I know how to load schema with pyarrow.parquet, however non-legacy dataset in
parquet doesn't yet support schema specification, so i was hoping to manage
this with pyarrow.dataset, if that's possible.
> Pyarrow: improve reading of partitioned parquet datasets whose schema changed
> -----------------------------------------------------------------------------
>
> Key: ARROW-8964
> URL: https://issues.apache.org/jira/browse/ARROW-8964
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 0.17.1
> Environment: Ubuntu 18.04, latest miniconda with python 3.7, pyarrow
> 0.17.1
> Reporter: Ira Saktor
> Priority: Major
>
> Hi there, i'm encountering the following issue when reading from HDFS:
>
> *My situation:*
> I have a paritioned parquet dataset in HDFS, whose recent partitions contain
> parquet files with more columns than the older ones. When i try to read data
> using pyarrow.dataset.dataset and filter on recent data, i still get only the
> columns that are also contained in the old parquet files. I'd like to somehow
> merge the schema or use the schema from parquet files from which data ends up
> being loaded.
> *when using:*
> `pyarrow.dataset.dataset(path_to_hdfs_directory, paritioning = 'hive',
> filters = my_filter_expression).to_table().to_pandas()`
> Is there please a way to handle schema changes in a way, that the read data
> would contain all columns?
> everything works fine when i copy the needed parquet files into a separate
> folder, however it is very inconvenient way of working.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)