[
https://issues.apache.org/jira/browse/ARROW-8964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118988#comment-17118988
]
Joris Van den Bossche commented on ARROW-8964:
----------------------------------------------
Existing issue for exposing this in python: ARROW-8221
> Pyarrow: improve reading of partitioned parquet datasets whose schema changed
> -----------------------------------------------------------------------------
>
> Key: ARROW-8964
> URL: https://issues.apache.org/jira/browse/ARROW-8964
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 0.17.1
> Environment: Ubuntu 18.04, latest miniconda with python 3.7, pyarrow
> 0.17.1
> Reporter: Ira Saktor
> Priority: Major
>
> Hi there, i'm encountering the following issue when reading from HDFS:
>
> *My situation:*
> I have a paritioned parquet dataset in HDFS, whose recent partitions contain
> parquet files with more columns than the older ones. When i try to read data
> using pyarrow.dataset.dataset and filter on recent data, i still get only the
> columns that are also contained in the old parquet files. I'd like to somehow
> merge the schema or use the schema from parquet files from which data ends up
> being loaded.
> *when using:*
> `pyarrow.dataset.dataset(path_to_hdfs_directory, paritioning = 'hive',
> filters = my_filter_expression).to_table().to_pandas()`
> Is there please a way to handle schema changes in a way, that the read data
> would contain all columns?
> everything works fine when i copy the needed parquet files into a separate
> folder, however it is very inconvenient way of working.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)