[jira] [Commented] (ARROW-8964) Pyarrow: improve reading of partitioned parquet datasets whose schema changed

Joris Van den Bossche (Jira) Thu, 28 May 2020 11:48:14 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-8964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118988#comment-17118988
 ]


Joris Van den Bossche commented on ARROW-8964:
----------------------------------------------

Existing issue for exposing this in python: ARROW-8221

> Pyarrow: improve reading of partitioned parquet datasets whose schema changed
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-8964
>                 URL: https://issues.apache.org/jira/browse/ARROW-8964
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 0.17.1
>         Environment: Ubuntu 18.04, latest miniconda with python 3.7, pyarrow 
> 0.17.1
>            Reporter: Ira Saktor
>            Priority: Major
>
> Hi there, i'm encountering the following issue when reading from HDFS:
>  
> *My situation:*
> I have a paritioned parquet dataset in HDFS, whose recent partitions contain 
> parquet files with more columns than the older ones. When i try to read data 
> using pyarrow.dataset.dataset and filter on recent data, i still get only the 
> columns that are also contained in the old parquet files. I'd like to somehow 
> merge the schema or use the schema from parquet files from which data ends up 
> being loaded.
> *when using:*
> `pyarrow.dataset.dataset(path_to_hdfs_directory, paritioning = 'hive', 
> filters = my_filter_expression).to_table().to_pandas()`
> Is there please a way to handle schema changes in a way, that the read data 
> would contain all columns?
> everything works fine when i copy the needed parquet files into a separate 
> folder, however it is very inconvenient way of working. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8964) Pyarrow: improve reading of partitioned parquet datasets whose schema changed

Reply via email to