[jira] [Commented] (ARROW-8964) Pyarrow: improve reading of partitioned parquet datasets whose schema changed

Ira Saktor (Jira) Thu, 28 May 2020 12:48:14 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-8964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119035#comment-17119035
 ]


Ira Saktor commented on ARROW-8964:
-----------------------------------

Thank you very much for your fast answer. In the meantime, regarding schema 
specification, could you please tell me if there a way in pyarrow.dataset to 
read schema from specific parquet file? I could then simply pass it one of the 
recent parquet files to infer schema from.

I know how to load schema with pyarrow.parquet, however non-legacy dataset in 
parquet doesn't yet support schema specification, so i was hoping to manage 
this with pyarrow.dataset, if that's possible.

> Pyarrow: improve reading of partitioned parquet datasets whose schema changed
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-8964
>                 URL: https://issues.apache.org/jira/browse/ARROW-8964
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 0.17.1
>         Environment: Ubuntu 18.04, latest miniconda with python 3.7, pyarrow 
> 0.17.1
>            Reporter: Ira Saktor
>            Priority: Major
>
> Hi there, i'm encountering the following issue when reading from HDFS:
>  
> *My situation:*
> I have a paritioned parquet dataset in HDFS, whose recent partitions contain 
> parquet files with more columns than the older ones. When i try to read data 
> using pyarrow.dataset.dataset and filter on recent data, i still get only the 
> columns that are also contained in the old parquet files. I'd like to somehow 
> merge the schema or use the schema from parquet files from which data ends up 
> being loaded.
> *when using:*
> `pyarrow.dataset.dataset(path_to_hdfs_directory, paritioning = 'hive', 
> filters = my_filter_expression).to_table().to_pandas()`
> Is there please a way to handle schema changes in a way, that the read data 
> would contain all columns?
> everything works fine when i copy the needed parquet files into a separate 
> folder, however it is very inconvenient way of working. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8964) Pyarrow: improve reading of partitioned parquet datasets whose schema changed

Reply via email to