[ 
https://issues.apache.org/jira/browse/ARROW-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-9720:
-----------------------------------------
    Fix Version/s: 11.0.0
                       (was: 10.0.0)

> [Python] Long-term fate of pyarrow.parquet.ParquetDataset
> ---------------------------------------------------------
>
>                 Key: ARROW-9720
>                 URL: https://issues.apache.org/jira/browse/ARROW-9720
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Assignee: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset-parquet-legacy, dataset-parquet-read
>             Fix For: 11.0.0
>
>
> The business logic of the python implementation of reading partitioned 
> parquet datasets in {{pyarrow.parquet.ParquetDataset}} has been ported to C++ 
> (ARROW-3764), and has also been optionally enabled in ParquetDataset(..) by 
> using {{use_legacy_dataset=False}} (ARROW-8039).
> But the question still is: what do we do with this class long term? 
> So for users who now do:
> {code}
> dataset = pq.ParquetDataset(...)
> dataset.metadata
> table = dataset.read()
> {code}
> what should they do in the future?  
> Do we keep a class like this (but backed by the pyarrow.dataset 
> implementation), or do we deprecate the class entirely, pointing users to 
> `dataset = ds.dataset(..., format="parquet")` ?
> In any case, we should strive to entirely delete the current custom python 
> implementation, but we could keep a {{ParquetDataset}} class that wraps or 
> inherits {{pyarrow.dataset.FileSystemDataset}} and adds some parquet 
> specifics to it (eg access to the parquet schema, the common metadata, 
> exposing the parquet-specific constructor keywords more easily, ..). 
> Features the {{ParquetDataset}} currently has that are not exactly covered by 
> pyarrow.dataset:
> - Partitioning information (the {{.partitions}} attribute
> - Access to common metadata ({{.metadata_path}}, {{.common_metadata_path}} 
> and {{.metadata}} attributes)
> - ParquetSchema of the dataset



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to