[jira] [Commented] (ARROW-9720) [Python] Long-term fate of pyarrow.parquet.ParquetDataset

Apache Arrow JIRA Bot (Jira) Mon, 02 Jan 2023 09:54:10 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653672#comment-17653672
 ]


Apache Arrow JIRA Bot commented on ARROW-9720:
----------------------------------------------

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [Python] Long-term fate of pyarrow.parquet.ParquetDataset
> ---------------------------------------------------------
>
>                 Key: ARROW-9720
>                 URL: https://issues.apache.org/jira/browse/ARROW-9720
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Assignee: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset-parquet-legacy, dataset-parquet-read
>             Fix For: 11.0.0
>
>
> The business logic of the python implementation of reading partitioned 
> parquet datasets in {{pyarrow.parquet.ParquetDataset}} has been ported to C++ 
> (ARROW-3764), and has also been optionally enabled in ParquetDataset(..) by 
> using {{use_legacy_dataset=False}} (ARROW-8039).
> But the question still is: what do we do with this class long term? 
> So for users who now do:
> {code}
> dataset = pq.ParquetDataset(...)
> dataset.metadata
> table = dataset.read()
> {code}
> what should they do in the future?  
> Do we keep a class like this (but backed by the pyarrow.dataset 
> implementation), or do we deprecate the class entirely, pointing users to 
> `dataset = ds.dataset(..., format="parquet")` ?
> In any case, we should strive to entirely delete the current custom python 
> implementation, but we could keep a {{ParquetDataset}} class that wraps or 
> inherits {{pyarrow.dataset.FileSystemDataset}} and adds some parquet 
> specifics to it (eg access to the parquet schema, the common metadata, 
> exposing the parquet-specific constructor keywords more easily, ..). 
> Features the {{ParquetDataset}} currently has that are not exactly covered by 
> pyarrow.dataset:
> - Partitioning information (the {{.partitions}} attribute
> - Access to common metadata ({{.metadata_path}}, {{.common_metadata_path}} 
> and {{.metadata}} attributes)
> - ParquetSchema of the dataset



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-9720) [Python] Long-term fate of pyarrow.parquet.ParquetDataset

Reply via email to