[jira] [Commented] (ARROW-8039) [C++][Python][Dataset] Assemble a minimal ParquetDataset shim

Joris Van den Bossche (Jira) Tue, 10 Mar 2020 16:08:10 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-8039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056496#comment-17056496
 ]


Joris Van den Bossche commented on ARROW-8039:
----------------------------------------------

> So the idea would be that read_table would be the function that gets the new 
> Dataset option, and ParquetDataset would be unchanged (just no longer 
> encouraged for use).

That would be an option, yes.

To give some context from dask's usage: they actually do _not_ use the 
ParquetDataset.read() method. They use a lot of other things of the class: get 
the partitioning information, the pieces, the metadata, etc, but not read the 
full dataset. For reading, they use ParquetDatasetPiece.read().

Now, dask's usage is maybe not typical, so it would be good to check some other 
places on how ParquetDataset gets used.

For example on StackOverflow:
 * Top answer on reading partitioned dataset on S3 uses 
ParquetDataset().reade().to_pandas(): 
[https://stackoverflow.com/questions/45043554/how-to-read-a-list-of-parquet-files-from-s3-as-a-pandas-dataframe-using-pyarrow/48809552#48809552]
 * Some other, less popular S3 related questions that also mention 
ParquetDataset with basically the same usage pattern

Now, there might still be value in having a two-step way (creating the dataset, 
and reading) instead of a 1 step {{read_table}}, since the former allows to do 
some inspection of the dataset before reading it. 
But this is what the {{pyarrow.dataset.Dataset}} already provides. So the 
question is if a ParquetDataset then is needed? 

I suppose such a subclass might be useful to directly expose the parquet 
specific things (eg without needing to specify {{format="parquet"}}, or by 
exposing ParquetFileFormatOptions directly in the constructor of 
ParquetDataset, etc). I think something like this _is_ useful, but then I would 
rather model it after dataset.Dataset, to make it consistent with that new API, 
rather than model it after parquet.ParquetDataset (which would introduce an 
inconsistencies with the new API), maybe just with a {{read()}} method for 
basic backwards compatibility (but for the rest following the API of 
dataset.Dataset)

> [C++][Python][Dataset] Assemble a minimal ParquetDataset shim
> -------------------------------------------------------------
>
>                 Key: ARROW-8039
>                 URL: https://issues.apache.org/jira/browse/ARROW-8039
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: C++ - Dataset, Python
>    Affects Versions: 0.16.0
>            Reporter: Ben Kietzman
>            Assignee: Ben Kietzman
>            Priority: Major
>             Fix For: 0.17.0
>
>
> Assemble a minimal ParquetDataset shim backed by {{pyarrow.dataset.*}}. 
> Replace the existing ParquetDataset with the shim by default, allow opt-out 
> for users who need the current ParquetDataset
> This is mostly exploratory to see which of the python tests fail



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8039) [C++][Python][Dataset] Assemble a minimal ParquetDataset shim

Reply via email to