jorisvandenbossche commented on code in PR #13199:
URL: https://github.com/apache/arrow/pull/13199#discussion_r880100079
##########
python/pyarrow/dataset.py:
##########
@@ -622,62 +631,74 @@ def dataset(source, schema=None, format=None,
filesystem=None,
Examples
--------
+ Creating an example pa.Table:
Review Comment:
```suggestion
Creating an example Table:
```
##########
python/pyarrow/dataset.py:
##########
@@ -155,46 +155,55 @@ def partitioning(schema=None, field_names=None,
flavor=None,
Specify the Schema for paths like "/2009/June":
- >>> partitioning(pa.schema([("year", pa.int16()), ("month", pa.string())]))
+ >>> import pyarrow as pa
+ >>> import pyarrow.dataset as ds
+ >>> ds.partitioning(pa.schema([("year", pa.int16()),
+ ... ("month", pa.string())]))
+ <pyarrow._dataset.DirectoryPartitioning object at ...>
Review Comment:
How useful is it to see those outputs? (we could try to make the repr more
informative though)
Because if we think that is not useful, we could also do something like
`part = ..` to not have to add the output.
##########
python/pyarrow/dataset.py:
##########
@@ -622,62 +631,74 @@ def dataset(source, schema=None, format=None,
filesystem=None,
Examples
--------
+ Creating an example pa.Table:
+
+ >>> import pyarrow as pa
+ >>> import pyarrow.parquet as pq
+ >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+ ... 'n_legs': [2, 2, 4, 4, 5, 100],
+ ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse",
+ ... "Brittle stars", "Centipede"]})
+ >>> pq.write_table(table, "file.parquet")
+
Opening a single file:
- >>> dataset("path/to/file.parquet", format="parquet")
+ >>> import pyarrow.dataset as ds
+ >>> dataset = ds.dataset("file.parquet", format="parquet")
+ >>> dataset.to_table()
+ pyarrow.Table
+ year: int64
+ n_legs: int64
+ animal: string
+ ----
+ year: [[2020,2022,2021,2022,2019,2021]]
+ n_legs: [[2,2,4,4,5,100]]
+ animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]
Opening a single file with an explicit schema:
- >>> dataset("path/to/file.parquet", schema=myschema, format="parquet")
+ >>> myschema = pa.schema([
+ ... ('n_legs', pa.int64()),
+ ... ('animal', pa.string())])
+ >>> dataset = ds.dataset("file.parquet", schema=myschema, format="parquet")
+ >>> dataset.to_table()
+ pyarrow.Table
+ n_legs: int64
+ animal: string
+ ----
+ n_legs: [[2,2,4,4,5,100]]
+ animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]
Opening a dataset for a single directory:
- >>> dataset("path/to/nyc-taxi/", format="parquet")
- >>> dataset("s3://mybucket/nyc-taxi/", format="parquet")
+ >>> ds.write_dataset(table, "partitioned_dataset", format="parquet",
+ ... partitioning=['year'])
+ >>> dataset = ds.dataset("partitioned_dataset", format="parquet")
+ >>> dataset.to_table()
+ pyarrow.Table
+ n_legs: int64
+ animal: string
+ ----
+ n_legs: [[5],[2],[4,100],[2,4]]
+ animal: [["Brittle stars"],["Flamingo"],...["Parrot","Horse"]]
+
+ >>> # Single directory from a S3 bucket
+ >>> # dataset("s3://mybucket/nyc-taxi/", format="parquet")
Opening a dataset from a list of relatives local paths:
- >>> dataset([
- ... "part0/data.parquet",
- ... "part1/data.parquet",
- ... "part3/data.parquet",
+ >>> dataset = ds.dataset([
+ ... "partitioned_dataset/2019/part-0.parquet",
+ ... "partitioned_dataset/2020/part-0.parquet",
+ ... "partitioned_dataset/2021/part-0.parquet",
... ], format='parquet')
-
- With filesystem provided:
-
- >>> paths = [
- ... 'part0/data.parquet',
- ... 'part1/data.parquet',
- ... 'part3/data.parquet',
- ... ]
- >>> dataset(paths, filesystem='file:///directory/prefix, format='parquet')
Review Comment:
> There are some examples I removed from ds.dataset that I will add back as
a follow-up when I will work on the docstring examples for
[Filesystems](https://issues.apache.org/jira/browse/ARROW-16091)
I would maybe keep them here for now, but add a `# doctest: +SKIP` on the
lines that wouldn't yet work.
(it's certainly fine to only handle them in a follow-up)
##########
python/pyarrow/dataset.py:
##########
@@ -622,62 +631,74 @@ def dataset(source, schema=None, format=None,
filesystem=None,
Examples
--------
+ Creating an example pa.Table:
+
+ >>> import pyarrow as pa
+ >>> import pyarrow.parquet as pq
+ >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+ ... 'n_legs': [2, 2, 4, 4, 5, 100],
+ ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse",
+ ... "Brittle stars", "Centipede"]})
+ >>> pq.write_table(table, "file.parquet")
Review Comment:
Wondering: those files, do they end up in the directory from where you are
running pytest / where the dataset.py is located?
Because if so, we will probably want to clean them up in some way, so you
don't get a bunch of files in your repo from running the doctests.
(it might be possible to change python's "current working directory" to a
temporary directory in conftest.py?)
##########
python/pyarrow/dataset.py:
##########
@@ -622,62 +631,74 @@ def dataset(source, schema=None, format=None,
filesystem=None,
Examples
--------
+ Creating an example pa.Table:
+
+ >>> import pyarrow as pa
+ >>> import pyarrow.parquet as pq
+ >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+ ... 'n_legs': [2, 2, 4, 4, 5, 100],
+ ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse",
+ ... "Brittle stars", "Centipede"]})
+ >>> pq.write_table(table, "file.parquet")
+
Opening a single file:
- >>> dataset("path/to/file.parquet", format="parquet")
+ >>> import pyarrow.dataset as ds
+ >>> dataset = ds.dataset("file.parquet", format="parquet")
+ >>> dataset.to_table()
+ pyarrow.Table
+ year: int64
+ n_legs: int64
+ animal: string
+ ----
+ year: [[2020,2022,2021,2022,2019,2021]]
+ n_legs: [[2,2,4,4,5,100]]
+ animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]
Opening a single file with an explicit schema:
- >>> dataset("path/to/file.parquet", schema=myschema, format="parquet")
+ >>> myschema = pa.schema([
+ ... ('n_legs', pa.int64()),
+ ... ('animal', pa.string())])
+ >>> dataset = ds.dataset("file.parquet", schema=myschema, format="parquet")
+ >>> dataset.to_table()
+ pyarrow.Table
+ n_legs: int64
+ animal: string
+ ----
+ n_legs: [[2,2,4,4,5,100]]
+ animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]
Opening a dataset for a single directory:
- >>> dataset("path/to/nyc-taxi/", format="parquet")
- >>> dataset("s3://mybucket/nyc-taxi/", format="parquet")
Review Comment:
For those two cases, it might also be fine to just add ` # doctest: +SKIP`
(seeing the output doesn't add much value?)
##########
python/pyarrow/parquet/__init__.py:
##########
@@ -2462,26 +2460,26 @@ def read_pandas(self, **kwargs):
Examples
--------
- Generate an example dataset:
+ Generate an example parquet file:
>>> import pyarrow as pa
- >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
- ... 'n_legs': [2, 2, 4, 4, 5, 100],
- ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse",
- ... "Brittle stars", "Centipede"]})
+ >>> import pandas as pd
+ >>> df = pd.DataFrame({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+ ... 'n_legs': [2, 2, 4, 4, 5, 100],
+ ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse",
+ ... "Brittle stars", "Centipede"]})
+ >>> table = pa.Table.from_pandas(df)
>>> import pyarrow.parquet as pq
- >>> pq.write_to_dataset(table, root_path='dataset_v2_read_pandas',
- ... partition_cols=['year'],
- ... use_legacy_dataset=False)
- >>> dataset = pq._ParquetDatasetV2('dataset_v2_read_pandas/')
+ >>> pq.write_table(table, 'table_V2.parquet')
+ >>> dataset = pq._ParquetDatasetV2('table_V2.parquet')
Review Comment:
```suggestion
>>> dataset = pq.ParquetDataset('table_V2.parquet')
```
(we should show the public API)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]