[jira] [Commented] (ARROW-15270) [Python] Make dataset.dataset() accept a list of directories as source

Will Jones (Jira) Thu, 06 Jan 2022 08:25:07 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-15270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17470020#comment-17470020
 ]


Will Jones commented on ARROW-15270:
------------------------------------

{quote}
The files were saved only with month as a field.
{quote}

The datasets API is meant to populate the partition fields when it loads your 
data. The problem is that you are using "directory partitioning", so you need 
to pass in the partitioning schema when loading. (Hive-style partitions, like 
{{year=2021/}}, don't require this.)

Try this:

{code:python}
import pyarrow.dataset as ds
import pyarrow as pa

table = pa.table({'month': [1, 2, 3, 4, 5], 'year': [2001, 2002, 2003, 2004, 
2004]})

partitioning = ds.partitioning(pa.schema([("year", pa.int64())]))

ds.write_dataset(
    data=table, 
    base_dir="partitioned", 
    format="ipc", 
    partitioning=partitioning
)

my_ds = ds.dataset(
    "partitioned", 
    format="ipc", 
    partitioning=partitioning
)
my_ds.to_table(filter=(ds.field("year") == 2001) | (ds.field("year") == 2002))
{code}

> [Python] Make dataset.dataset() accept a list of directories as source
> ----------------------------------------------------------------------
>
>                 Key: ARROW-15270
>                 URL: https://issues.apache.org/jira/browse/ARROW-15270
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Reynaldo Rojas Zelaya
>            Priority: Major
>
> Currently, if I partition a dataset as shown below, a directory 
> {{partitioned}} is created along with {{2001, 2002, 2003, 2004}} as 
> subdirectories. But then, if I wanted to only read the partitions 
> corresponding to years {{{}2001, 2002{}}}, I wouldn't have a straightforward 
> way of doing so.
> {code:python}
> >>> table = pa.table({'month': [1, 2, 3, 4, 5], 'year': [2001, 2002, 2003, 
> >>> 2004, 2004]})
> >>> table
> pyarrow.Table
> month: int64
> year: int64
> ----
> month: [[1,2,3,4,5]]
> year: [[2001,2002,2003,2004,2004]]
> >>> ds.write_dataset(data=table, base_dir="partitioned", format="ipc", 
> >>> partitioning=ds.partitioning(pa.schema([("year", pa.int64())])))
> >>> f = fs.SubTreeFileSystem(base_path='partitioned', 
> >>> base_fs=fs.LocalFileSystem())
> >>> ds.dataset(source=['2001','2002'], filesystem=f, format="ipc")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File 
> "/Users/userr2232/Documents/misc/first-julia-nn/lib/python3.9/site-packages/pyarrow/dataset.py",
>  line 683, in dataset
>     return _filesystem_dataset(source, **kwargs)
>   File 
> "/Users/userr2232/Documents/misc/first-julia-nn/lib/python3.9/site-packages/pyarrow/dataset.py",
>  line 423, in _filesystem_dataset
>     fs, paths_or_selector = _ensure_multiple_sources(source, filesystem)
>   File 
> "/Users/userr2232/Documents/misc/first-julia-nn/lib/python3.9/site-packages/pyarrow/dataset.py",
>  line 344, in _ensure_multiple_sources
>     raise IsADirectoryError(
> IsADirectoryError: Path 2001 points to a directory, but only file paths are 
> supported. To construct a nested or union dataset pass a list of dataset 
> objects instead.{code}
> Since {{dataset.write_dataset()}} produces this file structure, maybe 
> {{dataset.dataset()}} should accept a list of directories as {{source}}?
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15270) [Python] Make dataset.dataset() accept a list of directories as source

Reply via email to