[jira] [Commented] (ARROW-15270) [Python] Make dataset.dataset() accept a list of directories as source

Will Jones (Jira) Thu, 06 Jan 2022 08:30:14 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-15270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17470028#comment-17470028
 ]


Will Jones commented on ARROW-15270:
------------------------------------

{quote}
The reason I'm suggesting this feature is because I think that 
{{dataset.dataset()}} should, in some kind of way, be an inverse operation to 
{{{}dataset.write_dataset(){}}}. 
{quote}

I think it _is_ intended to be just like that, just perhaps not quite as you 
are expecting.

It might make sense to allow {{dataset.dataset()}} to take a list of 
directories, but I think the behavior would be different. It would likely 
create a Union dataset by separately performing dataset discovery in each 
directory you pass. For the use case you are showing, that has two 
disadvantages: (1) It may be slower than just doing the discovery process once; 
(2) It won't materialize the year column in the resulting table.

> [Python] Make dataset.dataset() accept a list of directories as source
> ----------------------------------------------------------------------
>
>                 Key: ARROW-15270
>                 URL: https://issues.apache.org/jira/browse/ARROW-15270
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Reynaldo Rojas Zelaya
>            Priority: Major
>
> Currently, if I partition a dataset as shown below, a directory 
> {{partitioned}} is created along with {{2001, 2002, 2003, 2004}} as 
> subdirectories. But then, if I wanted to only read the partitions 
> corresponding to years {{{}2001, 2002{}}}, I wouldn't have a straightforward 
> way of doing so.
> {code:python}
> >>> table = pa.table({'month': [1, 2, 3, 4, 5], 'year': [2001, 2002, 2003, 
> >>> 2004, 2004]})
> >>> table
> pyarrow.Table
> month: int64
> year: int64
> ----
> month: [[1,2,3,4,5]]
> year: [[2001,2002,2003,2004,2004]]
> >>> ds.write_dataset(data=table, base_dir="partitioned", format="ipc", 
> >>> partitioning=ds.partitioning(pa.schema([("year", pa.int64())])))
> >>> f = fs.SubTreeFileSystem(base_path='partitioned', 
> >>> base_fs=fs.LocalFileSystem())
> >>> ds.dataset(source=['2001','2002'], filesystem=f, format="ipc")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File 
> "/Users/userr2232/Documents/misc/first-julia-nn/lib/python3.9/site-packages/pyarrow/dataset.py",
>  line 683, in dataset
>     return _filesystem_dataset(source, **kwargs)
>   File 
> "/Users/userr2232/Documents/misc/first-julia-nn/lib/python3.9/site-packages/pyarrow/dataset.py",
>  line 423, in _filesystem_dataset
>     fs, paths_or_selector = _ensure_multiple_sources(source, filesystem)
>   File 
> "/Users/userr2232/Documents/misc/first-julia-nn/lib/python3.9/site-packages/pyarrow/dataset.py",
>  line 344, in _ensure_multiple_sources
>     raise IsADirectoryError(
> IsADirectoryError: Path 2001 points to a directory, but only file paths are 
> supported. To construct a nested or union dataset pass a list of dataset 
> objects instead.{code}
> Since {{dataset.write_dataset()}} produces this file structure, maybe 
> {{dataset.dataset()}} should accept a list of directories as {{source}}?
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15270) [Python] Make dataset.dataset() accept a list of directories as source

Reply via email to