Reynaldo Oscar Rojas Zelaya created ARROW-15270:
---------------------------------------------------

             Summary: [Python] Make dataset.dataset() accept a list of 
directories as source
                 Key: ARROW-15270
                 URL: https://issues.apache.org/jira/browse/ARROW-15270
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
            Reporter: Reynaldo Oscar Rojas Zelaya


Currently, if I partition a dataset as shown below, a directory {{partitioned}} 
is created along with {{2001, 2002, 2003, 2004}} as subdirectories. But then, 
if I wanted to only read the partitions corresponding to years {{{}2001, 
2002{}}}, I wouldn't have a straightforward way of doing so.

{code:python}
>>> table = pa.table({'month': [1, 2, 3, 4, 5], 'year': [2001, 2002, 2003, 
>>> 2004, 2004]})
>>> table
pyarrow.Table
month: int64
year: int64
----
month: [[1,2,3,4,5]]
year: [[2001,2002,2003,2004,2004]]
>>> ds.write_dataset(data=table, base_dir="partitioned", format="ipc", 
>>> partitioning=ds.partitioning(pa.schema([("year", pa.int64())])))
>>> f = fs.SubTreeFileSystem(base_path='partitioned', 
>>> base_fs=fs.LocalFileSystem())
>>> ds.dataset(source=['2001','2002'], filesystem=f, format="ipc")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File 
"/Users/userr2232/Documents/misc/first-julia-nn/lib/python3.9/site-packages/pyarrow/dataset.py",
 line 683, in dataset
    return _filesystem_dataset(source, **kwargs)
  File 
"/Users/userr2232/Documents/misc/first-julia-nn/lib/python3.9/site-packages/pyarrow/dataset.py",
 line 423, in _filesystem_dataset
    fs, paths_or_selector = _ensure_multiple_sources(source, filesystem)
  File 
"/Users/userr2232/Documents/misc/first-julia-nn/lib/python3.9/site-packages/pyarrow/dataset.py",
 line 344, in _ensure_multiple_sources
    raise IsADirectoryError(
IsADirectoryError: Path 2001 points to a directory, but only file paths are 
supported. To construct a nested or union dataset pass a list of dataset 
objects instead.{code}

Since {{dataset.write_dataset()}} produces this file structure, maybe 
{{dataset.dataset()}} should accept a list of directories as {{source}}?

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to