[
https://issues.apache.org/jira/browse/ARROW-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265891#comment-17265891
]
Joris Van den Bossche commented on ARROW-11250:
-----------------------------------------------
So for {{fs.info()}}, when you pass a path without a final slash, it returns
the path with slash, and if pass it it with slash, it returns without slash?
That sounds weird, but, I can't directly think of a way that it could influence
the dataset reading (we mostly use the user specified path, not the "name" key
in the dict).
To test it further directly with our fsspec filesystem wrapper, can you try
some variations/combinations:
{code:python}
from pyarrow.fs import PyFileSystem, FSSpecHandler, FileSelector
fs = fsspec.filesystem(protocol="abfs", account_name, account_key)
fs_pa = PyFileSystem(FSSpecHandler(fs))
fs_pa.get_file_info("dev/test-split")
fs_pa.get_file_info(FileSelector(FileSelector("dev/test-split", recursive=True))
fs_pa.get_file_info("dev/test-split")
{code}
> [Python] Inconsistent behavior calling ds.dataset()
> ---------------------------------------------------
>
> Key: ARROW-11250
> URL: https://issues.apache.org/jira/browse/ARROW-11250
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
> adal 1.2.5 pyh9f0ad1d_0 conda-forge
> adlfs 0.5.9 pyhd8ed1ab_0 conda-forge
> apache-airflow 1.10.14 pypi_0 pypi
> azure-common 1.1.24 py_0 conda-forge
> azure-core 1.9.0 pyhd3deb0d_0 conda-forge
> azure-datalake-store 0.0.51 pyh9f0ad1d_0 conda-forge
> azure-identity 1.5.0 pyhd8ed1ab_0 conda-forge
> azure-nspkg 3.0.2 py_0 conda-forge
> azure-storage-blob 12.6.0 pyhd3deb0d_0 conda-forge
> azure-storage-common 2.1.0 py37hc8dfbb8_3 conda-forge
> fsspec 0.8.5 pyhd8ed1ab_0 conda-forge
> jupyterlab_pygments 0.1.2 pyh9f0ad1d_0 conda-forge
> pandas 1.2.0 py37ha9443f7_0
> pyarrow 2.0.0 py37h4935f41_6_cpu conda-forge
> Reporter: Lance Dacey
> Priority: Minor
> Labels: azureblob, dataset,, python
> Fix For: 4.0.0
>
>
> In a Jupyter notebook, I have noticed that sometimes I am not able to read a
> dataset which certainly exists on Azure Blob.
>
> {code:java}
> fs = fsspec.filesystem(protocol="abfs", account_name, account_key)
> {code}
>
> One example of this is reading a dataset in one cell:
>
> {code:java}
> ds.dataset("dev/test-split", partitioning="hive", filesystem=fs){code}
>
> Then in another cell I try to read the same dataset:
>
> {code:java}
> ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)
> ---------------------------------------------------------------------------
> FileNotFoundError Traceback (most recent call last)
> <ipython-input-514-bf63585a0c1b> in <module>
> ----> 1 ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source,
> schema, format, filesystem, partitioning, partition_base_dir,
> exclude_invalid_files, ignore_prefixes)
> 669 # TODO(kszucs): support InMemoryDataset for a table input
> 670 if _is_path_like(source):
> --> 671 return _filesystem_dataset(source, **kwargs)
> 672 elif isinstance(source, (tuple, list)):
> 673 if all(_is_path_like(elem) for elem in source):
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in
> _filesystem_dataset(source, schema, filesystem, partitioning, format,
> partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
> 426 fs, paths_or_selector = _ensure_multiple_sources(source,
> filesystem)
> 427 else:
> --> 428 fs, paths_or_selector = _ensure_single_source(source,
> filesystem)
> 429
> 430 options = FileSystemFactoryOptions(
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in
> _ensure_single_source(path, filesystem)
> 402 paths_or_selector = [path]
> 403 else:
> --> 404 raise FileNotFoundError(path)
> 405
> 406 return filesystem, paths_or_selector
> FileNotFoundError: dev/test-split
> {code}
>
> If I reset the kernel, it works again. It also works if I change the path
> slightly, like adding a "/" at the end (so basically it just not work if I
> read the same dataset twice):
>
> {code:java}
> ds.dataset("dev/test-split/", partitioning="hive", filesystem=fs)
> {code}
>
>
> The other strange behavior I have noticed that that if I read a dataset
> inside of my Jupyter notebook,
>
> {code:java}
> %%time
> dataset = ds.dataset("dev/test-split",
> partitioning=ds.partitioning(pa.schema([("date", pa.date32())]),
> flavor="hive"),
> filesystem=fs,
> exclude_invalid_files=False)
> CPU times: user 1.98 s, sys: 0 ns, total: 1.98 s Wall time: 2.58 s{code}
>
> Now, on the exact same server when I try to run the same code against the
> same dataset in Airflow it takes over 3 minutes (comparing the timestamps in
> my logs between right before I read the dataset, and immediately after the
> dataset is available to filter):
> {code:java}
> [2021-01-14 03:52:04,011] INFO - Reading dev/test-split
> [2021-01-14 03:55:17,360] INFO - Processing dataset in batches
> {code}
> This is probably not a pyarrow issue, but what are some potential causes that
> I can look into? I have one example where it is 9 seconds to read the dataset
> in Jupyter, but then 11 *minutes* in Airflow. I don't know what to really
> investigate - as I mentioned, the Jupyter notebook and Airflow are on the
> same server and both are deployed using Docker. Airflow is using the
> CeleryExecutor.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)