Lance Dacey created ARROW-11250:

             Summary: [Python] Inconsistent behavior calling ds.dataset()
                 Key: ARROW-11250
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 2.0.0
         Environment: Ubuntu 18.04

adal                      1.2.5              pyh9f0ad1d_0    conda-forge
adlfs                     0.5.9              pyhd8ed1ab_0    conda-forge
apache-airflow            1.10.14                  pypi_0    pypi
azure-common              1.1.24                     py_0    conda-forge
azure-core                1.9.0              pyhd3deb0d_0    conda-forge
azure-datalake-store      0.0.51             pyh9f0ad1d_0    conda-forge
azure-identity            1.5.0              pyhd8ed1ab_0    conda-forge
azure-nspkg               3.0.2                      py_0    conda-forge
azure-storage-blob        12.6.0             pyhd3deb0d_0    conda-forge
azure-storage-common      2.1.0            py37hc8dfbb8_3    conda-forge
fsspec                    0.8.5              pyhd8ed1ab_0    conda-forge
jupyterlab_pygments       0.1.2              pyh9f0ad1d_0    conda-forge
pandas                    1.2.0            py37ha9443f7_0
pyarrow                   2.0.0           py37h4935f41_6_cpu    conda-forge
            Reporter: Lance Dacey
             Fix For: 3.0.0

In a Jupyter notebook, I have noticed that sometimes I am not able to read a 
dataset which certainly exists on Azure Blob.

fs = fsspec.filesystem(protocol="abfs", account_name, account_key)
One example of this is reading a dataset in one cell:

ds.dataset("dev/test-split", partitioning="hive", filesystem=fs){code}

Then in another cell I try to read the same dataset:

ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)

FileNotFoundError                         Traceback (most recent call last)
<ipython-input-514-bf63585a0c1b> in <module>
----> 1 ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)

/opt/conda/lib/python3.8/site-packages/pyarrow/ in dataset(source, 
schema, format, filesystem, partitioning, partition_base_dir, 
exclude_invalid_files, ignore_prefixes)
    669     # TODO(kszucs): support InMemoryDataset for a table input
    670     if _is_path_like(source):
--> 671         return _filesystem_dataset(source, **kwargs)
    672     elif isinstance(source, (tuple, list)):
    673         if all(_is_path_like(elem) for elem in source):

/opt/conda/lib/python3.8/site-packages/pyarrow/ in 
_filesystem_dataset(source, schema, filesystem, partitioning, format, 
partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
    426         fs, paths_or_selector = _ensure_multiple_sources(source, 
    427     else:
--> 428         fs, paths_or_selector = _ensure_single_source(source, 
    430     options = FileSystemFactoryOptions(

/opt/conda/lib/python3.8/site-packages/pyarrow/ in 
_ensure_single_source(path, filesystem)
    402         paths_or_selector = [path]
    403     else:
--> 404         raise FileNotFoundError(path)
    406     return filesystem, paths_or_selector

FileNotFoundError: dev/test-split

If I reset the kernel, it works again. It also works if I change the path 
slightly, like adding a "/" at the end (so basically it just not work if I read 
the same dataset twice):

ds.dataset("dev/test-split/", partitioning="hive", filesystem=fs)


The other strange behavior I have noticed that that if I read a dataset inside 
of my Jupyter notebook,

dataset = ds.dataset("dev/test-split", 
partitioning=ds.partitioning(pa.schema([("date", pa.date32())]), 

CPU times: user 1.98 s, sys: 0 ns, total: 1.98 s Wall time: 2.58 s{code}

Now, on the exact same server when I try to run the same code against the same 
dataset in Airflow it takes over 3 minutes (comparing the timestamps in my logs 
between right before I read the dataset, and immediately after the dataset is 
available to filter):
[2021-01-14 03:52:04,011] INFO - Reading dev/test-split
[2021-01-14 03:55:17,360] INFO - Processing dataset in batches
This is probably not a pyarrow issue, but what are some potential causes that I 
can look into? I have one example where it is 9 seconds to read the dataset in 
Jupyter, but then 11 *minutes* in Airflow. I don't know what to really 
investigate - as I mentioned, the Jupyter notebook and Airflow are on the same 
server and both are deployed using Docker. Airflow is using the CeleryExecutor.


This message was sent by Atlassian Jira

Reply via email to