[jira] [Commented] (ARROW-11250) [Python] Inconsistent behavior calling ds.dataset()

Lance Dacey (Jira) Fri, 15 Jan 2021 08:02:05 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17266151#comment-17266151
 ]


Lance Dacey commented on ARROW-11250:
-------------------------------------

Good idea - I was a able to list all of the files and print the info quickly, 
one interesting thing is that the ds.dataset() failed right after though and 
the error message is a little different. 

 

My input path was "dev/case-history/" with the final slash. This shows that it 
took 8 seconds to get the len(fs.find()) which is about the same amount of time 
it takes to read ds.dataset() in Jupyter. This error message is different than 
usual though and it mentions something about a dircache:

 
{code:java}
[2021-01-15 15:51:47,158] INFO - Reading /dev/case-history/
[2021-01-15 15:51:55,607] INFO - 9682
[2021-01-15 15:51:55,892] INFO - {'name': '/dev/case-history', 'size': 0, 
'type': 'directory'}
[2021-01-15 15:51:55,893] {taskinstance.py:1150} ERROR - '/dev/case-history/'
Traceback (most recent call last):
...
  File "/opt/conda/lib/python3.7/site-packages/pyarrow/dataset.py", line 671, 
in dataset
    return _filesystem_dataset(source, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pyarrow/dataset.py", line 428, 
in _filesystem_dataset
    fs, paths_or_selector = _ensure_single_source(source, filesystem)
  File "/opt/conda/lib/python3.7/site-packages/pyarrow/dataset.py", line 395, 
in _ensure_single_source
    file_info = filesystem.get_file_info([path])[0]
  File "pyarrow/_fs.pyx", line 434, in pyarrow._fs.FileSystem.get_file_info
  File "pyarrow/error.pxi", line 122, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/_fs.pyx", line 1012, in pyarrow._fs._cb_get_file_info_vector
  File "/opt/conda/lib/python3.7/site-packages/pyarrow/fs.py", line 195, in 
get_file_info
    info = self.fs.info(path)
  File "/opt/conda/lib/python3.7/site-packages/adlfs/spec.py", line 522, in info
    fetch_from_azure = (path and self._ls_from_cache(path) is None) or refresh
  File "/opt/conda/lib/python3.7/site-packages/fsspec/spec.py", line 336, in 
_ls_from_cache
    return self.dircache[path]
  File "/opt/conda/lib/python3.7/site-packages/fsspec/dircache.py", line 62, in 
__getitem__
    return self._cache[item]  # maybe raises KeyError
KeyError: '/dev/case-history/'
{code}
 

I edited my DAG and changed the input path to be "dev/case-history" with no 
final slash and the error was different (note that fs.info() always either 
removes or adds the final slash to the name of the path):
{code:java}
[2021-01-15 15:36:25,603] INFO - {'name': '/dev/case-history/', 'size': 0, 
'type': 'directory'}
[2021-01-15 15:36:25,604] ERROR - /dev/case-history
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/pyarrow/dataset.py", line 671, 
in dataset
    return _filesystem_dataset(source, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pyarrow/dataset.py", line 428, 
in _filesystem_dataset
    fs, paths_or_selector = _ensure_single_source(source, filesystem)
  File "/opt/conda/lib/python3.7/site-packages/pyarrow/dataset.py", line 404, 
in _ensure_single_source
    raise FileNotFoundError(path)
FileNotFoundError: /dev/case-history
{code}
 

Without any fs.info() or fs.find() it took 11 minutes to read the same 
dataset... from 17:45 to 17:56
{code:java}
[2021-01-14 17:45:10,470] INFO - Reading /dev/case-history/
[2021-01-14 17:56:58,307] INFO - Processing dataset in batches
{code}
 

> [Python] Inconsistent behavior calling ds.dataset()
> ---------------------------------------------------
>
>                 Key: ARROW-11250
>                 URL: https://issues.apache.org/jira/browse/ARROW-11250
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>         Environment: Ubuntu 18.04
> adal                      1.2.5              pyh9f0ad1d_0    conda-forge
> adlfs                     0.5.9              pyhd8ed1ab_0    conda-forge
> apache-airflow            1.10.14                  pypi_0    pypi
> azure-common              1.1.24                     py_0    conda-forge
> azure-core                1.9.0              pyhd3deb0d_0    conda-forge
> azure-datalake-store      0.0.51             pyh9f0ad1d_0    conda-forge
> azure-identity            1.5.0              pyhd8ed1ab_0    conda-forge
> azure-nspkg               3.0.2                      py_0    conda-forge
> azure-storage-blob        12.6.0             pyhd3deb0d_0    conda-forge
> azure-storage-common      2.1.0            py37hc8dfbb8_3    conda-forge
> fsspec                    0.8.5              pyhd8ed1ab_0    conda-forge
> jupyterlab_pygments       0.1.2              pyh9f0ad1d_0    conda-forge
> pandas                    1.2.0            py37ha9443f7_0
> pyarrow                   2.0.0           py37h4935f41_6_cpu    conda-forge
>            Reporter: Lance Dacey
>            Priority: Minor
>              Labels: azureblob, dataset,, python
>             Fix For: 4.0.0
>
>
> In a Jupyter notebook, I have noticed that sometimes I am not able to read a 
> dataset which certainly exists on Azure Blob.
>  
> {code:java}
> fs = fsspec.filesystem(protocol="abfs", account_name, account_key)
> {code}
>  
> One example of this is reading a dataset in one cell:
>  
> {code:java}
> ds.dataset("dev/test-split", partitioning="hive", filesystem=fs){code}
>  
> Then in another cell I try to read the same dataset:
>  
> {code:java}
> ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)
> ---------------------------------------------------------------------------
> FileNotFoundError                         Traceback (most recent call last)
> <ipython-input-514-bf63585a0c1b> in <module>
> ----> 1 ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, 
> schema, format, filesystem, partitioning, partition_base_dir, 
> exclude_invalid_files, ignore_prefixes)
>     669     # TODO(kszucs): support InMemoryDataset for a table input
>     670     if _is_path_like(source):
> --> 671         return _filesystem_dataset(source, **kwargs)
>     672     elif isinstance(source, (tuple, list)):
>     673         if all(_is_path_like(elem) for elem in source):
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> _filesystem_dataset(source, schema, filesystem, partitioning, format, 
> partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
>     426         fs, paths_or_selector = _ensure_multiple_sources(source, 
> filesystem)
>     427     else:
> --> 428         fs, paths_or_selector = _ensure_single_source(source, 
> filesystem)
>     429 
>     430     options = FileSystemFactoryOptions(
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> _ensure_single_source(path, filesystem)
>     402         paths_or_selector = [path]
>     403     else:
> --> 404         raise FileNotFoundError(path)
>     405 
>     406     return filesystem, paths_or_selector
> FileNotFoundError: dev/test-split
> {code}
>  
> If I reset the kernel, it works again. It also works if I change the path 
> slightly, like adding a "/" at the end (so basically it just not work if I 
> read the same dataset twice):
>  
> {code:java}
> ds.dataset("dev/test-split/", partitioning="hive", filesystem=fs)
> {code}
>  
>  
> The other strange behavior I have noticed that that if I read a dataset 
> inside of my Jupyter notebook,
>  
> {code:java}
> %%time
> dataset = ds.dataset("dev/test-split", 
> partitioning=ds.partitioning(pa.schema([("date", pa.date32())]), 
> flavor="hive"), 
> filesystem=fs,
> exclude_invalid_files=False)
> CPU times: user 1.98 s, sys: 0 ns, total: 1.98 s Wall time: 2.58 s{code}
>  
> Now, on the exact same server when I try to run the same code against the 
> same dataset in Airflow it takes over 3 minutes (comparing the timestamps in 
> my logs between right before I read the dataset, and immediately after the 
> dataset is available to filter):
> {code:java}
> [2021-01-14 03:52:04,011] INFO - Reading dev/test-split
> [2021-01-14 03:55:17,360] INFO - Processing dataset in batches
> {code}
> This is probably not a pyarrow issue, but what are some potential causes that 
> I can look into? I have one example where it is 9 seconds to read the dataset 
> in Jupyter, but then 11 *minutes* in Airflow. I don't know what to really 
> investigate - as I mentioned, the Jupyter notebook and Airflow are on the 
> same server and both are deployed using Docker. Airflow is using the 
> CeleryExecutor.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11250) [Python] Inconsistent behavior calling ds.dataset()

Reply via email to