[
https://issues.apache.org/jira/browse/ARROW-15910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510023#comment-17510023
]
Joris Van den Bossche commented on ARROW-15910:
-----------------------------------------------
[~crogers923] thanks for the quick follow-up.
It's strange that it correctly sees a directory, but then the actual reading
fails with "FIleNotFound" (thinking it is a file, not a directory).
But I remember now that a few weeks ago we had a similar issue on the user
mailing list with gcsfs giving such error (see my answer at
https://lists.apache.org/thread/d0fccn94ovt2hh6cgyktcvz127x5pysw). In that
case, it mattered whether you called the "info" method the first or the second
time. Can you check that here as well? The above output that you show, is that
the output you get when running that the first time? (after restarting the
interactive (console) session)
> [Python] pyarrow.parquet.read_table either returns FileNotFound or
> ArrowInvalid
> -------------------------------------------------------------------------------
>
> Key: ARROW-15910
> URL: https://issues.apache.org/jira/browse/ARROW-15910
> Project: Apache Arrow
> Issue Type: Bug
> Components: Parquet, Python
> Affects Versions: 6.0.1, 7.0.0
> Environment: GCP JupyterLab notebooks
> Reporter: Callista Rogers
> Priority: Major
>
> running below results in {{"GetFileIno() yielded path
> 'myBucket/features/MyParquet.parquet/year=2022/part-0019.snappy.parquet'
> which is outside base dir 'gs://myBucket/features/MyParquet.parquet/' "}}
> {code}
> import pyarrow.parquet as pq
> import gcsfs
> file_path="gs://myBucket/features/MyParquet.parquet/"
> fs=gcsfs.GCSFileSystem()
> table=pq.read_table(file_path,filesystem=fs)
> {code}
> Removing the gs:// from file_path results in a {{FileNotFoundError}}. Any
> variation of / or // at the beginning of the path gives me the 'outside base
> dir' error.
> I also ran the below and got valid results using both file_path patterns, so
> I know it finds the path just fine.
> {code}
> from pyarrow.fs import FileSelector, PyFileSystem, FSSpecHandler
> filesys = PyFileSystem(FSSpecHandler(fs))
> selector = FileSelector(file_path, recursive=True)
> filesys.get_file_info(selector)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)