[
https://issues.apache.org/jira/browse/ARROW-15910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche updated ARROW-15910:
------------------------------------------
Description:
running below results in {{"GetFileIno() yielded path
'myBucket/features/MyParquet.parquet/year=2022/part-0019.snappy.parquet' which
is outside base dir 'gs://myBucket/features/MyParquet.parquet/' "}}
{code}
import pyarrow.parquet as pq
import gcsfs
file_path="gs://myBucket/features/MyParquet.parquet/"
fs=gcsfs.GCSFileSystem()
table=pq.read_table(file_path,filesystem=fs)
{code}
Removing the gs:// from file_path results in a {{FileNotFoundError}}. Any
variation of / or // at the beginning of the path gives me the 'outside base
dir' error.
I also ran the below and got valid results using both file_path patterns, so I
know it finds the path just fine.
{code}
from pyarrow.fs import FileSelector, PyFileSystem, FSSpecHandler
filesys = PyFileSystem(FSSpecHandler(fs))
selector = FileSelector(file_path, recursive=True)
filesys.get_file_info(selector)
{code}
was:
running below results in {{"GetFileIno() yielded path
'myBucket/features/MyParquet.parquet/year=2022/part-0019.snappy.parquet' which
is outside base dir 'gs://myBucket/features/MyParquet.parquet/' "}}
{{import pyarrow.parquet as pq}}
{{import gcsfs}}
{{file_path="gs://myBucket/features/MyParquet.parquet/"}}
{{fs=gcsfs.GCSFileSystem()}}
{{table=pq.read_table(file_path,filesystem=fs)}}
Removing the gs:// from file_path results in a {{FileNotFoundError}}. Any
variation of / or // at the beginning of the path gives me the 'outside base
dir' error.
I also ran the below and got valid results using both file_path patterns, so I
know it finds the path just fine.
{{from pyarrow.fs import FileSelector, PyFileSystem, FSSpecHandler}}
{{filesys = PyFileSystem(FSSpecHandler(fs))}}
{{selector = FileSelector(file_path, recursive=True)}}
{{filesys.get_file_info(selector)}}
> [Python] pyarrow.parquet.read_table either returns FileNotFound or
> ArrowInvalid
> -------------------------------------------------------------------------------
>
> Key: ARROW-15910
> URL: https://issues.apache.org/jira/browse/ARROW-15910
> Project: Apache Arrow
> Issue Type: Bug
> Components: Parquet, Python
> Affects Versions: 6.0.1, 7.0.0
> Environment: GCP JupyterLab notebooks
> Reporter: Callista Rogers
> Priority: Major
>
> running below results in {{"GetFileIno() yielded path
> 'myBucket/features/MyParquet.parquet/year=2022/part-0019.snappy.parquet'
> which is outside base dir 'gs://myBucket/features/MyParquet.parquet/' "}}
> {code}
> import pyarrow.parquet as pq
> import gcsfs
> file_path="gs://myBucket/features/MyParquet.parquet/"
> fs=gcsfs.GCSFileSystem()
> table=pq.read_table(file_path,filesystem=fs)
> {code}
> Removing the gs:// from file_path results in a {{FileNotFoundError}}. Any
> variation of / or // at the beginning of the path gives me the 'outside base
> dir' error.
> I also ran the below and got valid results using both file_path patterns, so
> I know it finds the path just fine.
> {code}
> from pyarrow.fs import FileSelector, PyFileSystem, FSSpecHandler
> filesys = PyFileSystem(FSSpecHandler(fs))
> selector = FileSelector(file_path, recursive=True)
> filesys.get_file_info(selector)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)