Thanks! Removing the "gs://" prefix indeed fixes it. On Tue, Aug 2, 2022 at 4:01 PM Will Jones <will.jones...@gmail.com> wrote:
> Hi Li Jin, > > I'm not sure yet what changed, but I believe you can fix that error simply > by omitting the scheme prefix from the URI and just use the page when > loading the dataset. Here's my repro: > > import pyarrow as pa > import pyarrow.dataset as ds > from pyarrow.fs import S3FileSystem > > s3fs = S3FileSystem( > endpoint_override="https://storage.googleapis.com", > anonymous=True > ) > > uri = "gs://voltrondata-labs-datasets/nyc-taxi" > > # This works > ds.dataset(uri[5:], filesystem=s3fs) > > # With prefix causes error > ds.dataset(uri, filesystem=s3fs) > # ArrowInvalid: Expected an S3 object path of the form 'bucket/key...', got > a URI: 'gs://voltrondata-labs-datasets/nyc-taxi' > > Best, > > Will Jones > > On Mon, Aug 1, 2022 at 9:00 AM Li Jin <ice.xell...@gmail.com> wrote: > > > Hello! > > > > We recently updated Arrow to 7.0.0 and hit some error with our old code > > (Details below). I wonder if there is a new way to do this with the > current > > version? > > > > import pyarrow > > > > import pyarrow.parquet as pq > > > > > > > > df = pd.DataFrame({"aa": [1, 2, 3], "bb": [1, 2, 3]}) > > > > uri = "gs://amp_bucket_liao/try" > > > > s3fs = # ... > > > > > > > > pq.write_to_dataset( > > > > table=pyarrow.Table.from_pandas(df=df, preserve_index=True), > > > > root_path=uri, filesystem=s3fs, partition_cols=["aa"] > > > > ) > > > > # so far it works fine. > > > > > > > > # The following gives an error, error message in the thread > > > > test_df = pq.read_table( > > > > source=uri, filesystem=s3fs > > > > ) > > > > > > > > Error: > > > > > > > /home/tsdist/vats_deployments/modeling.env.interactive-bc9b04a0-708b-45b8-90bc-14b9ca6ee9ba/ext/public/python/pyarrow/7/0/x/dist/lib/python3.9/pyarrow/error.pxi > > in pyarrow.lib.check_status() > > > > 97 > > > > 98 if status.IsInvalid(): > > > > ---> 99 raise ArrowInvalid(message) > > > > 100 elif status.IsIOError(): > > > > 101 # Note: OSError constructor is > > > > > > > > ArrowInvalid: GetFileInfo() yielded path > > 'amp_bucket_liao/try/aa=3/235add6629d44a2f8fa4ec772340b73d.parquet', > > which is outside base dir 'gs://amp_bucket_liao/try' > > >