Hello!
We recently updated Arrow to 7.0.0 and hit some error with our old code
(Details below). I wonder if there is a new way to do this with the current
version?
import pyarrow
import pyarrow.parquet as pq
df = pd.DataFrame({"aa": [1, 2, 3], "bb": [1, 2, 3]})
uri = "gs://amp_bucket_liao/try"
s3fs = # ...
pq.write_to_dataset(
table=pyarrow.Table.from_pandas(df=df, preserve_index=True),
root_path=uri, filesystem=s3fs, partition_cols=["aa"]
)
# so far it works fine.
# The following gives an error, error message in the thread
test_df = pq.read_table(
source=uri, filesystem=s3fs
)
Error:
/home/tsdist/vats_deployments/modeling.env.interactive-bc9b04a0-708b-45b8-90bc-14b9ca6ee9ba/ext/public/python/pyarrow/7/0/x/dist/lib/python3.9/pyarrow/error.pxi
in pyarrow.lib.check_status()
97
98 if status.IsInvalid():
---> 99 raise ArrowInvalid(message)
100 elif status.IsIOError():
101 # Note: OSError constructor is
ArrowInvalid: GetFileInfo() yielded path
'amp_bucket_liao/try/aa=3/235add6629d44a2f8fa4ec772340b73d.parquet',
which is outside base dir 'gs://amp_bucket_liao/try'