[GitHub] [arrow] westonpace commented on issue #10492: Doc update ? For Reading and Writing the Apache Parquet Format

GitBox Wed, 09 Jun 2021 10:25:41 -0700


westonpace commented on issue #10492:
URL: https://github.com/apache/arrow/issues/10492#issuecomment-857888278

> Cannot submit a bug since it's not especially a direct issue but it's more
something not complete or up to date in the documentation
Please do create a JIRA issue. Arrow uses JIRA to track all changes (bugs,
doc change, CI improvements, new features) and so you don't have to worry about
that. These sound like valid concerns and a JIRA issue would be acceptable.

> There is a chapter for "Reading from Partitioned Datasets", that's great
... but works only with a local storage and adding a Data Lake URL to a
recursive folder don't work, missing the ability to read partitioned parquet
files from Cloud

That chapter is talking about the legacy datasets API (ParquetDataset). You
may be better served reading up on the new datasets API:
https://arrow.apache.org/docs/python/dataset.html#dataset . The new API will
accept a URL as a path although it currently only has first-class support for
S3 and HDFS. To use Azure data lake directly you would need to create a
filesystem for it as the datasets API needs to be able to list files, search
for files, create files, etc.

That being said, you might be able to make something work by using the
fsspec filesystem
(https://arrow.apache.org/docs/python/filesystems.html#using-fsspec-compatible-filesystems)
and https://github.com/dask/adlfs .

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #10492: Doc update ? For Reading and Writing the Apache Parquet Format

Reply via email to