anjakefala commented on issue #33618:
URL: https://github.com/apache/arrow/issues/33618#issuecomment-1396304799
> If local filesystems are the odd one out, perhaps it doesn't make sense in
the filesystem API and we should instead recommend a work around for that
special case.
Since the `datasets` API does make use of `arrow::fs`, and the local
filesystem is a very common use-case, I do think there is an argument to be
made for implementing this change, even if it only applies to the localFS.
> it seems worth it to check what other filesystems (S3, Azure, GCS, HDFS,
etc.) such an optimization is meaningful for.
@wjones127 did some preliminary research on this front:
> The following provide LastModified + Size at no additional cost:
> * S3
([docs](https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html#API_ListObjectsV2_Examples))
> * GCS
([docs](https://cloud.google.com/storage/docs/json_api/v1/objects/list))
> * Azure Blob store
([docs](https://learn.microsoft.com/en-us/rest/api/storageservices/list-blobs?tabs=azure-ad#response-body),
although you can request to leave out the metadata (not sure if this is any
faster though)
I tried to answer this question for HDFS ([docs
1](https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-common/FileSystemShell.html#ls)
[docs
2](https://apache.googlesource.com/hadoop-common/+/e39a8dcd753c0e26c62a67026a554cf2242b7c0f/hadoop-hdfs-project/hadoop-hdfs/src/main/native/libhdfs/hdfs.c#2974).
It seems like `HDFS dfs` also uses `stat()`, and would probably also benefit
from a similar to change, but I am not sure of the work involved for that.
`GetFileInfo` seems to be the API used to set the metadata.
@westonpace suggested that the change would be to add something like
`needs_extended_file_info` to the file selector or the GetFileInfo call. Would
it make sense to add something such that for localfs we decide whether to use
`stat` or `readdir` system call? Is it feasible to add an argument to a
function in `arrow::fs` that only applies to one of the filesystems and does
nothing for the others? If it does, I can review the relevant code, and propose
code.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]