anjakefala commented on issue #33618:
URL: https://github.com/apache/arrow/issues/33618#issuecomment-1396304799

   > If local filesystems are the odd one out, perhaps it doesn't make sense in 
the filesystem API and we should instead recommend a work around for that 
special case.
   
   Since the `datasets` API does make use of `arrow::fs`, and the local 
filesystem is a very common use-case, I do think there is an argument to be 
made for implementing this change, even if it only applies to the localFS. 
   
   >  it seems worth it to check what other filesystems (S3, Azure, GCS, HDFS, 
etc.) such an optimization is meaningful for.
   
   @wjones127  did some preliminary research on this front:
   > The following provide LastModified + Size at no additional cost:
    > * S3 
([docs](https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html#API_ListObjectsV2_Examples))
   > * GCS 
([docs](https://cloud.google.com/storage/docs/json_api/v1/objects/list))
   > * Azure Blob store 
([docs](https://learn.microsoft.com/en-us/rest/api/storageservices/list-blobs?tabs=azure-ad#response-body),
 although you can request to leave out the metadata (not sure if this is any 
faster though)
   
   I tried to answer this question for HDFS ([docs 
1](https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-common/FileSystemShell.html#ls)
 [docs 
2](https://apache.googlesource.com/hadoop-common/+/e39a8dcd753c0e26c62a67026a554cf2242b7c0f/hadoop-hdfs-project/hadoop-hdfs/src/main/native/libhdfs/hdfs.c#2974).
 
   
   It seems like `HDFS dfs` also uses `stat()`, and would probably also benefit 
from a similar to change, but I am not sure of the work involved for that. 
`GetFileInfo` seems to be the API used to set the metadata. 
   
   @westonpace suggested that the change would be to add something like 
`needs_extended_file_info` to the file selector or the GetFileInfo call. Would 
it make sense to add something such that for localfs we decide whether to use 
`stat` or `readdir` system call? Is it feasible to add an argument to a 
function in `arrow::fs` that only applies to one of the filesystems and does 
nothing for the others? If it does, I can review the relevant code, and propose 
code. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to