[ https://issues.apache.org/jira/browse/ARROW-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124012#comment-17124012 ]
Antoine Pitrou commented on ARROW-8950: --------------------------------------- Would it be ok to be able to disable it in {{S3Options}}? > [C++] Make head optional in s3fs > -------------------------------- > > Key: ARROW-8950 > URL: https://issues.apache.org/jira/browse/ARROW-8950 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Remi Dettai > Assignee: Antoine Pitrou > Priority: Major > > When you open an input file with the f3fs, it issues a head request to S3 to > check if the file is present/authorized and get the size > (https://github.com/apache/arrow/blob/f16f76ab7693ae085e82f4269a0a0bc23770bef9/cpp/src/arrow/filesystem/s3fs.cc#L407). > This call comes with a non-neglictable cost: > * adds latency > * priced the same as a GET request by AWS > I fail to see usecases where this call is really crucial: > * if the file is not present/authorized, failing at first read seems to have > mostly the same effect as failing on opening. I agree that it is kind of > "usual" for an _open_ call to fail eagerly, so to avoid surprises we could > add a flag indicating if we don't need to fail when running _OpenInputFile_ > on an inaccessible file. > * getting the size can be done on the first read, and could be mostly > avoided on caller side if the filesystem api provided read-from-end > capabilities (compatible with fs reads using _ios::end_ and on http > filesystems with _bytes=-xxx_). Worst case scenario the call to _head_ could > be done lazily when calling _getSize()._ > I agree that it makes things a bit more complex, and I understand that you > would not want to complexify the generic fs api because of blob storage > behavior. But obviously there are workloads where this has a significant > impact. -- This message was sent by Atlassian Jira (v8.3.4#803005)