Remi Dettai created ARROW-8950:
----------------------------------

             Summary: [C++] Make head optional in s3fs
                 Key: ARROW-8950
                 URL: https://issues.apache.org/jira/browse/ARROW-8950
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Remi Dettai


When you open an input file with the f3fs, it issues a head request to S3 to 
check if the file is present/authorized and get the size 
(https://github.com/apache/arrow/blob/f16f76ab7693ae085e82f4269a0a0bc23770bef9/cpp/src/arrow/filesystem/s3fs.cc#L407).

This call comes with a non-neglictable cost:
 * adds latency
 * priced the same as a GET request by AWS

I fail to see usecases where this call is really crucial:
 * if the file is not present/authorized, failing at first read seems to have 
mostly the same effect as failing on opening. I agree that it is kind of 
"usual" for an _open_ call to fail eagerly, so to avoid surprises we could add 
a flag indicating if we don't need to fail when running _OpenInputFile_ on an 
inaccessible file.
 * getting the size can be done on the first read, and could be mostly avoided 
on caller side if the filesystem api provided read-from-end capabilities 
(compatible with fs reads using _ios::end_ and on http filesystems with 
_bytes=-xxx_). Worst case scenario the call to _head_ could be done lazily when 
calling _getSize()._

I agree that it makes things a bit more complex, and I understand that you 
would not want to complexify the generic fs api because of blob storage 
behavior. But obviously there are workloads where this has a significant impact.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to