[
https://issues.apache.org/jira/browse/ARROW-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124066#comment-17124066
]
Remi Dettai commented on ARROW-8950:
------------------------------------
You need more than this. If you don't have a "read from end" type function in
the filesystem API, you will still need to get the size first in order to read
the end of the file. The primary usecase for this is of course parquet, where
you need to read the footer first.
A workaround if we don't want to extend the generic filesystem API would be to
provide the file size manually when opening the file, this way you could use
file sizes you got in batches with list commands or from some kind of catalog.
> [C++] Make head optional in s3fs
> --------------------------------
>
> Key: ARROW-8950
> URL: https://issues.apache.org/jira/browse/ARROW-8950
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Remi Dettai
> Assignee: Antoine Pitrou
> Priority: Major
>
> When you open an input file with the f3fs, it issues a head request to S3 to
> check if the file is present/authorized and get the size
> (https://github.com/apache/arrow/blob/f16f76ab7693ae085e82f4269a0a0bc23770bef9/cpp/src/arrow/filesystem/s3fs.cc#L407).
> This call comes with a non-neglictable cost:
> * adds latency
> * priced the same as a GET request by AWS
> I fail to see usecases where this call is really crucial:
> * if the file is not present/authorized, failing at first read seems to have
> mostly the same effect as failing on opening. I agree that it is kind of
> "usual" for an _open_ call to fail eagerly, so to avoid surprises we could
> add a flag indicating if we don't need to fail when running _OpenInputFile_
> on an inaccessible file.
> * getting the size can be done on the first read, and could be mostly
> avoided on caller side if the filesystem api provided read-from-end
> capabilities (compatible with fs reads using _ios::end_ and on http
> filesystems with _bytes=-xxx_). Worst case scenario the call to _head_ could
> be done lazily when calling _getSize()._
> I agree that it makes things a bit more complex, and I understand that you
> would not want to complexify the generic fs api because of blob storage
> behavior. But obviously there are workloads where this has a significant
> impact.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)