Sahil Takiar created IMPALA-9606:
------------------------------------
Summary: ABFS reads should use hdfsPreadFully
Key: IMPALA-9606
URL: https://issues.apache.org/jira/browse/IMPALA-9606
Project: IMPALA
Issue Type: Bug
Components: Backend
Reporter: Sahil Takiar
Assignee: Sahil Takiar
In IMPALA-8525, hdfs preads were enabled by default when reading data from S3.
IMPALA-8525 deferred enabling preads for ABFS because they didn't significantly
improve performance. After some more investigation into the ABFS input streams,
I think it is safe to use {{hdfsPreadFully}} for ABFS reads.
The ABFS client uses a different model for fetching data compared to S3A.
Details are beyond the scope of this JIRA, but it is related to a feature in
ABFS called "read-aheads". ABFS has logic to pre-fetch data it *thinks* will be
required by the client. By default, it pre-fetches # cores * 4 MB of data. If
the requested data exists in the client cache, it is read from the cache.
However, there is no real drawback to using {{hdfsPreadFully}} for ABFS reads.
It's definitely safer, because while the current implementation of ABFS always
returns the amount of requested data, only the {{hdfsPreadFully}} API makes
that guarantee.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)