[jira] [Commented] (IMPALA-9606) ABFS reads should use hdfsPreadFully

ASF subversion and git services (Jira) Thu, 01 Oct 2020 17:13:29 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-9606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205896#comment-17205896
 ]


ASF subversion and git services commented on IMPALA-9606:
---------------------------------------------------------

Commit 8e9cf51f6b328f500acf7c577289c5b888fd15d2 in impala's branch 
refs/heads/master from Sahil Takiar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=8e9cf51 ]

IMPALA-9606: ABFS reads should use hdfsPreadFully

Similar to IMPALA-8525, but for ABFS, instead of S3A.
I don't expect this to make a major improvement in performance,
like it did for S3A, although I am still seeing a marginal
improvement during some ad-hoc testing (about 5% scan perf
improvement). The reason is that the implementation of the ABFS
and S3A clients are very different, ABFS already reads all data
requested in a single hdfsRead call.

I ran the query 'select * from abfs_test_store_sales order by
ss_net_profit limit 10;' several times to validate that perf
does not regress. In fact, it does improve slightly for this query.
The table 'abfs_test_store_sales' is just a copy of the mini-cluster's
tpcds_parquet.store_sales, although it is not partitioned.

Testing:
* Tested against a ABFS storage account I have access to
* Ran several queries to validate there are no functional
  or perf regressions.

Change-Id: I994ea30cf31abc66f5d82d9b3c8e185d2bd06147
Reviewed-on: http://gerrit.cloudera.org:8080/16531
Reviewed-by: Joe McDonnell <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> ABFS reads should use hdfsPreadFully
> ------------------------------------
>
>                 Key: IMPALA-9606
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9606
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>
> In IMPALA-8525, hdfs preads were enabled by default when reading data from 
> S3. IMPALA-8525 deferred enabling preads for ABFS because they didn't 
> significantly improve performance. After some more investigation into the 
> ABFS input streams, I think it is safe to use {{hdfsPreadFully}} for ABFS 
> reads.
> The ABFS client uses a different model for fetching data compared to S3A. 
> Details are beyond the scope of this JIRA, but it is related to a feature in 
> ABFS called "read-aheads". ABFS has logic to pre-fetch data it *thinks* will 
> be required by the client. By default, it pre-fetches # cores * 4 MB of data. 
> If the requested data exists in the client cache, it is read from the cache.
> However, there is no real drawback to using {{hdfsPreadFully}} for ABFS 
> reads. It's definitely safer, because while the current implementation of 
> ABFS always returns the amount of requested data, only the {{hdfsPreadFully}} 
> API makes that guarantee.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-9606) ABFS reads should use hdfsPreadFully

Reply via email to