[ 
https://issues.apache.org/jira/browse/IMPALA-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632491#comment-17632491
 ] 

ASF subversion and git services commented on IMPALA-11704:
----------------------------------------------------------

Commit 15b07ff1fb348be2c75e2176e88feb5ef76fde42 in impala's branch 
refs/heads/master from Michael Smith
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=15b07ff1f ]

IMPALA-11704: (Addendum) fix crash on open for HDFS cache

When trying to read from HDFS cache, ReadFromCache calls
FileReader::Open(false) to force the file to open. The prior commit for
IMPALA-11704 didn't allow for that case when using a data cache, as the
data cache check would always happen. This resulted in a crash calling
CachedFile as exclusive_hdfs_fh_ was nullptr. Tests only catch this when
reading from HDFS cache with data cache enabled.

Replaces explicit arguments to override FileReader behavior with a flag
to communicate whether FileReader supports delayed open. Then the caller
can choose whether to call Open before read. Also simplifies calls to
ReadFromPos as it already has a pointer to ScanRange and can check
whether file handle caching is enabled directly. The Open call in
DoInternalRead uses a slightly wider net by only checking UseDataCache.
If the data cache is unavailable or a miss the file will then be opened.

Adds a select from tpch.nation to the query for test_data_cache.py as
something that triggers checking the HDFS cache.

Change-Id: I741488d6195e586917de220a39090895886a2dc5
Reviewed-on: http://gerrit.cloudera.org:8080/19228
Reviewed-by: Joe McDonnell <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Remote Ozone scans are slow even after data cache warmup
> --------------------------------------------------------
>
>                 Key: IMPALA-11704
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11704
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 4.1.1
>            Reporter: Michael Smith
>            Assignee: Michael Smith
>            Priority: Major
>             Fix For: Impala 4.2.0
>
>
> From [~drorke]:
> {quote}
> Running some basic performance sanity tests ... with Impala TPC-DS queries 
> against Ozone vs HDFS.  Impala appears to be using it's data cache for both 
> Ozone and HDFS remote reads, but in the case of Ozone reads I'm still seeing 
> long scan times and high I/O wait times even after cache warmup. Excerpts 
> below from profiles of q90.  Note in both cases the Impala profiles show 100% 
> cache hit rates but for some reason the scan IO wait times are still much 
> longer for the Ozone scans.
> {noformat}
> HDFS:
> - TotalTime: 1s924ms
> - ScannerIoWaitTime: 52.037ms
> Ozone:
> - TotalTime: 8s917ms
> - ScannerIoWaitTime: 7s454ms{noformat}
> If I disable the local cache explicitly via query option I get the following 
> times for the same scan:
> {noformat}
> HDFS:
> - TotalTime: 7s792ms
> - ScannerIoWaitTime: 6s244ms
> Ozone:
> - TotalTime: 8s963ms
> - ScannerIoWaitTime: 7s464ms{noformat}
> {quote}
> Investigating a bit, [~joemcdonnell] noticed in the Ozone profile
> {noformat}
>  - ScannerIoWaitTime: 7s454ms
>  - TotalRawHdfsOpenFileTime: 5s782ms
> {noformat}
> Based on profile differences around {{TotalRawHdfsOpenFileTime=5s782ms}} (vs 
> {{0ms}} for HDFS), I believe this is a difference in performance when using 
> the data cache but the file handle cache is disabled. That traces back to an 
> incomplete implementation of 
> [IMPALA-10147|https://issues.apache.org/jira/browse/IMPALA-10147].
> A data read:
> 1. [Checks that it can open a file 
> handle|https://github.infra.cloudera.com/CDH/Impala/blob/CDWH-2022.0.10.1/be/src/runtime/io/scan-range.cc#L199].
>  When file handle cache is enabled, this is a 
> [noop|https://github.infra.cloudera.com/CDH/Impala/blob/CDWH-2022.0.10.1/be/src/runtime/io/hdfs-file-reader.cc#L67].
> 2. It will then try to read data. If data cache is enabled, it will [try to 
> read from the data 
> cache|https://github.infra.cloudera.com/CDH/Impala/blob/CDWH-2022.0.10.1/be/src/runtime/io/hdfs-file-reader.cc#L137].
> 3. If data cache hits, that data is returned and any open file handles are 
> unused.
> When the file handle cache is disabled, opening the file handle [calls 
> hdfsOpenFile and 
> hdfsSeek|https://github.infra.cloudera.com/CDH/Impala/blob/CDWH-2022.0.10.1/be/src/runtime/io/hdfs-file-reader.cc#L70-L72].
>  {{hdfsOpenFile}} in particular is monitored and added to the profile as 
> {{TotalRawHdfsOpenFileTime}}. That time in the Ozone profile accounts for 
> most of the difference in performance between HDFS and Ozone in this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to