Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/12968 )
Change subject: IMPALA-8322: Add periodic dirty check of done_ in ThreadTokenAvailableCb ...................................................................... IMPALA-8322: Add periodic dirty check of done_ in ThreadTokenAvailableCb When HdfsScanNode is cancelled or hits an error, SetDoneInternal() holds HdfsScanNode::lock_ while it runs RequestContext::Cancel(), which can wait on IO threads to complete outstanding IOs. This can cause a cascade of blocked threads that causes Prepare() to take a significant time and cause datastream sender timeouts. The specific scenario seen has this set of threads: Thread 1: A DiskIoMgr thread is blocked on IO in hdfsOpenFile() or hdfsRead(), holding HdfsFileReader::lock_. Thread 2: An HDFS scanner thread is blocked in HdfsScanNode::SetDoneInternal() -> RequestContext::Cancel() -> ScanRange::CancelInternal(), waiting on HdfsFileReader::lock_. It is holding HdfsScanNode::lock_. Thread 3: A thread in ThreadResourceMgr::DestroyPool() -> (a few layers) -> HdfsScanNode::ThreadTokenAvailableCb() is blocked waiting on HdfsScanNode::lock_ while holding ThreadResourceMgr::lock_. Thread 4: A thread in FragmentInstanceState::Prepare() -> RuntimeState::Init() -> ThreadResourceMgr::CreatePool() is blocked waiting on ThreadResourceMgr::lock_. When Prepare() takes a significant time, datastream senders will time out waiting for the datastream receivers to start up. This causes failed queries. S3 has higher latencies for IO and does not have file handle caching, so S3 is more susceptible to this issue than other platforms. This changes HdfsScanNode::ThreadTokenAvailableCb() to periodically do a dirty check of HdfsScanNode::done_ when waiting to acquire the lock. This avoids the blocking experienced by Thread 3 in the example above. Testing: - Ran tests on normal HDFS and repeatedly on S3 Change-Id: I4881a3e5bfda64e8d60af95ad13b450cf7f8c130 Reviewed-on: http://gerrit.cloudera.org:8080/12968 Reviewed-by: Impala Public Jenkins <[email protected]> Tested-by: Impala Public Jenkins <[email protected]> --- M be/src/common/names.h M be/src/exec/hdfs-scan-node.cc M be/src/exec/hdfs-scan-node.h M be/src/runtime/io/request-ranges.h 4 files changed, 51 insertions(+), 31 deletions(-) Approvals: Impala Public Jenkins: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/12968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I4881a3e5bfda64e8d60af95ad13b450cf7f8c130 Gerrit-Change-Number: 12968 Gerrit-PatchSet: 7 Gerrit-Owner: Joe McDonnell <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]>
