[ 
https://issues.apache.org/jira/browse/IMPALA-7517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16599160#comment-16599160
 ] 

ASF subversion and git services commented on IMPALA-7517:
---------------------------------------------------------

Commit 13e93e75bf1cab44f80ceee17b0b0abde8ccd034 in impala's branch 
refs/heads/master from [~tlipcon]
[ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=13e93e7 ]

IMPALA-7517. Fix hang in scanner threads when soft limit is exceeded

As described in the JIRA, when scanner threads see that the soft limit
has been exceeded, they try to shut down. In some particular
interleavings, this would cause all of the scanner threads to exit
without any of them marking the scan as completed.

This patch adds a new fault point to inject fake soft limit errors, and
adds this fault point to the scanner test. With the previous placement
of the soft limit check, this caused query hangs pretty reliably. With
the new placement of the memory limit check, it now passes.

Change-Id: I3dc1a2ec79c823575d7d40e7b52216dea5b0ddde
Reviewed-on: http://gerrit.cloudera.org:8080/11369
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Todd Lipcon <[email protected]>


> Hung scanner when soft memory limit exceeded
> --------------------------------------------
>
>                 Key: IMPALA-7517
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7517
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 3.1.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
>             Fix For: Impala 3.1.0
>
>
> As reported on the mailing list, this is a regression due to IMPALA-7096 
> (7ccf7369085aa49a8fc0daf6f91d97b8a3135682). The scanner thread has the 
> following code:
> {code}
>    // Stop extra threads if we're over a soft limit in order to free up 
> memory.
>     if (!first_thread && mem_tracker_->AnyLimitExceeded(MemLimit::SOFT)) {
>       break;
>     }
>  
>     // Done with range and it completed successfully
>     if (progress_.done()) {
>       // All ranges are finished.  Indicate we are done.
>       SetDone();
>       break;
>     }
>  
>     if (scan_range == nullptr && num_unqueued_files == 0) {
>       unique_lock<mutex> l(lock_);
>       // All ranges have been queued and DiskIoMgr has no more new ranges for 
> this scan
>       // node to process. This means that every range is either done or being 
> processed by
>       // another thread.
>       all_ranges_started_ = true;
>       break;
>     }
>   }
> {code}
>  
> What if we have the following scenario:
>   
>  T1) grab scan range 1 and start processing
>   
>  T2) grab scan range 2 and start processing
>   
>  T1) finish scan range 1 and see that 'progress_' is not done()
>  T1) loop around, get no scan range (there are no more), so set 
> all_ranges_satrted_ and break
>  T1) thread exits
>   
>  T2) finish scan range 2
>  T2) happen to hit a soft memory limit error due to pressure from other exec 
> nodes, etc. Since we aren't the first thread, we break. (even though the 
> first thread is no longer running)
>  T2) thread exits
>   
>  Note that no one got to the point of calling SetDone() because we break due 
> to the memory limit error _before_ checking progress_.Done().
>   
>  Thus, the query will hang forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to