[ 
https://issues.apache.org/jira/browse/IMPALA-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766371#comment-16766371
 ] 

Thomas Tauber-Marshall commented on IMPALA-8183:
------------------------------------------------

Alright - figured out what's going on.

The test is designed to cause ReportExecStatus() rpcs to fail by backing up the 
control service queue. Prior to IMPALA-4555, after a failed ReportExecStatus() 
we would wait 'report_status_retry_interval_ms' between retries, which was 
100ms by default and wasn't touched by the test. That 100ms was right on the 
edge of being enough time for the coordinator to keep up with processing the 
reports, so that some would fail but most would succeed. It was always possible 
that we could hit 2990 in this setup, but it was unlikely.

Now, we wait 'status_report_interval_ms'. By default, this is 5000ms, so it 
should give the coordinator even more time and make these issues less likely. 
However, the test sets 'status_report_interval_ms' to 10ms, which isn't nearly 
enough time for the coordinator to do its processing, causing lots of the 
ReportExecStatus() rpcs to fail and making us hit 2990 pretty often.

Not sure what the solution is yet, the test will need to be reworked, but at 
least this isn't a bug with IMPALA-4555

> TestRPCTimeout.test_reportexecstatus_retry times out
> ----------------------------------------------------
>
>                 Key: IMPALA-8183
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8183
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Distributed Exec
>    Affects Versions: Impala 3.2.0
>            Reporter: Andrew Sherman
>            Assignee: Thomas Tauber-Marshall
>            Priority: Blocker
>
> There are 2 forms of failure, where the test itself times out, and where the 
> whole test run times out, suspiciously just after running 
> test_reportexecstatus_retry
> {quote}
> Error Message
> Failed: Timeout >7200s
> Stacktrace
> custom_cluster/test_rpc_timeout.py:143: in test_reportexecstatus_retry
>     self.execute_query_verify_metrics(self.TEST_QUERY, None, 10)
> custom_cluster/test_rpc_timeout.py:45: in execute_query_verify_metrics
>     self.execute_query(query, query_options)
> common/impala_test_suite.py:601: in wrapper
>     return function(*args, **kwargs)
> common/impala_test_suite.py:632: in execute_query
>     return self.__execute_query(self.client, query, query_options)
> common/impala_test_suite.py:699: in __execute_query
>     return impalad_client.execute(query, user=user)
> common/impala_connection.py:174: in execute
>     return self.__beeswax_client.execute(sql_stmt, user=user)
> beeswax/impala_beeswax.py:183: in execute
>     handle = self.__execute_query(query_string.strip(), user=user)
> beeswax/impala_beeswax.py:360: in __execute_query
>     self.wait_for_finished(handle)
> beeswax/impala_beeswax.py:384: in wait_for_finished
>     time.sleep(0.05)
> E   Failed: Timeout >7200s
> {quote}
> {quote}
> Test run timed out. This probably happened due to a hung thread which can be 
> confirmed by looking at the stacktrace of running impalad processes at 
> /data/jenkins/workspace/xxx/repos/Impala/logs/timeout_stacktrace
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to