Re: Custom cluster test failure in test_admission_controller.py

Tim Armstrong Fri, 24 Jun 2016 08:48:03 -0700

Hi Valencia,
  It does look like it's timing-related failure, but maybe a different one
to IMPALA-3772. You could try applying this fix we have in review
https://gerrit.cloudera.org/#/c/3450/1


It's curious that there are all these timing failures. Are you running on a
small VM or something like that? We typically run the tests on a fairly
modest 2-core VM and we don't generally see these tests failing.

There are a few tests that we know will fail on very slow machines or build
types, we've used "@SkipIfBuildType.not_dev_build" and
"specific_build_type_timeout" to deal with some of those cases.

On Fri, Jun 24, 2016 at 12:07 AM, Valencia Serrao <[email protected]>
wrote:

> Hi Tim,
>
> I am seeing 'timed out' assertions for 2 custom cluster tests in
> test_admission_controller.py: tests *test_admission_controller_with_flags
> *and *test_admission_controller_with_configs. *Putting a debug statements
> at line number: 512 in test_admission_controller.py code as follows:
>
> def run(self):
> client = None
> try:
> try:
> .............
> except ImpalaBeeswaxException as e:
> if "Rejected" in str(e):
> ............
> elif "exceeded timeout" in str(e):
> LOG.debug("Query %s timed out", self.query_num)
> self.query_state = 'TIMED OUT'
> *print "Query " + self.query_state *//added this line
> return
> else:
> raise e
> finally:
> ..................
>
>
> I found that queries in both test cases is getting time out.
> Query TIMED OUT
> Query TIMED OUT
>
> Metrics printed in logs are as follows:
> Final Metric: {'dequeued': 13, 'rejected': 0, 'released': 28, 'admitted':
> 28, 'queued': 15, 'timed-out': 2}
>
>
> The assertion is similar to the one mentioned in JIRA: *IMPALA-3772*
> <https://issues.cloudera.org/browse/IMPALA-3772>.
>
> Is this issue similar to the one you mentioned earlier in this thread ?
>
> Regards,
> Valencia
>
>
>
> [image: Inactive hide details for Nishidha Panpaliya---06/24/2016 11:56:15
> AM---Thanks a lot Tim. We tried running the query on impala]Nishidha
> Panpaliya---06/24/2016 11:56:15 AM---Thanks a lot Tim. We tried running the
> query on impala shell after starting impala cluster with give
>
> From: Nishidha Panpaliya/Austin/Contr/IBM
> To: Tim Armstrong <[email protected]>
> Cc: [email protected], Manish Patil/Austin/Contr/IBM@IBMUS,
> Sudarshan Jagadale/Austin/Contr/IBM@IBMUS, Valencia
> Serrao/Austin/Contr/IBM@IBMUS
> Date: 06/24/2016 11:56 AM
> Subject: Re: Custom cluster test failure in test_exchange_delays.py
> ------------------------------
>
>
> Thanks a lot Tim.
>
> We tried running the query on impala shell after starting impala cluster
> with given parameters. But the query is still passing. So, we just tried
> changing the delay to 20000 and we got the expected exception. Same thing
> is verified in the test case too by changing test argument for delay.
>
> But as you said, if the problem is timing sensitive and it is seen on
> other platforms too, we would not change the test case (to increase delay)
> just to make it pass. We can ignore the failure.
>
> Thanks again,
> Nishidha
>
>
> [image: Inactive hide details for Tim Armstrong ---06/23/2016 10:30:19
> PM---Hmm, that test is potentially timing sensitive. We've seen]Tim
> Armstrong ---06/23/2016 10:30:19 PM---Hmm, that test is potentially timing
> sensitive. We've seen problems when running with slow builds (e
>
> From: Tim Armstrong <[email protected]>
> To: [email protected]
> Cc: Sudarshan Jagadale/Austin/Contr/IBM@IBMUS, Valencia
> Serrao/Austin/Contr/IBM@IBMUS, Manish Patil/Austin/Contr/IBM@IBMUS,
> Nishidha Panpaliya/Austin/Contr/IBM@IBMUS
> Date: 06/23/2016 10:30 PM
> Subject: Re: Custom cluster test failure in test_exchange_delays.py
> ------------------------------
>
>
>
> Hmm, that test is potentially timing sensitive. We've seen problems when
> running with slow builds (e.g. code coverage) or running it on a
> particularly slow machine? E.g. single-core VM. It's probably ok to skip
> the test on PowerPC if this is the case.
>
> The query is expected to fail, but in this case no failure is happening.
> It's a "custom cluster test" that configures the cluster in a way that
> queries will fail with a timeout. It's test coverage for a bug where if the
> timeout happens Impala returned incorrect results.
>
> If you run the query on Impala with the default startup arguments it
> should succeed.
>
> If you start up Impala with the special configuration used by those tests,
> it should fail. E.g. locally I get:
>
> tarmstrong@tarmstrong-box:~/Impala/Impala$ ./bin/start-impala-cluster.py
> --impalad_args=--datastream_sender_timeout_ms=5000
> --impalad_args=--stress_datastream_recvr_delay_ms=10000
> Starting State Store logging to
> /home/tarmstrong/Impala/Impala/logs/cluster/statestored.INFO
> Starting Catalog Service logging to
> /home/tarmstrong/Impala/Impala/logs/cluster/catalogd.INFO
> Starting Impala Daemon logging to
> /home/tarmstrong/Impala/Impala/logs/cluster/impalad.INFO
> Starting Impala Daemon logging to
> /home/tarmstrong/Impala/Impala/logs/cluster/impalad_node1.INFO
> Starting Impala Daemon logging to
> /home/tarmstrong/Impala/Impala/logs/cluster/impalad_node2.INFO
> MainThread: Found 3 impalad/1 statestored/1 catalogd process(es)
> MainThread: Getting num_known_live_backends from tarmstrong-box:25000
> MainThread: Waiting for num_known_live_backends=3. Current value: 0
> MainThread: Getting num_known_live_backends from tarmstrong-box:25000
> MainThread: Waiting for num_known_live_backends=3. Current value: 0
> MainThread: Getting num_known_live_backends from tarmstrong-box:25000
> MainThread: Waiting for num_known_live_backends=3. Current value: 2
> MainThread: Getting num_known_live_backends from tarmstrong-box:25000
> MainThread: Waiting for num_known_live_backends=3. Current value: 2
> MainThread: Getting num_known_live_backends from tarmstrong-box:25000
> MainThread: num_known_live_backends has reached value: 3
> Waiting for Catalog... Status: 63 DBs / 1091 tables (ready=True)
> MainThread: Getting num_known_live_backends from tarmstrong-box:25001
> MainThread: num_known_live_backends has reached value: 3
> Waiting for Catalog... Status: 63 DBs / 1091 tables (ready=True)
> MainThread: Getting num_known_live_backends from tarmstrong-box:25002
> MainThread: num_known_live_backends has reached value: 3
> Waiting for Catalog... Status: 63 DBs / 1091 tables (ready=True)
> Impala Cluster Running with 3 nodes.
> tarmstrong@tarmstrong-box:~/Impala/Impala$ impala-shell.sh
> Starting Impala Shell without Kerberos authentication
> Connected to *tarmstrong-box.ca.cloudera.com:21000*
> <http://tarmstrong-box.ca.cloudera.com:21000/>
> Server version: impalad version 2.6.0-cdh5-INTERNAL DEBUG (build
> fe23dbf0465220a0c40a5c8431cb6a536e19dc6b)
>
> ***********************************************************************************
> Welcome to the Impala shell. Copyright (c) 2015 Cloudera, Inc. All rights
> reserved.
> (Impala Shell v2.6.0-cdh5-INTERNAL (fe23dbf) built on Fri May 13 11:15:16
> PDT 2016)
>
> You can run a single query from the command line using the '-q' option.
>
> ***********************************************************************************
> [*tarmstrong-box.ca.cloudera.com:21000*
> <http://tarmstrong-box.ca.cloudera.com:21000/>] > select count(*)
>                                        > from tpch.lineitem
>                                        >   inner join tpch.orders on
> l_orderkey = o_orderkey
>                                        > ;
> Query: select count(*)
> from tpch.lineitem
>   inner join tpch.orders on l_orderkey = o_orderkey
> WARNINGS:
>
> Sender timed out waiting for receiver fragment instance:
> 4cbdf04962743c:faa6717f926b5183
>
>
>
>  (1 of 2 similar)
>
>
> You could try increasing the delay on your setup to see if you can
> replicate the failure.
>
>
> On Thu, Jun 23, 2016 at 3:54 AM, Nishidha Panpaliya <*[email protected]*
> <[email protected]>> wrote:
>
>
>    Hi All,
>
>    On power8, we are getting 3 failures in custom cluster test failure. 2
>    test
>    cases failed in test_admission_controller.py and 1 in
>    test_exchange_delays.py. I investigated the test failure in
>    test_exchange_delays.py and below is my finding.
>
>       Test case failed is "test_exchange_small_delay". This test has input
>       test file as "QueryTest/exchange-delays",
>       --stress_datastream_recvr_delay_ms=10000 and
>       --datastream_sender_timeout_ms=5000.
>       The test is expected to throw an exception, with message as
>    mentioned in
>       CATCH section in QueryTest/exchange-delays.
>       However, at our end, the query in this test does not throw any
>       exception, but since QueryTest/exchange-delays has CATCH section
>       mentioned, the test case fails due to assertion in
>       tests/common/impala_test_suite.py as below -
>                if 'CATCH' in test_section:
>                  assert test_section['CATCH'].strip() == ''
>    4.      If I remove CATCH section from exchange-delays.test file, then
>    this
>    test case passes, however, another test case in the same test file
>    fails,
>    as it throws exception as per inputs given to it but CATCH section is
>    missing.
>    5.     On another RHEL ppc machine, this test randomly passes i.e.
>    both the
>    test cases throws exception as expected.
>
>    I'm really confused as to what parameter is leading the test case
>    "test_exchange_small_delay" to not throw any exception in my setup. Or
>    what
>    should actually be happened?
>    I checked latest cdh5-trunk code on github and it also has same test
>    code
>    and same content in query test file.
>
>    Kindly provide me some pointers.
>
>    Thanks,
>    Nishidha
>
>
>
>
>

Re: Custom cluster test failure in test_admission_controller.py

Reply via email to