Re: Custom cluster test failure in test_admission_controller.py

Valencia Serrao Sun, 26 Jun 2016 22:17:14 -0700

Hi Tim,

Thanks for the fix, applying it solved the 2 timed-out assertion failures
mentioned earlier.

I'm executing these tests on a VM having following configuration:
OS:                     Ubuntu15.10
Architecture:           ppc64le
RAM:                    110GB
HDD:                    210GB
CPU(s):                 4
On-line CPU(s) list:    0-3
Thread(s) per core:   1
Core(s) per socket:     1

Regards,
Valencia

From:   Tim Armstrong <[email protected]>
To:     Valencia Serrao/Austin/Contr/IBM@IBMUS
Cc:     [email protected], Manish
            Patil/Austin/Contr/IBM@IBMUS, Sudarshan
            Jagadale/Austin/Contr/IBM@IBMUS, Nishidha
            Panpaliya/Austin/Contr/IBM@IBMUS
Date:   06/24/2016 09:17 PM
Subject:        Re: Custom cluster test failure in test_admission_controller.py

Hi Valencia,
  It does look like it's timing-related failure, but maybe a different one
to IMPALA-3772. You could try applying this fix we have in review
https://gerrit.cloudera.org/#/c/3450/1

It's curious that there are all these timing failures. Are you running on a
small VM or something like that? We typically run the tests on a fairly
modest 2-core VM and we don't generally see these tests failing.

There are a few tests that we know will fail on very slow machines or build
types, we've used "@SkipIfBuildType.not_dev_build" and
"specific_build_type_timeout" to deal with some of those cases.

On Fri, Jun 24, 2016 at 12:07 AM, Valencia Serrao <[email protected]>
wrote:
  Hi Tim,

  I am seeing 'timed out' assertions for 2 custom cluster tests in
  test_admission_controller.py: tests test_admission_controller_with_flags
  and test_admission_controller_with_configs. Putting a debug statements at
  line number: 512 in test_admission_controller.py code as follows:

  def run(self):
  client = None
  try:
  try:
  .............
  except ImpalaBeeswaxException as e:
  if "Rejected" in str(e):
  ............
  elif "exceeded timeout" in str(e):
  LOG.debug("Query %s timed out", self.query_num)
  self.query_state = 'TIMED OUT'
  print "Query " + self.query_state //added this line
  return
  else:
  raise e
  finally:
  ..................

  I found that queries in both test cases is getting time out.
  Query TIMED OUT
  Query TIMED OUT

  Metrics printed in logs are as follows:
  Final Metric: {'dequeued': 13, 'rejected': 0, 'released': 28, 'admitted':
  28, 'queued': 15, 'timed-out': 2}

  The assertion is similar to the one mentioned in JIRA: IMPALA-3772.

  Is this issue similar to the one you mentioned earlier in this thread ?

  Regards,
  Valencia

  Inactive hide details for Nishidha Panpaliya---06/24/2016 11:56:15
  AM---Thanks a lot Tim. We tried running the query on impala Nishidha
  Panpaliya---06/24/2016 11:56:15 AM---Thanks a lot Tim. We tried running
  the query on impala shell after starting impala cluster with give

  From: Nishidha Panpaliya/Austin/Contr/IBM
  To: Tim Armstrong <[email protected]>
  Cc: [email protected], Manish Patil/Austin/Contr/IBM@IBMUS,
  Sudarshan Jagadale/Austin/Contr/IBM@IBMUS, Valencia
  Serrao/Austin/Contr/IBM@IBMUS
  Date: 06/24/2016 11:56 AM
  Subject: Re: Custom cluster test failure in test_exchange_delays.py

  Thanks a lot Tim.

  We tried running the query on impala shell after starting impala cluster
  with given parameters. But the query is still passing. So, we just tried
  changing the delay to 20000 and we got the expected exception. Same thing
  is verified in the test case too by changing test argument for delay.

  But as you said, if the problem is timing sensitive and it is seen on
  other platforms too, we would not change the test case (to increase
  delay) just to make it pass. We can ignore the failure.

  Thanks again,
  Nishidha

  Inactive hide details for Tim Armstrong ---06/23/2016 10:30:19 PM---Hmm,
  that test is potentially timing sensitive. We've seen Tim Armstrong
  ---06/23/2016 10:30:19 PM---Hmm, that test is potentially timing
  sensitive. We've seen problems when running with slow builds (e

  From: Tim Armstrong <[email protected]>
  To: [email protected]
  Cc: Sudarshan Jagadale/Austin/Contr/IBM@IBMUS, Valencia
  Serrao/Austin/Contr/IBM@IBMUS, Manish Patil/Austin/Contr/IBM@IBMUS,
  Nishidha Panpaliya/Austin/Contr/IBM@IBMUS
  Date: 06/23/2016 10:30 PM
  Subject: Re: Custom cluster test failure in test_exchange_delays.py

  Hmm, that test is potentially timing sensitive. We've seen problems when
  running with slow builds (e.g. code coverage) or running it on a
  particularly slow machine? E.g. single-core VM. It's probably ok to skip
  the test on PowerPC if this is the case.

  The query is expected to fail, but in this case no failure is happening.
  It's a "custom cluster test" that configures the cluster in a way that
  queries will fail with a timeout. It's test coverage for a bug where if
  the timeout happens Impala returned incorrect results.

  If you run the query on Impala with the default startup arguments it
  should succeed.

  If you start up Impala with the special configuration used by those
  tests, it should fail. E.g. locally I get:

  tarmstrong@tarmstrong-box:~/Impala/Impala$ ./bin/start-impala-cluster.py
  --impalad_args=--datastream_sender_timeout_ms=5000
  --impalad_args=--stress_datastream_recvr_delay_ms=10000
  Starting State Store logging
  to /home/tarmstrong/Impala/Impala/logs/cluster/statestored.INFO
  Starting Catalog Service logging
  to /home/tarmstrong/Impala/Impala/logs/cluster/catalogd.INFO
  Starting Impala Daemon logging
  to /home/tarmstrong/Impala/Impala/logs/cluster/impalad.INFO
  Starting Impala Daemon logging
  to /home/tarmstrong/Impala/Impala/logs/cluster/impalad_node1.INFO
  Starting Impala Daemon logging
  to /home/tarmstrong/Impala/Impala/logs/cluster/impalad_node2.INFO
  MainThread: Found 3 impalad/1 statestored/1 catalogd process(es)
  MainThread: Getting num_known_live_backends from tarmstrong-box:25000
  MainThread: Waiting for num_known_live_backends=3. Current value: 0
  MainThread: Getting num_known_live_backends from tarmstrong-box:25000
  MainThread: Waiting for num_known_live_backends=3. Current value: 0
  MainThread: Getting num_known_live_backends from tarmstrong-box:25000
  MainThread: Waiting for num_known_live_backends=3. Current value: 2
  MainThread: Getting num_known_live_backends from tarmstrong-box:25000
  MainThread: Waiting for num_known_live_backends=3. Current value: 2
  MainThread: Getting num_known_live_backends from tarmstrong-box:25000
  MainThread: num_known_live_backends has reached value: 3
  Waiting for Catalog... Status: 63 DBs / 1091 tables (ready=True)
  MainThread: Getting num_known_live_backends from tarmstrong-box:25001
  MainThread: num_known_live_backends has reached value: 3
  Waiting for Catalog... Status: 63 DBs / 1091 tables (ready=True)
  MainThread: Getting num_known_live_backends from tarmstrong-box:25002
  MainThread: num_known_live_backends has reached value: 3
  Waiting for Catalog... Status: 63 DBs / 1091 tables (ready=True)
  Impala Cluster Running with 3 nodes.
  tarmstrong@tarmstrong-box:~/Impala/Impala$ impala-shell.sh
  Starting Impala Shell without Kerberos authentication
  Connected to tarmstrong-box.ca.cloudera.com:21000
  Server version: impalad version 2.6.0-cdh5-INTERNAL DEBUG (build
  fe23dbf0465220a0c40a5c8431cb6a536e19dc6b)

***********************************************************************************

  Welcome to the Impala shell. Copyright (c) 2015 Cloudera, Inc. All rights
  reserved.
  (Impala Shell v2.6.0-cdh5-INTERNAL (fe23dbf) built on Fri May 13 11:15:16
  PDT 2016)

  You can run a single query from the command line using the '-q' option.

***********************************************************************************

  [tarmstrong-box.ca.cloudera.com:21000] > select count(*)
                                         > from tpch.lineitem
                                         >   inner join tpch.orders on
  l_orderkey = o_orderkey
                                         > ;
  Query: select count(*)
  from tpch.lineitem
    inner join tpch.orders on l_orderkey = o_orderkey
  WARNINGS:

  Sender timed out waiting for receiver fragment instance:
  4cbdf04962743c:faa6717f926b5183

   (1 of 2 similar)

  You could try increasing the delay on your setup to see if you can
  replicate the failure.

  On Thu, Jun 23, 2016 at 3:54 AM, Nishidha Panpaliya <[email protected]>
  wrote:

        Hi All,

        On power8, we are getting 3 failures in custom cluster test
        failure. 2 test
        cases failed in test_admission_controller.py and 1 in
        test_exchange_delays.py. I investigated the test failure in
        test_exchange_delays.py and below is my finding.

           Test case failed is "test_exchange_small_delay". This test has
        input
           test file as "QueryTest/exchange-delays",
           --stress_datastream_recvr_delay_ms=10000 and
           --datastream_sender_timeout_ms=5000.
           The test is expected to throw an exception, with message as
        mentioned in
           CATCH section in QueryTest/exchange-delays.
           However, at our end, the query in this test does not throw any
           exception, but since QueryTest/exchange-delays has CATCH section
           mentioned, the test case fails due to assertion in
           tests/common/impala_test_suite.py as below -
                    if 'CATCH' in test_section:
                      assert test_section['CATCH'].strip() == ''
        4.      If I remove CATCH section from exchange-delays.test file,
        then this
        test case passes, however, another test case in the same test file
        fails,
        as it throws exception as per inputs given to it but CATCH section
        is
        missing.
        5.     On another RHEL ppc machine, this test randomly passes i.e.
        both the
        test cases throws exception as expected.

        I'm really confused as to what parameter is leading the test case
        "test_exchange_small_delay" to not throw any exception in my setup.
        Or what
        should actually be happened?
        I checked latest cdh5-trunk code on github and it also has same
        test code
        and same content in query test file.

        Kindly provide me some pointers.

        Thanks,
        Nishidha

Re: Custom cluster test failure in test_admission_controller.py

Reply via email to