[ 
https://issues.apache.org/jira/browse/IMPALA-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773655#comment-16773655
 ] 

ASF subversion and git services commented on IMPALA-7119:
---------------------------------------------------------

Commit 9fdb93987cf13f346ad56c1b273a1e0fed86fd10 in impala's branch 
refs/heads/2.x from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=9fdb939 ]

IMPALA-7119: Restart whole minicluster when HDFS replication stalls

After loading data, we wait for HDFS to replicate
all of the blocks appropriately. If this takes too long,
we restart HDFS. However, HBase can fail if HDFS is
restarted and HBase is unable to write its logs.
In general, there is no real reason to keep HBase
and the other minicluster components running while
restarting HDFS.

This changes the HDFS health check to restart the
whole minicluster and Impala rather than just HDFS.

Testing:
 - Tested with a modified version that always does
   the restart in the HDFS health check and verified
   that the tests pass

Change-Id: I58ffe301708c78c26ee61aa754a06f46c224c6e2
Reviewed-on: http://gerrit.cloudera.org:8080/10665
Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>


> HBase tests failing with RetriesExhausted and "RuntimeException: couldn't 
> retrieve HBase table"
> -----------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-7119
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7119
>             Project: IMPALA
>          Issue Type: Bug
>    Affects Versions: Impala 2.13.0
>            Reporter: Tim Armstrong
>            Assignee: Joe McDonnell
>            Priority: Major
>              Labels: broken-build, flaky
>             Fix For: Impala 3.1.0
>
>
> 64820211a2d30238093f1c4cd03bc268e3a01638
> {noformat}
>     
> metadata.test_compute_stats.TestHbaseComputeStats.test_hbase_compute_stats_incremental[exec_option:
>  {'batch_size': 0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 5000, 
> 'disable_codegen': False, 'abort_on_error': 1, 'debug_action': None, 
> 'exec_single_node_rows_threshold': 0} | table_format: hbase/none]
>     
> metadata.test_compute_stats.TestHbaseComputeStats.test_hbase_compute_stats[exec_option:
>  {'batch_size': 0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 5000, 
> 'disable_codegen': False, 'abort_on_error': 1, 'debug_action': None, 
> 'exec_single_node_rows_threshold': 0} | table_format: hbase/none]
>     query_test.test_mt_dop.TestMtDop.test_mt_dop[mt_dop: 1 | exec_option: 
> {'batch_size': 0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 0, 
> 'disable_codegen': False, 'abort_on_error': 1, 'debug_action': None, 
> 'exec_single_node_rows_threshold': 0} | table_format: hbase/none]
>     query_test.test_mt_dop.TestMtDop.test_compute_stats[mt_dop: 1 | 
> exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | table_format: hbase/none]
>     
> query_test.test_hbase_queries.TestHBaseQueries.test_hbase_scan_node[exec_option:
>  {'batch_size': 0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 0, 
> 'disable_codegen': False, 'abort_on_error': 1, 'debug_action': None, 
> 'exec_single_node_rows_threshold': 0} | table_format: hbase/none]
>     query_test.test_queries.TestHdfsQueries.test_file_partitions[exec_option: 
> {'batch_size': 0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 0, 
> 'disable_codegen': False, 'abort_on_error': 1, 'debug_action': None, 
> 'exec_single_node_rows_threshold': 0} | table_format: hbase/none]
>     query_test.test_mt_dop.TestMtDop.test_mt_dop[mt_dop: 0 | exec_option: 
> {'batch_size': 0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 0, 
> 'disable_codegen': False, 'abort_on_error': 1, 'debug_action': None, 
> 'exec_single_node_rows_threshold': 0} | table_format: hbase/none]
>     query_test.test_observability.TestObservability.test_scan_summary
>     query_test.test_mt_dop.TestMtDop.test_compute_stats[mt_dop: 0 | 
> exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | table_format: hbase/none]
>     failure.test_failpoints.TestFailpoints.test_failpoints[table_format: 
> hbase/none | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | mt_dop: 4 | location: GETNEXT_SCANNER | action: FAIL | query: select 1 
> from alltypessmall order by id limit 100]
>     failure.test_failpoints.TestFailpoints.test_failpoints[table_format: 
> hbase/none | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | mt_dop: 0 | location: OPEN | action: CANCEL | query: select c from 
> (select id c from alltypessmall order by id limit 10) v where c = 1]
>     failure.test_failpoints.TestFailpoints.test_failpoints[table_format: 
> hbase/none | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | mt_dop: 0 | location: CLOSE | action: MEM_LIMIT_EXCEEDED | query: select 
> count(*) from alltypessmall]
>     failure.test_failpoints.TestFailpoints.test_failpoints[table_format: 
> hbase/none | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | mt_dop: 4 | location: PREPARE | action: MEM_LIMIT_EXCEEDED | query: 
> select count(int_col) from alltypessmall group by id]
>     failure.test_failpoints.TestFailpoints.test_failpoints[table_format: 
> hbase/none | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | mt_dop: 4 | location: OPEN | action: MEM_LIMIT_EXCEEDED | query: select 
> * from alltypessmall union all select * from alltypessmall]
>     failure.test_failpoints.TestFailpoints.test_failpoints[table_format: 
> hbase/none | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | mt_dop: 4 | location: CLOSE | action: MEM_LIMIT_EXCEEDED | query: select 
> row_number() over (partition by int_col order by id) from alltypessmall]
>     failure.test_failpoints.TestFailpoints.test_failpoints[table_format: 
> hbase/none | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | mt_dop: 4 | location: CLOSE | action: MEM_LIMIT_EXCEEDED | query: select 
> 1 from alltypessmall order by id]
>     failure.test_failpoints.TestFailpoints.test_failpoints[table_format: 
> hbase/none | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | mt_dop: 4 | location: CLOSE | action: MEM_LIMIT_EXCEEDED | query: select 
> * from alltypes]
>     verifiers.test_verify_metrics.TestValidateMetrics.test_metrics_are_zero
>     
> org.apache.impala.planner.PlannerTest.org.apache.impala.planner.PlannerTest
>     
> org.apache.impala.planner.S3PlannerTest.org.apache.impala.planner.S3PlannerTest
>     failure.test_failpoints.TestFailpoints.test_failpoints[table_format: 
> hbase/none | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | mt_dop: 4 | location: GETNEXT | action: FAIL | query: select 1 from 
> alltypessmall a join alltypessmall b on a.id != b.id]
>     failure.test_failpoints.TestFailpoints.test_failpoints[table_format: 
> hbase/none | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | mt_dop: 4 | location: PREPARE_SCANNER | action: MEM_LIMIT_EXCEEDED | 
> query: select 1 from alltypessmall a join alltypessmall b on a.id = b.id]
> {noformat}
> {noformat}
> 21:22:44 Running org.apache.impala.planner.S3PlannerTest
> 21:22:44 Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 450.328 sec <<< FAILURE! - in org.apache.impala.planner.S3PlannerTest
> 21:22:44 org.apache.impala.planner.S3PlannerTest  Time elapsed: 450.328 sec  
> <<< ERROR!
> 21:22:44      at 
> org.apache.impala.datagenerator.HBaseTestDataRegionAssignment.<init>(HBaseTestDataRegionAssignment.java:68)
> 21:22:44      at 
> org.apache.impala.planner.PlannerTestBase.setUp(PlannerTestBase.java:120)
> 21:22:44      at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
> 21:22:44      at 
> org.apache.impala.datagenerator.HBaseTestDataRegionAssignment.<init>(HBaseTestDataRegionAssignment.java:68)
> 21:22:44      at 
> org.apache.impala.planner.PlannerTestBase.setUp(PlannerTestBase.java:120)
> 21:22:44      at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
> 21:22:44      at 
> org.apache.impala.datagenerator.HBaseTestDataRegionAssignment.<init>(HBaseTestDataRegionAssignment.java:68)
> 21:22:44      at 
> org.apache.impala.planner.PlannerTestBase.setUp(PlannerTestBase.java:120)
> 21:22:44      at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
> 21:22:44 Running org.apache.impala.planner.PlannerTest
> 21:22:44 Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 450.602 sec <<< FAILURE! - in org.apache.impala.planner.PlannerTest
> 21:22:44 org.apache.impala.planner.PlannerTest  Time elapsed: 450.602 sec  
> <<< ERROR!
> 21:22:44      at 
> org.apache.impala.datagenerator.HBaseTestDataRegionAssignment.<init>(HBaseTestDataRegionAssignment.java:68)
> 21:22:44      at 
> org.apache.impala.planner.PlannerTestBase.setUp(PlannerTestBase.java:120)
> 21:22:44      at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
> 21:22:44      at 
> org.apache.impala.datagenerator.HBaseTestDataRegionAssignment.<init>(HBaseTestDataRegionAssignment.java:68)
> 21:22:44      at 
> org.apache.impala.planner.PlannerTestBase.setUp(PlannerTestBase.java:120)
> 21:22:44      at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
> 21:22:44      at 
> org.apache.impala.datagenerator.HBaseTestDataRegionAssignment.<init>(HBaseTestDataRegionAssignment.java:68)
> 21:22:44      at 
> org.apache.impala.planner.PlannerTestBase.setUp(PlannerTestBase.java:120)
> 21:22:44      at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
> {noformat}
> {noformat}
> 22:53:05 =================================== FAILURES 
> ===================================
> 22:53:05  TestFailpoints.test_failpoints[table_format: hbase/none | 
> exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | mt_dop: 4 | location: GETNEXT_SCANNER | action: FAIL | query: select 1 
> from alltypessmall order by id limit 100] 
> 22:53:05 failure/test_failpoints.py:102: in test_failpoints
> 22:53:05     raise e
> 22:53:05 E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> 22:53:05 E    INNER EXCEPTION: <class 'beeswaxd.ttypes.BeeswaxException'>
> 22:53:05 E    MESSAGE: RuntimeException: couldn't retrieve HBase table 
> (functional_hbase.alltypessmall) info:
> 22:53:05 E   Connection refused
> 22:53:05 E   CAUSED BY: ConnectException: Connection refused
> 22:53:05  TestFailpoints.test_failpoints[table_format: hbase/none | 
> exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | mt_dop: 0 | location: OPEN | action: CANCEL | query: select c from 
> (select id c from alltypessmall order by id limit 10) v where c = 1] 
> 22:53:05 failure/test_failpoints.py:102: in test_failpoints
> 22:53:05     raise e
> 22:53:05 E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> 22:53:05 E    INNER EXCEPTION: <class 'beeswaxd.ttypes.BeeswaxException'>
> 22:53:05 E    MESSAGE: RuntimeException: couldn't retrieve HBase table 
> (functional_hbase.alltypessmall) info:
> 22:53:05 E   Connection refused
> 22:53:05 E   CAUSED BY: ConnectException: Connection refused
> {noformat}
> {noformat}
> 23:21:02  
> TestHbaseComputeStats.test_hbase_compute_stats_incremental[exec_option: 
> {'batch_size': 0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 5000, 
> 'disable_codegen': False, 'abort_on_error': 1, 'debug_action': None, 
> 'exec_single_node_rows_threshold': 0} | table_format: hbase/none] 
> 23:21:02 [gw3] linux2 -- Python 2.7.5 
> /data/jenkins/workspace/impala-asf-2.x-core/repos/Impala/bin/../infra/python/env/bin/python
> 23:21:02 metadata/test_compute_stats.py:147: in 
> test_hbase_compute_stats_incremental
> 23:21:02     unique_database)
> 23:21:02 common/impala_test_suite.py:405: in run_test_case
> 23:21:02     result = self.__execute_query(target_impalad_client, query, 
> user=user)
> 23:21:02 common/impala_test_suite.py:620: in __execute_query
> 23:21:02     return impalad_client.execute(query, user=user)
> 23:21:02 common/impala_connection.py:160: in execute
> 23:21:02     return self.__beeswax_client.execute(sql_stmt, user=user)
> 23:21:02 beeswax/impala_beeswax.py:173: in execute
> 23:21:02     handle = self.__execute_query(query_string.strip(), user=user)
> 23:21:02 beeswax/impala_beeswax.py:341: in __execute_query
> 23:21:02     self.wait_for_completion(handle)
> 23:21:02 beeswax/impala_beeswax.py:361: in wait_for_completion
> 23:21:02     raise ImpalaBeeswaxException("Query aborted:" + error_log, None)
> 23:21:02 E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> 23:21:02 E    Query aborted:RuntimeException: couldn't retrieve HBase table 
> (functional_hbase.alltypessmall) info:
> 23:21:02 E   This server is in the failed servers list: 
> localhost/127.0.0.1:16202
> 23:21:02 E   CAUSED BY: FailedServerException: This server is in the failed 
> servers list: localhost/127.0.0.1:16202
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to