[ 
https://issues.apache.org/jira/browse/IMPALA-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500915#comment-16500915
 ] 

Joe McDonnell commented on IMPALA-7119:
---------------------------------------

During dataload, we check to see that HDFS is replicating blocks correctly. In 
this case, dataload spends 23 minutes waiting for HDFS to replicate and ends up 
restarting the HDFS namenode. This covers the period of 20:14 to 20:38:
{noformat}
20:14:49 Waiting for HDFS replication (logging to 
/data/jenkins/workspace/impala-asf-2.x-core/repos/Impala/logs/data_loading/wait-hdfs-replication.log)...
 
20:38:16   Waiting for HDFS replication OK (Took: 23 min 27 sec){noformat}
The HDFS NameNode is killed and restarted at 20:23 (from 
logs/data_loading/wait-hdfs-replication.log):
{noformat}
FSCK ended at Fri Jun 01 20:22:59 PDT 2018 in 128 milliseconds


The filesystem under path '/test-warehouse' is HEALTHY
There are under-replicated blocks in HDFS and HDFS is not making progress in 
120 seconds. Attempting to restart HDFS to resolve this issue.
Stopping kudu
Stopping kms
Stopping hdfs
Starting hdfs (Web UI - http://localhost:5070)
Namenode started
Starting kms (Web UI - http://localhost:9600)
Starting kudu (Web UI - http://localhost:8051)
The cluster is running
1026 under replicated blocks remaining.
Sleeping for 120 seconds before rechecking.{noformat}
When the NameNode starts up again (at 20:23:04), it is in safe mode. HBase 
remained running over the restart, and the HBase master dies a few seconds 
later due to being unable to write a log (from 
logs/cluster/hbase/hbase-jenkins-master-impala-ec2-centos74-m5-4xlarge-ondemand-1a7c.vpc.cloudera.com.out):
{noformat}
18/06/01 20:23:20 WARN wal.WALProcedureStore: failed to create log file with 
id=5
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException):
 Cannot create file/hbase/MasterProcWALs/state-00000000000000000005.log. Name 
node is in safe mode.
The reported blocks 13014 has reached the threshold 0.9990 of total blocks 
13014. The number of live datanodes 3 has reached the minimum number 0. In safe 
mode extension. Safe mode will be turned off automatically in 24 seconds.
...
18/06/01 20:23:20 FATAL wal.WALProcedureStore: Unable to roll the log
18/06/01 20:23:20 FATAL master.HMaster: Master server abort: loaded 
coprocessors are: []
18/06/01 20:23:20 INFO regionserver.HRegionServer: STOPPED: Stopped by 
WALProcedureStoreSyncThread
18/06/01 20:23:20 INFO regionserver.HRegionServer: Stopping infoServer
...
18/06/01 20:23:20 INFO regionserver.HRegionServer: stopping server 
localhost,16000,1527906274301; zookeeper connection closed.
18/06/01 20:23:20 INFO regionserver.HRegionServer: 
master/localhost/127.0.0.1:16000 exiting{noformat}
This is the end of output for the HBase master. I think that we are unable to 
read any HBase tables without the master.

One option is to restart HBase when we restart HDFS. 

> HBase tests failing with RetriesExhausted and "RuntimeException: couldn't 
> retrieve HBase table"
> -----------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-7119
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7119
>             Project: IMPALA
>          Issue Type: Bug
>    Affects Versions: Impala 2.13.0
>            Reporter: Tim Armstrong
>            Assignee: Joe McDonnell
>            Priority: Major
>              Labels: broken-build, flaky
>
> 64820211a2d30238093f1c4cd03bc268e3a01638
> {noformat}
>     
> metadata.test_compute_stats.TestHbaseComputeStats.test_hbase_compute_stats_incremental[exec_option:
>  {'batch_size': 0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 5000, 
> 'disable_codegen': False, 'abort_on_error': 1, 'debug_action': None, 
> 'exec_single_node_rows_threshold': 0} | table_format: hbase/none]
>     
> metadata.test_compute_stats.TestHbaseComputeStats.test_hbase_compute_stats[exec_option:
>  {'batch_size': 0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 5000, 
> 'disable_codegen': False, 'abort_on_error': 1, 'debug_action': None, 
> 'exec_single_node_rows_threshold': 0} | table_format: hbase/none]
>     query_test.test_mt_dop.TestMtDop.test_mt_dop[mt_dop: 1 | exec_option: 
> {'batch_size': 0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 0, 
> 'disable_codegen': False, 'abort_on_error': 1, 'debug_action': None, 
> 'exec_single_node_rows_threshold': 0} | table_format: hbase/none]
>     query_test.test_mt_dop.TestMtDop.test_compute_stats[mt_dop: 1 | 
> exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | table_format: hbase/none]
>     
> query_test.test_hbase_queries.TestHBaseQueries.test_hbase_scan_node[exec_option:
>  {'batch_size': 0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 0, 
> 'disable_codegen': False, 'abort_on_error': 1, 'debug_action': None, 
> 'exec_single_node_rows_threshold': 0} | table_format: hbase/none]
>     query_test.test_queries.TestHdfsQueries.test_file_partitions[exec_option: 
> {'batch_size': 0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 0, 
> 'disable_codegen': False, 'abort_on_error': 1, 'debug_action': None, 
> 'exec_single_node_rows_threshold': 0} | table_format: hbase/none]
>     query_test.test_mt_dop.TestMtDop.test_mt_dop[mt_dop: 0 | exec_option: 
> {'batch_size': 0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 0, 
> 'disable_codegen': False, 'abort_on_error': 1, 'debug_action': None, 
> 'exec_single_node_rows_threshold': 0} | table_format: hbase/none]
>     query_test.test_observability.TestObservability.test_scan_summary
>     query_test.test_mt_dop.TestMtDop.test_compute_stats[mt_dop: 0 | 
> exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | table_format: hbase/none]
>     failure.test_failpoints.TestFailpoints.test_failpoints[table_format: 
> hbase/none | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | mt_dop: 4 | location: GETNEXT_SCANNER | action: FAIL | query: select 1 
> from alltypessmall order by id limit 100]
>     failure.test_failpoints.TestFailpoints.test_failpoints[table_format: 
> hbase/none | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | mt_dop: 0 | location: OPEN | action: CANCEL | query: select c from 
> (select id c from alltypessmall order by id limit 10) v where c = 1]
>     failure.test_failpoints.TestFailpoints.test_failpoints[table_format: 
> hbase/none | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | mt_dop: 0 | location: CLOSE | action: MEM_LIMIT_EXCEEDED | query: select 
> count(*) from alltypessmall]
>     failure.test_failpoints.TestFailpoints.test_failpoints[table_format: 
> hbase/none | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | mt_dop: 4 | location: PREPARE | action: MEM_LIMIT_EXCEEDED | query: 
> select count(int_col) from alltypessmall group by id]
>     failure.test_failpoints.TestFailpoints.test_failpoints[table_format: 
> hbase/none | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | mt_dop: 4 | location: OPEN | action: MEM_LIMIT_EXCEEDED | query: select 
> * from alltypessmall union all select * from alltypessmall]
>     failure.test_failpoints.TestFailpoints.test_failpoints[table_format: 
> hbase/none | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | mt_dop: 4 | location: CLOSE | action: MEM_LIMIT_EXCEEDED | query: select 
> row_number() over (partition by int_col order by id) from alltypessmall]
>     failure.test_failpoints.TestFailpoints.test_failpoints[table_format: 
> hbase/none | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | mt_dop: 4 | location: CLOSE | action: MEM_LIMIT_EXCEEDED | query: select 
> 1 from alltypessmall order by id]
>     failure.test_failpoints.TestFailpoints.test_failpoints[table_format: 
> hbase/none | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | mt_dop: 4 | location: CLOSE | action: MEM_LIMIT_EXCEEDED | query: select 
> * from alltypes]
>     verifiers.test_verify_metrics.TestValidateMetrics.test_metrics_are_zero
>     
> org.apache.impala.planner.PlannerTest.org.apache.impala.planner.PlannerTest
>     
> org.apache.impala.planner.S3PlannerTest.org.apache.impala.planner.S3PlannerTest
>     failure.test_failpoints.TestFailpoints.test_failpoints[table_format: 
> hbase/none | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | mt_dop: 4 | location: GETNEXT | action: FAIL | query: select 1 from 
> alltypessmall a join alltypessmall b on a.id != b.id]
>     failure.test_failpoints.TestFailpoints.test_failpoints[table_format: 
> hbase/none | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | mt_dop: 4 | location: PREPARE_SCANNER | action: MEM_LIMIT_EXCEEDED | 
> query: select 1 from alltypessmall a join alltypessmall b on a.id = b.id]
> {noformat}
> {noformat}
> 21:22:44 Running org.apache.impala.planner.S3PlannerTest
> 21:22:44 Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 450.328 sec <<< FAILURE! - in org.apache.impala.planner.S3PlannerTest
> 21:22:44 org.apache.impala.planner.S3PlannerTest  Time elapsed: 450.328 sec  
> <<< ERROR!
> 21:22:44      at 
> org.apache.impala.datagenerator.HBaseTestDataRegionAssignment.<init>(HBaseTestDataRegionAssignment.java:68)
> 21:22:44      at 
> org.apache.impala.planner.PlannerTestBase.setUp(PlannerTestBase.java:120)
> 21:22:44      at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
> 21:22:44      at 
> org.apache.impala.datagenerator.HBaseTestDataRegionAssignment.<init>(HBaseTestDataRegionAssignment.java:68)
> 21:22:44      at 
> org.apache.impala.planner.PlannerTestBase.setUp(PlannerTestBase.java:120)
> 21:22:44      at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
> 21:22:44      at 
> org.apache.impala.datagenerator.HBaseTestDataRegionAssignment.<init>(HBaseTestDataRegionAssignment.java:68)
> 21:22:44      at 
> org.apache.impala.planner.PlannerTestBase.setUp(PlannerTestBase.java:120)
> 21:22:44      at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
> 21:22:44 Running org.apache.impala.planner.PlannerTest
> 21:22:44 Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 450.602 sec <<< FAILURE! - in org.apache.impala.planner.PlannerTest
> 21:22:44 org.apache.impala.planner.PlannerTest  Time elapsed: 450.602 sec  
> <<< ERROR!
> 21:22:44      at 
> org.apache.impala.datagenerator.HBaseTestDataRegionAssignment.<init>(HBaseTestDataRegionAssignment.java:68)
> 21:22:44      at 
> org.apache.impala.planner.PlannerTestBase.setUp(PlannerTestBase.java:120)
> 21:22:44      at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
> 21:22:44      at 
> org.apache.impala.datagenerator.HBaseTestDataRegionAssignment.<init>(HBaseTestDataRegionAssignment.java:68)
> 21:22:44      at 
> org.apache.impala.planner.PlannerTestBase.setUp(PlannerTestBase.java:120)
> 21:22:44      at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
> 21:22:44      at 
> org.apache.impala.datagenerator.HBaseTestDataRegionAssignment.<init>(HBaseTestDataRegionAssignment.java:68)
> 21:22:44      at 
> org.apache.impala.planner.PlannerTestBase.setUp(PlannerTestBase.java:120)
> 21:22:44      at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
> {noformat}
> {noformat}
> 22:53:05 =================================== FAILURES 
> ===================================
> 22:53:05  TestFailpoints.test_failpoints[table_format: hbase/none | 
> exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | mt_dop: 4 | location: GETNEXT_SCANNER | action: FAIL | query: select 1 
> from alltypessmall order by id limit 100] 
> 22:53:05 failure/test_failpoints.py:102: in test_failpoints
> 22:53:05     raise e
> 22:53:05 E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> 22:53:05 E    INNER EXCEPTION: <class 'beeswaxd.ttypes.BeeswaxException'>
> 22:53:05 E    MESSAGE: RuntimeException: couldn't retrieve HBase table 
> (functional_hbase.alltypessmall) info:
> 22:53:05 E   Connection refused
> 22:53:05 E   CAUSED BY: ConnectException: Connection refused
> 22:53:05  TestFailpoints.test_failpoints[table_format: hbase/none | 
> exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'debug_action': None, 'exec_single_node_rows_threshold': 
> 0} | mt_dop: 0 | location: OPEN | action: CANCEL | query: select c from 
> (select id c from alltypessmall order by id limit 10) v where c = 1] 
> 22:53:05 failure/test_failpoints.py:102: in test_failpoints
> 22:53:05     raise e
> 22:53:05 E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> 22:53:05 E    INNER EXCEPTION: <class 'beeswaxd.ttypes.BeeswaxException'>
> 22:53:05 E    MESSAGE: RuntimeException: couldn't retrieve HBase table 
> (functional_hbase.alltypessmall) info:
> 22:53:05 E   Connection refused
> 22:53:05 E   CAUSED BY: ConnectException: Connection refused
> {noformat}
> {noformat}
> 23:21:02  
> TestHbaseComputeStats.test_hbase_compute_stats_incremental[exec_option: 
> {'batch_size': 0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 5000, 
> 'disable_codegen': False, 'abort_on_error': 1, 'debug_action': None, 
> 'exec_single_node_rows_threshold': 0} | table_format: hbase/none] 
> 23:21:02 [gw3] linux2 -- Python 2.7.5 
> /data/jenkins/workspace/impala-asf-2.x-core/repos/Impala/bin/../infra/python/env/bin/python
> 23:21:02 metadata/test_compute_stats.py:147: in 
> test_hbase_compute_stats_incremental
> 23:21:02     unique_database)
> 23:21:02 common/impala_test_suite.py:405: in run_test_case
> 23:21:02     result = self.__execute_query(target_impalad_client, query, 
> user=user)
> 23:21:02 common/impala_test_suite.py:620: in __execute_query
> 23:21:02     return impalad_client.execute(query, user=user)
> 23:21:02 common/impala_connection.py:160: in execute
> 23:21:02     return self.__beeswax_client.execute(sql_stmt, user=user)
> 23:21:02 beeswax/impala_beeswax.py:173: in execute
> 23:21:02     handle = self.__execute_query(query_string.strip(), user=user)
> 23:21:02 beeswax/impala_beeswax.py:341: in __execute_query
> 23:21:02     self.wait_for_completion(handle)
> 23:21:02 beeswax/impala_beeswax.py:361: in wait_for_completion
> 23:21:02     raise ImpalaBeeswaxException("Query aborted:" + error_log, None)
> 23:21:02 E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> 23:21:02 E    Query aborted:RuntimeException: couldn't retrieve HBase table 
> (functional_hbase.alltypessmall) info:
> 23:21:02 E   This server is in the failed servers list: 
> localhost/127.0.0.1:16202
> 23:21:02 E   CAUSED BY: FailedServerException: This server is in the failed 
> servers list: localhost/127.0.0.1:16202
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to