[
https://issues.apache.org/jira/browse/HDFS-13834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16588093#comment-16588093
]
CR Hota commented on HDFS-13834:
--------------------------------
[~elgoiri] Thanks for reviewing
We should not fail the router and thats the whole point of catching any un
expected exception to continue serving traffic even if there is some issue with
connection creation for a particular namenode may be.
Right now the thread dies and makes router work with very very limited set of
connections (whatever was successfully created before dying)
Actual exception we found is
{code:java}
2018-06-22 12:43:00,758 ERROR
org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Fatal error
caught by connection creator
java.lang.IllegalArgumentException: java.net.UnknownHostException:
hadooplithiumnamenode02-sjc1.prod.uber.internal
at
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:418)
at
org.apache.hadoop.hdfs.server.federation.router.ConnectionPool.newConnection(ConnectionPool.java:339)
at
org.apache.hadoop.hdfs.server.federation.router.ConnectionPool.newConnection(ConnectionPool.java:293)
at
org.apache.hadoop.hdfs.server.federation.router.ConnectionManager$ConnectionCreator.run(ConnectionManager.java:415)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}
> RBF: Connection creator thread should catch Throwable
> -----------------------------------------------------
>
> Key: HDFS-13834
> URL: https://issues.apache.org/jira/browse/HDFS-13834
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Reporter: CR Hota
> Assignee: CR Hota
> Priority: Critical
> Attachments: HDFS-13834.0.patch, HDFS-13834.1.patch
>
>
> Connection creator thread is a single thread thats responsible for creating
> all downstream namenode connections.
> This is very critical thread and hence should not die understand
> exception/error scenarios.
> We saw this behavior in production systems where the thread died leaving the
> router process in bad state.
> The thread should also catch a generic error/exception.
> {code}
> @Override
> public void run() {
> while (this.running) {
> try {
> ConnectionPool pool = this.queue.take();
> try {
> int total = pool.getNumConnections();
> int active = pool.getNumActiveConnections();
> if (pool.getNumConnections() < pool.getMaxSize() &&
> active >= MIN_ACTIVE_RATIO * total) {
> ConnectionContext conn = pool.newConnection();
> pool.addConnection(conn);
> } else {
> LOG.debug("Cannot add more than {} connections to {}",
> pool.getMaxSize(), pool);
> }
> } catch (IOException e) {
> LOG.error("Cannot create a new connection", e);
> }
> } catch (InterruptedException e) {
> LOG.error("The connection creator was interrupted");
> this.running = false;
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]