[ 
https://issues.apache.org/jira/browse/HDFS-13834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16588093#comment-16588093
 ] 

CR Hota commented on HDFS-13834:
--------------------------------

[~elgoiri] Thanks for reviewing

We should not fail the router and thats the whole point of catching any un 
expected exception to continue serving traffic even if there is some issue with 
connection creation for a particular namenode may be.

Right now the thread dies and makes router work with very very limited set of 
connections (whatever was successfully created before dying)

Actual exception we found is

 
{code:java}
2018-06-22 12:43:00,758 ERROR 
org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Fatal error 
caught by connection creator 
java.lang.IllegalArgumentException: java.net.UnknownHostException: 
hadooplithiumnamenode02-sjc1.prod.uber.internal
        at 
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:418)
        at 
org.apache.hadoop.hdfs.server.federation.router.ConnectionPool.newConnection(ConnectionPool.java:339)
        at 
org.apache.hadoop.hdfs.server.federation.router.ConnectionPool.newConnection(ConnectionPool.java:293)
        at 
org.apache.hadoop.hdfs.server.federation.router.ConnectionManager$ConnectionCreator.run(ConnectionManager.java:415)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}
 

 

 

 

> RBF: Connection creator thread should catch Throwable
> -----------------------------------------------------
>
>                 Key: HDFS-13834
>                 URL: https://issues.apache.org/jira/browse/HDFS-13834
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: CR Hota
>            Assignee: CR Hota
>            Priority: Critical
>         Attachments: HDFS-13834.0.patch, HDFS-13834.1.patch
>
>
> Connection creator thread is a single thread thats responsible for creating 
> all downstream namenode connections.
> This is very critical thread and hence should not die understand 
> exception/error scenarios.
> We saw this behavior in production systems where the thread died leaving the 
> router process in bad state.
> The thread should also catch a generic error/exception.
> {code}
>     @Override
>     public void run() {
>       while (this.running) {
>         try {
>           ConnectionPool pool = this.queue.take();
>           try {
>             int total = pool.getNumConnections();
>             int active = pool.getNumActiveConnections();
>             if (pool.getNumConnections() < pool.getMaxSize() &&
>                 active >= MIN_ACTIVE_RATIO * total) {
>               ConnectionContext conn = pool.newConnection();
>               pool.addConnection(conn);
>             } else {
>               LOG.debug("Cannot add more than {} connections to {}",
>                   pool.getMaxSize(), pool);
>             }
>           } catch (IOException e) {
>             LOG.error("Cannot create a new connection", e);
>           }
>         } catch (InterruptedException e) {
>           LOG.error("The connection creator was interrupted");
>           this.running = false;
>         }
>       }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to