bruce xu created ZEPPELIN-3507:
----------------------------------

             Summary: Restart interpreter setting may cause lock contention and 
lead to ZeppelinServer return 503 code
                 Key: ZEPPELIN-3507
                 URL: https://issues.apache.org/jira/browse/ZEPPELIN-3507
             Project: Zeppelin
          Issue Type: Bug
          Components: zeppelin-interpreter
    Affects Versions: 0.7.2
         Environment: centos 6, jdk 1.8
            Reporter: bruce xu
         Attachments: zeppelin-jstack-20180525.log

When user restart interpreter setting, it may cause ZeppelinServer return 503 
code because of lock contention.

1) In restart interpreter setting operation, the *Client.recv_cancel* method 
may hung for a long time because of not receiving anything from server.  
However, this method will hold a lock of concurrentHashMap.

 
{code:java}
"qtp203777958-12450" prio=10 tid=0x00007f49b0001800 nid=0x30c5 runnable 
[0x00007f4a313af000]

   java.lang.Thread.State: RUNNABLE

        at java.net.SocketInputStream.socketRead0(Native Method)

        at java.net.SocketInputStream.read(SocketInputStream.java:152)

        at java.net.SocketInputStream.read(SocketInputStream.java:122)

        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)

        at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)

        at java.io.BufferedInputStream.read(BufferedInputStream.java:334)

        - locked <0x0000000795526d30> (a java.io.BufferedInputStream)

        at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)

        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)

        at 
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)

        at 
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)

        at 
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)

        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)

        at 
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_cancel(RemoteInterpreterService.java:291)

        at 
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.cancel(RemoteInterpreterService.java:276)

        at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreter.cancel(RemoteInterpreter.java:419)

        at 
org.apache.zeppelin.interpreter.LazyOpenInterpreter.cancel(LazyOpenInterpreter.java:106)

        at org.apache.zeppelin.notebook.Paragraph.jobAbort(Paragraph.java:477)

        at org.apache.zeppelin.scheduler.Job.abort(Job.java:240)

        at 
org.apache.zeppelin.interpreter.InterpreterSettingManager.stopJobByIntpGroup(InterpreterSettingManager.java:1019)

        at 
org.apache.zeppelin.interpreter.InterpreterSettingManager.stopJobAllInterpreter(InterpreterSettingManager.java:1009)

        at 
org.apache.zeppelin.interpreter.InterpreterSettingManager.restart(InterpreterSettingManager.java:963)

        - locked <0x000000078031cb78> (a java.util.concurrent.ConcurrentHashMap)

        at 
org.apache.zeppelin.rest.InterpreterRestApi.restartSetting(InterpreterRestApi.java:191)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

        at java.lang.reflect.Method.invoke(Method.java:606)

        at 
org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:180)

        at 
org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)

        at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:205)

        at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:102)

        at 
org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:58)

        at 
org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:94)
{code}
 

 

2) So what is ConcurrentHashMap?

>From the stack information and source code, I find that the lock is 
>*interpreterSetting*.

 
{code:java}
public void restart(String settingId, String user) {
  synchronized (interpreterSettings) {
    InterpreterSetting intpsetting = interpreterSettings.get(settingId);
    // Check if dependency in specified path is changed
    // If it did, overwrite old dependency jar with new one
    copyDependenciesFromLocalPath(intpsetting);
    if (intpsetting != null) {

      stopJobAllInterpreter(intpsetting, user);

      intpsetting.closeAndRmoveAllInterpreterGroups(user);

    } else {
      throw new InterpreterException("Interpreter setting id " + settingId
              + " not found");
    }
  }
}
{code}
3) How can this lock cause lock contention?

Because *NotebookServer* has a *onWebSocketText* method which will try to get 
the lock of InterpreterSetting object. 
{code:java}
"qtp203777958-12456" prio=10 tid=0x00007f49bc006800 nid=0x46ab waiting for 
monitor entry [0x00007f4a30a6f000]

   java.lang.Thread.State: BLOCKED (on object monitor)

        at 
org.apache.zeppelin.interpreter.InterpreterSettingManager.get(InterpreterSettingManager.java:1035)

        - waiting to lock <0x000000078031cb78> (a 
java.util.concurrent.ConcurrentHashMap)

        at 
org.apache.zeppelin.interpreter.InterpreterSettingManager.getInterpreterSettings(InterpreterSettingManager.java:450)

        at 
org.apache.zeppelin.socket.NotebookServer.sendAllAngularObjects(NotebookServer.java:2115)

        at 
org.apache.zeppelin.socket.NotebookServer.sendNote(NotebookServer.java:724)

        at 
org.apache.zeppelin.socket.NotebookServer.onMessage(NotebookServer.java:220)

        at 
org.apache.zeppelin.socket.NotebookSocket.onWebSocketText(NotebookSocket.java:59)

        at 
org.eclipse.jetty.websocket.common.events.JettyListenerEventDriver.onTextMessage(JettyListenerEventDriver.java:128)

        at 
org.eclipse.jetty.websocket.common.message.SimpleTextMessage.messageComplete(SimpleTextMessage.java:69)

        at 
org.eclipse.jetty.websocket.common.events.AbstractEventDriver.appendMessage(AbstractEventDriver.java:65)

        at 
org.eclipse.jetty.websocket.common.events.JettyListenerEventDriver.onTextFrame(JettyListenerEventDriver.java:122)

        at 
org.eclipse.jetty.websocket.common.events.AbstractEventDriver.incomingFrame(AbstractEventDriver.java:161)

        at 
org.eclipse.jetty.websocket.common.WebSocketSession.incomingFrame(WebSocketSession.java:309)

        at 
org.eclipse.jetty.websocket.common.extensions.ExtensionStack.incomingFrame(ExtensionStack.java:214)

        at 
org.eclipse.jetty.websocket.common.Parser.notifyFrame(Parser.java:220)

        at org.eclipse.jetty.websocket.common.Parser.parse(Parser.java:258)

        at 
org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.readParse(AbstractWebSocketConnection.java:632)

        at 
org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:480)

        at 
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)

        at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)

        at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)

        at java.lang.Thread.run(Thread.java:745)
{code}
When there is a lot of threads trying to get the lock, it will cause lock 
contention and slow down the ZeppelinServer performance until 503 code returned.

 

 

 

The jstack log file is attached below.

[^zeppelin-jstack-20180525.log]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to