sarutak opened a new pull request #28437:
URL: https://github.com/apache/spark/pull/28437
### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section
is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR. See the examples below.
1. If you refactor some codes with changing classes, showing the class
hierarchy will help reviewers.
2. If you fix some SQL features, you can provide some references of other
DBMSes.
3. If there is design documentation, please add the link.
4. If there is a discussion in the mailing list, please add the link.
-->
This PR added a workaround for the issue which occasionally happens when
SparkContext#stop() is called.
I think this issue can occurs on macOS with OpenJDK / OracleJDK 1.8.
If this issue happens, following stack trace is shown.
```
20/05/03 02:17:54 WARN AbstractConnector:
java.io.IOException: No such file or directory
at sun.nio.ch.NativeThread.signal(Native Method)
at
sun.nio.ch.ServerSocketChannelImpl.implCloseSelectableChannel(ServerSocketChannelImpl.java:292)
at
java.nio.channels.spi.AbstractSelectableChannel.implCloseChannel(AbstractSelectableChannel.java:234)
at
java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:115)
at
org.eclipse.jetty.server.ServerConnector.close(ServerConnector.java:368)
at
org.eclipse.jetty.server.AbstractNetworkConnector.shutdown(AbstractNetworkConnector.java:105)
at org.eclipse.jetty.server.Server.doStop(Server.java:439)
at
org.eclipse.jetty.util.component.AbstractLifeCycle.stop(AbstractLifeCycle.java:89)
at org.apache.spark.ui.ServerInfo.stop(JettyUtils.scala:501)
at org.apache.spark.ui.WebUI.$anonfun$stop$2(WebUI.scala:173)
at org.apache.spark.ui.WebUI.$anonfun$stop$2$adapted(WebUI.scala:173)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.ui.WebUI.stop(WebUI.scala:173)
at org.apache.spark.ui.SparkUI.stop(SparkUI.scala:101)
at
org.apache.spark.SparkContext.$anonfun$stop$6(SparkContext.scala:1966)
at
org.apache.spark.SparkContext.$anonfun$stop$6$adapted(SparkContext.scala:1966)
at scala.Option.foreach(Option.scala:407)
at
org.apache.spark.SparkContext.$anonfun$stop$5(SparkContext.scala:1966)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1966)
at org.apache.spark.repl.Main$.$anonfun$doMain$3(Main.scala:79)
at org.apache.spark.repl.Main$.$anonfun$doMain$3$adapted(Main.scala:79)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.repl.Main$.doMain(Main.scala:79)
at org.apache.spark.repl.Main$.main(Main.scala:58)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:934)
at
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1013)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1022)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
```
This issue happens when the Jetty's acceptor thread shrinks before the main
thread send a signal to the thread.
Jetty's acceptor thread waits for a new connection request and blocked by
`accept(this.fd, newfd, isaa)` in
[`sun.nio.ch.ServerSocketChannelImpl#accept`](http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/sun/nio/ch/ServerSocketChannelImpl.java#l241).
When `org.eclipse.jetty.server.Server.doStop` is called in the main thread,
the thread reaches [this
code](http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/sun/nio/ch/ServerSocketChannelImpl.java#l280).
The server socket descriptor will be closed by `nd.preClose` in the main
thread.
Then, `accept()` in acceptor thread throws an Exception due to "Bad file
descriptor" in case of macOS.
After the exception is thrown, the acceptor thread will continue to [fetch a
task](https://github.com/eclipse/jetty.project/blob/jetty-9.4.18.v20190429/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java#L783).
If the thread obtain the `SHRINK` task
[here](https://github.com/eclipse/jetty.project/blob/jetty-9.4.18.v20190429/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java#L854),
the thread will be shrink.
If, the acceptor thread finishes before `NativeThread.signal` is called in
the main thread, this issue happens.
Because the stack trace is displayed by the logger, it's difficult to
suppress it.
According to [this
condition](https://github.com/eclipse/jetty.project/blob/jetty-9.4.18.v20190429/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java#L842),
shrink doesn't happen. So this PR adds a workaround that set the idle time to
0 immediately before stop.
In case of Linux, the acceptor thread is still blocked by `accept` even
though `np.preClose` is called in the main thread.
The acceptor thread will return from `accept` when `NativeThread.signal` is
called in the main thread.
It seems that the implementation of `accept systemcall` called in `accept`
is different between Linux and macOS.
So, I believe this issue doesn't happen on Linux.
Also, the implementation of `NativeThread.signal` is a little bit changed in
[OpenJDK 9](http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/7b17bff2ea36) for
macOS.
So this issue doesn't happen for macOS with OpenJDK 9+.
You can reproduce this issue by following procedure using debugger.
1. Launch spark-shell in local mode with JDWP enabled.
2. Access to WebUI. This is needed to increase the number of SparkUI thread
to greater than minThreads to meet the condition of shrink.
3. Enable the following breakpoints. Note that don't suspend all threads
when a thread reaches one of the breakpoints. Only the threads which reach the
line should be suspended.
3.1 [long now = System.nanoTime(); at
org.eclipse.jetty.util.thread.QueuedThreadPool#idleJobPoll](https://github.com/eclipse/jetty.project/blob/jetty-9.4.18.v20190429/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java#L850)
3.2 [NativeThread.signal(th); at
sun.nio.ch.ServerSocketChannelImpl#implCloseSelectableChannel](http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/sun/nio/ch/ServerSocketChannelImpl.java#l283)
3.3 [thread = 0; at
ServerSocketChannelImpl#accept](http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/sun/nio/ch/ServerSocketChannelImpl.java#l247)
4. Quite spark-shell.
5. Waiting for a thread reaching the breakpoint `3.1` and until the
following condition become true (The idle time of those threads are 1min and
you can confirm that expression evaluation feature if your debugger supports ).
`(System.nanoTime() - last) > TimeUnit.MILLISECONDS.toNanos(_idleTimeout)`
6. The acceptor thread named `SparkUI-<N>-acceptor-0` should be suspended at
the breakpoint `3.3` so continue this thread. This thread will reach the
breakpoint at `3.1` and continue further. Then, the acceptor thread will be
shrink.
7. Continue all the threads rest.
### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
1. If you propose a new API, clarify the use case for a new API.
2. If you fix a bug, you can clarify why it is a bug.
-->
This stack trace is not a bug of Spark but it confuses users.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such as
the documentation fix.
If yes, please clarify the previous behavior and the change this PR proposes
- provide the console output, description and/or an example to show the
behavior difference if possible.
If possible, please also clarify if this is a user-facing change compared to
the released Spark versions or within the unreleased branches such as master.
If no, write 'No'.
-->
No.
### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some
test cases that check the changes thoroughly including negative and positive
cases if possible.
If it was tested in a way different from regular unit tests, please clarify
how you tested step by step, ideally copy and paste-able, so that other
reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why
it was difficult to add.
-->
Tested by the reproduce procedure above and confirmed acceptor thread is no
longer shrink.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]