[
https://issues.apache.org/jira/browse/SPARK-16914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15428285#comment-15428285
]
Thomas Graves commented on SPARK-16914:
---------------------------------------
Actually with SPARK-14963, the cluster admin is supposed to set the recovery
path to something that is durable or/and critical to nm. Normally if this path
is bad, the entire NM will not start, so this should work around this problem
if things are configured properly. ie it should crash if its bad.
> NodeManager crash when spark are registering executor infomartion into leveldb
> ------------------------------------------------------------------------------
>
> Key: SPARK-16914
> URL: https://issues.apache.org/jira/browse/SPARK-16914
> Project: Spark
> Issue Type: Bug
> Components: Shuffle
> Affects Versions: 1.6.2
> Reporter: cen yuhai
>
> {noformat}
> Stack: [0x00007fb5b53de000,0x00007fb5b54df000], sp=0x00007fb5b54dcba8, free
> space=1018k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native
> code)
> C [libc.so.6+0x896b1] memcpy+0x11
> Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
> j
> org.fusesource.leveldbjni.internal.NativeDB$DBJNI.Put(JLorg/fusesource/leveldbjni/internal/NativeWriteOptions;Lorg/fusesource/leveldbjni/internal/NativeSlice;Lorg/fusesource/leveldbjni/internal/NativeSlice;)J+0
> j
> org.fusesource.leveldbjni.internal.NativeDB.put(Lorg/fusesource/leveldbjni/internal/NativeWriteOptions;Lorg/fusesource/leveldbjni/internal/NativeSlice;Lorg/fusesource/leveldbjni/internal/NativeSlice;)V+11
> j
> org.fusesource.leveldbjni.internal.NativeDB.put(Lorg/fusesource/leveldbjni/internal/NativeWriteOptions;Lorg/fusesource/leveldbjni/internal/NativeBuffer;Lorg/fusesource/leveldbjni/internal/NativeBuffer;)V+18
> j
> org.fusesource.leveldbjni.internal.NativeDB.put(Lorg/fusesource/leveldbjni/internal/NativeWriteOptions;[B[B)V+36
> j
> org.fusesource.leveldbjni.internal.JniDB.put([B[BLorg/iq80/leveldb/WriteOptions;)Lorg/iq80/leveldb/Snapshot;+28
> j org.fusesource.leveldbjni.internal.JniDB.put([B[B)V+10
> j
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.registerExecutor(Ljava/lang/String;Ljava/lang/String;Lorg/apache/spark/network/shuffle/protocol/ExecutorShuffleInfo;)V+61
> J 8429 C2
> org.apache.spark.network.server.TransportRequestHandler.handle(Lorg/apache/spark/network/protocol/RequestMessage;)V
> (100 bytes) @ 0x00007fb5f27ff6cc [0x00007fb5f27fdde0+0x18ec]
> J 8371 C2
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V
> (10 bytes) @ 0x00007fb5f242df20 [0x00007fb5f242de80+0xa0]
> J 6853 C2
> io.netty.channel.SimpleChannelInboundHandler.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V
> (74 bytes) @ 0x00007fb5f215587c [0x00007fb5f21557e0+0x9c]
> J 5872 C2
> io.netty.handler.timeout.IdleStateHandler.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V
> (42 bytes) @ 0x00007fb5f2183268 [0x00007fb5f2183100+0x168]
> J 5849 C2
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V
> (158 bytes) @ 0x00007fb5f2191524 [0x00007fb5f218f5a0+0x1f84]
> J 5941 C2
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V
> (170 bytes) @ 0x00007fb5f220a230 [0x00007fb5f2209fc0+0x270]
> J 7747 C2 io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read()V
> (363 bytes) @ 0x00007fb5f264465c [0x00007fb5f2644140+0x51c]
> J 8008% C2 io.netty.channel.nio.NioEventLoop.run()V (162 bytes) @
> 0x00007fb5f26f6764 [0x00007fb5f26f63c0+0x3a4]
> j io.netty.util.concurrent.SingleThreadEventExecutor$2.run()V+13
> j java.lang.Thread.run()V+11
> v ~StubRoutines::call_stub
> {noformat}
> The target code in spark is in ExternalShuffleBlockResolver
> {code}
> /** Registers a new Executor with all the configuration we need to find its
> shuffle files. */
> public void registerExecutor(
> String appId,
> String execId,
> ExecutorShuffleInfo executorInfo) {
> AppExecId fullId = new AppExecId(appId, execId);
> logger.info("Registered executor {} with {}", fullId, executorInfo);
> try {
> if (db != null) {
> byte[] key = dbAppExecKey(fullId);
> byte[] value =
> mapper.writeValueAsString(executorInfo).getBytes(Charsets.UTF_8);
> db.put(key, value);
> }
> } catch (Exception e) {
> logger.error("Error saving registered executors", e);
> }
> executors.put(fullId, executorInfo);
> }
> {code}
> There is a problem with disk1
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]