[
https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002614#comment-15002614
]
Vinod Kone commented on MESOS-3851:
-----------------------------------
i looked at the libprocess code and it looked like a single connection should
be reused?
from:
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L1539
{code}
void SocketManager::send(Message* message, const Socket::Kind& kind)
{
CHECK(message != NULL);
const Address& address = message->to.address;
Option<Socket> socket = None();
bool connect = false;
synchronized (mutex) {
// Check if there is already a socket.
bool persist = persists.count(address) > 0;
bool temp = temps.count(address) > 0;
if (persist || temp) {
int s = persist ? persists[address] : temps[address];
CHECK(sockets.count(s) > 0);
socket = *sockets[s];
// Update whether or not this socket should get disposed after
// there is no more data to send.
if (!persist) {
dispose.insert(socket.get());
}
if (outgoing.count(socket.get()) > 0) {
outgoing[socket.get()].push(new MessageEncoder(socket.get(), message));
return;
} else {
// Initialize the outgoing queue.
outgoing[socket.get()];
}
} else {
// No persistent or temporary socket to the socket address
// currently exists, so we create a temporary one.
// The kind of socket we create is passed in as an argument.
// This allows us to support downgrading the connection type
// from SSL to POLL if enabled.
Try<Socket> create = Socket::create(kind);
if (create.isError()) {
VLOG(1) << "Failed to send, create socket: " << create.error();
delete message;
return;
}
socket = create.get();
int s = socket.get();
sockets[s] = new Socket(socket.get());
addresses[s] = address;
temps[address] = s;
dispose.insert(s);
// Initialize the outgoing queue.
outgoing[s];
connect = true;
}
}
if (connect) {
CHECK_SOME(socket);
socket.get().connect(address)
.onAny(lambda::bind(
&SocketManager::send_connect,
this,
lambda::_1,
new Socket(socket.get()),
message));
} else {
// If we're not connecting and we haven't added the encoder to
// the 'outgoing' queue then schedule it to be sent.
internal::send(
new MessageEncoder(socket.get(), message),
new Socket(socket.get()));
}
{code}
AFAICT, the race is only possible if 2 sockets are created but even for temp
connections (without link) only one socket is created. Well, I guess it is
possible for a 2 temp sockets to be created, if the first one gets closed
(because of no data) before the 2nd request arrives.
> Investigate recent crashes in Command Executor
> ----------------------------------------------
>
> Key: MESOS-3851
> URL: https://issues.apache.org/jira/browse/MESOS-3851
> Project: Mesos
> Issue Type: Bug
> Components: containerization
> Reporter: Anand Mazumdar
> Priority: Blocker
> Labels: mesosphere
>
> Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to
> support rootfs. There seem to be some tests showing frequent crashes due to
> assert violations.
> {{FetcherCacheTest.SimpleEviction}} failed due to the following log:
> {code}
> I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to
> executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6-0000 at
> executor(1)@172.17.5.200:33871'
> I1107 19:36:46.363682 1236 exec.cpp:297]
> I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave
> 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0
> @ 0x7f9f5a7db3fa google::LogMessage::Fail()
> I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns
> @ 0x7f9f5a7db359 google::LogMessage::SendToLog()
> @ 0x7f9f5a7dad6a google::LogMessage::Flush()
> @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal()
> @ 0x48d00a _CheckFatal::~_CheckFatal()
> @ 0x49c99d
> mesos::internal::CommandExecutorProcess::launchTask()
> @ 0x4b3dd7
> _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_
> @ 0x4c470c
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x7f9f5a761b1b std::function<>::operator()()
> @ 0x7f9f5a749935 process::ProcessBase::visit()
> @ 0x7f9f5a74d700 process::DispatchEvent::visit()
> @ 0x48e004 process::ProcessBase::serve()
> @ 0x7f9f5a745d21 process::ProcessManager::resume()
> @ 0x7f9f5a742f52
> _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
> @ 0x7f9f5a74cf2c
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> @ 0x7f9f5a74cedc
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
> @ 0x7f9f5a74ce6e
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
> @ 0x7f9f5a74cdc5
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
> @ 0x7f9f5a74cd5e
> _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
> @ 0x7f9f5624f1e0 (unknown)
> @ 0x7f9f564a8df5 start_thread
> @ 0x7f9f559b71ad __clone
> I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container
> '6553a617-6b4a-418d-9759-5681f45ff854' has exited
> I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container
> '6553a617-6b4a-418d-9759-5681f45ff854'
> I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container
> 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited
> {code}
> The reason seems to be a race between the executor receiving a
> {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the
> {{CHECK_SOME(executorInfo)}} failure.
> Link to complete log:
> https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535
> Another related failure from {{ExamplesTest.PersistentVolumeFramework}}
> {code}
> @ 0x7f4f71529cbd google::LogMessage::SendToLog()
> I1107 13:15:09.949987 31573 slave.cpp:2337] Status update manager
> successfully handled status update acknowledgement (UUID:
> 721c7316-5580-4636-a83a-098e3bd4ed1f) for task
> ad90531f-d3d8-43f6-96f2-c81c4548a12d of framework
> ac4ea54a-7d19-4e41-9ee3-1a761f8e5b0f-0000
> @ 0x7f4f715296ce google::LogMessage::Flush()
> @ 0x7f4f7152c402 google::LogMessageFatal::~LogMessageFatal()
> @ 0x48d00a _CheckFatal::~_CheckFatal()
> @ 0x49c99d
> mesos::internal::CommandExecutorProcess::launchTask()
> @ 0x4b3dd7
> _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_
> @ 0x4c470c
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x7f4f714b047f std::function<>::operator()()
> @ 0x7f4f71498299 process::ProcessBase::visit()
> @ 0x7f4f7149c064 process::DispatchEvent::visit()
> @ 0x48e004 process::ProcessBase::serve()
> @ 0x7f4f71494685 process::ProcessManager::resume()
> {code}
> Full logs at:
> https://builds.apache.org/job/Mesos/1191/COMPILER=gcc,CONFIGURATION=--verbose,OS=centos:7,label_exp=docker%7C%7CHadoop/consoleFull
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)