[
https://issues.apache.org/jira/browse/DISPATCH-106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
michael goulish updated DISPATCH-106:
-------------------------------------
Description:
With the standard 6-node demo network, (A-D, X, Y) after killing and
restarting node Y, I see a bad link on router D -- which causes D to crash.
Here is sequence of events from logs of routers and the topologist testing
program:
01:05:05.367 Killing router Y, pid 20074
01:05:05.367 Sleeping 30 seconds
01:05:35.367 Restarting router Y, pid 20120
01:05:38 Router D : last "valid origins" post to its log file :
Node QDR.C valid origins: []
01:05:46 Router D posts to its log file:
Exited Router Flux Mode
01:06:05.368 checking for crash after node bounce
( no crash detected )
01:06:17 last post to router D log file
ROUTER_LS (trace) RCVD: RA(id=QDR.X area=0 inst=1422165872
ls_seq=2 mobile_seq=0)
01:06:35.369 second check for crash. (none detected)
01:06:35.370 getting topology
( Node D fails to respond. PID 20072 )
( core file, timestamped 01:06 )
here is backtrace from router D's core file
{
#0 pn_string_get (string=0xfdfdfdfdbabecafe) at
/home/mick/rh-qpid-proton/proton-c/src/object/string.c:120
#1 0x00007ff73fa8e752 in qd_router_link_name (link=0x7ff72800b2d0) at
/home/mick/dispatch/src/router_agent.c:112
#2 0x00007ff73fa8e7dd in qd_entity_refresh_router_link
(entity=0x7ff7300c9b50, impl=0x7ff72800b2d0)
at /home/mick/dispatch/src/router_agent.c:120
#3 0x0000003e40805d8c in ffi_call_unix64 () from /lib64/libffi.so.6
#4 0x0000003e408056bc in ffi_call () from /lib64/libffi.so.6
#5 0x00007ff737d2dc8b in _ctypes_callproc () from
/usr/lib64/python2.7/lib-dynload/_ctypes.so
#6 0x00007ff737d27a85 in PyCFuncPtr_call () from
/usr/lib64/python2.7/lib-dynload/_ctypes.so
#7 0x00000036df44a0d3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#8 0x00000036df4de37c in PyEval_EvalFrameEx () from
/lib64/libpython2.7.so.1.0
#9 0x00000036df4e21dd in PyEval_EvalCodeEx () from
/lib64/libpython2.7.so.1.0
#10 0x00000036df4e088f in PyEval_EvalFrameEx () from
/lib64/libpython2.7.so.1.0
#11 0x00000036df4e21dd in PyEval_EvalCodeEx () from
/lib64/libpython2.7.so.1.0
#12 0x00000036df4e088f in PyEval_EvalFrameEx () from
/lib64/libpython2.7.so.1.0
#13 0x00000036df4e21dd in PyEval_EvalCodeEx () from
/lib64/libpython2.7.so.1.0
#14 0x00000036df46f0d8 in ?? () from /lib64/libpython2.7.so.1.0
#15 0x00000036df44a0d3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#16 0x00000036df4590c5 in ?? () from /lib64/libpython2.7.so.1.0
#17 0x00000036df44a0d3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#18 0x00000036df44a1b5 in ?? () from /lib64/libpython2.7.so.1.0
#19 0x00000036df44a29e in PyObject_CallFunction () from
/lib64/libpython2.7.so.1.0
#20 0x00007ff73fa8d77f in qd_io_rx_handler (context=0x7ff736321e68,
msg=0x7ff728019bd0, link_id=0
at /home/mick/dispatch/src/python_embedded.c:519
#21 0x00007ff73fa92533 in router_rx_handler (context=0x1db5fd0,
link=0x7ff730008710, delivery=0x7ff73004cc50)
at /home/mick/dispatch/src/router_node.c:922
#22 0x00007ff73fa7fa16 in do_receive (pnd=0x1e359a0) at
/home/mick/dispatch/src/container.c:221
#23 0x00007ff73fa7fea3 in process_handler (container=0x1dbd6f0,
unused=0x1e0a050, qd_conn=0x1e2c6a0)
at /home/mick/dispatch/src/container.c:362
#24 0x00007ff73fa80135 in handler (handler_context=0x1dbd6f0,
conn_context=0x1e0a050, event=QD_CONN_EVENT_PROCESS,
qd_conn=0x1e2c6a0) at /home/mick/dispatch/src/container.c:438
#25 0x00007ff73fa98346 in process_connector (qd_server=0x1d78460,
cxtr=0x1e1b9b0)
at /home/mick/dispatch/src/server.c:322
#26 0x00007ff73fa98c1f in thread_run (arg=0x1d70d30) at
/home/mick/dispatch/src/server.c:546
#27 0x0000003e3dc07ee5 in start_thread () from /lib64/libpthread.so.0
...
}
Let's go up to qd_router_link_name
at /home/mick/dispatch/src/router_agent.c:112
(gdb) print * link
$1 =
{
prev = 0x7ff72800b210,
next = 0x7ff72800b390,
mask_bit = 3,
link_type = QD_LINK_ROUTER,
link_direction = QD_OUTGOING,
owning_addr = 0x1d7d6c0,
waypoint = 0x0,
link = 0x7ff7280099d0,
connected_link = 0x0,
ref = 0x7ff72800f350,
target = 0x0,
event_fifo =
{
head = 0x0,
tail = 0x0,
scratch = 0x0,
size = 0
},
msg_fifo =
{
head = 0x7ff73003c230,
tail = 0x7ff73003bb70,
scratch = 0x7ff73003b9f0,
size = 102
}
}
(gdb) print * (link->link)
$2 =
{
pn_sess = 0x7ff72804b7b0,
pn_link = 0x7ff72804d6a0,
context = 0x7ff72800b2d0,
node = 0x1db6bb0,
drain_mode = false
}
(gdb) print * (link->link->pn_link)
$3 = {
endpoint = {
type = 33686018,
state = 33686018,
error = 0x202020202020202,
condition = {
name = 0x202020202020202,
description = 0x202020202020202,
info = 0x202020202020202
},
remote_condition = {
name = 0x202020202020202,
description = 0x202020202020202,
info = 0x202020202020202
},
endpoint_next = 0x202020202020202,
endpoint_prev = 0x202020202020202,
transport_next = 0x202020202020202,
transport_prev = 0x202020202020202,
modified = 2,
freed = 2,
posted_final = 2
},
source = {
address = 0x202020202020202,
properties = 0x202020202020202,
capabilities = 0x202020202020202,
outcomes = 0x202020202020202,
filter = 0x202020202020202,
durability = (PN_DELIVERIES | unknown: 33686016),
expiry_policy = 33686018,
timeout = 33686018,
type = 33686018,
distribution_mode = (PN_DIST_MODE_MOVE | unknown: 33686016),
dynamic = 2
},
target = {
address = 0x202020202020202,
properties = 0x202020202020202,
capabilities = 0x202020202020202,
outcomes = 0x202020202020202,
filter = 0x202020202020202,
durability = (PN_DELIVERIES | unknown: 33686016),
expiry_policy = 33686018,
( etc. -- it's all garbage. )
was:
With the standard 6-node demo network, (A-D, X, Y) after killing and
restarting node Y, I see a bad link on router D -- which causes D to crash.
Here is sequence of events from logs of routers and the topologist testing
program:
01:05:05.367 Killing router Y, pid 20074
01:05:05.367 Sleeping 30 seconds
01:05:35.367 Restarting router Y, pid 20120
01:05:38 Router D : last "valid origins" post to its log file :
Node QDR.C valid origins: []
01:05:46 Router D posts to its log file:
Exited Router Flux Mode
01:06:05.368 checking for crash after node bounce
( no crash detected )
01:06:17 last post to router D log file
ROUTER_LS (trace) RCVD: RA(id=QDR.X area=0 inst=1422165872
ls_seq=2 mobile_seq=0)
01:06:35.369 second check for crash. (none detected)
01:06:35.370 getting topology
( Node D fails to respond. PID 20072 )
( core file, timestamped 01:06 )
here is backtrace from router D's core file
{
#0 pn_string_get (string=0xfdfdfdfdbabecafe) at
/home/mick/rh-qpid-proton/proton-c/src/object/string.c:120
#1 0x00007ff73fa8e752 in qd_router_link_name (link=0x7ff72800b2d0) at
/home/mick/dispatch/src/router_agent.c:112
#2 0x00007ff73fa8e7dd in qd_entity_refresh_router_link
(entity=0x7ff7300c9b50, impl=0x7ff72800b2d0)
at /home/mick/dispatch/src/router_agent.c:120
#3 0x0000003e40805d8c in ffi_call_unix64 () from /lib64/libffi.so.6
#4 0x0000003e408056bc in ffi_call () from /lib64/libffi.so.6
#5 0x00007ff737d2dc8b in _ctypes_callproc () from
/usr/lib64/python2.7/lib-dynload/_ctypes.so
#6 0x00007ff737d27a85 in PyCFuncPtr_call () from
/usr/lib64/python2.7/lib-dynload/_ctypes.so
#7 0x00000036df44a0d3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#8 0x00000036df4de37c in PyEval_EvalFrameEx () from
/lib64/libpython2.7.so.1.0
#9 0x00000036df4e21dd in PyEval_EvalCodeEx () from
/lib64/libpython2.7.so.1.0
#10 0x00000036df4e088f in PyEval_EvalFrameEx () from
/lib64/libpython2.7.so.1.0
#11 0x00000036df4e21dd in PyEval_EvalCodeEx () from
/lib64/libpython2.7.so.1.0
#12 0x00000036df4e088f in PyEval_EvalFrameEx () from
/lib64/libpython2.7.so.1.0
#13 0x00000036df4e21dd in PyEval_EvalCodeEx () from
/lib64/libpython2.7.so.1.0
#14 0x00000036df46f0d8 in ?? () from /lib64/libpython2.7.so.1.0
#15 0x00000036df44a0d3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#16 0x00000036df4590c5 in ?? () from /lib64/libpython2.7.so.1.0
#17 0x00000036df44a0d3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#18 0x00000036df44a1b5 in ?? () from /lib64/libpython2.7.so.1.0
#19 0x00000036df44a29e in PyObject_CallFunction () from
/lib64/libpython2.7.so.1.0
#20 0x00007ff73fa8d77f in qd_io_rx_handler (context=0x7ff736321e68,
msg=0x7ff728019bd0, link_id=0)
at /home/mick/dispatch/src/python_embedded.c:519
#21 0x00007ff73fa92533 in router_rx_handler (context=0x1db5fd0,
link=0x7ff730008710, delivery=0x7ff73004cc50)
at /home/mick/dispatch/src/router_node.c:922
#22 0x00007ff73fa7fa16 in do_receive (pnd=0x1e359a0) at
/home/mick/dispatch/src/container.c:221
#23 0x00007ff73fa7fea3 in process_handler (container=0x1dbd6f0,
unused=0x1e0a050, qd_conn=0x1e2c6a0)
at /home/mick/dispatch/src/container.c:362
#24 0x00007ff73fa80135 in handler (handler_context=0x1dbd6f0,
conn_context=0x1e0a050, event=QD_CONN_EVENT_PROCESS,
qd_conn=0x1e2c6a0) at /home/mick/dispatch/src/container.c:438
#25 0x00007ff73fa98346 in process_connector (qd_server=0x1d78460,
cxtr=0x1e1b9b0)
at /home/mick/dispatch/src/server.c:322
#26 0x00007ff73fa98c1f in thread_run (arg=0x1d70d30) at
/home/mick/dispatch/src/server.c:546
#27 0x0000003e3dc07ee5 in start_thread () from /lib64/libpthread.so.0
...
}
Let's go up to qd_router_link_name
at /home/mick/dispatch/src/router_agent.c:112
(gdb) print * link
$1 =
{
prev = 0x7ff72800b210,
next = 0x7ff72800b390,
mask_bit = 3,
link_type = QD_LINK_ROUTER,
link_direction = QD_OUTGOING,
owning_addr = 0x1d7d6c0,
waypoint = 0x0,
link = 0x7ff7280099d0,
connected_link = 0x0,
ref = 0x7ff72800f350,
target = 0x0,
event_fifo =
{
head = 0x0,
tail = 0x0,
scratch = 0x0,
size = 0
},
msg_fifo =
{
head = 0x7ff73003c230,
tail = 0x7ff73003bb70,
scratch = 0x7ff73003b9f0,
size = 102
}
}
(gdb) print * (link->link)
$2 =
{
pn_sess = 0x7ff72804b7b0,
pn_link = 0x7ff72804d6a0,
context = 0x7ff72800b2d0,
node = 0x1db6bb0,
drain_mode = false
}
(gdb) print * (link->link->pn_link)
$3 = {
endpoint = {
type = 33686018,
state = 33686018,
error = 0x202020202020202,
condition = {
name = 0x202020202020202,
description = 0x202020202020202,
info = 0x202020202020202
},
remote_condition = {
name = 0x202020202020202,
description = 0x202020202020202,
info = 0x202020202020202
},
endpoint_next = 0x202020202020202,
endpoint_prev = 0x202020202020202,
transport_next = 0x202020202020202,
transport_prev = 0x202020202020202,
modified = 2,
freed = 2,
posted_final = 2
},
source = {
address = 0x202020202020202,
properties = 0x202020202020202,
capabilities = 0x202020202020202,
outcomes = 0x202020202020202,
filter = 0x202020202020202,
durability = (PN_DELIVERIES | unknown: 33686016),
expiry_policy = 33686018,
timeout = 33686018,
type = 33686018,
distribution_mode = (PN_DIST_MODE_MOVE | unknown: 33686016),
dynamic = 2
},
target = {
address = 0x202020202020202,
properties = 0x202020202020202,
capabilities = 0x202020202020202,
outcomes = 0x202020202020202,
filter = 0x202020202020202,
durability = (PN_DELIVERIES | unknown: 33686016),
expiry_policy = 33686018,
( etc. -- it's all garbage. )
> pn link corruption after router restart
> ---------------------------------------
>
> Key: DISPATCH-106
> URL: https://issues.apache.org/jira/browse/DISPATCH-106
> Project: Qpid Dispatch
> Issue Type: Bug
> Components: Router Node
> Affects Versions: 0.3
> Reporter: michael goulish
>
> With the standard 6-node demo network, (A-D, X, Y) after killing and
> restarting node Y, I see a bad link on router D -- which causes D to crash.
> Here is sequence of events from logs of routers and the topologist testing
> program:
> 01:05:05.367 Killing router Y, pid 20074
> 01:05:05.367 Sleeping 30 seconds
> 01:05:35.367 Restarting router Y, pid 20120
> 01:05:38 Router D : last "valid origins" post to its log file :
> Node QDR.C valid origins: []
> 01:05:46 Router D posts to its log file:
> Exited Router Flux Mode
> 01:06:05.368 checking for crash after node bounce
> ( no crash detected )
> 01:06:17 last post to router D log file
> ROUTER_LS (trace) RCVD: RA(id=QDR.X area=0 inst=1422165872
> ls_seq=2 mobile_seq=0)
> 01:06:35.369 second check for crash. (none detected)
> 01:06:35.370 getting topology
> ( Node D fails to respond. PID 20072 )
> ( core file, timestamped 01:06 )
> here is backtrace from router D's core file
> {
> #0 pn_string_get (string=0xfdfdfdfdbabecafe) at
> /home/mick/rh-qpid-proton/proton-c/src/object/string.c:120
> #1 0x00007ff73fa8e752 in qd_router_link_name (link=0x7ff72800b2d0) at
> /home/mick/dispatch/src/router_agent.c:112
> #2 0x00007ff73fa8e7dd in qd_entity_refresh_router_link
> (entity=0x7ff7300c9b50, impl=0x7ff72800b2d0)
> at /home/mick/dispatch/src/router_agent.c:120
> #3 0x0000003e40805d8c in ffi_call_unix64 () from /lib64/libffi.so.6
> #4 0x0000003e408056bc in ffi_call () from /lib64/libffi.so.6
> #5 0x00007ff737d2dc8b in _ctypes_callproc () from
> /usr/lib64/python2.7/lib-dynload/_ctypes.so
> #6 0x00007ff737d27a85 in PyCFuncPtr_call () from
> /usr/lib64/python2.7/lib-dynload/_ctypes.so
> #7 0x00000036df44a0d3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
> #8 0x00000036df4de37c in PyEval_EvalFrameEx () from
> /lib64/libpython2.7.so.1.0
> #9 0x00000036df4e21dd in PyEval_EvalCodeEx () from
> /lib64/libpython2.7.so.1.0
> #10 0x00000036df4e088f in PyEval_EvalFrameEx () from
> /lib64/libpython2.7.so.1.0
> #11 0x00000036df4e21dd in PyEval_EvalCodeEx () from
> /lib64/libpython2.7.so.1.0
> #12 0x00000036df4e088f in PyEval_EvalFrameEx () from
> /lib64/libpython2.7.so.1.0
> #13 0x00000036df4e21dd in PyEval_EvalCodeEx () from
> /lib64/libpython2.7.so.1.0
> #14 0x00000036df46f0d8 in ?? () from /lib64/libpython2.7.so.1.0
> #15 0x00000036df44a0d3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
> #16 0x00000036df4590c5 in ?? () from /lib64/libpython2.7.so.1.0
> #17 0x00000036df44a0d3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
> #18 0x00000036df44a1b5 in ?? () from /lib64/libpython2.7.so.1.0
> #19 0x00000036df44a29e in PyObject_CallFunction () from
> /lib64/libpython2.7.so.1.0
> #20 0x00007ff73fa8d77f in qd_io_rx_handler (context=0x7ff736321e68,
> msg=0x7ff728019bd0, link_id=0
> at /home/mick/dispatch/src/python_embedded.c:519
> #21 0x00007ff73fa92533 in router_rx_handler (context=0x1db5fd0,
> link=0x7ff730008710, delivery=0x7ff73004cc50)
> at /home/mick/dispatch/src/router_node.c:922
> #22 0x00007ff73fa7fa16 in do_receive (pnd=0x1e359a0) at
> /home/mick/dispatch/src/container.c:221
> #23 0x00007ff73fa7fea3 in process_handler (container=0x1dbd6f0,
> unused=0x1e0a050, qd_conn=0x1e2c6a0)
> at /home/mick/dispatch/src/container.c:362
> #24 0x00007ff73fa80135 in handler (handler_context=0x1dbd6f0,
> conn_context=0x1e0a050, event=QD_CONN_EVENT_PROCESS,
> qd_conn=0x1e2c6a0) at /home/mick/dispatch/src/container.c:438
> #25 0x00007ff73fa98346 in process_connector (qd_server=0x1d78460,
> cxtr=0x1e1b9b0)
> at /home/mick/dispatch/src/server.c:322
> #26 0x00007ff73fa98c1f in thread_run (arg=0x1d70d30) at
> /home/mick/dispatch/src/server.c:546
> #27 0x0000003e3dc07ee5 in start_thread () from /lib64/libpthread.so.0
> ...
> }
> Let's go up to qd_router_link_name
> at /home/mick/dispatch/src/router_agent.c:112
> (gdb) print * link
> $1 =
> {
> prev = 0x7ff72800b210,
> next = 0x7ff72800b390,
> mask_bit = 3,
> link_type = QD_LINK_ROUTER,
> link_direction = QD_OUTGOING,
> owning_addr = 0x1d7d6c0,
> waypoint = 0x0,
> link = 0x7ff7280099d0,
> connected_link = 0x0,
> ref = 0x7ff72800f350,
> target = 0x0,
> event_fifo =
> {
> head = 0x0,
> tail = 0x0,
> scratch = 0x0,
> size = 0
> },
> msg_fifo =
> {
> head = 0x7ff73003c230,
> tail = 0x7ff73003bb70,
> scratch = 0x7ff73003b9f0,
> size = 102
> }
> }
> (gdb) print * (link->link)
> $2 =
> {
> pn_sess = 0x7ff72804b7b0,
> pn_link = 0x7ff72804d6a0,
> context = 0x7ff72800b2d0,
> node = 0x1db6bb0,
> drain_mode = false
> }
> (gdb) print * (link->link->pn_link)
> $3 = {
> endpoint = {
> type = 33686018,
> state = 33686018,
> error = 0x202020202020202,
> condition = {
> name = 0x202020202020202,
> description = 0x202020202020202,
> info = 0x202020202020202
> },
> remote_condition = {
> name = 0x202020202020202,
> description = 0x202020202020202,
> info = 0x202020202020202
> },
> endpoint_next = 0x202020202020202,
> endpoint_prev = 0x202020202020202,
> transport_next = 0x202020202020202,
> transport_prev = 0x202020202020202,
> modified = 2,
> freed = 2,
> posted_final = 2
> },
> source = {
> address = 0x202020202020202,
> properties = 0x202020202020202,
> capabilities = 0x202020202020202,
> outcomes = 0x202020202020202,
> filter = 0x202020202020202,
> durability = (PN_DELIVERIES | unknown: 33686016),
> expiry_policy = 33686018,
> timeout = 33686018,
> type = 33686018,
> distribution_mode = (PN_DIST_MODE_MOVE | unknown: 33686016),
> dynamic = 2
> },
> target = {
> address = 0x202020202020202,
> properties = 0x202020202020202,
> capabilities = 0x202020202020202,
> outcomes = 0x202020202020202,
> filter = 0x202020202020202,
> durability = (PN_DELIVERIES | unknown: 33686016),
> expiry_policy = 33686018,
> ( etc. -- it's all garbage. )
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]