Hello
On Tue, Jan 29, 2013 at 8:42 PM, G Jones glenn.calt...@gmail.com wrote:
Hi,
As mentioned previously, we've been noticing failures of
tcpborphserver3 at a rate that has become annoying enough to finally
track down. We compiled from the github source on the ROACH2 itself
with debugging enabled and ran through gdb. The failure results are
described below. The problem seems to occur during the starttap
command. We'll forward along the raw katcp command we're using, but
the curious thing is why base which comes from:
base = gs-s_raw_mode-r_map + gs-s_register-e_pos_base;
is pointing to invalid memory sometimes.
So I am the author of this.
gs-s_raw_mode-r_map is the pointer into the FPGA which is mapped
into the address space of the process.
gs-s_register-e_pos_base is offset of the register in this memory area.
These pointers can become invalid when the fpga is reprogrammed - but
there is a test to exit when the fpga is reprogrammed - which (going by your
report) may not be sufficient.
Does the crash occur while the fpga is being (re)programmed ? If so, while
I try to understand the failure, you could do an explicit tap stop
before reprogramming the fpga (as a temp workaround).
If it occurs on other occasions, then this problem becomes more interesting.
regards
marc
-- Forwarded message --
From: Ramon E. Creager rcrea...@nrao.edu
Date: Tue, Jan 29, 2013 at 3:09 PM
Subject: [Gbsapp] tcpborphserver3 failure in tg.c
I've gotten the tcpborphserver to fail under the debugger, but because I
don't yet understand the memory management used in this program I'm not
yet sure what the problem is, so I'm putting this out in case someone
who understands the tcpborphserver can help isolate the problem more
quickly than I can. The segv occurs in tg.c, line 421. The gdb output is:
Program received signal SIGSEGV, Segmentation fault.
0x100092d4 in write_mac_fpga (gs=0x107b7928, offset=0, mac=0x107b7970
\002\002\n\021) at tg.c:421
421 *((uint32_t *)(base + offset)) = value;
(gdb) where
#0 0x100092d4 in write_mac_fpga (gs=0x107b7928, offset=0,
mac=0x107b7970 \002\002\n\021) at tg.c:421
#1 0x1000a140 in configure_fpga (gs=0x107b7928) at tg.c:877
#2 0x1000ae68 in create_getap (d=0x107878b8, instance=0,
name=0x10795da0 gbe0, tap=0x10795d9b tap0,
ip=0x10795da5 10.17.0.65, port=6, mac=0x10795db6
02:02:0A:11:00:41, period=10) at tg.c:1167
#3 0x1000b258 in insert_getap (d=0x107878b8, name=0x10795da0 gbe0,
tap=0x10795d9b tap0,
ip=0x10795da5 10.17.0.65, port=6, mac=0x10795db6
02:02:0A:11:00:41, period=10) at tg.c:1230
#4 0x1000b514 in tap_start_cmd (d=0x107878b8, argc=6) at tg.c:1290
#5 0x100143bc in call_katcp (d=0x107878b8) at dispatch.c:879
#6 0x100145cc in dispatch_katcp (d=0x107878b8) at dispatch.c:951
#7 0x10018994 in run_shared_katcp (d=0x10782008) at shared.c:659
#8 0x1001cbe8 in run_core_loop_katcp (dl=0x10782008) at server.c:699
#9 0x1001d0c0 in run_config_server_katcp (dl=0x10782008, file=0x0,
count=32, host=0x10047c90 7147, port=0)
at server.c:832
#10 0x10002034 in main (argc=3, argv=0xbff188f4) at main.c:196
(gdb) frame 1
#1 0x1000a140 in configure_fpga (gs=0x107b7928) at tg.c:877
877 if(write_mac_fpga(gs, GO_MAC, gs-s_mac_binary) 0){
(gdb) frame 0
#0 0x100092d4 in write_mac_fpga (gs=0x107b7928, offset=0,
mac=0x107b7970 \002\002\n\021) at tg.c:421
421 *((uint32_t *)(base + offset)) = value;
(gdb) list
416
417 value = ( 0x0 0xff00) |
418 ( 0x0 0xff) |
419 ((mac[0] 8) 0xff00) |
420(mac[1] 0xff);
421 *((uint32_t *)(base + offset)) = value;
422
423
424 value = ((mac[2] 24) 0xff00) |
425 ((mac[3] 16) 0xff) |
(gdb) print base
$1 = (void *) 0x1033fff
(gdb) print offset
$2 = 0
(gdb) print value
$3 = 514
(gdb)
'base' is a void * which is set like this:
base = gs-s_raw_mode-r_map + gs-s_register-e_pos_base;
(back to gdb):
(gdb) print *(gs-s_raw_mode)
$12 = {r_registers = 0x10783d80, r_hwmon = 0x10783d90, r_fpga = 1, r_map
= 0x, r_map_size = 33554432,
r_image = 0x0, r_bof_dir = 0x10783da0 /boffiles, r_top_register =
17314052, r_argc = 3,
r_argv = 0xbff188f4, r_chassis = 0x107876e0, r_taps = 0x10785820,
r_instances = 0}
(gdb) print *(gs-s_register)
$13 = {e_pos_base = 16990208, e_len_base = 16384, e_pos_offset = 0
'\000', e_len_offset = 0 '\000',
e_mode = 3 '\003'}
(gdb)
I should add that 'base' is pointing to memory gdb says it cannot access
(hence the segv):
(gdb) print *(uint32_t *)base
Cannot access memory at address 0x1033fff
Ray