Re: kernel BUG at kernel/timer.c:407!
Quoting Andre Noll: | On 10:24, Andre Noll wrote: | > > Can you confirm that, when the `culprit' patch is reverted (as per | > > the first email), the bug disappears? | > | > I double-checked this before sending the first mail, but I also will | > try the patch you sent today and report tomorrow. | | I now can confirm that also with 2.6.21-rc3 your patch makes the bug | go away, so feel free to add my Signed-Off. Would have liked to do that since in effect you identified this issue. For a sign-off it is probably too late, but in any case it is noted in the commit message. Glad to hear that the bug is remedied - that means two issues solved :) Thanks again Gerrit - To unsubscribe from this list: send the line "unsubscribe dccp" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at kernel/timer.c:407!
On 10:24, Andre Noll wrote: > > Can you confirm that, when the `culprit' patch is reverted (as per > > the first email), the bug disappears? > > I double-checked this before sending the first mail, but I also will > try the patch you sent today and report tomorrow. I now can confirm that also with 2.6.21-rc3 your patch makes the bug go away, so feel free to add my Signed-Off. Thanks Andre -- The only person who always got his work done by Friday was Robinson Crusoe signature.asc Description: Digital signature
Re: kernel BUG at kernel/timer.c:407!
On 07:13, Gerrit Renker wrote: > honestly, I have been unable to reproduce the bug, but there is a point > in what you said regarding the CCID-split patch. In fact, you identified > a real bug. > > I have looked at your application: > * the server listens, accepts and > * then writes to the client > > This is bi-directional mode. In TCP no problem. In DCCP that patch > has disabled bidirectional mode since at that time there were > strange interactions. This explains why the bug is also present in yesterday's Davem's tree, even with your two patches from yesterday applied. > In short: the only mode that is currently supported is > * server listens, accepts > * and only reads > o client connects and > o only writes > > That works with the test applications (ttcp and iperf) since this is > exactly the way they operate. But it constrains socket programming. Yes, I tried yesterday to reproduce the bug with ttcp but wasn't able to do so. > Can you confirm that, when the `culprit' patch is reverted (as per > the first email), the bug disappears? I double-checked this before sending the first mail, but I also will try the patch you sent today and report tomorrow. > In any way, thanks a lot for identifying this obvious restriction, > this has been a great help. You're welcome. > I think it is great that real applications appear and I would like to > use this paraslash application. Feel free to ask questions, or send patches :) > If there are any other problems, please > do email to this list. Will do. This was really the first problem I had with dccp. I implemented paraslash's dccp sender/receiver in March 2006 and it worked very well from the first day on. There have been only minor changes to this code since then. Andre -- The only person who always got his work done by Friday was Robinson Crusoe signature.asc Description: Digital signature
Re: kernel BUG at kernel/timer.c:407!
Andre, honestly, I have been unable to reproduce the bug, but there is a point in what you said regarding the CCID-split patch. In fact, you identified a real bug. I have looked at your application: * the server listens, accepts and * then writes to the client This is bi-directional mode. In TCP no problem. In DCCP that patch has disabled bidirectional mode since at that time there were strange interactions. In short: the only mode that is currently supported is * server listens, accepts * and only reads o client connects and o only writes That works with the test applications (ttcp and iperf) since this is exactly the way they operate. But it constrains socket programming. I will ask Arnaldo to revert all these stupid changes that I caused. This will allow full bidirectional connections in the first place. I have looked at the CCID3 code, it seems to be ok with that. Can you confirm that, when the `culprit' patch is reverted (as per the first email), the bug disappears? That would double the reasons. In any way, thanks a lot for identifying this obvious restriction, this has been a great help. I think it is great that real applications appear and I would like to use this paraslash application. If there are any other problems, please do email to this list. Thanks again Gerrit - To unsubscribe from this list: send the line "unsubscribe dccp" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at kernel/timer.c:407!
Further insights after several kernel recompilations. | I checked 2.6.19.x (which works), 2.6.20, 2.6.20.1 and 2.6.21-rc1 | (which all have the bug). 2.6.21-rc2 wouldn't compile on my system | due to some apic isssue. Since I couldn't reproduce the bug on the current kernel (2.6.21-rc2), I reverted to 2.6.21-rc1 and tried the paraslash application as described. I got a "main: dccp connect error" when connecting to the paraserver (which was listening on port 5001, as `ss' reported). With wireshark, I could see that the handshake and the close request were both handled correctly, so it seems that the connect error was rather down to a user error. After recompiling to the most recent 2.6.20-rc2, same error. I also changed configuration to PREEMPT but need to leave further testing for now. | | > If it is a very recent | >one, can you please check whether you get the following message in your | >syslog: | > "[...] listen_overflow!" | > | >If yes (dccp_debug should be turned on), then very likely setting listen(fd, 1) | >instead of listen(fd,0) may remove any strange effects. | >The reason is a recent change in sk_accept_queue_is_full which causes a different | >treatment of zero-sized listen-accept queues. | | Will test this evening and report tomorrow. This is indeed a(nother) bug - zero-sized listen queues currently do not work. Patch submitted to netdev. | | > 2. With the most recent davem-2.6 kernel I was not able to reproduce this bug. It | >should, after some more thought, really make no difference whether you are using | >loopback (127.0.0.1) or not. | | I can try this kernel as well. I'm currently downloading | | git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6.git | | Hope, that's the right one. Yes that should work. There is also a net-2.6.22.git but I think it is only there for the more recent changes which not yet have made it into 2.6. | >(a) an update of this patch | >(b) a second patch to do with timer initialisation of child sockets. | >In particular (a) may help. | | OK. Are these patches against the current linus-tree? They should work on any recent 2.6.19/20 tree (maybe with some offset). | >I managed to get the paraslash application built, but could not figure out how to | >populate the user lists and required configuration files. | | That's explained in the INSTALL file. But you don't need any Thanks - it was not spite not to rtfm, rather since busy with a lot of other stuff at the moment. | >I don't understand your code fully yet, but with the more recent stack trace I | >was wondering whether this has to do with setting the listen socket non-blocking | >(mark_fd_nonblock), which is done both in sender and receiver. | | IMHO it's considered good practice to set all fds which are used for | select() to non-blocking mode. AFAIR the reason is the situation | where a network packet arrives but is discarded because of a checksum | error. In this case it might happen that select() indicates readability | of an fd, but a subsequent read() blocks nevertheless. Maybe it's | unneccesssary to set an fd to non-blocking mode if it is only used | for writing. But it won't hurt either, so.. Oh sorry - it was not my intention to talk about programming style. The code is fine and you are right with what you are saying. It is just that I wanted to pin down the exact cause of the bug as in `it happens on non-blocking socket when X and Y and Z hold' &) - To unsubscribe from this list: send the line "unsubscribe dccp" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at kernel/timer.c:407!
On 09:20, Gerrit Renker wrote: > thank you for continued testing. I spent a long while yesterday evening > isolating > possible causes. Here are further pointers > > 1. Can you please tell us which kernel you are using? I checked 2.6.19.x (which works), 2.6.20, 2.6.20.1 and 2.6.21-rc1 (which all have the bug). 2.6.21-rc2 wouldn't compile on my system due to some apic isssue. > If it is a very recent >one, can you please check whether you get the following message in your >syslog: > "[...] listen_overflow!" > >If yes (dccp_debug should be turned on), then very likely setting > listen(fd, 1) >instead of listen(fd,0) may remove any strange effects. >The reason is a recent change in sk_accept_queue_is_full which causes a > different >treatment of zero-sized listen-accept queues. Will test this evening and report tomorrow. > 2. With the most recent davem-2.6 kernel I was not able to reproduce this > bug. It >should, after some more thought, really make no difference whether you are > using >loopback (127.0.0.1) or not. I can try this kernel as well. I'm currently downloading git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6.git Hope, that's the right one. > 3. I analyzed the reverted patch you identified. There is indeed a loophole > (which >has not become visible so far), hence I will send >(a) an update of this patch >(b) a second patch to do with timer initialisation of child sockets. >In particular (a) may help. OK. Are these patches against the current linus-tree? > 4. It may be worth trying a different application, e.g. > >http://www.erg.abdn.ac.uk/users/gerrit/dccp/apps/ttcp_dccp.tar.gz >in order to find out which combination of system calls triggers the bug > condition. Will test and report. >I managed to get the paraslash application built, but could not figure out > how to >populate the user lists and required configuration files. That's explained in the INSTALL file. But you don't need any configuration files and I _think_ you don't even need a paraslash user to reproduce the bug, an empty ~/.paraslash/server.users should do. Just start para_server with the autoplay (-a) option, i.e. para_server -a --random_dir=/some/dir/containing/an/mp3/file Then para_recv -r dccp triggers the bug. >I don't understand your code fully yet, but with the more recent stack > trace I >was wondering whether this has to do with setting the listen socket > non-blocking >(mark_fd_nonblock), which is done both in sender and receiver. IMHO it's considered good practice to set all fds which are used for select() to non-blocking mode. AFAIR the reason is the situation where a network packet arrives but is discarded because of a checksum error. In this case it might happen that select() indicates readability of an fd, but a subsequent read() blocks nevertheless. Maybe it's unneccesssary to set an fd to non-blocking mode if it is only used for writing. But it won't hurt either, so.. Thanks Andre -- The only person who always got his work done by Friday was Robinson Crusoe signature.asc Description: Digital signature
Re: kernel BUG at kernel/timer.c:407!
Hi Andre, thank you for continued testing. I spent a long while yesterday evening isolating possible causes. Here are further pointers 1. Can you please tell us which kernel you are using? If it is a very recent one, can you please check whether you get the following message in your syslog: "[...] listen_overflow!" If yes (dccp_debug should be turned on), then very likely setting listen(fd, 1) instead of listen(fd,0) may remove any strange effects. The reason is a recent change in sk_accept_queue_is_full which causes a different treatment of zero-sized listen-accept queues. 2. With the most recent davem-2.6 kernel I was not able to reproduce this bug. It should, after some more thought, really make no difference whether you are using loopback (127.0.0.1) or not. 3. I analyzed the reverted patch you identified. There is indeed a loophole (which has not become visible so far), hence I will send (a) an update of this patch (b) a second patch to do with timer initialisation of child sockets. In particular (a) may help. 4. It may be worth trying a different application, e.g. http://www.erg.abdn.ac.uk/users/gerrit/dccp/apps/ttcp_dccp.tar.gz in order to find out which combination of system calls triggers the bug condition. I managed to get the paraslash application built, but could not figure out how to populate the user lists and required configuration files. I don't understand your code fully yet, but with the more recent stack trace I was wondering whether this has to do with setting the listen socket non-blocking (mark_fd_nonblock), which is done both in sender and receiver. Again, many thanks for providing detailed information Gerrit Quoting Andre Noll: | The bug remains, but the backtrace is slightly different, | see below. | | > > The BUG is caused via the following chain: | > > | > > 1. dccp_write_xmit(sk, 0) (due to !block) | > > 1. dccp_sendmsg | > > 2. ccid2_hc_tx_send_packet -> with hctx->ccid2hctx_pipe >= hctx->ccid2hctx_cwnd | > >(see above, pipe=cwnd=1) ==> returns 1 | > > 3. in dccp_write_xmit(sk, 0): | > >if (!block) { /* this is true here */ | > > sk_reset_timer(sk, &dp->dccps_xmit_timer, | > > msecs_to_jiffies(err)+jiffies) | > >==> BUG() | > > | <7>dccp_set_state: listen(c1580030) LISTEN -> CLOSED | > > This may be a clue: this socket has not gone past listen state (i.e. not entered server) | > | > Yes, the bug happens in para_server just at the moment the first client | > connects. No data is transfered to the client. I'll look into the kernel | > dccp code a bit this evening as well. | | Found nothing suspicious. Apparently, dccp_connect() in | net/cddp/output.c is never called as this is the only place where | dp->dccps_xmit_timer.function is set, and the BUG in kernel/timer.c | indicates that this function pointer is NULL. - To unsubscribe from this list: send the line "unsubscribe dccp" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at kernel/timer.c:407!
On 19:59, Andre Noll wrote: > > 2. This is using CCID2, which has not been maintained for a while. Can > >you please try CCID 3 also, e.g. by using the following sysctls: > > > >sysctl -w net.dccp.default.rx_ccid=3 > >sysctl -w net.dccp.default.tx_ccid=3 > >sysctl -w net.dccp.default.tx_qlen=5 > >sysctl -w net.dccp.default.seq_window=100 > >sysctl -w net.dccp.default.send_ackvec=0 > > Will do this today in the evening and report again tomorrow. There was no sysctl on this system, but echo 3 > /proc/net/dccp/default/rx_ccid should do the same, so I did this (and analogously for the other three values). The bug remains, but the backtrace is slightly different, see below. > > The BUG is caused via the following chain: > > > > 1. dccp_write_xmit(sk, 0) (due to !block) > > 1. dccp_sendmsg > > 2. ccid2_hc_tx_send_packet -> with hctx->ccid2hctx_pipe >= > > hctx->ccid2hctx_cwnd > >(see above, pipe=cwnd=1) ==> returns 1 > > 3. in dccp_write_xmit(sk, 0): > >if (!block) { /* this is true here */ > > sk_reset_timer(sk, &dp->dccps_xmit_timer, > > msecs_to_jiffies(err)+jiffies) > >==> BUG() > > | <7>dccp_set_state: listen(c1580030) LISTEN -> CLOSED > > This may be a clue: this socket has not gone past listen state (i.e. not > > entered server) > > Yes, the bug happens in para_server just at the moment the first client > connects. No data is transfered to the client. I'll look into the kernel > dccp code a bit this evening as well. Found nothing suspicious. Apparently, dccp_connect() in net/cddp/output.c is never called as this is the only place where dp->dccps_xmit_timer.function is set, and the BUG in kernel/timer.c indicates that this function pointer is NULL. Regards, Andre [ cut here ] kernel BUG at kernel/timer.c:407! invalid opcode: [#1] PREEMPT CPU:0 EIP:0060:[]Not tainted VLI EFLAGS: 00210246 (2.6.20 #14) EIP is at mod_timer+0x1d/0x21 eax: df6e473c ebx: df6e4400 ecx: edx: fffc231e esi: df6e4400 edi: db534c4c ebp: 01a1 esp: dce2be0c ds: 007b es: 007b ss: 0068 Process para_server (pid: 1263, ti=dce2a000 task=de089550 task.ti=dce2a000) Stack: c03747b6 db579d80 c0416949 dce2be20 dce2be98 c0511fc0 df6e4400 dce2be98 dce2bedc c03b5470 01a1 c0443f20 dce2be98 dcf9b800 df788200 c0371128 01a1 dce2bedc dce2bedc 0001 dce2bed4 Call Trace: [] sk_reset_timer+0xc/0x15 [] dccp_sendmsg+0x105/0x155 [] inet_sendmsg+0x47/0x5f [] do_sock_write+0x93/0xa2 [] sock_aio_write+0x74/0x8c [] pcibios_setup+0x104/0x3b2 [] inet_csk_accept+0x8b/0x163 [] do_sync_write+0xc8/0x117 [] write_chan+0x1a5/0x1e4 [] __wake_up+0x4b/0x77 [] autoremove_wake_function+0x0/0x43 [] tty_ldisc_deref+0x51/0x84 [] copy_to_user+0x2b/0x37 [] vfs_write+0x18f/0x194 [] sys_write+0x41/0x6a [] syscall_call+0x7/0xb === Code: 5f c3 0f 0b eb fe 5b 5e 5f e9 be aa 2f 00 8b 48 0c 85 c9 74 16 39 50 08 74 05 e9 cb fe ff ff 8b 08 85 c9 74 f5 b8 01 00 00 00 c3 <0f> 0b eb fe 83 ec 0c 89 5c 24 04 89 74 24 08 89 c3 8b 00 31 f6 EIP: [] mod_timer+0x1d/0x21 SS:ESP 0068:dce2be0c -- The only person who always got his work done by Friday was Robinson Crusoe signature.asc Description: Digital signature
Re: kernel BUG at kernel/timer.c:407!
On 16:42, Gerrit Renker wrote: > I would much rather hunt down this bug than revert this patch, since > reverting it > means that almost twice the work is done per each packet and truly > bidirectional > connections are not currently supported. Agreed. Let's try to find the bug. > With initial tests here I could not reproduce the bug and had difficulties > since I have not all required libraries for paraslash. You only need libssl-dev to reproduce the bug. All the other libraries are optional and configure should detect what is installed on your system. > 1. I see you are using loopback (127.0.0.1) where packets are routed back >internally: does the bug persist when CCID2 sender and CCID2 receiver >are on different hosts? Not so easy to find out because I only have one machine and no internet at home, and no kernel with dccp support at work. I'll see what I can do. > 2. This is using CCID2, which has not been maintained for a while. Can >you please try CCID 3 also, e.g. by using the following sysctls: > >sysctl -w net.dccp.default.rx_ccid=3 >sysctl -w net.dccp.default.tx_ccid=3 >sysctl -w net.dccp.default.tx_qlen=5 >sysctl -w net.dccp.default.seq_window=100 >sysctl -w net.dccp.default.send_ackvec=0 Will do this today in the evening and report again tomorrow. > 3. Caveat: `Server' listens, `Client' connects. I could not build the > paraslash app >due to missing libraries, but found that dccp_recv_open calls connect in > dccp_recv.c >while dccp_open() in dccp_send.c calls listen(). It seems that the roles > are reversed, >is it possible to swap this in the application and does the problem > persist? Well, that's possible, but IMHO does not make much sense: The server app waits for incoming connections (not neccessary only dccp) and responds by sending the audio stream to the client. That's how it's supposed to be, no? > The BUG is caused via the following chain: > > 1. dccp_write_xmit(sk, 0) (due to !block) > 1. dccp_sendmsg > 2. ccid2_hc_tx_send_packet -> with hctx->ccid2hctx_pipe >= > hctx->ccid2hctx_cwnd >(see above, pipe=cwnd=1) ==> returns 1 > 3. in dccp_write_xmit(sk, 0): >if (!block) { /* this is true here */ > sk_reset_timer(sk, &dp->dccps_xmit_timer, > msecs_to_jiffies(err)+jiffies) >==> BUG() > | <7>dccp_set_state: listen(c1580030) LISTEN -> CLOSED > This may be a clue: this socket has not gone past listen state (i.e. not > entered server) Yes, the bug happens in para_server just at the moment the first client connects. No data is transfered to the client. I'll look into the kernel dccp code a bit this evening as well. Thanks for your help Andre -- The only person who always got his work done by Friday was Robinson Crusoe signature.asc Description: Digital signature
Re: kernel BUG at kernel/timer.c:407!
Hi Andre, thank you for providing a detailed report. The mentioned patch avoids sending received information to both rx/tx halves of a CCID. There are two possibilities: (a) The patch itself has a bug The patch has been tested extensively over months (but using mainly CCID3, as CCID2 gave bad performance) or (b) There is a problem in the CCID2 code, since there have been many kernel changes and the code was not so much maintained. I would much rather hunt down this bug than revert this patch, since reverting it means that almost twice the work is done per each packet and truly bidirectional connections are not currently supported. With initial tests here I could not reproduce the bug and had difficulties since I have not all required libraries for paraslash. 1. I see you are using loopback (127.0.0.1) where packets are routed back internally: does the bug persist when CCID2 sender and CCID2 receiver are on different hosts? 2. This is using CCID2, which has not been maintained for a while. Can you please try CCID 3 also, e.g. by using the following sysctls: sysctl -w net.dccp.default.rx_ccid=3 sysctl -w net.dccp.default.tx_ccid=3 sysctl -w net.dccp.default.tx_qlen=5 sysctl -w net.dccp.default.seq_window=100 sysctl -w net.dccp.default.send_ackvec=0 3. Caveat: `Server' listens, `Client' connects. I could not build the paraslash app due to missing libraries, but found that dccp_recv_open calls connect in dccp_recv.c while dccp_open() in dccp_send.c calls listen(). It seems that the roles are reversed, is it possible to swap this in the application and does the problem persist? 4. Notwithstanding (3), the BUG() condition in mod_timer should not be triggered, so any further information - in particular the tests in (1,2) would be good to do. Gerrit | ccid2_hc_tx_send_packet: pipe=1 cwnd=1 | [ cut here ] | kernel BUG at kernel/timer.c:407! | invalid opcode: [#1] | PREEMPT | CPU:0 | EIP:0060:[]Not tainted VLI | EFLAGS: 00210246 (2.6.20.1 #12) | EIP is at mod_timer+0x28/0x2c | eax: c15807d4 ebx: c1580414 ecx: edx: fffbda11 | esi: c1580414 edi: dea75644 ebp: dc637db8 esp: dc637db8 | ds: 007b es: 007b ss: 0068 | Process para_server (pid: 1085, ti=dc636000 task=de5df030 task.ti=dc636000) | Stack: dc637dc4 c0369b66 0001 dc637df8 c0407ddc dc637de0 c15804bc 00200292 | dc637de4 c041588c c15804b0 dc637df8 dea75644 c1580414 dc7b2ec4 | dc637e1c c040971d dc637e08 dc637e8c c0529780 c1580414 | Call Trace: | [] show_trace_log_lvl+0x1a/0x30 | [] show_stack_log_lvl+0x8d/0xaa | [] show_registers+0x1a8/0x312 | [] die+0x109/0x226 | [] do_trap+0x7e/0xb4 | [] do_invalid_op+0xa3/0xad | [] error_code+0x74/0x7c | [] sk_reset_timer+0xf/0x19 | [] dccp_write_xmit+0x224/0x22c | [] dccp_sendmsg+0x10a/0x15a | [] inet_sendmsg+0x44/0x55 | [] do_sock_write+0x9a/0xa9 | [] sock_aio_write+0x6a/0x7b | [] do_sync_write+0xc7/0x116 | [] vfs_write+0x182/0x187 | [] sys_write+0x3d/0x64 | [] syscall_call+0x7/0xb | === The BUG is caused via the following chain: 1. dccp_write_xmit(sk, 0) (due to !block) 1. dccp_sendmsg 2. ccid2_hc_tx_send_packet -> with hctx->ccid2hctx_pipe >= hctx->ccid2hctx_cwnd (see above, pipe=cwnd=1) ==> returns 1 3. in dccp_write_xmit(sk, 0): if (!block) { /* this is true here */ sk_reset_timer(sk, &dp->dccps_xmit_timer, msecs_to_jiffies(err)+jiffies) ==> BUG() | <7>dccp_set_state: listen(c1580030) LISTEN -> CLOSED This may be a clue: this socket has not gone past listen state (i.e. not entered server) - To unsubscribe from this list: send the line "unsubscribe dccp" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html