Re: panic on 2.6.24rc5
Quoting Tomasz Grobelny: | On Friday 28 December 2007, I wrote: | Dnia Wednesday 26 of December 2007, napisa?e?: | What are the panics you are getting? It might be worth posting them to | the list. | | Here is the screenshot I captured a few days ago. Details: | - kernel-vanilla 2.6.24rc5, | Now I'm using kernel as described in Arnaldo's mail (davem/net-2.6.25 + | patches 0001 to 0051). | | - netem+tbf limited traffic on lo interface, | - the panic was preceeded by several dmesg entries like that: | When netem is not used used there are no BUG entries in dmesg, if it is I get: | BUG: err=1 after ccid_hc_tx_packet_sent The `err' refers to the return value of dccp_transmit_skb(), the corresponding errno number is in include/asm-generic/errno-base.h #define EPERM1 /* Operation not permitted */ Are you running a security Linux (selinux) or as non-root? | | - dccp seemed to be painfully slow at sending packets from queue (but I | have no numbers to prove that), | Ok, now I do have numbers. I wrote a program (sclient) which sends an 80 byte | packet every 100ms. Here are the times of arrival (ie. time(0)) on the other | end of the connection (note that this is on loopback interface with no | limiting): | 1199023603 | 1199023603 | 1199023603 | 1199023603 | 1199023604 | 1199023604 | 1199023604 | 1199023604 | 1199023604 | 1199023604 | 1199023604 | 1199023604 | 1199023604 | 1199023604 | 1199023605 | 1199023605 | 1199023605 | 1199023605 | 1199023605 | 1199023605 | 1199023605 | 1199023605 | 1199023605 | 1199023605 | 1199023606 | 1199023606 | 1199023606 | 1199023606 | 1199023606 | 1199023606 | 1199023606 | 1199023606 | 1199023606 | 1199023606 | 1199023607 | 1199023607 | 1199023607 | 1199023607 | 1199023607 | 1199023608 | 1199023608 | 1199023609 | 1199023610 | 1199023613 | 1199023615 | 1199023619 | 1199023624 | 1199023633 | 1199023642 | 1199023659 | 1199023677 | 1199023713 | 1199023749 | 1199023813 | 1199023877 | 1199023941 | 1199024005 | 1199024069 | 1199024133 | 1199024197 | 1199024261 | 1199024325 | | during this time I get 4 lines in dmesg: | dccp_hdlr_ack_ratio: Not changing RX Ack Ratio from 1 to 0 | dccp_hdlr_ack_ratio: Not changing TX Ack Ratio from 1 to 2 | dccp_hdlr_ack_ratio: Not changing RX Ack Ratio from 1 to 2 | dccp_hdlr_ack_ratio: Not changing TX Ack Ratio from 1 to 0 | Did you set net.dccp.default.ack_ratio=1 ? It looks like it. In any case, this output looks good, since * the CCID3 sender disables Ack Ratio (Ack Ratio is only used by CCID2) * since you are using loopback, you get both sender/receiver messages (i.e. one sender message is always paired with one receiver message) * the CCID3 receiver does not touch Ack Ratio and leaves it at the default value for Ack Ratio (2) * the Not changing {R,T}X Ack Ratio messages are due to the fact that the Ack Ratio handler is currently disabled. This is a precaution since CCID2 is not well-behaved yet with Ack Ratios different from 1 * the Ack Ratio messages should in general not affect CCID3 performance since the CCID3 code ignores Ack Ratio (i.e. they are only informative) | As you can see for the first few seconds all is fine (packets arrive 9-10 a | second), but then the speed drops to 1 packet every 64 seconds. Can anybody | reproduce that? Or what may I be doing wrong? I will also try the setting you described. Is it possible for you to run your program between two computers (on testbeds here it works ok)? The 64 seconds means that CCID3 is dying very badly; 64 seconds is the maximum packet delay (t_mbi in RFC 3448), so there is something really strange going on. In combination with the -EPERM error it may be that the loss rate is very very hight. We would really be grateful for any further hints that you could give us. Here are a few more things to test * you can see your setting in the Request/Response handshake, the set of features negotiated for the connection is in the Response * many CCID3 parameters are printed out in dccp_probe, but I don't know how well this works when using loopback. Some scripts are on http://www.erg.abdn.ac.uk/users/gerrit/dccp/testing_dccp/ * wireshark (source version) is also able to decode the receiver reports by the CCID3 receiver (X_recv/loss p) - To unsubscribe from this list: send the line unsubscribe dccp in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: panic on 2.6.24rc5
Em Sun, Dec 30, 2007 at 04:18:36PM +0100, Tomasz Grobelny escreveu: On Friday 28 December 2007, I wrote: Dnia Wednesday 26 of December 2007, napisałeś: What are the panics you are getting? It might be worth posting them to the list. Here is the screenshot I captured a few days ago. Details: - kernel-vanilla 2.6.24rc5, Now I'm using kernel as described in Arnaldo's mail (davem/net-2.6.25 + patches 0001 to 0051). dccp_hdlr_ack_ratio is not on net-2.6.25, which means it is in one of the 0001 to 0051 patches from Gerrit. So, to help us understand where is the problem you could try building a kernel without applying any of the 0001 to 0051 patches. Could you do this at and report the results? I'm also assuming you are using CCID2 either by explicitely using feature negotiation setsockopt calls or by using the default, that is CCID2. If this is the case it would also be interesting to, before rebuilding the kernel, to try using CCID3 as the problem you're experiencing when using netem is exactly in the interface between the core DCCP code and the CCID being used. - Arnaldo - To unsubscribe from this list: send the line unsubscribe dccp in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: panic on 2.6.24rc5
Em Tue, Jan 01, 2008 at 10:30:56PM +0100, Tomasz Grobelny escreveu: Dnia Tuesday 01 of January 2008, Arnaldo Carvalho de Melo napisał: Em Sun, Dec 30, 2007 at 04:18:36PM +0100, Tomasz Grobelny escreveu: On Friday 28 December 2007, I wrote: Dnia Wednesday 26 of December 2007, napisałeś: What are the panics you are getting? It might be worth posting them to the list. Here is the screenshot I captured a few days ago. Details: - kernel-vanilla 2.6.24rc5, Now I'm using kernel as described in Arnaldo's mail (davem/net-2.6.25 + patches 0001 to 0051). dccp_hdlr_ack_ratio is not on net-2.6.25, which means it is in one of the 0001 to 0051 patches from Gerrit. So, to help us understand where is the problem you could try building a kernel without applying any of the 0001 to 0051 patches. Could you do this at and report the results? But what should I exactly test? Just whether the delays are gone or something more? I'll try to when I have some time (hopefully during weekend). If the kernel oopses, if the results are the same or are some problem introduced in the patches by Gerrit. I.e. you would help us to narrow down the problem by trying a binary search of changeset history built kernels. Please take a look at Documentation/BUG-HUNTING in the kernel sources. The process is somehow time consuming and its understandable if you can't perform it, your reports are already of great help, but if you can try helping us to narrow down exactly when some bugs you notice appeared, or if they were always present after some kernel builds, we'd be really grateful :-) I'm also assuming you are using CCID2 either by explicitely using feature negotiation setsockopt calls or by using the default, that is In fact I was using ccid3. When I switched to ccid2 it started to work more or less ok. It seems that for whatever reason ccid_hc_tx_send_packet is returning too big values (up to 64000). That is an excellent data point, ccid3 code is way more complex than ccid2, so trying with both is always a valuable data point. CCID2. If this is the case it would also be interesting to, before rebuilding the kernel, to try using CCID3 as the problem you're experiencing when using netem is exactly in the interface between the core DCCP code and the CCID being used. The problem with netem exists with both ccid2 and ccid3. I suspect that when all three elements of the connection (server, client and netem) are on one host netem is able to communicate packet loss by returning error. If netem was on a diffrent host the packet would be sent correctly (no BUG: err=1 after ccid_hc_tx_packet_sent) but dropped on another host. I think that in this situation dccp should behave as if the packet was simply dropped. I can't work on this right now, will look at it tomorrow, but thanks for the data points! - Arnaldo - To unsubscribe from this list: send the line unsubscribe dccp in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: panic on 2.6.24rc5
Em Wed, Jan 02, 2008 at 01:57:14AM +0100, Tomasz Grobelny escreveu: Dnia Wednesday 02 of January 2008, Arnaldo Carvalho de Melo napisał: If the kernel oopses, if the results are the same or are some problem introduced in the patches by Gerrit. I.e. you would help us to narrow down the problem by trying a binary search of changeset history built kernels. Oh, and by the way: does there exist any set of automated tests for dccp? It would be nice to have one, wouldn't it? Otherwise accepting any patch is quite risky... There are test programs, documented in the wiki, and there is peer review too :-) And DCCP on Linux was written in such a way that a large part of its core engine is actually shared with TCP, benefiting from a much bigger set of developers and testers. But please feel free to add more automated tests, it'll benefit us all. - Arnaldo - To unsubscribe from this list: send the line unsubscribe dccp in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html