Re: panic on 2.6.24rc5

2008-01-07 Thread Gerrit Renker

Quoting Tomasz Grobelny:
| On Friday 28 December 2007, I wrote:
|  Dnia Wednesday 26 of December 2007, napisa?e?:
|   What are the panics you are getting? It might be worth posting them to
|   the list.
| 
|  Here is the screenshot I captured a few days ago. Details:
|   - kernel-vanilla 2.6.24rc5,
| Now I'm using kernel as described in Arnaldo's mail (davem/net-2.6.25 + 
| patches 0001 to 0051).
| 
|   - netem+tbf limited traffic on lo interface,
|   - the panic was preceeded by several dmesg entries like that:
| When netem is not used used there are no BUG entries in dmesg, if it is I get:
| BUG: err=1 after ccid_hc_tx_packet_sent 
The `err' refers to the return value of dccp_transmit_skb(), the
corresponding errno number is in include/asm-generic/errno-base.h

#define EPERM1  /* Operation not permitted */

Are you running a security Linux (selinux) or as non-root?

| 
|   - dccp seemed to be painfully slow at sending packets from queue (but I
|  have no numbers to prove that),
| Ok, now I do have numbers. I wrote a program (sclient) which sends an 80 byte 
| packet every 100ms. Here are the times of arrival (ie. time(0)) on the other 
| end of the connection (note that this is on loopback interface with no 
| limiting):
| 1199023603
| 1199023603
| 1199023603
| 1199023603
| 1199023604
| 1199023604
| 1199023604
| 1199023604
| 1199023604
| 1199023604
| 1199023604
| 1199023604
| 1199023604
| 1199023604
| 1199023605
| 1199023605
| 1199023605
| 1199023605
| 1199023605
| 1199023605
| 1199023605
| 1199023605
| 1199023605
| 1199023605
| 1199023606
| 1199023606
| 1199023606
| 1199023606
| 1199023606
| 1199023606
| 1199023606
| 1199023606
| 1199023606
| 1199023606
| 1199023607
| 1199023607
| 1199023607
| 1199023607
| 1199023607
| 1199023608
| 1199023608
| 1199023609
| 1199023610
| 1199023613
| 1199023615
| 1199023619
| 1199023624
| 1199023633
| 1199023642
| 1199023659
| 1199023677
| 1199023713
| 1199023749
| 1199023813
| 1199023877
| 1199023941
| 1199024005
| 1199024069
| 1199024133
| 1199024197
| 1199024261
| 1199024325
| 
| during this time I get 4 lines in dmesg:
| dccp_hdlr_ack_ratio: Not changing RX Ack Ratio from 1 to 0
| dccp_hdlr_ack_ratio: Not changing TX Ack Ratio from 1 to 2
| dccp_hdlr_ack_ratio: Not changing RX Ack Ratio from 1 to 2
| dccp_hdlr_ack_ratio: Not changing TX Ack Ratio from 1 to 0
| 
Did you set net.dccp.default.ack_ratio=1 ? It looks like it. In any
case, this output looks good, since
 * the CCID3 sender disables Ack Ratio (Ack Ratio is only used by CCID2)
 * since you are using loopback, you get both sender/receiver messages
   (i.e. one sender message is always paired with one receiver message)
 * the CCID3 receiver does not touch Ack Ratio and leaves it at the default
   value for Ack Ratio (2)
 * the Not changing {R,T}X Ack Ratio messages are due to the fact that
   the Ack Ratio handler is currently disabled. This is a precaution
   since CCID2 is not well-behaved yet with Ack Ratios different from 1
 * the Ack Ratio messages should in general not affect CCID3 performance
   since the CCID3 code ignores Ack Ratio (i.e. they are only informative)

| As you can see for the first few seconds all is fine (packets arrive 9-10 a 
| second), but then the speed drops to 1 packet every 64 seconds. Can anybody 
| reproduce that? Or what may I be doing wrong?
I will also try the setting you described. Is it possible for you to run
your program between two computers (on testbeds here it works ok)?
The 64 seconds means that CCID3 is dying very badly; 64 seconds is the
maximum packet delay (t_mbi in RFC 3448), so there is something really
strange going on. In combination with the -EPERM error it may be that
the loss rate is very very hight.

We would really be grateful for any further hints that you could give us.

Here are a few more things to test
 * you can see your setting in the Request/Response handshake, the set
   of features negotiated for the connection is in the Response
  
 * many CCID3 parameters are printed out in dccp_probe, but I don't know
   how well this works when using loopback. Some scripts are on
   http://www.erg.abdn.ac.uk/users/gerrit/dccp/testing_dccp/

 * wireshark (source version) is also able to decode the receiver reports
   by the CCID3 receiver (X_recv/loss p)
-
To unsubscribe from this list: send the line unsubscribe dccp in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: panic on 2.6.24rc5

2008-01-01 Thread Arnaldo Carvalho de Melo
Em Sun, Dec 30, 2007 at 04:18:36PM +0100, Tomasz Grobelny escreveu:
 On Friday 28 December 2007, I wrote:
  Dnia Wednesday 26 of December 2007, napisałeś:
   What are the panics you are getting? It might be worth posting them to
   the list.
 
  Here is the screenshot I captured a few days ago. Details:
   - kernel-vanilla 2.6.24rc5,
 Now I'm using kernel as described in Arnaldo's mail (davem/net-2.6.25 + 
 patches 0001 to 0051).

dccp_hdlr_ack_ratio is not on net-2.6.25, which means it is in one of
the 0001 to 0051 patches from Gerrit. So, to help us understand where is
the problem you could try building a kernel without applying any of the
0001 to 0051 patches.

Could you do this at and report the results?

I'm also assuming you are using CCID2 either by explicitely using
feature negotiation setsockopt calls or by using the default, that is
CCID2. If this is the case it would also be interesting to, before
rebuilding the kernel, to try using CCID3 as the problem you're
experiencing when using netem is exactly in the interface between the
core DCCP code and the CCID being used.

- Arnaldo
-
To unsubscribe from this list: send the line unsubscribe dccp in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: panic on 2.6.24rc5

2008-01-01 Thread Arnaldo Carvalho de Melo
Em Tue, Jan 01, 2008 at 10:30:56PM +0100, Tomasz Grobelny escreveu:
 Dnia Tuesday 01 of January 2008, Arnaldo Carvalho de Melo napisał:
  Em Sun, Dec 30, 2007 at 04:18:36PM +0100, Tomasz Grobelny escreveu:
   On Friday 28 December 2007, I wrote:
Dnia Wednesday 26 of December 2007, napisałeś:
 What are the panics you are getting? It might be worth posting them
 to the list.
   
Here is the screenshot I captured a few days ago. Details:
 - kernel-vanilla 2.6.24rc5,
  
   Now I'm using kernel as described in Arnaldo's mail (davem/net-2.6.25 +
   patches 0001 to 0051).
 
  dccp_hdlr_ack_ratio is not on net-2.6.25, which means it is in one of
  the 0001 to 0051 patches from Gerrit. So, to help us understand where is
  the problem you could try building a kernel without applying any of the
  0001 to 0051 patches.
 
  Could you do this at and report the results?
 
 But what should I exactly test? Just whether the delays are gone or something 
 more? I'll try to when I have some time (hopefully during weekend).

If the kernel oopses, if the results are the same or are some problem
introduced in the patches by Gerrit. I.e. you would help us to narrow
down the problem by trying a binary search of changeset history built
kernels. 

Please take a look at Documentation/BUG-HUNTING in the kernel sources.
The process is somehow time consuming and its understandable if you
can't perform it, your reports are already of great help, but if you can
try helping us to narrow down exactly when some bugs you notice
appeared, or if they were always present after some kernel builds, we'd
be really grateful :-)
 
  I'm also assuming you are using CCID2 either by explicitely using
  feature negotiation setsockopt calls or by using the default, that is

 In fact I was using ccid3. When I switched to ccid2 it started to work more 
 or 
 less ok. It seems that for whatever reason ccid_hc_tx_send_packet is 
 returning too big values (up to 64000).

That is an excellent data point, ccid3 code is way more complex than
ccid2, so trying with both is always a valuable data point.
 
  CCID2. If this is the case it would also be interesting to, before
  rebuilding the kernel, to try using CCID3 as the problem you're
  experiencing when using netem is exactly in the interface between the
  core DCCP code and the CCID being used.

 The problem with netem exists with both ccid2 and ccid3. I suspect that when 
 all three elements of the connection (server, client and netem) are on one 
 host netem is able to communicate packet loss by returning error. If netem 
 was on a diffrent host the packet would be sent correctly (no BUG: err=1 
 after ccid_hc_tx_packet_sent) but dropped on another host. I think that in 
 this situation dccp should behave as if the packet was simply dropped.

I can't work on this right now, will look at it tomorrow, but thanks for
the data points!

- Arnaldo
-
To unsubscribe from this list: send the line unsubscribe dccp in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: panic on 2.6.24rc5

2008-01-01 Thread Arnaldo Carvalho de Melo
Em Wed, Jan 02, 2008 at 01:57:14AM +0100, Tomasz Grobelny escreveu:
 Dnia Wednesday 02 of January 2008, Arnaldo Carvalho de Melo napisał:
  If the kernel oopses, if the results are the same or are some problem
  introduced in the patches by Gerrit. I.e. you would help us to narrow
  down the problem by trying a binary search of changeset history built
  kernels.

 Oh, and by the way: does there exist any set of automated tests for dccp? It 
 would be nice to have one, wouldn't it? Otherwise accepting any patch is 
 quite risky...

There are test programs, documented in the wiki, and there is peer
review too :-)

And DCCP on Linux was written in such a way that a large part of its
core engine is actually shared with TCP, benefiting from a much bigger
set of developers and testers.

But please feel free to add more automated tests, it'll benefit us all.

- Arnaldo
-
To unsubscribe from this list: send the line unsubscribe dccp in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html