Re: [lwip-users] lwip and/or general tcp problems

Paul Butler Mon, 09 Apr 2007 09:10:36 -0700

Thomas,

Thanks for your response.  I think most of my original questions have been 
traced to a corrupt MAC address in the segment sent using lwip (visible in the 
log file I sent you).  This causes the receiver to ignore the segment, and lwip 
(correctly) starts retransmitting.  Whether the result is a gap of 2s or 
recovers at all is dependant on the number of segments sent with a corrupt MAC 
address initially, and whether lwip continues to use a corrupt MAC address for 
the retransmits.


Is there any known issue with lwip 1.1.0 that could allow the destination MAC 
address to become corrupt, with DHCP (or possibly just other network traffic) 
possibly causing the symptom to occur more often?  Have you or anyone else 
experienced this issue?

On a general note, can you clarify how the 8192 window size for the 
transmitting device would impact my throughput?  It's my understanding that 
this is the receive window size for the lwip (transmitting) device.  If I were 
implementing a high bandwidth link in both directions, this window would limit 
how much the other device could send to the lwip device, but I don't have any 
data flowing in that direction currently.  Or am I missing something?

Regards,
Paul

  ----- Original Message ----- 
  From: Taranowski, Thomas (SWCOE) 
  To: Paul Butler ; Mailing list for lwIP users 
  Sent: April 6, 2007 2:59 AM
  Subject: RE: [lwip-users] lwip and/or general tcp problems


   

   


------------------------------------------------------------------------------

  From: Paul Butler [mailto:[EMAIL PROTECTED] 
  Sent: Tuesday, April 03, 2007 12:04 PM
  To: Taranowski, Thomas (SWCOE)
  Subject: Re: [lwip-users] lwip and/or general tcp problems

   

  Thomas,

   

  Thanks for allowing me to use your personal email address.  I've attached a 
log file I've also sent to Kieran, but he has yet to respond to it.  My initial 
problems appear to have been tracked to two sources - The first is a problem 
with the nagle algorithm implementation from 1.1.0, and the second is a problem 
where my transmitting app (lwip on an Analog Devices' DSP) changes the MAC 
address without changing the IP address.  The first problem ADI has already 
identified the cause and provided a fix for it.  The second problem they have 
not identified the cause yet.

   

  If you compare segments 49 and 50 in the wireshark (ethereal) log I've 
attached, you can see that although the destination IP is still 192.168.16.36 
(correct), the MAC address changes from 00:06:1b:c5:d5:06 (correct) to 
00:10:24:28:d5:06 (incorrect).  Presumably, the receiving MAC looks at the MAC 
address and discards the segment.  In this case, the transmitter retransmits at 
segment 55 using the correct MAC address and the system recovers.  Later in the 
same file, segments 133 and 134 are sent to the same incorrect MAC address, but 
the retransmissions at segments 151, 153, 155, and 157 are sent to a new 
incorrect MAC address (00:11:43:ea:d5:06).  

   

  If you could comment on this problem of the MAC address getting corrupted 
momentarily, it would be a big help.  Given that several identical segments are 
sent an it is a many bit error, I don't think the problem is introduced on the 
wire.  Please let me know if the raw pcap file is stripped out, and I will 
resend in a password protected zip.

   

  Thanks again,

  Paul

    ----- Original Message ----- 

    From: Taranowski, Thomas (SWCOE) 

    To: Mailing list for lwIP users 

    Sent: April 3, 2007 2:30 PM

    Subject: RE: [lwip-users] lwip and/or general tcp problems

     

    You can send them to my personal address, but any zip files need to have a 
password, otherwise they get stripped out by the firewall.  

     


----------------------------------------------------------------------------

    From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Paul Butler
    Sent: Monday, March 26, 2007 11:01 AM
    To: Mailing list for lwIP users
    Subject: Re: [lwip-users] lwip and/or general tcp problems

     

    Thomas, 

     

    Thanks for your response.  I have added additional information below your 
responses.  I sent a response with a first round of logfiles and my header file 
included, but it appears they got stripped out or the email blocked.  Is there 
a way to send attachments to the list, or may I send them to your personal 
address instead of the list?

     

    Paul

      ----- Original Message ----- 

      From: Taranowski, Thomas (SWCOE) 

      To: Mailing list for lwIP users 

      Sent: March 23, 2007 7:39 PM

      Subject: RE: [lwip-users] lwip and/or general tcp problems

       

      I am working on a data acquisition system using an Analog Devices' 
Blackfin BF537, which has a 100Mb/s MAC and utilizes a port of lwip.  The lwip 
port appears to be derived from STABLE-0_6_3.  My application requires high 
throughput on the ethernet interface (~20Mb/s), so I have been creating very 
simple applications to run on the embedded processor with lwip to test the 
throughput and reliability of the setup.  The sample application on the BF537 
simply creates, binds, and listens on a socket, and then in an infinite loop 
accepts a single connection and then while that connection is open sends large 
packets (1460 bytes) on the connection.  I have a simple LabVIEW application 
that receives the data, and I have also been using the Wireshark analyzer to 
look at the transfers.  In this configuration, I am experiencing the following 
that I would really appreciate some insight on:

       

      1) When lwip is configured to use DHCP, it is very difficult to maintain 
a high throughput.  In fact, the connection very frequently times out after 
transferring just a few packets.  I don't see much other traffic related to 
having the DHCP server on the LAN, and I use a switch to isolate the 
transmitting device and the receiving PC.

       

      [TT] This could be a function of the configuration of your DHCP server, 
and the length of lease that is granted during the initial dhcp negotiation.

      I will confirm this.  I have attached a log file showing a case that 
timed out after a few transfers (070323 DHCP Startup Failed, some data.pcap), 
and one that failed with no data transferred (070323 DHCP Startup Failed.pcap).

       

      2) When not using DHCP, in general the connection is more reliable.  
However, there appears to be a "cold start" issue, where when the devices on 
the LAN (transmitter, switch, and receiving PC) are powered on for the first 
time the connection has trouble establishing itself.  A few packets will 
transfer successfully, followed by a dropped packet with no successful 
retransmissions over 30 seconds.

      [TT] This is pretty hard to diagnose.  To my mind, it sounds like it 
could be problems with the way in which the application design at system 
startup.  To diagnose this more closely, sniffer logs would be needed.

      I have attached a log file showing this failure (070323 Startup 
Failure.pcap).  Do you have a recommendation for the way the system should 
startup?

      [TT] Not really.  I might try to isolate the problem by trying various 
ordered startup procedures, then maybe a fix would present itself.

      Is a delay between accepting the connection and transmitting data likely 
to improve this issue?  There is already a considerable delay between when I 
power the switch and when I make the connection.

      [TT] It could.  If you try to  send before the link has been established, 
there could be some problems with dropped packets.  DHCP can work up to some 
fairly long waits, which would delay establishment of any connections.  If your 
port incorrectly (as I just found mine does) marks the netif as 'up' at 
interface open time when dhcp is enabled, then there could be some issues.  The 
dhcp framework marks the interface as 'up', via the netif_set_up() once the 
dhcp bind occurs.

      3) Again without DHCP, I can observe stalls in the transmitted data 
stream.  Normally, packets are transmitted more than once a millisecond (up to 
8 or ten per millisecond), but occasionally there are periods of ~150ms where 
no data is transmitted.  The receive window has not closed, and there is not 
indication of dropped packets or retransmission in the log file.

       

      [TT] It could be that the transmit window (assuming TCP) is full.  It 
could also be something to do with the multitude of #defines that tune the 
performance/space in opt.h.  Some sniffer logs may shed some light on the 
issue.  What window size does the remote end advertise?

      The remote end advertises a 64k window size.  I wasn't clear on a lot of 
the #defines - I've attached my option header file, could you comment? Is there 
somewhere I have limited my transmit window to just a few segments?

      [TT] Yes, in the sniffer log I see your transmit window is limited to 
8192, which is pretty small.  This is governed via the TCP_WND #define in your 
lwipopts.h.

      4) Still without DHCP, I observe ~2s stalls.  These appear to be caused 
by >1 dropped packet, which results in the first dropped packet being resent by 
fast retransmission, and all other packets being resent by the retransmission 
timers.

      [TT] This sounds like half-duplex Ethernet operation to me.  Make sure 
you don't have any half-duplex hubs floating around on your network.  These 
will cause random wait times on the order you mentioned.

      I confirmed thatthe 3 devices comprising my LAN (embedded device, hp 
switch, and ibm laptop) are all at least 10/100 auto negotiate half/full 
duplex, and the ibm laptop is a 1Gb device.  Other than forcing the devices to 
100Mb Full duplex, is there a way to confirm that nobody is operating at half 
duplex?  [TT] Not without some access to the driver statistics, or a LAN 
analyzer.  If you have access to some driver statistics, and you see any 
collisions, then you know there's a half-duplex device on that segment.

      Can you clarify why a half-duplex hub would cause random waits?

      [TT] It's due to the collision handling protocols of the CSMA/CD thing.  
I'm having trouble viewing the 802.3 standard at the moment, but the basic 
operations is as follows.  If a node starts to send an Ethernet frame, but 
detects a collision, it backs off for a random interval, which, if I recall 
correctly, can range upwards of a second, before it attempts a retransmit.

       

      Can anyone confirm that any or all of these behaviors is unexpected in a 
LAN environment (RTT normally <1ms)?  Although I'm new to this, it seems 
surprising that my little LAN with <15' CAT5 cable segments is so likely to 
have corrupted or lost packets.  

      [TT] An old hub or faulty connector can cause all sorts of issues.  I'd 
revert back to as simple a network as possible, and proceed from there, adding 
segments until some bad behavior is exhibited.

      I can try this with just a crossover cable, but there's not much room to 
go simpler. For the DHCP problems, can you recommend a simple way to add a DHCP 
server without connecting into my full office network?

      [TT] I loaded up one of my targets with an Ubuntu install, then installed 
the dhcpd3 server.  This gives me additional visibility into what's going on 
with the DHCP negotiation, and I can try out various options, etc.

       

      Can anyone give me some guidance on what to expect regarding lost 
packets?  

      [TT] An analysis I did some time back for an avionics platform concluded 
that I could expect that the phy, at a minimum, would cause one lost/corrupt 
packet per 24 hour period on a 3 in. long peer to peer link.  It seems to me 
that a dozen a day on a small network would not be unusual.

      A dozen a day doesn't sound unreasonable. I'm currently able to generate 
what I assume are lost/corrupt packets within a 20 or 30 second log file.

      Are the recovery processes I've observed correct behavior?  Should only a 
single packet be resent usign fast retransmission?  Is there anything inherent 
in the stack that could cause brief pauses in the data stream?  Why does using 
DHCP apparently make it so difficult to establish and maintain a 
high-throughput connection, particularly since there doesn't seem to be any 
other traffic on the LAN?

       

      Apologies for the multiple questions, but I needed to start somewhere, 
and I've already reached the limit of what the Analog Devices' support 
engineers can help with.  I can provide the log files from Wireshark if that 
would be helpful, but some are very large (tens of megabytes).  I'd also be 
interested if anyone can suggest other resources to further my understanding of 
networking and TCP/IP issues.

      [TT] You'd start by locating the portions of the capture logs that show 
aberrant behavior. 

      I'll follow up with those logfiles shortly. Is there an easier way to cut 
them down to size than using the editcap command-line utility? 

      [TT] I sometimes use the GUI, highlight the sections I want, then save 
the selection to a file.  I've never tried the editcap, but it sounds painful.

      Thanks,

       

      Paul Butler

       


--------------------------------------------------------------------------

      _______________________________________________
      lwip-users mailing list
      [email protected]
      http://lists.nongnu.org/mailman/listinfo/lwip-users




----------------------------------------------------------------------------

    _______________________________________________
    lwip-users mailing list
    [email protected]
    http://lists.nongnu.org/mailman/listinfo/lwip-users

_______________________________________________
lwip-users mailing list
[email protected]
http://lists.nongnu.org/mailman/listinfo/lwip-users

Re: [lwip-users] lwip and/or general tcp problems

Reply via email to