RE: [lwip-users] How to optimize raw UDP performance

Bill Auerbach Thu, 24 Sep 2009 11:55:42 -0700

There's some risk with disabling UDP checksum, but it's low.  From what I
see, UDP loss is on the magnitude of packets, not bytes in packets.  In
Windows, it can be bad.  I've seen 30-40 contiguous dropped packets just
minimizing a window (even an application other than the UDP-based one).
OTOH, I can run 150,000 packets a second for an hour without a drop (on a
LAN as a matter of fact).  Be prepared to contend with the non-lwIP end of
the connection at higher speeds.

Optimize your Ethernet driver.  If you can't send UDP packets at 980+MbS,
you're not optimal in your driver.  Although you don't need that speed, the
faster each packet is sent the better. Note: use a static copy of one UDP
pbuf and send it repeatedly.

This will help you:

lwIP Output Call Tree (per packet)

udp_sendto

Looks up which netif has IP addr - calls:

udp_sendto_if

Adds UDP header, fills in, chksums it - calls:

ip_output_if

Adds IP header, fills in, chksums it - calls:

etharp_output

Adds Eth header, fills in, chksums it - calls:

etharp_query

Looks up MAC from dest IP (ARP) - calls:

etharp_send_ip

Fills in 2 MAC addrs, calls:

netif->linkoutput

Raw packet send

The later you make your call here, the better.  There is a HUGE difference
between udp_sendto and etharp_query!  The speed killer is this is the ARP
lookup, the redundant address checks, a few pbuf_header calls and small
copies in these routines.  I optimized etharp.c using a faster cache test,
moved these functions to onchip memory (this is big if you can do this), and
removed the SMEMCPYs and for-loop MAC copies to use more efficient copies
and then using etharp_query got over 700MbS (100MHz Cyclone III FPGA running
NIOS II - this may be close to your platform).  Compare this to
udp_sendto_if which was only about 325MbS.  In the end I resorted to using
my own routines to build UDP packets (one pbuf with the IP/UDP header
chained to the payload).  With checksums disabled I get 969MbS.  (I had a
goal to get close to the wire speed if possible.) I had to time this on the
target side - Windows can only keep up with short bursts at this speed (250
packets or less) and WireShark has some difficulties but will also capture
short bursts.  I timed the times of 100 packets in WireShark to validate my
times recorded on the target.  These times were taken with nothing else
going on in the system.

My speeds reflect changes in several areas and do not reflect what is
possible *only* changing lwIP or optimizing the driver.  My goal was to make
Ethernet communications as fast as possible without rules of what to change
and not to change.

Bill

>-----Original Message-----

>From: [email protected]

>[mailto:[email protected]] On

>Behalf Of Max Bobrov

>Sent: Thursday, September 24, 2009 1:17 PM

>To: Mailing list for lwIP users

>Subject: Re: [lwip-users] How to optimize raw UDP performance

> 

>Bill: Thank you! disable CHECKSUM_CHECK_UDP and CHECKSUM_GEN_UDP gave

>a considerable increase in performance. Xilinx gui interface for lwip

>could use some significant improvement to make this and many other

>features more accessible.

> 

>Chris: I've increased some of these values (listed below) but haven't

>seen much improvement from that. Do these look ok or have you had

>better success with others?

> 

>#define MEM_ALIGNMENT 8

>#define MEM_SIZE 262144

>#define MEMP_NUM_PBUF 32

>#define MEMP_NUM_UDP_PCB 8

>#define MEMP_NUM_TCP_PCB 32

>#define MEMP_NUM_TCP_PCB_LISTEN 8

>#define MEMP_NUM_TCP_SEG 256

>#define LWIP_USE_HEAP_FROM_INTERRUPT 1

> 

>#define MEMP_NUM_SYS_TIMEOUT 8

>#define PBUF_POOL_SIZE 256

>#define PBUF_POOL_BUFSIZE 2048

>#define PBUF_LINK_HLEN 16

> 

> 

>On Wed, Sep 23, 2009 at 11:06 PM, Chris Strahm <[email protected]>

>wrote:

>> Actually someone else reported to me that turning the checksum off in

>lwIP

>> actually made it slower.  I have not checked the reason for this, but

>that

>> was someone else's experience.  There is a big difference in whether

>you use

>> 8/16/32 bit memcpy type routines.  Also if you can write it in asm.

> Since

>> yours is FPGA, little different.  Also same kind of thing for

>checksum.  Asm

>> will be faster.  Sometimes the difference in how a particular variable

>or

>> address pointer is generated by C can result in very big difference in

>code.

>> You have to look at everything when it comes to high performance.

>> 

>> Also what is the size of your PBUFs and your blocks in your DMA or MAC

>ISR.

>> I assume for a 1G Enet system you probably want the maximum, about

>1536

>> each.

>> 

>> Chris.

>> 

>> 

>> 

>> _______________________________________________

>> lwip-users mailing list

>> [email protected]

>> http://lists.nongnu.org/mailman/listinfo/lwip-users

>> 

> 

> 

>_______________________________________________

>lwip-users mailing list

>[email protected]

>http://lists.nongnu.org/mailman/listinfo/lwip-users

_______________________________________________
lwip-users mailing list
[email protected]
http://lists.nongnu.org/mailman/listinfo/lwip-users

RE: [lwip-users] How to optimize raw UDP performance

Reply via email to