Re: [ofa-general] Support for CXGB3 RNIC on P6

2009-02-03 Thread Krishna Kumar2
Sorry for the vagueness. There are no messages in the /var/log (or the
console/dmesg).
To answer Jon's question: lspci doesn't show the device.

No lights come up on the adapter, which I guess is the normal behavior till
the device is
succesfully probed and recognized.

thanks,

- KK

Roland Dreier rdre...@cisco.com wrote on 02/03/2009 09:42:39 PM:

 Roland Dreier rdre...@cisco.com
 02/03/2009 09:42 PM

 To

 Krishna Kumar2/India/i...@ibmin

 cc

 openfabrics general@lists.openfabrics.org

 Subject

 Re: [ofa-general] Support for CXGB3 RNIC on P6

   My colleague (at a different site) is trying to get couple of Chelsio
RNIC
   adapters working on
   p6 systems but for some reason the cards aren't recognized on bootup.

 What do you mean by aren't recognized on bootup?  More details on the
 specific problem are needed to diagnose it.

  - R.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Support for CXGB3 RNIC on P6

2009-02-02 Thread Krishna Kumar2

Hi,

My colleague (at a different site) is trying to get couple of Chelsio RNIC
adapters working on
p6 systems but for some reason the cards aren't recognized on bootup. The
same cards works
on my xseries systems, and following are the messages I get (there are no
messages on his p6
systems):

Feb  1 11:42:49 localhost kernel: Chelsio T3 Network Driver - version
1.1.1-ko
Feb  1 11:42:49 localhost kernel: cxgb3 :22:00.0: PCI INT A - GSI 17
(level, low) - IRQ 17
Feb  1 11:42:49 localhost kernel: input: Power Button (FF) as
/class/input/input1
Feb  1 11:42:49 localhost kernel: ACPI: Power Button (FF) [PWRF]
Feb  1 11:42:49 localhost kernel: cxgb3 :22:00.0: Port 0 using 4 queue
sets.
Feb  1 11:42:49 localhost kernel: eth2: Chelsio T310 10GBASE-R RNIC (rev 4)
PCI Express x8 MSI-X
Feb  1 11:42:49 localhost kernel: eth2: 128MB CM, 256MB PMTX, 256MB PMRX,
S/N: PT49070050

Is this revision of cxgb3 (rev4) not supported on p6? Or are we missing
something to get it to work?

thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Test programs supporting RNIC's.

2008-04-09 Thread Krishna Kumar2
Hi Dotan,

  ib_clock_test ib_read_lat   ib_write_bw_postlist
  ib_rdma_bwib_send_bwib_write_lat
  ib_rdma_lat   ib_send_lat
  ib_read_bw
 
  Out of this, only ib_rdma_bw seems to be CMA enabled. Is this the
  only program that supports RNIC's?
 
 Yes.

 I know that there are planing to add the CMA support to all of the ib_*
 applications as well
 (but today, only ib_rdma_bw and ib_rdma_lat support it).

Yes, I had forgotten ib_rdma_lat on the list of CMA enabled apps. But
somehow it didn't work for me. I need to reboot to the distro OS and
locate the error, will post it later.

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Test programs supporting RNIC's.

2008-04-09 Thread Krishna Kumar2
Hi Steve,

Steve Wise [EMAIL PROTECTED] wrote on 04/09/2008 07:33:35 PM:

 Krishna, if you are interested, you could add cma support to the rest of
 these.  I can help by answering questions and/or testing things...

If no one else is already doing this, I can start doing this in the
background. Will follow up with you if I need any help.

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Test programs supporting RNIC's.

2008-04-08 Thread Krishna Kumar2
Hi,

I am testing Chelsio cxgb3 RNICS on RHEL5.2 (beta). The following
list of applications are installed on the system as part of OFED
install option:

ib_clock_test ib_read_lat   ib_write_bw_postlist
ib_rdma_bwib_send_bwib_write_lat
ib_rdma_lat   ib_send_lat
ib_read_bw

Out of this, only ib_rdma_bw seems to be CMA enabled. Is this the
only program that supports RNIC's?

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: Status of NFS-RDMA ?

2008-01-30 Thread Krishna Kumar2
Hi James,

Since you had mentioned in an earlier email that NFS-RDMA server side will
be present in
OFED1.4, do you know if any port of the server code to OFED1.3 (when it
comes out) will
happen? Is there any effort for that, any work ongoing, any help required,
etc?

I couldn't find the release time lines for OFED1.4, is there any link on
openfabrics homepage?

Thanks,

- KK

[EMAIL PROTECTED] wrote on 01/29/2008 08:23:46 PM:



 On Tue, 29 Jan 2008, Pawel Dziekonski wrote:

 
  On Mon, 28 Jan 2008 at 10:14:22AM -0500, James Lentini wrote:
  
  
   On Sat, 26 Jan 2008, Pawel Dziekonski wrote:
  
I pulled Tom's tree from new url and build a kernel.
  
   If you enabled support for INFINIBAND drivers (IB and iWARP support)
   and NFS client/server support, the kernel should be ready to go (run
   grep RDMA /your_kernel_sources/.config to confirm that
   CONFIG_SUNRPC_XPRT_RDMA is either m or y).
  
   NFS/RDMA doesn't require OFED be installed. OFED is a release of the
   Linux kernel sources and some userspace libraries/tools. If you are
 
then I downloaded OFED from
http://www.mellanox.com/downloads/NFSoRDMA/OFED-1.2-NFS-RDMA.gz,
  
   I don't know what the above URL contains. The latest code is in Tom
   Tucker's tree (and now NFS server maintainer Bruce Fields tree). It
is
 
 
  hi,
 
  back to subject on a proper mailing list.
 
  I have a 3 year experience with mellanox hardware and IBGold so I
  basically know what OFED is all about. up to now i was only using
  IBGold since IB drivers appeared in kernel pretty recently.

 You'll want to use the mainline kernel's IB drivers for NFS/RDMA.
 We've been developing the NFS/RDMA software on the OpenFabrics (aka
 OpenIB) code since it was merged into 2.6.10 in Dec 2004.

  currently I have new hardware. I'm running Tom's kernel and already
  did some MPI tests. SDP is not working, probably because sdp kernel
  modules where not build. ;) I understand that those modules are only
  available from ofa-kernel. please correct me if i'm wrong.

 Correct. SDP has never been submitted to mainline Linux.

  system is Scientic Linux 4.5, which is supposed to be a fully
  compatible RH4 clone. hardware is Supermicro mobos with Mellanox
  MT25204 and Flextronisc switch.
 
  error log from ofa-kernel build:

 Is your goal to build a kernel with an NFS/RDMA server? If so, the
 kernel sources from Tom Tucker's git tree are the ones you want, not
 the old OFED 1.2-based packages which are out of date.

 Did you try setting up the NFS/RDMA server on the kernel used for your
 MPI tests above?

make[1]: Entering directory `/usr/src/ib/xprt-switch-2.6'
test -e include/linux/autoconf.h -a -e include/config/auto.conf ||
(\
echo;   \
echo   ERROR: Kernel configuration is invalid.;   \
echo  include/linux/autoconf.h or include/config/auto.conf
are
 missing.;  \
echo  Run 'make oldconfig  make prepare' on kernel src
to fix it.;  \
echo;   \
/bin/false)
   
obviously, doing 'make oldconfig  make prepare' does not help.
anyway, above mentioned files do exist:
   
# ls -la /usr/src/ib/xprt-switch-2.6/{include/linux/autoconf.h,
 include/config/auto.conf}
-rw-r--r--  1 root root 10156 Jan 25 17:42
/usr/src/ib/xprt-switch-2.
 6/include/config/auto.conf
-rw-r--r--  1 root root 14733 Jan 25 17:42
/usr/src/ib/xprt-switch-2.
 6/include/linux/autoconf.h
   
despite of above, compilation continues but fails with:
   
  gcc -Wp,-MD,/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.
 2/drivers/infiniband/core/.mad.o.d  -nostdinc -isystem
/usr/lib/gcc/x86_64-
 redhat-linux/3.4.6/include -D__KERNEL__
-I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.
 2/include  -I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2
/drivers/infiniband/include
 -Iinclude-include include/linux/autoconf.h  -include
 /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/include/linux/autoconf.h   -Wall
-Wundef
 -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common
-Werror-
 implicit-function-declaration -Os   -m64 -mno-red-zone -mcmodel=kernel
-pipe -
 Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time
-mno-sse -
 mno-mmx -mno-sse2 -mno-3dnow -maccumulate-outgoing-args -DCONFIG_AS_CFI=1
-
 DCONFIG_AS_CFI_SIGNAL_FRAME=1 -fomit-frame-pointer  -Wdeclaration-after-
 statement -DMODULE -DKBUILD_STR(s)=#s -
 DKBUILD_BASENAME=KBUILD_STR(mad)  -DKBUILD_MODNAME=KBUILD_STR(ib_mad)
-c -
 o /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/core/.!
  tmp
   _mad.o /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2
/drivers/infiniband/core/mad.c
/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2
/drivers/infiniband/core/mad.c: In
 function `ib_mad_init_module':
/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2
/drivers/infiniband/core/mad.c:
 2966: error: too many arguments to function `kmem_cache_create'
make[4]: *** 

[ofa-general] Re: Status of NFS-RDMA ?

2008-01-30 Thread Krishna Kumar2
Hi Jeff  James,

Great. If you let me know when the bits are ready (I don't always read the
mailing list),
I should be able to get some testing done.

Thanks,

- KK

Jeff Becker [EMAIL PROTECTED] wrote on 01/30/2008 11:02:09 PM:

 Hi all.

 James Lentini wrote:
  On Wed, 30 Jan 2008, Krishna Kumar2 wrote:
 
 
  Hi James,
 
  Since you had mentioned in an earlier email that NFS-RDMA server
  side will be present in OFED1.4,
 
 
  Actually, that was Tziporet.
 
 
  do you know if any port of the server code to OFED1.3 (when it comes
  out) will happen? Is there any effort for that, any work ongoing,
  any help required, etc?
 
 
  Jeff Becker had looked into this. We would definitely appreciate the
  help.
 
 I have set up a git tree for NFSoRDMA  and succesfully merged it with,
 and built it on OFED 1.3-rcx. I'm currently doing the backports (SLES 10
 SP1 first). All this is in preparation for OFED 1.4, as that is when
 NFSoRDMA will be included in OFED.  I think I have this
 patching/backporting stuff under control. However, my testing resources
 are limited. Thus depending on your platform, I might be able to point
 you at OFED 1.3 based bits for testing if/when they are ready. Thanks.

 -jeff

  The NFS framework has changed significantly in several areas in recent
  kernel releases. This has made backporting the NFS/RDMA code to older
  kernels challenging.
 
  If you are interested in working on OFED1.3 support, let us know.
 
 
  I couldn't find the release time lines for OFED1.4, is there any
  link on openfabrics homepage?
 
 
  I'm not involved with the OFED1.4 planning. Tziporet, is there
  information on this?
 
 
  Thanks,
 
  - KK
 
  [EMAIL PROTECTED] wrote on 01/29/2008 08:23:46 PM:
 
 
  On Tue, 29 Jan 2008, Pawel Dziekonski wrote:
 
 
  On Mon, 28 Jan 2008 at 10:14:22AM -0500, James Lentini wrote:
 
  On Sat, 26 Jan 2008, Pawel Dziekonski wrote:
 
 
  I pulled Tom's tree from new url and build a kernel.
 
  If you enabled support for INFINIBAND drivers (IB and iWARP
support)
  and NFS client/server support, the kernel should be ready to go
(run
  grep RDMA /your_kernel_sources/.config to confirm that
  CONFIG_SUNRPC_XPRT_RDMA is either m or y).
 
  NFS/RDMA doesn't require OFED be installed. OFED is a release of
the
  Linux kernel sources and some userspace libraries/tools. If you are
 
  then I downloaded OFED from
  http://www.mellanox.com/downloads/NFSoRDMA/OFED-1.2-NFS-RDMA.gz,
 
  I don't know what the above URL contains. The latest code is in Tom
  Tucker's tree (and now NFS server maintainer Bruce Fields tree). It
 
  is
 
  hi,
 
  back to subject on a proper mailing list.
 
  I have a 3 year experience with mellanox hardware and IBGold so I
  basically know what OFED is all about. up to now i was only using
  IBGold since IB drivers appeared in kernel pretty recently.
 
  You'll want to use the mainline kernel's IB drivers for NFS/RDMA.
  We've been developing the NFS/RDMA software on the OpenFabrics (aka
  OpenIB) code since it was merged into 2.6.10 in Dec 2004.
 
 
  currently I have new hardware. I'm running Tom's kernel and already
  did some MPI tests. SDP is not working, probably because sdp kernel
  modules where not build. ;) I understand that those modules are only
  available from ofa-kernel. please correct me if i'm wrong.
 
  Correct. SDP has never been submitted to mainline Linux.
 
 
  system is Scientic Linux 4.5, which is supposed to be a fully
  compatible RH4 clone. hardware is Supermicro mobos with Mellanox
  MT25204 and Flextronisc switch.
 
  error log from ofa-kernel build:
 
  Is your goal to build a kernel with an NFS/RDMA server? If so, the
  kernel sources from Tom Tucker's git tree are the ones you want, not
  the old OFED 1.2-based packages which are out of date.
 
  Did you try setting up the NFS/RDMA server on the kernel used for
your
  MPI tests above?
 
 
  make[1]: Entering directory `/usr/src/ib/xprt-switch-2.6'
  test -e include/linux/autoconf.h -a -e include/config/auto.conf ||
 
  (\
 
  echo;   \
  echo   ERROR: Kernel configuration is invalid.;   \
  echo  include/linux/autoconf.h or
include/config/auto.conf
 
  are
 
  missing.;  \
 
  echo  Run 'make oldconfig  make prepare' on kernel src
 
  to fix it.;  \
 
  echo;   \
  /bin/false)
 
  obviously, doing 'make oldconfig  make prepare' does not help.
  anyway, above mentioned files do exist:
 
  # ls -la /usr/src/ib/xprt-switch-2.6/{include/linux/autoconf.h,
 
  include/config/auto.conf}
 
  -rw-r--r--  1 root root 10156 Jan 25 17:42
 
  /usr/src/ib/xprt-switch-2.
 
  6/include/config/auto.conf
 
  -rw-r--r--  1 root root 14733 Jan 25 17:42
 
  /usr/src/ib/xprt-switch-2.
 
  6/include/linux/autoconf.h
 
  despite of above, compilation continues but fails with:
 
gcc -Wp,-MD,/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.
 
  2/drivers

Re: [ofa-general] [PATCH] ipoib: Bug fix in ipoib_poll - resend

2007-11-27 Thread Krishna Kumar2
Hi Eli,


t = min(IPOIB_NUM_WC, max);
n = ib_poll_cq(priv-cq, t, priv-ibwc);
 +  if (unlikely(n  1))
 + break;

for (i = 0; i  n; i++) {
   struct ib_wc *wc = priv-ibwc + i;

The 'for' loop (followed by the if (n != t) break check)
should take care of this, isn't it?

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Re: [PATCH] IPoIB: Remove redundant check in xmit handler

2007-11-19 Thread Krishna Kumar2
Hi Roland,

I am not sure if my answer was clear, so I will try to be
clearer:

qdisc_run() first checks netif_queue_stopped(dev), and then if
it can get the __LINK_STATE_QDISC_RUNNING bit, it calls
__qdisc_run() which will do the actual xmit. Subsequent calls
to xmit within __qdisc_run checks for netif_queue_stopped.

So there is no way that xmit can be called with a stopped queue
as the core does it for every skb. And no other cpu can execute
this at the same time as the RUNNING bit is held. So this is a
completely safe removal of check for every skb.

I have tested this code extensively as part of batching skbs
and have never hit it.

Thanks,

- KK


 Hi Roland,

  This check was added because of a real problem seen in practice a
  while ago.  Has something changed in the tx queue locking that makes
  it redundant now?

 I am not sure of how it was earlier, but currently a device's xmit can be
 called
 only on one cpu at a time (by holding the __LINK_STATE_QDISC_RUNNING
 bit in qdisc_run). And queue_stopped check is present before xmit.

  I seem to remember that I could make the problem race trigger pretty
  fast by making the tx queue very small so that it got stopped a lot.

 I just tested with a smaller queue size (tx queue size=4), put a debug in
 the
 queue_stopped check in xmit(), and a counter to find how many times the
 queue was stopped (in ipoib_send). After a 20 min test run with 64
threads,
 the queue was stopped 16.5 million times, but the debug never hit.

 I tested with buffer sizes varying from 128 to 16K bytes (though TSO/GSO
is
 not implemented in IPoIB anyway).

 Thanks,

 - KK

 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

 To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Re: [PATCH] IPoIB: Remove redundant check in xmit handler

2007-11-19 Thread Krishna Kumar2
Hi Roland,

Roland Dreier [EMAIL PROTECTED] wrote on 11/20/2007 09:55:00 AM:

   qdisc_run() first checks netif_queue_stopped(dev), and then if
   it can get the __LINK_STATE_QDISC_RUNNING bit, it calls
   __qdisc_run() which will do the actual xmit. Subsequent calls
   to xmit within __qdisc_run checks for netif_queue_stopped.
  
   So there is no way that xmit can be called with a stopped queue
   as the core does it for every skb. And no other cpu can execute
   this at the same time as the RUNNING bit is held. So this is a
   completely safe removal of check for every skb.

 Hmm, I don't see any changes that prevent the race I originally
 described in http://oss.sgi.com/archives/netdev/2004-12/msg00474.html
 Maybe your test may just not be able to hit the race, or am I missing
 something?

(Thanks for the link, I hadn't seen this earlier)

That race will not happen as CPU#2 cannot call qdisc_restart when CPU#1
is holding the RUNNING bit. In this case, CPU#2 simply queue's the skb
to dev-q and returns, while CPU#1 finds this new skb in it's iteration
of __qdisc_run (and ends up processing all queue'd skb after checking for
stopped queue).

 Anyway medium-term I want to move IPoIB away from LLTX so this doesn't
 matter that much.

Are you planning for 2.6.25?

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Re: [PATCH] IPoIB: Remove redundant check in xmit handler

2007-11-19 Thread Krishna Kumar2
I forgot to mention, maybe the RUNNING bit was added as part of this bug
fix,
which means no LLTX driver need to really check for this condition again in
their xmit handler. That might explain the problem you faced then.

thanks,

- KK

Krishna Kumar2/India/IBM wrote on 11/20/2007 10:57:31 AM:

 Hi Roland,

 Roland Dreier [EMAIL PROTECTED] wrote on 11/20/2007 09:55:00 AM:

qdisc_run() first checks netif_queue_stopped(dev), and then if
it can get the __LINK_STATE_QDISC_RUNNING bit, it calls
__qdisc_run() which will do the actual xmit. Subsequent calls
to xmit within __qdisc_run checks for netif_queue_stopped.
   
So there is no way that xmit can be called with a stopped queue
as the core does it for every skb. And no other cpu can execute
this at the same time as the RUNNING bit is held. So this is a
completely safe removal of check for every skb.
 
  Hmm, I don't see any changes that prevent the race I originally
  described in http://oss.sgi.com/archives/netdev/2004-12/msg00474.html
  Maybe your test may just not be able to hit the race, or am I missing
  something?

 (Thanks for the link, I hadn't seen this earlier)

 That race will not happen as CPU#2 cannot call qdisc_restart when CPU#1
 is holding the RUNNING bit. In this case, CPU#2 simply queue's the skb
 to dev-q and returns, while CPU#1 finds this new skb in it's iteration
 of __qdisc_run (and ends up processing all queue'd skb after checking for
 stopped queue).

  Anyway medium-term I want to move IPoIB away from LLTX so this doesn't
  matter that much.

 Are you planning for 2.6.25?

 Thanks,

 - KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-11 Thread Krishna Kumar2
Hi Dave,

David Miller wrote on 10/10/2007 02:13:31 AM:

  Hopefully that new qdisc will just use the TX rings of the hardware
  directly. They are typically large enough these days. That might avoid
  some locking in this critical path.

 Indeed, I also realized last night that for the default qdiscs
 we do a lot of stupid useless work.  If the queue is a FIFO
 and the device can take packets, we should send it directly
 and not stick it into the qdisc at all.

Since you are talking of how it should be done in the *current* code,
I feel LLTX drivers will not work nicely with this.

Actually I was trying this change a couple of weeks back, but felt
that doin go would result in out of order packets (skbs present in
q which were not sent out for LLTX failure will be sent out only at
next net_tx_action, while other skbs are sent ahead).

One option is to first call qdisc_run() and then process this skb,
but that is ugly (requeue handling).

However I guess this can be done cleanly once LLTX is removed.

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCHES] TX batching

2007-10-09 Thread Krishna Kumar2
J Hadi Salim [EMAIL PROTECTED] wrote on 10/08/2007 07:35:20 PM:

 I dont see something from Krishna's approach that i can take and reuse.
 This maybe because my old approaches have evolved from the same path.
 There is a long list but as a sample: i used to do a lot more work while
 holding the queue lock which i have now moved post queue lock; i dont
 have any speacial interfaces/tricks just for batching, i provide hints
 to the core of how much the driver can take etc etc. I have offered
 Krishna co-authorship if he makes the IPOIB driver to work on my
 patches, that offer still stands if he chooses to take it.

My feeling is that since the approaches are very different, it would
be a good idea to test the two for performance. Do you mind me doing
that? Ofcourse others and/or you are more than welcome to do the same.

I had sent a note to you yesterday about this, please let me know
either way.

*** Previous mail **

Hi Jamal,

If you don't mind, I am trying to run your approach vs mine to get some
results
for comparison.

For starters, I am having issues with iperf when using your infrastructure
code with
my IPoIB driver - about 100MB is sent and then everything stops for some
reason.
The changes in the IPoIB driver that I made to support batching is to set
BTX, set
xmit_win, and dynamically reduce xmit_win on every xmit and increase
xmit_win on
every xmit completion. Is there anything else that is required from the
driver?

thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] RE: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Krishna Kumar2
Hi Peter,

Waskiewicz Jr, Peter P [EMAIL PROTECTED] wrote on
10/09/2007 04:03:42 AM:

  true, that needs some resolution. Heres a hand-waving thought:
  Assuming all packets of a specific map end up in the same
  qdiscn queue, it seems feasible to ask the qdisc scheduler to
  give us enough packages (ive seen people use that terms to
  refer to packets) for each hardware ring's available space.
  With the patches i posted, i do that via
  dev-xmit_win that assumes only one view of the driver; essentially a
  single ring.
  If that is doable, then it is up to the driver to say i have
  space for 5 in ring[0], 10 in ring[1] 0 in ring[2] based on
  what scheduling scheme the driver implements - the dev-blist
  can stay the same. Its a handwave, so there may be issues
  there and there could be better ways to handle this.
 
  Note: The other issue that needs resolving that i raised
  earlier was in regards to multiqueue running on multiple cpus
  servicing different rings concurently.

 I can see the qdisc being modified to send batches per queue_mapping.
 This shouldn't be too difficult, and if we had the xmit_win per queue
 (in the subqueue struct like Dave pointed out).

I hope my understanding of multiqueue is correct for this mail to make
sense :-)

Isn't it enough that the multiqueue+batching drivers handle skbs
belonging to different queue's themselves, instead of qdisc having
to figure that out? This will reduce costs for most skbs that are
neither batched nor sent to multiqueue devices.

Eg, driver can keep processing skbs and put to the correct tx_queue
as long as mapping remains the same. If the mapping changes, it posts
earlier skbs (with the correct lock) and then iterates for the other
skbs that have the next different mapping, and so on.

(This is required only if driver is supposed to transmit 1 skb in one
call, otherwise it is not an issue)

Alternatively, supporting drivers could return a different code on
mapping change, like: NETDEV_TX_MAPPING_CHANGED (for batching only)
so that qdisc_run() could retry. Would that work?

Secondly having xmit_win per queue: would it help in multiple skb
case? Currently there is no way to tell qdisc to dequeue skbs from
a particular band - it returns skb from highest priority band.

thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Krishna Kumar2
Hi Dave,

David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:

  Isn't it enough that the multiqueue+batching drivers handle skbs
  belonging to different queue's themselves, instead of qdisc having
  to figure that out? This will reduce costs for most skbs that are
  neither batched nor sent to multiqueue devices.
 
  Eg, driver can keep processing skbs and put to the correct tx_queue
  as long as mapping remains the same. If the mapping changes, it posts
  earlier skbs (with the correct lock) and then iterates for the other
  skbs that have the next different mapping, and so on.

 The complexity in most of these suggestions is beginning to drive me a
 bit crazy :-)

 This should be the simplest thing in the world, when TX queue has
 space, give it packets.  Period.

 When I hear suggestions like have the driver pick the queue in
 -hard_start_xmit() and return some special status if the queue
 becomes different.  you know, I really begin to wonder :-)

 If we have to go back, get into the queueing layer locks, have these
 special cases, and whatnot, what's the point?

I understand your point, but the qdisc code itself needs almost no
change, as small as:

qdisc_restart()
{
  ...
  case NETDEV_TX_MAPPING_CHANGED:
  /*
   * Driver sent some skbs from one mapping, and found others
   * are for different queue_mapping. Try again.
   */
  ret = 1; /* guaranteed to have atleast 1 skb in batch list */
  break;
  ...
}

Alternatively if the driver does all the dirty work, qdisc needs no
change at all. However, I am not sure if this addresses all the
concerns raised by you, Peter, Jamal, others.

 This code should eventually be able to run lockless all the way to the
 TX queue handling code of the driver.  The queueing code should know
 what TX queue the packet will be bound for, and always precisely
 invoke the driver in a state where the driver can accept the packet.

This sounds like a good idea :)

I need to think more on this, esp as my batching sends multiple skbs of
possibly different mappings to device, and those skbs stay in batch list
if driver couldn't send them out.

thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Krishna Kumar2
David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:

 Ignore LLTX, it sucks, it was a big mistake, and we will get rid of
 it.

Great, this will make life easy. Any idea how long that would take?
It seems simple enough to do.

thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 1/3] [NET_BATCH] Introduce batching interface

2007-10-08 Thread Krishna Kumar2
Hi Jamal,

If you don't mind, I am trying to run your approach vs mine to get some
results
for comparison.

For starters, I am having issues with iperf when using your infrastructure
code with
my IPoIB driver - about 100MB is sent and then everything stops for some
reason.
The changes in the IPoIB driver that I made to support batching is to set
BTX, set
xmit_win, and dynamically reduce xmit_win on every xmit and increase
xmit_win on
every xmit completion. Is there anything else that is required from the
driver?

thanks,

- KK

J Hadi Salim [EMAIL PROTECTED] wrote on 10/08/2007 12:06:23 AM:

 This patch introduces the netdevice interface for batching.

 cheers,
 jamal


 [attachment oct07-p1of3 deleted by Krishna Kumar2/India/IBM]

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread Krishna Kumar2
J Hadi Salim [EMAIL PROTECTED] wrote on 10/08/2007 06:47:24 PM:

 two, there should _never_ be any requeueing even if LLTX in the previous
 patches when i supported them; if there is, it is a bug. This is because
 we dont send more than what the driver asked for via xmit_win. So if it
 asked for more than it can handle, that is a bug. If its available space
 changes while we are sending to it, that too is a bug.

Driver might ask for 10 and we send 10, but LLTX driver might fail to get
lock and return TX_LOCKED. I haven't seen your code in greater detail, but
don't you requeue in that case too?

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-07 Thread Krishna Kumar2
 jamal wrote:

   +   while ((skb = __skb_dequeue(skbs)) != NULL)
   +  q-ops-requeue(skb, q);
 
 
  -requeue queues at the head, so this looks like it would reverse
  the order of the skbs.

 Excellent catch!  thanks; i will fix.

 As a side note: Any batching driver should _never_ have to requeue; if
 it does it is buggy. And the non-batching ones if they ever requeue will
 be a single packet, so not much reordering.

On the contrary, batching LLTX drivers (if that is not ruled out) will very
often requeue resulting in heavy reordering. Fix looks good though.

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000

2007-09-21 Thread Krishna Kumar2
Hi Or,

Sorry about the delay, I ran into various bugs and then system froze for
about 2 days.

Or Gerlitz [EMAIL PROTECTED] wrote on 09/17/2007 03:22:58 PM:

 good, please test with rev5 and let us know.

I tested with rev5 and this is what I found (different from what I
said earlier about EHCA):

Original code had almost zero retransmission (for a particular run),
1 for EHCA and 0 for MTHCA. With batching, both had high retransmissions:
73680 for EHCA and 70268 for MTHCA.

It seems I was wrong when I said EHCA was having no issues. So far
I have identical retransmission numbers for E1000 only.

 transmission of 4K batched packets sounds like a real problem for the
 receiver side, with 0.5K send/recv queue size, its 8 batches of 512
 packets each were for each RX there is completion (WC) to process, SKB
 to alloc and post to the QP where for the TX there's only posting to the
 QP, processes one (?) WC and free 512 SKBs.

The receiver and sender both have 4K WR's. I had earlier changed batching
so that IPoIB will send atmost 2 skbs even if more are present in the queue
and send 2 more after the first two and so on. But that too gave high
numbers for retransmissions.

 If indeed the situation is so unsymmetrical, I am starting to think that
 the CPU utilization at the sender side might be much higher with
 batching then without batching, have you looked into that?

Overall it is almost the same. I had used netperf (about 1 month back) and
it gave almost same numbers. I haven't tried recently. Even in regular
code, though batching is not done, qdisc_restart() does xmit in a tight
loop. The only difference is that dev-queue_lock is DROPPED/GOT for each
skb, and dev-tx_lock is held for shorter times. I avoid the former and
have no control for the latter.

 I am not with you. Looking on 2.6.22 and 2.6.23-rc5, for both their
 ipoib-NAPI mechanism is implemented through the function ipoib_poll
 being the polling api for the network stack etc, so what is the old API
 and where does this difference exist?

I meant the pre-Stephen-Hemminger converted NAPI. He had changed the
old NAPI to newer one (where driver doesn't get *budget, etc).

 You might want to try something lighter such as iperf udp test, where a
 nice criteria would be to compare bandwidth AND packet loss between
 no-batching and batching. As for the MTU, the default is indeed 2K
 (2044) but its always to just know the facts, namely what was the mtu
 during the test.

OK, that is a good idea. I will try it over the weekend.

 if you have user space libraries installed, load ib_uverbs and run the
 command ibv_devinfo, you will see all the infiniband devices on your
 system and for each its device id and firmware version. If not, you
 should be looking on

 /sys/class/infiniband/$device/hca_type
 and
 /sys/class/infiniband/$device/fw_ver

Both these files are not present, though ehca0 is present.
For mthca, the values are : MT23108  3.5.0.

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule()

2007-09-19 Thread Krishna Kumar2
Hi Jan-Bernd,

Jan-Bernd Themann [EMAIL PROTECTED] wrote on 09/19/2007 06:53:48 PM:

 If I understood it right the problem you describe (quota update in
 __napi_schdule) can cause further problems when you choose the
 following numbers:

 CPU1: A. process 99 pkts
 CPU1: B. netif_rx_complete()
 CPU2: interrupt occures, netif_rx_schedule is called, net_rx_action
triggerd:
 CPU2: C. set quota = 100 (__napi_schedule)
 CPU2: D. call poll(), process 1 pkt
 CPU2: D.2 call netif_rx_complete() (quota not exeeded)
 CPU2: E. net_rx_action: set quota=99
 CPU1: F. net_rx_action: set qutoa=99 - 99 = 0
 CPU1: G. modify list (list_move_tail) altough netif_rx_complete has been
called

 Step G would fail as the device is not in the list due
 to netif_rx_complete. This case can occur for all
 devices running on an SMP system where interrupts are not pinned.

I think list_move should be ok whether device is on the list or not.
But it should not come to that code since work (99) != weight (100).
If work == weight, then driver would not have done complete, and the
next/prev would not be set to POISON.

I like the clean changes made by Dave to fix this, and will test it
today (if I can get my crashed system to come up).

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule()

2007-09-19 Thread Krishna Kumar2
Hi Dave,

David Miller [EMAIL PROTECTED] wrote on 09/19/2007 09:35:57 PM:

 The NAPI_STATE_SCHED flag bit should provide all of the necessary
 synchornization.

 Only the setter of that bit should add the NAPI instance to the
 polling list.

 The polling loop runs atomically on the cpu where the NAPI instance
 got added to the per-cpu polling list.  And therefore decisions to
 complete NAPI are serialized too.

 That serialized completion decision is also when the list deletion
 occurs.

About the list deletion occurs, isn't the race I mentioned still present?
If done  budget, the driver does netif_rx_complete (at which time some
other
cpu can add this NAPI to their list). But the first cpu might do some more
actions on the napi, like ipoib_poll() calls request_notify_cq(priv-cq),
when other cpu might have started using this napi.

(net_rx_action's 'list_move' however will not execute since work != weight)

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule()

2007-09-19 Thread Krishna Kumar2
Ran 4/16/64 thread iperf on latest bits with this patch and no issues after
30 mins. I used to
consistently get the bug within 1-2 mins with just 4 threads prior to this
patch.

Tested-by: Krishna Kumar [EMAIL PROTECTED]
(if any value in that)

thanks,

- KK

David Miller [EMAIL PROTECTED] wrote on 09/20/2007 10:42:24 AM:

 From: Krishna Kumar2 [EMAIL PROTECTED]
 Date: Thu, 20 Sep 2007 10:40:33 +0530

  I like the clean changes made by Dave to fix this, and will test it
  today (if I can get my crashed system to come up).

 I would very much appreciate this testing, as I'm rather sure we've
 plugged up the most serious holes at this point.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 1/2] IPoIB: Fix unregister_netdev hang

2007-09-18 Thread Krishna Kumar2
Hi Roland,

Roland Dreier [EMAIL PROTECTED] wrote on 09/18/2007 07:57:24 PM:

   While using IPoIB over EHCA (rc6 bits), unregister_netdev hangs with

 I don't think you're actually using rc6 bits, since in your patch you
have:

   -poll_more:

 and I think that is only in Dave's net-2.6.24 tree now, right?

Nope, that was what I downloaded yesterday:

VERSION = 2
PATCHLEVEL = 6
SUBLEVEL = 23
EXTRAVERSION =-rc6
NAME = Pink Farting Weasel

  +if (likely(!ib_req_notify_cq(priv-cq,
  +
IB_CQ_NEXT_COMP |
  +
IB_CQ_REPORT_MISSED_EVENTS)))

 It is possible for an interrupt to happen immediately right here,
 before the netif_rx_complete(), so that netif_rx_schedule() gets
 called while we are still on the poll list.

  +netif_rx_complete(dev, napi);

To be clear, netif_rx_schedule while we are still in the poll list will not
do any harm as it does nothing since NAPI_STATE_SCHED is still set (cleared
by netif_rx_complete which has not yet run). Effectively we lost/delayed
processing an interrupt, if I understood the code right.

I agree with you on the new patch.

thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 1/2] IPoIB: Fix unregister_netdev hang

2007-09-18 Thread Krishna Kumar2
Hi Roland,

 Please double check your tree.  I just very carefully looked at my
 trees, and the poll_more: label is added in commit 6b460a71 ([NET]:
 Make NAPI polling independent of struct net_device objects.) which is
 only in the net-2.6.24 tree.  Of course Dave did not change the
 version information in the Makefile since he wouldn't want Linus to
 pick up any extra strange changes when he pulls, so a net-2.6.24 tree
 will look like 2.6.23-rc6 as you quoted.

 And the refcounting bug I fixed is only in net-2.6.24.

You are absolutely right. My wording was incorrect, I should have said
net-2.6.24 (which is *at* rev rc6).

   To be clear, netif_rx_schedule while we are still in the poll list
will not
   do any harm as it does nothing since NAPI_STATE_SCHED is still set
(cleared
   by netif_rx_complete which has not yet run). Effectively we
lost/delayed
   processing an interrupt, if I understood the code right.

 Right, we lose an interrupt, and since the CQ events are one-shot, we
 never get another one, and the interface is effectively dead.

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 3/10 REV5] [sched] Modify qdisc_run to support batching

2007-09-16 Thread Krishna Kumar2
Hi Evgeniy,

Evgeniy Polyakov [EMAIL PROTECTED] wrote on 09/14/2007 05:45:19 PM:

  +  if (skb-next) {
  + int count = 0;
  +
  + do {
  +struct sk_buff *nskb = skb-next;
  +
  +skb-next = nskb-next;
  +__skb_queue_tail(dev-skb_blist, nskb);
  +count++;
  + } while (skb-next);

 Could it be list_move()-like function for skb lists?
 I'm pretty sure if you change first and the last skbs and ke of the
 queue in one shot, result will be the same.

I have to do a bit more like update count, etc, but I agree it is do-able.
I had mentioned in my PATCH 0/10 that I will later try this suggestion
that you provided last time.

 Actually how many skbs are usually batched in your load?

It depends, eg when the tx lock is not got, I get batching of upto 8-10
skbs (assuming that tx lock was not got quite a few times). But when the
queue gets blocked, I have seen batching upto 4K skbs (if tx_queue_len
is 4K).

  + /* Reset destructor for kfree_skb to work */
  + skb-destructor = DEV_GSO_CB(skb)-destructor;
  + kfree_skb(skb);

 Why do you free first skb in the chain?

This is the gso code which has segmented 'skb' to skb'1-n', and those
skb'1-n' are sent out and freed by driver, which means the dummy 'skb'
(without any data) remains to be freed.

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 2/10 REV5] [core] Add skb_blist support for batching

2007-09-16 Thread Krishna Kumar2
Hi Evgeniy,

Evgeniy Polyakov [EMAIL PROTECTED] wrote on 09/14/2007 06:16:38 PM:

  +   if (dev-features  NETIF_F_BATCH_SKBS) {
  +  /* Driver supports batching skb */
  +  dev-skb_blist = kmalloc(sizeof *dev-skb_blist, GFP_KERNEL);
  +  if (dev-skb_blist)
  + skb_queue_head_init(dev-skb_blist);
  +   }
  +

 A nitpick is that you should use sizeof(struct ...) and I think it
 requires flag clearing in cae of failed initialization?

I thought it is better to use *var name in case the name of the structure
changes. Also, the flag is not cleared since I could try to enable batching
later, and it could succeed at that time. When skb_blist is allocated, then
batching is enabled otherwise it is disabled (while features flag just
indicates that driver supports batching).

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 10/10 REV5] [E1000] Implement batching

2007-09-16 Thread Krishna Kumar2
Hi Evgeniy,

Evgeniy Polyakov [EMAIL PROTECTED] wrote on 09/14/2007 06:17:14 PM:

  if (unlikely(skb-len = 0)) {
 dev_kfree_skb_any(skb);
  -  return NETDEV_TX_OK;
  +  return NETDEV_TX_DROPPED;
  }

 This changes could actually go as own patch, although not sure it is
 ever used. just a though, not a stopper.

Since this flag is new and useful only for batching, I feel it is OK to
include it in this patch.

  +   if (!skb || (blist  skb_queue_len(blist))) {
  +  /*
  +   * Either batching xmit call, or single skb case but there are
  +   * skbs already in the batch list from previous failure to
  +   * xmit - send the earlier skbs first to avoid out of order.
  +   */
  +  if (skb)
  + __skb_queue_tail(blist, skb);
  +  skb = __skb_dequeue(blist);

 Why is it put at the end?

There is a bug that I had explained in rev4 (see XXX below) resulting
in sending out skbs out of order. The fix is that if the driver gets
called with a skb but there are older skbs already in the batch list
(which failed to get sent out), send those skbs first before this one.

Thanks,

- KK

[XXX] Dave had suggested to use batching only in the net_tx_action case.
When I implemented that in earlier revisions, there were lots of TCP
retransmissions (about 18,000 to every 1 in regular code). I found the
reason
for part of that problem as: skbs get queue'd up in dev-qdisc (when tx
lock
was not got or queue blocked); when net_tx_action is called later, it
passes
the batch list as argument to qdisc_run and this results in skbs being
moved
to the batch list; then batching xmit also fails due to tx lock failure;
the
next many regular xmit of a single skb will go through the fast path (pass
NULL batch list to qdisc_run) and send those skbs out to the device while
previous skbs are cooling their heels in the batch list.

The first fix was to not pass NULL/batch-list to qdisc_run() but to always
check whether skbs are present in batch list when trying to xmit. This
reduced
retransmissions by a third (from 18,000 to around 12,000), but led to
another
problem while testing - iperf transmit almost zero data for higher # of
parallel flows like 64 or more (and when I run iperf for a 2 min run, it
takes about 5-6 mins, and reports that it ran 0 secs and the amount of data
transfered is a few MB's). I don't know why this happens with this being
the
only change (any ideas is very appreciated).

The second fix that resolved this was to revert back to Dave's suggestion
to
use batching only in net_tx_action case, and modify the driver to see if
skbs
are present in batch list and to send them out first before sending the
current skb. I still see huge retransmission for IPoIB (but not for E1000),
though it has reduced to 12,000 from the earlier 18,000 number.


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000

2007-09-16 Thread Krishna Kumar2
Hi Dave,

David Miller [EMAIL PROTECTED] wrote on 09/17/2007 04:47:48 AM:

 The only major complaint I have about this patch series is that
 the IPoIB part should just be one big changeset.  Otherwise the
 tree is not bisectable, for example the initial ipoib header file
 change breaks the build.

Right, I will change it accordingly.

 On a lower priority, I question the indirection of skb_blist by making
 it a pointer.  For what?  Saving 12 bytes on 64-bit?  That kmalloc()'d
 thing is a nearly guarenteed cache and/or TLB miss.  Just inline the
 thing, we generally don't do crap like this anywhere else.

The intention was to avoid having two flags (one that driver supports
batching and second to indicate that batching is on/off). So I could test
skb_blist as an indication of whether batching is on/off. But your point
on cache miss is absolutely correct, and I will change this part to be
inline.

thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 1/10 REV5] [Doc] HOWTO Documentation for batching

2007-09-16 Thread Krishna Kumar2
Hi Randy,

Randy Dunlap [EMAIL PROTECTED] wrote on 09/15/2007 12:07:09 AM:

  +   To fix this problem, error cases where driver xmit gets called with
a
  +   skb must code as follows:
  +  1. If driver xmit cannot get tx lock, return NETDEV_TX_LOCKED
  + as usual. This allows qdisc to requeue the skb.
  +  2. If driver xmit got the lock but failed to send the skb, it
  + should return NETDEV_TX_BUSY but before that it should have
  + queue'd the skb to the batch list. In this case, the qdisc

queued

  + does not requeue the skb.

Since this was a new section that I added to the documentation, this error
creeped up. Thanks for catching it, and review comments/ack-off :)

thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000

2007-09-16 Thread Krishna Kumar2
Hi Or,

 So with ipoib/mthca you still see this 1 : 18.5K retransmission rate
 (with no noticeable retransmission increase for E1000) you were
 reporting at the V4 post?! if this is the case, I think it calls for
 further examination, where help from Mellanox could ease things, I guess.

What I will do today/tomorrow is to run the rev5 (which I didn't run
for mthca) on both ehca and mthca and get statistics and send it out.
Otherwise what you stated is correct as far as rev4 goes. After giving
latest details, I will appreciate any help from Mellanox developers.

 By saying that with ehca you see normal level retransmissions - 2 times
 the regular code do you mean 1 : 2 retransmission rate between batching
 to no batching?

Correct, for every 1 retransmission in the regular code, I see two
retransmissions in batching case (which I assume is due to overflow at the
receiver side as I batch sometimes upto 4K skbs). I will post the exact
numbers in the next post.

 I am not sure this was mentioned over the threads, but clearly two sides
 are needed for the dance here, namely I think you want to do your tests
 (both the no batching and with batching) with something like NAPI
 enabled at the receiver side, 2.6.23-rc5 has NAPI

I was using 2.6.23-rc1 on receiver (which also has NAPI, but uses the
old API - the same fn ipoib_poll()).

 is this with no delay set or not? connected or datagram mode? mtu?
 netperf command? system spec (specifically hca device id and fw
 version), etc?

This is TCP (without No Delay), datagram mode, I didn't change mtu from
the default (is it 2K?).

Command is iperf with various options for different test
buffer-size/threads.

Regarding id/etc, this is what dmesg has:

Sep 16 22:49:26 elm3b39 kernel: eHCA Infiniband Device Driver (Rel.:
SVNEHCA_0023)
Sep 16 22:49:26 elm3b39 kernel: xics_enable_irq: irq=36868: ibm_int_on
returned -3

There are *fw* files for mthca0, but I don't see for ehca in /sys/class, so
I am not sure (since these are pci-e cards, nothing shows up in lspci -v).
What should I look for?

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000

2007-09-16 Thread Krishna Kumar2
[Removing Jeff as requested from thread :) ]

Hi Dave,

David Miller [EMAIL PROTECTED] wrote on 09/17/2007 07:55:02 AM:

 From: jamal [EMAIL PROTECTED]
 Date: Sun, 16 Sep 2007 22:14:21 -0400

  I still think this work - despite my vested interest - needs more
  scrutiny from a performance perspective.

 Absolutely.

 There are tertiary issues I'm personally interested in, for example
 how well this stuff works when we enable software GSO on a non-TSO
 capable card.

 In such a case the GSO segment should be split right before we hit the
 driver and then all the sub-segments of the original GSO frame batched
 in one shot down to the device driver.

 In this way you'll get a large chunk of the benefit of TSO without
 explicit hardware support for the feature.

 There are several cards (some even 10GB) that will benefit immensely
 from this.

I have tried this on ehca which does not support TSO. I added GSO flag at
the ipoib layer (and that resulted in a panic/fix that is mentioned in
this patchset). I will re-run tests for this and submit results.

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [RESEND] Implement batching skb API and support in IPoIB

2007-09-14 Thread Krishna Kumar2

Hi Dave,

I am re-sending in case you didn't get this earlier. Also sending REV5 of
the patch.
I will send patch for e1000e on monday or tuesday after making the changes
and
testing over the weekend.

thanks,

- KK

__

Hi Dave,

David Miller [EMAIL PROTECTED] wrote on 08/29/2007 10:21:50 AM:

 From: Krishna Kumar2 [EMAIL PROTECTED]
 Date: Wed, 29 Aug 2007 08:53:30 +0530

  I am scp'ng from 192.168.1.1 to 192.168.1.2 and captured at the send
  side.

 Bad choice of test, this is cpu limited since the scp
 has to encrypt and MAC hash all the data it sends.

 Use something like straight ftp or bw_tcp from lmbench.

I used bw_tcp from lmbench-3. I transfered 500MB and captured
the tcpdump, and analysis at various points gave pipeline sizes:

26064, 27792, 22888, 23168, 23448, 20272, 23168, 4344, 10136,
164792, 35920, 26344, 24336, 24336, 23168, 25784, 23168,

There was one huge 164K, otherwise most were in smaller ranges
like 20-30K. I ran the following test script:

SERVER=192.168.1.2
BYTES=100m
BUFFERSIZES=4096 16384 65536 131072 262144
PROCS=1 8
ITERATIONS=5

for m in $BUFFERSIZES
do
  for procs in $PROCS
  do
echo TEST: Size:$m Procs:$procs
bw_tcp -N $ITERATIONS -m $m -M $BYTES -P $procs $SERVER
  done
done

Result is:
Test without batching:
#   Size   Procs BW (MB/s)
1   4096   1 117.39
2   16384  1 117.49
3   65536  1 117.55
4   131072 1 117.55
5   262144 1 117.58

6   4096   8 117.18
7   16384  8 117.47
8   65536  8 117.54
9   131072 8 117.59
10  262144 8 117.55

Test with batching:
#   Size   Procs BW (MB/s)
1   4096   1 117.39
2   16384  1 117.48
3   65536  1 117.55
4   131072 1 117.58
5   262144 1 117.58

6   4096   8 117.19
7   16384  8 117.46
8   65536  8 117.53
9   131072 8 117.55
10  262144 8 117.60

So it doesn't seem to harm e1000.

Can someone give a link to the E1000E driver? I couldn't find it
after downloading Jeff's netdev-2.6 tree.

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-28 Thread Krishna Kumar2
Hi Dave,

I am scp'ng from 192.168.1.1 to 192.168.1.2 and captured at the send
side.

192.168.1.1.37201  192.168.1.2.ssh: P 837178092:837178596(504) ack
1976304527 win 79 nop,nop,timestamp 36791208 123919397
192.168.1.1.37201  192.168.1.2.ssh: . 837178596:837181492(2896) ack
1976304527 win 79 nop,nop,timestamp 36791208 123919397
192.168.1.1.37201  192.168.1.2.ssh: . 837181492:837184388(2896) ack
1976304527 win 79 nop,nop,timestamp 36791208 123919397
192.168.1.1.37201  192.168.1.2.ssh: . 837184388:837188732(4344) ack
1976304527 win 79 nop,nop,timestamp 36791208 123919397
192.168.1.1.37201  192.168.1.2.ssh: . 837188732:837193076(4344) ack
1976304527 win 79 nop,nop,timestamp 36791208 123919397
192.168.1.1.37201  192.168.1.2.ssh: . 837193076:837194524(1448) ack
1976304527 win 79 nop,nop,timestamp 36791208 123919397
192.168.1.2.ssh  192.168.1.1.37201: . ack 837165060 win 3338
nop,nop,timestamp 123919397 36791208

Data in pipeline: 837194524 - 837165060 = 29464.

In most cases, I am getting 7K, 8K, 13K, and rarely close to 16K. I ran
iperf with
4K, 16K  32K (as I could do multiple threads instead of single process).
Results
are (for E1000 82547GI chipset, BW in KB/s):

Test   Org BW  New BW%
Size:4096 Procs:1   114612  114644  .02
Size:16394 Procs:1  114634 114644   0
Size:32768 Procs:1  114645 114643   0

And for multiple threads:

TestOrg BW  New BW   %
Size:4096 Procs:8   114632  114637   0
Size:4096 Procs:16  114639  114637  0
Size:4096 Procs:64  114893  114800 -.08
Size:16394 Procs:8  114641  114642  0
Size:16394 Procs:16 114642 114643  0
Size:16394 Procs:64 114911 114781  -.11
Size:32768 Procs:8  114638  114639   0
Size:32768 Procs:16 114642  114645  0
Size:32768 Procs:64 114932  114777 -.13

I will run netperf and report CPU utilization too.

Thanks,

- KK

David Miller [EMAIL PROTECTED] wrote on 08/21/2007 12:48:24 PM:

 From: Krishna Kumar2 [EMAIL PROTECTED]
 Date: Fri, 17 Aug 2007 11:36:03 +0530

   I ran 3 iterations of 45 sec tests (total 1 hour 16 min, but I will
   run a longer one tonight). The results are (results in KB/s, and %):
 
  I ran a 8.5 hours run with no batching + another 8.5 hours run with
  batching (Buffer sizes: 32 128 512 4096 16384, Threads: 1 8 32,
  Each test run time: 3 minutes, Iterations to average: 5). TCP seems
  to get a small improvement.

 Using 16K buffer size really isn't going to keep the pipe full enough
 for TSO.  And realistically applications queue much more data at a
 time.  Also, with smaller buffer sizes can have negative effects for
 the dynamic receive and send buffer growth algorithm the kernel uses,
 it might consider the connection application limited for too long.

 I would really prefer to see numbers that use buffer sizes more in
 line with the amount of data that is typically inflight on a 1G
 connection on a local network.

 Do a tcpdump during the height of the transfer to see about what this
 value is.  When an ACK comes in, compare the sequence number it's
 ACK'ing with the sequence number of the most recently sent frame.
 The difference is approximately the pipe size at maximum congestion
 window assuming a loss free local network.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-28 Thread Krishna Kumar2
[EMAIL PROTECTED] wrote on 08/29/2007 10:21:50 AM:

 From: Krishna Kumar2 [EMAIL PROTECTED]
 Date: Wed, 29 Aug 2007 08:53:30 +0530

  I am scp'ng from 192.168.1.1 to 192.168.1.2 and captured at the send
  side.

 Bad choice of test, this is cpu limited since the scp
 has to encrypt and MAC hash all the data it sends.

 Use something like straight ftp or bw_tcp from lmbench.

OK

 Using a different tool seems strange to me, why not just adjust the
 buffer size with command line options in the benchmark you were using
 in the first place?

The reason was to run parallel copies, not for buffer limitations.
Let me use the same tool for benchmark.

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-22 Thread Krishna Kumar2
Hi Dave,

David Miller [EMAIL PROTECTED] wrote on 08/22/2007 09:52:29 AM:

 From: Krishna Kumar2 [EMAIL PROTECTED]
 Date: Wed, 22 Aug 2007 09:41:52 +0530

snip
   Because TSO does batching already, so it's a very good
   tit for tat comparison of the new batching scheme
   vs. an existing one.
 
  I am planning to do more testing on your suggestion over the
  weekend, but I had a comment. Are you saying that TSO and
  batching should be mutually exclusive so hardware that doesn't
  support TSO (like IB) only would benefit?
 
  But even if they can co-exist, aren't cases like sending
  multiple small skbs better handled with batching?

 I'm not making any suggestions, so don't read that into anything I've
 said :-)

 I think the jury is still out, but seeing TSO perform even slightly
 worse with the batching changes in place would be very worrysome.
 This applies to both throughput and cpu utilization.

Does turning off batching solve that problem? What I mean by that is:
batching can be disabled if a TSO device is worse for some cases. Infact
something that I had changed my latest code is to not enable batching
in register_netdevice (in Rev4 which I am sending in a few mins), rather
the user has to explicitly turn 'on' batching.

Wondering if that is what you are concerned about. In any case, I will
test your case on Monday (I am on vacation for next couple of days).

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-22 Thread Krishna Kumar2
David Miller [EMAIL PROTECTED] wrote on 08/22/2007 02:44:40 PM:

 From: Krishna Kumar2 [EMAIL PROTECTED]
 Date: Wed, 22 Aug 2007 12:33:04 +0530

  Does turning off batching solve that problem? What I mean by that is:
  batching can be disabled if a TSO device is worse for some cases.

 This new batching stuff isn't going to be enabled or disabled
 on a per-device basis just to get parity with how things are
 now.

 It should be enabled by default, and give at least as good
 performance as what can be obtained right now.

That was how it was in earlier revisions. In revision4 I coded it so
that it is enabled only if explicitly set by the user. I can revert
that change.

 Otherwise it's a clear regression.

Definitely. For drivers that support it, it should not reduce performance.

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 10/10 Rev4] [E1000] Implement batching

2007-08-22 Thread Krishna Kumar2
Hi Auke,

Kok, Auke [EMAIL PROTECTED] wrote on 08/22/2007 08:09:31 PM:

 Krishna,

 while I appreciate the patch I would have preferred a patch to e1000e.
Not only
 does the e1000e driver remove a lot of the workarounds for old silicon,
it is
 also a good way for us to move the current e1000 driver into a bit more
stable
 maintenance mode.

 Do you think you can write this patch for e1000e instead? code-wise a lot
of
 things are still the same, so your patch should be relatively easy to
generate.

 e1000e currently lives in a branch from jeff garzik's netdev-2.6 tree

Definitely, I will pick it up and generate a patch.

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 1/10 Rev4] [Doc] HOWTO Documentation for batching

2007-08-22 Thread Krishna Kumar2
Hi Randy,

Thanks for your suggestions. Will clean up those changes.

- KK

Randy Dunlap [EMAIL PROTECTED] wrote on 08/22/2007 09:20:13 PM:

 On Wed, 22 Aug 2007 13:58:58 +0530 Krishna Kumar wrote:

  Add Documentation describing batching skb xmit capability.
 
  Signed-off-by: Krishna Kumar [EMAIL PROTECTED]
  ---
   batching_skb_xmit.txt |   78
++
   1 files changed, 78 insertions(+)
 
  diff -ruNp org/Documentation/networking/batching_skb_xmit.txt
 new/Documentation/networking/batching_skb_xmit.txt
  --- org/Documentation/networking/batching_skb_xmit.txt   1970-01-01
05:30:
 00.0 +0530
  +++ new/Documentation/networking/batching_skb_xmit.txt   2007-08-22
10:21:
 19.0 +0530
  @@ -0,0 +1,78 @@
  +   HOWTO for batching skb xmit support
  +   ---
  +
  +Section 1: What is batching skb xmit
  +Section 2: How batching xmit works vs the regular xmit
  +Section 3: How drivers can support batching
  +Section 4: How users can work with batching
  +
  +
  +Introduction: Kernel support for batching skb
  +--
  +
  +A new capability to support xmit of multiple skbs is provided in the
netdevice
  +layer. Drivers which enable this capability should be able to process
multiple
  +skbs in a single call to their xmit handler.
  +
  +
  +Section 1: What is batching skb xmit
  +-
  +
  +   This capability is optionally enabled by a driver by setting the
  +   NETIF_F_BATCH_SKBS bit in dev-features. The pre-requisite for a

  prerequisite

  +   driver to use this capability is that it should have a reasonably

I would say reasonably-sized.

  +   sized hardware queue that can process multiple skbs.
  +
  +
  +Section 2: How batching xmit works vs the regular xmit
  +---
  +
  +   The network stack gets called from upper layer protocols with a
single
  +   skb to transmit. This skb is first enqueue'd and an attempt is made
to

enqueued

  +   transmit it immediately (via qdisc_run). However, events like tx
lock
  +   contention, tx queue stopped, etc, can result in the skb not
getting

   etc.,

  +   sent out and it remains in the queue. When the next xmit is called
or
  +   when the queue is re-enabled, qdisc_run could potentially find
  +   multiple packets in the queue, and iteratively send them all out
  +   one-by-one.
  +
  +   Batching skb xmit is a mechanism to exploit this situation where
all
  +   skbs can be passed in one shot to the device. This reduces driver
  +   processing, locking at the driver (or in stack for ~LLTX drivers)
  +   gets amortized over multiple skbs, and in case of specific drivers
  +   where every xmit results in a completion processing (like IPoIB) -
  +   optimizations can be made in the driver to request a completion for
  +   only the last skb that was sent which results in saving interrupts
  +   for every (but the last) skb that was sent in the same batch.
  +
  +   Batching can result in significant performance gains for systems
that
  +   have multiple data stream paths over the same network interface
card.
  +
  +
  +Section 3: How drivers can support batching
  +-
  +
  +   Batching requires the driver to set the NETIF_F_BATCH_SKBS bit in
  +   dev-features.
  +
  +   The driver's xmit handler should be modified to process multiple
skbs
  +   instead of one skb. The driver's xmit handler is called either with
a


an

  +   skb to transmit or NULL skb, where the latter case should be
handled
  +   as a call to xmit multiple skbs. This is done by sending out all
skbs
  +   in the dev-skb_blist list (where it was added by the core stack).
  +
  +
  +Section 4: How users can work with batching
  +-
  +
  +   Batching can be disabled for a particular device, e.g. on desktop
  +   systems if only one stream of network activity for that device is
  +   taking place, since performance could be slightly affected due to
  +   extra processing that batching adds (unless packets are getting
  +   sent fast resulting in stopped queue's). Batching can be enabled if

queues).

  +   more than one stream of network activity per device is being done,
  +   e.g. on servers; or even desktop usage with multiple browser, chat,
  +   file transfer sessions, etc.
  +
  +   Per device batching can be enabled/disabled by passing 'on' or
'off'
  +   respectively to ethtool.

with what other parameter(s), e.g.,

ethtool dev batching on/off ?

 ---
 ~Randy
 *** Remember to use Documentation/SubmitChecklist when testing your code
***

___
general mailing list

[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-21 Thread Krishna Kumar2
David Miller [EMAIL PROTECTED] wrote on 08/22/2007 12:21:43 AM:

 From: jamal [EMAIL PROTECTED]
 Date: Tue, 21 Aug 2007 08:30:22 -0400

  On Tue, 2007-21-08 at 00:18 -0700, David Miller wrote:
 
   Using 16K buffer size really isn't going to keep the pipe full enough
   for TSO.
 
  Why the comparison with TSO (or GSO for that matter)?

 Because TSO does batching already, so it's a very good
 tit for tat comparison of the new batching scheme
 vs. an existing one.

I am planning to do more testing on your suggestion over the
weekend, but I had a comment. Are you saying that TSO and
batching should be mutually exclusive so hardware that doesn't
support TSO (like IB) only would benefit?

But even if they can co-exist, aren't cases like sending
multiple small skbs better handled with batching?

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-17 Thread Krishna Kumar2

Hi Dave,

 I ran 3 iterations of 45 sec tests (total 1 hour 16 min, but I will
 run a longer one tonight). The results are (results in KB/s, and %):

I ran a 8.5 hours run with no batching + another 8.5 hours run with
batching (Buffer sizes: 32 128 512 4096 16384, Threads: 1 8 32,
Each test run time: 3 minutes, Iterations to average: 5). TCP seems
to get a small improvement.

Thanks,

- KK

---
   TCP
---
Size:32 Procs:1 3415   3321   -2.75
Size:128 Procs:1   13094  13388  2.24
Size:512 Procs:149037  50683  3.35
Size:4096 Procs:1  114646 114619 -.02
Size:16384 Procs:1114626 114644 .01

Size:32 Procs:8 22675  22633  -.18
Size:128 Procs:877994  77297  -.89
Size:512 Procs:8114716 114711 0
Size:4096 Procs:8  114637 114636 0
Size:16384 Procs:8 95814  114638 19.64

Size:32 Procs:3223240  23349  .46
Size:128 Procs:32  82284  82247  -.04
Size:512 Procs:32  114885 114769 -.10
Size:4096 Procs:32 95735  114634 19.74
Size:16384 Procs:32   114736 114641 -.08

Average:115153411902103.36%
---
No Delay:
-
Size:32 Procs:1 3002   2873   -4.29
Size:128 Procs:1   11853  11801  -.43
Size:512 Procs:1   45565  45837  .59
Size:4096 Procs:1 114511 114485 -.02
Size:16384 Procs:1114521 114555 .02

Size:32 Procs:8 8026   8029   .03
Size:128 Procs:8   31589  31573  -.05
Size:512 Procs:8   111506 105766 -5.14
Size:4096 Procs:8 114455 114454 0
Size:16384 Procs:895833  114491 19.46

Size:32 Procs:328005   8027   .27
Size:128 Procs:32  31475  31505  .09
Size:512 Procs:32  114558 113687 -.76
Size:4096 Procs:32114784 114447 -.29
Size:16384 Procs:32   114719 114496 -.19

Average: 10460261034402-1.11%
---

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-14 Thread Krishna Kumar2
Hi Dave,

David Miller [EMAIL PROTECTED] wrote on 08/08/2007 04:19:00 PM:

 From: Krishna Kumar [EMAIL PROTECTED]
 Date: Wed, 08 Aug 2007 15:01:14 +0530

  RESULTS: The performance improvement for TCP No Delay is in the range
of -8%
 to 320% (with -8% being the sole negative), with many individual
tests
 giving 50% or more improvement (I think it is to do with the hw
slots
 getting full quicker resulting in more batching when the queue gets
 woken). The results for TCP is in the range of -11% to 93%, with
most
 of the tests (8/12) giving improvements.

 Not because I think it obviates your work, but rather because I'm
 curious, could you test a TSO-in-hardware driver converted to
 batching and see how TSO alone compares to batching for a pure
 TCP workload?

 I personally don't think it will help for that case at all as
 TSO likely does better job of coalescing the work _and_ reducing
 bus traffic as well as work in the TCP stack.

I used E1000 (guess the choice is OK as e1000_tso returns TRUE. My
hw is 82547GI).

You are right, it doesn't help TSO case at all (infact degrades). Two
things to note though:
  - E1000 may not be suitable for adding batching (which is no
longer a new API, as I have changed it already).
  - Small skbs where TSO doesn't come into picture still seems to
improve. A couple of cases for large skbs did result in some
improved (like 4K, TCP No Delay, 32 procs).

[Total segments retransmission for original code test run: 2220  for
new code test run: 1620. So the retransmission problem that I was
getting seems to be an IPoIB bug, though I did have to fix one bug
in my networking component where I was calling qdisc_run(NULL) for
regular xmit path and change to always use batching. The problem is
that skb1 - skb10 may be present in the queue after each of them
failed to be sent out, then net_tx_action fires which batches all of
these into the blist and tries to send them out again, which also
fails (eg tx lock fail or queue full), then the next single skb xmit
will send the latest skb ignoring the 10 skbs that are already waiting
in the batching list. These 10 skbs are sent out only the next time
net_tx_action is called, so out of order skbs result. This fix reduced
retransmissions from 180,000 to 55,000 or so. When I changed IPoIB
driver to use iterative sends of each skb instead of creating multiple
Work Request's, that number went down to 15].

I ran 3 iterations of 45 sec tests (total 1 hour 16 min, but I will
run a longer one tonight). The results are (results in KB/s, and %):

Test Case  Org BW   New BW  % Change

  TCP

Size:32 Procs:1 18483918112.01
Size:32 Procs:8 21888   21555   -1.52
Size:32 Procs:3219317   22433   16.13

Size:256 Procs:115584   25991   66.78
Size:256 Procs:8110937  74565   -32.78
Size:256 Procs:32   105767  98967   -6.42

Size:4096 Procs:1   81910   96073   17.29
Size:4096 Procs:8   113302  94040   -17.00
Size:4096 Procs:32  109664  105522  -3.77

TCP No Delay:
--
Size:32 Procs:1 2688317718.19
Size:32 Procs:8 656810588   61.20
Size:32 Procs:326573783819.24

Size:256 Procs:1786912724   61.69
Size:256 Procs:865652   45652   -30.46
Size:256 Procs:32   95114   112279  18.04

Size:4096 Procs:1   95302   84664   -11.16
Size:4096 Procs:8   19  89111   -19.80
Size:4096 Procs:32  109249  113919  4.27

I will submit Rev4 with suggested changes (including single merged
API) on Thursday after some more testing.

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-14 Thread Krishna Kumar2

Forgot to mention one thing:

 This fix reduced
 retransmissions from 180,000 to 55,000 or so. When I changed IPoIB
 driver to use iterative sends of each skb instead of creating multiple
 Work Request's, that number went down to 15].

This also reduced TCP No Delay performance from huge percentages like
200-400% and now is almost the same as original code. So fixing this
problem in IPoIB (driver?) will enable to use the multiple Work Request
 Work Completion, rather than limiting batching to single WR/WC.

thanks,

- KK


__

Hi Dave,

David Miller [EMAIL PROTECTED] wrote on 08/08/2007 04:19:00 PM:

 From: Krishna Kumar [EMAIL PROTECTED]
 Date: Wed, 08 Aug 2007 15:01:14 +0530

  RESULTS: The performance improvement for TCP No Delay is in the range
of -8%
 to 320% (with -8% being the sole negative), with many individual
tests
 giving 50% or more improvement (I think it is to do with the hw
slots
 getting full quicker resulting in more batching when the queue gets
 woken). The results for TCP is in the range of -11% to 93%, with
most
 of the tests (8/12) giving improvements.

 Not because I think it obviates your work, but rather because I'm
 curious, could you test a TSO-in-hardware driver converted to
 batching and see how TSO alone compares to batching for a pure
 TCP workload?

 I personally don't think it will help for that case at all as
 TSO likely does better job of coalescing the work _and_ reducing
 bus traffic as well as work in the TCP stack.

I used E1000 (guess the choice is OK as e1000_tso returns TRUE. My
hw is 82547GI).

You are right, it doesn't help TSO case at all (infact degrades). Two
things to note though:
  - E1000 may not be suitable for adding batching (which is no
longer a new API, as I have changed it already).
  - Small skbs where TSO doesn't come into picture still seems to
improve. A couple of cases for large skbs did result in some
improved (like 4K, TCP No Delay, 32 procs).

[Total segments retransmission for original code test run: 2220  for
new code test run: 1620. So the retransmission problem that I was
getting seems to be an IPoIB bug, though I did have to fix one bug
in my networking component where I was calling qdisc_run(NULL) for
regular xmit path and change to always use batching. The problem is
that skb1 - skb10 may be present in the queue after each of them
failed to be sent out, then net_tx_action fires which batches all of
these into the blist and tries to send them out again, which also
fails (eg tx lock fail or queue full), then the next single skb xmit
will send the latest skb ignoring the 10 skbs that are already waiting
in the batching list. These 10 skbs are sent out only the next time
net_tx_action is called, so out of order skbs result. This fix reduced
retransmissions from 180,000 to 55,000 or so. When I changed IPoIB
driver to use iterative sends of each skb instead of creating multiple
Work Request's, that number went down to 15].

I ran 3 iterations of 45 sec tests (total 1 hour 16 min, but I will
run a longer one tonight). The results are (results in KB/s, and %):

Test Case  Org BW   New BW  % Change

  TCP

Size:32 Procs:1 18483918112.01
Size:32 Procs:8 21888   21555   -1.52
Size:32 Procs:3219317   22433   16.13

Size:256 Procs:115584   25991   66.78
Size:256 Procs:8110937  74565   -32.78
Size:256 Procs:32   105767  98967   -6.42

Size:4096 Procs:1   81910   96073   17.29
Size:4096 Procs:8   113302  94040   -17.00
Size:4096 Procs:32  109664  105522  -3.77

TCP No Delay:
--
Size:32 Procs:1 2688317718.19
Size:32 Procs:8 656810588   61.20
Size:32 Procs:326573783819.24

Size:256 Procs:1786912724   61.69
Size:256 Procs:865652   45652   -30.46
Size:256 Procs:32   95114   112279  18.04

Size:4096 Procs:1   95302   84664   -11.16
Size:4096 Procs:8   19  89111   -19.80
Size:4096 Procs:32  109249  113919  4.27

I will submit Rev4 with suggested changes (including single merged
API) on Thursday after some more testing.

Thanks,

- KK

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit 

[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-09 Thread Krishna Kumar2
David Miller [EMAIL PROTECTED] wrote on 08/09/2007 09:57:27 AM:

  Patrick had suggested calling dev_hard_start_xmit() instead of
  conditionally calling the new API and to remove the new API
  entirely. The driver determines whether batching is required or
  not depending on (skb==NULL) or not. Would that approach be fine
  with this single interface goal ?

 It is a valid posibility.

 Note that this is similar to how we handle TSO, the driver
 sets the feature bit and in its -hard_start_xmit() it checks
 the SKB for the given offload property.

Great, I will try to get rid of two paths entirely, and see how to
re-arrange the code cleanly.

thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Submitted Rev3 of Batching API

2007-08-08 Thread Krishna Kumar2

Hi all,

I just submitted Rev3 of the batching skb patch with support for IPoIB.

I am still getting huge number of retransmissions, 1,255207 vs 67, or
18,750 retransmissions in the new code for every single retransmission
with the original code. But even with so many extra processing/skbs,
I am getting around 20% overall BW improvement.

I would really appreciate if some experts can review the code on the list,
and also suggest something regarding the retransmissions.

thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 2/9 Rev3] [core] Add skb_blist hard_start_xmit_batch

2007-08-08 Thread Krishna Kumar2
Hi Evgeniy,

Evgeniy Polyakov [EMAIL PROTECTED] wrote on 08/08/2007 05:31:43 PM:

  +int dev_change_tx_batch_skb(struct net_device *dev, unsigned long
new_batch_skb)
  +{
  +   int ret = 0;
  +   struct sk_buff_head *blist;
  +
  +   if (!dev-hard_start_xmit_batch) {
  +  /* Driver doesn't support batching skb API */
  +  ret = -ENOTSUPP;
  +  goto out;
  +   }
  +
  +   /* Handle invalid argument */
  +   if (new_batch_skb  0) {
  +  ret = -EINVAL;
  +  goto out;
  +   }
 It is unsigned, how can it be less than zero?

Yuck, originally I had it as int and changed to ulong and forgot to
remove this check.

 And actually you use it just like a binary flag (casted to/from u32 in
 the code, btw), so why not using ethtool_value directly here?

I still need to check if the value is changing, so the one check is needed.
Later I am using it as a value directly.

  +   /* Check if new value is same as the current */
  +   if (!!dev-skb_blist == !!new_batch_skb)
  +  goto out;
  +
  +   if (new_batch_skb 
  +   (blist = kmalloc(sizeof *blist, GFP_KERNEL)) == NULL) {
  +  ret = -ENOMEM;
  +  goto out;
  +   }
  +
  +   spin_lock(dev-queue_lock);
  +   if (new_batch_skb) {
  +  skb_queue_head_init(blist);
  +  dev-skb_blist = blist;
  +   } else
  +  free_batching(dev);
  +   spin_unlock(dev-queue_lock);

 This needs bh lock too, since blist is accessed by qdisc_restart.

Yes, had it in the code, put it in the list of changes, but missed it for
some reason :(

  +int dev_add_skb_to_blist(struct sk_buff *skb, struct net_device *dev)
  +{
  +   if (!list_empty(ptype_all))
  +  dev_queue_xmit_nit(skb, dev);
  +
  +   if (netif_needs_gso(dev, skb)) {
  +  if (unlikely(dev_gso_segment(skb))) {
  + kfree(skb);
  + return 0;
  +  }
  +
  +  if (skb-next) {
  + int count = 0;
  +
  + do {
  +struct sk_buff *nskb = skb-next;
  +
  +skb-next = nskb-next;
  +__skb_queue_tail(dev-skb_blist, nskb);
  +count++;
  + } while (skb-next);

 Is it possible to move list without iterating over each entry?

Though I cannot see something obvious to do that, let me see if
something is possible as it will make a good difference.

thanks for your suggestions,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching

2007-08-08 Thread Krishna Kumar2
Evgeniy Polyakov [EMAIL PROTECTED] wrote on 08/08/2007 05:44:02 PM:

 On Wed, Aug 08, 2007 at 03:01:45PM +0530, Krishna Kumar
([EMAIL PROTECTED]) wrote:
  +static inline int get_skb(struct net_device *dev, struct Qdisc *q,
  +   struct sk_buff_head *blist, struct sk_buff **skbp)
  +{
  +   if (likely(!blist || (!skb_queue_len(blist)  qdisc_qlen(q) =
1))) {
  +  return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
  +   } else {
  +  int max = dev-tx_queue_len - skb_queue_len(blist);
  +  struct sk_buff *skb;
  +
  +  while (max  0  (skb = dev_dequeue_skb(dev, q)) != NULL)
  + max -= dev_add_skb_to_blist(skb, dev);
  +
  +  *skbp = NULL;
  +  return 1;   /* we have atleast one skb in blist */
  +   }
  +}

 Same here - is it possible to get a list in one go instead of pulling
 one-by-one, since it forces quite a few additional unneded lock
 get/releases. What about dev_dequeue_number_skb(dev, q, num), which will
 grab the lock and move a list of skbs from one queue to provided head.

OK, I will try this out.

  @@ -158,7 +198,10 @@ static inline int qdisc_restart(struct n
  /* And release queue */
  spin_unlock(dev-queue_lock);
 
  -   ret = dev_hard_start_xmit(skb, dev);
  +   if (likely(skb))
  +  ret = dev_hard_start_xmit(skb, dev);
  +   else
  +  ret = dev-hard_start_xmit_batch(dev);

 Perfectionism says that having array of two functions and calling one of
 them via array_func_pointer[!!skb] will be much faster. Just a though.
 It is actually much faster than if/else on x86 at least.

Thinking about this - I will have to store the 2 pointer array in dev
itself wasting some space, and also fn pointer will have wrong signature
as one takes an extra argument. Will ponder some more :)

thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-08 Thread Krishna Kumar2
Herbert Xu [EMAIL PROTECTED] wrote on 08/08/2007 07:12:47 PM:

 On Wed, Aug 08, 2007 at 03:49:00AM -0700, David Miller wrote:
 
  Not because I think it obviates your work, but rather because I'm
  curious, could you test a TSO-in-hardware driver converted to
  batching and see how TSO alone compares to batching for a pure
  TCP workload?

 You could even lower the bar by disabling TSO and enabling
 software GSO.

I will try with E1000 (though I didn't see improvement when I tested a long
time back). The difference I expect is that TSO would help with large
packets and not necessarily small/medium packets and not definitely in
the case of multiple different skbs (as opposed to single large skb)
getting
queue'd. I think these are two different workloads.

  I personally don't think it will help for that case at all as
  TSO likely does better job of coalescing the work _and_ reducing
  bus traffic as well as work in the TCP stack.

 I agree.  I suspect the bulk of the effort is in getting
 these skb's created and processed by the stack so that by
 the time that they're exiting the qdisc there's not much
 to be saved anymore.

However, I am getting a large improvement for IPoIB specifically for this
same case. The reason - batching will help only when queue gets full and
stopped (and to a lesser extent if tx lock was not got, which results
in fewer amount of batching that can be done).

thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching

2007-08-08 Thread Krishna Kumar2
Hi Patrick,

Patrick McHardy [EMAIL PROTECTED] wrote on 08/08/2007 07:35:17 PM:

 Krishna Kumar wrote:
snip
  +static inline int get_skb(struct net_device *dev, struct Qdisc *q,
  +   struct sk_buff_head *blist, struct sk_buff **skbp)
  +{
  +   if (likely(!blist || (!skb_queue_len(blist)  qdisc_qlen(q) =
1))) {
  +  return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
  +   } else {
  +  int max = dev-tx_queue_len - skb_queue_len(blist);
  +  struct sk_buff *skb;
  +
  +  while (max  0  (skb = dev_dequeue_skb(dev, q)) != NULL)
  + max -= dev_add_skb_to_blist(skb, dev);
  +
  +  *skbp = NULL;
  +  return 1;   /* we have atleast one skb in blist */
  +   }
  +}

 I think you missed my previous reply to this in the flood of
 responses (or I missed your reply), so I'm copying it below:

Sorry, but I didn't get your post on this point earlier (thanks for
posting again).

 The entire idea of a single queue after qdiscs that is refilled
 independantly of transmissions times etc. make be worry a bit.
 By changing the timing you're effectively changing the qdiscs
 behaviour, at least in some cases. SFQ is a good example, but I
 believe it affects most work-conserving qdiscs. Think of this
 situation:

 100 packets of flow 1 arrive
 50 packets of flow 1 are sent
 100 packets for flow 2 arrive
 remaining packets are sent

 On the wire you'll first see 50 packets of flow 1, than 100 packets
 alternate of flow 1 and 2, then 50 packets flow 2.

 With your additional queue all packets of flow 1 are pulled out of
 the qdisc immediately and put in the fifo. When the 100 packets of
 the second flow arrive they will also get pulled out immediately
 and are put in the fifo behind the remaining 50 packets of flow 1.
 So what you get on the wire is:

 100 packets of flow 1
 100 packets of flow 1

In normal case (qdisc run from xmit and not from tx_action), the code
executing is the same as regular code without any difference in wire
behavior due to the check:
  if (likely(!blist || (!skb_queue_len(blist)  qdisc_qlen(q) = 1)))
{
return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
(I always pass blist as NULL).

With SFQ for the above scenario, my code will first send out 50 F1 skbs
iteratively (blist==NULL), get blocked so that 50 skbs accumulate on F1
and 100 on F2, then when it is woken up, it will batch but it will pick
up 50 F1 and 50 F2 skbs in alternate order and put in the queue, and
finally pick up the remaining 50 F2 skbs and put those too in the queue.
Since I am calling dev_dequeue_skb, I am assured of RR when using SFQ.
The only time I would have 100 skbs from F1 in my queue (and hence sent
out first) is if there are NO F2 skbs.

 So SFQ is without any effect. This is not completely avoidable of
 course, but you can and should limit the damage by only pulling
 out as much packets as the driver can take and have the driver
 stop the queue afterwards.

I feel this cannot happen. Please correct me if I am wrong.

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-08 Thread Krishna Kumar2
Hi Dave,

David Miller [EMAIL PROTECTED] wrote on 08/09/2007 03:31:37 AM:

  What do you generally think of the patch/implementation ? :)

 We have two driver implementation paths on recieve and now
 we'll have two on send, and that's not a good trend.

Correct.

 In an ideal world all the drivers would be NAPI and netif_rx()
 would only be used by tunneling drivers and similar in the
 protocol layers.  And likewise all sends would go through
 -hard_start_xmit().

 If you can come up with a long term strategy that gets rid of
 the special transmit method, that'd be great.

 We should make Linux network drivers easy to write, not more difficult
 by constantly adding most interfaces than we consolidate.

I think that is a good top level view, and I agree with that.

Patrick had suggested calling dev_hard_start_xmit() instead of
conditionally calling the new API and to remove the new API
entirely. The driver determines whether batching is required or
not depending on (skb==NULL) or not. Would that approach be fine
with this single interface goal ?

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [RFH] IPoIB retransmission when sending multiple WR's to device

2007-08-02 Thread Krishna Kumar2
(Request For Help)

Hi,

On the same topic that I wrote about earlier, I put debugs
in my code to store all skbs in bufferA when enqueing multiple
skbs, and store all skbs to bufferB just before doing post.
During post, I compare the two buffers to make sure that I am
not posting in the wrong order, and that never happens.

But I am getting a huge amount of retransmissions anyway,
and I don't understand why this happens. It is either the
IPoIB driver changes I made, or driver that is processing
multiple WR's, or firmware that is sending out of order.
I don't see high retransmission when I run over E1000 (had
ported E1000 driver too, just to make sure that the core
networking code is not putting skbs out of order, and
results confirms it is correct). The retransmission number
is around 200 vs 6000-1 / sec for Original/New IPoIB
drivers. The result of retransmission is affecting when I
have high number of threads (like 64), but in small # of
threads, even with the retransmission, I am getting very
good improvement in throughput. But with 64 threads, I am
getting a fall from 270 MBPs to 250 MBPs.

Since I am really stuck for the last couple of weeks on this,
can someone help by taking a quick look at my code (attached
as just the functions that are changed), and/or suggest some
way to debug/fix this ?

Thanks,

- KK

(See attached file: CODE)

CODE
Description: Binary data
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [RFH] IPoIB retransmission when sending multiple WR's to device

2007-08-02 Thread Krishna Kumar2
Hi Roland,

Roland Dreier [EMAIL PROTECTED] wrote on 08/02/2007 09:59:23 PM:

   On the same topic that I wrote about earlier, I put debugs
   in my code to store all skbs in bufferA when enqueing multiple
   skbs, and store all skbs to bufferB just before doing post.
   During post, I compare the two buffers to make sure that I am
   not posting in the wrong order, and that never happens.
  
   But I am getting a huge amount of retransmissions anyway,

 Why do you think the retransmissions are related to things being sent
 out of order?  Is it possible you're just sending much faster and
 overrunning the receiver's queue of posted receives?

I cannot be sure of that. But in regular code too, batching is done
in qdisc_run() in a different sense - it sends out packets *iteratively*.
In this case, I see only 225 retransmission for the entire run of all
tests (while in my code, I see 100,000 or more (I think I gave wrong
number earlier, this is the right one - 200 vs 100,000).

Is there any way to avoid the situation you are talking about ? I am
already setting recv_queue_size=4096 when loading ipoib (so for mthca
too).

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [RFH] IPoIB retransmission when sending multiple WR's to device

2007-08-02 Thread Krishna Kumar2

Hi Roland,

I did one more test to check the out-of-order theory. I changed my new API
to be:

/* Original code, unmodified */
ipoib_start_xmit()
{
  original code
}

/* Added new xmit which is identical to original code but doesn't get the
lock */
ipoib_start_xmit_nolock()
{
  original code but without getting lock
}

/* Batching API */
ipoib_start_xmit_batch()
{
  get_lock()
  while (skbs in queue) {
ret = ipoib_start_xmit_nolock()
  }
  unlock()
}

This in effect is fast sends of multiple skbs while holding the lock once
for all the skbs
without going back to the IP stack. The only difference is that this
creates one WR per
skb instead of multiple WR's. I got identical retranmissions (around 240
for the full run)
as compared to original code (225), while the multiple WR's code had
100,000
retransmissions.

I agree that sending multiple WR's is still faster than this, but this is
second best I could
do, and still there is no increase in retransmissions. The tx_queue_len on
ib0 is 8192,
recv/send size are both 4K. The receiver shows no errors/drops in ifconfig
nor netstat -s.

Is there anything that can be concluded/suspected, or something else I
could try ?

The latest run of the batching API gave me 719,679 retransmissions for a 16
min test run
of 16/64 threads (iperf), which comes to 600 retransmissions per sec, which
is more than
the retransmissions during the entire run for the regular code!

thanks,

- KK

__

Hi Roland,

Roland Dreier [EMAIL PROTECTED] wrote on 08/02/2007 09:59:23 PM:

   On the same topic that I wrote about earlier, I put debugs
   in my code to store all skbs in bufferA when enqueing multiple
   skbs, and store all skbs to bufferB just before doing post.
   During post, I compare the two buffers to make sure that I am
   not posting in the wrong order, and that never happens.
  
   But I am getting a huge amount of retransmissions anyway,

 Why do you think the retransmissions are related to things being sent
 out of order?  Is it possible you're just sending much faster and
 overrunning the receiver's queue of posted receives?

I cannot be sure of that. But in regular code too, batching is done
in qdisc_run() in a different sense - it sends out packets *iteratively*.
In this case, I see only 225 retransmission for the entire run of all
tests (while in my code, I see 100,000 or more (I think I gave wrong
number earlier, this is the right one - 200 vs 100,000).

Is there any way to avoid the situation you are talking about ? I am
already setting recv_queue_size=4096 when loading ipoib (so for mthca
too).

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Processing multiple WR's in MTHCA (other drivers).

2007-08-01 Thread Krishna Kumar2

Hi,

If I call post_send with multiple WR's (as a list of 'next' linked WR's)
from IPoIB
to the driver, is it guaranteed that those WR's will go out on the wire in
the same
sequence ? Will the send sequence be : WR[0], WR[1],  WR[n]; or this
cannot
be guaranteed by the driver after it issues the doorbell ?

The reason to ask is that when I modify the networking + ipoib code to send
multiple
skbs, I see huge number of retransmissions (original code: 250, modified
code: 5000
or more) as though skbs were sent out of order, and any amount of staring
at the
code is not helping.

The code to send multiple WR's is (hopefully readable):

/*
 * Normally start_index is 0 to indicate send from WR[0], but if one
post_send
* fails, start_index is set to the index of the first untried WR by
ipoib_send(), and
* calls post_send() again. I have also not seen any failures from provider
and a
* subsequent retry that causes this retransmission to happen.
*/
static inline int post_send(struct ipoib_dev_priv *priv, u32 qpn,
int start_index, int num_skbs,
struct ib_send_wr **bad_wr)
{
int ret;
struct ib_send_wr *last_wr, *next_wr;

last_wr = priv-tx_wr[start_index + num_skbs - 1];

/* Set Completion Notification for last WR */
last_wr-send_flags = IB_SEND_SIGNALED;

/* Terminate the last WR */
next_wr = last_wr-next;
last_wr-next = NULL;

/* Send all the WR's in one doorbell */
ret = ib_post_send(priv-qp, priv-tx_wr[start_index], bad_wr);

/* Restore send_flags  WR chain */
last_wr-send_flags = 0;
last_wr-next = next_wr;

return ret;
}

(I am also getting a completion with multiple WR's in each completion
handler, which
is good).

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: Processing multiple WR's in MTHCA (other drivers).

2007-08-01 Thread Krishna Kumar2
Yes, I saw that, and also 10.8.3. Just wondering if that is actually
complaint in the
hardware. The reason being the huge amount of retransmission that I am
getting.

thanks,

- KK


Michael S. Tsirkin [EMAIL PROTECTED] wrote on 08/01/2007 02:47:45
PM:

 Quoting Krishna Kumar2 [EMAIL PROTECTED]:
 Subject: Processing multiple WR's in MTHCA (other drivers).
  If I call post_send with multiple WR's (as a list of 'next' linked
WR's) from
  IPoIB to the driver, is it guaranteed that those WR's will go out on
the wire in
  the same sequence?

 See chapter 9.5 transaction ordering:
C9-25: A requester shall transmit request messages in the order that
the
Work Queue Elements (WQEs) were posted.


 --
 MST

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 00/10] Implement batching skb API

2007-07-24 Thread Krishna Kumar2
Jamal,

This is silly. I am not responding to this type of presumptuous and
insulting mails.

Regards,

- KK

J Hadi Salim [EMAIL PROTECTED] wrote on 07/25/2007 12:58:20 AM:

 KK,

 On Tue, 2007-24-07 at 09:14 +0530, Krishna Kumar2 wrote:

 
  J Hadi Salim [EMAIL PROTECTED] wrote on 07/23/2007 06:02:01 PM:


  Actually you have not sent netperf results with prep and without prep.

 My results were based on pktgen (which i explained as testing the
 driver). I think depending on netperf without further analysis is
 simplistic. It was like me doing forwarding tests on these patches.

   So _which_ non-LLTX driver doesnt do that? ;-
 
  I have no idea since I haven't looked at all drivers. Can you tell
which
  all non-LLTX drivers does that ? I stated this as the sole criterea.

 The few i have peeked at all do it. I also think the e1000 should be
 converted to be non-LLTX. The rest of netdev is screaming to kill LLTX.

   tun driver doesnt use it either - but i doubt that makes it bloat
 
  Adding extra code that is currently not usable (esp from a submission
  point) is bloat.

 So far i have converted 3 drivers, 1 of them doesnt use it. Two more
 driver conversions are on the way, they will both use it. How is this
 bloat again?
 A few emails back you said if only IPOIB can use batching then thats
 good enough justification.

   You waltz in, have the luxury of looking at my code, presentations,
many
   discussions with me etc ...
 
  luxury ?
  I had implemented the entire thing even before knowing that you
  are working on something similar! and I had sent the first proposal to
  netdev,

 I saw your patch at the end of may (or at least 2 weeks after you said
 it existed). That patch has very little resemblance to what you just
 posted conceptwise or codewise. I could post it if you would give me
 permission.

  *after* which you told that you have your own code and presentations
(which
  I had never seen earlier - I joined netdev a few months back, earlier I
was
  working on RDMA, Infiniband as you know).

 I am gonna assume you didnt know of my work - which i have been making
 public for about 3 years. Infact i talked about this topic when i
 visited your office in 2006 on a day you were not present, so it is
 plausible you didnt hear of it.

   And it didn't give me any great
  ideas either, remember I had posted results for E1000 at the time of
  sending the proposals.

 In mid-June you sent me a series of patches which included anything from
 changing variable names to combining qdisc_restart and about everything
 i referred to as being cosmetic differences in your posted patches. I
 took two of those and incorporated them in. One was an XXX in my code
 already to allocate the dev-blist
 (Commit: bb4464c5f67e2a69ffb233fcf07aede8657e4f63).
 The other one was a mechanical removal of the blist being passed
 (Commit: 0e9959e5ee6f6d46747c97ca8edc91b3eefa0757).
 Some of the others i asked you to defer. For example, the reason i gave
 you for not merging any qdisc_restart_combine changes is because i was
 waiting for Dave to swallow the qdisc_restart changes i made; otherwise
 maintainance becomes extremely painful for me.
 Sridhar actually provided a lot more valuable comments and fixes but has
 not planted a flag on behalf of the queen of spain like you did.

  However I do give credit in my proposal to you for what
  ideas that your provided (without actual code), and the same I did for
other
  people who did the same, like Dave, Sridhar. BTW, you too had
discussions with me,
  and I sent some patches to improve your code too,

 I incorporated two of your patches and asked for deferal of others.
 These patches have now shown up in what you claim as the difference. I
 just call them cosmetic difference not to downplay the importance of
 having an ethtool interface but because they do not make batching
 perform any better. The real differences are those two items. I am
 suprised you havent cannibalized those changes as well. I thought you
 renamed them to something else; according to your posting:
 This patch will work with drivers updated by Jamal, Matt  Michael Chan
 with minor modifications - rename xmit_win to xmit_slots  rename batch
 handler. Or maybe thats a future plan you have in mind?

  so it looks like a two
  way street to me (and that is how open source works and should).

 Open source is a lot more transparent than that.

 You posted a question, which was part of your research. I responded and
 told you i have patches; you asked me for them and i promptly ported
 them from pre-2.6.18 to the latest kernel at the time.

 The nature of this batching work is one of performance. So numbers are
 important. If you had some strong disagreements on something in the
 architecture, then it would be of great value to explain it in a
 technical detail - and more importantly to provide some numbers to say
 why it is a bad idea. You get numbers by running some tests.
 You did none of the above. Your

[ofa-general] Re: [PATCH 02/12 -Rev2] Changes to netdevice.h

2007-07-24 Thread Krishna Kumar2
Hi Patrick,

Krishna Kumar2/India/IBM wrote on 07/23/2007 08:27:53 AM:

 Hi Patrick,

 Patrick McHardy [EMAIL PROTECTED] wrote on 07/22/2007 10:36:51 PM:

  Krishna Kumar wrote:
   @@ -472,6 +474,9 @@ struct net_device
   void *priv;   /* pointer to private data   */
   int (*hard_start_xmit) (struct sk_buff *skb,
  struct net_device *dev);
   +   int (*hard_start_xmit_batch) (struct net_device
   +   *dev);
   +
 
 
  Os this function really needed? Can't you just call hard_start_xmit
with
  a NULL skb and have the driver use dev-blist?

 Probably not. I will see how to do it this way and get back to you.

I think this is a good idea and makes code everywhere simpler. I
will try this change and test to make sure it doesn't have any
negative impact. Will mostly send out rev3 tomorrow.

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 02/10] Networking include file changes.

2007-07-23 Thread Krishna Kumar2
Hi Sridhar,

Sridhar Samudrala [EMAIL PROTECTED] wrote on 07/23/2007 11:29:39 AM:

 Krishna Kumar2 wrote:
  Hi Sridhar,
 
  Sridhar Samudrala [EMAIL PROTECTED] wrote on 07/20/2007 10:55:05 PM:
  diff -ruNp org/include/net/pkt_sched.h new/include/net/pkt_sched.h
  --- org/include/net/pkt_sched.h   2007-07-20 07:49:28.0 +0530
  +++ new/include/net/pkt_sched.h   2007-07-20 08:30:22.0 +0530
  @@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge
 struct rtattr *tab);
   extern void qdisc_put_rtab(struct qdisc_rate_table *tab);
 
  -extern void __qdisc_run(struct net_device *dev);
  +extern void __qdisc_run(struct net_device *dev, struct sk_buff_head
  *blist);
  Why do we need this additional 'blist' argument?
  Is this different from dev-skb_blist?
 
  It is the same, but I want to call it mostly with NULL and rarely with
the
  batch list pointer (so it is related to your other question). My
original
  code didn't have this and was trying batching in all cases. But in most
  xmit's (probably almost all), there will be only one packet in the
queue to
  send and batching will never happen. When there is a lock contention or
if
  the queue is stopped, then the next iteration will find 1 packets. But
I
  still will try no batching for the lock failure case as there be
probably
  2 packets (one from previous time and 1 from this time, or 3 if two
  failures,
  etc), and try batching only when queue was stopped from net_tx_action
(this
  was based on Dave Miller's idea).

 Is this right to say that the above change is to get this behavior?
If qdisc_run() is called from dev_queue_xmit() don't use batching.
If qdisc_run() is called from net_tx_action(), do batching.

Correct.

 Isn't it possible to have multiple skb's in the qdisc queue in the
 first case?

It is possible but rarer (so unnecessary checking most of the time). From
net_tx_action you are guaranteed to have multiple skbs, but from xmit you
will almost always get one skb (since most send of 1 skb will go out OK).
And also in the xmit path, it is more likely to have few skbs compared to
possibly hundreds in the net_tx_action path.

 If this additional argument is used to indicate if we should do batching
 or not, then passing a flag may be much more cleaner than passing the
blist.

OK, I will add this as another action item to check (along with Patrick's
suggestion to use single API) and will get back.

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 00/12 -Rev2] Implement batching skb API

2007-07-23 Thread Krishna Kumar2
 I have started a 10 run test for various buffer sizes and processes, and
 will post the results on Monday.

The 10 iteration run results for Rev2 are (average) :

--
Test Case   Org New%Change
--

  TCP 1 Process
Size:32   2703306313.31
Size:128 12948  12217   -5.64
Size:512 48108  55384   15.12
Size:4096   129089132586  2.70
Average:192848 203250  5.39

  TCP 4 Processes
Size:32  10389  107683.64
Size:12839694   42265   6.47
Size:512159563 156373 -1.99
Size:4096   268094 256008-4.50
Average:477740 465414 -2.58


  TCP No Delay 1 Process
Size:32   2606   295013.20
Size:128 8115   11864   46.19
Size:512 39113  42608   8.93
Size:4096   103966105333  1.31
Average:153800 162755  5.82


  TCP No Delay 4 Processes
Size:32   42138727107.14
Size:128 17579   35143   99.91
Size:512 70803   123936 75.04
Size:4096   203541  225259 10.67
Average: 296136 393065  32.73

--
Average:11205241224484 9.28%

There are three cases that degrade a little (upto -5.6%), but there are 13
cases
that improve, and many of those are in the 13% to over 100% (7 cases).

Thanks,

- KK

Krishna Kumar2/India/[EMAIL PROTECTED] wrote on 07/22/2007 02:34:57 PM:

 This set of patches implements the batching API, and makes the following
 changes resulting from the review of the first set:

 Changes :
 -
 1.  Changed skb_blist from pointer to static as it saves only 12 bytes
 (i386), but bloats the code.
 2.  Removed requirement for driver to set features  NETIF_F_BATCH_SKBS
 in register_netdev to enable batching as it is redundant. Changed
this
 flag to NETIF_F_BATCH_ON and it is set by register_netdev, and other
 user changable calls can modify this bit to enable/disable batching.
 3.  Added ethtool support to enable/disable batching (not tested).
 4.  Added rtnetlink support to enable/disable batching (not tested).
 5.  Removed MIN_QUEUE_LEN_BATCH for batching as high performance drivers
 should not have a small queue anyway (adding bloat).
 6.  skbs are purged from dev_deactivate instead of from unregister_netdev
 to drop all references to the device.
 7.  Removed changelog in source code in sch_generic.c, and unrelated
renames
 from sch_generic.c (lockless, comments).
 8.  Removed xmit_slots entirely, as it was adding bloat (code and header)
 and not adding value (it is calculated and set twice in internal send
 routine and handle work completion, and referenced once in batch
xmit;
 and can instead be calculated once in xmit).

 Issues :
 
 1. Remove /sysfs support completely ?
 2. Whether rtnetlink support is required as GSO has only ethtool ?

 Patches are described as:
Mail 0/12  : This mail.
Mail 1/12  : HOWTO documentation.
Mail 2/12  : Changes to netdevice.h
Mail 3/12  : dev.c changes.
Mail 4/12  : Ethtool changes.
Mail 5/12  : sysfs changes.
Mail 6/12  : rtnetlink changes.
Mail 7/12  : Change in qdisc_run  qdisc_restart API, modify callers
to use this API.
Mail 8/12  : IPoIB include file changes.
Mail 9/12  : IPoIB verbs changes
Mail 10/12 : IPoIB multicast, CM changes
Mail 11/12 : IPoIB xmit API addition
Mail 12/12 : IPoIB xmit internals changes (ipoib_ib.c)

 I have started a 10 run test for various buffer sizes and processes, and
 will post the results on Monday.

 Please review and provide feedback/ideas; and consider for inclusion.

 Thanks,

 - KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 03/12 -Rev2] dev.c changes.

2007-07-23 Thread Krishna Kumar2
Hi Evgeniy,

Evgeniy Polyakov [EMAIL PROTECTED] wrote on 07/23/2007 04:14:28 PM:
  +/*
  + * dev_change_tx_batching - Enable or disable batching for a driver
that
  + * supports batching.

  +   /* Check if new value is same as the current */
  +   if (!!(dev-features  NETIF_F_BATCH_ON) == !!new_batch_skb)
  +  goto out;

 o_O

 Scratched head for too long before understood what it means :)

Is there a easy way to do this ?

  +   spin_lock(dev-queue_lock);
  +   if (new_batch_skb) {
  +  dev-features |= NETIF_F_BATCH_ON;
  +  dev-tx_queue_len = 1;
  +   } else {
  +  if (!skb_queue_empty(dev-skb_blist))
  + skb_queue_purge(dev-skb_blist);
  +  dev-features = ~NETIF_F_BATCH_ON;
  +  dev-tx_queue_len = 1;
  +   }
  +   spin_unlock(dev-queue_lock);

 Hmm, should this also stop interrupts?

That is a good question, and I am not sure. I thought it
is not required, though adding it doesn't affect code
either. Can someone tell if disabling bh is required and
why (couldn't figure out the intention of bh for
dev_queue_xmit either, is this to disable preemption) ?

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 00/10] Implement batching skb API

2007-07-23 Thread Krishna Kumar2
Hi Jamal,

J Hadi Salim [EMAIL PROTECTED] wrote on 07/23/2007 06:02:01 PM:

 Yes, and these results were sent to you as well a while back.
 When i get the time when i get back i will look em up in my test machine
 and resend.

Actually you have not sent netperf results with prep and without prep.

  No. I see value only in non-LLTX drivers which also gets the same TX
lock
  in the RX path.

 So _which_ non-LLTX driver doesnt do that? ;-

I have no idea since I haven't looked at all drivers. Can you tell which
all non-LLTX drivers does that ? I stated this as the sole criterea.

 tun driver doesnt use it either - but i doubt that makes it bloat

Adding extra code that is currently not usable (esp from a submission
point)
is bloat.

 You waltz in, have the luxury of looking at my code, presentations, many
 discussions with me etc ...

luxury ? I had implemented the entire thing even before knowing that you
are working on something similar! and I had sent the first proposal to
netdev,
*after* which you told that you have your own code and presentations (which
I had never seen earlier - I joined netdev a few months back, earlier I was
working on RDMA, Infiniband as you know). And it didn't give me any great
ideas either, remember I had posted results for E1000 at the time of
sending
the proposals. However I do give credit in my proposal to you for what
ideas
that your provided (without actual code), and the same I did for other
people
who did the same, like Dave, Sridhar. BTW, you too had discussions with me,
and I sent some patches to improve your code too, so it looks like a two
way
street to me (and that is how open source works and should).

 When i ask for differences to code you produced, they now seem to sum up
 to the two below. You dont think theres some honest issue with this
 picture?

Two changes ? That's it ? I gave a big list of changes between our
implementations but you twist my words to conclude there is just two (by
conveniently labelling everything else cosmetic, or potentially
useful!)! Even my restart routine used a single API from the first day,
I would never imagine using multiple API's. Our codes probably doesn't
have even one line that look remotely similar!

To clarify : I suggested that you could send patches for the two *missing*
items if you can show they add value (and not the rest, as I consider
those will not improve the code/logic/algo).

  (lacking in frankness, candor, or sincerity; falsely or hypocritically
  ingenuous; insincere)  Sorry, no response to personal comments and
  have a flame-war :)

 Give me a better description.

Sorry, no personal comments. Infact I will avoid responding to baits and
innuendoes from now on.

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 00/10] Implement batching skb API

2007-07-22 Thread Krishna Kumar2
Hi Jamal,

J Hadi Salim [EMAIL PROTECTED] wrote on 07/21/2007 06:48:41 PM:

 - Use a single qdisc interface to avoid code duplication and reduce
   maintainability (sch_generic.c size reduces by ~9%).
 - Has per device configurable parameter to turn on/off batching.
 - qdisc_restart gets slightly modified while looking simple without
   any checks for batching vs regular code (infact only two lines
have
   changed - 1. instead of dev_dequeue_skb, a new batch-aware
function
   is called; and 2. an extra call to hard_start_xmit_batch.

 - No change in__qdisc_run other than a new argument (from DM's
idea).
 - Applies to latest net-2.6.23 compared to 2.6.22-rc4 code.

 All the above are cosmetic differences. To me is the highest priority
 is making sure that batching is useful and what the limitations are.
 At some point, when all looks good - i dont mind adding an ethtool
 interface to turn off/on batching, merge with the new qdisc restart path
 instead of having a parallel path, solicit feedback on naming, where to
 allocate structs etc etc. All that is low prio if batching across a
 variety of hardware and applications doesnt prove useful. At the moment,
 i am unsure theres consistency to justify push batching in.

Batching need not be useful for every hardware. If there is hardware that
is useful to exploit batching (like clearly IPoIB is a good candidate as
both the TX and the TX completion path can handle multiple skb processing,
and I haven't looked at other drivers to see if any of them can do
something
similar), then IMHO it makes sense to enable batching for that hardware. It
is upto the other drivers to determine whether converting to the batching
API makes sense or not. And as indicated, the total size increase for
adding
the kernel support is also insignificant - 0.03%, or 1164 Bytes (using the
'size' command).

 Having said that below are the main architectural differences we have
 which is what we really need to discuss and see what proves useful:

  - Batching algo/processing is different (eg. if
qdisc_restart() finds
   one skb in the batch list, it will try to batch more (upto a
limit)
   instead of sending that out and batching the rest in the next
call.

 This sounds a little more aggressive but maybe useful.
 I have experimented with setting upper bound limits (current patches
 have a pktgen interface to set the max to send) and have concluded that
 it is unneeded. Probing by letting the driver tell you what space is
 available has proven to be the best approach. I have been meaning to
 remove the code in pktgen which allows these limits.

I don't quite agree with that approach, eg, if the blist is empty and the
driver tells there is space for one packet, you will add one packet and
the driver sends it out and the device is stopped (with potentially lot of
skbs on dev-q). Then no packets are added till the queue is enabled, at
which time a flood of skbs will be processed increasing latency and holding
lock for a single longer duration. My approach will mitigate holding lock
for longer times and instead send skbs to the device as long as we are
within
the limits.

Infact in my rev2 patch (being today or tomorrow after handling Patrick's
and
Stephen's comments), I am even removing the driver specific xmit_slots as I
find it is adding bloat and requires more cycles than calculating the value
each time xmit is done (ofcourse in your approach it is required since the
stack uses it).

 - Jamal's code has a separate hw prep handler called from the stack,
   and results are accessed in driver during xmit later.

 I have explained the reasoning to this a few times. A recent response to
 Michael Chan is here:
 http://marc.info/?l=linux-netdevm=118346921316657w=2

Since E1000 doesn't seem to use the TX lock on RX (atleast I couldn't find
it),
I feel having prep will not help as no other cpu can execute the queue/xmit
code anyway (E1000 is also a LLTX driver). Other driver that hold tx lock
could
get improvement however.

 And heres a response to you that i havent heard back on:
 http://marc.info/?l=linux-netdevm=118355539503924w=2

That is because it answered my query :) It is what I was expecting, but
thanks
for the explanation.

 My tests so far indicate this interface is useful. It doesnt apply well

I wonder if you tried enabling/disabling 'prep' on E1000 to see how the
performance is affected. If it helps, I guess you could send me a patch to
add that and I can also test it to see what the effect is. I didn't add it
since IPoIB wouldn't be able to exploit it (unless someone is kind enough
to show me how to).

 So if i was to sum up this, (it would be useful discussion to have on
 these) the real difference is:

 a) you have an extra check on refilling the skb list when you find that
 it has a single skb. I tagged this as being potentially useful.

It is very useful since extra processing is not required for one skb case -

[ofa-general] Re: [PATCH 11/12 -Rev2] IPoIB xmit API addition

2007-07-22 Thread Krishna Kumar2
Hi Micheal,

Michael S. Tsirkin [EMAIL PROTECTED] wrote on 07/22/2007 03:11:36
PM:

  +   /*
  +* Handle skbs completion from tx_tail to wr_id. It is possible to
  +* handle WC's from earlier post_sends (possible multiple) in this
  +* iteration as we move from tx_tail to wr_id, since if the last
  +* WR (which is the one which had a completion request) failed to
be
  +* sent for any of those earlier request(s), no completion
  +* notification is generated for successful WR's of those earlier
  +* request(s).
  +*/

 AFAIK a signalled WR will always generate a completion.
 What am I missing?

Yes, signalled WR will generate a completion. I am trying to catch the case
where, say, I send 64 skbs and set signalling for only the last skb and the
others are set to NO signalling. Now if the driver found the last WR was
bad
for some reason, it will synchronously fail the send for that WR (which
happens to be the only one that is signalled). So after the 1 to 63 skbs
are
finished, there will be no completion called. That was my understanding of
how
this works, and coded it that way so that the next post will clean up the
previous one's completion.

 
  + /*
  +  * Better error handling can be done here, like free
  +  * all untried skbs if err == -ENOMEM. However at this
  +  * time, we re-try all the skbs, all of which will
  +  * likely fail anyway (unless device finished sending
  +  * some out in the meantime). This is not a regression
  +  * since the earlier code is not doing this either.
  +  */

 Are you retrying posting skbs? Why is this a good idea?
 AFAIK, earlier code did not retry posting WRs at all.

Not exactly. If I send 64 skbs to the device and the provider returned a
bad WR at skb # 50, then I will have to try skb# 51-64 again since the
provider has not attemped to send those out as it bails out at the first
failure. The provider ofcourse has already sent out skb# 1-49 before
returning failure at skb# 50. So it is not strictly retry, just xmit of
next skbs which is what the current code also does. I tested this part out
by simulating errors in mthca_post_send and verified that the next
iteration clears up the remaining skbs.

 The comment seems to imply that post send fails as a result of SQ
overflow -

Correct.

 do you see SQ overflow errors in your testing?

No.

 AFAIK, IPoIB should never overflow the SQ.

Correct. It should never happen unless IPoIB has a bug :) I guess the
comment
should be removed ?

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 06/12 -Rev2] rtnetlink changes.

2007-07-22 Thread Krishna Kumar2
Hi Patrick,

Patrick McHardy [EMAIL PROTECTED] wrote on 07/22/2007 10:40:37 PM:

 Krishna Kumar wrote:
  diff -ruNp org/include/linux/if_link.h rev2/include/linux/if_link.h
  --- org/include/linux/if_link.h   2007-07-20 16:33:35.0 +0530
  +++ rev2/include/linux/if_link.h   2007-07-20 16:35:08.0 +0530
  @@ -78,6 +78,8 @@ enum
  IFLA_LINKMODE,
  IFLA_LINKINFO,
   #define IFLA_LINKINFO IFLA_LINKINFO
  +   IFLA_TXBTHSKB,  /* Driver support for Batch'd skbs */
  +#define IFLA_TXBTHSKB IFLA_TXBTHSKB


 Ughh what a name :) I prefer pronouncable names since they are
 much easier to remember and don't need comments explaining
 what they mean.

 But I actually think offering just an ethtool interface would
 be better, at least for now.

Great, I will remove /sys and rtnetlink and keep the Ethtool i/f.

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 02/12 -Rev2] Changes to netdevice.h

2007-07-22 Thread Krishna Kumar2
Hi Patrick,

Patrick McHardy [EMAIL PROTECTED] wrote on 07/22/2007 10:36:51 PM:

 Krishna Kumar wrote:
  @@ -472,6 +474,9 @@ struct net_device
  void *priv;   /* pointer to private data   */
  int (*hard_start_xmit) (struct sk_buff *skb,
 struct net_device *dev);
  +   int (*hard_start_xmit_batch) (struct net_device
  +   *dev);
  +


 Os this function really needed? Can't you just call hard_start_xmit with
 a NULL skb and have the driver use dev-blist?

Probably not. I will see how to do it this way and get back to you.

  /* These may be needed for future network-power-down code. */
  unsigned long  trans_start;   /* Time (in jiffies) of last Tx
*/
 
  @@ -582,6 +587,8 @@ struct net_device
   #define   NETDEV_ALIGN  32
   #define   NETDEV_ALIGN_CONST   (NETDEV_ALIGN - 1)
 
  +#define BATCHING_ON(dev)   ((dev-features  NETIF_F_BATCH_ON) != 0)
  +
   static inline void *netdev_priv(const struct net_device *dev)
   {
  return dev-priv;
  @@ -832,6 +839,8 @@ extern int  dev_set_mac_address(struct n
  struct sockaddr *);
   extern int  dev_hard_start_xmit(struct sk_buff *skb,
  struct net_device *dev);
  +extern int  dev_add_skb_to_blist(struct sk_buff *skb,
  +struct net_device *dev);


 Again, function signatures should be introduced in the same patch
 that contains the function. Splitting by file doesn't make sense.

Right. I did it for some but missed this. Sorry, will redo.

thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 00/10] Implement batching skb API

2007-07-22 Thread Krishna Kumar2
Hi Jamal,

J Hadi Salim [EMAIL PROTECTED] wrote on 07/22/2007 06:21:09 PM:

 My concern is there is no consistency in results. I see improvements on
 something which you say dont. You see improvement in something that
 Evgeniy doesnt etc.

Hmmm ? Evgeniy has not even tested my code to find some regression :) And
you may possibly not find much improvement in E1000 when you run iperf
(which
is what I do) compared to pktgen. I can re-run and confirm this since my
last
E1000 run was quite some time back.

My point is that batching not being viable for E1000 (or tg3) need not be
the
sole criterea for inclusion. If IPoIB or other drivers can take advantage
of
it and get better results, then batching can be considered. Maybe E1000 too
can get improvements if some one with more expertise tries to add this API
(not judging your driver writing capabilities - just stating that driver
writers will know more knobs to exploit a complex device like E1000).

  Since E1000 doesn't seem to use the TX lock on RX (atleast I couldn't
find
  it),
  I feel having prep will not help as no other cpu can execute the
queue/xmit
  code anyway (E1000 is also a LLTX driver).

 My experiments show it is useful (in a very visible way using pktgen)
 for e1000 to have the prep() interface.

I meant : have you compared results of batching with prep on vs prep off,
and
what is the difference in BW ?

   Other driver that hold tx lock could get improvement however.

 So you do see the value then with non LLTX drivers, right? ;-

No. I see value only in non-LLTX drivers which also gets the same TX lock
in the RX path. If different locks are got by TX/RX, then since you are
holding queue_lock before calling 'prep', this excludes other TX from
running at the same time. In that case, pre-poning the get of the tx_lock
to do the 'prep' will not cause any degradation (since no other tx can run
anyway, while rx can run as it gets a different lock).

 The value is also there in LLTX drivers even if in just formating a skb
 ready for transmit. If this is not clear i could do a much longer
 writeup on my thought evolution towards adding prep().

In LLTX drivers, the driver does the 'prep' without holding the tx_lock in
any case, so there should be no improvement. Could you send the write-up
since
I really don't see the value in prep unless the driver is non-LLTX *and*
TX/RX holds the same TX lock. I think that is the sole criterea, right ?

  If it helps, I guess you could send me a patch to
  add that and I can also test it to see what the effect is. I didn't add
it
  since IPoIB wouldn't be able to exploit it (unless someone is kind
enough
  to show me how to).

 Such core code should not just be focussed on IPOIB.

There is *nothing* IPoIB specific or focus in my code. I said adding prep
doesn't
work for IPoIB and so it is pointless to add bloat to the code until some
code can
actually take advantage of this feature (I am sure you will agree). Which
is why I
also mentioned to please send me a patch if you find it useful for any
driver
rather than rejecting this idea.

  I think the code I have is ready and stable,

 I am not sure how to intepret that - are you saying all-is-good and we
 should just push your code in?

I am only too well aware that Dave will not accept any code (having
experienced with Mobile IPv6 a long time back when he said to move most
of it to userspace and he was absolutely correct :). What I meant to say
is that there isn't much point in saying that your code is not ready or
you are using old code base, or has multiple restart functions, or is not
tested enough, etc, and then say let's re-do/rethink the whole
implementation when my code is already working and giving good results.
Unless you have some design issues with it, or code is written badly, is
not maintainable, not linux style compliant, is buggy, will not handle
some case/workload, type of issues.

OTOH, if you find some cases that are better handled with :
  1. prep handler
  2. xmit_win (which I don't have now),
then please send me patches and I will also test out and incorporate.

 It sounds disingenuous but i may have misread you.

(lacking in frankness, candor, or sincerity; falsely or hypocritically
ingenuous; insincere)  Sorry, no response to personal comments and
have a flame-war :)

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 04/10] net-sysfs.c changes.

2007-07-21 Thread Krishna Kumar2
Stephen Hemminger [EMAIL PROTECTED] wrote on 07/20/2007
09:52:03 PM:
 Patrick McHardy [EMAIL PROTECTED] wrote:

  Krishna Kumar2 wrote:
   Patrick McHardy [EMAIL PROTECTED] wrote on 07/20/2007 03:37:20 PM:
  
  
  
   rtnetlink support seems more important than sysfs to me.
  
  
   Thanks, I will add that as a patch. The reason to add to sysfs is
that
   it is easier to change for a user (and similar to tx_queue_len).
  
 

 But since batching is so similar to TSO, i really should be part of the
 flags and controlled by ethtool like other offload flags.

So should I add all three interfaces (or which ones) :

  1. /sys (like for tx_queue_len)
  2. netlink
  3. ethtool.

Or only 2  3 are enough ?

thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 03/10] dev.c changes.

2007-07-21 Thread Krishna Kumar2
Hi Sridhar,

Sridhar Samudrala [EMAIL PROTECTED] wrote on 07/20/2007 11:14:19 PM:

  @@ -1566,7 +1605,7 @@ gso:
/* reset queue_mapping to zero */
skb-queue_mapping = 0;
rc = q-enqueue(skb, q);
  - qdisc_run(dev);
  + qdisc_run(dev, NULL);

 OK. So you are passing a NULL blist here. However, i am
 not sure why batching is not used in this situation.

Actually it could be used, but in most cases there will be only
one skb. If I pass the blist here, the result (for batching
case) will be to put one single skb into the blist and call
the new xmit API. That wastes cycles as we take a skb out
from the queue (as in regular code) and then add it to the
blist (different in the new code) and then the driver has to
remove this skb from the blist (different in the new code).
I could try batching but then require there are more than
1 skbs before adding to the blist (or the blist doesn't already
have skbs, in which case adding even one skb makes sense). Also,
it will have a slight impact for regular drivers where for each
xmit, one extra dereference for dev-skb_blist (which is always
NULL) is made, which was another reason to always pass NULL.

I will check what the results are by giving passing blist here
too and make the above change. I will run tests for that (as
well as NETPERF RR test as asked by Evgeniy).

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 05/10] sch_generic.c changes.

2007-07-21 Thread Krishna Kumar2
Hi Patrick,

Patrick McHardy [EMAIL PROTECTED] wrote on 07/20/2007 11:46:36 PM:

 Krishna Kumar wrote:
  +static inline int get_skb(struct net_device *dev, struct Qdisc *q,
  +   struct sk_buff_head *blist,
  +   struct sk_buff **skbp)
  +{
  +   if (likely(!blist) || (!skb_queue_len(blist)  qdisc_qlen(q) =
1)) {
  +  return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
  +   } else {
  +  int max = dev-tx_queue_len - skb_queue_len(blist);


 I'm assuming the driver will simply leave excess packets in the
 blist for the next run.

Yes, and the next run will be scheduled even if no more xmits are called
either
due to qdisc_restart()'s call to driver returning :
  BUSY : driver failed to send all, net_tx_action will handle this
later (the
 case you mentioned)
  OK : and qlen is  0, return 1 and __qdisc_run() will re-retry (where
blist len will become zero as driver processed EVERYTHING on
blist)

 The check for tx_queue_len is wrong though,
 its only a default which can be overriden and some qdiscs don't
 care for it at all.

I think it should not matter whether qdiscs use this or not, or even if it
is modified (unless it is made zero in which case this breaks). The
intention behind this check is to make sure that not more than tx_queue_len
skbs are in all queues put together (q-qdisc + dev-skb_blist), otherwise
the blist can become too large and breaks the idea of tx_queue_len. Is that
a good justification ?

  -void __qdisc_run(struct net_device *dev)
  +void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist)


 And the patches should really be restructured so this change is
 in the same patch changing the header and the caller, for example.

Ah, OK.

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 02/10] Networking include file changes.

2007-07-21 Thread Krishna Kumar2
Hi Sridhar,

Sridhar Samudrala [EMAIL PROTECTED] wrote on 07/20/2007 10:55:05 PM:
  diff -ruNp org/include/net/pkt_sched.h new/include/net/pkt_sched.h
  --- org/include/net/pkt_sched.h   2007-07-20 07:49:28.0 +0530
  +++ new/include/net/pkt_sched.h   2007-07-20 08:30:22.0 +0530
  @@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge
 struct rtattr *tab);
   extern void qdisc_put_rtab(struct qdisc_rate_table *tab);
 
  -extern void __qdisc_run(struct net_device *dev);
  +extern void __qdisc_run(struct net_device *dev, struct sk_buff_head
*blist);

 Why do we need this additional 'blist' argument?
 Is this different from dev-skb_blist?

It is the same, but I want to call it mostly with NULL and rarely with the
batch list pointer (so it is related to your other question). My original
code didn't have this and was trying batching in all cases. But in most
xmit's (probably almost all), there will be only one packet in the queue to
send and batching will never happen. When there is a lock contention or if
the queue is stopped, then the next iteration will find 1 packets. But I
still will try no batching for the lock failure case as there be probably
2 packets (one from previous time and 1 from this time, or 3 if two
failures,
etc), and try batching only when queue was stopped from net_tx_action (this
was based on Dave Miller's idea).

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Results Scripts for : [PATCH 00/10] Implement batching skb API

2007-07-20 Thread Krishna Kumar2

Attached file contains scripts for running tests and parsing results :

(See attached file: scripts.tar)

The result of a 10 run (average) TCP iperf (and 1 netperf for UDP) is
given below.

Thanks,

- KK

-
Test configuration : Single cross-over cable for MTHCA cards (MT23108) on
two PPC64 systems, both systems are 8-CPU P5 1.5 GHz processors with 8HB memory.

A. TCP results for a 10 run average are as follows (using iperf, could not
   run netperf in parallel as it is not synchronized):

First number  : Orig BW in KB/s.
Second number : New BW in KB/s.
Third number  : Percentage change.

   IPoIB was configured with 512 sendq size while default configuration (128)
   gave positives for most test cases but more negatives for 512 and 4K buffer
   sizes.

Buffer Size 32
TCP Threads:1 : 31263169  1.4
TCP Threads:4 : 973910889 11.8
TCP Threads:16 : 35383  47218 33.4
TCP Threads:64 : 85147  84196 -1.1
Average : 9.05%

TCP No Delay: Threads:1 : 1990  297649.5
TCP No Delay: Threads:4 : 8137  87707.7
TCP No Delay: Threads:16 : 3171437308 17.63
TCP No Delay: Threads:64 : 7283081892 12.44
Average : 14.19%

Buffer Size 128
TCP Threads:1 : 12674   13339 5.2
TCP Threads:4 : 37889   40816 7.7
TCP Threads:16 : 141342 165935  17.3
TCP Threads:64 : 199813 196283  -1.7
Average : 6.29%

TCP No Delay: Threads:1 : 7732  11272   45.7
TCP No Delay: Threads:4 : 33348 35222   5.6
TCP No Delay: Threads:16 : 120507   143960  19.5
TCP No Delay: Threads:64 : 195459   193875  -0.8
Average : 7.64%

Buffer Size 512
TCP Threads:1 : 42256   55735 31.9
TCP Threads:4 : 161237  161777  0.3
TCP Threads:16 : 227911 231781  1.7
TCP Threads:64 : 229779 223152  -2.9
Average : 1.70%

TCP No Delay: Threads:1 : 30065 42500   41.3
TCP No Delay: Threads:4 : 79076 12584859.1
TCP No Delay: Threads:16 : 225725   224155  -0.7
TCP No Delay: Threads:64 : 231220   223664  -3.26
Average : 8.84%

Buffer Size 4096
TCP Threads:1 : 119364  135445  13.5
TCP Threads:4 : 261301  256754  -1.7
TCP Threads:16 : 246889 247065  0.07
TCP Threads:64 : 237613 234185  -1.4
Average : 0.95%

TCP No Delay: Threads:1 : 102187104087  1.9
TCP No Delay: Threads:4 : 204139243169  19.1
TCP No Delay: Threads:16 : 245529   242519  -1.2
TCP No Delay: Threads:64 : 236826   233382  -1.4
Average : 4.37%

-
B. Using netperf to run 1 process UDP (1 run, measured with 128 sendq
   size, will be re-doing with 512 sendq and for 10 runs average) :

--
   OrgNew  Perc
BWService BW ServiceBW Service
--
6.40  1277.64 6.501272.41   1.56 -.40
24.80 663.01  25.80   318.134.03 -52.01
101.8081.02   101.90  80.63 .09  -.48
395.7020.77   395.90  20.74 .05  -.14
1172.90   7.001156.80 7.10  -1.371.42

-

scripts.tar
Description: Binary data
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 00/10] Implement batching skb API

2007-07-20 Thread Krishna Kumar2
Stephen Hemminger [EMAIL PROTECTED] wrote on 07/20/2007
12:48:48 PM:

 You may see worse performance with batching in the real world when
 running over WAN's.  Like TSO, batching will generate back to back packet
 trains that are subject to multi-packet synchronized loss. The problem is
that
 intermediate router queues are often close to full, and when a long
string
 of packets arrives back to back only the first ones will get in, the rest
 get dropped.  Normal sends have at least minimal pacing so they are less
 likely do get synchronized drop.

Hi Stephen,

OK. The difference that I could see is that in existing code, the minimal
pacing also could lead to (possibly slighly lesser) loss since sends are
quick iterations at the IP layer, while in batching sends are iterative at
the driver layer.

Is it an issue ? Any suggestions ?

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 00/10] Implement batching skb API

2007-07-20 Thread Krishna Kumar2
Stephen Hemminger [EMAIL PROTECTED] wrote on 07/20/2007
12:48:48 PM:

 You may see worse performance with batching in the real world when
 running over WAN's.  Like TSO, batching will generate back to back packet
 trains that are subject to multi-packet synchronized loss. The problem is
that
 intermediate router queues are often close to full, and when a long
string
 of packets arrives back to back only the first ones will get in, the rest
 get dropped.  Normal sends have at least minimal pacing so they are less
 likely do get synchronized drop.

Also forgot to mention in the previous mail, if performance is seen to be
dipping,
batching can be disabled on WAN's by:

echo 0  /sys/class/net/dev/tx_batch_skbs

and use batching on local/site networks in that case.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 05/10] sch_generic.c changes.

2007-07-20 Thread Krishna Kumar2
Patrick McHardy [EMAIL PROTECTED] wrote on 07/20/2007 03:41:01 PM:

 Krishna Kumar wrote:
  diff -ruNp org/net/sched/sch_generic.c new/net/sched/sch_generic.c
  --- org/net/sched/sch_generic.c   2007-07-20 07:49:28.0 +0530
  +++ new/net/sched/sch_generic.c   2007-07-20 08:30:22.0 +0530
  @@ -9,6 +9,11 @@
* Authors:   Alexey Kuznetsov, [EMAIL PROTECTED]
*  Jamal Hadi Salim, [EMAIL PROTECTED] 990601
*  - Ingress support
  + *
  + * New functionality:
  + *  Krishna Kumar, [EMAIL PROTECTED], July 2007
  + *  - Support for sending multiple skbs to devices that support
  + *new api - dev-hard_start_xmit_batch()


 No new changelogs in source code please, git keeps track of that.

Ah, didn't know this, thanks for letting me know.

  -static inline int qdisc_restart(struct net_device *dev)
  +static inline int qdisc_restart(struct net_device *dev,
  +struct sk_buff_head *blist)
   {
  struct Qdisc *q = dev-qdisc;
  struct sk_buff *skb;
  -   unsigned lockless;
  +   unsigned getlock;  /* whether we need to get lock or not */


 Unrelated rename, please get rid of this to reduce the noise.

OK, I guess I should have sent that change earlier :) The reason to change
the name is to avoid (double-negative) checks like :

  if (!lockless)
to
  if (getlock).

I will remove these changes.

thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 03/10] dev.c changes.

2007-07-20 Thread Krishna Kumar2
Hi Patrick,

Thanks for your comments.

Patrick McHardy [EMAIL PROTECTED] wrote on 07/20/2007 03:34:30 PM:

 The queue length can be changed through multiple interfaces, if that
 really is important you need to catch these cases too.

I have a TODO comment in net-sysfs.c which is to catch this case.

  +  } else {
  + dev-skb_blist = kmalloc(sizeof *dev-skb_blist,
  +   GFP_KERNEL);


 Why not simply put the head in struct net_device? It seems to me that
 this could also be used for gso_skb.

Without going into GSO, it is wasting some 32 bytes on i386 since most
drivers
don't export this API.

 Queue purging should be done in dev_deactivate.

I originally had it in dev_deactivate, but when I did a ifdown eth0, ifup
eth0,
the system panic'd. The first solution I thought was to initialize the
skb_blist
in dev_change_flags() rather than in register_netdev(), but then felt that
a
series of ifup/ifdown will unnecessarily check stuff/malloc/free/initialize
stuff,
and so thought of putting it in unregister_netdev (where it is balanced
with
register_netdev).

Is there any reason to move this ?

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 04/10] net-sysfs.c changes.

2007-07-20 Thread Krishna Kumar2
Patrick McHardy [EMAIL PROTECTED] wrote on 07/20/2007 03:37:20 PM:

 Krishna Kumar wrote:
  Support to turn on/off batching from /sys.


 rtnetlink support seems more important than sysfs to me.

Thanks, I will add that as a patch. The reason to add to sysfs is that
it is easier to change for a user (and similar to tx_queue_len).

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 03/10] dev.c changes.

2007-07-20 Thread Krishna Kumar2
Hi Patrick,

Patrick McHardy [EMAIL PROTECTED] wrote on 07/20/2007 04:50:37 PM:

  I have a TODO comment in net-sysfs.c which is to catch this case.
 

 I noticed that. Still wondering why it is important at all though.

I saw another mail of yours on the marc list on this same topic (which
still hasn't come to me in the mail), so I will answer both :

 Is there any downside in using batching with smaller queue sizes?

I think there is, but as yet I don't have any data (and 16 is probably
higher
than reqd) to show it. If the queue size is very small (like 4), the extra
processing to maintain this list may take more cycles than the performance
gains for sending out few skbs, esp since most xmits will send out 1 skb
and
skb batching takes places less often (when tx lock fails or queue gets
full).

OTOH, there might be a gain to even send out 2 skbs, the problem is in
doing
the extra processing before xmit and not at the time of xmit.

Does this sound OK ? If so, I will add the code to implement the TODO for
tx_queue_len checking too.

  Without going into GSO, it is wasting some 32 bytes on i386 since most
  drivers don't export this API.

 32 bytes? I count 16, - 4 for the pointer, so its 12 bytes of waste.
 If you'd use it for gso_skb it would come down to 8 bytes. struct
 net_device is a pig already, and there are better ways to reduce this
 than starting to allocating single members with a few bytes IMO.

Sorry, I wanted to say 12 bytes on 32 bit system but mixed it up and
said 32 bytes. So I guess static allocation is better then, and it will
also help in performance as memory access is not required (offsetof
should work).

 Yes, packets can be holding references to various stuff and
 these should be released on device down. As I said above I
 don't really like the allocation, but even if you want to
 keep it, just do the purging and dev_deactivate and keep the
 freeing in unregister_netdev (actually I guess it should be
 free_netdev to handle register_netdevice errors).

Right, that makes it clean to do (and avoid stale packets on down).
I will make both these changes now.

Thanks for these suggestions,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 03/10] dev.c changes.

2007-07-20 Thread Krishna Kumar2
Patrick McHardy [EMAIL PROTECTED] wrote on 07/20/2007:

 I can't really argue about the numbers, but it seems to me that only
 devices which *usually* have a sufficient queue length will support
 this, and anyone setting the queue length of a gbit device to 16 is
 begging for trouble anyway. So it doesn't really seem worth to bloat
 the code for handling an insane configuration as long as it doesn't
 break.

Ah, I get your point now. So if driver sets BATCHING and user then sets
queue_len to (say) 4, then poor results are expected (and kernel doesn't
need to try fix it). Same for driver setting BATCHING when it's queue is
small in the first place, which no driver writer should do anyway. I think
it makes the code a lot easier too. Will update.

thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 03/10] dev.c changes.

2007-07-20 Thread Krishna Kumar2
Patrick McHardy [EMAIL PROTECTED] wrote on 07/20/2007 04:50:37 PM:

 32 bytes? I count 16, - 4 for the pointer, so its 12 bytes of waste.
 If you'd use it for gso_skb it would come down to 8 bytes. struct
 net_device is a pig already, and there are better ways to reduce this
 than starting to allocating single members with a few bytes IMO.

Currently, this allocated pointer is an indication to let kernel users
(qdisc_restart, setting/resetting tx_batch_skbs) know whether batching
is enabled or disabled. Removing the pointer and making it static means
those users cannot figure out this information . Adding another field to
netdev may be a bad idea, so I am thinking of overloading dev-features
to add a new flag (other than NETIF_F_BATCH_SKBS, since that is a driver
capabilities flag) which can be set/cleared based on NETIF_F_BATCH_SKBS
bit. Does this approach sound OK ?

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 03/10] dev.c changes.

2007-07-20 Thread Krishna Kumar2

(My Notes crashed when I hit the Send button, so not sure if this went
out).

__

Patrick McHardy [EMAIL PROTECTED] wrote on 07/20/2007 04:50:37 PM:

 32 bytes? I count 16, - 4 for the pointer, so its 12 bytes of waste.
 If you'd use it for gso_skb it would come down to 8 bytes. struct
 net_device is a pig already, and there are better ways to reduce this
 than starting to allocating single members with a few bytes IMO.

Currently, this allocated pointer is an indication to let kernel users
(qdisc_restart, setting/resetting tx_batch_skbs) know whether batching
is enabled or disabled. Removing the pointer and making it static means
those users cannot figure out this information . Adding another field to
netdev may be a bad idea, so I am thinking of overloading dev-features
to add a new flag (other than NETIF_F_BATCH_SKBS, since that is a driver
capabilities flag) which can be set/cleared based on NETIF_F_BATCH_SKBS
bit. Does this approach sound OK ?

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 00/10] Implement batching skb API

2007-07-20 Thread Krishna Kumar2
Hi Evgeniy,

Evgeniy Polyakov [EMAIL PROTECTED] wrote on 07/20/2007 06:24:23 PM:

  After fine-tuning qdisc and other changes, I modified IPoIB to use this
API,
  and now get good gains. Summary for TCP  No Delay: 1 process improves
for
  all cases from 1.4% to 49.5%; 4 process has almost identical
improvements
  from -1.7% to 59.1%; 16 process case also improves in the range of
-1.2% to
  33.4%; while 64 process doesn't have much improvement (-3.3% to 12.4%).
UDP
  was tested with 1 process netperf with small increase in BW but big
  improvement in Service Demand. Netperf latency tests show small drop in
  transaction rate (results in separate attachment).

 What about round-robin tcp time and latency test? In theory such batching
 mode should not change that timings, but practice can show new aspects.
 I will review code later this week (likely tomorrow) and if there will
 be some issues return back.

I had run RR test quite some time back and don't have the result at this
time,
other than remembering it was almost the same as the original. As I am
running
some tests on those systems at this time, I can send the results of RR
tomorrow.

Thanks,

- KK

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general