[dpdk-dev] capture packets on VM

2016-07-15 Thread Matt Laswell
Hey Raja,

When you bind the ports to the DPDK poll mode drivers, the kernel no longer
has visibility into them.  This makes some sense intuitively - it would be
very bad for both the kernel and a user mode application to both attempt to
control the ports.  This is why tools like tcpdump and wireshark don't work
(and why the ports don't show up in ifconfig generally).

If you just want to know that packets are flowing, an easy way to do it is
simply to emit messages (via printf or the logging subsystem of your
choice) or increment counters when you receive packets.  If you want to
verify a little bit of information about the packets but don't need full
capture, you can either add some parsing information to your messages, or
build out more stats.

However, if you want to actually capture the packet contents, it's a little
trickier.  You can write your own packet-capture application, of course,
but that might be a bigger task than you're looking for.  You can also
instantiate a KNI interface and either copy or forward the packets to it
(and, from there, you can do tcpdump on the kernel side of the interface).
  I seem to recall that there's been some work done on tcpdump like
applications within DPDK, but don't remember what state those efforts are
in presently.

--
Matt Laswell
laswell at infinite.io
infinite io, inc.

On Fri, Jul 15, 2016 at 12:54 AM, Raja Jayapal  wrote:

> Hi All,
>
> I have installed dpdk on VM and would like to know how to capture the
> packets on dpdk ports.
> I am sending traffic from host  and want to know how to confirm whether
> the packets are flowing via dpdk ports.
> I tried with tcpdump and wireshark but could not capture the packets
> inside VM.
> setup : bridge1(Host)--- VM(Guest with DPDK) - bridge2(Host)
>
> Please suggest.
>
> Thanks,
> Raja
>
> =-=-=
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain
> confidential or privileged information. If you are
> not the intended recipient, any dissemination, use,
> review, distribution, printing or copying of the
> information contained in this e-mail message
> and/or attachments to it are strictly prohibited. If
> you have received this communication in error,
> please notify us by reply e-mail or telephone and
> immediately and permanently delete the message
> and any attachments. Thank you
>
>
>


[dpdk-dev] backtracing from within the code

2016-06-27 Thread Matt Laswell
I've done something similar to what's described in the link below.  But
it's worth pointing out that it's using printf() inside a signal handler,
which isn't safe. If your use case is catching SIGSEGV, for example,
solutions built on printf() will usually work, but can deadlock.  One way
around the problem is to call write() directly, passing it stdout's file
handle.

For example, I have this in my code:
#define WRITE_STRING(fd, s) write (fd, s, strlen (s))

In my signal handlers, I use the above like this:
WRITE_STRING(STDOUT_FILENO, "Stack trace:\n");

This approach is a little bit more cumbersome to code, but safer.

The last time that I looked the DPDK rte_dump_stack() is using vfprintf(),
which isn't safe in a signal handler.  However, it's been several DPDK
releases since I peeked at the details.

--
Matt Laswell
Principal Software Engineer
infinite io, inc.
laswell at infinite.io


On Sat, Jun 25, 2016 at 9:07 AM, Rosen, Rami  wrote:

> Hi,
> If you are willing to skip static methods and use the GCC backtrace, you
> can
> try this example (it worked for me, but it was quite a time ago):
> http://www.helicontech.co.il/?id=linuxbt
>
> Regards,
> Rami Rosen
> Intel Corporation
>
> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Stephen Hemminger
> Sent: Friday, June 24, 2016 8:46 PM
> To: Thomas Monjalon 
> Cc: Catalin Vasile ; dev at dpdk.org; Dumitrescu,
> Cristian 
> Subject: Re: [dpdk-dev] backtracing from within the code
>
> On Fri, 24 Jun 2016 12:05:26 +0200
> Thomas Monjalon  wrote:
>
> > 2016-06-24 09:25, Dumitrescu, Cristian:
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Catalin Vasile
> > > > I'm trying to add a feature to DPDK and I'm having a hard time
> printing a
> > > > backtrace.
> > > > I tried using this[1] functions for printing, but it does not print
> more than one
> > > > function. Maybe it lacks the symbols it needs.
> > [...]
> > > It eventually calls rte_dump_stack() in file
> lib/lirte_eal/linuxapp/eal/eal_debug.c, which calls backtrace(), which is
> probably what you are looking for.
> >
> > Example:
> > 5: [build/app/testpmd(_start+0x29) [0x416f69]]
> > 4: [/usr/lib/libc.so.6(__libc_start_main+0xf0) [0x7eff3b757610]]
> > 3: [build/app/testpmd(main+0x2ff) [0x416b3f]]
> > 2: [build/app/testpmd(init_port_config+0x88) [0x419a78]]
> > 1: [build/lib/librte_eal.so.2.1(rte_dump_stack+0x18) [0x7eff3c126488]]
> >
> > Please tell us if you have some cases where rte_dump_stack() does not
> work.
> > I do not remember what are the constraints to have it working.
> > Your binary is not stripped?
>
> The GCC backtrace doesn't work well because it can't find static functions.
> I ended up using libunwind to get a better back trace.
>


[dpdk-dev] Kernel panic in KNI

2016-04-07 Thread Matt Laswell
Hey Robert,

Thanks for the insight.  I work with Jay on the code he's asking about; we
only have one mbuf pool that we use for all packets.  Mostly, this is for
the reasons that you describe, as well as for the sake of simplicity.  As
it happens, the stack trace we're seeing makes it look as though either the
mbuf's data pointer is screwed up, or the VA translation done on it is.  I
suspect that we're getting to a failure mode similar to the one you
experienced, though perhaps for different reasons.

Thanks,
Matt

On Wed, Apr 6, 2016 at 5:30 PM, Sanford, Robert  wrote:

> Hi Jay,
>
> I won't try to interpret your kernel stack trace. But, I'll tell you about
> a KNI-related problem that we once experienced, and the symptom was a
> kernel hang.
>
> The problem was that we were passing mbufs allocated out of one mempool,
> to a KNI context that we had set up with a different mempool (on a
> different CPU socket). The KNI kernel driver, converts the user-space mbuf
> virtual address (VA) to a kernel VA by adding the difference between the
> user and kernel VAs of the mempool used to create the KNI context. So, if
> an mbuf comes from a different mempool, the calculated address will
> probably be VERY BAD.
>
> Could this be your problem?
>
> --
> Robert
>
>
> On 4/6/16 4:16 PM, "Jay Rolette"  wrote:
>
> >I had a system lockup hard a couple of days ago and all we were able to
> >get
> >was a photo of the LCD monitor with most of the kernel panic on it. No way
> >to scroll back the buffer and nothing in the logs after we rebooted. Not
> >surprising with a kernel panic due to an exception during interrupt
> >processing. We have a serial console attached in case we are able to get
> >it
> >to happen again, but it's not easy to reproduce (hours of runtime for this
> >instance).
> >
> >Ran the photo through OCR software to get a text version of the dump, so
> >possible I missed some fixups in this:
> >
> >[39178.433262] RDX: 00ba RSI: 881fd2f350ee RDI:
> >a12520669126180a
> >[39178.464020] RBP: 880433966970 R08: a12520669126180a R09:
> >881fd2f35000
> >[39178.495091] R10:  R11: 881fd2f88000 R12:
> >883fdla75ee8
> >[39178.526594] R13: 00ba R14: 7fdad5a66780 R15:
> >883715ab6780
> >[39178.559011] FS:  77fea740() GS:88lfffc0()
> >knlGS:
> >[39178.592005] CS:  0010 DS:  ES:  CR0: 80050033
> >[39178.623931] CR2: 77ea2000 CR3: 001fd156f000 CR4:
> >001407f0
> >[39178.656187] Stack:
> >[39178.689025] c067c7ef 00ba 00ba
> >881fd2f88000
> >[39178.722682] 4000 8B3fd0bbd09c 883fdla75ee8
> >8804339bb9c8
> >[39178.756525] 81658456 881fcd2ec40c c0680700
> >880436bad800
> >[39178.790577] Call Trace:
> >[39178.824420] [] ? kni_net_tx+0xef/0x1a0 [rte_kni]
> >[39178.859190] [] dev_hard_start_xmit+0x316/0x5c0
> >[39178.893426] [] sch_direct_xmit+0xee/0xic0
> >[39178.927435] [l __dev_queue_xmit+0x200/0x4d0
> >[39178.961684] [l dev_queue_xmit+0x10/0x20
> >[39178.996194] [] neigh_connected_output+0x67/0x100
> >[39179.031098] [] ip_finish_output+0xid8/0x850
> >[39179.066709] [l ip_output+0x58/0x90
> >[39179.101551] [] ip_local_out_sk+0x30/0x40
> >[39179.136823] [] ip_queue_xmit+0xl3f/0x3d0
> >[39179.171742] [] tcp_transmit_skb+0x47c/0x900
> >[39179.206854] [l tcp_write_xmit+0x110/0xcb0
> >[39179.242335] [] __tcp_push_pending_frames+0x2e/0xc0
> >[39179.277632] [] tcp_push+0xec/0x120
> >[39179.311768] [] tcp_sendmsg+0xb9/0xce0
> >[39179.346934] [] ? tcp_recvmsg+0x6e2/0xba0
> >[39179.385586] [] inet_sendmsg+0x64/0x60
> >[39179.424228] [] ? apparmor_socket_sendmsg+0x21/0x30
> >[39179.4586581 [] sock_sendmsg+0x86/0xc0
> >[39179.493220] [] ? __inet_stream_connect+0xa5/0x320
> >[39179.528033] [] ? __fdget+0x13/0x20
> >[39179.561214] [] SYSC_sendto+0x121/0x1c0
> >[39179.594665] [] ? aa_sk_perm.isra.4+0x6d/0x150
> >[39179.6268931 [] ? read_tsc+0x9/0x20
> >[39179.6586541 [] ? ktime_get_ts+0x48/0xe0
> >[39179.689944] [] SyS_sendto+0xe/0x10
> >[39179.719575] [] system_call_fastpath+0xia/0xif
> >[39179.748760] Code: 43 58 48 Zb 43 50 88 43 4e 5b 5d c3 66 Of if 84 00 00
> >00 00 00 e8 fb fb ff ff eb e2 90 90 90 90 90 90 90
> > 90 48 89 f8 48 89 d1  a4 c3 03 83 eZ 07 f3 48 .15 89 di f3 a4 c3 20
> >4c
> >8b % 4c 86
> >[39179.808690] RIP  [] memcpy+0x6/0x110
> >[39179.837238]  RSP 
> >[39179.933755] ---[ end trace 2971562f425e2cf8 ]---
> >[39179.964856] Kernel panic - not syncing: Fatal exception in interrupt
> >[39179.992896] Kernel Offset: 0x0 from 0x8100 (relocation
> >range: 0x8000-0xbfff)
> >[39180.024617] ---[ end Kernel panic - not syncing: Fatal exception in
> >interrupt
> >
> >It blew up when kni_net_tx() called memcpy() to copy data from the skb to
> >an mbuf.
> >
> >Disclosure: I'm not a Linux device driver guy. I dip into the kernel as
> >needed. Plenty of experience doing RTOS and 

[dpdk-dev] How to approach packet TX lockups

2015-11-17 Thread Matt Laswell
Thanks, I'll give that a try.

In my environment, I'm pretty sure we're using the fully-featured
ixgbe_xmit_pkts() and not _simple().   If setting rs_thresh=1 is safer,
I'll stick with that.

Again, thanks to all for the assistance.

- Matt

On Tue, Nov 17, 2015 at 10:20 AM, Ananyev, Konstantin <
konstantin.ananyev at intel.com> wrote:

> Hi Matt,
>
>
>
> As I said, at least  try to upgrade contents of shared code to the latest
> one.
>
> In previous releases: lib/librte_pmd_ixgbe/ixgbe, now located at:
> drivers/net/ixgbe/.
>
>
>
> > For reference, my transmit function is  rte_eth_tx_burst().
>
> I meant what ixgbe TX function it points to: ixgbe_xmit_pkts or
> ixgbe_xmit_pkts_simple()?
>
> For ixgbe_xmit_pkts_simple() don?t set tx_rs_thresh > 32,
>
> for ixgbe_xmit_pkts() the safest way is to set  tx_rs_thresh=1.
>
> Though as I understand from your previous mails, you already did that, and
> it didn?t help.
>
> Konstantin
>
>
>
>
>
> *From:* Matt Laswell [mailto:laswell at infiniteio.com]
> *Sent:* Tuesday, November 17, 2015 3:05 PM
> *To:* Ananyev, Konstantin
> *Cc:* Stephen Hemminger; dev at dpdk.org
>
> *Subject:* Re: [dpdk-dev] How to approach packet TX lockups
>
>
>
> Hey Konstantin,
>
>
>
> Moving from 1.6r2 to 2.2 is going to be a pretty significant change due to
> things like changes in the MBuf format, API differences, etc.  Even as an
> experiment, that's an awfully large change to absorb.  Is there a subset
> that you're referring to that could be more readily included without
> modifying so many touch points into DPDK?
>
>
>
> For reference, my transmit function is  rte_eth_tx_burst().  It seems to
> reliably tell me that it has enqueued all of the packets that I gave it,
> however the stats from rte_eth_stats_get() indicate that no packets are
> actually being sent.
>
>
>
> Thanks,
>
>
>
> - Matt
>
>
>
> On Tue, Nov 17, 2015 at 8:44 AM, Ananyev, Konstantin <
> konstantin.ananyev at intel.com> wrote:
>
>
>
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Matt Laswell
> > Sent: Tuesday, November 17, 2015 2:24 PM
> > To: Stephen Hemminger
> > Cc: dev at dpdk.org
> > Subject: Re: [dpdk-dev] How to approach packet TX lockups
> >
> > Yes, we're on 1.6r2.  That said, I've tried a number of different values
> > for the thresholds without a lot of luck.  Setting wthresh/hthresh/
> pthresh
> > to 0/0/32 or 0/0/0 doesn't appear to fix things.  And, as Matthew
> > suggested, I'm pretty sure using 0 for the thresholds leads to auto-
> config
> > by the driver.  I also tried 1/1/32, which required that I also change
> the
> > rs_thresh value from 0 to 1 to work around a panic in PMD initialization
> > ("TX WTHRESH must be set to 0 if tx_rs_thresh is greater than 1").
> >
> > Any other suggestions?
>
> That's not only DPDK code changed since 1.6.
> I am pretty sure that we also have a new update of shared code since then
> (and as I remember probably more than one).
> One suggestion would be at least try to upgrade the shared code up to the
> latest.
> Another one - even if you can't upgrade to 2.2 in you production
> environment,
> it probably worth to do that in some test environment and then check does
> the problem persist.
> If yes,  then we'll need some guidance how to reproduce it.
>
> Another question it is not clear what TX function do you use?
> Konstantin
>
>
> >
> > On Mon, Nov 16, 2015 at 7:31 PM, Stephen Hemminger <
> > stephen at networkplumber.org> wrote:
> >
> > > On Mon, 16 Nov 2015 18:49:15 -0600
> > > Matt Laswell  wrote:
> > >
> > > > Hey Stephen,
> > > >
> > > > Thanks a lot; that's really useful information.  Unfortunately, I'm
> at a
> > > > stage in our release cycle where upgrading to a new version of DPDK
> isn't
> > > > feasible.  Any chance you (or others reading this) has a pointer to
> the
> > > > relevant changes?  While I can't afford to upgrade DPDK entirely,
> > > > backporting targeted fixes is more doable.
> > > >
> > > > Again, thanks.
> > > >
> > > > - Matt
> > > >
> > > >
> > > > On Mon, Nov 16, 2015 at 6:12 PM, Stephen Hemminger <
> > > > stephen at networkplumber.org> wrote:
> > > >
> > > > > On Mon, 16 Nov 2015 17:48:35 -0600
> > > > > Matt Laswell  wrote:
> > > > >
> > > > > > Hey Folks,
> > > > >

[dpdk-dev] How to approach packet TX lockups

2015-11-17 Thread Matt Laswell
Hey Konstantin,

Moving from 1.6r2 to 2.2 is going to be a pretty significant change due to
things like changes in the MBuf format, API differences, etc.  Even as an
experiment, that's an awfully large change to absorb.  Is there a subset
that you're referring to that could be more readily included without
modifying so many touch points into DPDK?

For reference, my transmit function is  rte_eth_tx_burst().  It seems to
reliably tell me that it has enqueued all of the packets that I gave it,
however the stats from rte_eth_stats_get() indicate that no packets are
actually being sent.

Thanks,

- Matt

On Tue, Nov 17, 2015 at 8:44 AM, Ananyev, Konstantin <
konstantin.ananyev at intel.com> wrote:

>
>
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Matt Laswell
> > Sent: Tuesday, November 17, 2015 2:24 PM
> > To: Stephen Hemminger
> > Cc: dev at dpdk.org
> > Subject: Re: [dpdk-dev] How to approach packet TX lockups
> >
> > Yes, we're on 1.6r2.  That said, I've tried a number of different values
> > for the thresholds without a lot of luck.  Setting
> wthresh/hthresh/pthresh
> > to 0/0/32 or 0/0/0 doesn't appear to fix things.  And, as Matthew
> > suggested, I'm pretty sure using 0 for the thresholds leads to
> auto-config
> > by the driver.  I also tried 1/1/32, which required that I also change
> the
> > rs_thresh value from 0 to 1 to work around a panic in PMD initialization
> > ("TX WTHRESH must be set to 0 if tx_rs_thresh is greater than 1").
> >
> > Any other suggestions?
>
> That's not only DPDK code changed since 1.6.
> I am pretty sure that we also have a new update of shared code since then
> (and as I remember probably more than one).
> One suggestion would be at least try to upgrade the shared code up to the
> latest.
> Another one - even if you can't upgrade to 2.2 in you production
> environment,
> it probably worth to do that in some test environment and then check does
> the problem persist.
> If yes,  then we'll need some guidance how to reproduce it.
>
> Another question it is not clear what TX function do you use?
> Konstantin
>
> >
> > On Mon, Nov 16, 2015 at 7:31 PM, Stephen Hemminger <
> > stephen at networkplumber.org> wrote:
> >
> > > On Mon, 16 Nov 2015 18:49:15 -0600
> > > Matt Laswell  wrote:
> > >
> > > > Hey Stephen,
> > > >
> > > > Thanks a lot; that's really useful information.  Unfortunately, I'm
> at a
> > > > stage in our release cycle where upgrading to a new version of DPDK
> isn't
> > > > feasible.  Any chance you (or others reading this) has a pointer to
> the
> > > > relevant changes?  While I can't afford to upgrade DPDK entirely,
> > > > backporting targeted fixes is more doable.
> > > >
> > > > Again, thanks.
> > > >
> > > > - Matt
> > > >
> > > >
> > > > On Mon, Nov 16, 2015 at 6:12 PM, Stephen Hemminger <
> > > > stephen at networkplumber.org> wrote:
> > > >
> > > > > On Mon, 16 Nov 2015 17:48:35 -0600
> > > > > Matt Laswell  wrote:
> > > > >
> > > > > > Hey Folks,
> > > > > >
> > > > > > I sent this to the users email list, but I'm not sure how many
> > > people are
> > > > > > actively reading that list at this point.  I'm dealing with a
> > > situation
> > > > > in
> > > > > > which my application loses the ability to transmit packets out
> of a
> > > port
> > > > > > during times of moderate stress.  I'd love to hear suggestions
> for
> > > how to
> > > > > > approach this problem, as I'm a bit at a loss at the moment.
> > > > > >
> > > > > > Specifically, I'm using DPDK 1.6r2 running on Ubuntu 14.04LTS on
> > > Haswell
> > > > > > processors.  I'm using the 82599 controller, configured to spread
> > > packets
> > > > > > across multiple queues.  Each queue is accessed by a different
> lcore
> > > in
> > > > > my
> > > > > > application; there is therefore concurrent access to the
> controller,
> > > but
> > > > > > not to any of the queues.  We're binding the ports to the igb_uio
> > > driver.
> > > > > > The symptoms I see are these:
> > > > > >
> > > > > >
> > > > > >- All transmit out of a particular port 

[dpdk-dev] How to approach packet TX lockups

2015-11-17 Thread Matt Laswell
Yes, we're on 1.6r2.  That said, I've tried a number of different values
for the thresholds without a lot of luck.  Setting wthresh/hthresh/pthresh
to 0/0/32 or 0/0/0 doesn't appear to fix things.  And, as Matthew
suggested, I'm pretty sure using 0 for the thresholds leads to auto-config
by the driver.  I also tried 1/1/32, which required that I also change the
rs_thresh value from 0 to 1 to work around a panic in PMD initialization
("TX WTHRESH must be set to 0 if tx_rs_thresh is greater than 1").

Any other suggestions?

On Mon, Nov 16, 2015 at 7:31 PM, Stephen Hemminger <
stephen at networkplumber.org> wrote:

> On Mon, 16 Nov 2015 18:49:15 -0600
> Matt Laswell  wrote:
>
> > Hey Stephen,
> >
> > Thanks a lot; that's really useful information.  Unfortunately, I'm at a
> > stage in our release cycle where upgrading to a new version of DPDK isn't
> > feasible.  Any chance you (or others reading this) has a pointer to the
> > relevant changes?  While I can't afford to upgrade DPDK entirely,
> > backporting targeted fixes is more doable.
> >
> > Again, thanks.
> >
> > - Matt
> >
> >
> > On Mon, Nov 16, 2015 at 6:12 PM, Stephen Hemminger <
> > stephen at networkplumber.org> wrote:
> >
> > > On Mon, 16 Nov 2015 17:48:35 -0600
> > > Matt Laswell  wrote:
> > >
> > > > Hey Folks,
> > > >
> > > > I sent this to the users email list, but I'm not sure how many
> people are
> > > > actively reading that list at this point.  I'm dealing with a
> situation
> > > in
> > > > which my application loses the ability to transmit packets out of a
> port
> > > > during times of moderate stress.  I'd love to hear suggestions for
> how to
> > > > approach this problem, as I'm a bit at a loss at the moment.
> > > >
> > > > Specifically, I'm using DPDK 1.6r2 running on Ubuntu 14.04LTS on
> Haswell
> > > > processors.  I'm using the 82599 controller, configured to spread
> packets
> > > > across multiple queues.  Each queue is accessed by a different lcore
> in
> > > my
> > > > application; there is therefore concurrent access to the controller,
> but
> > > > not to any of the queues.  We're binding the ports to the igb_uio
> driver.
> > > > The symptoms I see are these:
> > > >
> > > >
> > > >- All transmit out of a particular port stops
> > > >- rte_eth_tx_burst() indicates that it is sending all of the
> packets
> > > >that I give to it
> > > >- rte_eth_stats_get() gives me stats indicating that no packets
> are
> > > >being sent on the affected port.  Also, no tx errors, and no pause
> > > frames
> > > >sent or received (opackets = 0, obytes = 0, oerrors = 0, etc.)
> > > >- All other ports continue to work normally
> > > >- The affected port continues to receive packets without problems;
> > > only
> > > >TX is affected
> > > >- Resetting the port via rte_eth_dev_stop() and
> rte_eth_dev_start()
> > > >restores things and packets can flow again
> > > >- The problem is replicable on multiple devices, and doesn't
> follow
> > > one
> > > >particular port
> > > >
> > > > I've tried calling rte_mbuf_sanity_check() on all packets before
> sending
> > > > them.  I've also instrumented my code to look for packets that have
> > > already
> > > > been sent or freed, as well as cycles in chained packets being
> sent.  I
> > > > also put a lock around all accesses to rte_eth* calls to synchronize
> > > access
> > > > to the NIC.  Given some recent discussion here, I also tried
> changing the
> > > > TX RS threshold from 0 to 32, 16, and 1.  None of these strategies
> proved
> > > > effective.
> > > >
> > > > Like I said at the top, I'm a little at a loss at this point.  If you
> > > were
> > > > dealing with this set of symptoms, how would you proceed?
> > > >
> > >
> > > I remember some issues with old DPDK 1.6 with some of the prefetch
> > > thresholds on 82599. You would be better off going to a later DPDK
> > > version.
> > >
>
> I hope you are on 1.6.0r2 at least??
>
> With older DPDK there was no way to get driver to tell you what the
> preferred settings were for pthresh/hthresh/wthresh. And the values
> in Intel sample applications were broken on some hardware.
>
> I remember reverse engineering the safe values from reading the Linux
> driver.
>
> The Linux driver is much better tested than the DPDK one...
> In the Linux driver, the Transmit Descriptor Controller (txdctl)
> is fixed at (for transmit)
>wthresh = 1
>hthresh = 1
>pthresh = 32
>
> The DPDK 2.2 driver uses:
> wthresh = 0
> hthresh = 0
> pthresh = 32
>
>
>
>
>
>
>


[dpdk-dev] How to approach packet TX lockups

2015-11-16 Thread Matt Laswell
Hey Stephen,

Thanks a lot; that's really useful information.  Unfortunately, I'm at a
stage in our release cycle where upgrading to a new version of DPDK isn't
feasible.  Any chance you (or others reading this) has a pointer to the
relevant changes?  While I can't afford to upgrade DPDK entirely,
backporting targeted fixes is more doable.

Again, thanks.

- Matt


On Mon, Nov 16, 2015 at 6:12 PM, Stephen Hemminger <
stephen at networkplumber.org> wrote:

> On Mon, 16 Nov 2015 17:48:35 -0600
> Matt Laswell  wrote:
>
> > Hey Folks,
> >
> > I sent this to the users email list, but I'm not sure how many people are
> > actively reading that list at this point.  I'm dealing with a situation
> in
> > which my application loses the ability to transmit packets out of a port
> > during times of moderate stress.  I'd love to hear suggestions for how to
> > approach this problem, as I'm a bit at a loss at the moment.
> >
> > Specifically, I'm using DPDK 1.6r2 running on Ubuntu 14.04LTS on Haswell
> > processors.  I'm using the 82599 controller, configured to spread packets
> > across multiple queues.  Each queue is accessed by a different lcore in
> my
> > application; there is therefore concurrent access to the controller, but
> > not to any of the queues.  We're binding the ports to the igb_uio driver.
> > The symptoms I see are these:
> >
> >
> >- All transmit out of a particular port stops
> >- rte_eth_tx_burst() indicates that it is sending all of the packets
> >that I give to it
> >- rte_eth_stats_get() gives me stats indicating that no packets are
> >being sent on the affected port.  Also, no tx errors, and no pause
> frames
> >sent or received (opackets = 0, obytes = 0, oerrors = 0, etc.)
> >- All other ports continue to work normally
> >- The affected port continues to receive packets without problems;
> only
> >TX is affected
> >- Resetting the port via rte_eth_dev_stop() and rte_eth_dev_start()
> >restores things and packets can flow again
> >- The problem is replicable on multiple devices, and doesn't follow
> one
> >particular port
> >
> > I've tried calling rte_mbuf_sanity_check() on all packets before sending
> > them.  I've also instrumented my code to look for packets that have
> already
> > been sent or freed, as well as cycles in chained packets being sent.  I
> > also put a lock around all accesses to rte_eth* calls to synchronize
> access
> > to the NIC.  Given some recent discussion here, I also tried changing the
> > TX RS threshold from 0 to 32, 16, and 1.  None of these strategies proved
> > effective.
> >
> > Like I said at the top, I'm a little at a loss at this point.  If you
> were
> > dealing with this set of symptoms, how would you proceed?
> >
>
> I remember some issues with old DPDK 1.6 with some of the prefetch
> thresholds on 82599. You would be better off going to a later DPDK
> version.
>


[dpdk-dev] How to approach packet TX lockups

2015-11-16 Thread Matt Laswell
Hey Folks,

I sent this to the users email list, but I'm not sure how many people are
actively reading that list at this point.  I'm dealing with a situation in
which my application loses the ability to transmit packets out of a port
during times of moderate stress.  I'd love to hear suggestions for how to
approach this problem, as I'm a bit at a loss at the moment.

Specifically, I'm using DPDK 1.6r2 running on Ubuntu 14.04LTS on Haswell
processors.  I'm using the 82599 controller, configured to spread packets
across multiple queues.  Each queue is accessed by a different lcore in my
application; there is therefore concurrent access to the controller, but
not to any of the queues.  We're binding the ports to the igb_uio driver.
The symptoms I see are these:


   - All transmit out of a particular port stops
   - rte_eth_tx_burst() indicates that it is sending all of the packets
   that I give to it
   - rte_eth_stats_get() gives me stats indicating that no packets are
   being sent on the affected port.  Also, no tx errors, and no pause frames
   sent or received (opackets = 0, obytes = 0, oerrors = 0, etc.)
   - All other ports continue to work normally
   - The affected port continues to receive packets without problems; only
   TX is affected
   - Resetting the port via rte_eth_dev_stop() and rte_eth_dev_start()
   restores things and packets can flow again
   - The problem is replicable on multiple devices, and doesn't follow one
   particular port

I've tried calling rte_mbuf_sanity_check() on all packets before sending
them.  I've also instrumented my code to look for packets that have already
been sent or freed, as well as cycles in chained packets being sent.  I
also put a lock around all accesses to rte_eth* calls to synchronize access
to the NIC.  Given some recent discussion here, I also tried changing the
TX RS threshold from 0 to 32, 16, and 1.  None of these strategies proved
effective.

Like I said at the top, I'm a little at a loss at this point.  If you were
dealing with this set of symptoms, how would you proceed?

Thanks in advance.

--
Matt Laswell
infinite io, inc.
laswell at infiniteio.com


[dpdk-dev] DPDK Port Mirroring

2015-07-09 Thread Matt Laswell
Keith speaks truth.  If I were going to do what you're describing, I would
do the following:

1. Start with the l2fwd example application.
2. Remove the part where it modifies the ethernet MAC address of received
packets.
3. Add a call in to clone mbufs via rte_pktmbuf_clone() and send the cloned
packets out of the port of your choice

As long as you don't need to modify the packets - and if you're mirroring,
you shouldn't - simply cloning received packets and sending them out your
mirror port should get you most of the way there.

On Thu, Jul 9, 2015 at 3:17 PM, Wiles, Keith  wrote:

>
>
> On 7/9/15, 12:26 PM, "dev on behalf of Assaad, Sami (Sami)"
>  
> wrote:
>
> >Hello,
> >
> >I want to build a DPDK app that is able to port-mirror all ingress
> >traffic from two 10G interfaces.
> >
> >1.   Is it possible in port-mirroring traffic consisting of 450byte
> >packets at 20G without losing more than 5% of traffic?
> >
> >2.   Would you have any performance results due to packet copying?
>
> Do you need to copy the packet if you increment the reference count you
> can send the packet to both ports without having to copy the packet.
> >
> >3.   Would you have any port mirroring DPDK sample code?
>
> DPDK does not have port mirroring example, but you could grab the l2fwd or
> l3fwd and modify it to do what you want.
> >
> >Thanks in advance.
> >
> >Best Regards,
> >Sami Assaad.
>
>


[dpdk-dev] Packet Cloning

2015-05-28 Thread Matt Laswell
Hey Kyle,

That's one way you can handle it, though I suspect you'll end up with some
complexity elsewhere in your code to deal with remembering whether you
should look at the original data or the copied and modified data.  Another
way is just to make a copy of the original mbuf, but have your copy API
stop after it reaches some particular point.  Perhaps just the L2-L4
headers, perhaps a few hundred bytes into payload, or perhaps something
else entirely. This all gets very application dependent, of course.  How
much is "enough" is going to depend heavily on what you're trying to
accomplish.

-- 
Matt Laswell
infinite io, inc.
laswell at infiniteio.com


On Thu, May 28, 2015 at 10:38 AM, Kyle Larose  wrote:

> I'm fairly new to dpdk, so I may be completely out to lunch on this, but
> here's an idea to possibly improve performance compared to a straight copy
> of the entire packet. If this idea makes sense, perhaps it could be added
> to the mbuf library as an extension of the clone functionality?
>
> If you are only modifying the headers (say the Ethernet header), is it
> possible to make a copy of only the first N bytes (say 32 bytes)?
>
> For example, you make two new "main" mbufs, which contain duplicate
> metadata, and a copy of the first 32 bytes of the packet. Call them A and
> B. Have both A and B chain to the original mbuf (call it O), which is
> reference counted as with the normal clone functionality. Then, you adjust
> the O such that its start data is 32 bytes into the packet.
>
> When you transmit A, it will send its own copy of the 32 bytes, plus the
> unaltered remaining data contained in O. A will be freed, and the refcount
> of O decremented. When you transmit B, it will work the same as with the
> previous one, except that when the refcount on O is decremented, it reaches
> zero and it is freed as well.
>
> I'm not sure if this makes sense in all cases (for example, maybe it's
> just faster to allocate separate mbufs for 64-byte packets). Perhaps that
> could also be handled transparently underneath the hood.
>
> Thoughts?
>
> Thanks,
>
> Kyle
>
> On Thu, May 28, 2015 at 11:10 AM, Matt Laswell 
> wrote:
>
>> Since Padam is going to be altering payload, he likely cannot use that
>> API.
>> The rte_pktmbuf_clone() API doesn't make a copy of the payload.  Instead,
>> it gives you a second mbuf whose payload pointer points back to the
>> contents of the first (and also increments the reference counter on the
>> first so that it isn't actually freed until all clones are accounted for).
>> This is very fast, which is good.  However, since there's only really one
>> buffer full of payload, changes in the original also affect the clone and
>> vice versa.  This can have surprising and unpleasant side effects that may
>> not show up until you are under load, which is awesome*.
>>
>> For what it's worth, if you need to be able to modify the copy while
>> leaving the original alone, I don't believe that there's a good solution
>> within DPDK.   However, writing your own API to copy rather than clone a
>> packet mbuf isn't difficult.
>>
>> --
>> Matt Laswell
>> infinite io, inc.
>> laswell at infiniteio.com
>>
>> * Don't ask me how I know how much awesome fun this can be, though I
>> suspect you can guess.
>>
>> On Thu, May 28, 2015 at 9:52 AM, Stephen Hemminger <
>> stephen at networkplumber.org> wrote:
>>
>> > On Thu, 28 May 2015 17:15:42 +0530
>> > Padam Jeet Singh  wrote:
>> >
>> > > Hello,
>> > >
>> > > Is there a function in DPDK to completely clone a pkt_mbuf including
>> the
>> > segments?
>> > >
>> > > I am trying to build a packet mirroring application which sends packet
>> > out through two separate interfaces, but the packet payload needs to be
>> > altered before send.
>> > >
>> > > Thanks,
>> > > Padam
>> > >
>> > >
>> >
>> > Isn't this what you want?
>> >
>> > /**
>> >  * Creates a "clone" of the given packet mbuf.
>> >  *
>> >  * Walks through all segments of the given packet mbuf, and for each of
>> > them:
>> >  *  - Creates a new packet mbuf from the given pool.
>> >  *  - Attaches newly created mbuf to the segment.
>> >  * Then updates pkt_len and nb_segs of the "clone" packet mbuf to match
>> > values
>> >  * from the original packet mbuf.
>> >  *
>> >  * @param md
>> >  *   The packet mbuf to be cloned.
>> >  * @param mp
>> >  *   The mempool from which the "clone" mbufs are allocated.
>> >  * @return
>> >  *   - The pointer to the new "clone" mbuf on success.
>> >  *   - NULL if allocation fails.
>> >  */
>> > static inline struct rte_mbuf *rte_pktmbuf_clone(struct rte_mbuf *md,
>> > struct rte_mempool *mp)
>> >
>>
>
>


[dpdk-dev] Packet Cloning

2015-05-28 Thread Matt Laswell
Since Padam is going to be altering payload, he likely cannot use that API.
The rte_pktmbuf_clone() API doesn't make a copy of the payload.  Instead,
it gives you a second mbuf whose payload pointer points back to the
contents of the first (and also increments the reference counter on the
first so that it isn't actually freed until all clones are accounted for).
This is very fast, which is good.  However, since there's only really one
buffer full of payload, changes in the original also affect the clone and
vice versa.  This can have surprising and unpleasant side effects that may
not show up until you are under load, which is awesome*.

For what it's worth, if you need to be able to modify the copy while
leaving the original alone, I don't believe that there's a good solution
within DPDK.   However, writing your own API to copy rather than clone a
packet mbuf isn't difficult.

-- 
Matt Laswell
infinite io, inc.
laswell at infiniteio.com

* Don't ask me how I know how much awesome fun this can be, though I
suspect you can guess.

On Thu, May 28, 2015 at 9:52 AM, Stephen Hemminger <
stephen at networkplumber.org> wrote:

> On Thu, 28 May 2015 17:15:42 +0530
> Padam Jeet Singh  wrote:
>
> > Hello,
> >
> > Is there a function in DPDK to completely clone a pkt_mbuf including the
> segments?
> >
> > I am trying to build a packet mirroring application which sends packet
> out through two separate interfaces, but the packet payload needs to be
> altered before send.
> >
> > Thanks,
> > Padam
> >
> >
>
> Isn't this what you want?
>
> /**
>  * Creates a "clone" of the given packet mbuf.
>  *
>  * Walks through all segments of the given packet mbuf, and for each of
> them:
>  *  - Creates a new packet mbuf from the given pool.
>  *  - Attaches newly created mbuf to the segment.
>  * Then updates pkt_len and nb_segs of the "clone" packet mbuf to match
> values
>  * from the original packet mbuf.
>  *
>  * @param md
>  *   The packet mbuf to be cloned.
>  * @param mp
>  *   The mempool from which the "clone" mbufs are allocated.
>  * @return
>  *   - The pointer to the new "clone" mbuf on success.
>  *   - NULL if allocation fails.
>  */
> static inline struct rte_mbuf *rte_pktmbuf_clone(struct rte_mbuf *md,
> struct rte_mempool *mp)
>


[dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.

2015-05-08 Thread Matt Laswell
On Fri, May 8, 2015 at 5:54 PM, Ravi Kerur  wrote:

>
>
> On Fri, May 8, 2015 at 3:29 PM, Matt Laswell 
> wrote:
>
>>
>>
>> On Fri, May 8, 2015 at 4:19 PM, Ravi Kerur  wrote:
>>
>>> This patch replaces memcmp in librte_hash with rte_memcmp which is
>>> implemented with AVX/SSE instructions.
>>>
>>> +static inline int
>>> +rte_memcmp(const void *_src_1, const void *_src_2, size_t n)
>>> +{
>>> +   const uint8_t *src_1 = (const uint8_t *)_src_1;
>>> +   const uint8_t *src_2 = (const uint8_t *)_src_2;
>>> +   int ret = 0;
>>> +
>>> +   if (n & 0x80)
>>> +   return rte_cmp128(src_1, src_2);
>>> +
>>> +   if (n & 0x40)
>>> +   return rte_cmp64(src_1, src_2);
>>> +
>>> +   if (n & 0x20) {
>>> +   ret = rte_cmp32(src_1, src_2);
>>> +   n -= 0x20;
>>> +   src_1 += 0x20;
>>> +   src_2 += 0x20;
>>> +   }
>>>
>>>
>> Pardon me for butting in, but this seems incorrect for the first two
>> cases listed above, as the function as written will only compare the first
>> 128 or 64 bytes of each source and return the result.  The pattern
>> expressed in the 32 byte case appears more correct, as it compares the
>> first 32 bytes and then lets later pieces of the function handle the
>> smaller remaining bits of the sources. Also, if this function is to handle
>> arbitrarily large source data, the 128 byte case needs to be in a loop.
>>
>> What am I missing?
>>
>
> Current max hash key length supported is 64 bytes, hence no comparison is
> done after 64 bytes. 128 bytes comparison is added to measure performance
> only and there is no use-case as of now. With the current use-cases its not
> required but if there is a need to handle large arbitrary data upto 128
> bytes it can be modified.
>

Ah, gotcha.  I misunderstood and thought that this was meant to be a
generic AVX/SSE enabled memcmp() replacement, and that the use of it in
rte_hash was meant merely as a test case.   If it's more limited than that,
carry on, though you might want to make a note of it in the documentation.
I suspect others will misinterpret the name as I did.

--
Matt Laswell
infinite io, inc.
laswell at infiniteio.com


[dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.

2015-05-08 Thread Matt Laswell
On Fri, May 8, 2015 at 4:19 PM, Ravi Kerur  wrote:

> This patch replaces memcmp in librte_hash with rte_memcmp which is
> implemented with AVX/SSE instructions.
>
> +static inline int
> +rte_memcmp(const void *_src_1, const void *_src_2, size_t n)
> +{
> +   const uint8_t *src_1 = (const uint8_t *)_src_1;
> +   const uint8_t *src_2 = (const uint8_t *)_src_2;
> +   int ret = 0;
> +
> +   if (n & 0x80)
> +   return rte_cmp128(src_1, src_2);
> +
> +   if (n & 0x40)
> +   return rte_cmp64(src_1, src_2);
> +
> +   if (n & 0x20) {
> +   ret = rte_cmp32(src_1, src_2);
> +   n -= 0x20;
> +   src_1 += 0x20;
> +   src_2 += 0x20;
> +   }
>
>
Pardon me for butting in, but this seems incorrect for the first two cases
listed above, as the function as written will only compare the first 128 or
64 bytes of each source and return the result.  The pattern expressed in
the 32 byte case appears more correct, as it compares the first 32 bytes
and then lets later pieces of the function handle the smaller remaining
bits of the sources. Also, if this function is to handle arbitrarily large
source data, the 128 byte case needs to be in a loop.

What am I missing?

--
Matt Laswell
infinite io, inc.
laswell at infiniteio.com


[dpdk-dev] Beyond DPDK 2.0

2015-04-24 Thread Matt Laswell
On Fri, Apr 24, 2015 at 12:39 PM, Jay Rolette 
wrote:
>
> I can tell you that if DPDK were GPL-based, my company wouldn't be using
> it. I suspect we wouldn't be the only ones...
>

I want to emphasize this point.  It's unsurprising that Jay and I agree,
since we work together.  But I can say with quite a bit of confidence that
my last employer also would stop using DPDK if it were GPL licensed.   Or,
if they didn't jettison it entirely, they would never move beyond the last
BSD-licensed version.  If you want to incentivize companies to support
DPDK, the first step is to ensure they're using it.  For that reason, GPL
seems like a step in the wrong direction to me.

- Matt


[dpdk-dev] Symmetric RSS Hashing, Part 2

2015-03-30 Thread Matt Laswell
That's really encouraging.  Thanks!

One thing I'll note is that if my reading of the original paper is
accurate, the 0x6d5a value isn't there in order to cause symmetry - other
repeated 16 bit values will do that, as you've seen.  What the 0x6d5a value
gets you is symmetry while preserving RSS's effectiveness at load spreading
with typical traffic data.  Not all 16 bit values will do this.

--
Matt Laswell
infinite io, inc.
laswell at infiniteio.com

On Mon, Mar 30, 2015 at 10:00 AM, Vladimir Medvedkin 
wrote:

> Matthew,
>
> I don't use any special tricks to make symmetric RSS work. Furthermore, it
> works not only with 0x6d5a.
>
> Regards,
> Vladimir
>
> 2015-03-28 23:11 GMT+03:00 Matthew Hall :
>
> > On Sat, Mar 28, 2015 at 12:10:20PM +0300, Vladimir Medvedkin wrote:
> > > I just verify RSS symmetric in my code, all works great.
> > > ...
> > > By the way, maybe it will be usefull to add softrss function in DPDK?
> >
> > Vladimir,
> >
> > All of this is super-awesome code. I agree having SW RSS would be quite
> > nice.
> > Then you could more easily support things like virtio-net and other stuff
> > which doesn't have RSS.
> >
> > Did you have to use any special tricks to get the 0x6d5a to work? I
> wasn't
> > quite
> > sure how to initialize that and get it to run right.
> >
> > Matthew.
> >
>


[dpdk-dev] Symmetric RSS Hashing, Part 2

2015-03-23 Thread Matt Laswell
Hey Folks,

I have essentially the same question as Matthew.  Has there been progress
in this area?

--
Matt Laswell
infinite io, inc.
laswell at infiniteio.com


On Sat, Mar 14, 2015 at 3:47 PM, Matthew Hall  wrote:

> A few months ago we had this thread about symmetric hashing of TCP in RSS:
>
> http://dpdk.org/ml/archives/dev/2014-December/010148.html
>
> I was wondering if we ever did figure out how to get the 0x6d5a hash key
> mentioned in there to work, or another alternative one.
>
> Thanks,
> Matthew.


[dpdk-dev] Question about link up/down events and transmit queues

2015-03-10 Thread Matt Laswell
Just a bit more on this.  We've found that when a link goes down, the TX
descriptor ring appears to fill up with packets fairly quickly, and then
calls to rte_eth_tx_burst() start returning zero.  Our application handles
this case, and frees the mbufs that could not be sent.

However, when link is reestablished, the TX descriptor ring appears to stay
full.  Hence, subsequent calls to rte_eth_tx_burst() continue to return
zero, and we continue to free the mbufs without sending them.  Frankly,
this was surprising, as we I have assumed that the TX descriptor ring would
be emptied when the link came back up, either by sending the enqueued
packets, or by reinitializing.

I've tried calling rte_eth_dev_start() and rte_eth_promiscuous_enable() in
order to restart everything.  That appears to work, at least on the
combination of drivers that I tested with.  Can somebody please tell me
whether this is the preferred way to recover from link down?

Thanks,

--
Matt Laswell
*infinite io, inc.*
laswell at infiniteio.com


On Tue, Mar 10, 2015 at 10:47 AM, Matt Laswell 
wrote:

> Hey Folks,
>
> I'm running into an issue that I hope is obvious and simple.  We're
> running DPDK 1.6.2 with an 82599 NIC.  We find that if, while running
> traffic, we disconnect a port and then later reconnect it, we never regain
> the ability to transmit packets out of that port after it comes back up.
> Specifically, our calls to rte_eth_tx_burst() get return values that
> indicate that no packets could be sent.
>
> Is there an additional step that we have to do on link down/up operations,
> perhaps to tell the NIC to flush its descriptor ring?
>
> Thanks in advance for your help.
>
> --
> Matt Laswell
> *infinite io, inc.*
> laswell at infiniteio.com
>


[dpdk-dev] Question about link up/down events and transmit queues

2015-03-10 Thread Matt Laswell
Hey Folks,

I'm running into an issue that I hope is obvious and simple.  We're running
DPDK 1.6.2 with an 82599 NIC.  We find that if, while running traffic, we
disconnect a port and then later reconnect it, we never regain the ability
to transmit packets out of that port after it comes back up.
Specifically, our calls to rte_eth_tx_burst() get return values that
indicate that no packets could be sent.

Is there an additional step that we have to do on link down/up operations,
perhaps to tell the NIC to flush its descriptor ring?

Thanks in advance for your help.

--
Matt Laswell
*infinite io, inc.*
laswell at infiniteio.com


[dpdk-dev] Appropriate DPDK data structures for TCP sockets

2015-02-23 Thread Matt Laswell
Hey Matthew,

I've mostly worked on stackless systems over the last few years, but I have
done a fair bit of work on high performance, highly scalable connection
tracking data structures.  In that spirit, here are a few counterintuitive
insights I've gained over the years.  Perhaps they'll be useful to you.
Apologies in advance for likely being a bit long-winded.

First, you really need to take cache performance into account when you're
choosing a data structure.  Something like a balanced tree can seem awfully
appealing at first blush, either on its own or as a chaining mechanism for
a hash table.  But the problem with trees is that there really isn't much
locality of reference in your memory use - every single step in your
descent ends up being a cache miss.  This hurts you twice: once that you
end up stalled waiting for the next node in the tree to load from main
memory, and again when you have to reload whatever you pushed out of cache
to get it.

It's often better if, instead of a tree, you do linear search across arrays
of hash values.  It's easy to size the array so that it is exactly one
cache line long, and you can generally do linear search of the whole thing
in less time than it takes to do a single cache line fill.   If you find a
match, you can do full verification against the full tuple as needed.

Second, rather than synchronizing (perhaps with locks, perhaps with
lockless data structures), it's often beneficial to create multiple
threads, each of which holds a fraction of your connection tracking data.
Every connection belongs to a single one of these threads, selected perhaps
by hash or RSS value, and all packets from the connection go through that
single thread.  This approach has a couple of advantages.  First,
obviously, no slowdowns for synchronization.  But, second, I've found that
when you are spreading packets from a single connection across many compute
elements, you're inevitably going to start putting packets out of order.
In many applications, this ultimately leads to some additional processing
to put things back in order, which gives away the performance gains you
achieved.  Of course, this approach brings its own set of complexities, and
challenges for your application, and doesn't always spread the work as
efficiently across all of your cores.  But it might be worth considering.

Third, it's very worthwhile to have a cache for the most recently accessed
connection.  First, because network traffic is bursty, and you'll
frequently see multiple packets from the same connection in succession.
Second, because it can make life easier for your application code.  If you
have multiple places that need to access connection data, you don't have to
worry so much about the cost of repeated searches.  Again, this may or may
not matter for your particular application.  But for ones I've worked on,
it's been a win.

Anyway, as predicted, this post has gone far too long for a Monday
morning.  Regardless, I hope you found it useful.  Let me know if you have
questions or comments.

--
Matt Laswell
infinite io, inc.
laswell at infiniteio.com

On Sun, Feb 22, 2015 at 10:50 PM, Matthew Hall 
wrote:

>
> On Feb 22, 2015, at 4:02 PM, Stephen Hemminger 
> wrote:
> > Use userspace RCU? or BSD RB_TREE
>
> Thanks Stephen,
>
> I think the RB_TREE stuff is single threaded mostly.
>
> But user-space RCU looks quite good indeed, I didn't know somebody ported
> it out of the kernel. I'll check it out.
>
> Matthew.


[dpdk-dev] A question about hugepage initialization time

2014-12-09 Thread Matt Laswell
Hey Everybody,

Thanks for the feedback.  Yeah, we're pretty sure that the amount of memory
we work with is atypical, and we're hitting something that isn't an issue
for most DPDK users.

To clarify, yes, we're using 1GB hugepages, and we set them up via
hugepagesz and hugepages= in our kernel's grub line.  We find that when we
use four 1GB huge pages, eal memory init takes a couple of seconds, which
is no big deal.  When we use 128 1GB pages, though, memory init can take
several minutes.   The concern is that we will very likely use even more
memory in the future.  Our boot time is mostly just a nuisance now;
nonlinear growth in memory init time may transform it into a larger problem.

We've had to disable transparent hugepages due to latency issues with
in-memory databases.  I'll have to look at the possibility of alternative
memset implementations.  Perhaps some profiler time is in my future.

Again, thanks to everybody for the useful information.

--
Matt Laswell
laswell at infiniteio.com
infinite io, inc.

On Tue, Dec 9, 2014 at 1:06 PM, Matthew Hall  wrote:

> On Tue, Dec 09, 2014 at 10:33:59AM -0600, Matt Laswell wrote:
> > Our DPDK application deals with very large in memory data structures, and
> > can potentially use tens or even hundreds of gigabytes of hugepage
> memory.
>
> What you're doing is an unusual use case and this is open source code where
> nobody might have tested and QA'ed this yet.
>
> So my recommendation would be adding some rte_log statements to measure the
> various steps in the process to see what's going on. Also using the Linux
> Perf
> framework to do low-overhead sampling-based profiling, and making sure
> you've
> got everything compiled with debug symbols so you can see what's consuming
> the
> execution time.
>
> You might find that it makes sense to use some custom allocators like
> jemalloc
> alongside of the DPDK allocators, including perhaps "transparent hugepage
> mode" in your process, and some larger page sizes to reduce the number of
> pages.
>
> You can also use this handy kernel options, hugepagesz= hugepages=N .
> This creates guaranteed-contiguous known-good hugepages during boot which
> initialize much more quickly with less trouble and glitches in my
> experience.
>
> https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
> https://www.kernel.org/doc/Documentation/vm/transhuge.txt
>
> There is no one-size-fits-all solution but these are some possibilities.
>
> Good Luck,
> Matthew.
>


[dpdk-dev] A question about hugepage initialization time

2014-12-09 Thread Matt Laswell
Hey Folks,

Our DPDK application deals with very large in memory data structures, and
can potentially use tens or even hundreds of gigabytes of hugepage memory.
During the course of development, we've noticed that as the number of huge
pages increases, the memory initialization time during EAL init gets to be
quite long, lasting several minutes at present.  The growth in init time
doesn't appear to be linear, which is concerning.

This is a minor inconvenience for us and our customers, as memory
initialization makes our boot times a lot longer than it would otherwise
be.  Also, my experience has been that really long operations often are
hiding errors - what you think is merely a slow operation is actually a
timeout of some sort, often due to misconfiguration. This leads to two
questions:

1. Does the long initialization time suggest that there's an error
happening under the covers?
2. If not, is there any simple way that we can shorten memory
initialization time?

Thanks in advance for your insights.

--
Matt Laswell
laswell at infiniteio.com
infinite io, inc.


[dpdk-dev] Load-balancing position field in DPDK load_balancer sample app vs. Hash table

2014-11-15 Thread Matt Laswell
Fantastic.  Thanks for the assist.

--
Matt Laswell
laswell at infiniteio.com
infinite io, inc.


On Sat, Nov 15, 2014 at 1:10 AM, Yerden Zhumabekov 
wrote:

>  Hello Matt,
>
> You can specify RSS configuration through rte_eth_dev_configure() function
> supplied with this structure:
>
> struct rte_eth_conf port_conf = {
> .rxmode = {
> .mq_mode= ETH_MQ_RX_RSS,
>  ...
> },
> .rx_adv_conf = {
> .rss_conf = {
> .rss_key = NULL,
> .rss_hf = ETH_RSS_IPV4 | ETH_RSS_IPV6,
> },
> },
> .
> };
>
> In this case, RSS-hash is calculated over IP addresses only and with
> default RSS key. Look at lib/librte_ether/rte_ethdev.h for other
> definitions.
>
>
> 15.11.2014 0:49, Matt Laswell ?:
>
> Hey Folks,
>
>  This thread has been tremendously helpful, as I'm looking at adding
> RSS-based load balancing to my application in the not too distant future.
> Many thanks to all who have contributed, especially regarding symmetric RSS.
>
>  Not to derail the conversation too badly, but could one of you point me
> to some example code that demonstrates the steps needed to configure RSS?
> We're using Niantic NICs, so I assume that this is pretty standard stuff,
> but having an example to study is a real leg up.
>
>  Again, thanks for all of the information.
>
>  --
> Matt Laswell
> laswell at infiniteio.com
> infinite io, inc.
>
> On Fri, Nov 14, 2014 at 10:57 AM, Chilikin, Andrey <
> andrey.chilikin at intel.com> wrote:
>
>> Fortville supports symmetrical hashing on HW level, a patch for i40e PMD
>> was submitted a couple of weeks ago. For Niantic you can use symmetrical
>> rss key recommended by Konstantin.
>>
>> Regards,
>> Andrey
>>
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Ananyev, Konstantin
>> Sent: Friday, November 14, 2014 4:50 PM
>> To: Yerden Zhumabekov; Kamraan Nasim; dev at dpdk.org
>> Cc: Yuanzhang Hu
>> Subject: Re: [dpdk-dev] Load-balancing position field in DPDK
>> load_balancer sample app vs. Hash table
>>
>> > -Original Message-
>> > From: Yerden Zhumabekov [mailto:e_zhumabekov at sts.kz]
>> > Sent: Friday, November 14, 2014 4:23 PM
>> > To: Ananyev, Konstantin; Kamraan Nasim; dev at dpdk.org
>> > Cc: Yuanzhang Hu
>> > Subject: Re: [dpdk-dev] Load-balancing position field in DPDK
>> > load_balancer sample app vs. Hash table
>> >
>> > I'd like to interject a question here.
>> >
>> > In case of flow classification, one might possibly prefer for packets
>> > from the same flow to fall on the same logical core. With this '%'
>> > load balancing, it would require to get the same RSS hash value for
>> > packets with direct (src to dst) and swapped (dst to src) IPs and
>> > ports. Am I correct that hardware RSS calculation cannot provide this
>> symmetry?
>>
>> As I remember, it is possible but you have to tweak rss key values.
>> Here is a paper describing how to do that:
>> http://www.ndsl.kaist.edu/~shinae/papers/TR-symRSS.pdf
>>
>> Konstantin
>>
>> >
>> > 14.11.2014 20:44, Ananyev, Konstantin ?:
>> > > If you have a NIC that is capable to do HW hash computation, then
>> > > you can do your load balancing based on that value.
>> > > Let say ixgbe/igb/i40e NICs can calculate RSS hash value based on
>> > > different combinations of dst/src Ips, dst/src ports.
>> > > This value can be stored inside mbuf for each RX packet by PMD RX
>> function.
>> > > Then you can do:
>> > > worker_id = mbuf->hash.rss % n_workersl
>> > >
>> > > That might to provide better balancing then using just one byte
>> > > value, plus should be a bit faster, as in that case your balancer
>> code don't need to touch packet's data.
>> > >
>> > > Konstantin
>> >
>> > --
>> > Sincerely,
>> >
>> > Yerden Zhumabekov
>> > State Technical Service
>> > Astana, KZ
>> >
>>
>>
>
> --
> Sincerely,
>
> Yerden Zhumabekov
> State Technical Service
> Astana, KZ
>
>


[dpdk-dev] Load-balancing position field in DPDK load_balancer sample app vs. Hash table

2014-11-14 Thread Matt Laswell
Hey Folks,

This thread has been tremendously helpful, as I'm looking at adding
RSS-based load balancing to my application in the not too distant future.
Many thanks to all who have contributed, especially regarding symmetric RSS.

Not to derail the conversation too badly, but could one of you point me to
some example code that demonstrates the steps needed to configure RSS?
We're using Niantic NICs, so I assume that this is pretty standard stuff,
but having an example to study is a real leg up.

Again, thanks for all of the information.

--
Matt Laswell
laswell at infiniteio.com
infinite io, inc.

On Fri, Nov 14, 2014 at 10:57 AM, Chilikin, Andrey <
andrey.chilikin at intel.com> wrote:

> Fortville supports symmetrical hashing on HW level, a patch for i40e PMD
> was submitted a couple of weeks ago. For Niantic you can use symmetrical
> rss key recommended by Konstantin.
>
> Regards,
> Andrey
>
> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Ananyev, Konstantin
> Sent: Friday, November 14, 2014 4:50 PM
> To: Yerden Zhumabekov; Kamraan Nasim; dev at dpdk.org
> Cc: Yuanzhang Hu
> Subject: Re: [dpdk-dev] Load-balancing position field in DPDK
> load_balancer sample app vs. Hash table
>
> > -Original Message-
> > From: Yerden Zhumabekov [mailto:e_zhumabekov at sts.kz]
> > Sent: Friday, November 14, 2014 4:23 PM
> > To: Ananyev, Konstantin; Kamraan Nasim; dev at dpdk.org
> > Cc: Yuanzhang Hu
> > Subject: Re: [dpdk-dev] Load-balancing position field in DPDK
> > load_balancer sample app vs. Hash table
> >
> > I'd like to interject a question here.
> >
> > In case of flow classification, one might possibly prefer for packets
> > from the same flow to fall on the same logical core. With this '%'
> > load balancing, it would require to get the same RSS hash value for
> > packets with direct (src to dst) and swapped (dst to src) IPs and
> > ports. Am I correct that hardware RSS calculation cannot provide this
> symmetry?
>
> As I remember, it is possible but you have to tweak rss key values.
> Here is a paper describing how to do that:
> http://www.ndsl.kaist.edu/~shinae/papers/TR-symRSS.pdf
>
> Konstantin
>
> >
> > 14.11.2014 20:44, Ananyev, Konstantin ?:
> > > If you have a NIC that is capable to do HW hash computation, then
> > > you can do your load balancing based on that value.
> > > Let say ixgbe/igb/i40e NICs can calculate RSS hash value based on
> > > different combinations of dst/src Ips, dst/src ports.
> > > This value can be stored inside mbuf for each RX packet by PMD RX
> function.
> > > Then you can do:
> > > worker_id = mbuf->hash.rss % n_workersl
> > >
> > > That might to provide better balancing then using just one byte
> > > value, plus should be a bit faster, as in that case your balancer code
> don't need to touch packet's data.
> > >
> > > Konstantin
> >
> > --
> > Sincerely,
> >
> > Yerden Zhumabekov
> > State Technical Service
> > Astana, KZ
> >
>
>


[dpdk-dev] segmented recv ixgbevf

2014-11-05 Thread Matt Laswell
Hey Folks,

I ran into the same issue that Alex is describing here, and I wanted to
expand just a little bit on his comments, as the documentation isn't very
clear.

Per the documentation, the two arguments to rte_pktmbuf_pool_init() are a
pointer to the memory pool that contains the newly-allocated mbufs and an
opaque pointer.  The docs are pretty vague about what the opaque pointer
should point to or what it's contents mean; all of the examples I looked at
just pass a NULL pointer. The docs for this function describe the opaque
pointer this way:

"A pointer that can be used by the user to retrieve useful information for
mbuf initialization. This pointer comes from the init_arg parameter of
rte_mempool_create()
<http://www.dpdk.org/doc/api/rte__mempool_8h.html#a7dc1d01a45144e3203c36d1800cb8f17>
."

This is a little bit misleading.  Under the covers, rte_pktmbuf_pool_init()
doesn't threat the opaque pointer as a pointer at all.  Rather, it just
converts it to a uint16_t which contains the desired mbuf size.   If it
receives 0 (in other words, if you passed in a NULL pointer), it will use
2048 bytes + RTE_PKTMBUF_HEADROOM.  Hence, incoming jumbo frames will be
segmented into 2K chunks.

Any chance we could get an improvement to the documentation about this
parameter?  It seems as though the opaque pointer isn't a pointer and
probably shouldn't be opaque.

Hope this helps the next person who comes across this behavior.

--
Matt Laswell
infinite io, inc.

On Thu, Oct 30, 2014 at 7:48 AM, Alex Markuze  wrote:

> For posterity.
>
> 1.When using MTU larger then 2K its advised to provide the value
> to rte_pktmbuf_pool_init.
> 2.ixgbevf rounds down the ("MBUF size" - RTE_PKTMBUF_HEADROOM) to the
> nearest 1K multiple when deciding on the receiving capabilities [buffer
> size]of the Buffers in the pool.
> The function SRRCTL register,  is considered here for some reason?
>


[dpdk-dev] Question about ASLR

2014-09-08 Thread Matt Laswell
Bruce,

That's tremendously helpful.  Thanks for the information.

--
Matt Laswell
*infinite io*


On Sun, Sep 7, 2014 at 2:52 PM, Richardson, Bruce <
bruce.richardson at intel.com> wrote:

> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Matt Laswell
> > Sent: Friday, September 05, 2014 7:57 PM
> > To: dev at dpdk.org
> > Subject: [dpdk-dev] Question about ASLR
> >
> > Hey Folks,
> >
> > A colleague noticed warnings in section 23.3 of the programmer's guide
> > about the use of address space layout randomization with multiprocess
> DPDK
> > applications.  And, upon inspection, it appears that ASLR is enabled on
> our
> > target systems.  We've never seen a problem that we could trace back to
> > ASLR, and we've never see a warning during EAL memory initialiization,
> > either, which is strange.
> >
> > Given the choice, we would prefer to keep ASLR for security reasons.
> Given
> > that in our problem domain:
> >- We are running a multiprocess DPDK application
> >- We run only one DPDK application, which is a single compiled binary
> >- We have exactly one process running per logical core
> >- We're OK with interrupts coming just to the primary
> >- We handle interaction from our control plane via a separate shared
> > memory space
> >
> > Is it OK in this circumstance to leave ASLR enabled?  I think it probably
> > is, but would love to hear reasons why not and/or pitfalls that we need
> to
> > avoid.
> >
> > Thanks in advance.
> >
> > --
> > Matt Laswell
> > *infinite io*
>
> Having ASLR enabled will just introduce a small element of uncertainty in
> the application startup process as you the memory mappings used by your app
> will move about from run to run. In certain cases we've seen some of the
> secondary multi-process application examples fail to start at random once
> every few hundred times (IIRC correctly - this was some time back).
> Presumably the chances of the secondary failing to start will vary
> depending on how ASLR has adjusted the memory mappings in the primary.
> So, with ASLR on, we've found occasionally that mappings will fail, in
> which case the solution is really just to retry the app again and ASLR will
> re-randomise it differently and it will likely start. Disabling ASLR gives
> repeatability in this regard - your app will always start successfully - or
> if there is something blocking the memory maps from being replicated -
> always fail to start (in which case you try passing EAL parameters to hint
> the primary process to use different mapping addresses).
>
> In your case, you are not seeing any problems thus far, so likely if
> secondary process startup failures do occur, they should hopefully work
> fine by just trying again! Whether this element of uncertainty is
> acceptable or not is your choice :-). One thing you could try, to find out
> what the issues might be with your app, is to just try running it
> repeatedly in a script, killing it after a couple of seconds. This should
> tell you how often, if ever, initialization failures are to be expected
> when using ASLR.
>
> Hope this helps,
> Regards,
> /Bruce
>


[dpdk-dev] Question about ASLR

2014-09-05 Thread Matt Laswell
Hey Folks,

A colleague noticed warnings in section 23.3 of the programmer's guide
about the use of address space layout randomization with multiprocess DPDK
applications.  And, upon inspection, it appears that ASLR is enabled on our
target systems.  We've never seen a problem that we could trace back to
ASLR, and we've never see a warning during EAL memory initialiization,
either, which is strange.

Given the choice, we would prefer to keep ASLR for security reasons.  Given
that in our problem domain:
   - We are running a multiprocess DPDK application
   - We run only one DPDK application, which is a single compiled binary
   - We have exactly one process running per logical core
   - We're OK with interrupts coming just to the primary
   - We handle interaction from our control plane via a separate shared
memory space

Is it OK in this circumstance to leave ASLR enabled?  I think it probably
is, but would love to hear reasons why not and/or pitfalls that we need to
avoid.

Thanks in advance.

--
Matt Laswell
*infinite io*


[dpdk-dev] DPDK with Ubuntu 14.04?

2014-07-11 Thread Matt Laswell
Thanks Roger,

We saw similar issues with regard to kcompat.h.  Can I ask if you've done
anything beyond the example applications under 14.04?

--
Matt Laswell
infinite io


On Thu, Jul 10, 2014 at 7:07 PM, Wiles, Roger Keith <
keith.wiles at windriver.com> wrote:

>  The one problem I had with 14.04 was the kcompat.h file. It looks like a
> hash routine has changed its arguments. I edited the kcompat.h file and was
> about to change the code to allow DPDK to build. It is not affix but it
> worked for me.
>
>  lib/librte_eal/linuxapp/kni/ethtool/igb/kcompat.h
>
>  /*  Changed the next line to use (3,13,8) instead of (3,14,0) KeithW
> */
> #if ( LINUX_VERSION_CODE < KERNEL_VERSION(3,13,8) )
> #if (!(RHEL_RELEASE_CODE && RHEL_RELEASE_CODE >=
> RHEL_RELEASE_VERSION(7,0)))
> #ifdef NETIF_F_RXHASH
> #define PKT_HASH_TYPE_L3 0
>
>  *Hope that works.*
>
>  *Keith **Wiles*, Principal Technologist with CTO office, *Wind River*
> mobile 972-213-5533
>
> [image: Powering 30 Years of Innovation]
> <http://www.windriver.com/announces/wr30/>
>
>  On Jul 10, 2014, at 5:56 PM, Matt Laswell  wrote:
>
> Hey Folks,
>
> I know that official support hasn't moved past Ubuntu 12.04 LTS yet, but
> does anybody have any practical experience running with 14.04 LTS?  My team
> has run into one compilation error so far with 1.7, but other than that
> things look OK at first blush.  I'd like to move my product to 14.04 for a
> variety of reasons, but would hate to spend time chasing down subtle
> incompatibilities.  I'm guessing we're not the first ones to try this...
>
> Thanks.
>
> --
> Matt Laswell
> infinite io
>
>
>


[dpdk-dev] DPDK with Ubuntu 14.04?

2014-07-10 Thread Matt Laswell
Hey Folks,

I know that official support hasn't moved past Ubuntu 12.04 LTS yet, but
does anybody have any practical experience running with 14.04 LTS?  My team
has run into one compilation error so far with 1.7, but other than that
things look OK at first blush.  I'd like to move my product to 14.04 for a
variety of reasons, but would hate to spend time chasing down subtle
incompatibilities.  I'm guessing we're not the first ones to try this...

Thanks.

--
Matt Laswell
infinite io


[dpdk-dev] Ability to/impact of running with smaller page sizes

2014-07-01 Thread Matt Laswell
Thanks everybody,

It sounds as though what I'm looking for may be possible, especially with
1.7, but will require some tweaking and there will most definitely be a
performance hit.  That's great information.  This is still just an
experiment for us, and it's not at all guaranteed that I'm going to move
towards smaller pages, but I very much appreciate the insights.

--
Matt Laswell


On Tue, Jul 1, 2014 at 6:51 AM, Burakov, Anatoly 
wrote:

> Hi Matt,
>
> > I'm curious - is it possible in practical terms to run DPDK without
> hugepages?
>
> Starting with release 1.7.0, support for VFIO was added, which allows
> using  DPDK without hugepages at al (including RX/TX rings) via the
> --no-huge command-line parameter. Bear in mind though that you'll have to
> have IOMMU/VT-d enabled (i.e. no VM support, only host-based) and also have
> supported kernel version (3.6+) as well to use VFIO, the memory size will
> be limited to 1G, and it won't work with multiprocess. I don't have any
> performance figures on that unfortunately.
>
> Best regards,
> Anatoly Burakov
> DPDK SW Engineer
>


[dpdk-dev] Ability to/impact of running with smaller page sizes

2014-06-30 Thread Matt Laswell
Hey Folks,

In my application, I'm seeing some design considerations in a project I'm
working on that push me towards the use of smaller memory page sizes.  I'm
curious - is it possible in practical terms to run DPDK without hugepages?
 If so, does anybody have any practical experience (or a
back-of-the-envelop estimate) of how badly such a configuration would hurt
performance?  For sake of argument, assume that virtually all of the memory
being used is in pre-allocated mempools (e.g lots of rte_mempool_create(),
very little rte_malloc().

Thanks in advance for your help.

-- 
Matt Laswell