from:"Ananyev, Konstantin"

[dpdk-dev] [PATCH v2] i40e: Fix eth_i40e_dev_init sequence on ThunderX

2016-12-01 Thread Ananyev, Konstantin


Hi Jerin,

> > > > > > >
> > > > > > > i40e_asq_send_command: rd32 & wr32 under ThunderX gives 
> > > > > > > unpredictable
> > > > > > >results. To solve this include rte memory 
> > > > > > > barriers
> > > > > > >
> > > > > > > Signed-off-by: Satha Rao 
> > > > > > > ---
> > > > > > >  drivers/net/i40e/base/i40e_osdep.h | 14 ++
> > > > > > >  1 file changed, 14 insertions(+)
> > > > > > >
> > > > > > > diff --git a/drivers/net/i40e/base/i40e_osdep.h 
> > > > > > > b/drivers/net/i40e/base/i40e_osdep.h
> > > > > > > index 38e7ba5..ffa3160 100644
> > > > > > > --- a/drivers/net/i40e/base/i40e_osdep.h
> > > > > > > +++ b/drivers/net/i40e/base/i40e_osdep.h
> > > > > > > @@ -158,7 +158,13 @@ do { 
> > > > > > >\
> > > > > > >   ((volatile uint32_t *)((char *)(a)->hw_addr + (reg)))
> > > > > > >  static inline uint32_t i40e_read_addr(volatile void *addr)
> > > > > > >  {
> > > > > > > +#if defined(RTE_ARCH_ARM64)
> > > > > > > + uint32_t val = rte_le_to_cpu_32(I40E_PCI_REG(addr));
> > > > > > > + rte_rmb();
> > > > > > > + return val;
> > > > > >
> > > > > > If you really need an rmb/wmb with MMIO read/writes on ARM,
> > > > > > I think you can avoid #ifdefs here and use rte_smp_rmb/rte_smp_wmb.
> > > > > > BTW, I suppose if you need it for i40e, you would need it for other 
> > > > > > devices too.
> > > > >
> > > > > Yes. ARM would need for all devices(typically, the devices on 
> > > > > external PCI bus).
> > > > > I guess rte_smp_rmb may not be the correct abstraction. So we need 
> > > > > more of
> > > > > rte_rmb() as we need only non smp variant on IO side. I guess then it 
> > > > > make sense to
> > > > > create new abstraction in eal with following variants so that each 
> > > > > arch
> > > > > gets opportunity to make what it makes sense that specific platform
> > > > >
> > > > > rte_readb_relaxed
> > > > > rte_readw_relaxed
> > > > > rte_readl_relaxed
> > > > > rte_readq_relaxed
> > > > > rte_writeb_relaxed
> > > > > rte_writew_relaxed
> > > > > rte_writel_relaxed
> > > > > rte_writeq_relaxed
> > > > > rte_readb
> > > > > rte_readw
> > > > > rte_readl
> > > > > rte_readq
> > > > > rte_writeb
> > > > > rte_writew
> > > > > rte_writel
> > > > > rte_writeq
> > > > >
> > > > > Thoughts ?
> > > > >
> > > >
> > > > That seems like a lot of API calls!
> > > > Perhaps you can clarify - why would the rte_smp_rmb() not work for you?
> > >
> > > Currently arm64 mapped DMB as rte_smp_rmb() for smp case.
> > >
> > > Ideally for io barrier and non smp case, we need to map it as DSB and it 
> > > is
> > > bit heavier than DMB
> >
> > Ok, so you need some new macro, like rte_io_(r|w)mb or so, that would 
> > expand into dmb
> > for ARM,  correct?
> 
> The io barrier expands to dsb.
> http://lxr.free-electrons.com/source/arch/arm64/include/asm/io.h#L110

Sorry, yes I meant  DSB here.

> 
> >
> > >
> > > The linux kernel arm64 mappings
> > > http://lxr.free-electrons.com/source/arch/arm64/include/asm/io.h#L142
> > >
> > > DMB vs DSB
> > > https://community.arm.com/thread/3833
> > >
> > > The relaxed one are without any barriers.(the use case like accessing on
> > > chip peripherals may need only relaxed versions)
> > >
> > > Thoughts on new rte EAL abstraction?
> >
> > Looks like a lot of macros but if you guys think that would help - NP with 
> > that :)
> 
> I don't have strong opinion here. If there is concern on a lot of macros
> then, I can introduce only "rte_io_(r|w)mb" instead of 
> read[b|w|l|q]/write[b|w|l|q]/relaxed.
> let me know?

I think we can have both.
The question is in the amount of work need to be done.

> 
> > Again, in that case we probably can get rid of driver specific pci reg 
> > read/write defines.
> Yes. But, That's going to have a lot of change :-(

Yes I agree, the changes would be quite significant.

> 
> If there is no objection then I will introduce
> "read[b|w|l|q]/write[b|w|l|q]/relaxed" and then change all external pcie 
> drivers
> with new macros.

That seems like a good idea to me.
Though as you said that seems quite a significant change.
Probably make sense to do it in 2 stages (just a suggestion): 
First introduce rte_io_(r|w)mb and fix with it existing issues in the 
particular drivers.
Second replace existing PMD specific xxx_read/write_addr() with your new 
generic 

Konstantin

[dpdk-dev] [PATCH v2] i40e: Fix eth_i40e_dev_init sequence on ThunderX

2016-11-30 Thread Ananyev, Konstantin

Hi Jerin,

> 
> On Tue, Nov 22, 2016 at 01:46:54PM +, Bruce Richardson wrote:
> > On Tue, Nov 22, 2016 at 03:46:38AM +0530, Jerin Jacob wrote:
> > > On Sun, Nov 20, 2016 at 11:21:43PM +, Ananyev, Konstantin wrote:
> > > > Hi
> > > > >
> > > > > i40e_asq_send_command: rd32 & wr32 under ThunderX gives unpredictable
> > > > >results. To solve this include rte memory 
> > > > > barriers
> > > > >
> > > > > Signed-off-by: Satha Rao 
> > > > > ---
> > > > >  drivers/net/i40e/base/i40e_osdep.h | 14 ++
> > > > >  1 file changed, 14 insertions(+)
> > > > >
> > > > > diff --git a/drivers/net/i40e/base/i40e_osdep.h 
> > > > > b/drivers/net/i40e/base/i40e_osdep.h
> > > > > index 38e7ba5..ffa3160 100644
> > > > > --- a/drivers/net/i40e/base/i40e_osdep.h
> > > > > +++ b/drivers/net/i40e/base/i40e_osdep.h
> > > > > @@ -158,7 +158,13 @@ do { 
> > > > >\
> > > > >   ((volatile uint32_t *)((char *)(a)->hw_addr + (reg)))
> > > > >  static inline uint32_t i40e_read_addr(volatile void *addr)
> > > > >  {
> > > > > +#if defined(RTE_ARCH_ARM64)
> > > > > + uint32_t val = rte_le_to_cpu_32(I40E_PCI_REG(addr));
> > > > > + rte_rmb();
> > > > > + return val;
> > > >
> > > > If you really need an rmb/wmb with MMIO read/writes on ARM,
> > > > I think you can avoid #ifdefs here and use rte_smp_rmb/rte_smp_wmb.
> > > > BTW, I suppose if you need it for i40e, you would need it for other 
> > > > devices too.
> > >
> > > Yes. ARM would need for all devices(typically, the devices on external 
> > > PCI bus).
> > > I guess rte_smp_rmb may not be the correct abstraction. So we need more of
> > > rte_rmb() as we need only non smp variant on IO side. I guess then it 
> > > make sense to
> > > create new abstraction in eal with following variants so that each arch
> > > gets opportunity to make what it makes sense that specific platform
> > >
> > > rte_readb_relaxed
> > > rte_readw_relaxed
> > > rte_readl_relaxed
> > > rte_readq_relaxed
> > > rte_writeb_relaxed
> > > rte_writew_relaxed
> > > rte_writel_relaxed
> > > rte_writeq_relaxed
> > > rte_readb
> > > rte_readw
> > > rte_readl
> > > rte_readq
> > > rte_writeb
> > > rte_writew
> > > rte_writel
> > > rte_writeq
> > >
> > > Thoughts ?
> > >
> >
> > That seems like a lot of API calls!
> > Perhaps you can clarify - why would the rte_smp_rmb() not work for you?
> 
> Currently arm64 mapped DMB as rte_smp_rmb() for smp case.
> 
> Ideally for io barrier and non smp case, we need to map it as DSB and it is
> bit heavier than DMB

Ok, so you need some new macro, like rte_io_(r|w)mb or so, that would expand 
into dmb
for ARM,  correct?

> 
> The linux kernel arm64 mappings
> http://lxr.free-electrons.com/source/arch/arm64/include/asm/io.h#L142
> 
> DMB vs DSB
> https://community.arm.com/thread/3833
> 
> The relaxed one are without any barriers.(the use case like accessing on
> chip peripherals may need only relaxed versions)
> 
> Thoughts on new rte EAL abstraction?

Looks like a lot of macros but if you guys think that would help - NP with that 
:)
Again, in that case we probably can get rid of driver specific pci reg 
read/write defines.

Konstantin

> 
> >
> > /Bruce

[dpdk-dev] [PATCH v12 0/6] add Tx preparation

2016-11-30 Thread Ananyev, Konstantin


Hi Harish,
> 
> 
> >We need attention of every PMD developers on this thread.
> >
> >Reminder of what Konstantin suggested:
> >"
> >- if the PMD supports TX offloads AND
> >- if to be able use any of these offloads the upper layer SW would have
> >to:
> >* modify the contents of the packet OR
> >* obey HW specific restrictions
> >then it is a PMD developer responsibility to provide tx_prep() that would
> >implement
> >expected modifications of the packet contents and restriction checks.
> >Otherwise, tx_prep() implementation is not required and can be safely set
> >to NULL.
> >"
> >
> >I copy/paste also my previous conclusion:
> >
> >Before txprep, there is only one API: the application must prepare the
> >packets checksum itself (get_psd_sum in testpmd).
> >With txprep, the application have 2 choices: keep doing the job itself
> >or call txprep which calls a PMD-specific function.
> >The question is: does non-Intel drivers need a checksum preparation for
> >TSO?
> >Will it behave well if txprep does nothing in these drivers?
> >
> >When looking at the code, most of drivers handle the TSO flags.
> >But it is hard to know whether they rely on the pseudo checksum or not.
> >
> >git grep -l 'PKT_TX_UDP_CKSUM\|PKT_TX_TCP_CKSUM\|PKT_TX_TCP_SEG'
> >drivers/net/
> >
> >drivers/net/bnxt/bnxt_txr.c
> >drivers/net/cxgbe/sge.c
> >drivers/net/e1000/em_rxtx.c
> >drivers/net/e1000/igb_rxtx.c
> >drivers/net/ena/ena_ethdev.c
> >drivers/net/enic/enic_rxtx.c
> >drivers/net/fm10k/fm10k_rxtx.c
> >drivers/net/i40e/i40e_rxtx.c
> >drivers/net/ixgbe/ixgbe_rxtx.c
> >drivers/net/mlx4/mlx4.c
> >drivers/net/mlx5/mlx5_rxtx.c
> >drivers/net/nfp/nfp_net.c
> >drivers/net/qede/qede_rxtx.c
> >drivers/net/thunderx/nicvf_rxtx.c
> >drivers/net/virtio/virtio_rxtx.c
> >drivers/net/vmxnet3/vmxnet3_rxtx.c
> >
> >Please, we need a comment for each driver saying
> >"it is OK, we do not need any checksum preparation for TSO"
> >or
> >"yes we have to implement tx_prepare or TSO will not work in this mode"
> >
> 
> qede PMD doesn?t currently support TSO yet, it only supports Tx TCP/UDP/IP
> csum offloads.
> So Tx preparation isn?t applicable. So as of now -
> "it is OK, we do not need any checksum preparation for TSO"

Thanks for the answer.
Though please note that it not only for TSO.
This is for any TX offload for which the upper layer SW would have
to modify the contents of the packet.
Though as I can see for qede neither PKT_TX_IP_CKSUM or PKT_TX_TCP_CKSUM
exhibits any extra requirements for the user.
Is that correct?

Konstantin   


> 
> 
> Thanks,
> Harish

[dpdk-dev] [PATCH v12 0/6] add Tx preparation

2016-11-30 Thread Ananyev, Konstantin

Hi John,

> 
> Hi,
> -john
> 
> > -Original Message-
> > From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> > Sent: Monday, November 28, 2016 3:03 AM
> > To: dev at dpdk.org; Rahul Lakkireddy ;
> > Stephen Hurd ; Jan Medala
> > ; Jakub Palider ; John Daley
> > (johndale) ; Adrien Mazarguil
> > ; Alejandro Lucero
> > ; Harish Patil
> > ; Rasesh Mody ; Jerin
> > Jacob ; Yuanhan Liu
> > ; Yong Wang 
> > Cc: Tomasz Kulasek ;
> > konstantin.ananyev at intel.com; olivier.matz at 6wind.com
> > Subject: Re: [dpdk-dev] [PATCH v12 0/6] add Tx preparation
> >
> > We need attention of every PMD developers on this thread.
> >
> > Reminder of what Konstantin suggested:
> > "
> > - if the PMD supports TX offloads AND
> > - if to be able use any of these offloads the upper layer SW would have to:
> > * modify the contents of the packet OR
> > * obey HW specific restrictions
> > then it is a PMD developer responsibility to provide tx_prep() that would
> > implement expected modifications of the packet contents and restriction
> > checks.
> > Otherwise, tx_prep() implementation is not required and can be safely set to
> > NULL.
> > "
> >
> > I copy/paste also my previous conclusion:
> >
> > Before txprep, there is only one API: the application must prepare the
> > packets checksum itself (get_psd_sum in testpmd).
> > With txprep, the application have 2 choices: keep doing the job itself or 
> > call
> > txprep which calls a PMD-specific function.
> > The question is: does non-Intel drivers need a checksum preparation for
> > TSO?
> > Will it behave well if txprep does nothing in these drivers?
> >
> > When looking at the code, most of drivers handle the TSO flags.
> > But it is hard to know whether they rely on the pseudo checksum or not.
> >
> > git grep -l 'PKT_TX_UDP_CKSUM\|PKT_TX_TCP_CKSUM\|PKT_TX_TCP_SEG'
> > drivers/net/
> >
> > drivers/net/bnxt/bnxt_txr.c
> > drivers/net/cxgbe/sge.c
> > drivers/net/e1000/em_rxtx.c
> > drivers/net/e1000/igb_rxtx.c
> > drivers/net/ena/ena_ethdev.c
> > drivers/net/enic/enic_rxtx.c
> > drivers/net/fm10k/fm10k_rxtx.c
> > drivers/net/i40e/i40e_rxtx.c
> > drivers/net/ixgbe/ixgbe_rxtx.c
> > drivers/net/mlx4/mlx4.c
> > drivers/net/mlx5/mlx5_rxtx.c
> > drivers/net/nfp/nfp_net.c
> > drivers/net/qede/qede_rxtx.c
> > drivers/net/thunderx/nicvf_rxtx.c
> > drivers/net/virtio/virtio_rxtx.c
> > drivers/net/vmxnet3/vmxnet3_rxtx.c
> >
> > Please, we need a comment for each driver saying "it is OK, we do not need
> > any checksum preparation for TSO"
> > or
> > "yes we have to implement tx_prepare or TSO will not work in this mode"
> 
> I like the idea of tx prep since it should make for cleaner apps.
> 
> For enic, I believe the answer is " it is OK, we do not need any checksum 
> preparation".
> 
> Prior to now, it was necessary to set IP checksum to 0 and put in a TCP/UDP 
> pseudo header. But there is a hardware overwrite of
> checksums option which makes preparation in software unnecessary and it is 
> testing out well so far. I plan to enable it in 17.02. TSO is also
> being enabled for 17.02 and it does not look like any prep is required. So 
> I'm going with " txprep NULL pointer is OK for enic", but may have
> to change my mind if something comes up in testing.

That's great thanks.
Other non-Intel PMD maintainers, please any feedback? 
Konstantin

> 
> -john

[dpdk-dev] [PATCH v12 0/6] add Tx preparation

2016-11-30 Thread Ananyev, Konstantin

Hi Adrien,

> 
> On Mon, Nov 28, 2016 at 12:03:06PM +0100, Thomas Monjalon wrote:
> > We need attention of every PMD developers on this thread.
> 
> I've been following this thread from the beginning while working on rte_flow
> and wanted to see where it was headed before replying. (I know, v11 was
> submitted about 1 month ago but still.)
> 
> > Reminder of what Konstantin suggested:
> > "
> > - if the PMD supports TX offloads AND
> > - if to be able use any of these offloads the upper layer SW would have to:
> > * modify the contents of the packet OR
> > * obey HW specific restrictions
> > then it is a PMD developer responsibility to provide tx_prep() that would 
> > implement
> > expected modifications of the packet contents and restriction checks.
> > Otherwise, tx_prep() implementation is not required and can be safely set 
> > to NULL.
> > "
> >
> > I copy/paste also my previous conclusion:
> >
> > Before txprep, there is only one API: the application must prepare the
> > packets checksum itself (get_psd_sum in testpmd).
> > With txprep, the application have 2 choices: keep doing the job itself
> > or call txprep which calls a PMD-specific function.
> 
> Something is definitely needed here, and only PMDs can provide it. I think
> applications should not have to clear checksum fields or initialize them to
> some magic value, same goes for any other offload or hardware limitation
> that needs to be worked around.
> 
> tx_prep() is one possible answer to this issue, however as mentioned in the
> original patch it can be very expensive if exposed by the PMD.
> 
> Another issue I'm more concerned about is the way limitations are managed
> (struct rte_eth_desc_lim). While not officially tied to tx_prep(), this
> structure contains new fields that are only relevant to a few devices, and I
> fear it will keep growing with each new hardware quirk to manage, breaking
> ABIs in the process.

Well, if some new HW capability/limitation would arise and we'd like to support
it in DPDK, then yes we probably would need to think how to incorporate it here.
Do you have anything particular in mind here?

> 
> What are applications supposed to do, check each of them regardless before
> attempting to send a burst?
> 
> I understand tx_prep() automates this process, however I'm wondering why
> isn't the TX burst function doing that itself. Using nb_mtu_seg_max as an
> example, tx_prep() has an extra check in case of TSO that the TX burst
> function does not perform. This ends up being much more expensive to
> applications due to the additional loop doing redundant testing on each
> mbuf.
> 
> If, say as a performance improvement, we decided to leave the validation
> part to the TX burst function; what remains in tx_prep() is basically heavy
> "preparation" requiring mbuf changes (i.e. erasing checksums, for now).
> 
> Following the same logic, why can't such a thing be made part of the TX
> burst function as well (through a direct call to rte_phdr_cksum_fix()
> whenever necessary). From an application standpoint, what are the advantages
> of having to:
> 
>  if (tx_prep()) // iterate and update mbufs as needed
>  tx_burst(); // iterate and send
> 
> Compared to:
> 
>  tx_burst(); // iterate, update as needed and send

I think that was discussed extensively quite a lot previously here:
As Thomas already replied - main motivation is to allow user
to execute them on different stages of packet TX pipeline,
and probably on different cores.
I think that provides better flexibility to the user to when/where
do these preparations and hopefully would lead to better performance.

Though, if you or any other PMD developer/maintainer would prefer
for particular PMD to combine both functionalities into tx_burst() and
keep tx_prep() as NOP - this is still possible too.  

> 
> Note that PMDs could still provide different TX callbacks depending on the
> set of enabled offloads so performance is not unnecessarily impacted.
> 
> In my opinion the second approach is both faster to applications and more
> friendly from a usability perspective, am I missing something obvious?
> 
> > The question is: does non-Intel drivers need a checksum preparation for TSO?
> > Will it behave well if txprep does nothing in these drivers?
> >
> > When looking at the code, most of drivers handle the TSO flags.
> > But it is hard to know whether they rely on the pseudo checksum or not.
> >
> > git grep -l 'PKT_TX_UDP_CKSUM\|PKT_TX_TCP_CKSUM\|PKT_TX_TCP_SEG' 
> > drivers/net/
> >
> > drivers/net/bnxt/bnxt_txr.c
> > drivers/net/cxgbe/sge.c
> > drivers/net/e1000/em_rxtx.c
> > drivers/net/e1000/igb_rxtx.c
> > drivers/net/ena/ena_ethdev.c
> > drivers/net/enic/enic_rxtx.c
> > drivers/net/fm10k/fm10k_rxtx.c
> > drivers/net/i40e/i40e_rxtx.c
> > drivers/net/ixgbe/ixgbe_rxtx.c
> > drivers/net/mlx4/mlx4.c
> > drivers/net/mlx5/mlx5_rxtx.c
> > drivers/net/nfp/nfp_net.c
> > drivers/net/qede/qede_rxtx.c
> > drivers/net/thunderx/nicvf_rxtx.c
> >

[dpdk-dev] [PATCH RFC 1/2] net/i40e: allow bulk alloc for the max size desc ring

2016-11-29 Thread Ananyev, Konstantin

Hi Ilya,

> Ping.
> 
> Best regards, Ilya Maximets.
> 
> On 19.10.2016 17:07, Ilya Maximets wrote:
> > The only reason why bulk alloc disabled for the rings with
> > more than (I40E_MAX_RING_DESC - RTE_PMD_I40E_RX_MAX_BURST)
> > descriptors is the possible out-of-bound access to the dma
> > memory. But it's the artificial limit and can be easily
> > avoided by allocating of RTE_PMD_I40E_RX_MAX_BURST more
> > descriptors in memory. This will not interfere the HW and,
> > as soon as all rings' memory zeroized, Rx functions will
> > work correctly.
> >
> > This change allows to use vectorized Rx functions with
> > 4096 descriptors in Rx ring which is important to achieve
> > zero packet drop rate in high-load installations.
> >
> > Signed-off-by: Ilya Maximets 
> > ---
> >  drivers/net/i40e/i40e_rxtx.c | 24 +---
> >  1 file changed, 13 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
> > index 7ae7d9f..1f76691 100644
> > --- a/drivers/net/i40e/i40e_rxtx.c
> > +++ b/drivers/net/i40e/i40e_rxtx.c
> > @@ -409,15 +409,6 @@ check_rx_burst_bulk_alloc_preconditions(__rte_unused 
> > struct i40e_rx_queue *rxq)
> >  "rxq->rx_free_thresh=%d",
> >  rxq->nb_rx_desc, rxq->rx_free_thresh);
> > ret = -EINVAL;
> > -   } else if (!(rxq->nb_rx_desc < (I40E_MAX_RING_DESC -
> > -   RTE_PMD_I40E_RX_MAX_BURST))) {
> > -   PMD_INIT_LOG(DEBUG, "Rx Burst Bulk Alloc Preconditions: "
> > -"rxq->nb_rx_desc=%d, "
> > -"I40E_MAX_RING_DESC=%d, "
> > -"RTE_PMD_I40E_RX_MAX_BURST=%d",
> > -rxq->nb_rx_desc, I40E_MAX_RING_DESC,
> > -RTE_PMD_I40E_RX_MAX_BURST);
> > -   ret = -EINVAL;
> > }
> >  #else
> > ret = -EINVAL;
> > @@ -1698,8 +1689,19 @@ i40e_dev_rx_queue_setup(struct rte_eth_dev *dev,
> > rxq->rx_deferred_start = rx_conf->rx_deferred_start;
> >
> > /* Allocate the maximun number of RX ring hardware descriptor. */
> > -   ring_size = sizeof(union i40e_rx_desc) * I40E_MAX_RING_DESC;
> > -   ring_size = RTE_ALIGN(ring_size, I40E_DMA_MEM_ALIGN);
> > +   len = I40E_MAX_RING_DESC;
> > +
> > +#ifdef RTE_LIBRTE_I40E_RX_ALLOW_BULK_ALLOC
> > +   /**
> > +* Allocating a little more memory because vectorized/bulk_alloc Rx
> > +* functions doesn't check boundaries each time.
> > +*/
> > +   len += RTE_PMD_I40E_RX_MAX_BURST;
> > +#endif
> > +

Looks good to me.
One question, though do we really need '+#ifdef 
RTE_LIBRTE_I40E_RX_ALLOW_BULK_ALLOC' here?
Why just not remove this ifdef here and always add allocate extra descriptors.
Konstantin

> > +   ring_size = RTE_ALIGN(len * sizeof(union i40e_rx_desc),
> > + I40E_DMA_MEM_ALIGN);
> > +
> > rz = rte_eth_dma_zone_reserve(dev, "rx_ring", queue_idx,
> >   ring_size, I40E_RING_BASE_ALIGN, socket_id);
> > if (!rz) {
> >

[dpdk-dev] [PATCH] doc: postpone ABI changes for Tx prepare

2016-11-10 Thread Ananyev, Konstantin



> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Thomas Monjalon
> Sent: Wednesday, November 9, 2016 10:31 PM
> To: Kulasek, TomaszX 
> Cc: dev at dpdk.org
> Subject: [dpdk-dev] [PATCH] doc: postpone ABI changes for Tx prepare
> 
> The changes for the feature "Tx prepare" should be made in version 17.02.
> 
> Signed-off-by: Thomas Monjalon 
> ---
>  doc/guides/rel_notes/deprecation.rst | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/doc/guides/rel_notes/deprecation.rst 
> b/doc/guides/rel_notes/deprecation.rst
> index 1a9e1ae..ab6014d 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -8,8 +8,8 @@ API and ABI deprecation notices are to be posted here.
>  Deprecation Notices
>  ---
> 
> -* In 16.11 ABI changes are planned: the ``rte_eth_dev`` structure will be
> -  extended with new function pointer ``tx_pkt_prep`` allowing verification
> +* In 17.02 ABI changes are planned: the ``rte_eth_dev`` structure will be
> +  extended with new function pointer ``tx_pkt_prepare`` allowing verification
>and processing of packet burst to meet HW specific requirements before
>transmit. Also new fields will be added to the ``rte_eth_desc_lim`` 
> structure:
>``nb_seg_max`` and ``nb_mtu_seg_max`` providing information about number of
> --

Acked-by: Konstantin Ananyev 

> 2.7.0

[dpdk-dev] [PATCH] examples/l3fwd: force CRC stripping for i40evf

2016-11-09 Thread Ananyev, Konstantin



> -Original Message-
> From: Topel, Bjorn
> Sent: Wednesday, November 9, 2016 11:28 AM
> To: Ananyev, Konstantin ; dev at dpdk.org; 
> Zhang, Helin 
> Cc: Xu, Qian Q ; Yao, Lei A ; 
> Wu, Jingjing ;
> thomas.monjalon at 6wind.com
> Subject: Re: [dpdk-dev] [PATCH] examples/l3fwd: force CRC stripping for i40evf
> 
> >> Correct, so the broader question would be "what is the correct
> >> behavior for an example application, when a port configuration
> >> isn't supported by the hardware?".
> >>
> >> My stand, FWIW, is that igb and ixgbe should have the same
> >> semantics as i40e currently has, i.e. return an error to the user
> >> if the port is mis-configured, NOT changing the setting behind the
> >> users back.
> >>
> >
> > Fine by me, but then it means that the fix haw to include changes
> > for all apps plus ixgbe and igb PMDs, correct? :)
> 
> Ugh. Correct, I guess. :-)
> 
> As for ixgbe and igb - they need a patch changing from silent ignore to
> explicit error. Regarding the apps, I guess all the apps that rely on
> that disabling CRC stripping always work, need some work. Or should all
> the example applications have CRC stripping *enabled* by default? I'd
> assume that all DPDK supported NICs has support for CRC stripping

[dpdk-dev] [PATCH] examples/l3fwd: force CRC stripping for i40evf

2016-11-09 Thread Ananyev, Konstantin



> 
> Adding Helin to the conversation.
> 
> > That's a common problem across Intel VF devices. Bothe igb and ixgbe
> > overcomes that problem just by forcing ' rxmode.hw_strip_crc = 1;' at
> > PMD itself:
> 
> [...]
> 
> > Wonder, can't i40e VF do the same?
> 
> Right, however, and this (silent failure) approach was rejected by the
> maintainers. Please, refer to this thread:
> http://dpdk.org/ml/archives/dev/2016-April/037523.html
> 
>  > BTW, all other examples would experience same problem too, right?
> 
> Correct, so the broader question would be "what is the correct behavior
> for an example application, when a port configuration isn't supported by
> the hardware?".
> 
> My stand, FWIW, is that igb and ixgbe should have the same semantics as
> i40e currently has, i.e. return an error to the user if the port is
> mis-configured, NOT changing the setting behind the users back.
> 

Fine by me, but then it means that the fix haw to include changes for all apps
plus ixgbe and igb PMDs, correct? :)

Konstantin

[dpdk-dev] [PATCH] examples/l3fwd: force CRC stripping for i40evf

2016-11-09 Thread Ananyev, Konstantin


Hi,

> 
> Commit 1bbcc5d21129 ("i40evf: report error for unsupported CRC
> stripping config") broke l3fwd, since it was forcing that CRC was
> kept. Now, if i40evf is running, CRC stripping will be enabled.
> 
> Signed-off-by: Bj?rn T?pel 
> ---
>  examples/l3fwd/main.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/examples/l3fwd/main.c b/examples/l3fwd/main.c
> index 7223e773107e..b60278794135 100644
> --- a/examples/l3fwd/main.c
> +++ b/examples/l3fwd/main.c
> @@ -906,6 +906,14 @@ main(int argc, char **argv)
>   n_tx_queue = MAX_TX_QUEUE_PER_PORT;
>   printf("Creating queues: nb_rxq=%d nb_txq=%u... ",
>   nb_rx_queue, (unsigned)n_tx_queue );
> + rte_eth_dev_info_get(portid, _info);
> + if (dev_info.driver_name &&
> + strcmp(dev_info.driver_name, "net_i40e_vf") == 0) {
> + /* i40evf require that CRC stripping is enabled. */
> + port_conf.rxmode.hw_strip_crc = 1;
> + } else {
> + port_conf.rxmode.hw_strip_crc = 0;
> + }

That's a common problem across Intel VF devices.
Bothe igb and ixgbe overcomes that problem just by forcing ' 
rxmode.hw_strip_crc = 1;'
at PMD itself:

 static int
ixgbevf_dev_configure(struct rte_eth_dev *dev)
{
...
   /*
 * VF has no ability to enable/disable HW CRC
 * Keep the persistent behavior the same as Host PF
 */
#ifndef RTE_LIBRTE_IXGBE_PF_DISABLE_STRIP_CRC
if (!conf->rxmode.hw_strip_crc) {
PMD_INIT_LOG(NOTICE, "VF can't disable HW CRC Strip");
conf->rxmode.hw_strip_crc = 1;
}
#else
if (conf->rxmode.hw_strip_crc) {
PMD_INIT_LOG(NOTICE, "VF can't enable HW CRC Strip");
conf->rxmode.hw_strip_crc = 0;
}
#endif

Wonder, can't i40e VF do the same?
BTW, all other examples would experience same problem too, right?
Konstantin

>   ret = rte_eth_dev_configure(portid, nb_rx_queue,
>   (uint16_t)n_tx_queue, _conf);
>   if (ret < 0)
> @@ -946,7 +954,6 @@ main(int argc, char **argv)
>   printf("txq=%u,%d,%d ", lcore_id, queueid, socketid);
>   fflush(stdout);
> 
> - rte_eth_dev_info_get(portid, _info);
>   txconf = _info.default_txconf;
>   if (port_conf.rxmode.jumbo_frame)
>   txconf->txq_flags = 0;
> --
> 2.9.3

[dpdk-dev] [PATCH v11 1/6] ethdev: add Tx preparation

2016-11-01 Thread Ananyev, Konstantin


Hi Thomas,

> > > I suggest to integrate it in the beginning of 17.02 cycle, with the hope
> > > that you can convince other developers to implement it in other drivers,
> > > so we could finally enable it in the default config.
> >
> > Ok, any insights then, how we can convince people to do that?
> 
> You just have to explain clearly what this new feature is bringing
> and what will be the future possibilities.
> 
> > BTW,  it means then that tx_prep() should become part of mandatory API
> > to be implemented by each PMD doing TX offloads, right?
> 
> Right.
> The question is "what means mandatory"?

For me "mandatory" here would mean that:
 - if the PMD supports TX offloads AND
 - if to be able use any of these offloads the upper layer SW would have to:
- modify the contents of the packet OR
- obey HW specific restrictions
 then it is a PMD developer responsibility to provide tx_prep() that would 
implement
 expected modifications of the packet contents and restriction checks.
Otherwise, tx_prep() implementation is not required and can be safely set to 
NULL.  

Does it sounds good enough to everyone?

> Should we block some patches for non-compliant drivers?

If we agree that it should be a 'mandatory' one - and patch in question breaks
that requirement, then probably yes.

> Should we remove offloads capability from non-compliant drivers?

Do you mean existing PMDs?
Are there any particular right now, that can't work properly with testpmd 
csumonly mode?

Konstantin

[dpdk-dev] [PATCH v11 1/6] ethdev: add Tx preparation

2016-10-28 Thread Ananyev, Konstantin


> 
> 2016-10-28 11:34, Ananyev, Konstantin:
> > > > 2016-10-27 16:24, Ananyev, Konstantin:
> > > > > From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> > > > > > 2016-10-27 15:52, Ananyev, Konstantin:
> > > > > > > > 2016-10-26 14:56, Tomasz Kulasek:
> > > > > > > > > --- a/config/common_base
> > > > > > > > > +++ b/config/common_base
> > > > > > > > > +CONFIG_RTE_ETHDEV_TX_PREP=y
> > > > > > > >
> > > > > > > > We cannot enable it until it is implemented in every drivers.
> > > > > > >
> > > > > > > Not sure why?
> > > > > > > If tx_pkt_prep == NULL, then rte_eth_tx_prep() would just act as 
> > > > > > > noop.
> > > > > > > Right now it is not mandatory for the PMD to implement it.
> > > > > >
> > > > > > If it is not implemented, the application must do the preparation 
> > > > > > by itself.
> > > > > > From patch 6:
> > > > > > "
> > > > > > Removed pseudo header calculation for udp/tcp/tso packets from
> > > > > > application and used Tx preparation API for packet preparation and
> > > > > > verification.
> > > > > > "
> > > > > > So how does it behave with other drivers?
> > > > >
> > > > > Hmm so it seems that we broke testpmd csumonly mode for non-intel 
> > > > > drivers..
> > > > > My bad, missed that part completely.
> > > > > Yes, then I suppose for now we'll need to support both (with and 
> > > > > without) code paths for testpmd.
> > > > > Probably a new fwd mode or just extra parameter for the existing one?
> > > > > Any other suggestions?
> > > >
> > > > Please think how we can use it in every applications.
> > > > It is not ready.
> > > > Either we introduce the API without enabling it, or we implement it
> > > > in every drivers.
> > >
> > > I understand your position here, but just like to point that:
> > > 1) It is a new functionality optional to use.
> > >  The app is free not to use that functionality and still do the 
> > > preparation itself
> > >  (as it has to do it now).
> > > All existing apps would keep working as expected without using that 
> > > function.
> > > Though if the app developer knows that for all HW models he plans to 
> > > run on
> > > tx_prep is implemented - he is free to use it.
> > > 2) It would be difficult for Tomasz (and other Intel guys) to 
> > > implement tx_prep()
> > >  for all non-Intel HW that DPDK supports right now.
> > >  We just don't have all the actual HW in stock and probably adequate 
> > > knowledge of it.
> > > So we depend here on the good will of other PMD mainaners/developers 
> > > to implement
> > > tx_prep() for these devices.
> > > From other side, if it will be disabled by default, then, I think,
> > > PMD developers just wouldn't be motivated to implement it.
> > > So it will be left untested and unused forever.
> >
> > Actually as another thought:
> > Can we have it enabled by default, but mark it as experimental or so?
> > If memory serves me right, we've done that for cryptodev in the past, no?
> 
> Cryptodev was a whole new library.
> We won't play the game "find which function is experimental or not".
> 
> We should not enable a function until it is fully implemented.
> 
> If the user really understands that it will work only with few drivers
> then he can change the build configuration himself.
> Enabling in the default configuration is a message to say that it works
> everywhere without any risk.
> It's so simple that I don't even understand why I must argue for.
> 
> And by the way, it is late for 16.11.

Ok, I understand your concern about enabling it by default and testpmd breakage,
but what else you believe is not ready? 

> I suggest to integrate it in the beginning of 17.02 cycle, with the hope
> that you can convince other developers to implement it in other drivers,
> so we could finally enable it in the default config.

Ok, any insights then, how we can convince people to do that?
BTW,  it means then that tx_prep() should become part of mandatory API
to be implemented by each PMD doing TX offloads, right?   

> 
> Oh, and I don't trust that nobody were thinking that it would break testpmd
> for non-Intel drivers.

Well, believe it or not, but yes, I missed that one.
I think I already admitted that it was my fault, and apologized for that.
But sure, it is your choice to trust me here or not.
Konstantin

[dpdk-dev] [PATCH v11 1/6] ethdev: add Tx preparation

2016-10-28 Thread Ananyev, Konstantin



> 
> Hi Thomasz,
> 
> >
> > 2016-10-27 16:24, Ananyev, Konstantin:
> > > From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> > > > 2016-10-27 15:52, Ananyev, Konstantin:
> > > > > > Hi Tomasz,
> > > > > >
> > > > > > This is a major new function in the API and I still have some 
> > > > > > comments.
> > > > > >
> > > > > > 2016-10-26 14:56, Tomasz Kulasek:
> > > > > > > --- a/config/common_base
> > > > > > > +++ b/config/common_base
> > > > > > > +CONFIG_RTE_ETHDEV_TX_PREP=y
> > > > > >
> > > > > > We cannot enable it until it is implemented in every drivers.
> > > > >
> > > > > Not sure why?
> > > > > If tx_pkt_prep == NULL, then rte_eth_tx_prep() would just act as noop.
> > > > > Right now it is not mandatory for the PMD to implement it.
> > > >
> > > > If it is not implemented, the application must do the preparation by 
> > > > itself.
> > > > From patch 6:
> > > > "
> > > > Removed pseudo header calculation for udp/tcp/tso packets from
> > > > application and used Tx preparation API for packet preparation and
> > > > verification.
> > > > "
> > > > So how does it behave with other drivers?
> > >
> > > Hmm so it seems that we broke testpmd csumonly mode for non-intel 
> > > drivers..
> > > My bad, missed that part completely.
> > > Yes, then I suppose for now we'll need to support both (with and without) 
> > > code paths for testpmd.
> > > Probably a new fwd mode or just extra parameter for the existing one?
> > > Any other suggestions?
> >
> > Please think how we can use it in every applications.
> > It is not ready.
> > Either we introduce the API without enabling it, or we implement it
> > in every drivers.
> 
> I understand your position here, but just like to point that:
> 1) It is a new functionality optional to use.
>  The app is free not to use that functionality and still do the 
> preparation itself
>  (as it has to do it now).
> All existing apps would keep working as expected without using that 
> function.
> Though if the app developer knows that for all HW models he plans to run 
> on
> tx_prep is implemented - he is free to use it.
> 2) It would be difficult for Tomasz (and other Intel guys) to implement 
> tx_prep()
>  for all non-Intel HW that DPDK supports right now.
>  We just don't have all the actual HW in stock and probably adequate 
> knowledge of it.
> So we depend here on the good will of other PMD mainaners/developers to 
> implement
> tx_prep() for these devices.
> From other side, if it will be disabled by default, then, I think,
> PMD developers just wouldn't be motivated to implement it.
> So it will be left untested and unused forever.

Actually as another thought:
Can we have it enabled by default, but mark it as experimental or so?
If memory serves me right, we've done that for cryptodev in the past, no?
Konstantin

> 
> >
> > > > > > >  struct rte_eth_dev {
> > > > > > >   eth_rx_burst_t rx_pkt_burst; /**< Pointer to PMD receive 
> > > > > > > function. */
> > > > > > >   eth_tx_burst_t tx_pkt_burst; /**< Pointer to PMD transmit 
> > > > > > > function. */
> > > > > > > + eth_tx_prep_t tx_pkt_prep; /**< Pointer to PMD transmit prepare 
> > > > > > > function. */
> > > > > > >   struct rte_eth_dev_data *data;  /**< Pointer to device data */
> > > > > > >   const struct eth_driver *driver;/**< Driver for this device */
> > > > > > >   const struct eth_dev_ops *dev_ops; /**< Functions exported by 
> > > > > > > PMD */
> > > > > >
> > > > > > Could you confirm why tx_pkt_prep is not in dev_ops?
> > > > > > I guess we want to have several implementations?
> > > > >
> > > > > Yes, it depends on configuration options, same as tx_pkt_burst.
> > > > >
> > > > > >
> > > > > > Shouldn't we have a const struct control_dev_ops and a struct 
> > > > > > datapath_dev_ops?
> > > > >
> > > > > That's probably a good idea, but I suppose it is out of scope for 
> > > > > that patch.
> > > >
> > > > No it's not out of scope.
> > > > It answers to the question "why is it added in this structure and not 
> > > > dev_ops".
> > > > We won't do this change when nothing else is changed in the struct.
> > >
> > > Not sure I understood you here:
> > > Are you saying datapath_dev_ops/controlpath_dev_ops have to be introduced 
> > > as part of that patch?
> > > But that's a lot of  changes all over rte_ethdev.[h,c].
> > > It definitely worse a separate patch (might be some discussion) for me.
> >
> > Yes it could be a separate patch in the same patchset.
> 
> Honestly, I think it is a good idea, but it is too late and too risky to do 
> such change right now.
> We are on RC2 right now, just few days before RC3...
> Can't that wait till 17.02?
> From my understanding - it is pure code restructuring, without any 
> functionality affected.
> Konstantin
>

[dpdk-dev] [PATCH v11 1/6] ethdev: add Tx preparation

2016-10-28 Thread Ananyev, Konstantin

Hi Thomasz,

> 
> 2016-10-27 16:24, Ananyev, Konstantin:
> > From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> > > 2016-10-27 15:52, Ananyev, Konstantin:
> > > > > Hi Tomasz,
> > > > >
> > > > > This is a major new function in the API and I still have some 
> > > > > comments.
> > > > >
> > > > > 2016-10-26 14:56, Tomasz Kulasek:
> > > > > > --- a/config/common_base
> > > > > > +++ b/config/common_base
> > > > > > +CONFIG_RTE_ETHDEV_TX_PREP=y
> > > > >
> > > > > We cannot enable it until it is implemented in every drivers.
> > > >
> > > > Not sure why?
> > > > If tx_pkt_prep == NULL, then rte_eth_tx_prep() would just act as noop.
> > > > Right now it is not mandatory for the PMD to implement it.
> > >
> > > If it is not implemented, the application must do the preparation by 
> > > itself.
> > > From patch 6:
> > > "
> > > Removed pseudo header calculation for udp/tcp/tso packets from
> > > application and used Tx preparation API for packet preparation and
> > > verification.
> > > "
> > > So how does it behave with other drivers?
> >
> > Hmm so it seems that we broke testpmd csumonly mode for non-intel drivers..
> > My bad, missed that part completely.
> > Yes, then I suppose for now we'll need to support both (with and without) 
> > code paths for testpmd.
> > Probably a new fwd mode or just extra parameter for the existing one?
> > Any other suggestions?
> 
> Please think how we can use it in every applications.
> It is not ready.
> Either we introduce the API without enabling it, or we implement it
> in every drivers.

I understand your position here, but just like to point that:
1) It is a new functionality optional to use.
 The app is free not to use that functionality and still do the preparation 
itself
 (as it has to do it now).
All existing apps would keep working as expected without using that 
function.
Though if the app developer knows that for all HW models he plans to run on
tx_prep is implemented - he is free to use it.
2) It would be difficult for Tomasz (and other Intel guys) to implement 
tx_prep()
 for all non-Intel HW that DPDK supports right now.
 We just don't have all the actual HW in stock and probably adequate 
knowledge of it.
So we depend here on the good will of other PMD mainaners/developers to 
implement
tx_prep() for these devices. 
From other side, if it will be disabled by default, then, I think,
PMD developers just wouldn't be motivated to implement it. 
So it will be left untested and unused forever.   

> 
> > > > > >  struct rte_eth_dev {
> > > > > > eth_rx_burst_t rx_pkt_burst; /**< Pointer to PMD receive 
> > > > > > function. */
> > > > > > eth_tx_burst_t tx_pkt_burst; /**< Pointer to PMD transmit 
> > > > > > function. */
> > > > > > +   eth_tx_prep_t tx_pkt_prep; /**< Pointer to PMD transmit prepare 
> > > > > > function. */
> > > > > > struct rte_eth_dev_data *data;  /**< Pointer to device data */
> > > > > > const struct eth_driver *driver;/**< Driver for this device */
> > > > > > const struct eth_dev_ops *dev_ops; /**< Functions exported by 
> > > > > > PMD */
> > > > >
> > > > > Could you confirm why tx_pkt_prep is not in dev_ops?
> > > > > I guess we want to have several implementations?
> > > >
> > > > Yes, it depends on configuration options, same as tx_pkt_burst.
> > > >
> > > > >
> > > > > Shouldn't we have a const struct control_dev_ops and a struct 
> > > > > datapath_dev_ops?
> > > >
> > > > That's probably a good idea, but I suppose it is out of scope for that 
> > > > patch.
> > >
> > > No it's not out of scope.
> > > It answers to the question "why is it added in this structure and not 
> > > dev_ops".
> > > We won't do this change when nothing else is changed in the struct.
> >
> > Not sure I understood you here:
> > Are you saying datapath_dev_ops/controlpath_dev_ops have to be introduced 
> > as part of that patch?
> > But that's a lot of  changes all over rte_ethdev.[h,c].
> > It definitely worse a separate patch (might be some discussion) for me.
> 
> Yes it could be a separate patch in the same patchset.

Honestly, I think it is a good idea, but it is too late and too risky to do 
such change right now.
We are on RC2 right now, just few days before RC3...
Can't that wait till 17.02?
>From my understanding - it is pure code restructuring, without any 
>functionality affected.
Konstantin

[dpdk-dev] [PATCH v11 1/6] ethdev: add Tx preparation

2016-10-28 Thread Ananyev, Konstantin



> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: Friday, October 28, 2016 11:22 AM
> To: Ananyev, Konstantin ; Kulasek, TomaszX 
> 
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v11 1/6] ethdev: add Tx preparation
> 
> 2016-10-28 10:15, Ananyev, Konstantin:
> > > From: Ananyev, Konstantin
> > > > From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> > > > > 2016-10-27 15:52, Ananyev, Konstantin:
> > > > > > > 2016-10-26 14:56, Tomasz Kulasek:
> > > > > > > > --- a/config/common_base
> > > > > > > > +++ b/config/common_base
> > > > > > > > +CONFIG_RTE_ETHDEV_TX_PREP=y
> > > > > > >
> > > > > > > We cannot enable it until it is implemented in every drivers.
> > > > > >
> > > > > > Not sure why?
> > > > > > If tx_pkt_prep == NULL, then rte_eth_tx_prep() would just act as 
> > > > > > noop.
> > > > > > Right now it is not mandatory for the PMD to implement it.
> > > > >
> > > > > If it is not implemented, the application must do the preparation by
> > > > itself.
> > > > > From patch 6:
> > > > > "
> > > > > Removed pseudo header calculation for udp/tcp/tso packets from
> > > > > application and used Tx preparation API for packet preparation and
> > > > > verification.
> > > > > "
> > > > > So how does it behave with other drivers?
> > > >
> > > > Hmm so it seems that we broke testpmd csumonly mode for non-intel
> > > > drivers..
> > > > My bad, missed that part completely.
> > > > Yes, then I suppose for now we'll need to support both (with and 
> > > > without)
> > > > code paths for testpmd.
> > > > Probably a new fwd mode or just extra parameter for the existing one?
> > > > Any other suggestions?
> > > >
> > >
> > > I had sent txprep engine in v2 
> > > (http://dpdk.org/dev/patchwork/patch/15775/), but I'm opened on the 
> > > suggestions. If you like it I can
> resent
> > > it in place of csumonly modification.
> >
> > I still not sure it is worth to have another version of csum...
> > Can we introduce a new global variable in testpmd and a new command:
> > testpmd> csum tx_prep
> > or so?
> > Looking at current testpmd patch, I suppose the changes will be minimal.
> > What do you think?
> 
> No please no!
> The problem is not in testpmd.
> The problem is in every applications.
> Should we prepare the checksums or let tx_prep do it?

Not sure, I understood you...
Right now we don't' change other apps.
They would work as before.
If people would like to start to use tx_prep in their apps -
they are free to do that.
If they like to keep doing that manually - that's fine too.
>From other side we need an ability to test (and demonstrate) that new 
>functionality.
So we do need changes in testpmd.
Konstantin



> The result will depend of the driver used.

[dpdk-dev] [PATCH v11 1/6] ethdev: add Tx preparation

2016-10-28 Thread Ananyev, Konstantin

Hi Tomasz,

> 
> Hi
> 
> > -Original Message-
> > From: Ananyev, Konstantin
> > Sent: Thursday, October 27, 2016 18:24
> > To: Thomas Monjalon 
> > Cc: Kulasek, TomaszX ; dev at dpdk.org
> > Subject: RE: [dpdk-dev] [PATCH v11 1/6] ethdev: add Tx preparation
> >
> >
> >
> > > -Original Message-
> > > From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> > > Sent: Thursday, October 27, 2016 5:02 PM
> > > To: Ananyev, Konstantin 
> > > Cc: Kulasek, TomaszX ; dev at dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH v11 1/6] ethdev: add Tx preparation
> > >
> > > 2016-10-27 15:52, Ananyev, Konstantin:
> > > >
> > > > >
> > > > > Hi Tomasz,
> > > > >
> > > > > This is a major new function in the API and I still have some
> > comments.
> > > > >
> > > > > 2016-10-26 14:56, Tomasz Kulasek:
> > > > > > --- a/config/common_base
> > > > > > +++ b/config/common_base
> > > > > > +CONFIG_RTE_ETHDEV_TX_PREP=y
> > > > >
> > > > > We cannot enable it until it is implemented in every drivers.
> > > >
> > > > Not sure why?
> > > > If tx_pkt_prep == NULL, then rte_eth_tx_prep() would just act as noop.
> > > > Right now it is not mandatory for the PMD to implement it.
> > >
> > > If it is not implemented, the application must do the preparation by
> > itself.
> > > From patch 6:
> > > "
> > > Removed pseudo header calculation for udp/tcp/tso packets from
> > > application and used Tx preparation API for packet preparation and
> > > verification.
> > > "
> > > So how does it behave with other drivers?
> >
> > Hmm so it seems that we broke testpmd csumonly mode for non-intel
> > drivers..
> > My bad, missed that part completely.
> > Yes, then I suppose for now we'll need to support both (with and without)
> > code paths for testpmd.
> > Probably a new fwd mode or just extra parameter for the existing one?
> > Any other suggestions?
> >
> 
> I had sent txprep engine in v2 (http://dpdk.org/dev/patchwork/patch/15775/), 
> but I'm opened on the suggestions. If you like it I can resent
> it in place of csumonly modification.

I still not sure it is worth to have another version of csum...
Can we introduce a new global variable in testpmd and a new command:
testpmd> csum tx_prep
or so? 
Looking at current testpmd patch, I suppose the changes will be minimal.
What do you think?
Konstantin 

> 
> Tomasz
> 
> > >
> > > > > >  struct rte_eth_dev {
> > > > > > eth_rx_burst_t rx_pkt_burst; /**< Pointer to PMD receive
> > function. */
> > > > > > eth_tx_burst_t tx_pkt_burst; /**< Pointer to PMD transmit
> > > > > > function. */
> > > > > > +   eth_tx_prep_t tx_pkt_prep; /**< Pointer to PMD transmit
> > > > > > +prepare function. */
> > > > > > struct rte_eth_dev_data *data;  /**< Pointer to device data */
> > > > > > const struct eth_driver *driver;/**< Driver for this device */
> > > > > > const struct eth_dev_ops *dev_ops; /**< Functions exported by
> > > > > > PMD */
> > > > >
> > > > > Could you confirm why tx_pkt_prep is not in dev_ops?
> > > > > I guess we want to have several implementations?
> > > >
> > > > Yes, it depends on configuration options, same as tx_pkt_burst.
> > > >
> > > > >
> > > > > Shouldn't we have a const struct control_dev_ops and a struct
> > datapath_dev_ops?
> > > >
> > > > That's probably a good idea, but I suppose it is out of scope for that
> > patch.
> > >
> > > No it's not out of scope.
> > > It answers to the question "why is it added in this structure and not
> > dev_ops".
> > > We won't do this change when nothing else is changed in the struct.
> >
> > Not sure I understood you here:
> > Are you saying datapath_dev_ops/controlpath_dev_ops have to be introduced
> > as part of that patch?
> > But that's a lot of  changes all over rte_ethdev.[h,c].
> > It definitely worse a separate patch (might be some discussion) for me.
> > Konstantin
> >
> >

[dpdk-dev] [PATCH v11 1/6] ethdev: add Tx preparation

2016-10-27 Thread Ananyev, Konstantin



> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: Thursday, October 27, 2016 5:02 PM
> To: Ananyev, Konstantin 
> Cc: Kulasek, TomaszX ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v11 1/6] ethdev: add Tx preparation
> 
> 2016-10-27 15:52, Ananyev, Konstantin:
> >
> > >
> > > Hi Tomasz,
> > >
> > > This is a major new function in the API and I still have some comments.
> > >
> > > 2016-10-26 14:56, Tomasz Kulasek:
> > > > --- a/config/common_base
> > > > +++ b/config/common_base
> > > > +CONFIG_RTE_ETHDEV_TX_PREP=y
> > >
> > > We cannot enable it until it is implemented in every drivers.
> >
> > Not sure why?
> > If tx_pkt_prep == NULL, then rte_eth_tx_prep() would just act as noop.
> > Right now it is not mandatory for the PMD to implement it.
> 
> If it is not implemented, the application must do the preparation by itself.
> From patch 6:
> "
> Removed pseudo header calculation for udp/tcp/tso packets from
> application and used Tx preparation API for packet preparation and
> verification.
> "
> So how does it behave with other drivers?

Hmm so it seems that we broke testpmd csumonly mode for non-intel drivers..
My bad, missed that part completely.
Yes, then I suppose for now we'll need to support both (with and without) code 
paths for testpmd.
Probably a new fwd mode or just extra parameter for the existing one?
Any other suggestions?

> 
> > > >  struct rte_eth_dev {
> > > > eth_rx_burst_t rx_pkt_burst; /**< Pointer to PMD receive 
> > > > function. */
> > > > eth_tx_burst_t tx_pkt_burst; /**< Pointer to PMD transmit 
> > > > function. */
> > > > +   eth_tx_prep_t tx_pkt_prep; /**< Pointer to PMD transmit prepare 
> > > > function. */
> > > > struct rte_eth_dev_data *data;  /**< Pointer to device data */
> > > > const struct eth_driver *driver;/**< Driver for this device */
> > > > const struct eth_dev_ops *dev_ops; /**< Functions exported by 
> > > > PMD */
> > >
> > > Could you confirm why tx_pkt_prep is not in dev_ops?
> > > I guess we want to have several implementations?
> >
> > Yes, it depends on configuration options, same as tx_pkt_burst.
> >
> > >
> > > Shouldn't we have a const struct control_dev_ops and a struct 
> > > datapath_dev_ops?
> >
> > That's probably a good idea, but I suppose it is out of scope for that 
> > patch.
> 
> No it's not out of scope.
> It answers to the question "why is it added in this structure and not 
> dev_ops".
> We won't do this change when nothing else is changed in the struct.

Not sure I understood you here:
Are you saying datapath_dev_ops/controlpath_dev_ops have to be introduced as 
part of that patch?
But that's a lot of  changes all over rte_ethdev.[h,c].
It definitely worse a separate patch (might be some discussion) for me.
Konstantin

[dpdk-dev] [PATCH v11 1/6] ethdev: add Tx preparation

2016-10-27 Thread Ananyev, Konstantin



> 
> Hi Tomasz,
> 
> This is a major new function in the API and I still have some comments.
> 
> 2016-10-26 14:56, Tomasz Kulasek:
> > --- a/config/common_base
> > +++ b/config/common_base
> > +CONFIG_RTE_ETHDEV_TX_PREP=y
> 
> We cannot enable it until it is implemented in every drivers.

Not sure why?
If tx_pkt_prep == NULL, then rte_eth_tx_prep() would just act as noop.
Right now it is not mandatory for the PMD to implement it.

> 
> >  struct rte_eth_dev {
> > eth_rx_burst_t rx_pkt_burst; /**< Pointer to PMD receive function. */
> > eth_tx_burst_t tx_pkt_burst; /**< Pointer to PMD transmit function. */
> > +   eth_tx_prep_t tx_pkt_prep; /**< Pointer to PMD transmit prepare 
> > function. */
> > struct rte_eth_dev_data *data;  /**< Pointer to device data */
> > const struct eth_driver *driver;/**< Driver for this device */
> > const struct eth_dev_ops *dev_ops; /**< Functions exported by PMD */
> 
> Could you confirm why tx_pkt_prep is not in dev_ops?
> I guess we want to have several implementations?

Yes, it depends on configuration options, same as tx_pkt_burst.

> 
> Shouldn't we have a const struct control_dev_ops and a struct 
> datapath_dev_ops?

That's probably a good idea, but I suppose it is out of scope for that patch.
Konstantin

[dpdk-dev] mbuf changes

2016-10-25 Thread Ananyev, Konstantin



> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Adrien Mazarguil
> Sent: Tuesday, October 25, 2016 2:48 PM
> To: Morten Br?rup 
> Cc: Richardson, Bruce ; Wiles, Keith 
> ; dev at dpdk.org; Olivier Matz
> ; Oleg Kuporosov 
> Subject: Re: [dpdk-dev] mbuf changes
> 
> On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Br?rup wrote:
> > Comments inline.
> 
> I'm only replying to the nb_segs bits here.
> 
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson
> > > Sent: Tuesday, October 25, 2016 1:14 PM
> > > To: Adrien Mazarguil
> > > Cc: Morten Br?rup; Wiles, Keith; dev at dpdk.org; Olivier Matz; Oleg
> > > Kuporosov
> > > Subject: Re: [dpdk-dev] mbuf changes
> > >
> > > On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil wrote:
> > > > On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Br?rup wrote:
> > > > > Comments inline.
> > > > >
> > > > > Med venlig hilsen / kind regards
> > > > > - Morten Br?rup
> > > > >
> > > > >
> > > > > > -Original Message-
> > > > > > From: Adrien Mazarguil [mailto:adrien.mazarguil at 6wind.com]
> > > > > > Sent: Tuesday, October 25, 2016 11:39 AM
> > > > > > To: Bruce Richardson
> > > > > > Cc: Wiles, Keith; Morten Br?rup; dev at dpdk.org; Olivier Matz; Oleg
> > > > > > Kuporosov
> > > > > > Subject: Re: [dpdk-dev] mbuf changes
> > > > > >
> > > > > > On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson wrote:
> > > > > > > On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith wrote:
> > > > > > [...]
> > > > > > > > > On Oct 24, 2016, at 10:49 AM, Morten Br?rup
> > > > > >  wrote:
> > > > > > [...]
> > > > > > > > > 5.
> > > > > > > > >
> > > > > > > > > And here?s something new to think about:
> > > > > > > > >
> > > > > > > > > m->next already reveals if there are more segments to a
> > > packet.
> > > > > > Which purpose does m->nb_segs serve that is not already covered
> > > by
> > > > > > m-
> > > > > > >next?
> > > > > > >
> > > > > > > It is duplicate info, but nb_segs can be used to check the
> > > > > > > validity
> > > > > > of
> > > > > > > the next pointer without having to read the second mbuf
> > > cacheline.
> > > > > > >
> > > > > > > Whether it's worth having is something I'm happy enough to
> > > > > > > discuss, though.
> > > > > >
> > > > > > Although slower in some cases than a full blown "next packet"
> > > > > > pointer, nb_segs can also be conveniently abused to link several
> > > > > > packets and their segments in the same list without wasting
> > > space.
> > > > >
> > > > > I don?t understand that; can you please elaborate? Are you abusing
> > > m->nb_segs as an index into an array in your application? If that is
> > > the case, and it is endorsed by the community, we should get rid of m-
> > > >nb_segs and add a member for application specific use instead.
> > > >
> > > > Well, that's just an idea, I'm not aware of any application using
> > > > this, however the ability to link several packets with segments seems
> > > > useful to me (e.g. buffering packets). Here's a diagram:
> > > >
> > > >  .---.   .---.   .---.   .---.   .---
> > > ---
> > > >  | pkt 0 |   | seg 1 |   | seg 2 |   | pkt 1 |   |
> > > pkt 2
> > > >  |  next --->|  next --->|  next --->|  next --->|
> > > ...
> > > >  | nb_segs 3 |   | nb_segs 1 |   | nb_segs 1 |   | nb_segs 1 |   |
> > > >  `---'   `---'   `---'   `---'   `---
> > > ---
> >
> > I see. It makes it possible to refer to a burst of packets (with segments 
> > or not) by a single mbuf reference, as an alternative to the current
> design pattern of using an array and length (struct rte_mbuf **mbufs, 
> unsigned count).
> >
> > This would require implementation in the PMDs etc.
> >
> > And even in this case, m->nb_segs does not need to be an integer, but could 
> > be replaced by a single bit indicating if the segment is a
> continuation of a packet or the beginning (alternatively the end) of a 
> packet, i.e. the bit can be set for either the first or the last segment in
> the packet.

We do need nb_segs -  at least for TX.
That's how TX function calculates how many TXDs it needs to allocate(and fill).
Of-course it can re-scan whole chain of segments to count them, but I think
it would slowdown things even more.
Though yes, I suppose it can be moved to the second cahe-line.
Konstantin

> 
> Sure however if we keep the current definition, a single bit would not be
> enough as it must be nonzero for the buffer to be valid. I think a 8 bit
> field is not that expensive for a counter.
> 
> > It is an almost equivalent alternative to the fundamental design pattern of 
> > using an array of mbuf with count, which is widely implemented
> in DPDK. And m->next still lives in the second cache line, so I don't see any 
> gain by this.
> 
> That's right, it does not have to live in the first cache line, my only
>

[dpdk-dev] mbuf changes

2016-10-25 Thread Ananyev, Konstantin



> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Olivier Matz
> Sent: Tuesday, October 25, 2016 1:49 PM
> To: Richardson, Bruce ; Morten Br?rup  smartsharesystems.com>
> Cc: Adrien Mazarguil ; Wiles, Keith 
> ; dev at dpdk.org; Oleg Kuporosov
> 
> Subject: Re: [dpdk-dev] mbuf changes
> 
> 
> 
> On 10/25/2016 02:45 PM, Bruce Richardson wrote:
> > On Tue, Oct 25, 2016 at 02:33:55PM +0200, Morten Br?rup wrote:
> >> Comments at the end.
> >>
> >> Med venlig hilsen / kind regards
> >> - Morten Br?rup
> >>
> >>> -Original Message-
> >>> From: Bruce Richardson [mailto:bruce.richardson at intel.com]
> >>> Sent: Tuesday, October 25, 2016 2:20 PM
> >>> To: Morten Br?rup
> >>> Cc: Adrien Mazarguil; Wiles, Keith; dev at dpdk.org; Olivier Matz; Oleg
> >>> Kuporosov
> >>> Subject: Re: [dpdk-dev] mbuf changes
> >>>
> >>> On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Br?rup wrote:
>  Comments inline.
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce
> > Richardson
> > Sent: Tuesday, October 25, 2016 1:14 PM
> > To: Adrien Mazarguil
> > Cc: Morten Br?rup; Wiles, Keith; dev at dpdk.org; Olivier Matz; Oleg
> > Kuporosov
> > Subject: Re: [dpdk-dev] mbuf changes
> >
> > On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil wrote:
> >> On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Br?rup wrote:
> >>> Comments inline.
> >>>
> >>> Med venlig hilsen / kind regards
> >>> - Morten Br?rup
> >>>
> >>>
>  -Original Message-
>  From: Adrien Mazarguil [mailto:adrien.mazarguil at 6wind.com]
>  Sent: Tuesday, October 25, 2016 11:39 AM
>  To: Bruce Richardson
>  Cc: Wiles, Keith; Morten Br?rup; dev at dpdk.org; Olivier Matz;
>  Oleg Kuporosov
>  Subject: Re: [dpdk-dev] mbuf changes
> 
>  On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson
> >>> wrote:
> > On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith
> >>> wrote:
>  [...]
> >>> On Oct 24, 2016, at 10:49 AM, Morten Br?rup
>   wrote:
>  [...]
> 
> > One other point I'll mention is that we need to have a
> > discussion on how/where to add in a timestamp value into
> >>> the
> > mbuf. Personally, I think it can be in a union with the
> > sequence
> > number value, but I also suspect that 32-bits of a
> >>> timestamp
> > is not going to be enough for
>  many.
> >
> > Thoughts?
> 
>  If we consider that timestamp representation should use
> > nanosecond
>  granularity, a 32-bit value may likely wrap around too
> >>> quickly
>  to be useful. We can also assume that applications requesting
>  timestamps may care more about latency than throughput, Oleg
> > found
>  that using the second cache line for this purpose had a
> > noticeable impact [1].
> 
>   [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html
> >>>
> >>> I agree with Oleg about the latency vs. throughput importance
> >>> for
> > such applications.
> >>>
> >>> If you need high resolution timestamps, consider them to be
> > generated by the NIC RX driver, possibly by the hardware itself
> > (http://w3new.napatech.com/features/time-precision/hardware-time-
> > stamp), so the timestamp belongs in the first cache line. And I am
> > proposing that it should have the highest possible accuracy, which
> > makes the value hardware dependent.
> >>>
> >>> Furthermore, I am arguing that we leave it up to the
> >>> application
> >>> to
> > keep track of the slowly moving bits (i.e. counting whole seconds,
> > hours and calendar date) out of band, so we don't use precious
> >>> space
> > in the mbuf. The application doesn't need the NIC RX driver's fast
> > path to capture which date (or even which second) a packet was
> > received. Yes, it adds complexity to the application, but we can't
> > set aside 64 bit for a generic timestamp. Or as a weird tradeoff:
> > Put the fast moving 32 bit in the first cache line and the slow
> > moving 32 bit in the second cache line, as a placeholder for the
> >>> application to fill out if needed.
> > Yes, it means that the application needs to check the time and
> > update its variable holding the slow moving time once every second
> > or so; but that should be doable without significant effort.
> >>
> >> That's a good point, however without a 64 bit value, elapsed time
> >> between two arbitrary mbufs cannot be measured reliably due to
> >>> not
> >> enough context, one way or another the low resolution value is
> >> also
> > needed.
> >>
> >> Obviously latency-sensitive applications are unlikely to perform
> >>

[dpdk-dev] mbuf changes

2016-10-25 Thread Ananyev, Konstantin

Hi everyone,

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson
> Sent: Tuesday, October 25, 2016 9:54 AM
> To: Morten Br?rup 
> Cc: Wiles, Keith ; dev at dpdk.org; Olivier Matz 
> 
> Subject: Re: [dpdk-dev] mbuf changes
> 
> On Mon, Oct 24, 2016 at 11:47:16PM +0200, Morten Br?rup wrote:
> > Comments inline.
> >
> > Med venlig hilsen / kind regards
> > - Morten Br?rup
> >
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson
> > > Sent: Monday, October 24, 2016 6:26 PM
> > > To: Wiles, Keith
> > > Cc: Morten Br?rup; dev at dpdk.org; Olivier Matz
> > > Subject: Re: [dpdk-dev] mbuf changes
> > >
> > > On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith wrote:
> > > >
> > > > > On Oct 24, 2016, at 10:49 AM, Morten Br?rup  > > > > smartsharesystems.com>
> > > wrote:
> > > > >
> > > > > First of all: Thanks for a great DPDK Userspace 2016!
> > > > >
> > > > >
> > > > >
> > > > > Continuing the Userspace discussion about Olivier Matz?s proposed mbuf
> > > changes...
> > >
> > > Thanks for keeping the discussion going!
> > > > >
> > > > >
> > > > >
> > > > > 1.
> > > > >
> > > > > Stephen Hemminger had a noteworthy general comment about keeping
> > > metadata for the NIC in the appropriate section of the mbuf: Metadata
> > > generated by the NIC?s RX handler belongs in the first cache line, and
> > > metadata required by the NIC?s TX handler belongs in the second cache 
> > > line.
> > > This also means that touching the second cache line on ingress should be
> > > avoided if possible; and Bruce Richardson mentioned that for this reason 
> > > m-
> > > >next was zeroed on free().
> > > > >
> > > Thinking about it, I suspect there are more fields we can reset on free
> > > to save time on alloc. Refcnt, as discussed below is one of them, but so
> > > too could be the nb_segs field and possibly others.
> > Yes. Consider the use of rte_pktmbuf_reset() or add a 
> > rte_pktmbuf_prealloc() for this purpose.
> >
> Possibly. We just want to make sure that we don't reset too much either!
> :-)
> 
> >
> > > > > 2.
> > > > >
> > > > > There seemed to be consensus that the size of m->refcnt should match
> > > the size of m->port because a packet could be duplicated on all physical
> > > ports for L3 multicast and L2 flooding.
> > > > >
> > > > > Furthermore, although a single physical machine (i.e. a single server)
> > > with 255 physical ports probably doesn?t exist, it might contain more than
> > > 255 virtual machines with a virtual port each, so it makes sense extending
> > > these mbuf fields from 8 to 16 bits.
> > > >
> > > > I thought we also talked about removing the m->port from the mbuf as it
> > > is not really needed.
> > > >
> > > Yes, this was mentioned, and also the option of moving the port value to
> > > the second cacheline, but it appears that NXP are using the port value
> > > in their NIC drivers for passing in metadata, so we'd need their
> > > agreement on any move (or removal).
> > >
> > If a single driver instance services multiple ports, so the ports are not 
> > polled individually, the m->port member will be required to identify
> the port. In that case, it might also used elsewhere in the ingress path, and 
> thus appropriate having in the first cache line. 

Ok, but these days most of devices have multiple rx queues.
So identify the RX source properly you need not only port, but port+queue (at 
least 3 bytes).
But I suppose better to wait for NXP input here. 

>So yes, this needs
> further discussion with NXP.
> >
> >
> > > > > 3.
> > > > >
> > > > > Someone (Bruce Richardson?) suggested moving m->refcnt and m->port to
> > > the second cache line, which then generated questions from the audience
> > > about the real life purpose of m->port, and if m->port could be removed
> > > from the mbuf structure.
> > > > >
> > > > >
> > > > >
> > > > > 4.
> > > > >
> > > > > I suggested using offset -1 for m->refcnt, so m->refcnt becomes 0 on
> > > first allocation. This is based on the assumption that other mbuf fields
> > > must be zeroed at alloc()/free() anyway, so zeroing m->refcnt is cheaper
> > > than setting it to 1.
> > > > >
> > > > > Furthermore (regardless of m->refcnt offset), I suggested that it is
> > > not required to modify m->refcnt when allocating and freeing the mbuf, 
> > > thus
> > > saving one write operation on both alloc() and free(). However, this
> > > assumes that m->refcnt debugging, e.g. underrun detection, is not 
> > > required.
> > >
> > > I don't think it really matters what sentinal value is used for the
> > > refcnt because it can't be blindly assigned on free like other fields.
> > > However, I think 0 as first reference value becomes more awkward
> > > than 1, because we need to deal with underflow. Consider the situation
> > > where we have two references to the mbuf, so refcnt is 1, and both are
> > > freed at the same time. Since the refcnt is

[dpdk-dev] rte_kni_tx_burst() hangs because of no freedescriptors

2016-10-25 Thread Ananyev, Konstantin

Hi Helin,

> 
> From: yingzhi [mailto:kaitoy at qq.com]
> Sent: Monday, October 24, 2016 6:39 PM
> To: Zhang, Helin
> Cc: dev at dpdk.org
> Subject: Re: RE: [dpdk-dev] rte_kni_tx_burst() hangs because of no 
> freedescriptors
> 
> Hi Helin,
> 
> Thanks for your response, to answer your questions:
> 1. we send only one packet each time calling rte_kni_tx_burst(), which means 
> the last argument is 1.
> 2. it returns 0 because the free mbuf function inside tx_burst will not free 
> any mbuf:
> 
>   if (txq->nb_tx_free < txq->tx_free_thresh)
>   ixgbe_tx_free_bufs(txq);
> 
> after this operation, the txq->nb_tx_free is not increased and keeps to be 
> "0" eventually.
> 
> I did some tests today, I commented out this section of 
> ixgbe_rxtx_vec_common.h -> ixgbe_tx_free_bufs
> 
>   status = txq->tx_ring[txq->tx_next_dd].wb.status;
>   if (!(status & IXGBE_ADVTXD_STAT_DD))
>   return 0;
> 
> After ignoring DD bit check, our app runs for about 6 hours without issue. I 
> suspect there is something wrong in my program set the DD bit
> somewhere. One of the possible cause currently I suspect is as far as I know, 
> rte_pktmbuf_free(mbuf.array[k]) will free the mbuf of the
> packet and any fragmented packets following by it. But in our application 
> such as below code snippet:
> 
> auto nb_tx = rte_eth_tx_burst(port, queue, mbuf.array, (uint16_t) 
> nb_rx);
> if (unlikely(nb_tx < nb_rx)) {
> for (unsigned k = nb_tx; k < nb_rx; k++) {
> rte_pktmbuf_free(mbuf.array[k]);
> }
> }
> [Zhang, Helin] it seems above code piece has memory leak, if the buffer is 
> chained. After all memory leaked, then the issue comes. Please
> try to check if this is the root cause!
> 
> In this case if there are fragmented packets and failed transmission, may 
> cause a mbuf be freed multiple times.

Yes rte_pktmbuf_free() will free all chained segments for that packet.
The code above seems correct to me
(if of course .you don't try to free these packets again somewhere later).  
Can you clarify where do you think is a memory leak here?
Konstantin

> 
> Above is just my suspect, need to do some tests later today or tomorrow.
> 
> Thanks
> -- Original --
> From:  "Zhang, Helin";;
> Date:  Mon, Oct 24, 2016 11:33 AM
> To:  "yingzhi";
> Cc:  "dev at dpdk.org";
> Subject:  RE: [dpdk-dev] rte_kni_tx_burst() hangs because of no 
> freedescriptors
> 
> Hi Yingzhi
> 
> Thank you for the reporting! The description is not so clear at least for me.
> Please help to narrown down the issue by youself.
> How many packets would it have for calling TX function?
> Why it would return 0 after calling TX function? No memory? Or return from 
> else? Have you found anything?
> 
> Regards,
> Helin
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] > dpdk.org]> On Behalf Of yingzhi
> > Sent: Sunday, October 23, 2016 9:30 PM
> > To: users; dev at dpdk.org
> > Subject: [dpdk-dev] rte_kni_tx_burst() hangs because of no free descriptors
> >
> > -
> > Hi Experts,
> >
> > Background:
> >
> > We are using DPDK to develop a LoadBalancer following below logic: When
> > a new packet is received:
> >  1. if the dst_addr is management IP, forward to KNI. 2. if the dst_addr is 
> > in
> > VIP list, select backend and forward(modify dst mac address). 3. otherwise
> > drop the packet.
> >
> > At this stage, we use one single thread for KNI forwarding and another for
> > VIP forwarding(forward to eth).
> >
> > DPDK version: 16.07
> >  NIC: 82599ES 10-Gigabit SFI/SFP+ Network Connection
> >  Linux: 14.04.1-Ubuntu x64
> >
> > Promblem description:
> >
> > The program runs correctly for sometime(around 2 hours for 400Mb traffic).
> > But it it will hang. When problem happens, rte_eth_tx_burst() will not able 
> > to
> > send out any packets(always returns 0). We tracked into that function and
> > noticed it is actually calling ixgbe driver's ixgbe_xmit_pkts_vec() 
> > function in
> > our environment, because we use default tx queue configuration, after
> > printing some info, we found if the free function works fine:
> >  tx_rs_thresh: 32, tx_free_thresh: 32, nb_tx_free: 31
> >
> > it will trigger free and make 32 more free descriptors:
> >  tx_rs_thresh: 32, tx_free_thresh: 32, nb_tx_free: 62
> >
> > but when something going wrong, it will no longer free anything:
> >  tx_rs_thresh: 32, tx_free_thresh: 32, nb_tx_free: 0 tx_rs_thresh: 32,
> > tx_free_thresh: 32, nb_tx_free: 0
> >
> > It may related with the DD flag of the descriptor but we are not quite sure.
> >
> > Our program logic:
> >
> > create two mbuf pools on socket 0, one for rx_queue and one for kni. (all
> > lcore threads runs on socket0)
> >
> > init kni interface with rte_kni_alloc()
> >
> >
> > init one NIC interface with
> >  rte_eth_dev_configure();

[dpdk-dev] [PATCH v10 0/6] add Tx preparation

2016-10-24 Thread Ananyev, Konstantin

> 
> As discussed in that thread:
> 
> http://dpdk.org/ml/archives/dev/2015-September/023603.html
> 
> Different NIC models depending on HW offload requested might impose
> different requirements on packets to be TX-ed in terms of:
> 
>  - Max number of fragments per packet allowed
>  - Max number of fragments per TSO segments
>  - The way pseudo-header checksum should be pre-calculated
>  - L3/L4 header fields filling
>  - etc.
> 
> 
> MOTIVATION:
> ---
> 
> 1) Some work cannot (and didn't should) be done in rte_eth_tx_burst.
>However, this work is sometimes required, and now, it's an
>application issue.
> 
> 2) Different hardware may have different requirements for TX offloads,
>other subset can be supported and so on.
> 
> 3) Some parameters (e.g. number of segments in ixgbe driver) may hung
>device. These parameters may be vary for different devices.
> 
>For example i40e HW allows 8 fragments per packet, but that is after
>TSO segmentation. While ixgbe has a 38-fragment pre-TSO limit.
> 
> 4) Fields in packet may require different initialization (like e.g. will
>require pseudo-header checksum precalculation, sometimes in a
>different way depending on packet type, and so on). Now application
>needs to care about it.
> 
> 5) Using additional API (rte_eth_tx_prep) before rte_eth_tx_burst let to
>prepare packet burst in acceptable form for specific device.
> 
> 6) Some additional checks may be done in debug mode keeping tx_burst
>implementation clean.
> 
> 
> PROPOSAL:
> -
> 
> To help user to deal with all these varieties we propose to:
> 
> 1) Introduce rte_eth_tx_prep() function to do necessary preparations of
>packet burst to be safely transmitted on device for desired HW
>offloads (set/reset checksum field according to the hardware
>requirements) and check HW constraints (number of segments per
>packet, etc).
> 
>While the limitations and requirements may differ for devices, it
>requires to extend rte_eth_dev structure with new function pointer
>"tx_pkt_prep" which can be implemented in the driver to prepare and
>verify packets, in devices specific way, before burst, what should to
>prevent application to send malformed packets.
> 
> 2) Also new fields will be introduced in rte_eth_desc_lim:
>nb_seg_max and nb_mtu_seg_max, providing an information about max
>segments in TSO and non-TSO packets acceptable by device.
> 
>This information is useful for application to not create/limit
>malicious packet.
> 
> 
> APPLICATION (CASE OF USE):
> --
> 
> 1) Application should to initialize burst of packets to send, set
>required tx offload flags and required fields, like l2_len, l3_len,
>l4_len, and tso_segsz
> 
> 2) Application passes burst to the rte_eth_tx_prep to check conditions
>required to send packets through the NIC.
> 
> 3) The result of rte_eth_tx_prep can be used to send valid packets
>and/or restore invalid if function fails.
> 
> e.g.
> 
>   for (i = 0; i < nb_pkts; i++) {
> 
>   /* initialize or process packet */
> 
>   bufs[i]->tso_segsz = 800;
>   bufs[i]->ol_flags = PKT_TX_TCP_SEG | PKT_TX_IPV4
>   | PKT_TX_IP_CKSUM;
>   bufs[i]->l2_len = sizeof(struct ether_hdr);
>   bufs[i]->l3_len = sizeof(struct ipv4_hdr);
>   bufs[i]->l4_len = sizeof(struct tcp_hdr);
>   }
> 
>   /* Prepare burst of TX packets */
>   nb_prep = rte_eth_tx_prep(port, 0, bufs, nb_pkts);
> 
>   if (nb_prep < nb_pkts) {
>   printf("tx_prep failed\n");
> 
>   /* nb_prep indicates here first invalid packet. rte_eth_tx_prep
>* can be used on remaining packets to find another ones.
>*/
> 
>   }
> 
>   /* Send burst of TX packets */
>   nb_tx = rte_eth_tx_burst(port, 0, bufs, nb_prep);
> 
>   /* Free any unsent packets. */
> 
> v10 changes:
>  - moved drivers tx calback check in rte_eth_tx_prep after queue_id check
> 
> v9 changes:
>  - fixed headers structure fragmentation check
>  - moved fragmentation check into rte_validate_tx_offload()
> 
> v8 changes:
>  - mbuf argument in rte_validate_tx_offload declared as const
> 
> v7 changes:
>  - comments reworded/added
>  - changed errno values returned from Tx prep API
>  - added check in rte_phdr_cksum_fix if headers are in the first
>data segment and can be safetly modified
>  - moved rte_validate_tx_offload to rte_mbuf
>  - moved rte_phdr_cksum_fix to rte_net.h
>  - removed rte_pkt.h new file as useless
> 
> v6 changes:
> - added performance impact test results to the patch description
> 
> v5 changes:
>  - rebased csum engine modification
>  - added information to the csum engine about performance tests
>  - some performance improvements
> 
> v4 changes:
>  - tx_prep is now set to default behavior (NULL) for simple/vector path
>in

[dpdk-dev] [PATCH v8 1/6] ethdev: add Tx preparation

2016-10-24 Thread Ananyev, Konstantin



> -Original Message-
> From: Kulasek, TomaszX
> Sent: Monday, October 24, 2016 1:49 PM
> To: Ananyev, Konstantin ; dev at dpdk.org
> Cc: olivier.matz at 6wind.com
> Subject: RE: [PATCH v8 1/6] ethdev: add Tx preparation
> 
> Hi Konstantin,
> 
> > -Original Message-----
> > From: Ananyev, Konstantin
> > Sent: Monday, October 24, 2016 14:15
> > To: Kulasek, TomaszX ; dev at dpdk.org
> > Cc: olivier.matz at 6wind.com
> > Subject: RE: [PATCH v8 1/6] ethdev: add Tx preparation
> >
> > Hi Tomasz,
> >
> 
> [...]
> 
> > >
> > > +/**
> > > + * Fix pseudo header checksum
> > > + *
> > > + * This function fixes pseudo header checksum for TSO and non-TSO
> > > +tcp/udp in
> > > + * provided mbufs packet data.
> > > + *
> > > + * - for non-TSO tcp/udp packets full pseudo-header checksum is counted
> > and set
> > > + *   in packet data,
> > > + * - for TSO the IP payload length is not included in pseudo header.
> > > + *
> > > + * This function expects that used headers are in the first data
> > > +segment of
> > > + * mbuf, and are not fragmented.
> > > + *
> > > + * @param m
> > > + *   The packet mbuf to be validated.
> > > + * @return
> > > + *   0 if checksum is initialized properly
> > > + */
> > > +static inline int
> > > +rte_phdr_cksum_fix(struct rte_mbuf *m) {
> > > + struct ipv4_hdr *ipv4_hdr;
> > > + struct ipv6_hdr *ipv6_hdr;
> > > + struct tcp_hdr *tcp_hdr;
> > > + struct udp_hdr *udp_hdr;
> > > + uint64_t ol_flags = m->ol_flags;
> > > + uint64_t inner_l3_offset = m->l2_len;
> > > +
> > > + if (ol_flags & PKT_TX_OUTER_IP_CKSUM)
> > > + inner_l3_offset += m->outer_l2_len + m->outer_l3_len;
> > > +
> > > + /* headers are fragmented */
> > > + if (unlikely(rte_pktmbuf_data_len(m) >= inner_l3_offset + m->l3_len
> > +
> > > + m->l4_len))
> >
> > Might be better to move that check into rte_validate_tx_offload(), so it
> > would be called only when TX_DEBUG is on.
> 
> While unfragmented headers are not general requirements for Tx offloads, and 
> this requirement is for this particular implementation,
> maybe for performance reasons will be better to keep it here, and just add 
> #if DEBUG to leave rte_validate_tx_offload more generic.

Hmm and what is the advantage to pollute that code with more ifdefs?
Again, why unfragmented headers are not general requirements?
As long as DPDK pseudo-headear csum calculation routines can't handle 
fragmented case,
it pretty much is a general requirement, no?
Konstantin

> 
> > Another thing, shouldn't it be:
> > if (rte_pktmbuf_data_len(m) < inner_l3_offset + m->l3_len + m->l4_len) ?
> 
> Yes, it should.
> 
> > Konstantin
> >
> 
> Tomasz

[dpdk-dev] [PATCH v8 1/6] ethdev: add Tx preparation

2016-10-24 Thread Ananyev, Konstantin

Hi Tomasz,

> 
>  /**
> + * Validate general requirements for tx offload in mbuf.
> + *
> + * This function checks correctness and completeness of Tx offload settings.
> + *
> + * @param m
> + *   The packet mbuf to be validated.
> + * @return
> + *   0 if packet is valid
> + */
> +static inline int
> +rte_validate_tx_offload(const struct rte_mbuf *m)
> +{
> + uint64_t ol_flags = m->ol_flags;
> +
> + /* Does packet set any of available offloads? */
> + if (!(ol_flags & PKT_TX_OFFLOAD_MASK))
> + return 0;
> +
> + /* IP checksum can be counted only for IPv4 packet */
> + if ((ol_flags & PKT_TX_IP_CKSUM) && (ol_flags & PKT_TX_IPV6))
> + return -EINVAL;
> +
> + /* IP type not set when required */
> + if (ol_flags & (PKT_TX_L4_MASK | PKT_TX_TCP_SEG))
> + if (!(ol_flags & (PKT_TX_IPV4 | PKT_TX_IPV6)))
> + return -EINVAL;
> +
> + /* Check requirements for TSO packet */
> + if (ol_flags & PKT_TX_TCP_SEG)
> + if ((m->tso_segsz == 0) ||
> + ((ol_flags & PKT_TX_IPV4) &&
> + !(ol_flags & PKT_TX_IP_CKSUM)))
> + return -EINVAL;
> +
> + /* PKT_TX_OUTER_IP_CKSUM set for non outer IPv4 packet. */
> + if ((ol_flags & PKT_TX_OUTER_IP_CKSUM) &&
> + !(ol_flags & PKT_TX_OUTER_IPV4))
> + return -EINVAL;
> +
> + return 0;
> +}
> +
> +/**
>   * Dump an mbuf structure to a file.
>   *
>   * Dump all fields for the given packet mbuf and all its associated
> diff --git a/lib/librte_net/rte_net.h b/lib/librte_net/rte_net.h
> index d4156ae..79669d7 100644
> --- a/lib/librte_net/rte_net.h
> +++ b/lib/librte_net/rte_net.h
> @@ -38,6 +38,11 @@
>  extern "C" {
>  #endif
> 
> +#include 
> +#include 
> +#include 
> +#include 
> +
>  /**
>   * Structure containing header lengths associated to a packet, filled
>   * by rte_net_get_ptype().
> @@ -86,6 +91,91 @@ struct rte_net_hdr_lens {
>  uint32_t rte_net_get_ptype(const struct rte_mbuf *m,
>   struct rte_net_hdr_lens *hdr_lens, uint32_t layers);
> 
> +/**
> + * Fix pseudo header checksum
> + *
> + * This function fixes pseudo header checksum for TSO and non-TSO tcp/udp in
> + * provided mbufs packet data.
> + *
> + * - for non-TSO tcp/udp packets full pseudo-header checksum is counted and 
> set
> + *   in packet data,
> + * - for TSO the IP payload length is not included in pseudo header.
> + *
> + * This function expects that used headers are in the first data segment of
> + * mbuf, and are not fragmented.
> + *
> + * @param m
> + *   The packet mbuf to be validated.
> + * @return
> + *   0 if checksum is initialized properly
> + */
> +static inline int
> +rte_phdr_cksum_fix(struct rte_mbuf *m)
> +{
> + struct ipv4_hdr *ipv4_hdr;
> + struct ipv6_hdr *ipv6_hdr;
> + struct tcp_hdr *tcp_hdr;
> + struct udp_hdr *udp_hdr;
> + uint64_t ol_flags = m->ol_flags;
> + uint64_t inner_l3_offset = m->l2_len;
> +
> + if (ol_flags & PKT_TX_OUTER_IP_CKSUM)
> + inner_l3_offset += m->outer_l2_len + m->outer_l3_len;
> +
> + /* headers are fragmented */
> + if (unlikely(rte_pktmbuf_data_len(m) >= inner_l3_offset + m->l3_len +
> + m->l4_len))

Might be better to move that check into rte_validate_tx_offload(),
so it would be called only when TX_DEBUG is on.
Another thing, shouldn't it be:
if (rte_pktmbuf_data_len(m) < inner_l3_offset + m->l3_len + m->l4_len)
?
Konstantin


> + return -ENOTSUP;
> +
> + if ((ol_flags & PKT_TX_UDP_CKSUM) == PKT_TX_UDP_CKSUM) {
> + if (ol_flags & PKT_TX_IPV4) {
> + ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *,
> + inner_l3_offset);
> +
> + if (ol_flags & PKT_TX_IP_CKSUM)
> + ipv4_hdr->hdr_checksum = 0;
> +
> + udp_hdr = (struct udp_hdr *)((char *)ipv4_hdr +
> + m->l3_len);
> + udp_hdr->dgram_cksum = rte_ipv4_phdr_cksum(ipv4_hdr,
> + ol_flags);
> + } else {
> + ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct ipv6_hdr *,
> + inner_l3_offset);
> + /* non-TSO udp */
> + udp_hdr = rte_pktmbuf_mtod_offset(m, struct udp_hdr *,
> + inner_l3_offset + m->l3_len);
> + udp_hdr->dgram_cksum = rte_ipv6_phdr_cksum(ipv6_hdr,
> + ol_flags);
> + }
> + } else if ((ol_flags & PKT_TX_TCP_CKSUM) ||
> + (ol_flags & PKT_TX_TCP_SEG)) {
> + if (ol_flags & PKT_TX_IPV4) {
> + ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *,
> + inner_l3_offset);
> +
> +

[dpdk-dev] [PATCH v6 1/6] ethdev: add Tx preparation

2016-10-19 Thread Ananyev, Konstantin

Hi guys,

> >
> > > +static inline int
> > > +rte_phdr_cksum_fix(struct rte_mbuf *m) {
> > > + struct ipv4_hdr *ipv4_hdr;
> > > + struct ipv6_hdr *ipv6_hdr;
> > > + struct tcp_hdr *tcp_hdr;
> > > + struct udp_hdr *udp_hdr;
> > > + uint64_t ol_flags = m->ol_flags;
> > > + uint64_t inner_l3_offset = m->l2_len;
> > > +
> > > + if (ol_flags & PKT_TX_OUTER_IP_CKSUM)
> > > + inner_l3_offset += m->outer_l2_len + m->outer_l3_len;
> > > +
> > > + if ((ol_flags & PKT_TX_UDP_CKSUM) == PKT_TX_UDP_CKSUM) {
> > > + if (ol_flags & PKT_TX_IPV4) {
> > > + ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *,
> > > + inner_l3_offset);
> > > +
> > > + if (ol_flags & PKT_TX_IP_CKSUM)
> > > + ipv4_hdr->hdr_checksum = 0;
> > > +
> > > + udp_hdr = (struct udp_hdr *)((char *)ipv4_hdr + m-
> > >l3_len);
> > > + udp_hdr->dgram_cksum = rte_ipv4_phdr_cksum(ipv4_hdr,
> > ol_flags);
> > > + } else {
> > > + ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct ipv6_hdr *,
> > > + inner_l3_offset);
> > > + /* non-TSO udp */
> > > + udp_hdr = rte_pktmbuf_mtod_offset(m, struct udp_hdr *,
> > > + inner_l3_offset + m->l3_len);
> > > + udp_hdr->dgram_cksum = rte_ipv6_phdr_cksum(ipv6_hdr,
> > ol_flags);
> > > + }
> > > + } else if ((ol_flags & PKT_TX_TCP_CKSUM) ||
> > > + (ol_flags & PKT_TX_TCP_SEG)) {
> > > + if (ol_flags & PKT_TX_IPV4) {
> > > + ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *,
> > > + inner_l3_offset);
> > > +
> > > + if (ol_flags & PKT_TX_IP_CKSUM)
> > > + ipv4_hdr->hdr_checksum = 0;
> > > +
> > > + /* non-TSO tcp or TSO */
> > > + tcp_hdr = (struct tcp_hdr *)((char *)ipv4_hdr + m-
> > >l3_len);
> > > + tcp_hdr->cksum = rte_ipv4_phdr_cksum(ipv4_hdr, 
> > > ol_flags);
> > > + } else {
> > > + ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct ipv6_hdr *,
> > > + inner_l3_offset);
> > > + /* non-TSO tcp or TSO */
> > > + tcp_hdr = rte_pktmbuf_mtod_offset(m, struct tcp_hdr *,
> > > + inner_l3_offset + m->l3_len);
> > > + tcp_hdr->cksum = rte_ipv6_phdr_cksum(ipv6_hdr, 
> > > ol_flags);
> > > + }
> > > + }
> > > + return 0;
> > > +}
> > > +
> > > +#endif /* _RTE_PKT_H_ */
> > >
> >
> > The function expects that all the network headers are in the first, and
> > that each of them is contiguous.
> >
> 
> Yes, I see...

Yes it does.
I suppose it is a legitimate restriction  (assumptions) for those who'd like to 
use that function.
But that's a good point and I suppose we need to state it explicitly in the API 
comments.

> 
> > Also, I had an interesting remark from Stephen [1] on a similar code.
> > If the mbuf is a clone, it will modify the data of the direct mbuf, which
> > should be read-only. Note that it is likely to happen in a TCP stack,
> > because the packet is kept locally in case it has to be retransmitted.
> > Cloning a mbuf is more efficient than duplicating it.
> >
> > I plan to fix it in virtio code by "uncloning" the headers.
> >
> > [1] http://dpdk.org/ml/archives/dev/2016-October/048873.html

This subject is probably a bit off-topic... 
My position on it - it shouldn't be a PMD responsibility to make these kind of 
checks.
I think it should be upper layer responsibility to provide to the PMD an mbuf
that can be safely used (and in that case modified) by the driver.
Let say if upper layer would like to use packet clone for TCP retransmissions,
It can easily overcome that problem by cloning only data part of the packet,
and putting L2/L3/L4 headers in a separate (head) mbuf and chaining it with
cloned data mbuf.

Konstantin

[dpdk-dev] [PATCH 1/3] mbuf: embedding timestamp into the packet

2016-10-19 Thread Ananyev, Konstantin


Hi lads,
> 
> 
> 
> On 10/13/2016 04:35 PM, Oleg Kuporosov wrote:
> > The hard requirement of financial services industry is accurate
> > timestamping aligned with the packet itself. This patch is to satisfy
> > this requirement:
> >
> > - include uint64_t timestamp field into rte_mbuf with minimal impact to
> >   throughput/latency. Keep it just simple uint64_t in ns (more than 580
> >   years) would be enough for immediate needs while using full
> >   struct timespec with twice bigger size would have much stronger
> >   performance impact as missed cacheline0.
> >
> > - it is possible as there is 6-bytes gap in 1st cacheline (fast path)
> >   and moving uint16_t vlan_tci_outer field to 2nd cacheline.
> >
> > - such move will only impact for pretty rare usable VLAN RX stripping
> >   mode for outer TCI (it used only for one NIC i40e from the whole set and
> >   allows to keep minimal performance impact for RX/TX timestamps.
> 
> This argument is difficult to accept. One can say you are adding
> a field for a pretty rare case used by only one NIC :)
> 
> Honestly, I'm not able to judge whether timestamp is more important than
> vlan_tci_outer. As room is tight in the first cache line, your patch
> submission is the occasion to raise the question: how to decide what
> should be in the first part of the mbuf? There are also some other
> candidates for moving: m->seqn is only used in librte_reorder and it
> is not set in the RX part of a driver.
> 
> About the timestamp, it would be valuable to have other opinions,
> not only about the placement of the field in the structure, but also
> to check that this API is also usable for other NICs.
> 
> Have you measured the impact of having the timestamp in the second part
> of the mbuf?

My vote also would be to have timestamp in the second cache line.
About moving seqn to the 2-nd cache line too - that's probably a fair point.

About the rest of the patch: 
Do you really need to put that code into the PMDs itself?
Can't the same result be achieved by using RX callbacks?
Again that approach would work with any PMD and
there would be no need to modify PMD code itself.

Another thing, that I am in doubt why to use system time?
Wouldn't raw HW TSC value (rte_rdtsc()) do here?
Konstantin

> 
> Changing the mbuf structure should happen as rarely as possible, and we
> have to make sure we take the correct decisions. I think we will
> discuss this at dpdk userland 2016.
> 
> 
> Apart from that, I wonder if an ol_flag should be added to tell that
> the timestamp field is valid in the mbuf.
> 
> Regards,
> Olivier

[dpdk-dev] [PATCH v6 0/6] add Tx preparation

2016-10-18 Thread Ananyev, Konstantin



> 
> As discussed in that thread:
> 
> http://dpdk.org/ml/archives/dev/2015-September/023603.html
> 
> Different NIC models depending on HW offload requested might impose different 
> requirements on packets to be TX-ed in terms of:
> 
>  - Max number of fragments per packet allowed
>  - Max number of fragments per TSO segments
>  - The way pseudo-header checksum should be pre-calculated
>  - L3/L4 header fields filling
>  - etc.
> 
> 
> MOTIVATION:
> ---
> 
> 1) Some work cannot (and didn't should) be done in rte_eth_tx_burst.
>However, this work is sometimes required, and now, it's an
>application issue.
> 
> 2) Different hardware may have different requirements for TX offloads,
>other subset can be supported and so on.
> 
> 3) Some parameters (e.g. number of segments in ixgbe driver) may hung
>device. These parameters may be vary for different devices.
> 
>For example i40e HW allows 8 fragments per packet, but that is after
>TSO segmentation. While ixgbe has a 38-fragment pre-TSO limit.
> 
> 4) Fields in packet may require different initialization (like e.g. will
>require pseudo-header checksum precalculation, sometimes in a
>different way depending on packet type, and so on). Now application
>needs to care about it.
> 
> 5) Using additional API (rte_eth_tx_prep) before rte_eth_tx_burst let to
>prepare packet burst in acceptable form for specific device.
> 
> 6) Some additional checks may be done in debug mode keeping tx_burst
>implementation clean.
> 
> 
> PROPOSAL:
> -
> 
> To help user to deal with all these varieties we propose to:
> 
> 1) Introduce rte_eth_tx_prep() function to do necessary preparations of
>packet burst to be safely transmitted on device for desired HW
>offloads (set/reset checksum field according to the hardware
>requirements) and check HW constraints (number of segments per
>packet, etc).
> 
>While the limitations and requirements may differ for devices, it
>requires to extend rte_eth_dev structure with new function pointer
>"tx_pkt_prep" which can be implemented in the driver to prepare and
>verify packets, in devices specific way, before burst, what should to
>prevent application to send malformed packets.
> 
> 2) Also new fields will be introduced in rte_eth_desc_lim:
>nb_seg_max and nb_mtu_seg_max, providing an information about max
>segments in TSO and non-TSO packets acceptable by device.
> 
>This information is useful for application to not create/limit
>malicious packet.
> 
> 
> APPLICATION (CASE OF USE):
> --
> 
> 1) Application should to initialize burst of packets to send, set
>required tx offload flags and required fields, like l2_len, l3_len,
>l4_len, and tso_segsz
> 
> 2) Application passes burst to the rte_eth_tx_prep to check conditions
>required to send packets through the NIC.
> 
> 3) The result of rte_eth_tx_prep can be used to send valid packets
>and/or restore invalid if function fails.
> 
> e.g.
> 
>   for (i = 0; i < nb_pkts; i++) {
> 
>   /* initialize or process packet */
> 
>   bufs[i]->tso_segsz = 800;
>   bufs[i]->ol_flags = PKT_TX_TCP_SEG | PKT_TX_IPV4
>   | PKT_TX_IP_CKSUM;
>   bufs[i]->l2_len = sizeof(struct ether_hdr);
>   bufs[i]->l3_len = sizeof(struct ipv4_hdr);
>   bufs[i]->l4_len = sizeof(struct tcp_hdr);
>   }
> 
>   /* Prepare burst of TX packets */
>   nb_prep = rte_eth_tx_prep(port, 0, bufs, nb_pkts);
> 
>   if (nb_prep < nb_pkts) {
>   printf("tx_prep failed\n");
> 
>   /* nb_prep indicates here first invalid packet. rte_eth_tx_prep
>* can be used on remaining packets to find another ones.
>*/
> 
>   }
> 
>   /* Send burst of TX packets */
>   nb_tx = rte_eth_tx_burst(port, 0, bufs, nb_prep);
> 
>   /* Free any unsent packets. */
> 
> 
> v5 changes:
>  - rebased csum engine modification
>  - added information to the csum engine about performance tests
>  - some performance improvements
> 
> v4 changes:
>  - tx_prep is now set to default behavior (NULL) for simple/vector path
>in fm10k, i40e and ixgbe drivers to increase performance, when
>Tx offloads are not intentionally available
> 
> v3 changes:
>  - reworked csum testpmd engine instead adding new one,
>  - fixed checksum initialization procedure to include also outer
>checksum offloads,
>  - some minor formattings and optimalizations
> 
> v2 changes:
>  - rte_eth_tx_prep() returns number of packets when device doesn't
>support tx_prep functionality,
>  - introduced CONFIG_RTE_ETHDEV_TX_PREP allowing to turn off tx_prep
> 
> 
> Tomasz Kulasek (6):
>   ethdev: add Tx preparation
>   e1000: add Tx preparation
>   fm10k: add Tx preparation
>   i40e: add Tx preparation
>   ixgbe: add Tx preparation
>   testpmd: use Tx preparation in

[dpdk-dev] [PATCH v2 0/3] net: fix out of order Rx read issue

2016-10-18 Thread Ananyev, Konstantin



> 
> In vPMD, when load Rx desc with _mm_loadu_si128,
> volatile point will be cast into non-volatile point.
> So GCC is allowed to reorder the load instructions,
> while Rx read's correctness is reply on these load
> instructions to follow a backward sequence strictly,
> so we add compile barrier to prevent compiler reorder.
> We already met this issue on i40e with GCC6 and we
> fixed this on ixgbe and fm10k also.
> 
> v2:
> - fix check-git-log.sh warning.
> - add more detail commit message.
> 
> Qi Zhang (3):
>   net/i40e: fix out of order Rx read issue
>   net/ixgbe: fix out of order Rx read issue
>   net/fm10k: fix out of ofder Rx read issue
> 
>  drivers/net/fm10k/fm10k_rxtx_vec.c | 3 +++
>  drivers/net/i40e/i40e_rxtx_vec.c   | 3 +++
>  drivers/net/ixgbe/ixgbe_rxtx_vec_sse.c | 3 +++
>  3 files changed, 9 insertions(+)
> 
> --

Acked-by: Konstantin Ananyev 

> 2.7.4

[dpdk-dev] [PATCH v2 1/2] drivers/i40e: fix X722 macro absence result in compile

2016-10-16 Thread Ananyev, Konstantin

Hi Jeff,

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Jeff Guo
> Sent: Sunday, October 16, 2016 2:40 AM
> To: Zhang, Helin ; Wu, Jingjing  intel.com>
> Cc: dev at dpdk.org; Guo, Jia 
> Subject: [dpdk-dev] [PATCH v2 1/2] drivers/i40e: fix X722 macro absence 
> result in compile
> 
> Since some register only be supported by X722 but may not be supported
> by other NICs, so add X722 macro to distinguish that to avoid compile error
> when the X722 macro is undefined.


Two probably silly questions:
1) So who will setup X722_SUPPORT macro?
Is that a user responsibility when he is building dpdk i40e PMD?
If so, why it is not a rte_config option?
2) Why this all has to be build  time decision?
Why nor run-time?
Why i40e driver can't support all devices (including x722)
and invoke different config functions (write different registers)
based on device type/id information?
As it does for other device types/ids?

Konstantin

> 
> Fixes: d0a349409bd7 (?i40e: support AQ based RSS config?)
> Fixes: 001a1c0f98f4 ("ethdev: get registers width")
> Fixes: a0454b5d2e08 (?i40e: update device ids?)
> Fixes: 647d1eaf758b (?i40evf: support AQ based RSS config?)
> Fixes: 3058891a2b02 (?net/i40e: move PCI device ids to the driver?)
> Fixes: d9efd0136ac1 (?i40e: add EEPROM and registers dumping?)
> Signed-off-by: Jeff Guo 
> 
> ---
> v2:
> fix compile error when x722 macro is not define.
> ---
>  drivers/net/i40e/i40e_ethdev.c| 36 ++-
>  drivers/net/i40e/i40e_ethdev.h| 17 +++
>  drivers/net/i40e/i40e_ethdev_vf.c | 27 +++
>  drivers/net/i40e/i40e_regs.h  | 96 
> +++
>  drivers/net/i40e/i40e_rxtx.c  | 18 +++-
>  5 files changed, 191 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
> index d0640b9..920fd6d 100644
> --- a/drivers/net/i40e/i40e_ethdev.c
> +++ b/drivers/net/i40e/i40e_ethdev.c
> @@ -468,13 +468,17 @@ static const struct rte_pci_id pci_id_i40e_map[] = {
>   { RTE_PCI_DEVICE(I40E_INTEL_VENDOR_ID, I40E_DEV_ID_10G_BASE_T4) },
>   { RTE_PCI_DEVICE(I40E_INTEL_VENDOR_ID, I40E_DEV_ID_25G_B) },
>   { RTE_PCI_DEVICE(I40E_INTEL_VENDOR_ID, I40E_DEV_ID_25G_SFP28) },
> +#ifdef X722_SUPPORT
> +#ifdef X722_A0_SUPPORT
>   { RTE_PCI_DEVICE(I40E_INTEL_VENDOR_ID, I40E_DEV_ID_X722_A0) },
> +#endif /* X722_A0_SUPPORT */
>   { RTE_PCI_DEVICE(I40E_INTEL_VENDOR_ID, I40E_DEV_ID_KX_X722) },
>   { RTE_PCI_DEVICE(I40E_INTEL_VENDOR_ID, I40E_DEV_ID_QSFP_X722) },
>   { RTE_PCI_DEVICE(I40E_INTEL_VENDOR_ID, I40E_DEV_ID_SFP_X722) },
>   { RTE_PCI_DEVICE(I40E_INTEL_VENDOR_ID, I40E_DEV_ID_1G_BASE_T_X722) },
>   { RTE_PCI_DEVICE(I40E_INTEL_VENDOR_ID, I40E_DEV_ID_10G_BASE_T_X722) },
>   { RTE_PCI_DEVICE(I40E_INTEL_VENDOR_ID, I40E_DEV_ID_SFP_I_X722) },
> +#endif /* X722_SUPPORT */
>   { .vendor_id = 0, /* sentinel */ },
>  };
> 
> @@ -3182,6 +3186,7 @@ i40e_get_rss_lut(struct i40e_vsi *vsi, uint8_t *lut, 
> uint16_t lut_size)
>   if (!lut)
>   return -EINVAL;
> 
> +#ifdef X722_SUPPORT
>   if (pf->flags & I40E_FLAG_RSS_AQ_CAPABLE) {
>   ret = i40e_aq_get_rss_lut(hw, vsi->vsi_id, TRUE,
> lut, lut_size);
> @@ -3190,12 +3195,15 @@ i40e_get_rss_lut(struct i40e_vsi *vsi, uint8_t *lut, 
> uint16_t lut_size)
>   return ret;
>   }
>   } else {
> +#endif /* X722_SUPPORT */
>   uint32_t *lut_dw = (uint32_t *)lut;
>   uint16_t i, lut_size_dw = lut_size / 4;
> 
>   for (i = 0; i < lut_size_dw; i++)
>   lut_dw[i] = I40E_READ_REG(hw, I40E_PFQF_HLUT(i));
> +#ifdef X722_SUPPORT
>   }
> +#endif /* X722_SUPPORT */
> 
>   return 0;
>  }
> @@ -3213,6 +3221,7 @@ i40e_set_rss_lut(struct i40e_vsi *vsi, uint8_t *lut, 
> uint16_t lut_size)
>   pf = I40E_VSI_TO_PF(vsi);
>   hw = I40E_VSI_TO_HW(vsi);
> 
> +#ifdef X722_SUPPORT
>   if (pf->flags & I40E_FLAG_RSS_AQ_CAPABLE) {
>   ret = i40e_aq_set_rss_lut(hw, vsi->vsi_id, TRUE,
> lut, lut_size);
> @@ -3221,13 +3230,16 @@ i40e_set_rss_lut(struct i40e_vsi *vsi, uint8_t *lut, 
> uint16_t lut_size)
>   return ret;
>   }
>   } else {
> +#endif /* X722_SUPPORT */
>   uint32_t *lut_dw = (uint32_t *)lut;
>   uint16_t i, lut_size_dw = lut_size / 4;
> 
>   for (i = 0; i < lut_size_dw; i++)
>   I40E_WRITE_REG(hw, I40E_PFQF_HLUT(i), lut_dw[i]);
>   I40E_WRITE_FLUSH(hw);
> +#ifdef X722_SUPPORT
>   }
> +#endif /* X722_SUPPORT */
> 
>   return 0;
>  }
> @@ -3508,8 +3520,10 @@ i40e_pf_parameter_init(struct rte_eth_dev *dev)
>   pf->lan_nb_qps = 1;
>   } else {
>   pf->flags |= I40E_FLAG_RSS;
> +#ifdef X722_SUPPORT
>   if (hw->mac.type ==

[dpdk-dev] [PATCH] eal: avoid unnecessary conflicts over rte_config file

2016-10-13 Thread Ananyev, Konstantin


> 
> It's true that users can patch around this problem (and I started off doing 
> just that), but why impose this inconvenience on users when DPDK
> can just "do the right thing" to begin with? For example, it took me several 
> hours to figure out why the problem was occurring and then to
> hunt down the --file-prefix solution. Is there some reason why it would be a 
> bad idea to fix this in DPDK?

[dpdk-dev] [PATCH] mempool: fix search of maximum contiguous pages

2016-10-13 Thread Ananyev, Konstantin


> 
> > > > diff --git a/lib/librte_mempool/rte_mempool.c
> > > > b/lib/librte_mempool/rte_mempool.c
> > > > index 71017e1..e3e254a 100644
> > > > --- a/lib/librte_mempool/rte_mempool.c
> > > > +++ b/lib/librte_mempool/rte_mempool.c
> > > > @@ -426,9 +426,12 @@ rte_mempool_populate_phys_tab(struct
> > > > rte_mempool *mp, char *vaddr,
> > > >
> > > > for (i = 0; i < pg_num && mp->populated_size < mp->size; i += 
> > > > n) {
> > > >
> > > > +   phys_addr_t paddr_next;
> > > > +   paddr_next = paddr[i] + pg_sz;
> > > > +
> > > > /* populate with the largest group of contiguous pages 
> > > > */
> > > > for (n = 1; (i + n) < pg_num &&
> > > > -paddr[i] + pg_sz == paddr[i+n]; n++)
> > > > +paddr_next == paddr[i+n]; n++, paddr_next 
> > > > += pg_sz)
> > > > ;
> > >
> > > Good catch.
> > > Why not just paddr[i + n - 1] != paddr[i + n]?
> >
> > Sorry, I meant 'paddr[i + n - 1] + pg_sz == paddr[i+n]' off course.
> >
> > > Then you don't need extra variable (paddr_next) here.
> > > Konstantin
> 
> Thank you, Konstantin
> 'paddr[i + n - 1] + pg_sz = paddr[i + n]' also can fix it and have straight 
> meaning.
> But I assume that my revision with paddr_next += pg_sz may have a bit better 
> performance.

I don't think there would be any real difference, again it is not performance 
critical code-path.

> By the way, paddr[i] + n * pg_sz = paddr[i + n] can also resolve it.

Yes, that's one seems even better for me - make things more clear.
Konstantin

> 
> /Wei
> 
> > >
> > > >
> > > > ret = rte_mempool_populate_phys(mp, vaddr + i * pg_sz,
> > > > --
> > > > 2.7.4

[dpdk-dev] [PATCH] eal: avoid unnecessary conflicts over rte_config file

2016-10-13 Thread Ananyev, Konstantin


Hi John,

> Before this patch, DPDK used the file ~/.rte_config as a lock to detect
> potential interference between multiple DPDK applications running on the
> same machine. However, if a single user ran DPDK applications concurrently
> on several different machines, and if the user's home directory was shared
> between the machines via NFS, DPDK would incorrectly detect conflicts
> for all but the first application and abort them. This patch fixes the
> problem by incorporating the machine name into the config file name (e.g.,
> ~/.rte_hostname_config).
> 
> Signed-off-by: John Ousterhout 
> ---
>  doc/guides/prog_guide/multi_proc_support.rst | 11 +++
>  lib/librte_eal/common/eal_common_proc.c  |  8 ++--
>  lib/librte_eal/common/eal_filesystem.h   | 15 +--
>  3 files changed, 22 insertions(+), 12 deletions(-)
> 
> diff --git a/doc/guides/prog_guide/multi_proc_support.rst 
> b/doc/guides/prog_guide/multi_proc_support.rst
> index badd102..a54fa1c 100644
> --- a/doc/guides/prog_guide/multi_proc_support.rst
> +++ b/doc/guides/prog_guide/multi_proc_support.rst
> @@ -129,10 +129,13 @@ Support for this usage scenario is provided using the 
> ``--file-prefix`` paramete
> 
>  By default, the EAL creates hugepage files on each hugetlbfs filesystem 
> using the rtemap_X filename,
>  where X is in the range 0 to the maximum number of hugepages -1.
> -Similarly, it creates shared configuration files, memory mapped in each 
> process, using the /var/run/.rte_config filename,
> -when run as root (or $HOME/.rte_config when run as a non-root user;
> -if filesystem and device permissions are set up to allow this).
> -The rte part of the filenames of each of the above is configurable using the 
> file-prefix parameter.
> +Similarly, it creates shared configuration files, memory mapped in each 
> process.
> +When run as root, the name of the configuration file will be
> +/var/run/.rte_*host*_config, where *host* is the name of the machine.
> +When run as a non-root user, the the name of the configuration file will be
> +$HOME/.rte_*host*_config (if filesystem and device permissions are set up to 
> allow this).
> +If the ``--file-prefix`` parameter has been specified, its value will be used
> +in place of "rte" in the file names.

I am not sure that we need to handle all such cases inside EAL.
User can easily overcome that problem by just adding something like:
--file-prefix=`uname -n`
to his command-line.
Konstantin

> 
>  In addition to specifying the file-prefix parameter,
>  any DPDK applications that are to be run side-by-side must explicitly limit 
> their memory use.
> diff --git a/lib/librte_eal/common/eal_common_proc.c 
> b/lib/librte_eal/common/eal_common_proc.c
> index 12e0fca..517aa0c 100644
> --- a/lib/librte_eal/common/eal_common_proc.c
> +++ b/lib/librte_eal/common/eal_common_proc.c
> @@ -45,12 +45,8 @@ rte_eal_primary_proc_alive(const char *config_file_path)
> 
>   if (config_file_path)
>   config_fd = open(config_file_path, O_RDONLY);
> - else {
> - char default_path[PATH_MAX+1];
> - snprintf(default_path, PATH_MAX, RUNTIME_CONFIG_FMT,
> -  default_config_dir, "rte");
> - config_fd = open(default_path, O_RDONLY);
> - }
> + else
> + config_fd = open(eal_runtime_config_path(), O_RDONLY);
>   if (config_fd < 0)
>   return 0;
> 
> diff --git a/lib/librte_eal/common/eal_filesystem.h 
> b/lib/librte_eal/common/eal_filesystem.h
> index fdb4a70..4929aa3 100644
> --- a/lib/librte_eal/common/eal_filesystem.h
> +++ b/lib/librte_eal/common/eal_filesystem.h
> @@ -41,7 +41,7 @@
>  #define EAL_FILESYSTEM_H
> 
>  /** Path of rte config file. */
> -#define RUNTIME_CONFIG_FMT "%s/.%s_config"
> +#define RUNTIME_CONFIG_FMT "%s/.%s_%s_config"
> 
>  #include 
>  #include 
> @@ -59,11 +59,22 @@ eal_runtime_config_path(void)
>   static char buffer[PATH_MAX]; /* static so auto-zeroed */
>   const char *directory = default_config_dir;
>   const char *home_dir = getenv("HOME");
> + static char nameBuffer[1000];
> + int result;
> 
>   if (getuid() != 0 && home_dir != NULL)
>   directory = home_dir;
> +
> + /*
> +  * Include the name of the host in the config file path. Otherwise,
> +  * if DPDK applications run on different hosts but share a home
> +  * directory (e.g. via NFS), they will choose the same config
> +  * file and conflict unnecessarily.
> +  */
> + result = gethostname(nameBuffer, sizeof(nameBuffer)-1);
>   snprintf(buffer, sizeof(buffer) - 1, RUNTIME_CONFIG_FMT, directory,
> - internal_config.hugefile_prefix);
> + internal_config.hugefile_prefix,
> + (result == 0) ? nameBuffer : "unknown-host");
>   return buffer;
>  }
> 
> --
> 2.8.3

[dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx

2016-10-13 Thread Ananyev, Konstantin



> -Original Message-
> From: Richardson, Bruce
> Sent: Thursday, October 13, 2016 11:19 AM
> To: Ananyev, Konstantin 
> Cc: Vladyslav Buslov ; Wu, Jingjing 
> ; Yigit, Ferruh ;
> Zhang, Helin ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch 
> instructions for bulk rx
> 
> On Wed, Oct 12, 2016 at 12:04:39AM +, Ananyev, Konstantin wrote:
> > Hi Vladislav,
> >
> > > > > > >
> > > > > > > On 7/14/2016 6:27 PM, Vladyslav Buslov wrote:
> > > > > > > > Added prefetch of first packet payload cacheline in
> > > > > > > > i40e_rx_scan_hw_ring Added prefetch of second mbuf cacheline in
> > > > > > > > i40e_rx_alloc_bufs
> > > > > > > >
> > > > > > > > Signed-off-by: Vladyslav Buslov
> > > > > > > > 
> > > > > > > > ---
> > > > > > > >  drivers/net/i40e/i40e_rxtx.c | 7 +--
> > > > > > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > > > > > > b/drivers/net/i40e/i40e_rxtx.c index d3cfb98..e493fb4 100644
> > > > > > > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > > > > > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > > > > > > @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct
> > > > i40e_rx_queue
> > > > > > *rxq)
> > > > > > > > /* Translate descriptor info to mbuf parameters 
> > > > > > > > */
> > > > > > > > for (j = 0; j < nb_dd; j++) {
> > > > > > > > mb = rxep[j].mbuf;
> > > > > > > > +   rte_prefetch0(RTE_PTR_ADD(mb->buf_addr,
> > > > > > > RTE_PKTMBUF_HEADROOM));
> > > > > >
> > > > > > Why did prefetch here? I think if application need to deal with
> > > > > > packet, it is more suitable to put it in application.
> > > > > >
> > > > > > > > qword1 = rte_le_to_cpu_64(\
> > > > > > > > 
> > > > > > > > rxdp[j].wb.qword1.status_error_len);
> > > > > > > > pkt_len = ((qword1 &
> > > > > > > I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> > > > > > > > @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue
> > > > > > *rxq)
> > > > > > > >
> > > > > > > > rxdp = >rx_ring[alloc_idx];
> > > > > > > > for (i = 0; i < rxq->rx_free_thresh; i++) {
> > > > > > > > -   if (likely(i < (rxq->rx_free_thresh - 1)))
> > > > > > > > +   if (likely(i < (rxq->rx_free_thresh - 1))) {
> > > > > > > > /* Prefetch next mbuf */
> > > > > > > > -   rte_prefetch0(rxep[i + 1].mbuf);
> > > > > > > > +   rte_prefetch0([i + 
> > > > > > > > 1].mbuf->cacheline0);
> > > > > > > > +   rte_prefetch0([i +
> > > > > > > > + 1].mbuf->cacheline1);
> > > >
> > > > I think there are rte_mbuf_prefetch_part1/part2 defined in rte_mbuf.h,
> > > > specially for that case.
> > >
> > > Thanks for pointing that out.
> > > I'll submit new patch if you decide to move forward with this development.
> > >
> > > >
> > > > > > > > +   }
> > > > > > Agree with this change. And when I test it by testpmd with iofwd, no
> > > > > > performance increase is observed but minor decrease.
> > > > > > Can you share will us when it will benefit the performance in your
> > > > scenario ?
> > > > > >
> > > > > >
> > > > > > Thanks
> > > > > > Jingjing
> > > > >
> > > > > Hello Jingjing,
> > > > >
> > > > > Thanks for code review.
> > > > >
> > > > > My use case: We have simple distributor

[dpdk-dev] [PATCH] mempool: fix search of maximum contiguous pages

2016-10-13 Thread Ananyev, Konstantin



> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Ananyev, Konstantin
> Sent: Thursday, October 13, 2016 10:54 AM
> To: Dai, Wei ; dev at dpdk.org; Gonzalez Monroy, Sergio 
> ; Tan, Jianfeng
> ; Dai, Wei 
> Subject: Re: [dpdk-dev] [PATCH] mempool: fix search of maximum contiguous 
> pages
> 
> Hi
> 
> >
> > Signed-off-by: Wei Dai 
> > ---
> >  lib/librte_mempool/rte_mempool.c | 5 -
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/lib/librte_mempool/rte_mempool.c 
> > b/lib/librte_mempool/rte_mempool.c
> > index 71017e1..e3e254a 100644
> > --- a/lib/librte_mempool/rte_mempool.c
> > +++ b/lib/librte_mempool/rte_mempool.c
> > @@ -426,9 +426,12 @@ rte_mempool_populate_phys_tab(struct rte_mempool *mp, 
> > char *vaddr,
> >
> > for (i = 0; i < pg_num && mp->populated_size < mp->size; i += n) {
> >
> > +   phys_addr_t paddr_next;
> > +   paddr_next = paddr[i] + pg_sz;
> > +
> > /* populate with the largest group of contiguous pages */
> > for (n = 1; (i + n) < pg_num &&
> > -paddr[i] + pg_sz == paddr[i+n]; n++)
> > +paddr_next == paddr[i+n]; n++, paddr_next += pg_sz)
> > ;
> 
> Good catch.
> Why not just paddr[i + n - 1] != paddr[i + n]?

Sorry, I meant 'paddr[i + n - 1] + pg_sz == paddr[i+n]' off course.

> Then you don't need extra variable (paddr_next) here.
> Konstantin
> 
> >
> > ret = rte_mempool_populate_phys(mp, vaddr + i * pg_sz,
> > --
> > 2.7.4

[dpdk-dev] [PATCH] mempool: fix search of maximum contiguous pages

2016-10-13 Thread Ananyev, Konstantin

Hi 

> 
> Signed-off-by: Wei Dai 
> ---
>  lib/librte_mempool/rte_mempool.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/librte_mempool/rte_mempool.c 
> b/lib/librte_mempool/rte_mempool.c
> index 71017e1..e3e254a 100644
> --- a/lib/librte_mempool/rte_mempool.c
> +++ b/lib/librte_mempool/rte_mempool.c
> @@ -426,9 +426,12 @@ rte_mempool_populate_phys_tab(struct rte_mempool *mp, 
> char *vaddr,
> 
>   for (i = 0; i < pg_num && mp->populated_size < mp->size; i += n) {
> 
> + phys_addr_t paddr_next;
> + paddr_next = paddr[i] + pg_sz;
> +
>   /* populate with the largest group of contiguous pages */
>   for (n = 1; (i + n) < pg_num &&
> -  paddr[i] + pg_sz == paddr[i+n]; n++)
> +  paddr_next == paddr[i+n]; n++, paddr_next += pg_sz)
>   ;

Good catch.
Why not just paddr[i + n - 1] != paddr[i + n]?
Then you don't need extra variable (paddr_next) here.
Konstantin

> 
>   ret = rte_mempool_populate_phys(mp, vaddr + i * pg_sz,
> --
> 2.7.4

[dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx

2016-10-12 Thread Ananyev, Konstantin

Hi Vladislav,

> > > > >
> > > > > On 7/14/2016 6:27 PM, Vladyslav Buslov wrote:
> > > > > > Added prefetch of first packet payload cacheline in
> > > > > > i40e_rx_scan_hw_ring Added prefetch of second mbuf cacheline in
> > > > > > i40e_rx_alloc_bufs
> > > > > >
> > > > > > Signed-off-by: Vladyslav Buslov
> > > > > > 
> > > > > > ---
> > > > > >  drivers/net/i40e/i40e_rxtx.c | 7 +--
> > > > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > > > > b/drivers/net/i40e/i40e_rxtx.c index d3cfb98..e493fb4 100644
> > > > > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > > > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > > > > @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct
> > i40e_rx_queue
> > > > *rxq)
> > > > > > /* Translate descriptor info to mbuf parameters */
> > > > > > for (j = 0; j < nb_dd; j++) {
> > > > > > mb = rxep[j].mbuf;
> > > > > > +   rte_prefetch0(RTE_PTR_ADD(mb->buf_addr,
> > > > > RTE_PKTMBUF_HEADROOM));
> > > >
> > > > Why did prefetch here? I think if application need to deal with
> > > > packet, it is more suitable to put it in application.
> > > >
> > > > > > qword1 = rte_le_to_cpu_64(\
> > > > > > rxdp[j].wb.qword1.status_error_len);
> > > > > > pkt_len = ((qword1 &
> > > > > I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> > > > > > @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue
> > > > *rxq)
> > > > > >
> > > > > > rxdp = >rx_ring[alloc_idx];
> > > > > > for (i = 0; i < rxq->rx_free_thresh; i++) {
> > > > > > -   if (likely(i < (rxq->rx_free_thresh - 1)))
> > > > > > +   if (likely(i < (rxq->rx_free_thresh - 1))) {
> > > > > > /* Prefetch next mbuf */
> > > > > > -   rte_prefetch0(rxep[i + 1].mbuf);
> > > > > > +   rte_prefetch0([i + 
> > > > > > 1].mbuf->cacheline0);
> > > > > > +   rte_prefetch0([i +
> > > > > > + 1].mbuf->cacheline1);
> >
> > I think there are rte_mbuf_prefetch_part1/part2 defined in rte_mbuf.h,
> > specially for that case.
> 
> Thanks for pointing that out.
> I'll submit new patch if you decide to move forward with this development.
> 
> >
> > > > > > +   }
> > > > Agree with this change. And when I test it by testpmd with iofwd, no
> > > > performance increase is observed but minor decrease.
> > > > Can you share will us when it will benefit the performance in your
> > scenario ?
> > > >
> > > >
> > > > Thanks
> > > > Jingjing
> > >
> > > Hello Jingjing,
> > >
> > > Thanks for code review.
> > >
> > > My use case: We have simple distributor thread that receives packets
> > > from port and distributes them among worker threads according to VLAN
> > and MAC address hash.
> > >
> > > While working on performance optimization we determined that most of
> > CPU usage of this thread is in DPDK.
> > > As and optimization we decided to switch to rx burst alloc function,
> > > however that caused additional performance degradation compared to
> > scatter rx mode.
> > > In profiler two major culprits were:
> > >   1. Access to packet data Eth header in application code. (cache miss)
> > >   2. Setting next packet descriptor field to NULL in DPDK
> > > i40e_rx_alloc_bufs code. (this field is in second descriptor cache
> > > line that was not
> > > prefetched)
> >
> > I wonder what will happen if we'll remove any prefetches here?
> > Would it make things better or worse (and by how much)?
> 
> In our case it causes few per cent PPS degradation on next=NULL assignment 
> but it seems that JingJing's test doesn't confirm it.
> 
> >
> > > After applying my fixes performance improved compared to scatter rx
> > mode.
> > >
> > > I assumed that prefetch of first cache line of packet data belongs to
> > > DPDK because it is done in scatter rx mode. (in
> > > i40e_recv_scattered_pkts)
> > > It can be moved to application side but IMO it is better to be consistent
> > across all rx modes.
> >
> > I would agree with Jingjing here, probably PMD should avoid to prefetch
> > packet's data.
> 
> Actually I can see some valid use cases where it is beneficial to have this 
> prefetch in driver.
> In our sw distributor case it is trivial to just prefetch next packet on each 
> iteration because packets are processed one by one.
> However when we move this functionality to hw by means of 
> RSS/vfunction/FlowDirector(our long term goal) worker threads will receive
> packets directly from rx queues of NIC.
> First operation of worker thread is to perform bulk lookup in hash table by 
> destination MAC. This will cause cache miss on accessing each
> eth header and can't be easily mitigated in application code.
> I assume it is ubiquitous use case for DPDK.

Yes it is a quite common use-case.
Though I many cases it is possible to

[dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx

2016-10-11 Thread Ananyev, Konstantin

Hi Vladislav,

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Vladyslav Buslov
> Sent: Monday, October 10, 2016 6:06 PM
> To: Wu, Jingjing ; Yigit, Ferruh  intel.com>; Zhang, Helin 
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch 
> instructions for bulk rx
> 
> > -Original Message-
> > From: Wu, Jingjing [mailto:jingjing.wu at intel.com]
> > Sent: Monday, October 10, 2016 4:26 PM
> > To: Yigit, Ferruh; Vladyslav Buslov; Zhang, Helin
> > Cc: dev at dpdk.org
> > Subject: RE: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> > instructions for bulk rx
> >
> >
> >
> > > -Original Message-
> > > From: Yigit, Ferruh
> > > Sent: Wednesday, September 14, 2016 9:25 PM
> > > To: Vladyslav Buslov ; Zhang, Helin
> > > ; Wu, Jingjing 
> > > Cc: dev at dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> > > instructions for bulk rx
> > >
> > > On 7/14/2016 6:27 PM, Vladyslav Buslov wrote:
> > > > Added prefetch of first packet payload cacheline in
> > > > i40e_rx_scan_hw_ring Added prefetch of second mbuf cacheline in
> > > > i40e_rx_alloc_bufs
> > > >
> > > > Signed-off-by: Vladyslav Buslov 
> > > > ---
> > > >  drivers/net/i40e/i40e_rxtx.c | 7 +--
> > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > > b/drivers/net/i40e/i40e_rxtx.c index d3cfb98..e493fb4 100644
> > > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > > @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue
> > *rxq)
> > > > /* Translate descriptor info to mbuf parameters */
> > > > for (j = 0; j < nb_dd; j++) {
> > > > mb = rxep[j].mbuf;
> > > > +   rte_prefetch0(RTE_PTR_ADD(mb->buf_addr,
> > > RTE_PKTMBUF_HEADROOM));
> >
> > Why did prefetch here? I think if application need to deal with packet, it 
> > is
> > more suitable to put it in application.
> >
> > > > qword1 = rte_le_to_cpu_64(\
> > > > rxdp[j].wb.qword1.status_error_len);
> > > > pkt_len = ((qword1 &
> > > I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> > > > @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue
> > *rxq)
> > > >
> > > > rxdp = >rx_ring[alloc_idx];
> > > > for (i = 0; i < rxq->rx_free_thresh; i++) {
> > > > -   if (likely(i < (rxq->rx_free_thresh - 1)))
> > > > +   if (likely(i < (rxq->rx_free_thresh - 1))) {
> > > > /* Prefetch next mbuf */
> > > > -   rte_prefetch0(rxep[i + 1].mbuf);
> > > > +   rte_prefetch0([i + 1].mbuf->cacheline0);
> > > > +   rte_prefetch0([i + 1].mbuf->cacheline1);

I think there are rte_mbuf_prefetch_part1/part2 defined in rte_mbuf.h,
specially for that case.

> > > > +   }
> > Agree with this change. And when I test it by testpmd with iofwd, no
> > performance increase is observed but minor decrease.
> > Can you share will us when it will benefit the performance in your scenario 
> > ?
> >
> >
> > Thanks
> > Jingjing
> 
> Hello Jingjing,
> 
> Thanks for code review.
> 
> My use case: We have simple distributor thread that receives packets from 
> port and distributes them among worker threads according to
> VLAN and MAC address hash.
> 
> While working on performance optimization we determined that most of CPU 
> usage of this thread is in DPDK.
> As and optimization we decided to switch to rx burst alloc function, however 
> that caused additional performance degradation compared to
> scatter rx mode.
> In profiler two major culprits were:
>   1. Access to packet data Eth header in application code. (cache miss)
>   2. Setting next packet descriptor field to NULL in DPDK i40e_rx_alloc_bufs 
> code. (this field is in second descriptor cache line that was not
> prefetched)

I wonder what will happen if we'll remove any prefetches here?
Would it make things better or worse (and by how much)?

> After applying my fixes performance improved compared to scatter rx mode.
> 
> I assumed that prefetch of first cache line of packet data belongs to DPDK 
> because it is done in scatter rx mode. (in
> i40e_recv_scattered_pkts)
> It can be moved to application side but IMO it is better to be consistent 
> across all rx modes.

I would agree with Jingjing here, probably PMD should avoid to prefetch 
packet's data. 
Konstantin

[dpdk-dev] [PATCH] eal: fix c++ compilation issue with rte_delay_us()

2016-10-05 Thread Ananyev, Konstantin

Hi Thomas,

> 
> 2016-10-03 18:27, Konstantin Ananyev:
> > When compiling with C++, it treats
> > void (*rte_delay_us)(unsigned int us);
> > as definition of the global variable.
> > So further linking with librte_eal fails.
> >
> > Fixes: b4d63fb62240 ("eal: customize delay function")
> 
> Applied, thanks
> 
> I don't understand why it was not failing with C compilation?

Don't know off hand.
Yes, I would expect gcc to fail with same symptoms too.
But by some reason it puts it makes it a 'common' symbol:

$ cat rttm1.c

#include 
#include 
#include 

int main(int argc, char *argv[])
{
int ret = rte_eal_init(argc, argv);
rte_delay_us(1);
printf("return code: %d\n", ret);
return ret;
}

$ gcc -m64 -pthread -o rttm1 rttm1.o -ldl   -L/${RTE_SDK}/${RTE_TARGET}/lib 
-Wl,-lrte_eal
$ nm rttm1.o | grep rte_delay_us
0008 C rte_delay_us

Konstantin

[dpdk-dev] [PATCH] mbuf: fix atomic refcnt update synchronization

2016-09-03 Thread Ananyev, Konstantin

Hi,

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Linzhe Lee
> Sent: Saturday, September 3, 2016 3:05 AM
> To: Stephen Hemminger 
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] mbuf: fix atomic refcnt update synchronization
> 
> yes,stephen.
> 
> my config file here: http://pastebin.com/N0RKGArh
> 
> 2016-09-03 0:51 GMT+08:00 Stephen Hemminger :
> > On Sat, 3 Sep 2016 00:31:50 +0800
> > Linzhe Lee  wrote:
> >
> >> Thanks for reply, Stephen.
> >>
> >>
> >>
> >> I'm in x86-64, my cpu is `Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz`.
> >>
> >>
> >>
> >> When allocation mbuf in program1, and transfer it to program2 for
> >> free via ring, the program1 might meet assert in allocate mbuf sometimes.
> >> (`RTE_ASSERT(rte_mbuf_refcnt_read(m) == 0);`)

If you believe there is a problem inside rte_mbuf code,
please provide a test program to reproduce the issue.
So far, I personally don't see any issue in the rte_mbuf code.
Konstantin


> >>
> >>
> >>
> >> but when I using gdb to check it, the refcnt field of mbuf is already
> >> zero. so I believe the problem came from the cache line problem or
> >> incorrect optimization.
> >>
> >>
> >>
> >> When apply this patch, the problem seems solved. I'm submitting it
> >> for your comments.
> >
> > Are you sure you have REFCNT_ATOMIC set?

[dpdk-dev] [PATCH v6 2/9] acl: add altivec intrinsics for dpdk acl on ppc_64

2016-08-31 Thread Ananyev, Konstantin

> 
> This patch adds port for ACL library in ppc64le.
> 
> Signed-off-by: Gowrishankar Muthukrishnan  linux.vnet.ibm.com>
> ---
>  app/test-acl/main.c |   4 +
>  config/defconfig_ppc_64-power8-linuxapp-gcc |   1 -
>  lib/librte_acl/Makefile |   2 +
>  lib/librte_acl/acl.h|   4 +
>  lib/librte_acl/acl_run.h|   2 +
>  lib/librte_acl/acl_run_altivec.c|  47 
>  lib/librte_acl/acl_run_altivec.h| 329 
> 
>  lib/librte_acl/rte_acl.c|  13 ++
>  lib/librte_acl/rte_acl.h|   1 +
>  9 files changed, 402 insertions(+), 1 deletion(-)  create mode 100644 
> lib/librte_acl/acl_run_altivec.c  create mode 100644
> lib/librte_acl/acl_run_altivec.h
> 

I am not a ppc expert, and I don't have ppc HW to try,
But from IA perspective - all looks good.
Acked-by: Konstantin Ananyev

[dpdk-dev] [PATCH 0/6] add Tx preparation

2016-08-31 Thread Ananyev, Konstantin



> 
> On Fri, 26 Aug 2016 18:22:52 +0200
> Tomasz Kulasek  wrote:
> 
> > As discussed in that thread:
> >
> > http://dpdk.org/ml/archives/dev/2015-September/023603.html
> >
> > Different NIC models depending on HW offload requested might impose
> > different requirements on packets to be TX-ed in terms of:
> >
> >  - Max number of fragments per packet allowed
> >  - Max number of fragments per TSO segments
> >  - The way pseudo-header checksum should be pre-calculated
> >  - L3/L4 header fields filling
> >  - etc.
> >
> >
> > MOTIVATION:
> > ---
> >
> > 1) Some work cannot (and didn't should) be done in rte_eth_tx_burst.
> >However, this work is sometimes required, and now, it's an
> >application issue.
> 
> Why not? You are adding an additional API burden on every application.
> 
> >
> > 2) Different hardware may have different requirements for TX offloads,
> >other subset can be supported and so on.
> 
> These need to be reported by API so that application can handle it.

If you read the patch description, you'll see that we do both:
- provide tx_prep()
- "2) Also new fields will be introduced in rte_eth_desc_lim: 
   nb_seg_max and nb_mtu_seg_max, providing an information about max
   segments in TSO and non-TSO packets acceptable by device.

   This information is useful for application to not create/limit
   malicious packet."

> Doing these transformations in tx_prep seems late in the process.

Why is that?
It is totally up to the application to decide ahat stage it wants to call 
tx_prep() for each packet -
just after it formed and mbuf to be TX-ed, or just before calling tx_burst() 
for it, or somewhere in btetween. 

> 
> >
> > 3) Some parameters (e.g. number of segments in ixgbe driver) may hung
> >device. These parameters may be vary for different devices.
> >
> >For example i40e HW allows 8 fragments per packet, but that is after
> >TSO segmentation. While ixgbe has a 38-fragment pre-TSO limit.
> 
> Seems better to handle these limits as exceptions in i40e_tx_burst etc; 
> rather than a pre-step. Look at how Linux driver API works, several
> drivers have to have an exception linearize path.

Hmm, doesn't it contradicts with your statement above:
' Doing these transformations in tx_prep seems late in the process.'? :)
I suppose we all know that Linux kernel driver and DPDK PMD usage model is 
quite different. 
As a rule of thumb we try to avoid modifying packet data inside the tx_burst() 
itself.
Having this functionality in a different function gives upper layer a choice 
when it is better
to modify packet contents and hopefully hide/minimize memory access latencies.  
   

> 
> >
> > 4) Fields in packet may require different initialization (like e.g. will
> >require pseudo-header checksum precalculation, sometimes in a
> >different way depending on packet type, and so on). Now application
> >needs to care about it.
> 
> Once again, the driver should do this in Tx.

Once again, I really doubt it should.

> 
> 
> >
> > 5) Using additional API (rte_eth_tx_prep) before rte_eth_tx_burst let to
> >prepare packet burst in acceptable form for specific device.
> >
> > 6) Some additional checks may be done in debug mode keeping tx_burst
> >implementation clean.
> 
> Most of this could be done by refactoring existing tx_burst in drivers.
> Much of the code seems to be written as the "let's write a 2000 line function 
> because that is most efficient" rather than "let's write small
> steps and let the compiler optimize it"

I don't see how that could be easily done inside tx_burst() without signifcatn 
performance loss.
Especially if we have a pipeline model, when we have one or several t produce 
mbufs to be TX-ed,
and one or several lcores that doing actual TX for these packets.

Konstantin

[dpdk-dev] [PATCH] acl: use rte_calloc for temporary memory allocation

2016-08-31 Thread Ananyev, Konstantin

Hi Vlad, 

> 
> Hi Konstantin,
> 
> Thanks for your feedback.
> 
> Would you accept this change as config file compile-time parameter with libc 
> calloc as default?
> It is one line change only so it is easy to ifdef.

It is an option, but the main requirements from the community
is to minimize number of build time config options, and instead provide
ability to the user to configure things at runtime.
That's why I thought about something like:

+ /* use EAL hugepages for temporary memory allocations,
+  * might improve build time, build increases hugepages
+  * demand significantly.
+  */
#define RTE_ACL_CFG_FLAG_HMEM   1

struct rte_acl_config {
uint32_t num_categories; /**< Number of categories to build with. */
uint32_t num_fields; /**< Number of field definitions. */
struct rte_acl_field_def defs[RTE_ACL_MAX_FIELDS];
/**< array of field definitions. */
size_t max_size;
/**< max memory limit for internal run-time structures. */
+   uint64_t flags;
}; 

And then change tb_pool() to use either calloc() or rte_alloc() based on the 
flags value.
Another, probably even better and more flexible way is to allow user to specify 
his own alloc routine:

struct rte_acl_config {
uint32_t num_categories; /**< Number of categories to build with. */
uint32_t num_fields; /**< Number of field definitions. */
struct rte_acl_field_def defs[RTE_ACL_MAX_FIELDS];
/**< array of field definitions. */
size_t max_size;
/**< max memory limit for internal run-time structures. */
+   void *(tballoc)( (size_t, size_t);
+   void (*tbfree)(void *);
};

In that case user can provide his own alloc/free based on rte_calloc/rte_free
or even on something else.
The only drawback I see with that approaches, that in that case, you'll need to 
follow
ABI compliance rules, which probably means that your change might not make 
16.11.

Konstantin 

> 
> Regards,
> Vlad
> 
> -Original Message-
> From: Ananyev, Konstantin [mailto:konstantin.ananyev at intel.com]
> Sent: Wednesday, August 31, 2016 4:28 AM
> To: Vladyslav Buslov
> Cc: dev at dpdk.org
> Subject: RE: [PATCH] acl: use rte_calloc for temporary memory allocation
> 
> Hi Vladyslav,
> 
> > -Original Message-
> > From: Vladyslav Buslov [mailto:vladyslav.buslov at harmonicinc.com]
> > Sent: Tuesday, August 16, 2016 3:01 PM
> > To: Ananyev, Konstantin 
> > Cc: dev at dpdk.org
> > Subject: [PATCH] acl: use rte_calloc for temporary memory allocation
> >
> > Acl build process uses significant amount of memory which degrades
> > performance by causing page walks when memory is allocated on regular heap 
> > using libc calloc.
> >
> > This commit changes tb_mem to allocate temporary memory on huge pages with 
> > rte_calloc.
> 
> We deliberately used standard system memory allocation routines (calloc/free) 
> here.
> With current design build phase was never considered to be an 'RT' phase 
> operation.
> It is pretty cpu and memory expensive.
> So if we'll use RTE memory for build phase it could easily consume all (or 
> most) of it, and might cause DPDK process failure or degradation.
> If you really feel that you (and other users) would benefit from using 
> rte_calloc here (I personally still in doubt), then at least it should be a
> new field inside rte_acl_config, that would allow user to control that 
> behavior.
> Though, as I said above, librte_acl was never designed to ' to allocate tens 
> of thousands of ACLs at runtime'.
> To add ability to add/delete rules at runtime while keeping lookup time 
> reasonably low some new approach need to be introduced.
> Konstantin
> 
> >
> > Signed-off-by: Vladyslav Buslov 
> > ---
> >  lib/librte_acl/tb_mem.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/lib/librte_acl/tb_mem.c b/lib/librte_acl/tb_mem.c index
> > 157e608..c373673 100644
> > --- a/lib/librte_acl/tb_mem.c
> > +++ b/lib/librte_acl/tb_mem.c
> > @@ -52,7 +52,7 @@ tb_pool(struct tb_mem_pool *pool, size_t sz)
> > size_t size;
> >
> > size = sz + pool->alignment - 1;
> > -   block = calloc(1, size + sizeof(*pool->block));
> > +   block = rte_calloc("ACL_TBMEM_BLOCK", 1, size +
> > +sizeof(*pool->block), 0);
> > if (block == NULL) {
> > RTE_LOG(ERR, MALLOC, "%s(%zu)\n failed, currently allocated "
> > "by pool: %zu bytes\n", __func__, sz, pool->alloc);
> > --
> > 2.8.3

[dpdk-dev] [PATCH] acl: use rte_calloc for temporary memory allocation

2016-08-31 Thread Ananyev, Konstantin

Hi Vladyslav,

> -Original Message-
> From: Vladyslav Buslov [mailto:vladyslav.buslov at harmonicinc.com]
> Sent: Tuesday, August 16, 2016 3:01 PM
> To: Ananyev, Konstantin 
> Cc: dev at dpdk.org
> Subject: [PATCH] acl: use rte_calloc for temporary memory allocation
> 
> Acl build process uses significant amount of memory which degrades 
> performance by causing page walks when memory is allocated on
> regular heap using libc calloc.
> 
> This commit changes tb_mem to allocate temporary memory on huge pages with 
> rte_calloc.

We deliberately used standard system memory allocation routines (calloc/free) 
here.
With current design build phase was never considered to be an 'RT' phase 
operation.
It is pretty cpu and memory expensive.
So if we'll use RTE memory for build phase it could easily consume all (or most)
of it, and might cause DPDK process failure or degradation.
If you really feel that you (and other users) would benefit from using
rte_calloc here (I personally still in doubt), then at least it should be a new
field inside rte_acl_config, that would allow user to control that behavior.
Though, as I said above, librte_acl was never designed to ' to allocate tens of 
thousands of ACLs at runtime'.
To add ability to add/delete rules at runtime while keeping lookup time 
reasonably low
some new approach need to be introduced.  
Konstantin

> 
> Signed-off-by: Vladyslav Buslov 
> ---
>  lib/librte_acl/tb_mem.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lib/librte_acl/tb_mem.c b/lib/librte_acl/tb_mem.c index 
> 157e608..c373673 100644
> --- a/lib/librte_acl/tb_mem.c
> +++ b/lib/librte_acl/tb_mem.c
> @@ -52,7 +52,7 @@ tb_pool(struct tb_mem_pool *pool, size_t sz)
>   size_t size;
> 
>   size = sz + pool->alignment - 1;
> - block = calloc(1, size + sizeof(*pool->block));
> + block = rte_calloc("ACL_TBMEM_BLOCK", 1, size + sizeof(*pool->block),
> +0);
>   if (block == NULL) {
>   RTE_LOG(ERR, MALLOC, "%s(%zu)\n failed, currently allocated "
>   "by pool: %zu bytes\n", __func__, sz, pool->alloc);
> --
> 2.8.3

[dpdk-dev] [PATCH] app/testpmd: fix RSS-hash-key size

2016-08-04 Thread Ananyev, Konstantin

Hi Awal,

As I said in the offline discussion, here are few nits
that I think need to be addressed.
See below. 
Konstantin

> 
> RSS hash-key-size is retrieved from device configuration instead of using a 
> fixed size of 40 bytes.
> 
> Fixes: f79959ea1504 ("app/testpmd: allow to configure RSS hash key")
> 
> Signed-off-by: Mohammad Abdul Awal 
> ---
>  app/test-pmd/cmdline.c | 24 +---  app/test-pmd/config.c  
> | 17 ++---
>  2 files changed, 31 insertions(+), 10 deletions(-)
> 
> diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c index 
> f90befc..14412b4 100644
> --- a/app/test-pmd/cmdline.c
> +++ b/app/test-pmd/cmdline.c
> @@ -1608,7 +1608,6 @@ struct cmd_config_rss_hash_key {
>   cmdline_fixed_string_t key;
>  };
> 
> -#define RSS_HASH_KEY_LENGTH 40
>  static uint8_t
>  hexa_digit_to_value(char hexa_digit)
>  {
> @@ -1640,20 +1639,30 @@ cmd_config_rss_hash_key_parsed(void *parsed_result,
>  __attribute__((unused)) void *data)  {
>   struct cmd_config_rss_hash_key *res = parsed_result;
> - uint8_t hash_key[RSS_HASH_KEY_LENGTH];
> + uint8_t hash_key[16 * 4];

No need for hard-coded constants.
I'd suggest to keep RSS_HASH_KEY_LENGTH, just increase it to 52 (or might be 
even bigger) value.

>   uint8_t xdgt0;
>   uint8_t xdgt1;
>   int i;
> + struct rte_eth_dev_info dev_info;
> + uint8_t hash_key_size;
> 
> + memset(_info, 0, sizeof(dev_info));
> + rte_eth_dev_info_get(res->port_id, _info);
> + if (dev_info.hash_key_size > 0) {

&& dev_info.hash_key_size <= sizeof(hash_key) {

> + hash_key_size = dev_info.hash_key_size;
> + } else {
> + printf("dev_info did not provide a valid hash key size\n");
> + return;
> + }
>   /* Check the length of the RSS hash key */
> - if (strlen(res->key) != (RSS_HASH_KEY_LENGTH * 2)) {
> + if (strlen(res->key) != (hash_key_size * 2)) {
>   printf("key length: %d invalid - key must be a string of %d"
>  "hexa-decimal numbers\n", (int) strlen(res->key),
> -RSS_HASH_KEY_LENGTH * 2);
> +hash_key_size * 2);
>   return;
>   }
>   /* Translate RSS hash key into binary representation */
> - for (i = 0; i < RSS_HASH_KEY_LENGTH; i++) {
> + for (i = 0; i < hash_key_size; i++) {
>   xdgt0 = parse_and_check_key_hexa_digit(res->key, (i * 2));
>   if (xdgt0 == 0xFF)
>   return;
> @@ -1663,7 +1672,7 @@ cmd_config_rss_hash_key_parsed(void *parsed_result,
>   hash_key[i] = (uint8_t) ((xdgt0 * 16) + xdgt1);
>   }
>   port_rss_hash_key_update(res->port_id, res->rss_type, hash_key,
> -  RSS_HASH_KEY_LENGTH);
> + hash_key_size);
>  }
> 
>  cmdline_parse_token_string_t cmd_config_rss_hash_key_port = @@ -1692,7 
> +1701,8 @@ cmdline_parse_inst_t
> cmd_config_rss_hash_key = {
>   "port config X rss-hash-key ipv4|ipv4-frag|ipv4-tcp|ipv4-udp|"
>   "ipv4-sctp|ipv4-other|ipv6|ipv6-frag|ipv6-tcp|ipv6-udp|"
>   "ipv6-sctp|ipv6-other|l2-payload|"
> - "ipv6-ex|ipv6-tcp-ex|ipv6-udp-ex 80 hexa digits\n",
> + "ipv6-ex|ipv6-tcp-ex|ipv6-udp-ex "
> + "80 hexa digits (104 hexa digits for fortville)\n",

No need to mention particular NIC (Fortville) here.
I'd say better: 'array of hex digits (variable size, NIC dependent)' or so.

>   .tokens = {
>   (void *)_config_rss_hash_key_port,
>   (void *)_config_rss_hash_key_config,
> diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c index 
> bfcbff9..851408b 100644
> --- a/app/test-pmd/config.c
> +++ b/app/test-pmd/config.c
> @@ -1012,14 +1012,25 @@ void
>  port_rss_hash_conf_show(portid_t port_id, char rss_info[], int show_rss_key) 
>  {
>   struct rte_eth_rss_conf rss_conf;
> - uint8_t rss_key[10 * 4] = "";
> + uint8_t rss_key[16 * 4] = "";

better rss_key[RSS_HASH_KEY_LENGTH ], and I think there is no need to put '0' 
into first element.

>   uint64_t rss_hf;
>   uint8_t i;
>   int diag;
> + struct rte_eth_dev_info dev_info;
> + uint8_t hash_key_size;
> 
>   if (port_id_is_invalid(port_id, ENABLED_WARN))
>   return;
> 
> + memset(_info, 0, sizeof(dev_info));
> + rte_eth_dev_info_get(port_id, _info);
> + if (dev_info.hash_key_size > 0) {
> + hash_key_size = dev_info.hash_key_size;
> + } else {
> + printf("dev_info did not provide a valid hash key size\n");
> + return;
> + }
> +
>   rss_conf.rss_hf = 0;
>   for (i = 0; i < RTE_DIM(rss_type_table); i++) {
>   if (!strcmp(rss_info, rss_type_table[i].str)) @@ -1028,7 
> +1039,7 @@ port_rss_hash_conf_show(portid_t port_id, char
> rss_info[], int show_rss_key)
> 
>   /* Get RSS hash key if asked to display it

[dpdk-dev] [PATCH] i40e: enable i40e pmd on ARM platform

2016-08-03 Thread Ananyev, Konstantin


Hi Jianbo,

> > Hi, Jianbo
> >
> > I have tested you patch on my X86 platform,  the single core performance 
> > for Non-vector PMD will have about 1Mpps drop
> > Non-vector PMD single core performance with patch   :  ~33.9 
> > Mpps
> > Non-vector PMD single core performance without patch:  ~35.1 Mpps
> > Is there any way to avoid such performance drop on X86? Thanks.
> >
> 
> I think we can place a compiling condition before rte_rmb() to avoid 
> performance decrease on x86.
> For example:  #if defined(RTE_ARCH_ARM) || defined(RTE_ARCH_ARM64)

I suppose you can use rte_smp_rmb() here?
Konstantin

> 
> Thanks!
> Jianbo
> 
> > BRs
> > Lei
> >
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Jianbo Liu
> > Sent: Tuesday, August 2, 2016 2:58 PM
> > To: dev at dpdk.org; Zhang, Helin ; Wu, Jingjing
> > 
> > Cc: Jianbo Liu 
> > Subject: [dpdk-dev] [PATCH] i40e: enable i40e pmd on ARM platform
> >
> > And add read memory barrier to avoid status inconsistency between two RX 
> > descriptors readings.
> >
> > Signed-off-by: Jianbo Liu 
> > ---
> >  config/defconfig_arm64-armv8a-linuxapp-gcc | 2 +-
> >  doc/guides/nics/overview.rst   | 2 +-
> >  drivers/net/i40e/i40e_rxtx.c   | 2 ++
> >  3 files changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/config/defconfig_arm64-armv8a-linuxapp-gcc
> > b/config/defconfig_arm64-armv8a-linuxapp-gcc
> > index 1a17126..08f282b 100644
> > --- a/config/defconfig_arm64-armv8a-linuxapp-gcc
> > +++ b/config/defconfig_arm64-armv8a-linuxapp-gcc
> > @@ -46,6 +46,6 @@ CONFIG_RTE_EAL_IGB_UIO=n
> >
> >  CONFIG_RTE_LIBRTE_IVSHMEM=n
> >  CONFIG_RTE_LIBRTE_FM10K_PMD=n
> > -CONFIG_RTE_LIBRTE_I40E_PMD=n
> > +CONFIG_RTE_LIBRTE_I40E_INC_VECTOR=n
> >
> >  CONFIG_RTE_SCHED_VECTOR=n
> > diff --git a/doc/guides/nics/overview.rst
> > b/doc/guides/nics/overview.rst index 6abbae6..5175591 100644
> > --- a/doc/guides/nics/overview.rst
> > +++ b/doc/guides/nics/overview.rst
> > @@ -138,7 +138,7 @@ Most of these differences are summarized below.
> > Linux VFIO Y Y   Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y  
> >Y   Y Y
> > Other kdrv Y Y  
> >  Y
> > ARMv7   
> >  Y Y Y
> > -   ARMv8  Y Y Y Y  
> >  Y Y   Y Y
> > +   ARMv8  Y   Y Y Y Y  
> >  Y Y   Y Y
> > Power8 Y Y  
> >  Y
> > TILE-Gx 
> >  Y
> > x86-32 Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y  
> >  Y   Y Y Y
> > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > b/drivers/net/i40e/i40e_rxtx.c index 554d167..4004b8e 100644
> > --- a/drivers/net/i40e/i40e_rxtx.c
> > +++ b/drivers/net/i40e/i40e_rxtx.c
> > @@ -994,6 +994,8 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
> > I40E_RXD_QW1_STATUS_SHIFT;
> > }
> >
> > +   rte_rmb();
> > +
> > /* Compute how many status bits were set */
> > for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++)
> > nb_dd += s[j] & (1 <<
> > I40E_RX_DESC_STATUS_DD_SHIFT);
> > --
> > 2.4.11
> >

[dpdk-dev] [PATCH v3] i40: fix the VXLAN TSO issue

2016-07-29 Thread Ananyev, Konstantin

Hi Jianfeng,

> 
> Hi,
> 
> On 7/18/2016 7:56 PM, Zhe Tao wrote:
> > Problem:
> > When using the TSO + VXLAN feature in i40e, the outer UDP length
> > fields in the multiple UDP segments which are TSOed by the i40e will
> > have a wrong value.
> >
> > Fix this problem by adding the tunnel type field in the i40e
> > descriptor which was missed before.
> >
> > Fixes: 77b8301733c3 ("i40e: VXLAN Tx checksum offload")
> >
> > Signed-off-by: Zhe Tao 
> > ---
> > v2: edited the comments
> > v3: added external IP offload flag when TSO is enabled for tunnelling
> > packets
> >
> >   app/test-pmd/csumonly.c  | 29 +
> >   drivers/net/i40e/i40e_rxtx.c | 12 +---
> >   lib/librte_mbuf/rte_mbuf.h   | 16 +++-
> >   3 files changed, 45 insertions(+), 12 deletions(-)
> >
> ...
> > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > index 15e3a10..90812ea 100644
> > --- a/lib/librte_mbuf/rte_mbuf.h
> > +++ b/lib/librte_mbuf/rte_mbuf.h
> > @@ -133,6 +133,17 @@ extern "C" {
> >   /* add new TX flags here */
> >
> >   /**
> > + * Bits 45:48 used for the tunnel type.
> > + * When doing Tx offload like TSO or checksum, the HW needs to
> > +configure the
> > + * tunnel type into the HW descriptors.
> > + */
> > +#define PKT_TX_TUNNEL_VXLAN   (1ULL << 45)
> > +#define PKT_TX_TUNNEL_GRE   (2ULL << 45)
> > +#define PKT_TX_TUNNEL_IPIP(3ULL << 45)
> > +/* add new TX TUNNEL type here */
> > +#define PKT_TX_TUNNEL_MASK(0xFULL << 45)
> > +
> 
> Above flag bits are added so that i40e driver can tell tunnel type of this 
> packet (udp or gre or ipip), just interested to know how about just do
> a simple analysis like below without introducing these flags?
> 
> if outer_ether.proto == ipv4
>  l4_proto = ipv4_hdr->next_proto;
> else outer_ether.proto == ipv6
>  l4_proto = ipv6_hdr->next_proto;
> 
> switch (l4_proto)
>  case ipv4:
>  case ipv6:
>  tunnel_type = ipip;
>  break;
>  case udp:
>  tunnel_type = udp;
>  break;
>  case gre:
>  tunnel_type = gre;
>  break;
>  default:
>   error;

Right now none of our PMDs reads/writes actual packet data,
and I think it is better to keep it that way.
It is an upper layer responsibility to specify which offload
needs to be enabled and provide necessary information. 
Konstantin 

> 
> Thanks,
> Jianfeng
> 
> 
> > +/**
> >* Second VLAN insertion (QinQ) flag.
> >*/
> >   #define PKT_TX_QINQ_PKT(1ULL << 49)   /**< TX packet with double VLAN 
> > inserted. */
> > @@ -867,7 +878,10 @@ struct rte_mbuf {
> > union {
> > uint64_t tx_offload;   /**< combined for easy fetch */
> > struct {
> > -   uint64_t l2_len:7; /**< L2 (MAC) Header Length. */
> > +   uint64_t l2_len:7;
> > +   /**< L2 (MAC) Header Length if it isn't a tunneling pkt.
> > +* for tunnel it is outer L4 len+tunnel len+inner L2 len
> > +*/
> > uint64_t l3_len:9; /**< L3 (IP) Header Length. */
> > uint64_t l4_len:8; /**< L4 (TCP/UDP) Header Length. */
> > uint64_t tso_segsz:16; /**< TCP TSO segment size */

[dpdk-dev] [PATCH v2] doc: announce ABI change for rte_eth_dev structure

2016-07-28 Thread Ananyev, Konstantin

Hi Olivier,

> 
> Hi,
> 
> Jumping into this thread, it looks it's the last pending patch remaining for 
> the release.
> 
> For reference, the idea of tx_prep() was mentionned some time ago in 
> http://dpdk.org/ml/archives/dev/2014-May/002504.html
> 
> Few comments below.
> 
> On 07/28/2016 03:01 PM, Ananyev, Konstantin wrote:
> > Right now to make HW TX offloads to work user is required to do particular 
> > actions:
> > 1. set mbuf.ol_flags properly.
> > 2. setup mbuf.tx_offload fields properly.
> > 3. update L3/L4 header fields in a particular way.
> >
> > We move #3 into tx_prep(), to hide that complexity from the user simplify 
> > things for him.
> > Though if he still prefers to do #3  by himself - that's ok too.
> 
> I think moving #3 out of the application is a good idea. Today, for TSO, the 
> offload dpdk API requires to set a specific pseudo header
> checksum (which does not include the ip len, as expected by Intel drivers), 
> and set the IP checksum to 0.
> 
> In our app, the network stack sets the TCP checksum to the standard pseudo 
> header checksum, and before sending the mbuf:
> - packets are split in sw if the driver does not support tso
> - the tcp csum is patched to match dpdk api if the driver supports tso
> 
> In the patchs I've recently sent adding tso support for virtio-pmd, it 
> conforms to the dpdk API (phdr csum without ip len), so the tx function
> need to patch the mbuf data inside the driver, which is something what we 
> want to avoid, for some good reasons explained by Konstantin.

Yep, that would be another good use-case for tx_prep() I suppose.

> 
> So, I think having a tx_prep would also be the occasion to standardize a bit 
> the dpdk offload api, and let the driver-specific stuff in tx_prep().
> 
> Konstantin, any opinion about this?

Yes, that sounds like a good thing to me.

> 
> >>> Another main purpose of tx_prep(): for multi-segment packets is to
> >>> check that number of segments doesn't exceed  HW limit.
> >>> Again right now users have to do that on their own.
> 
> If calling tx_prep() is optional, does it mean that this check may be done 
> twice? (once in tx_prep() and once in tx_burst())

I meant 'optional' in a way, that if user doesn't want to use tx_prep() and
do step #3 from the above on his own (what happens now) that is still ok.
But I think step #3 (modify packet's data) still needs to be done before 
tx_burst()  is called for the packets.

> 
> What would be the advantage of doing this check in tx_prep() instead of 
> keeping it in tx_burst(), as it does not touch the mbuf data?
> 
> >>> 3.  Having it a s separate function would allow user control when/where
> >>>to call it, let say only for some packets, or probably call 
> >>> tx_prep()
> >>>on one core, and do actual tx_burst() for these packets on the 
> >>> other.
> 
> Yes, from what I remember, the pipeline model was the main reason why we do 
> not just modify the packet in tx_burst(). Right?

Yes.

> 
> >>>>> If you refer as lost cycles here something like:
> >>>>> RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->tx_prep, -ENOTSUP); then
> >>>>> yes.
> >>>>> Though comparing to actual work need to be done for most HW TX
> >>>>> offloads, I think it is neglectable.
> >>>>
> >>>> Not sure.
> 
> I think doing this test on a per-bulk basis should not impact performance.
> 
> > To be honest, I don't understand what is your concern here.
> > That proposed change doesn't break any existing functionality, doesn't
> > introduce any new requirements to the existing API, and wouldn't
> > introduce any performance regression for existing apps.
> > It is a an extension, and user is free not to use it, if it doesn't fit his 
> > needs.
> >  From other side there are users who are interested in that
> > functionality, and they do have use-cases for  it.
> 
> In my opinion, using tx_prep() will implicitly become mandatory as soon as 
> the application want to do offload. An application that won't use
> it will have to prepare the mbuf, and this preparation will depend on the 
> device, which is not acceptable inside an application.

Yes, I also hope that most apps that do use TX offloads will start to use it,
as I think it will be much more convenient way, then what we have right now.
I just liked to emphasis that user wouldn't be forced to.
Konstantin

> 
> 
> So, to conclude, the api change notification looks good to me, even if there 
> is still some room to discuss the implementation details.
> 
> 
> Acked-by: Olivier Matz

[dpdk-dev] [PATCH v2] doc: announce ABI change for rte_eth_dev structure

2016-07-28 Thread Ananyev, Konstantin



> -Original Message-
> From: Jerin Jacob [mailto:jerin.jacob at caviumnetworks.com]
> Sent: Thursday, July 28, 2016 12:39 PM
> To: Ananyev, Konstantin 
> Cc: Thomas Monjalon ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2] doc: announce ABI change for rte_eth_dev 
> structure
> 
> On Thu, Jul 28, 2016 at 10:36:07AM +, Ananyev, Konstantin wrote:
> > > If it does not cope up then it can skip tx'ing in the actual tx
> > > burst itself and move the "skipped" tx packets to end of the list in
> > > the tx burst so that application can take the action on "skipped"
> > > packet after the tx burst
> >
> > Sorry, that's too cryptic for me.
> > Can you reword it somehow?
> 
> OK.
> 1) lets say application requests 32 packets to send it using tx_burst.
> 2) packets are from p0 to p31
> 3) in driver due to some reason, it is not able to send the packets due to 
> some constraints in the driver(say expect p2 and p16 everything
> else sent successfully by the driver)
> 4) driver can move p2 and p16 at pkt[0] and pkt[1] on tx_burst and return 30
> 5) application can take action on p2 and p16 based the return value of 30(ie 
> 32-30 = 2 packets needs to handle at pkt[0] and pkt[1]

That would introduce packets reordering and unnecessary complicate the PMD TX 
functions.
Again it would require changes in *all* existing PMD tx functions.
So we don't plan to do things that way.

> 
> 
> >
> > >
> > >
> > > > Instead it just setups the ol_flags, fills tx_offload fields and calls 
> > > > tx_prep().
> > > > Please read the original Tomasz's patch, I think he explained
> > > > possible use-cases with lot of details.
> > >
> > > Sorry, it is not very clear in terms of use cases.
> >
> > Ok, what I meant to say:
> > Right now, if user wants to use HW TX cksum/TSO offloads he might have to:
> > - setup ipv4 header cksum field.
> > - calculate the pseudo header checksum
> > - setup tcp/udp cksum field.
> >
> > Rules how these calculations need to be done and which fields need to
> > be updated, may vary depending on HW underneath and requested offloads.
> > tx_prep() - supposed to hide all these nuances from user and allow him
> > to use TX HW offloads in a transparent way.
> 
> Not sure I understand it completely. Bit contradicting with below statement
> |We would document what tx_prep() supposed to do, and in what cases user
> |don't need it.

How that contradicts?
Right now to make HW TX offloads to work user is required to do particular 
actions:
1. set mbuf.ol_flags properly.
2. setup mbuf.tx_offload fields properly.
3. update L3/L4 header fields in a particular way.

We move #3 into tx_prep(), to hide that complexity from the user simplify 
things for him.
Though if he still prefers to do #3  by himself - that's ok too.

> 
> How about introducing a new ethdev generic eal command-line mode OR new 
> ethdev_configure hint that PMD driver is in "tx_prep-
> >tx_burst" mode instead of just tx_burst? That way no fast-path performance 
> >degradation for the PMD that does not need it

User might want to send different packets over different devices,
or even over different queues over the same device.
Or even he might want to call tx_prep() for one group of packets,
but skip for different group of packets for the same TX queue. 
So I think we should allow user to decide when/where to call it.

> 
> 
> >
> > Another main purpose of tx_prep(): for multi-segment packets is to
> > check that number of segments doesn't exceed  HW limit.
> > Again right now users have to do that on their own.
> >
> > >
> > > In HW perspective, It it tries to avoid the illegal state. But not
> > > sure calling "back to back" tx prepare and then tx burst how does it
> > > improve the situation as the check illegal state check introduce in
> > > actual tx burst it self.
> > >
> > > In SW perspective, its try to avoid sending malformed packets. In my
> > > view the same can achieved with existing tx burst it self as PMD is
> > > the one finally send the packets on the wire.
> >
> > Ok, so your question is: why not to put that functionality into
> > tx_burst() itself, right?
> > For few reasons:
> > 1. putting that functionality into tx_burst() would introduce unnecessary
> > slowdown for cases when that functionality is not needed
> > (one segment per packet, no HW offloads).
> 
> These parameters can be configured on init time

No always, see above.

> 
> > 2. User might don't want to use tx_prep() - he/she might have i

[dpdk-dev] [PATCH] doc: announce renaming of ethdev library

2016-07-28 Thread Ananyev, Konstantin



>
> The right name of ethdev should be dpdk_netdev. However:
> 1/ We are using rte_ prefix in the code and library names.
> 2/ The API uses rte_ethdev
 > That's why 16.11 will just have the rte_ prefix prepended to the
> library filename as every other libraries.
>
> Signed-off-by: Thomas Monjalon 
 > ---
 >  doc/guides/rel_notes/deprecation.rst | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/doc/guides/rel_notes/deprecation.rst
> b/doc/guides/rel_notes/deprecation.rst
> index f502f86..7a55037 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -11,6 +11,9 @@ Deprecation Notices
>  * The log history is deprecated.
>It is voided in 16.07 and will be removed in release 16.11.
>
> +* The ethdev library file will be renamed from libethdev.* to
> +librte_ethdev.*
> +  in release 16.11 in order to have a more consistent namespace.
> +
>  * The ethdev hotplug API is going to be moved to EAL with a
> notification
>mechanism added to crypto and ethdev libraries so that hotplug
> is
> now
 >available to both of them. This API will be stripped of the
> device arguments
> --
Acked-by: Konstantin Ananyev 

> 2.7.0

[dpdk-dev] [PATCH v2] doc: announce ABI change for rte_eth_dev structure

2016-07-28 Thread Ananyev, Konstantin



> > > >
> > > > > -Original Message-
> > > > > From: Jerin Jacob [mailto:jerin.jacob at caviumnetworks.com]
> > > > > Sent: Wednesday, July 27, 2016 6:11 PM
> > > > > To: Thomas Monjalon 
> > > > > Cc: Kulasek, TomaszX ; dev at dpdk.org;
> > > > > Ananyev, Konstantin 
> > > > > Subject: Re: [dpdk-dev] [PATCH v2] doc: announce ABI change for
> > > > > rte_eth_dev structure
> > > > >
> > > > > On Wed, Jul 27, 2016 at 01:59:01AM -0700, Thomas Monjalon wrote:
> > > > > > > > Signed-off-by: Tomasz Kulasek 
> > > > > > > > ---
> > > > > > > > +* In 16.11 ABI changes are plained: the ``rte_eth_dev``
> > > > > > > > +structure will be
> > > > > > > > +  extended with new function pointer ``tx_pkt_prep`` allowing
> > > > > > > > +verification
> > > > > > > > +  and processing of packet burst to meet HW specific
> > > > > > > > +requirements before
> > > > > > > > +  transmit. Also new fields will be added to the 
> > > > > > > > ``rte_eth_desc_lim`` structure:
> > > > > > > > +  ``nb_seg_max`` and ``nb_mtu_seg_max`` provideing
> > > > > > > > +information about number of
> > > > > > > > +  segments limit to be transmitted by device for TSO/non-TSO 
> > > > > > > > packets.
> > > > > > >
> > > > > > > Acked-by: Konstantin Ananyev 
> > > > > >
> > > > > > I think I understand you want to split the TX processing:
> > > > > > 1/ modify/write in mbufs
> > > > > > 2/ write in HW
> > > > > > and let application decide:
> > > > > > - where the TX prep is done (which core)
> > > > >
> > > > > In what basics applications knows when and where to call tx_pkt_prep 
> > > > > in fast path.
> > > > > if all the time it needs to call before tx_burst then the PMD won't
> > > > > have/don't need this callback waste cycles in fast path.Is this the 
> > > > > expected behavior ?
> > > > > Anything think it as compile time to make other PMDs wont suffer 
> > > > > because of this change.
> > > >
> > > > Not sure what suffering you are talking about...
> > > > Current model - i.e. when application does preparations (or doesn't if
> > > > none is required) on its own and just call tx_burst() would still be 
> > > > there.
> > > > If the app doesn't want to use tx_prep() by some reason - that still
> > > > ok, and decision is up to the particular app.
> > >
> > > So my question is in what basics application decides to call the 
> > > preparation.
> > > Can you tell me the use case in application perspective?
> >
> > I suppose one most common use-case when application uses HW TX offloads,
> > and don't' to cope on its own which L3/L4 header fields need to be filled
> > for that particular dev_type/hw_offload combination.
> 
> If it does not cope up then it can skip tx'ing in the actual tx burst
> itself and move the "skipped" tx packets to end of the list in the tx
> burst so that application can take the action on "skipped" packet after
> the tx burst

Sorry, that's too cryptic for me.
Can you reword it somehow?

> 
> 
> > Instead it just setups the ol_flags, fills tx_offload fields and calls 
> > tx_prep().
> > Please read the original Tomasz's patch, I think he explained possible 
> > use-cases
> > with lot of details.
> 
> Sorry, it is not very clear in terms of use cases.

Ok, what I meant to say:
Right now, if user wants to use HW TX cksum/TSO offloads he might have to:
- setup ipv4 header cksum field.
- calculate the pseudo header checksum
- setup tcp/udp cksum field.

Rules how these calculations need to be done and which fields need to be 
updated,
may vary depending on HW underneath and requested offloads.
tx_prep() - supposed to hide all these nuances from user and allow him to use 
TX HW offloads
in a transparent way.

Another main purpose of tx_prep(): for multi-segment packets is to check
that number of segments doesn't exceed  HW limit.
Again right now users have to do that on their own.

> 
> In HW perspective, It it tries to avoid the illegal state. But not sure
> calling "back to back" tx prepare and th

[dpdk-dev] [PATCH v2] doc: announce ABI change for rte_eth_dev structure

2016-07-27 Thread Ananyev, Konstantin


> 
> On Wed, Jul 27, 2016 at 05:33:01PM +, Ananyev, Konstantin wrote:
> >
> >
> > > -Original Message-
> > > From: Jerin Jacob [mailto:jerin.jacob at caviumnetworks.com]
> > > Sent: Wednesday, July 27, 2016 6:11 PM
> > > To: Thomas Monjalon 
> > > Cc: Kulasek, TomaszX ; dev at dpdk.org;
> > > Ananyev, Konstantin 
> > > Subject: Re: [dpdk-dev] [PATCH v2] doc: announce ABI change for
> > > rte_eth_dev structure
> > >
> > > On Wed, Jul 27, 2016 at 01:59:01AM -0700, Thomas Monjalon wrote:
> > > > > > Signed-off-by: Tomasz Kulasek 
> > > > > > ---
> > > > > > +* In 16.11 ABI changes are plained: the ``rte_eth_dev``
> > > > > > +structure will be
> > > > > > +  extended with new function pointer ``tx_pkt_prep`` allowing
> > > > > > +verification
> > > > > > +  and processing of packet burst to meet HW specific
> > > > > > +requirements before
> > > > > > +  transmit. Also new fields will be added to the 
> > > > > > ``rte_eth_desc_lim`` structure:
> > > > > > +  ``nb_seg_max`` and ``nb_mtu_seg_max`` provideing
> > > > > > +information about number of
> > > > > > +  segments limit to be transmitted by device for TSO/non-TSO 
> > > > > > packets.
> > > > >
> > > > > Acked-by: Konstantin Ananyev 
> > > >
> > > > I think I understand you want to split the TX processing:
> > > > 1/ modify/write in mbufs
> > > > 2/ write in HW
> > > > and let application decide:
> > > > - where the TX prep is done (which core)
> > >
> > > In what basics applications knows when and where to call tx_pkt_prep in 
> > > fast path.
> > > if all the time it needs to call before tx_burst then the PMD won't
> > > have/don't need this callback waste cycles in fast path.Is this the 
> > > expected behavior ?
> > > Anything think it as compile time to make other PMDs wont suffer because 
> > > of this change.
> >
> > Not sure what suffering you are talking about...
> > Current model - i.e. when application does preparations (or doesn't if
> > none is required) on its own and just call tx_burst() would still be there.
> > If the app doesn't want to use tx_prep() by some reason - that still
> > ok, and decision is up to the particular app.
> 
> So my question is in what basics application decides to call the preparation.
> Can you tell me the use case in application perspective?

I suppose one most common use-case when application uses HW TX offloads,
and don't' to cope on its own which L3/L4 header fields need to be filled
for that particular dev_type/hw_offload combination.
Instead it just setups the ol_flags, fills tx_offload fields and calls 
tx_prep().
Please read the original Tomasz's patch, I think he explained possible 
use-cases 
with lot of details.  

> and what if the PMD does not implement that callback then it is of waste 
> cycles. Right?

If you refer as lost cycles here something like:
RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->tx_prep, -ENOTSUP);
then yes.
Though comparing to actual work need to be done for most HW TX offloads,
I think it is neglectable.
Again, as I said before, it is totally voluntary for the application.
Konstantin 

> 
> Jerin
> 
> 
> > Konstantin
> >
> > >
> > >
> > > > - what to do if the TX prep fail
> > > > So adding some processing in this first part becomes "not too
> > > > expensive" or "manageable" from the application point of view.
> > > >
> > > > If I well understand the intent,
> > > >
> > > > Acked-by: Thomas Monjalon  (except
> > > > typos ;)

[dpdk-dev] [PATCH v2] doc: announce ABI change for rte_eth_dev structure

2016-07-27 Thread Ananyev, Konstantin



> -Original Message-
> From: Jerin Jacob [mailto:jerin.jacob at caviumnetworks.com]
> Sent: Wednesday, July 27, 2016 6:11 PM
> To: Thomas Monjalon 
> Cc: Kulasek, TomaszX ; dev at dpdk.org; 
> Ananyev, Konstantin 
> Subject: Re: [dpdk-dev] [PATCH v2] doc: announce ABI change for rte_eth_dev 
> structure
> 
> On Wed, Jul 27, 2016 at 01:59:01AM -0700, Thomas Monjalon wrote:
> > > > Signed-off-by: Tomasz Kulasek 
> > > > ---
> > > > +* In 16.11 ABI changes are plained: the ``rte_eth_dev`` structure
> > > > +will be
> > > > +  extended with new function pointer ``tx_pkt_prep`` allowing
> > > > +verification
> > > > +  and processing of packet burst to meet HW specific requirements
> > > > +before
> > > > +  transmit. Also new fields will be added to the ``rte_eth_desc_lim`` 
> > > > structure:
> > > > +  ``nb_seg_max`` and ``nb_mtu_seg_max`` provideing information
> > > > +about number of
> > > > +  segments limit to be transmitted by device for TSO/non-TSO packets.
> > >
> > > Acked-by: Konstantin Ananyev 
> >
> > I think I understand you want to split the TX processing:
> > 1/ modify/write in mbufs
> > 2/ write in HW
> > and let application decide:
> > - where the TX prep is done (which core)
> 
> In what basics applications knows when and where to call tx_pkt_prep in fast 
> path.
> if all the time it needs to call before tx_burst then the PMD won't 
> have/don't need this callback waste cycles in fast path.Is this the expected
> behavior ?
> Anything think it as compile time to make other PMDs wont suffer because of 
> this change.

Not sure what suffering you are talking about...
Current model - i.e. when application does preparations (or doesn't if none is 
required)
on its own and just call tx_burst() would still be there.
If the app doesn't want to use tx_prep() by some reason - that still ok,
and decision is up to the particular app. 
Konstantin

> 
> 
> > - what to do if the TX prep fail
> > So adding some processing in this first part becomes "not too
> > expensive" or "manageable" from the application point of view.
> >
> > If I well understand the intent,
> >
> > Acked-by: Thomas Monjalon  (except typos ;)

[dpdk-dev] ACL: BUG: rte_acl_classify_scalar mismatch when use a special rule

2016-07-27 Thread Ananyev, Konstantin

Hi,

> 
> define a rule as following:
> 
> struct acl_ipv4_rule acl_rule[] = {
> {
> .data = {.userdata = 103, .category_mask = 1, .priority = 1},
> /* proto */
> .field[0] = {.value.u8 = 0, .mask_range.u8 = 0x0,},
> /* source IPv4 */
> .field[1] = {.value.u32 = IPv4(0, 0, 0, 0), .mask_range.u32 = 0,},
> /* destination IPv4 */
> .field[2] = {.value.u32 = IPv4(192, 168, 2, 4), .mask_range.u32 = 
> 32,},
> /* source port */
> .field[3] = {.value.u16 = 0, .mask_range.u16 = 0x,},
> /* destination port */
> .field[4] = {.value.u16 = 1024, .mask_range.u16 = 0x,},
> },
> };
> 
> build a pkt like this:
> 
> pv4_hdr->next_proto_id = 6;
> ipv4_hdr->src_addr = rte_cpu_to_be_32(IPv4(10, 18, 2, 3));
> ipv4_hdr->dst_addr = rte_cpu_to_be_32(IPv4(192, 168, 2, 4));
> port = (uint16_t*)((unsigned char*)ipv4_hdr + sizeof(struct ipv4_hdr));
> port[0] = rte_cpu_to_be_16();
> port[1] = rte_cpu_to_be_16(4608);
> 
> rte_acl_classify_scalar will mismatch this packet!
> 
> i readed rte_acl_classify_scalar function, and found the reason:
> 
> while (flows.started > 0) {
> 
> input0 = GET_NEXT_4BYTES(parms, 0);
> input1 = GET_NEXT_4BYTES(parms, 1);
> 
> for (n = 0; n < 4; n++) {
> 
> transition0 = scalar_transition(flows.trans,
> transition0, (uint8_t)input0);
> input0 >>= CHAR_BIT;
> 
> transition1 = scalar_transition(flows.trans,
> transition1, (uint8_t)input1);
> input1 >>= CHAR_BIT;
> }
> 
> while ((transition0 | transition1) & RTE_ACL_NODE_MATCH) {
> transition0 = acl_match_check(transition0,
> 0, ctx, parms, , resolve_priority_scalar);
> transition1 = acl_match_check(transition1,
> 1, ctx, parms, , resolve_priority_scalar);
> }
> }
> 
> everytime, scalar get 4bytes to transition, and usually it work well, but if 
> we set a acl rule as prior, mismatch will appear.
> this is because field[3] is a 100% wild node, so it was removed as a 
> deactivated field.
> 
> in this situation, when rte_acl_classify_scalar runs, proto/sip/dip match ok, 
> and then it skip sport because it was removed.
> now input0 is a int value(4 bytes) started form dport.
> it will get a match-node after 2 bytes match(dport is a short value), but 
> cycle stoped untill n = 4, finally it translated to another node which is
> not a match-node, the mismatch happened.
> 
> i'm not sure search_sse_8/search_sse_4/search_avx2x16 is Ok.
> 
> how to fix it?
> i think set GET_NEXT_4BYTES to GET_NEXT_BYTE will solve this problem, but it 
> will influence performance.
> another way, don't use acl_rule_stats to remove deactivated field, but code 
> will change a lot.

If you believe there is a problem, could you try to reproduce it with 
app/testacl,
and provide a rule file and a trace file?
Thanks
Konstantin

[dpdk-dev] [PATCH v2] doc: announce ABI change for mbuf structure

2016-07-27 Thread Ananyev, Konstantin



> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Olivier Matz
> Sent: Wednesday, July 20, 2016 8:16 AM
> To: dev at dpdk.org
> Cc: jerin.jacob at caviumnetworks.com; thomas.monjalon at 6wind.com; 
> Richardson, Bruce 
> Subject: [dpdk-dev] [PATCH v2] doc: announce ABI change for mbuf structure
> 
> For 16.11, the mbuf structure will be modified implying ABI breakage.
> Some discussions already took place here:
> http://www.dpdk.org/dev/patchwork/patch/12878/
> 
> Signed-off-by: Olivier Matz 
> ---
> 
> v1->v2:
> - reword the sentences to keep things more open, as suggested by Bruce
> 
>  doc/guides/rel_notes/deprecation.rst | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/doc/guides/rel_notes/deprecation.rst 
> b/doc/guides/rel_notes/deprecation.rst
> index f502f86..b9f5a93 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -41,3 +41,9 @@ Deprecation Notices
>  * The mempool functions for single/multi producer/consumer are deprecated and
>will be removed in 16.11.
>It is replaced by rte_mempool_generic_get/put functions.
> +
> +* ABI changes are planned for 16.11 in the ``rte_mbuf`` structure: some 
> fields
> +  may be reordered to facilitate the writing of ``data_off``, ``refcnt``, and
> +  ``nb_segs`` in one operation, because some platforms have an overhead if 
> the
> +  store address is not naturally aligned. Other mbuf fields, such as the
> +  ``port`` field, may be moved or removed as part of this mbuf work.
> --

Acked-by: Konstantin Ananyev 
> 2.8.1

[dpdk-dev] [PATCH v2 1/2] eal: remove redundant codes to parse --lcores

2016-07-26 Thread Ananyev, Konstantin



> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wei Dai
> Sent: Tuesday, July 26, 2016 10:52 AM
> To: dev at dpdk.org
> Cc: Dai, Wei 
> Subject: [dpdk-dev] [PATCH v2 1/2] eal: remove redundant codes to parse 
> --lcores
> 
> local variable i is not referred by other codes in the function 
> eal_parse_lcores( ), so it can be removed.
> 
> Signed-off-by: Wei Dai 
> ---
>  lib/librte_eal/common/eal_common_options.c | 4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/lib/librte_eal/common/eal_common_options.c 
> b/lib/librte_eal/common/eal_common_options.c
> index 481c732..c5bf98c 100644
> --- a/lib/librte_eal/common/eal_common_options.c
> +++ b/lib/librte_eal/common/eal_common_options.c
> @@ -578,7 +578,6 @@ eal_parse_lcores(const char *lcores)
>   struct rte_config *cfg = rte_eal_get_configuration();
>   static uint16_t set[RTE_MAX_LCORE];
>   unsigned idx = 0;
> - int i;
>   unsigned count = 0;
>   const char *lcore_start = NULL;
>   const char *end = NULL;
> @@ -593,9 +592,6 @@ eal_parse_lcores(const char *lcores)
>   /* Remove all blank characters ahead and after */
>   while (isblank(*lcores))
>   lcores++;
> - i = strlen(lcores);
> - while ((i > 0) && isblank(lcores[i - 1]))
> - i--;

I suppose originally it meant to do something  like that:
while ((i > 0) && isblank(lcores[i - 1]))
lcores[i--] = 0;

to get rid of blank characters at the end of the line, no?
Konstantin

> 
>   CPU_ZERO();
> 
> --
> 2.5.5

[dpdk-dev] [dpdk-users] RSS Hash not working for XL710/X710 NICs for some RX mbuf sizes

2016-07-26 Thread Ananyev, Konstantin




Hi Dumitru,

> 
> Hi Beilei,
> 
> On Mon, Jul 25, 2016 at 12:04 PM, Take Ceara  
> wrote:
> > Hi Beilei,
> >
> > On Mon, Jul 25, 2016 at 5:24 AM, Xing, Beilei  
> > wrote:
> >> Hi,
> >>
> >>> -Original Message-
> >>> From: Take Ceara [mailto:dumitru.ceara at gmail.com]
> >>> Sent: Friday, July 22, 2016 8:32 PM
> >>> To: Xing, Beilei 
> >>> Cc: Zhang, Helin ; Wu, Jingjing 
> >>> ; dev at dpdk.org
> >>> Subject: Re: [dpdk-dev] [dpdk-users] RSS Hash not working for
> >>> XL710/X710 NICs for some RX mbuf sizes
> >>>
> >>> I was using the test-pmd "txonly" implementation which sends fixed 
> >>> UDP packets from 192.168.0.1:1024 -> 192.168.0.2:1024.
> >>>
> >>> I changed the test-pmd tx_only code so that it sends traffic with 
> >>> incremental destination IP: 192.168.0.1:1024 -> [192.168.0.2,
> >>> 192.168.0.12]:1024
> >>> I also dumped the source and destination IPs in the "rxonly"
> >>> pkt_burst_receive function.
> >>> Then I see that packets are indeed sent to different queues but 
> >>> the
> >>> mbuf->hash.rss value is still 0.
> >>>
> >>> ./testpmd -c 1 -n 4 -w :01:00.3 -w :81:00.3 -- -i
> >>> --coremask=0x0 --rxq=16 --txq=16 --mbuf-size 1152 --txpkts 
> >>> 1024 --enable-rx-cksum --rss-udp
> >>>
> >>> [...]
> >>>
> >>>  - Receive queue=0xf
> >>>   PKT_RX_RSS_HASH
> >>>   src=68:05:CA:38:6D:63 - dst=02:00:00:00:00:01 - type=0x0800 -
> >>> length=1024 - nb_segs=1 - RSS queue=0xa - (outer) L2 type: ETHER -
> >>> (outer) L3 type: IPV4_EXT_UNKNOWN SIP=C0A80001 DIP=C0A80006 -
> >>> (outer)
> >>> L4 type: UDP - Tunnel type: Unknown - RSS hash=0x0 - Inner L2 type:
> >>> Unknown - RSS queue=0xf - RSS queue=0x7 - (outer) L2 type: ETHER -
> >>> (outer) L3 type: IPV4_EXT_UNKNOWN SIP=C0A80001 DIP=C0A80007 -
> >>> (outer)
> >>> L4 type: UDP - Tunnel type: Unknown - Inner L2 type: Unknown - 
> >>> Inner
> >>> L3 type: Unknown - Inner L4 type: Unknown
> >>>  - Receive queue=0x7
> >>>   PKT_RX_RSS_HASH
> >>>   src=68:05:CA:38:6D:63 - dst=02:00:00:00:00:01 - (outer) L2 type:
> >>> ETHER - (outer) L3 type: IPV4_EXT_UNKNOWN SIP=C0A80001 
> >>> DIP=C0A80009
> >>> -
> >>> type=0x0800 - length=1024 - nb_segs=1 - Inner L3 type: Unknown - 
> >>> Inner
> >>> L4 type: Unknown - RSS hash=0x0 - (outer) L4 type: UDP - Tunnel type:
> >>> Unknown - Inner L2 type: Unknown - Inner L3 type: Unknown - RSS
> >>> queue=0x7 - Inner L4 type: Unknown
> >>>
> >>> [...]
> >>>
> >>> testpmd> stop
> >>>   --- Forward Stats for RX Port= 0/Queue= 0 -> TX Port= 1/Queue= 0 
> >>> ---
> >>>   RX-packets: 0  TX-packets: 32 TX-dropped: 0
> >>>   --- Forward Stats for RX Port= 0/Queue= 1 -> TX Port= 1/Queue= 1 
> >>> ---
> >>>   RX-packets: 59 TX-packets: 32 TX-dropped: 0
> >>>   --- Forward Stats for RX Port= 0/Queue= 2 -> TX Port= 1/Queue= 2 
> >>> ---
> >>>   RX-packets: 48 TX-packets: 32 TX-dropped: 0
> >>>   --- Forward Stats for RX Port= 0/Queue= 3 -> TX Port= 1/Queue= 3 
> >>> ---
> >>>   RX-packets: 0  TX-packets: 32 TX-dropped: 0
> >>>   --- Forward Stats for RX Port= 0/Queue= 4 -> TX Port= 1/Queue= 4 
> >>> ---
> >>>   RX-packets: 59 TX-packets: 32 TX-dropped: 0
> >>>   --- Forward Stats for RX Port= 0/Queue= 5 -> TX Port= 1/Queue= 5 
> >>> ---
> >>>   RX-packets: 0  TX-packets: 32 TX-dropped: 0
> >>>   --- Forward Stats for RX Port= 0/Queue= 6 -> TX Port= 1/Queue= 6 
> >>> ---
> >>>   RX-packets: 0  TX-packets: 32 TX-dropped: 0
> >>>   --- Forward Stats for RX Port= 0/Queue= 7 -> TX Port= 1/Queue= 7 
> >>> ---
> >>>   RX-packets: 48 TX-packets: 32 TX-dropped: 0
> >>>   --- Forward Stats for RX Port= 0/Queue= 8 -> TX Port= 1/Queue= 8 
> >>> ---
> >>>   RX-packets: 0  TX-packets: 32 TX-dropped: 0
> >>>   --- Forward Stats for RX Port= 0/Queue= 9 -> TX Port= 1/Queue= 9 
> >>> ---
> >>>   RX-packets: 59 TX-packets: 32 TX-dropped: 0
> >>>   --- Forward Stats for RX Port= 0/Queue=10 -> TX Port= 1/Queue=10 
> >>> ---
> >>>   RX-packets: 48 TX-packets: 32 TX-dropped: 0
> >>>   --- Forward Stats for RX Port= 0/Queue=11 -> TX Port= 1/Queue=11 
> >>> ---
> >>>   RX-packets: 0  TX-packets: 32 TX-dropped: 0
> >>>   --- Forward Stats for RX Port= 0/Queue=12 -> TX Port= 1/Queue=12 
> >>> ---
> >>>   RX-packets: 59 TX-packets: 32 TX-dropped: 0
> >>>   --- Forward Stats for RX Port= 0/Queue=13 -> TX Port= 1/Queue=13 
> >>> ---
> >>>   RX-packets: 0  TX-packets: 32 TX-dropped: 0
> >>>   --- Forward Stats for RX Port= 0/Queue=14 -> TX Port= 1/Queue=14 
> >>> ---
> >>>   RX-packets: 0  TX-packets: 32 TX-dropped: 0
> >>>   --- Forward Stats for RX Port=

[dpdk-dev] [PATCH] ring: fix sc dequeue performance issue

2016-07-24 Thread Ananyev, Konstantin



> -Original Message-
> From: Jerin Jacob [mailto:jerin.jacob at caviumnetworks.com]
> Sent: Sunday, July 24, 2016 6:08 PM
> To: dev at dpdk.org
> Cc: thomas.monjalon at 6wind.com; Ananyev, Konstantin  intel.com>; Jerin Jacob
> 
> Subject: [dpdk-dev] [PATCH] ring: fix sc dequeue performance issue
> 
> Use of rte_smb_wmb() instead of rte_smb_rmb() in sc dequeue function creates 
> the additional overhead of waiting for all the STOREs to be
> completed to local buffer from ring buffer memory. The sc dequeue function 
> demands only LOAD-STORE barrier where LOADs from ring
> buffer memory needs to be completed before tail pointer update. Changing to 
> rte_smb_rmb() to enable the required LOAD-STORE barrier.
> 
> Fixes: ecc7d10e448e ("ring: guarantee dequeue ordering before tail update")
> 
> Signed-off-by: Jerin Jacob 
> ---
>  lib/librte_ring/rte_ring.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index 
> f928324..0e22e69 100644
> --- a/lib/librte_ring/rte_ring.h
> +++ b/lib/librte_ring/rte_ring.h
> @@ -756,7 +756,7 @@ __rte_ring_sc_do_dequeue(struct rte_ring *r, void 
> **obj_table,
> 
>   /* copy in table */
>   DEQUEUE_PTRS();
> - rte_smp_wmb();
> + rte_smp_rmb();
> 
>   __RING_STAT_ADD(r, deq_success, n);
>   r->cons.tail = cons_next;
> --

Acked-by: Konstantin Ananyev 

> 2.5.5

[dpdk-dev] [PATCH] lib: change rte_ring dequeue to guarantee ordering before tail update

2016-07-23 Thread Ananyev, Konstantin



> -Original Message-
> From: Jerin Jacob [mailto:jerin.jacob at caviumnetworks.com]
> Sent: Saturday, July 23, 2016 12:49 PM
> To: Ananyev, Konstantin 
> Cc: Thomas Monjalon ; Juhamatti Kuusisaari 
> ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] lib: change rte_ring dequeue to guarantee 
> ordering before tail update
> 
> On Sat, Jul 23, 2016 at 11:15:27AM +, Ananyev, Konstantin wrote:
> >
> >
> > > -Original Message-
> > > From: Jerin Jacob [mailto:jerin.jacob at caviumnetworks.com]
> > > Sent: Saturday, July 23, 2016 11:39 AM
> > > To: Ananyev, Konstantin 
> > > Cc: Thomas Monjalon ; Juhamatti
> > > Kuusisaari ; dev at dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH] lib: change rte_ring dequeue to
> > > guarantee ordering before tail update
> > >
> > > On Sat, Jul 23, 2016 at 10:14:51AM +, Ananyev, Konstantin wrote:
> > > > Hi lads,
> > > >
> > > > > On Sat, Jul 23, 2016 at 11:02:33AM +0200, Thomas Monjalon wrote:
> > > > > > 2016-07-23 8:05 GMT+02:00 Jerin Jacob  > > > > > caviumnetworks.com>:
> > > > > > > On Thu, Jul 21, 2016 at 11:26:50PM +0200, Thomas Monjalon wrote:
> > > > > > >> > > Consumer queue dequeuing must be guaranteed to be done
> > > > > > >> > > fully before the tail is updated. This is not
> > > > > > >> > > guaranteed with a read barrier, changed to a write
> > > > > > >> > > barrier just before tail update which in
> > > > > practice guarantees correct order of reads and writes.
> > > > > > >> > >
> > > > > > >> > > Signed-off-by: Juhamatti Kuusisaari
> > > > > > >> > > 
> > > > > > >> >
> > > > > > >> > Acked-by: Konstantin Ananyev
> > > > > > >> > 
> > > > > > >>
> > > > > > >> Applied, thanks
> > > > > > >
> > > > > > > There was ongoing discussion on this
> > > > > > > http://dpdk.org/ml/archives/dev/2016-July/044168.html
> > > > > >
> > > > > > Sorry Jerin, I forgot this email.
> > > > > > The problem is that nobody replied to your email and you did
> > > > > > not nack the v2 of this patch.
> > > >
> > > > It's probably my bad.
> > > > I acked the patch before Jerin response, and forgot to reply later.
> > > >
> > > > > >
> > > > > > > This change may not be required as it has the performance impact.
> > > > > >
> > > > > > We need to clearly understand what is the performance impact
> > > > > > (numbers and use cases) on one hand, and is there a real bug
> > > > > > fixed by this patch on the other hand?
> > > > >
> > > > > IHMO, there is no real bug here. rte_smb_rmb() provides the
> > > > > LOAD-STORE barrier to make sure tail pointer WRITE happens only after 
> > > > > prior LOADS.
> > > >
> > > > Yep, from what I read at the link Jerin provided, indeed it seems 
> > > > rte_smp_rmb() is enough for the arm arch here...
> > > > For ppc, as I can see both rte_smp_rmb()/rte_smp_wmb() emits the same 
> > > > instruction.
> > > >
> > > > >
> > > > > Thoughts?
> > > >
> > > > Wonder how big is a performance impact?
> > >
> > > With this change we need to wait for addtional STORES to be completed to 
> > > local buffer in addtion to LOADS from ring buffers memory.
> >
> > I understand that, just wonder did you see any real performance difference?
> 
> Yeah...

Ok, then I don't see any good reason why we shouldn't revert it.
I suppose the best way would be to submit a new patch for RC5 to revert the 
changes.
Do you prefer to submit it yourself and I'll ack it or visa-versa?
Thanks
Konstantin 

> 
> > Probably with ring_perf_autotest/mempool_perf_autotest or something?
> 
> W/O change
> RTE>>ring_perf_autotest
> ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 4 
> MP/MC single enq/dequeue: 16 SP/SC burst enq/dequeue
> (size: 8): 0 MP/MC burst enq/dequeue (size: 8): 2 SP/SC burst enq/dequeue 
> (size: 32): 0 MP/MC burst enq/dequeue (size: 32): 0
> 
> ### Testing empty dequeue ###
> SC empty

[dpdk-dev] [PATCH] lib: change rte_ring dequeue to guarantee ordering before tail update

2016-07-23 Thread Ananyev, Konstantin



> -Original Message-
> From: Jerin Jacob [mailto:jerin.jacob at caviumnetworks.com]
> Sent: Saturday, July 23, 2016 11:39 AM
> To: Ananyev, Konstantin 
> Cc: Thomas Monjalon ; Juhamatti Kuusisaari 
> ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] lib: change rte_ring dequeue to guarantee 
> ordering before tail update
> 
> On Sat, Jul 23, 2016 at 10:14:51AM +, Ananyev, Konstantin wrote:
> > Hi lads,
> >
> > > On Sat, Jul 23, 2016 at 11:02:33AM +0200, Thomas Monjalon wrote:
> > > > 2016-07-23 8:05 GMT+02:00 Jerin Jacob  > > > caviumnetworks.com>:
> > > > > On Thu, Jul 21, 2016 at 11:26:50PM +0200, Thomas Monjalon wrote:
> > > > >> > > Consumer queue dequeuing must be guaranteed to be done
> > > > >> > > fully before the tail is updated. This is not guaranteed
> > > > >> > > with a read barrier, changed to a write barrier just before
> > > > >> > > tail update which in
> > > practice guarantees correct order of reads and writes.
> > > > >> > >
> > > > >> > > Signed-off-by: Juhamatti Kuusisaari
> > > > >> > > 
> > > > >> >
> > > > >> > Acked-by: Konstantin Ananyev 
> > > > >>
> > > > >> Applied, thanks
> > > > >
> > > > > There was ongoing discussion on this
> > > > > http://dpdk.org/ml/archives/dev/2016-July/044168.html
> > > >
> > > > Sorry Jerin, I forgot this email.
> > > > The problem is that nobody replied to your email and you did not
> > > > nack the v2 of this patch.
> >
> > It's probably my bad.
> > I acked the patch before Jerin response, and forgot to reply later.
> >
> > > >
> > > > > This change may not be required as it has the performance impact.
> > > >
> > > > We need to clearly understand what is the performance impact
> > > > (numbers and use cases) on one hand, and is there a real bug fixed
> > > > by this patch on the other hand?
> > >
> > > IHMO, there is no real bug here. rte_smb_rmb() provides the
> > > LOAD-STORE barrier to make sure tail pointer WRITE happens only after 
> > > prior LOADS.
> >
> > Yep, from what I read at the link Jerin provided, indeed it seems 
> > rte_smp_rmb() is enough for the arm arch here...
> > For ppc, as I can see both rte_smp_rmb()/rte_smp_wmb() emits the same 
> > instruction.
> >
> > >
> > > Thoughts?
> >
> > Wonder how big is a performance impact?
> 
> With this change we need to wait for addtional STORES to be completed to 
> local buffer in addtion to LOADS from ring buffers memory.

I understand that, just wonder did you see any real performance difference?
Probably with ring_perf_autotest/mempool_perf_autotest or something?
Konstantin 

> 
> > If there is a real one, I suppose we can revert the patch?
> 
> Request to revert this one as their no benifts for other architectures and 
> indeed it creates addtional delay in waiting for STORES to complete
> in ARM.
> Lets do the correct thing by reverting it.
> 
> Jerin
> 
> 
> 
> > Konstantin
> >
> > >
> > > >
> > > > Please guys make things clear and we'll revert if needed.

[dpdk-dev] [PATCH] lib: change rte_ring dequeue to guarantee ordering before tail update

2016-07-23 Thread Ananyev, Konstantin

Hi lads,

> On Sat, Jul 23, 2016 at 11:02:33AM +0200, Thomas Monjalon wrote:
> > 2016-07-23 8:05 GMT+02:00 Jerin Jacob :
> > > On Thu, Jul 21, 2016 at 11:26:50PM +0200, Thomas Monjalon wrote:
> > >> > > Consumer queue dequeuing must be guaranteed to be done fully
> > >> > > before the tail is updated. This is not guaranteed with a read 
> > >> > > barrier, changed to a write barrier just before tail update which in
> practice guarantees correct order of reads and writes.
> > >> > >
> > >> > > Signed-off-by: Juhamatti Kuusisaari
> > >> > > 
> > >> >
> > >> > Acked-by: Konstantin Ananyev 
> > >>
> > >> Applied, thanks
> > >
> > > There was ongoing discussion on this
> > > http://dpdk.org/ml/archives/dev/2016-July/044168.html
> >
> > Sorry Jerin, I forgot this email.
> > The problem is that nobody replied to your email and you did not nack
> > the v2 of this patch.

It's probably my bad.
I acked the patch before Jerin response, and forgot to reply later. 

> >
> > > This change may not be required as it has the performance impact.
> >
> > We need to clearly understand what is the performance impact (numbers
> > and use cases) on one hand, and is there a real bug fixed by this
> > patch on the other hand?
> 
> IHMO, there is no real bug here. rte_smb_rmb() provides the LOAD-STORE 
> barrier to make sure tail pointer WRITE happens only after prior
> LOADS.

Yep, from what I read at the link Jerin provided, indeed it seems rte_smp_rmb() 
is enough for the arm arch here...
For ppc, as I can see both rte_smp_rmb()/rte_smp_wmb() emits the same 
instruction.

> 
> Thoughts?

Wonder how big is a performance impact?
If there is a real one, I suppose we can revert the patch?
Konstantin 

> 
> >
> > Please guys make things clear and we'll revert if needed.

[dpdk-dev] [PATCH v2] doc: announce ABI change for rte_eth_dev structure

2016-07-21 Thread Ananyev, Konstantin



> 
> This is an ABI deprecation notice for DPDK 16.11 in librte_ether about
> changes in rte_eth_dev and rte_eth_desc_lim structures.
> 
> As discussed in that thread:
> 
> http://dpdk.org/ml/archives/dev/2015-September/023603.html
> 
> Different NIC models depending on HW offload requested might impose
> different requirements on packets to be TX-ed in terms of:
> 
>  - Max number of fragments per packet allowed
>  - Max number of fragments per TSO segments
>  - The way pseudo-header checksum should be pre-calculated
>  - L3/L4 header fields filling
>  - etc.
> 
> 
> MOTIVATION:
> ---
> 
> 1) Some work cannot (and didn't should) be done in rte_eth_tx_burst.
>However, this work is sometimes required, and now, it's an
>application issue.
> 
> 2) Different hardware may have different requirements for TX offloads,
>other subset can be supported and so on.
> 
> 3) Some parameters (eg. number of segments in ixgbe driver) may hung
>device. These parameters may be vary for different devices.
> 
>For example i40e HW allows 8 fragments per packet, but that is after
>TSO segmentation. While ixgbe has a 38-fragment pre-TSO limit.
> 
> 4) Fields in packet may require different initialization (like eg. will
>require pseudo-header checksum precalculation, sometimes in a
>different way depending on packet type, and so on). Now application
>needs to care about it.
> 
> 5) Using additional API (rte_eth_tx_prep) before rte_eth_tx_burst let to
>prepare packet burst in acceptable form for specific device.
> 
> 6) Some additional checks may be done in debug mode keeping tx_burst
>implementation clean.
> 
> 
> PROPOSAL:
> -
> 
> To help user to deal with all these varieties we propose to:
> 
> 1. Introduce rte_eth_tx_prep() function to do necessary preparations of
>packet burst to be safely transmitted on device for desired HW
>offloads (set/reset checksum field according to the hardware
>requirements) and check HW constraints (number of segments per
>packet, etc).
> 
>While the limitations and requirements may differ for devices, it
>requires to extend rte_eth_dev structure with new function pointer
>"tx_pkt_prep" which can be implemented in the driver to prepare and
>verify packets, in devices specific way, before burst, what should to
>prevent application to send malformed packets.
> 
> 2. Also new fields will be introduced in rte_eth_desc_lim:
>nb_seg_max and nb_mtu_seg_max, providing an information about max
>segments in TSO and non-TSO packets acceptable by device.
> 
>This information is useful for application to not create/limit
>malicious packet.
> 
> 
> APPLICATION (CASE OF USE):
> --
> 
> 1) Application should to initialize burst of packets to send, set
>required tx offload flags and required fields, like l2_len, l3_len,
>l4_len, and tso_segsz
> 
> 2) Application passes burst to the rte_eth_tx_prep to check conditions
>required to send packets through the NIC.
> 
> 3) The result of rte_eth_tx_prep can be used to send valid packets
>and/or restore invalid if function fails.
> 
> eg.
> 
>   for (i = 0; i < nb_pkts; i++) {
> 
>   /* initialize or process packet */
> 
>   bufs[i]->tso_segsz = 800;
>   bufs[i]->ol_flags = PKT_TX_TCP_SEG | PKT_TX_IPV4
>   | PKT_TX_IP_CKSUM;
>   bufs[i]->l2_len = sizeof(struct ether_hdr);
>   bufs[i]->l3_len = sizeof(struct ipv4_hdr);
>   bufs[i]->l4_len = sizeof(struct tcp_hdr);
>   }
> 
>   /* Prepare burst of TX packets */
>   nb_prep = rte_eth_tx_prep(port, 0, bufs, nb_pkts);
> 
>   if (nb_prep < nb_pkts) {
>   printf("tx_prep failed\n");
> 
>   /* drop or restore invalid packets */
> 
>   }
> 
>   /* Send burst of TX packets */
>   nb_tx = rte_eth_tx_burst(port, 0, bufs, nb_prep);
> 
>   /* Free any unsent packets. */
> 
> 
> 
> Signed-off-by: Tomasz Kulasek 
> ---
>  doc/guides/rel_notes/deprecation.rst |7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/doc/guides/rel_notes/deprecation.rst 
> b/doc/guides/rel_notes/deprecation.rst
> index f502f86..485aacb 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -41,3 +41,10 @@ Deprecation Notices
>  * The mempool functions for single/multi producer/consumer are deprecated and
>will be removed in 16.11.
>It is replaced by rte_mempool_generic_get/put functions.
> +
> +* In 16.11 ABI changes are plained: the ``rte_eth_dev`` structure will be
> +  extended with new function pointer ``tx_pkt_prep`` allowing verification
> +  and processing of packet burst to meet HW specific requirements before
> +  transmit. Also new fields will be added to the ``rte_eth_desc_lim`` 
> structure:
> +  ``nb_seg_max`` and ``nb_mtu_seg_max`` provideing information about number 
>

[dpdk-dev] [PATCH 04/12] mbuf: add function to calculate a checksum

2016-07-21 Thread Ananyev, Konstantin

Hi Olivier,

> 
> This function can be used to calculate the checksum of data embedded in mbuf, 
> that can be composed of several segments.
> 
> This function will be used by the virtio pmd in next commits to calculate the 
> checksum in software in case the protocol is not recognized.
> 
> Signed-off-by: Olivier Matz 
> ---
>  doc/guides/rel_notes/release_16_11.rst |  5 
>  lib/librte_mbuf/rte_mbuf.c | 55 
> --
>  lib/librte_mbuf/rte_mbuf.h | 13 
>  lib/librte_mbuf/rte_mbuf_version.map   |  1 +
>  4 files changed, 72 insertions(+), 2 deletions(-)
> 
> diff --git a/doc/guides/rel_notes/release_16_11.rst 
> b/doc/guides/rel_notes/release_16_11.rst
> index 6a591e2..da70f3b 100644
> --- a/doc/guides/rel_notes/release_16_11.rst
> +++ b/doc/guides/rel_notes/release_16_11.rst
> @@ -53,6 +53,11 @@ New Features
>Added two new functions ``rte_get_rx_ol_flag_list()`` and
>``rte_get_tx_ol_flag_list()`` to dump offload flags as a string.
> 
> +* **Added a functions to calculate the checksum of data in a mbuf.**
> +
> +  Added a new function ``rte_pktmbuf_cksum()`` to process the checksum
> + of  data embedded in an mbuf chain.
> +
>  Resolved Issues
>  ---
> 
> diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c index 
> 56f37e6..0304245 100644
> --- a/lib/librte_mbuf/rte_mbuf.c
> +++ b/lib/librte_mbuf/rte_mbuf.c
> @@ -60,6 +60,7 @@
>  #include 
>  #include 
>  #include 
> +#include 

As a nit, do we need to introduce a dependency for librte_mbuf on librte_net?
Might be better to put this functionality into librte_net?
Konstantin

[dpdk-dev] [PATCH] doc: announce ABI change for rte_eth_dev structure

2016-07-20 Thread Ananyev, Konstantin

Hi Thomas,

> Hi,
> 
> This patch announces an interesting change in the DPDK design.
> 
> 2016-07-20 16:24, Tomasz Kulasek:
> > This is an ABI deprecation notice for DPDK 16.11 in librte_ether about
> > changes in rte_eth_dev and rte_eth_desc_lim structures.
> >
> > In 16.11, we plan to introduce rte_eth_tx_prep() function to do
> > necessary preparations of packet burst to be safely transmitted on
> > device for desired HW offloads (set/reset checksum field according to
> > the hardware requirements) and check HW constraints (number of
> > segments per packet, etc).
> >
> > While the limitations and requirements may differ for devices, it
> > requires to extend rte_eth_dev structure with new function pointer
> > "tx_pkt_prep" which can be implemented in the driver to prepare and
> > verify packets, in devices specific way, before burst, what should to
> > prevent application to send malformed packets.
> >
> > Also new fields will be introduced in rte_eth_desc_lim: nb_seg_max and
> > nb_mtu_seg_max, providing an information about max segments in TSO and
> > non TSO packets acceptable by device.
> 
> We cannot acknowledge such notice without a prior design discussion.
> Please explain why you plan to work on this change and give a draft of the 
> new structures (a RFC patch would be ideal).

I think it is not really a deprecation note, but announce ABI change for 
rte_ethdev.h structures.
The plan is to implement what was proposed & discussed the following thread:
http://dpdk.org/ml/archives/dev/2015-September/023603.html

Konstantin

[dpdk-dev] [PATCH v3] i40: fix the VXLAN TSO issue

2016-07-19 Thread Ananyev, Konstantin



> 
> Problem:
> When using the TSO + VXLAN feature in i40e, the outer UDP length fields in 
> the multiple UDP segments which are TSOed by the i40e will
> have a wrong value.
> 
> Fix this problem by adding the tunnel type field in the i40e descriptor which 
> was missed before.
> 
> Fixes: 77b8301733c3 ("i40e: VXLAN Tx checksum offload")
> 
> Signed-off-by: Zhe Tao 
> ---
> v2: edited the comments
> v3: added external IP offload flag when TSO is enabled for tunnelling packets
> 
>  app/test-pmd/csumonly.c  | 29 +
>  drivers/net/i40e/i40e_rxtx.c | 12 +---
>  lib/librte_mbuf/rte_mbuf.h   | 16 +++-
>  3 files changed, 45 insertions(+), 12 deletions(-)
> 
> diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c index 
> ac4bd8f..aaa006f 100644
> --- a/app/test-pmd/csumonly.c
> +++ b/app/test-pmd/csumonly.c
> @@ -204,7 +204,8 @@ parse_ethernet(struct ether_hdr *eth_hdr, struct 
> testpmd_offload_info *info)  static void  parse_vxlan(struct
> udp_hdr *udp_hdr,
>   struct testpmd_offload_info *info,
> - uint32_t pkt_type)
> + uint32_t pkt_type,
> + uint64_t *ol_flags)
>  {
>   struct ether_hdr *eth_hdr;
> 
> @@ -215,6 +216,7 @@ parse_vxlan(struct udp_hdr *udp_hdr,
>   RTE_ETH_IS_TUNNEL_PKT(pkt_type) == 0)
>   return;
> 
> + *ol_flags |= PKT_TX_TUNNEL_VXLAN;


Hmm, I don't actually see much difference between that version and the previous 
one.
 Regarding your comment on V2:
" this flag is for tunnelling type, and CTD is based on whether we need to do 
the
external ip offload and TSO.so this flag will not cause one extra CTD."
I think CTD selection should be based not only on is EIP cksum is enabled or 
not.
You can have tunneled packet with TSO on over IPv6, right?
I think for i40e we need CTD each time PKT_TX_TUNNEL_ is on.


>   info->is_tunnel = 1;
>   info->outer_ethertype = info->ethertype;
>   info->outer_l2_len = info->l2_len;
> @@ -231,7 +233,9 @@ parse_vxlan(struct udp_hdr *udp_hdr,
> 
>  /* Parse a gre header */
>  static void
> -parse_gre(struct simple_gre_hdr *gre_hdr, struct testpmd_offload_info *info)
> +parse_gre(struct simple_gre_hdr *gre_hdr,
> +   struct testpmd_offload_info *info,
> +   uint64_t *ol_flags)
>  {
>   struct ether_hdr *eth_hdr;
>   struct ipv4_hdr *ipv4_hdr;
> @@ -242,6 +246,8 @@ parse_gre(struct simple_gre_hdr *gre_hdr, struct 
> testpmd_offload_info *info)
>   if ((gre_hdr->flags & _htons(~GRE_SUPPORTED_FIELDS)) != 0)
>   return;
> 
> + *ol_flags |= PKT_TX_TUNNEL_GRE;
> +
>   gre_len += sizeof(struct simple_gre_hdr);
> 
>   if (gre_hdr->flags & _htons(GRE_KEY_PRESENT)) @@ -417,7 +423,7 @@ 
> process_inner_cksums(void *l3_hdr, const struct
> testpmd_offload_info *info,
>   * packet */
>  static uint64_t
>  process_outer_cksums(void *outer_l3_hdr, struct testpmd_offload_info *info,
> - uint16_t testpmd_ol_flags)
> + uint16_t testpmd_ol_flags, uint64_t orig_ol_flags)
>  {
>   struct ipv4_hdr *ipv4_hdr = outer_l3_hdr;
>   struct ipv6_hdr *ipv6_hdr = outer_l3_hdr; @@ -428,7 +434,8 @@ 
> process_outer_cksums(void *outer_l3_hdr, struct
> testpmd_offload_info *info,
>   ipv4_hdr->hdr_checksum = 0;
>   ol_flags |= PKT_TX_OUTER_IPV4;
> 
> - if (testpmd_ol_flags & TESTPMD_TX_OFFLOAD_OUTER_IP_CKSUM)
> + if ((testpmd_ol_flags & TESTPMD_TX_OFFLOAD_OUTER_IP_CKSUM) ||
> + (info->tso_segsz != 0))
>   ol_flags |= PKT_TX_OUTER_IP_CKSUM;

Why do you need to always raise OUTER_IP_CKSUM when TSO is enabled?

>   else
>   ipv4_hdr->hdr_checksum = rte_ipv4_cksum(ipv4_hdr); @@ 
> -442,6 +449,9 @@ process_outer_cksums(void
> *outer_l3_hdr, struct testpmd_offload_info *info,
>* hardware supporting it today, and no API for it. */
> 
>   udp_hdr = (struct udp_hdr *)((char *)outer_l3_hdr + info->outer_l3_len);
> + if ((orig_ol_flags & PKT_TX_TCP_SEG) &&
> + ((orig_ol_flags & PKT_TX_TUNNEL_MASK) == PKT_TX_TUNNEL_VXLAN))
> + udp_hdr->dgram_cksum = 0;
>   /* do not recalculate udp cksum if it was 0 */
>   if (udp_hdr->dgram_cksum != 0) {
>   udp_hdr->dgram_cksum = 0;
> @@ -705,15 +715,18 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
>   if (info.l4_proto == IPPROTO_UDP) {
>   struct udp_hdr *udp_hdr;
>   udp_hdr = (struct udp_hdr *)((char *)l3_hdr +
> - info.l3_len);
> - parse_vxlan(udp_hdr, , m->packet_type);
> +info.l3_len);
> + parse_vxlan(udp_hdr, , m->packet_type,
> + _flags);
>   } else if (info.l4_proto == IPPROTO_GRE) {
>   struct

[dpdk-dev] [PATCH] lib: change rte_ring dequeue to guarantee ordering before tail update

2016-07-15 Thread Ananyev, Konstantin

> 
> Consumer queue dequeuing must be guaranteed to be done fully before the tail 
> is updated. This is not guaranteed with a read barrier,
> changed to a write barrier just before tail update which in practice 
> guarantees correct order of reads and writes.
> 
> Signed-off-by: Juhamatti Kuusisaari 
> ---
>  lib/librte_ring/rte_ring.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index 
> eb45e41..14920af 100644
> --- a/lib/librte_ring/rte_ring.h
> +++ b/lib/librte_ring/rte_ring.h
> @@ -748,7 +748,7 @@ __rte_ring_sc_do_dequeue(struct rte_ring *r, void 
> **obj_table,
> 
> /* copy in table */
> DEQUEUE_PTRS();
> -   rte_smp_rmb();
> +   rte_smp_wmb();
> 
> __RING_STAT_ADD(r, deq_success, n);
> r->cons.tail = cons_next;
> --

Acked-by: Konstantin Ananyev 

> 2.9.0
>

[dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location

2016-07-15 Thread Ananyev, Konstantin

Hi Jerin,

> >
> >
> > > The CPU also
> > > knows already the value that will be written to cons.tail and that
> > > value does not depend on the previous read either. The CPU does not know 
> > > we are planning to do a spinlock there, so it might do things
> out-of-order without proper dependencies.
> > >
> > > > For  __rte_ring_sc_do_dequeue(), I think you right, we might need
> > > > something stronger.
> > > > I don't want to put rte_smp_mb() here as it would cause full HW
> > > > barrier even on machines with strong memory order (IA).
> > > > I think that rte_smp_wmb() might be enough here:
> > > > it would force cpu to wait till writes in DEQUEUE_PTRS() are
> > > > become visible, which means reads have to be completed too.
> > >
> > > In practice I think that rte_smp_wmb() would work fine, even though
> > > it is not strictly according to the book. Below solution would be my
> > > proposal as a fix to the issue of sc dequeueing (and also to mc 
> > > dequeueing, if we have the problem of CPU completely ignoring the
> spinlock in reality there):
> > >
> > > DEQUEUE_PTRS();
> > > ..
> > > rte_smp_wmb();
> > > r->cons.tail = cons_next;
> >
> > As I said in previous email - it looks good for me for
> > _rte_ring_sc_do_dequeue(), but I am interested to hear what  ARM and PPC 
> > maintainers think about it.
> > Jan, Jerin do you have any comments on it?
> 
> Actually it is NOT performance effective and difficult to capture the ORDER 
> dependency with plane store and load barriers on WEAK
> ordered machines.
> Beyond plane store and load barriers, We need to express  #LoadLoad, 
> #LoadStore,#StoreStore barrier dependency with Acquire and
> Release Semantics in Arch neutral code(Looks like this is compiler barrier on 
> IA) http://preshing.com/20120913/acquire-and-release-
> semantics/
> 
> For instance, Full barrier CAS(__sync_bool_compare_and_swap) will not be 
> required for weak ordered machine in MP case.
> I can send out a RFC version of ring implementation changes required with 
> acquire-and-release-semantics.
> If it has performance degradation on IA then we can separate it out through 
> conditional compilation flag.
> 
> GCC Built-in Functions for Memory Model Aware Atomic Operations 
> https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html

I am not sure what exactly changes you are planning,
but I suppose I'd just wait for your RFC here.
Though my question was: what do you think about current 
_rte_ring_sc_do_dequeue()? 
Do you agree that rmb() is not sufficient here and does Juhamatti patch:
http://dpdk.org/dev/patchwork/patch/14846/
looks good to you?
It looks good to me ,and I am going to ACK it, but thought you'd better
have a look too. 
Thanks
Konstantin


> 
> Thoughts ?
> 
> Jerin
> 
> > Chao, sorry but I still not sure why PPC is considered as architecture with 
> > strong memory ordering?
> > Might be I am missing something obvious here.
> > Thank
> > Konstantin
> >

[dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location

2016-07-14 Thread Ananyev, Konstantin


Hi Juhamatti,

> 
> Hi Konstantin,
> 
> > > > > > It is quite safe to move the barrier before DEQUEUE because
> > > > > > after the DEQUEUE there is nothing really that we would want
> > > > > > to protect
> > with a
> > > > read barrier.
> > > > >
> > > > > I don't think so.
> > > > > If you remove barrier after DEQUEUE(), that means on systems
> > > > > with relaxed memory ordering cons.tail could be updated before
> > > > > DEQUEUE() will be finished and producer can overwrite queue
> > > > > entries that were
> > not
> > > > yet dequeued.
> > > > > So if cpu can really do such speculative out of order loads,
> > > > > then we do need for  __rte_ring_sc_do_dequeue() something like:
> > > > >
> > > > > rte_smp_rmb();
> > > > > DEQUEUE_PTRS();
> > > > > rte_smp_rmb();
> > >
> > > You have a valid point here, there needs to be a guarantee that
> > > cons_tail
> > cannot
> > > be updated before DEQUEUE is completed. Nevertheless, my point was
> > that it is
> > > not guaranteed with a read barrier anyway. The implementation has
> > > the
> > following
> > > sequence
> > >
> > > DEQUEUE_PTRS(); (i.e. READ/LOAD)
> > > rte_smp_rmb();
> > > ..
> > > r->cons.tail = cons_next; (i.e WRITE/STORE)
> > >
> > > Above read barrier does not guarantee any ordering for the following
> > writes/stores.
> > > As a guarantee is needed, I think we in fact need to change the read
> > > barrier
> > on the
> > > dequeue to a full barrier, which guarantees the read+write order, as
> > follows
> > >
> > > DEQUEUE_PTRS();
> > > rte_smp_mb();
> > > ..
> > > r->cons.tail = cons_next;
> > >
> > > If you agree, I can for sure prepare another patch for this issue.
> >
> > Hmm, I think for __rte_ring_mc_do_dequeue() we are ok with smp_rmb(),
> > as we have to read cons.tail anyway.
> 
> Are you certain that this read creates strong enough dependency between read 
> of cons.tail and the write of it on the mc_do_dequeue()? 

Yes, I believe so.

> I think it does not really create any control dependency there as the next 
> write is not dependent of the result of the read.

I think it is dependent: cons.tail can be updated only if it's current value is 
eual to precomputed before cons_head.
So cpu has to read cons.tail value first.

> The CPU also
> knows already the value that will be written to cons.tail and that value does 
> not depend on the previous read either. The CPU does not
> know we are planning to do a spinlock there, so it might do things 
> out-of-order without proper dependencies.
> 
> > For  __rte_ring_sc_do_dequeue(), I think you right, we might need
> > something stronger.
> > I don't want to put rte_smp_mb() here as it would cause full HW
> > barrier even on machines with strong memory order (IA).
> > I think that rte_smp_wmb() might be enough here:
> > it would force cpu to wait till writes in DEQUEUE_PTRS() are become
> > visible, which means reads have to be completed too.
> 
> In practice I think that rte_smp_wmb() would work fine, even though it is not 
> strictly according to the book. Below solution would be my
> proposal as a fix to the issue of sc dequeueing (and also to mc dequeueing, 
> if we have the problem of CPU completely ignoring the spinlock
> in reality there):
> 
> DEQUEUE_PTRS();
> ..
> rte_smp_wmb();
> r->cons.tail = cons_next;

As I said in previous email - it looks good for me for 
_rte_ring_sc_do_dequeue(),
but I am interested to hear what  ARM and PPC maintainers think about it.
Jan, Jerin do you have any comments on it?
Chao, sorry but I still not sure why PPC is considered as architecture with 
strong memory ordering?
Might be I am missing something obvious here.
Thank
Konstantin

> 
> --
>  Juhamatti
> 
> > Another option would be to define a new macro: rte_weak_mb() or so,
> > that would be expanded into CB on boxes with strong memory model, and
> > to full MB on machines with relaxed ones.
> > Interested to hear what ARM and PPC guys think.
> > Konstantin
> >
> > P.S. Another thing a bit off-topic - for PPC guys:
> > As I can see smp_rmb/smp_wmb are just a complier barriers:
> > find lib/librte_eal/common/include/arch/ppc_64/ -type f | xargs grep
> > smp_ lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h:#define
> > rte_smp_mb() rte_mb()
> > lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h:#define
> > rte_smp_wmb() rte_compiler_barrier()
> > lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h:#define
> > rte_smp_rmb() rte_compiler_barrier()
> > My knowledge about PPC architecture is rudimental, but is that really 
> > enough?
> >
> > >
> > > Thanks,
> > > --
> > >  Juhamatti
> > >
> > > > > Konstantin
> > > > >
> > > > > > The read
> > > > > > barrier is mapped to a compiler barrier on strong memory model
> > > > > > systems and this works fine too as the order of the head,tail
> > > > > > updates is still guaranteed on the new location. Even if the
> > > > > > problem would be theoretical on most systems, it is worth
> > > > > > fixing as the risk for
> > > > problems is

[dpdk-dev] [PATCH 0/3] add new command line options and error handling in pdump

2016-07-14 Thread Ananyev, Konstantin


This patch set contains
1)Error handling fixes in pdump library.
2)Support of server and client socket path command line options in pdump tool.
3)Default socket path name fixes in pdump library doc.

Reshma Pattan (3):
  pdump: fix error handlings
  app/pdump: add new command line options for socket paths
  doc: fix default socket path names

 app/pdump/main.c| 57 +++--
 doc/guides/prog_guide/pdump_lib.rst | 12   
doc/guides/sample_app_ug/pdump.rst  | 31 ++--
 lib/librte_pdump/rte_pdump.c| 26 +
 4 files changed, 97 insertions(+), 29 deletions(-)

Acked-by: Konstantin Ananyev 

--
2.5.0

[dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location

2016-07-13 Thread Ananyev, Konstantin


Hi Juhamatti,

> 
> Hello,
> 
> > > Hi Juhamatti,
> > >
> > > >
> > > > Hello,
> > > >
> > > > > > > > -Original Message-
> > > > > > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of
> > > > > > > > Juhamatti Kuusisaari
> > > > > > > > Sent: Monday, July 11, 2016 11:21 AM
> > > > > > > > To: dev at dpdk.org
> > > > > > > > Subject: [dpdk-dev] [PATCH] lib: move rte_ring read barrier
> > > > > > > > to correct location
> > > > > > > >
> > > > > > > > Fix the location of the rte_ring data dependency read barrier.
> > > > > > > > It needs to be called before accessing indexed data to
> > > > > > > > ensure that the data itself is guaranteed to be correctly 
> > > > > > > > updated.
> > > > > > > >
> > > > > > > > See more details at kernel/Documentation/memory-barriers.txt
> > > > > > > > section 'Data dependency barriers'.
> > > > > > >
> > > > > > >
> > > > > > > Any explanation why?
> > > > > > > From my point smp_rmb()s are on the proper places here :)
> > > > > > > Konstantin
> > > > > >
> > > > > > The problem here is that on a weak memory model system the CPU
> > > > > > is allowed to load the address data out-of-order in advance.
> > > > > > If the read barrier is after the DEQUEUE, you might end up
> > > > > > having the old data there on a race situation when the buffer is
> > continuously full.
> > > > > > Having it before the DEQUEUE guarantees that the load is not
> > > > > > done in advance.
> > > > >
> > > > > Sorry, still didn't see any race condition in the current code.
> > > > > Can you provide any particular example?
> > > > > From other side, moving smp_rmb() before dequeueing the objects,
> > > > > could introduce a race condition, on cpus where later writes can
> > > > > be reordered with earlier reads.
> > > >
> > > > Here is a simplified example sequence from time perspective:
> > > > 1. Consumer CPU (CCPU) loads value y from r->ring[x] out-of-order
> > > > (the key of the problem)
> > >
> > > To read the value of ring[x] cpu has to calculate x first.
> > > And to calculate x it needs to read cons.head and prod.tail first.
> > > Are you saying that some modern cpu can:
> > >  -'speculate' value of cons.head and  prod.tail
> > >   (based on what?)
> > >  -calculate x based on these speculated values.
> > > - read ring[x]
> > > - read cons.head and prod.tail
> > > - if read values are not equal to speculated ones , then
> > >   re-caluclate x and re-read ring[x]
> > > - else use speculatively read ring[x]
> > > ?
> > > If such thing is possible  (is it really? and if yes on which cpu?),
> >
> > As I can see, neither ARM or PPC support  such things.
> > Both of them do obey address dependency.
> > (ARM & PPC guys feel free to correct me here, if I am wrong here).
> > So what cpu we are talking about?
> 
> I checked that too, indeed the problem I described seems to be more academic
> than even theoretical and does not apply to current CPUs. So I agree here and
> this makes this patch unneeded, I'll withdraw it. However, the implementation
> may still have another issue, see below.
> 
> > > then yes, we might need an extra  smp_rmb() before DEQUEUE_PTRS() for
> > > __rte_ring_sc_do_dequeue().
> > > For __rte_ring_mc_do_dequeue(), I think we are ok, as there is CAS
> > > just before DEQUEUE_PTRS().
> > >
> > > > 2. Producer CPU (PCPU) updates r->ring[x] to value be z 3. PCPU
> > > > updates prod_tail to be x 4. CCPU updates cons_head to be x 5. CCPU
> > > > loads r->ring[x] by using out-of-order loaded value y [is z in
> > > > reality]
> > > >
> > > > The problem here is that on weak memory model, the CCPU is allowed
> > > > to load
> > > > r->ring[x] value in advance, if it decides to do so (CCPU needs to
> > > > r->be able to see
> > > > in advance that x will be an interesting index worth loading). The
> > > > index value x is updated atomically,  but it does not matter here.
> > > > Also, the write barrier on PCPU side guarantees that CCPU cannot see
> > > > update of x before PCPU has really updated the r->ring[x] to z and
> > > > moved the tail, but still allows to do the out-of-order loads without
> > proper read barrier.
> > > >
> > > > When the read barrier is moved between steps 4 and 5, it disallows
> > > > to use any out-of-order loads so far and forces to drop r->ring[x] y
> > > > value and load current value z.
> > > >
> > > > The ring queue appears to work well as this is a rare corner case.
> > > > Due to the head,tail-structure the problem needs queue to be full
> > > > and also CCPU needs to see r->ring[x] update later than it does the
> > > > out-of-order load. In addition, the HW needs to be able to predict
> > > > and choose the load to the future index (which should be quite
> > > > possible, considering modern CPUs). If you have seen in the past
> > > > problems and noticed that a larger ring queue works better as a
> > workaround, you may have encountered the problem already.
> > >
> > > I don't understand what means 'larger rings works better'

[dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location

2016-07-12 Thread Ananyev, Konstantin

> 
> 
> Hi Juhamatti,
> 
> >
> > Hello,
> >
> > > > > > -Original Message-
> > > > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Juhamatti
> > > > > > Kuusisaari
> > > > > > Sent: Monday, July 11, 2016 11:21 AM
> > > > > > To: dev at dpdk.org
> > > > > > Subject: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to
> > > > > > correct location
> > > > > >
> > > > > > Fix the location of the rte_ring data dependency read barrier.
> > > > > > It needs to be called before accessing indexed data to ensure that
> > > > > > the data itself is guaranteed to be correctly updated.
> > > > > >
> > > > > > See more details at kernel/Documentation/memory-barriers.txt
> > > > > > section 'Data dependency barriers'.
> > > > >
> > > > >
> > > > > Any explanation why?
> > > > > From my point smp_rmb()s are on the proper places here :) Konstantin
> > > >
> > > > The problem here is that on a weak memory model system the CPU is
> > > > allowed to load the address data out-of-order in advance.
> > > > If the read barrier is after the DEQUEUE, you might end up having the
> > > > old data there on a race situation when the buffer is continuously full.
> > > > Having it before the DEQUEUE guarantees that the load is not done in
> > > > advance.
> > >
> > > Sorry, still didn't see any race condition in the current code.
> > > Can you provide any particular example?
> > > From other side, moving smp_rmb() before dequeueing the objects, could
> > > introduce a race condition, on cpus where later writes can be reordered 
> > > with
> > > earlier reads.
> >
> > Here is a simplified example sequence from time perspective:
> > 1. Consumer CPU (CCPU) loads value y from r->ring[x] out-of-order
> > (the key of the problem)
> 
> To read the value of ring[x] cpu has to calculate x first.
> And to calculate x it needs to read cons.head and prod.tail first.
> Are you saying that some modern cpu can:
>  -'speculate' value of cons.head and  prod.tail
>   (based on what?)
>  -calculate x based on these speculated values.
> - read ring[x]
> - read cons.head and prod.tail
> - if read values are not equal to speculated ones , then
>   re-caluclate x and re-read ring[x]
> - else use speculatively read ring[x]
> ?
> If such thing is possible  (is it really? and if yes on which cpu?),

As I can see, neither ARM or PPC support  such things.
Both of them do obey address dependency.
(ARM & PPC guys feel free to correct me here, if I am wrong here).
So what cpu we are talking about?
Konstantin

> then yes, we might need an extra  smp_rmb() before DEQUEUE_PTRS()
> for __rte_ring_sc_do_dequeue().
> For __rte_ring_mc_do_dequeue(), I think we are ok, as
> there is CAS just before DEQUEUE_PTRS().
> 
> > 2. Producer CPU (PCPU) updates r->ring[x] to value be z
> > 3. PCPU updates prod_tail to be x
> > 4. CCPU updates cons_head to be x
> > 5. CCPU loads r->ring[x] by using out-of-order loaded value y [is z in 
> > reality]
> >
> > The problem here is that on weak memory model, the CCPU is allowed to load
> > r->ring[x] value in advance, if it decides to do so (CCPU needs to be able 
> > to see
> > in advance that x will be an interesting index worth loading). The index 
> > value x
> > is updated atomically,  but it does not matter here. Also, the write 
> > barrier on PCPU
> > side guarantees that CCPU cannot see update of x before PCPU has really 
> > updated
> > the r->ring[x] to z and moved the tail, but still allows to do the 
> > out-of-order loads
> > without proper read barrier.
> >
> > When the read barrier is moved between steps 4 and 5, it disallows to use
> > any out-of-order loads so far and forces to drop r->ring[x] y value and
> > load current value z.
> >
> > The ring queue appears to work well as this is a rare corner case. Due to 
> > the
> > head,tail-structure the problem needs queue to be full and also CCPU needs
> > to see r->ring[x] update later than it does the out-of-order load. In 
> > addition,
> > the HW needs to be able to predict and choose the load to the future index
> > (which should be quite possible, considering modern CPUs). If you have seen
> > in the past problems and noticed that a larger ring queue works better as a
> > workaround, you may have encountered the problem already.
> 
> I don't understand what means 'larger rings works better' here.
> What we are talking about is  race condition, that if hit, would
> cause data corruption and most likely a crash.
> 
> >
> > It is quite safe to move the barrier before DEQUEUE because after the 
> > DEQUEUE
> > there is nothing really that we would want to protect with a read barrier.
> 
> I don't think so.
> If you remove barrier after DEQUEUE(), that means on systems with relaxed 
> memory ordering
> cons.tail could be updated before DEQUEUE() will be finished and producer can 
> overwrite
> queue entries that were not yet dequeued.
> So if cpu can really do such speculative out of order loads,
> then we do need for  __rte_ring_sc_do_dequeue() something

[dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location

2016-07-12 Thread Ananyev, Konstantin


Hi Juhamatti,

> 
> Hello,
> 
> > > > > -Original Message-
> > > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Juhamatti
> > > > > Kuusisaari
> > > > > Sent: Monday, July 11, 2016 11:21 AM
> > > > > To: dev at dpdk.org
> > > > > Subject: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to
> > > > > correct location
> > > > >
> > > > > Fix the location of the rte_ring data dependency read barrier.
> > > > > It needs to be called before accessing indexed data to ensure that
> > > > > the data itself is guaranteed to be correctly updated.
> > > > >
> > > > > See more details at kernel/Documentation/memory-barriers.txt
> > > > > section 'Data dependency barriers'.
> > > >
> > > >
> > > > Any explanation why?
> > > > From my point smp_rmb()s are on the proper places here :) Konstantin
> > >
> > > The problem here is that on a weak memory model system the CPU is
> > > allowed to load the address data out-of-order in advance.
> > > If the read barrier is after the DEQUEUE, you might end up having the
> > > old data there on a race situation when the buffer is continuously full.
> > > Having it before the DEQUEUE guarantees that the load is not done in
> > > advance.
> >
> > Sorry, still didn't see any race condition in the current code.
> > Can you provide any particular example?
> > From other side, moving smp_rmb() before dequeueing the objects, could
> > introduce a race condition, on cpus where later writes can be reordered with
> > earlier reads.
> 
> Here is a simplified example sequence from time perspective:
> 1. Consumer CPU (CCPU) loads value y from r->ring[x] out-of-order
> (the key of the problem)

To read the value of ring[x] cpu has to calculate x first.
And to calculate x it needs to read cons.head and prod.tail first.
Are you saying that some modern cpu can:
 -'speculate' value of cons.head and  prod.tail
  (based on what?)
 -calculate x based on these speculated values.
- read ring[x]
- read cons.head and prod.tail
- if read values are not equal to speculated ones , then
  re-caluclate x and re-read ring[x]
- else use speculatively read ring[x] 
?
If such thing is possible  (is it really? and if yes on which cpu?),
then yes, we might need an extra  smp_rmb() before DEQUEUE_PTRS()
for __rte_ring_sc_do_dequeue().
For __rte_ring_mc_do_dequeue(), I think we are ok, as
there is CAS just before DEQUEUE_PTRS().

> 2. Producer CPU (PCPU) updates r->ring[x] to value be z
> 3. PCPU updates prod_tail to be x
> 4. CCPU updates cons_head to be x
> 5. CCPU loads r->ring[x] by using out-of-order loaded value y [is z in 
> reality]
> 
> The problem here is that on weak memory model, the CCPU is allowed to load
> r->ring[x] value in advance, if it decides to do so (CCPU needs to be able to 
> see
> in advance that x will be an interesting index worth loading). The index 
> value x
> is updated atomically,  but it does not matter here. Also, the write barrier 
> on PCPU
> side guarantees that CCPU cannot see update of x before PCPU has really 
> updated
> the r->ring[x] to z and moved the tail, but still allows to do the 
> out-of-order loads
> without proper read barrier.
> 
> When the read barrier is moved between steps 4 and 5, it disallows to use
> any out-of-order loads so far and forces to drop r->ring[x] y value and
> load current value z.
> 
> The ring queue appears to work well as this is a rare corner case. Due to the
> head,tail-structure the problem needs queue to be full and also CCPU needs
> to see r->ring[x] update later than it does the out-of-order load. In 
> addition,
> the HW needs to be able to predict and choose the load to the future index
> (which should be quite possible, considering modern CPUs). If you have seen
> in the past problems and noticed that a larger ring queue works better as a
> workaround, you may have encountered the problem already.

I don't understand what means 'larger rings works better' here. 
What we are talking about is  race condition, that if hit, would
cause data corruption and most likely a crash.

> 
> It is quite safe to move the barrier before DEQUEUE because after the DEQUEUE
> there is nothing really that we would want to protect with a read barrier.

I don't think so.
If you remove barrier after DEQUEUE(), that means on systems with relaxed 
memory ordering
cons.tail could be updated before DEQUEUE() will be finished and producer can 
overwrite
queue entries that were not yet dequeued.
So if cpu can really do such speculative out of order loads,
then we do need for  __rte_ring_sc_do_dequeue() something like:

rte_smp_rmb();
DEQUEUE_PTRS();
rte_smp_rmb();

Konstantin

> The read
> barrier is mapped to a compiler barrier on strong memory model systems and 
> this
> works fine too as the order of the head,tail updates is still guaranteed on 
> the new
> location. Even if the problem would be theoretical on most systems, it is 
> worth fixing
> as the risk for problems is very low.
> 
> --
>  Juhamatti
> 
> > Konstantin
> 
> 
> 
> 
> >

[dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location

2016-07-11 Thread Ananyev, Konstantin

> Hi,
> 
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Juhamatti
> > > Kuusisaari
> > > Sent: Monday, July 11, 2016 11:21 AM
> > > To: dev at dpdk.org
> > > Subject: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct
> > > location
> > >
> > > Fix the location of the rte_ring data dependency read barrier.
> > > It needs to be called before accessing indexed data to ensure that the
> > > data itself is guaranteed to be correctly updated.
> > >
> > > See more details at kernel/Documentation/memory-barriers.txt
> > > section 'Data dependency barriers'.
> >
> >
> > Any explanation why?
> > From my point smp_rmb()s are on the proper places here :) Konstantin
> 
> The problem here is that on a weak memory model system the CPU is
> allowed to load the address data out-of-order in advance.
> If the read barrier is after the DEQUEUE, you might end up having the old
> data there on a race situation when the buffer is continuously full.
> Having it before the DEQUEUE guarantees that the load is not done
> in advance.

Sorry, still didn't see any race condition in the current code.
Can you provide any particular example?
>From other side, moving smp_rmb() before dequeueing the objects,
could introduce a race condition, on cpus where later writes can be reordered
with  earlier reads.
Konstantin

> 
> On Intel, it should not matter due to different memory model, so this is
> limited to weak memory model systems.
> 
> --
>  Juhamatti
> 
> > >
> > > Signed-off-by: Juhamatti Kuusisaari 
> > > ---
> > >  lib/librte_ring/rte_ring.h | 6 --
> > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> > > index eb45e41..a923e49 100644
> > > --- a/lib/librte_ring/rte_ring.h
> > > +++ b/lib/librte_ring/rte_ring.h
> > > @@ -662,9 +662,10 @@ __rte_ring_mc_do_dequeue(struct rte_ring *r,
> > void **obj_table,
> > >   cons_next);
> > > } while (unlikely(success == 0));
> > >
> > > +   rte_smp_rmb();
> > > +
> > > /* copy in table */
> > > DEQUEUE_PTRS();
> > > -   rte_smp_rmb();
> > >
> > > /*
> > >  * If there are other dequeues in progress that preceded us,
> > > @@ -746,9 +747,10 @@ __rte_ring_sc_do_dequeue(struct rte_ring *r,
> > void **obj_table,
> > > cons_next = cons_head + n;
> > > r->cons.head = cons_next;
> > >
> > > +   rte_smp_rmb();
> > > +
> > > /* copy in table */
> > > DEQUEUE_PTRS();
> > > -   rte_smp_rmb();
> > >
> > > __RING_STAT_ADD(r, deq_success, n);
> > > r->cons.tail = cons_next;
> > > --
> > > 2.9.0
> > >
> > >
> > >
> > ==
> > ==
> > > The information contained in this message may be privileged and
> > > confidential and protected from disclosure. If the reader of this
> > > message is not the intended recipient, or an employee or agent
> > > responsible for delivering this message to the intended recipient, you
> > > are hereby notified that any reproduction, dissemination or
> > > distribution of this communication is strictly prohibited. If you have
> > > received this communication in error, please notify us immediately by
> > > replying to the message and deleting it from your computer. Thank you.
> > > Coriant-Tellabs
> > >
> > ==
> > ==

[dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct location

2016-07-11 Thread Ananyev, Konstantin


Hi ,

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Juhamatti Kuusisaari
> Sent: Monday, July 11, 2016 11:21 AM
> To: dev at dpdk.org
> Subject: [dpdk-dev] [PATCH] lib: move rte_ring read barrier to correct 
> location
> 
> Fix the location of the rte_ring data dependency read barrier.
> It needs to be called before accessing indexed data to ensure
> that the data itself is guaranteed to be correctly updated.
> 
> See more details at kernel/Documentation/memory-barriers.txt
> section 'Data dependency barriers'.


Any explanation why?
>From my point smp_rmb()s are on the proper places here :)
Konstantin

> 
> Signed-off-by: Juhamatti Kuusisaari 
> ---
>  lib/librte_ring/rte_ring.h | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> index eb45e41..a923e49 100644
> --- a/lib/librte_ring/rte_ring.h
> +++ b/lib/librte_ring/rte_ring.h
> @@ -662,9 +662,10 @@ __rte_ring_mc_do_dequeue(struct rte_ring *r, void 
> **obj_table,
>   cons_next);
> } while (unlikely(success == 0));
> 
> +   rte_smp_rmb();
> +
> /* copy in table */
> DEQUEUE_PTRS();
> -   rte_smp_rmb();
> 
> /*
>  * If there are other dequeues in progress that preceded us,
> @@ -746,9 +747,10 @@ __rte_ring_sc_do_dequeue(struct rte_ring *r, void 
> **obj_table,
> cons_next = cons_head + n;
> r->cons.head = cons_next;
> 
> +   rte_smp_rmb();
> +
> /* copy in table */
> DEQUEUE_PTRS();
> -   rte_smp_rmb();
> 
> __RING_STAT_ADD(r, deq_success, n);
> r->cons.tail = cons_next;
> --
> 2.9.0
> 
> 
> 
> The information contained in this message may be privileged
> and confidential and protected from disclosure. If the reader
> of this message is not the intended recipient, or an employee
> or agent responsible for delivering this message to the
> intended recipient, you are hereby notified that any reproduction,
> dissemination or distribution of this communication is strictly
> prohibited. If you have received this communication in error,
> please notify us immediately by replying to the message and
> deleting it from your computer. Thank you. Coriant-Tellabs
>

[dpdk-dev] specific driver API - was bypass code cleanup

2016-07-11 Thread Ananyev, Konstantin



> > > > > Hmmm. It's true it is cleaner. But I am not sure having a generic API
> > > > > for bypass is a good idea at all.
> > > > > I was thinking to totally remove it.
> > > >
> > > > Why to remove it?
> > > > As I know there are people who use that functionality.
> > > >
> > > > > Maybe we can try to have a specific API by including ixgbe_bypass.h in
> > > > > the application.
> > > >
> > > > Hmm, isn't that what we were trying to get rid of in last few years?
> > > > HW specific stuff?
> > >
> > > Yes exactly.
> > > I have the feeling the bypass API is specific to ixgbe. Isn't it?
> >
> > As far as I know, yes.
> >
> > > As we will probably see other features specific to only one device.
> > > Instead of adding a function in the generic API, I think it may be
> > > saner to include a driver header.
> >
> > But that means use has to make decision based on HW id/type of the device,
> > the thing we were trying to get rid of in last few releases, no?
> 
> Not really. If an application requires the bypass feature, we can assume
> it will be used only on ixgbe NICs.

Bypass HW doesn't present on all ixgbe devices.
Only few specific models have that functionality.
So we still to provide and API for user to query is that functionality is 
supported or not,
user still has to make a decision based on HW id. 

> Having some generic APIs helps to deploy DPDK applications on heterogeous
> machines. But if an application rely on something hardware specific, there
> is no benefit of using a "fake generic layer" I guess.

Ok, what is your definition of 'heterogeous machines' then?
Let say today, as I know, only i40e and ixgbe can do HW mirroring of the 
traffic.
Is that  generic enough, or not?
I suppose we can find huge number of examples, when functionality differes
between different HW models.
As I remember that was discussed a while ago, and generic conclusion was:
avoid device specific API exposed to the application layer. 

> 
> > > Then if it appears to be used
> > > in more devices, it can be generalized.
> > > What do you think of this approach?
> >
> > We talked few times about introducing sort of ioctl() call, to communicate
> > about HW specific features.
> > Might be a bypass I a good candidate to be moved into this ioctl() thing...
> 
> I don't see how making an ioctl-like would be better than directly including
> a specific header.

User and application writer don't have to guess on what device his code will 
work.
He can just query what ioctl IDs that device support, and then use the 
supported ones.

> 
> > But I suppose it's too late for 16.07 to start such big changes.
> 
> Of course yes.

Ok, then I misunderstood you here.

> 
> > If you don't like bypass API to be a generic one, my suggestion would be
> > to leave it as it is for 16.07, and start a discussion what it should look 
> > like
> > for 16.11.
> 
> That's what we are doing here.
> I've changed the title to give a better visibility to the thread.

Ok, thanks.
Konstantin

[dpdk-dev] [PATCH] lib/librte_ether: bypass code cleanup

2016-07-11 Thread Ananyev, Konstantin


> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Thomas Monjalon
> > > Hmmm. It's true it is cleaner. But I am not sure having a generic API
> > > for bypass is a good idea at all.
> > > I was thinking to totally remove it.
> >
> > Why to remove it?
> > As I know there are people who use that functionality.
> >
> > > Maybe we can try to have a specific API by including ixgbe_bypass.h in
> > > the application.
> >
> > Hmm, isn't that what we were trying to get rid of in last few years?
> > HW specific stuff?
> 
> Yes exactly.
> I have the feeling the bypass API is specific to ixgbe. Isn't it?

As far as I know, yes.

> 
> As we will probably see other features specific to only one device.
> Instead of adding a function in the generic API, I think it may be
> saner to include a driver header.

But that means use has to make decision based on HW id/type of the device,
the thing we were trying to get rid of in last few releases, no?

> Then if it appears to be used
> in more devices, it can be generalized.
> What do you think of this approach?

We talked few times about introducing sort of ioctl() call, to communicate
about HW specific features.
Might be a bypass I a good candidate to be moved into this ioctl() thing...
But I suppose it's too late for 16.07 to start such big changes.
If you don't like bypass API to be a generic one, my suggestion would be
to leave it as it is for 16.07, and start a discussion what it should look like
for 16.11. 

Konstantin

[dpdk-dev] [PATCH] lib/librte_ether: bypass code cleanup

2016-07-11 Thread Ananyev, Konstantin

Hi Thomas,

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Thomas Monjalon
> Sent: Monday, July 11, 2016 8:11 AM
> To: Lu, Wenzhuo
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] lib/librte_ether: bypass code cleanup
> 
> 2016-07-11 14:21, Wenzhuo Lu:
> > In testpmd code, device id is used directly to check if bypass
> > is supported. But APP should not know the details of HW, the NIC
> > specific info should not be exposed here.
> 
> Right
> 
> > This patch adds a new rte API to check if bypass is supported.
> 
> Hmmm. It's true it is cleaner. But I am not sure having a generic API
> for bypass is a good idea at all.
> I was thinking to totally remove it.

Why to remove it?
As I know there are people who use that functionality.

> Maybe we can try to have a specific API by including ixgbe_bypass.h in
> the application.

Hmm, isn't that what we were trying to get rid of in last few years?
HW specific stuff?

Konstantin

[dpdk-dev] [PATCH v2] i40: fix the VXLAN TSO issue

2016-07-07 Thread Ananyev, Konstantin



> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Ananyev, Konstantin
> Sent: Thursday, July 07, 2016 11:51 AM
> To: Tao, Zhe; dev at dpdk.org
> Cc: Tao, Zhe; Wu, Jingjing
> Subject: Re: [dpdk-dev] [PATCH v2] i40: fix the VXLAN TSO issue
> 
> 
> Hi Tao,
> 
> Sorry hit send button too early by accident :)
> 
> > >
> > > Problem:
> > > When using the TSO + VXLAN feature in i40e, the outer UDP length fields in
> > > the multiple UDP segments which are TSOed by the i40e will have the
> > > wrong value.
> > >
> > > Fix this problem by adding the tunnel type field in the i40e descriptor
> > > which was missed before.
> > >
> > > Fixes: 77b8301733c3 ("i40e: VXLAN Tx checksum offload")
> > >
> > > Signed-off-by: Zhe Tao 
> > > ---
> > > V2: Edited some comments for mbuf structure and i40e driver.
> > >
> > >  app/test-pmd/csumonly.c  | 26 +++---
> > >  drivers/net/i40e/i40e_rxtx.c | 12 +---
> > >  lib/librte_mbuf/rte_mbuf.h   | 16 +++-
> > >  3 files changed, 43 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
> > > index ac4bd8f..d423c20 100644
> > > --- a/app/test-pmd/csumonly.c
> > > +++ b/app/test-pmd/csumonly.c
> > > @@ -204,7 +204,8 @@ parse_ethernet(struct ether_hdr *eth_hdr, struct 
> > > testpmd_offload_info *info)
> > >  static void
> > >  parse_vxlan(struct udp_hdr *udp_hdr,
> > >   struct testpmd_offload_info *info,
> > > - uint32_t pkt_type)
> > > + uint32_t pkt_type,
> > > + uint64_t *ol_flags)
> > >  {
> > >   struct ether_hdr *eth_hdr;
> > >
> > > @@ -215,6 +216,7 @@ parse_vxlan(struct udp_hdr *udp_hdr,
> > >   RTE_ETH_IS_TUNNEL_PKT(pkt_type) == 0)
> > >   return;
> > >
> > > + *ol_flags |= PKT_TX_TUNNEL_VXLAN;

Do we always have to setup tunnelling flags here?
Obviously it would mean an extra CTD per packet and might slowdown things.
In fact, I think current patch wouldn't work correctly if
TESTPMD_TX_OFFLOAD_OUTER_IP_CKSUM is not set.
So, can we do it only when TSO is enabled or outer IP checksum is enabled?

> > >   info->is_tunnel = 1;
> > >   info->outer_ethertype = info->ethertype;
> > >   info->outer_l2_len = info->l2_len;
> > > @@ -231,7 +233,9 @@ parse_vxlan(struct udp_hdr *udp_hdr,
> > >
> > >  /* Parse a gre header */
> > >  static void
> > > -parse_gre(struct simple_gre_hdr *gre_hdr, struct testpmd_offload_info 
> > > *info)
> > > +parse_gre(struct simple_gre_hdr *gre_hdr,
> > > +   struct testpmd_offload_info *info,
> > > +   uint64_t *ol_flags)
> > >  {
> > >   struct ether_hdr *eth_hdr;
> > >   struct ipv4_hdr *ipv4_hdr;
> > > @@ -242,6 +246,8 @@ parse_gre(struct simple_gre_hdr *gre_hdr, struct 
> > > testpmd_offload_info *info)
> > >   if ((gre_hdr->flags & _htons(~GRE_SUPPORTED_FIELDS)) != 0)
> > >   return;
> > >
> > > + *ol_flags |= PKT_TX_TUNNEL_GRE;
> > > +
> > >   gre_len += sizeof(struct simple_gre_hdr);
> > >
> > >   if (gre_hdr->flags & _htons(GRE_KEY_PRESENT))
> > > @@ -417,7 +423,7 @@ process_inner_cksums(void *l3_hdr, const struct 
> > > testpmd_offload_info *info,
> > >   * packet */
> > >  static uint64_t
> > >  process_outer_cksums(void *outer_l3_hdr, struct testpmd_offload_info 
> > > *info,
> > > - uint16_t testpmd_ol_flags)
> > > + uint16_t testpmd_ol_flags, uint64_t orig_ol_flags)
> > >  {
> > >   struct ipv4_hdr *ipv4_hdr = outer_l3_hdr;
> > >   struct ipv6_hdr *ipv6_hdr = outer_l3_hdr;
> > > @@ -442,6 +448,9 @@ process_outer_cksums(void *outer_l3_hdr, struct 
> > > testpmd_offload_info *info,
> > >* hardware supporting it today, and no API for it. */
> > >
> > >   udp_hdr = (struct udp_hdr *)((char *)outer_l3_hdr + info->outer_l3_len);
> > > + if ((orig_ol_flags & PKT_TX_TCP_SEG) &&
> > > + ((orig_ol_flags & PKT_TX_TUNNEL_MASK) == PKT_TX_TUNNEL_VXLAN))
> > > + udp_hdr->dgram_cksum = 0;
> > >   /* do not recalculate udp cksum if it was 0 */
> > >   if (udp_hdr->dgram_cksum != 0) {
> > >   udp_hdr->dgram_cksum = 0;
> > > @@ -705,15 +714,18 @@ pkt_burst_checksum_forward(struct fwd_str

[dpdk-dev] [PATCH v2] i40: fix the VXLAN TSO issue

2016-07-07 Thread Ananyev, Konstantin


Hi Tao,

Sorry hit send button too early by accident :)

> >
> > Problem:
> > When using the TSO + VXLAN feature in i40e, the outer UDP length fields in
> > the multiple UDP segments which are TSOed by the i40e will have the
> > wrong value.
> >
> > Fix this problem by adding the tunnel type field in the i40e descriptor
> > which was missed before.
> >
> > Fixes: 77b8301733c3 ("i40e: VXLAN Tx checksum offload")
> >
> > Signed-off-by: Zhe Tao 
> > ---
> > V2: Edited some comments for mbuf structure and i40e driver.
> >
> >  app/test-pmd/csumonly.c  | 26 +++---
> >  drivers/net/i40e/i40e_rxtx.c | 12 +---
> >  lib/librte_mbuf/rte_mbuf.h   | 16 +++-
> >  3 files changed, 43 insertions(+), 11 deletions(-)
> >
> > diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
> > index ac4bd8f..d423c20 100644
> > --- a/app/test-pmd/csumonly.c
> > +++ b/app/test-pmd/csumonly.c
> > @@ -204,7 +204,8 @@ parse_ethernet(struct ether_hdr *eth_hdr, struct 
> > testpmd_offload_info *info)
> >  static void
> >  parse_vxlan(struct udp_hdr *udp_hdr,
> > struct testpmd_offload_info *info,
> > -   uint32_t pkt_type)
> > +   uint32_t pkt_type,
> > +   uint64_t *ol_flags)
> >  {
> > struct ether_hdr *eth_hdr;
> >
> > @@ -215,6 +216,7 @@ parse_vxlan(struct udp_hdr *udp_hdr,
> > RTE_ETH_IS_TUNNEL_PKT(pkt_type) == 0)
> > return;
> >
> > +   *ol_flags |= PKT_TX_TUNNEL_VXLAN;
> > info->is_tunnel = 1;
> > info->outer_ethertype = info->ethertype;
> > info->outer_l2_len = info->l2_len;
> > @@ -231,7 +233,9 @@ parse_vxlan(struct udp_hdr *udp_hdr,
> >
> >  /* Parse a gre header */
> >  static void
> > -parse_gre(struct simple_gre_hdr *gre_hdr, struct testpmd_offload_info 
> > *info)
> > +parse_gre(struct simple_gre_hdr *gre_hdr,
> > + struct testpmd_offload_info *info,
> > + uint64_t *ol_flags)
> >  {
> > struct ether_hdr *eth_hdr;
> > struct ipv4_hdr *ipv4_hdr;
> > @@ -242,6 +246,8 @@ parse_gre(struct simple_gre_hdr *gre_hdr, struct 
> > testpmd_offload_info *info)
> > if ((gre_hdr->flags & _htons(~GRE_SUPPORTED_FIELDS)) != 0)
> > return;
> >
> > +   *ol_flags |= PKT_TX_TUNNEL_GRE;
> > +
> > gre_len += sizeof(struct simple_gre_hdr);
> >
> > if (gre_hdr->flags & _htons(GRE_KEY_PRESENT))
> > @@ -417,7 +423,7 @@ process_inner_cksums(void *l3_hdr, const struct 
> > testpmd_offload_info *info,
> >   * packet */
> >  static uint64_t
> >  process_outer_cksums(void *outer_l3_hdr, struct testpmd_offload_info *info,
> > -   uint16_t testpmd_ol_flags)
> > +   uint16_t testpmd_ol_flags, uint64_t orig_ol_flags)
> >  {
> > struct ipv4_hdr *ipv4_hdr = outer_l3_hdr;
> > struct ipv6_hdr *ipv6_hdr = outer_l3_hdr;
> > @@ -442,6 +448,9 @@ process_outer_cksums(void *outer_l3_hdr, struct 
> > testpmd_offload_info *info,
> >  * hardware supporting it today, and no API for it. */
> >
> > udp_hdr = (struct udp_hdr *)((char *)outer_l3_hdr + info->outer_l3_len);
> > +   if ((orig_ol_flags & PKT_TX_TCP_SEG) &&
> > +   ((orig_ol_flags & PKT_TX_TUNNEL_MASK) == PKT_TX_TUNNEL_VXLAN))
> > +   udp_hdr->dgram_cksum = 0;
> > /* do not recalculate udp cksum if it was 0 */
> > if (udp_hdr->dgram_cksum != 0) {
> > udp_hdr->dgram_cksum = 0;
> > @@ -705,15 +714,18 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
> > if (info.l4_proto == IPPROTO_UDP) {
> > struct udp_hdr *udp_hdr;
> > udp_hdr = (struct udp_hdr *)((char *)l3_hdr +
> > -   info.l3_len);
> > -   parse_vxlan(udp_hdr, , m->packet_type);
> > +  info.l3_len);
> > +   parse_vxlan(udp_hdr, , m->packet_type,
> > +   _flags);
> > } else if (info.l4_proto == IPPROTO_GRE) {
> > struct simple_gre_hdr *gre_hdr;
> > gre_hdr = (struct simple_gre_hdr *)
> > ((char *)l3_hdr + info.l3_len);
> > -   parse_gre(gre_hdr, );
> > +   parse_gre(gre_hdr, , _flags);
> > } else if (info.l4_proto == IPPROTO_IPIP) {
> > void *encap_ip_hdr;
> > +
> > +   ol_flags |= PKT_TX_TUNNEL_IPIP;
> > encap_ip_hdr = (char *)l3_hdr + info.l3_len;
> > parse_encap_ip(encap_ip_hdr, );
> > }
> > @@ -745,7 +757,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
> >  * processed in hardware. */
> > if (info.is_tunnel == 1) {
> > ol_flags |= process_outer_cksums(outer_l3_hdr, ,
> > -   testpmd_ol_flags);
> > +   testpmd_ol_flags, ol_flags);
> >

[dpdk-dev] [PATCH v2] i40: fix the VXLAN TSO issue

2016-07-07 Thread Ananyev, Konstantin

Hi Tao,

> 
> Problem:
> When using the TSO + VXLAN feature in i40e, the outer UDP length fields in
> the multiple UDP segments which are TSOed by the i40e will have the
> wrong value.
> 
> Fix this problem by adding the tunnel type field in the i40e descriptor
> which was missed before.
> 
> Fixes: 77b8301733c3 ("i40e: VXLAN Tx checksum offload")
> 
> Signed-off-by: Zhe Tao 
> ---
> V2: Edited some comments for mbuf structure and i40e driver.
> 
>  app/test-pmd/csumonly.c  | 26 +++---
>  drivers/net/i40e/i40e_rxtx.c | 12 +---
>  lib/librte_mbuf/rte_mbuf.h   | 16 +++-
>  3 files changed, 43 insertions(+), 11 deletions(-)
> 
> diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
> index ac4bd8f..d423c20 100644
> --- a/app/test-pmd/csumonly.c
> +++ b/app/test-pmd/csumonly.c
> @@ -204,7 +204,8 @@ parse_ethernet(struct ether_hdr *eth_hdr, struct 
> testpmd_offload_info *info)
>  static void
>  parse_vxlan(struct udp_hdr *udp_hdr,
>   struct testpmd_offload_info *info,
> - uint32_t pkt_type)
> + uint32_t pkt_type,
> + uint64_t *ol_flags)
>  {
>   struct ether_hdr *eth_hdr;
> 
> @@ -215,6 +216,7 @@ parse_vxlan(struct udp_hdr *udp_hdr,
>   RTE_ETH_IS_TUNNEL_PKT(pkt_type) == 0)
>   return;
> 
> + *ol_flags |= PKT_TX_TUNNEL_VXLAN;
>   info->is_tunnel = 1;
>   info->outer_ethertype = info->ethertype;
>   info->outer_l2_len = info->l2_len;
> @@ -231,7 +233,9 @@ parse_vxlan(struct udp_hdr *udp_hdr,
> 
>  /* Parse a gre header */
>  static void
> -parse_gre(struct simple_gre_hdr *gre_hdr, struct testpmd_offload_info *info)
> +parse_gre(struct simple_gre_hdr *gre_hdr,
> +   struct testpmd_offload_info *info,
> +   uint64_t *ol_flags)
>  {
>   struct ether_hdr *eth_hdr;
>   struct ipv4_hdr *ipv4_hdr;
> @@ -242,6 +246,8 @@ parse_gre(struct simple_gre_hdr *gre_hdr, struct 
> testpmd_offload_info *info)
>   if ((gre_hdr->flags & _htons(~GRE_SUPPORTED_FIELDS)) != 0)
>   return;
> 
> + *ol_flags |= PKT_TX_TUNNEL_GRE;
> +
>   gre_len += sizeof(struct simple_gre_hdr);
> 
>   if (gre_hdr->flags & _htons(GRE_KEY_PRESENT))
> @@ -417,7 +423,7 @@ process_inner_cksums(void *l3_hdr, const struct 
> testpmd_offload_info *info,
>   * packet */
>  static uint64_t
>  process_outer_cksums(void *outer_l3_hdr, struct testpmd_offload_info *info,
> - uint16_t testpmd_ol_flags)
> + uint16_t testpmd_ol_flags, uint64_t orig_ol_flags)
>  {
>   struct ipv4_hdr *ipv4_hdr = outer_l3_hdr;
>   struct ipv6_hdr *ipv6_hdr = outer_l3_hdr;
> @@ -442,6 +448,9 @@ process_outer_cksums(void *outer_l3_hdr, struct 
> testpmd_offload_info *info,
>* hardware supporting it today, and no API for it. */
> 
>   udp_hdr = (struct udp_hdr *)((char *)outer_l3_hdr + info->outer_l3_len);
> + if ((orig_ol_flags & PKT_TX_TCP_SEG) &&
> + ((orig_ol_flags & PKT_TX_TUNNEL_MASK) == PKT_TX_TUNNEL_VXLAN))
> + udp_hdr->dgram_cksum = 0;
>   /* do not recalculate udp cksum if it was 0 */
>   if (udp_hdr->dgram_cksum != 0) {
>   udp_hdr->dgram_cksum = 0;
> @@ -705,15 +714,18 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
>   if (info.l4_proto == IPPROTO_UDP) {
>   struct udp_hdr *udp_hdr;
>   udp_hdr = (struct udp_hdr *)((char *)l3_hdr +
> - info.l3_len);
> - parse_vxlan(udp_hdr, , m->packet_type);
> +info.l3_len);
> + parse_vxlan(udp_hdr, , m->packet_type,
> + _flags);
>   } else if (info.l4_proto == IPPROTO_GRE) {
>   struct simple_gre_hdr *gre_hdr;
>   gre_hdr = (struct simple_gre_hdr *)
>   ((char *)l3_hdr + info.l3_len);
> - parse_gre(gre_hdr, );
> + parse_gre(gre_hdr, , _flags);
>   } else if (info.l4_proto == IPPROTO_IPIP) {
>   void *encap_ip_hdr;
> +
> + ol_flags |= PKT_TX_TUNNEL_IPIP;
>   encap_ip_hdr = (char *)l3_hdr + info.l3_len;
>   parse_encap_ip(encap_ip_hdr, );
>   }
> @@ -745,7 +757,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
>* processed in hardware. */
>   if (info.is_tunnel == 1) {
>   ol_flags |= process_outer_cksums(outer_l3_hdr, ,
> - testpmd_ol_flags);
> + testpmd_ol_flags, ol_flags);
>   }
> 
>   /* step 4: fill the mbuf meta data (flags and header lengths) */
> diff --git a/drivers/net/i40e/i40e_rxtx.c

[dpdk-dev] [PATCH v2] net/ixgbe: fix compilation when offload flags disabled

2016-07-01 Thread Ananyev, Konstantin


> The ixgbe driver does not compile if CONFIG_RTE_IXGBE_RX_OLFLAGS_ENABLE=n
> because the macro has not the proper number of parameters. To reproduce
> the issue:
> 
>   make config T=x86_64-native-linuxapp-gcc
>   sed -i 
> 's,CONFIG_RTE_IXGBE_RX_OLFLAGS_ENABLE=y,CONFIG_RTE_IXGBE_RX_OLFLAGS_ENABLE=n,'
>  build/.config
>   make -j4
>   [...]
>ixgbe_rxtx_vec_sse.c: In function ?_recv_raw_pkts_vec?:
>ixgbe_rxtx_vec_sse.c:345:53: error: macro "desc_to_olflags_v" passed 3 
> arguments, but takes just 2
>   desc_to_olflags_v(descs, vlan_flags, _pkts[pos]);
> ^
>ixgbe_rxtx_vec_sse.c:345:3: error: ?desc_to_olflags_v? undeclared (first 
> use in this function)
>   desc_to_olflags_v(descs, vlan_flags, _pkts[pos]);
>   ^
>ixgbe_rxtx_vec_sse.c:345:3: note: each undeclared identifier is reported 
> only once for each function it appears in
>ixgbe_rxtx_vec_sse.c:231:10: error: variable ?vlan_flags? set but not used 
> [-Werror=unused-but-set-variable]
>  uint8_t vlan_flags;
>  ^
>cc1: all warnings being treated as errors
> 
> This patch fixes the number of arguments in th macro, and ensure that
> vlan_flags is marked as used to avoid the third error.
> 
> Fixes: b37b528d957c ("mbuf: add new Rx flags for stripped VLAN")
> Reported-by: Amin Tootoonchian 
> Signed-off-by: Olivier Matz 
> ---
>  drivers/net/ixgbe/ixgbe_rxtx_vec_sse.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec_sse.c 
> b/drivers/net/ixgbe/ixgbe_rxtx_vec_sse.c
> index 4f95deb..1c4fd7c 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx_vec_sse.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx_vec_sse.c
> @@ -197,7 +197,9 @@ desc_to_olflags_v(__m128i descs[4], uint8_t vlan_flags,
>   rx_pkts[3]->ol_flags = vol.e[3];
>  }
>  #else
> -#define desc_to_olflags_v(desc, rx_pkts) do {} while (0)
> +#define desc_to_olflags_v(desc, vlan_flags, rx_pkts) do { \
> + RTE_SET_USED(vlan_flags); \
> + } while (0)
>  #endif
> 
>  /*
> --

Acked-by: Konstantin Ananyev 

> 2.8.1

[dpdk-dev] [PATCH] mk: fix acl library static linking

2016-06-30 Thread Ananyev, Konstantin


Hi Thomas,

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Sergio Gonzalez Monroy
> Sent: Thursday, June 30, 2016 3:02 PM
> To: Thomas Monjalon
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] mk: fix acl library static linking
> 
> On 30/06/2016 13:44, Thomas Monjalon wrote:
> > 2016-06-30 13:04, Sergio Gonzalez Monroy:
> >> On 30/06/2016 12:38, Thomas Monjalon wrote:
> >>> Does it need to be commented in rte.app.mk?
> >>> The other libs are in whole-archive to support dlopen of drivers.
> >>> But the problem here is not because of a driver use.
> >> There seem to be a bunch of libraries under --whole-archive scope that
> >> are not
> >> PMDs, ie. cfgfile, cmdline...
> >>
> >> What is the criteria?
> > The criteria is a bit vague. We must try to include only libs which can
> > be used by a driver.
> > cmdline should probably not be there.
> > Does it make sense to use cfgfile in a driver? maybe yes.
> 
> So as it is, ACL autotest is broken when building static libs
> (non-combined).
> For combined libs we usually wrap libdpdk.a with --whole-archive, thus it is
> not an issue.
> 
> Just thinking a bit more about the 'dlopen of drivers' case you
> mentioned before,
> shouldn't the driver have proper dependencies and therefore need shared
> DPDK libraries?
> What does happen if binary/app and driver are built against different
> library versions?
> Where does it say that we do support this use case?
> 
> Sergio
> 

So are you going to apply this patch?
Right now acl just can't be used properly in case of static library build.
Thanks
Konstantin

[dpdk-dev] [PATCH] mk: fix acl library static linking

2016-06-30 Thread Ananyev, Konstantin



> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Sergio Gonzalez Monroy
> Sent: Thursday, June 30, 2016 12:10 PM
> To: dev at dpdk.org
> Subject: [dpdk-dev] [PATCH] mk: fix acl library static linking
> 
> Since below commit, ACL library is outside the scope of --whole-archive
> and ACL autotest fails.
> 
>   RTE>>acl_autotest
>   ACL: allocation of 25166728 bytes on socket 9 for ACL_acl_ctx failed
>   ACL: rte_acl_add_rules(acl_ctx): rule #1 is invalid
>   Line 1584: SSE classify with zero categories failed!
>   Test Failed
> 
> This is the result of the linker picking weak over non-weak functions.
> 
> Fixes: 95dc3c3cf31c ("mk: reduce scope of whole-archive static linking")
> 
> Signed-off-by: Sergio Gonzalez Monroy 
> ---
>  mk/rte.app.mk | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mk/rte.app.mk b/mk/rte.app.mk
> index 83314ca..7f89fd4 100644
> --- a/mk/rte.app.mk
> +++ b/mk/rte.app.mk
> @@ -76,12 +76,12 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_IP_FRAG)+= 
> -lrte_ip_frag
>  _LDLIBS-$(CONFIG_RTE_LIBRTE_METER)  += -lrte_meter
>  _LDLIBS-$(CONFIG_RTE_LIBRTE_SCHED)  += -lrte_sched
>  _LDLIBS-$(CONFIG_RTE_LIBRTE_LPM)+= -lrte_lpm
> -_LDLIBS-$(CONFIG_RTE_LIBRTE_ACL)+= -lrte_acl
>  _LDLIBS-$(CONFIG_RTE_LIBRTE_JOBSTATS)   += -lrte_jobstats
>  _LDLIBS-$(CONFIG_RTE_LIBRTE_POWER)  += -lrte_power
> 
>  _LDLIBS-y += --whole-archive
> 
> +_LDLIBS-$(CONFIG_RTE_LIBRTE_ACL)+= -lrte_acl
>  _LDLIBS-$(CONFIG_RTE_LIBRTE_TIMER)  += -lrte_timer
>  _LDLIBS-$(CONFIG_RTE_LIBRTE_HASH)   += -lrte_hash
>  _LDLIBS-$(CONFIG_RTE_LIBRTE_VHOST)  += -lrte_vhost
> --
> 2.4.11

Acked-by: Konstantin Ananyev

[dpdk-dev] [PATCH v2] mbuf:rearrange mbuf to be more mbuf chain friendly

2016-06-27 Thread Ananyev, Konstantin



> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Keith Wiles
> Sent: Saturday, June 25, 2016 4:56 PM
> To: dev at dpdk.org
> Subject: [dpdk-dev] [PATCH v2] mbuf:rearrange mbuf to be more mbuf chain 
> friendly
> 
> Move the next pointer to the first cacheline of the rte_mbuf structure
> and move the offload values to the second cacheline to give better
> performance to applications using chained mbufs.
> 
> Enabled by a configuration option CONFIG_RTE_MBUF_CHAIN_FRIENDLY default
> is set to No.

First, it would make ixgbe and i40e vector RX functions to work incorrectly.
Second, I don't think we can afford to allow people swap mbuf fields in the way 
they like.
Otherwise we'll end-up with totally unmaintainable code pretty soon.
So NACK.  

Konstantin

> 
> Signed-off-by: Keith Wiles 
> ---
>  config/common_base |  2 +
>  .../linuxapp/eal/include/exec-env/rte_kni_common.h |  8 +++
>  lib/librte_mbuf/rte_mbuf.h | 67 
> +++---
>  3 files changed, 56 insertions(+), 21 deletions(-)
> 
> diff --git a/config/common_base b/config/common_base
> index 379a791..f7c624e 100644
> --- a/config/common_base
> +++ b/config/common_base
> @@ -405,6 +405,8 @@ CONFIG_RTE_LIBRTE_MBUF_DEBUG=n
>  CONFIG_RTE_MBUF_DEFAULT_MEMPOOL_OPS="ring_mp_mc"
>  CONFIG_RTE_MBUF_REFCNT_ATOMIC=y
>  CONFIG_RTE_PKTMBUF_HEADROOM=128
> +# Set to y if needing to be mbuf chain friendly.
> +CONFIG_RTE_MBUF_CHAIN_FRIENDLY=n
> 
>  #
>  # Compile librte_timer
> diff --git a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h 
> b/lib/librte_eal/linuxapp/eal/include/exec-
> env/rte_kni_common.h
> index 2acdfd9..44d65cd 100644
> --- a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
> +++ b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
> @@ -120,11 +120,19 @@ struct rte_kni_mbuf {
>   char pad2[4];
>   uint32_t pkt_len;   /**< Total pkt len: sum of all segment 
> data_len. */
>   uint16_t data_len;  /**< Amount of data in segment buffer. */
> +#ifdef RTE_MBUF_CHAIN_FRIENDLY
> + char pad3[8];
> + void *next;
> 
>   /* fields on second cache line */
> + char pad4[16] __attribute__((__aligned__(RTE_CACHE_LINE_MIN_SIZE)));
> + void *pool;
> +#else
> + /* fields on second cache line */
>   char pad3[8] __attribute__((__aligned__(RTE_CACHE_LINE_MIN_SIZE)));
>   void *pool;
>   void *next;
> +#endif
>  };
> 
>  /*
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index 15e3a10..6e6ba0e 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -765,6 +765,28 @@ typedef uint8_t  MARKER8[0];  /**< generic marker with 
> 1B alignment */
>  typedef uint64_t MARKER64[0]; /**< marker that allows us to overwrite 8 bytes
> * with a single assignment */
> 
> +typedef union {
> + uint32_t rss; /**< RSS hash result if RSS enabled */
> + struct {
> + union {
> + struct {
> + uint16_t hash;
> + uint16_t id;
> + };
> + uint32_t lo;
> + /**< Second 4 flexible bytes */
> + };
> + uint32_t hi;
> + /**< First 4 flexible bytes or FD ID, dependent on
> + PKT_RX_FDIR_* flag in ol_flags. */
> + } fdir;   /**< Filter identifier if FDIR enabled */
> + struct {
> + uint32_t lo;
> + uint32_t hi;
> + } sched;  /**< Hierarchical scheduler */
> + uint32_t usr; /**< User defined tags. See rte_distributor_process() 
> */
> +} rss_hash_t;
> +
>  /**
>   * The generic rte_mbuf, containing a packet mbuf.
>   */
> @@ -824,28 +846,31 @@ struct rte_mbuf {
>   uint16_t data_len;/**< Amount of data in segment buffer. */
>   /** VLAN TCI (CPU order), valid if PKT_RX_VLAN_STRIPPED is set. */
>   uint16_t vlan_tci;
> +#ifdef RTE_MBUF_CHAIN_FRIENDLY
> + /*
> +  * Move offload into the second cache line and next in the first.
> +  * Better performance for applications using chained mbufs to have
> +  * the next pointer in the first cache line.
> +  * If you change this structure, you must change the user-mode
> +  * version in rte_kni_common.h to match the new layout.
> +  */
> + uint32_t seqn; /**< Sequence number. See also rte_reorder_insert() */
> + uint16_t vlan_tci_outer;  /**< Outer VLAN Tag Control Identifier (CPU 
> order) */
> + struct rte_mbuf *next;/**< Next segment of scattered packet. */
> +
> + /* second cache line - fields only used in slow path or on TX */
> + MARKER cacheline1 __rte_cache_min_aligned;
> +
> + rss_hash_t hash;  /**< hash information */
> 
>   union {
> - uint32_t rss; /**< RSS hash result if RSS enabled */
> -

[dpdk-dev] [PATCH] ixgbe:enable configuration for old ptype behavior

2016-06-27 Thread Ananyev, Konstantin

> The default behavior is to NOT support the old ptype behavior,
> but enabling the configuration option the old ptype style
> can be supported.
> 
> Add support for old behaviour until we have a cleaner solution using
> a configuration option CONFIG_RTE_IXGBE_ENABLE_OLD_PTYPE_BEHAVIOUR,
> which is defaulted to not set.

I think it was explained why we had to diable ptype recognition for vRX, please 
see here:
http://dpdk.org/browse/dpdk/commit/drivers/net/ixgbe/ixgbe_rxtx_vec.c?id=d9a2009a81089093645fea2e04b51dd37edf3e6f
I think that introducing a compile time option to re-enable incomplete
and not fully correct functionality is a very bad approach.
So NACK.
Konstantin  

> 
> Signed-off-by: Keith Wiles 
> ---
>  config/common_base |  2 ++
>  drivers/net/ixgbe/ixgbe_ethdev.c   |  6 +
>  drivers/net/ixgbe/ixgbe_rxtx_vec.c | 52 
> +++---
>  3 files changed, 57 insertions(+), 3 deletions(-)
> 
> diff --git a/config/common_base b/config/common_base
> index bdde2e7..05e69bc 100644
> --- a/config/common_base
> +++ b/config/common_base
> @@ -160,6 +160,8 @@ CONFIG_RTE_LIBRTE_IXGBE_DEBUG_DRIVER=n
>  CONFIG_RTE_LIBRTE_IXGBE_PF_DISABLE_STRIP_CRC=n
>  CONFIG_RTE_IXGBE_INC_VECTOR=y
>  CONFIG_RTE_IXGBE_RX_OLFLAGS_ENABLE=y
> +# Enable to restore old ptype behavior
> +CONFIG_RTE_IXGBE_ENABLE_OLD_PTYPE_BEHAVIOR=n
> 
>  #
>  # Compile burst-oriented I40E PMD driver
> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c 
> b/drivers/net/ixgbe/ixgbe_ethdev.c
> index e11a431..068b92b 100644
> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
> @@ -3076,7 +3076,13 @@ ixgbe_dev_supported_ptypes_get(struct rte_eth_dev *dev)
>   if (dev->rx_pkt_burst == ixgbe_recv_pkts ||
>   dev->rx_pkt_burst == ixgbe_recv_pkts_lro_single_alloc ||
>   dev->rx_pkt_burst == ixgbe_recv_pkts_lro_bulk_alloc ||
> +#ifdef RTE_IXGBE_ENABLE_OLD_PTYPE_BEHAVIOR
> + dev->rx_pkt_burst == ixgbe_recv_pkts_bulk_alloc ||
> + dev->rx_pkt_burst == ixgbe_recv_pkts_vec ||
> + dev->rx_pkt_burst == ixgbe_recv_scattered_pkts_vec)
> +#else
>   dev->rx_pkt_burst == ixgbe_recv_pkts_bulk_alloc)
> +#endif
>   return ptypes;
>   return NULL;
>  }
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec.c 
> b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> index 12190d2..2e0d50b 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> @@ -228,6 +228,10 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct 
> rte_mbuf **rx_pkts,
>   );
>   __m128i dd_check, eop_check;
>   uint8_t vlan_flags;
> +#ifdef RTE_IXGBE_ENABLE_OLD_PTYPE_BEHAVIOR
> + __m128i desc_mask = _mm_set_epi32(0x, 0x,
> +   0x, 0x07F0);
> +#endif
> 
>   /* nb_pkts shall be less equal than RTE_IXGBE_MAX_RX_BURST */
>   nb_pkts = RTE_MIN(nb_pkts, RTE_IXGBE_MAX_RX_BURST);
> @@ -268,8 +272,14 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct 
> rte_mbuf **rx_pkts,
>   13, 12,  /* octet 12~13, 16 bits data_len */
>   0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
>   13, 12,  /* octet 12~13, low 16 bits pkt_len */
> +#ifdef RTE_IXGBE_ENABLE_OLD_PTYPE_BEHAVIOR
> + 0xFF, 0xFF,  /* skip high 16 bits pkt_type */
> + 1,   /* octet 1, 8 bits pkt_type field */
> + 0/* octet 0, 4 bits offset 4 pkt_type field */
> +#else
>   0xFF, 0xFF,  /* skip 32 bit pkt_type */
>   0xFF, 0xFF
> +#endif
>   );
> 
>   /* Cache is empty -> need to scan the buffer rings, but first move
> @@ -291,6 +301,9 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct 
> rte_mbuf **rx_pkts,
>   for (pos = 0, nb_pkts_recd = 0; pos < nb_pkts;
>   pos += RTE_IXGBE_DESCS_PER_LOOP,
>   rxdp += RTE_IXGBE_DESCS_PER_LOOP) {
> +#ifdef RTE_IXGBE_ENABLE_OLD_PTYPE_BEHAVIOR
> + __m128i descs0[RTE_IXGBE_DESCS_PER_LOOP];
> +#endif
>   __m128i descs[RTE_IXGBE_DESCS_PER_LOOP];
>   __m128i pkt_mb1, pkt_mb2, pkt_mb3, pkt_mb4;
>   __m128i zero, staterr, sterr_tmp1, sterr_tmp2;
> @@ -301,18 +314,30 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct 
> rte_mbuf **rx_pkts,
> 
>   /* Read desc statuses backwards to avoid race condition */
>   /* A.1 load 4 pkts desc */
> +#ifdef RTE_IXGBE_ENABLE_OLD_PTYPE_BEHAVIOR
> + descs0[3] = _mm_loadu_si128((__m128i *)(rxdp + 3));
> +#else
>   descs[3] = _mm_loadu_si128((__m128i *)(rxdp + 3));
> -
> +#endif
>   /* B.2 copy 2 mbuf point into rx_pkts  */
>   _mm_storeu_si128((__m128i *)_pkts[pos], mbp1);
> 
>   /* B.1 load 1 mbuf point */
>   mbp2 = _mm_loadu_si128((__m128i *)_ring[pos+2]);
> 
> +#ifdef

[dpdk-dev] [PATCH v6 1/4] lib/librte_ether: support device reset

2016-06-21 Thread Ananyev, Konstantin



> 
> Hi Konstantin,
> 
> > Hi Jerin,
> >
> > > -Original Message-
> > > From: Jerin Jacob [mailto:jerin.jacob at caviumnetworks.com]
> > > Sent: Tuesday, June 21, 2016 9:56 AM
> > > To: Lu, Wenzhuo
> > > Cc: Stephen Hemminger; dev at dpdk.org; Ananyev, Konstantin; Richardson, 
> > > Bruce; Chen, Jing D; Liang, Cunming; Wu, Jingjing;
> Zhang,
> > > Helin; thomas.monjalon at 6wind.com
> > > Subject: Re: [dpdk-dev] [PATCH v6 1/4] lib/librte_ether: support device 
> > > reset
> > >
> > > On Tue, Jun 21, 2016 at 08:24:36AM +, Lu, Wenzhuo wrote:
> > > > Hi Jerin,
> > >
> > > Hi Wenzhuo,
> > >
> > > > > > > > > On Mon, Jun 20, 2016 at 02:24:27PM +0800, Wenzhuo Lu wrote:
> > > > > > > > > > Add an API to reset the device.
> > > > > > > > > > It's for VF device in this scenario, kernel PF + DPDK VF.
> > > > > > > > > > When the PF port down->up, APP should call this API to reset
> > > > > > > > > > VF port. Most likely, APP should call it in its management
> > > > > > > > > > thread and guarantee the thread safe. It means APP should 
> > > > > > > > > > stop
> > > > > > > > > > the rx/tx and the device, then reset the device, then 
> > > > > > > > > > recover
> > > > > > > > > > the device and rx/tx.
> > > > > > > > >
> > > > > > > > > Following is _a_ use-case for Device reset. But may be not be
> > > > > > > > > _the_ use case. IMO, We need to first say expected behavior of
> > > > > > > > > this API and add a use-case later.
> > > > > > > > >
> > > > > > > > > Other use-case would be, PCIe VF with functional level reset 
> > > > > > > > > for
> > > > > > > > > SRIOV migration.
> > > > > > > > > Are we on same page?
> > > > > > > >
> > > > > > > >
> > > > > > > > In my experience with Linux devices, this is normally handled by
> > > > > > > > the device driver in the start routine.  Since any use case 
> > > > > > > > which
> > > > > > > > needs this is going to do a stop/reset/start sequence, why not
> > > > > > > > just have the VF device driver do this in the start routine?.
> > > > > > > >
> > > > > > > > Adding yet another API and state transistion if not necessary
> > > > > > > > increases the complexity and required test cases for all 
> > > > > > > > devices.
> > > > > > >
> > > > > > > I agree with Stephen here.I think if application needs to call 
> > > > > > > start
> > > > > > > after the device reset then we could add this logic in start 
> > > > > > > itself
> > > > > > > rather exposing a yet another API
> > > > > > Do you mean changing the device_start to include all these actions, 
> > > > > > stop
> > > > > device -> stop queue -> re-setup queue -> start queue -> start device 
> > > > > ?
> > > > >
> > > > > What was the expected API call sequence when you were introduced this 
> > > > > API?
> > > > >
> > > > > Point was to have implicit device reset in the API call 
> > > > > sequence(Wherever make
> > > > > sense for specific PMD)
> > > > I think the API call sequence depends on the implementation of the APP. 
> > > > Let's say if there's not this reset API, APP can use this
> API
> > > call sequence to handle the PF link down/up event, rte_eth_dev_close -> 
> > > rte_eth_rx_queue_setup -> rte_eth_tx_queue_setup -
> >
> > > rte_eth_dev_start.
> > > > Actually our purpose is to use this reset API instead of the API call 
> > > > sequence. You can see the reset API is not necessary. The
> benefit
> > > is to save the code for APP.
> > >
> > > Then I am bit confused with original commit log description.
> > > |
> > > |It means APP should stop the rx/tx and the device, then reset the
>

[dpdk-dev] [PATCH v3 1/2] mempool: add stack (lifo) mempool handler

2016-06-21 Thread Ananyev, Konstantin



> -Original Message-
> From: Jerin Jacob [mailto:jerin.jacob at caviumnetworks.com]
> Sent: Tuesday, June 21, 2016 4:35 AM
> To: Ananyev, Konstantin
> Cc: Thomas Monjalon; dev at dpdk.org; Hunt, David; olivier.matz at 6wind.com; 
> viktorin at rehivetech.com; shreyansh.jain at nxp.com
> Subject: Re: [dpdk-dev] [PATCH v3 1/2] mempool: add stack (lifo) mempool 
> handler
> 
> On Mon, Jun 20, 2016 at 05:56:40PM +, Ananyev, Konstantin wrote:
> >
> >
> > > -Original Message-
> > > From: Jerin Jacob [mailto:jerin.jacob at caviumnetworks.com]
> > > Sent: Monday, June 20, 2016 3:22 PM
> > > To: Ananyev, Konstantin
> > > Cc: Thomas Monjalon; dev at dpdk.org; Hunt, David; olivier.matz at 
> > > 6wind.com; viktorin at rehivetech.com; shreyansh.jain at nxp.com
> > > Subject: Re: [dpdk-dev] [PATCH v3 1/2] mempool: add stack (lifo) mempool 
> > > handler
> > >
> > > On Mon, Jun 20, 2016 at 01:58:04PM +, Ananyev, Konstantin wrote:
> > > >
> > > >
> > > > > -Original Message-
> > > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Thomas 
> > > > > Monjalon
> > > > > Sent: Monday, June 20, 2016 2:54 PM
> > > > > To: Jerin Jacob
> > > > > Cc: dev at dpdk.org; Hunt, David; olivier.matz at 6wind.com; viktorin 
> > > > > at rehivetech.com; shreyansh.jain at nxp.com
> > > > > Subject: Re: [dpdk-dev] [PATCH v3 1/2] mempool: add stack (lifo) 
> > > > > mempool handler
> > > > >
> > > > > 2016-06-20 18:55, Jerin Jacob:
> > > > > > On Mon, Jun 20, 2016 at 02:08:10PM +0100, David Hunt wrote:
> > > > > > > This is a mempool handler that is useful for pipelining apps, 
> > > > > > > where
> > > > > > > the mempool cache doesn't really work - example, where we have one
> > > > > > > core doing rx (and alloc), and another core doing Tx (and 
> > > > > > > return). In
> > > > > > > such a case, the mempool ring simply cycles through all the mbufs,
> > > > > > > resulting in a LLC miss on every mbuf allocated when the number of
> > > > > > > mbufs is large. A stack recycles buffers more effectively in this
> > > > > > > case.
> > > > > > >
> > > > > > > Signed-off-by: David Hunt 
> > > > > > > ---
> > > > > > >  lib/librte_mempool/Makefile|   1 +
> > > > > > >  lib/librte_mempool/rte_mempool_stack.c | 145 
> > > > > > > +
> > > > > >
> > > > > > How about moving new mempool handlers to drivers/mempool? (or 
> > > > > > similar).
> > > > > > In future, adding HW specific handlers in lib/librte_mempool/ may 
> > > > > > be bad idea.
> > > > >
> > > > > You're probably right.
> > > > > However we need to check and understand what a HW mempool handler 
> > > > > will be.
> > > > > I imagine the first of them will have to move handlers in drivers/
> > > >
> > > > Does it mean it we'll have to move mbuf into drivers too?
> > > > Again other libs do use mempool too.
> > > > Why not just lib/librte_mempool/arch/
> > > > ?
> > >
> > > I was proposing only to move only the new
> > > handler(lib/librte_mempool/rte_mempool_stack.c). Not any library or any
> > > other common code.
> > >
> > > Just like DPDK crypto device, Even if it is software implementation its
> > > better to move in driver/crypto instead of lib/librte_cryptodev
> > >
> > > "lib/librte_mempool/arch/" is not correct place as it is platform specific
> > > not architecture specific and HW mempool device may be PCIe or platform
> > > device.
> >
> > Ok, but why rte_mempool_stack.c has to be moved?
> 
> Just thought of having all the mempool handlers at one place.
> We can't move all HW mempool handlers at lib/librte_mempool/

Yep, but from your previous mail I thought we might have specific ones
for specific devices, no? 
If so, why to put them in one place, why just not in:
Drivers/xxx_dev/xxx_mempool.[h,c]
?
And keep generic ones in lib/librte_mempool
?
Konstantin

> 
> Jerin
> 
> > I can hardly imagine it is a 'platform sepcific'.
> > From my understanding it is a generic code.
> > Konstantin
> >
> >
> > >
> > > > Konstantin
> > > >
> > > >
> > > > > Jerin, are you volunteer?

[dpdk-dev] [PATCH v6 1/4] lib/librte_ether: support device reset

2016-06-21 Thread Ananyev, Konstantin

Hi Jerin,

> -Original Message-
> From: Jerin Jacob [mailto:jerin.jacob at caviumnetworks.com]
> Sent: Tuesday, June 21, 2016 9:56 AM
> To: Lu, Wenzhuo
> Cc: Stephen Hemminger; dev at dpdk.org; Ananyev, Konstantin; Richardson, 
> Bruce; Chen, Jing D; Liang, Cunming; Wu, Jingjing; Zhang,
> Helin; thomas.monjalon at 6wind.com
> Subject: Re: [dpdk-dev] [PATCH v6 1/4] lib/librte_ether: support device reset
> 
> On Tue, Jun 21, 2016 at 08:24:36AM +, Lu, Wenzhuo wrote:
> > Hi Jerin,
> 
> Hi Wenzhuo,
> 
> > > > > > > On Mon, Jun 20, 2016 at 02:24:27PM +0800, Wenzhuo Lu wrote:
> > > > > > > > Add an API to reset the device.
> > > > > > > > It's for VF device in this scenario, kernel PF + DPDK VF.
> > > > > > > > When the PF port down->up, APP should call this API to reset
> > > > > > > > VF port. Most likely, APP should call it in its management
> > > > > > > > thread and guarantee the thread safe. It means APP should stop
> > > > > > > > the rx/tx and the device, then reset the device, then recover
> > > > > > > > the device and rx/tx.
> > > > > > >
> > > > > > > Following is _a_ use-case for Device reset. But may be not be
> > > > > > > _the_ use case. IMO, We need to first say expected behavior of
> > > > > > > this API and add a use-case later.
> > > > > > >
> > > > > > > Other use-case would be, PCIe VF with functional level reset for
> > > > > > > SRIOV migration.
> > > > > > > Are we on same page?
> > > > > >
> > > > > >
> > > > > > In my experience with Linux devices, this is normally handled by
> > > > > > the device driver in the start routine.  Since any use case which
> > > > > > needs this is going to do a stop/reset/start sequence, why not
> > > > > > just have the VF device driver do this in the start routine?.
> > > > > >
> > > > > > Adding yet another API and state transistion if not necessary
> > > > > > increases the complexity and required test cases for all devices.
> > > > >
> > > > > I agree with Stephen here.I think if application needs to call start
> > > > > after the device reset then we could add this logic in start itself
> > > > > rather exposing a yet another API
> > > > Do you mean changing the device_start to include all these actions, stop
> > > device -> stop queue -> re-setup queue -> start queue -> start device ?
> > >
> > > What was the expected API call sequence when you were introduced this API?
> > >
> > > Point was to have implicit device reset in the API call sequence(Wherever 
> > > make
> > > sense for specific PMD)
> > I think the API call sequence depends on the implementation of the APP. 
> > Let's say if there's not this reset API, APP can use this API
> call sequence to handle the PF link down/up event, rte_eth_dev_close -> 
> rte_eth_rx_queue_setup -> rte_eth_tx_queue_setup ->
> rte_eth_dev_start.
> > Actually our purpose is to use this reset API instead of the API call 
> > sequence. You can see the reset API is not necessary. The benefit
> is to save the code for APP.
> 
> Then I am bit confused with original commit log description.
> |
> |It means APP should stop the rx/tx and the device, then reset the
> |device, then recover the device and rx/tx.
> |
> I was under impression that it a low level reset API for this device? Is
> n't it?
> 
> The other issue is generalized outlook of the API, Certain PMD will not
> have PF link down/up event? Link down/up and only connected to VF and PF
> only for configuration.
> 
> How about fixing it more transparently in PMD driver itself as
> PMD driver knows the PF link up/down event, Is it possible to
> recover the VF on that event if its only matter of resetting it?

I think we already went through that discussion on the list.
Unfortunately with current dpdk design it is hardly possible.
To achieve that we need to introduce some sort of synchronisation
between IO and control APIs (locking or so).
Actually I am not sure why having a special reset function will be a problem.
Yes, it would exist only for VFs, for PF it could be left unimplemented.
Though it definitely seems more convenient from user point of view,
they would know: to handle VF reset event, they just need to call that
particular function, not to re-implement their own.

Konstantin

> 
> Jerin

[dpdk-dev] supported packet types

2016-06-21 Thread Ananyev, Konstantin

Hi Olivier,

> 
> Hi Konstantin,
> 
> On 06/16/2016 01:29 PM, Ananyev, Konstantin wrote:
> >>>> I suggest instead to set the ptype
> >>>> in an opportunistic fashion instead:
> >>>> - if the driver/hw knows the ptype, set it
> >>>> - else, set it to unknown
> >>>
> >>> That's what PMD does now... and I don't think it can do much more -
> >>> only interpret information provided by the HW in a proper way.
> >>> Probably I misunderstood you here...
> >>
> >> My suggestion was to remove get_supported_ptypes an set the ptype in
> >> mbuf when the pmd recognize a type.
> >>
> >> But we could also keep get_supported_ptypes() for ptypes that will
> >> always be recognized, allowing a PMD to set any other ptype if it
> >> is recognized. This is probably what we have today in some PMDs, I
> >> think it just requires some clarification.
> >
> > Yes, +1 to the second option.
> 
> What about the following API comment?
> 
> '''
> Retrieve the supported packet types of an Ethernet device.
> 
> When a packet type is announced as supported, it *must* be recognized by
> the PMD. For instance, if RTE_PTYPE_L2_ETHER, RTE_PTYPE_L2_ETHER_VLAN
> and RTE_PTYPE_L3_IPV4 are announced, the PMD must return the following
> packet types for these packets:
> - Ether/IPv4  -> RTE_PTYPE_L2_ETHER | RTE_PTYPE_L3_IPV4
> - Ether/Vlan/IPv4 -> RTE_PTYPE_L2_ETHER_VLAN | RTE_PTYPE_L3_IPV4
> - Ether/   -> RTE_PTYPE_L2_ETHER
> - Ether/Vlan/ -> RTE_PTYPE_L2_ETHER_VLAN
> 
> When a packet is received by a PMD, the most precise type must be
> returned among the ones supported. However a PMD is allowed to set
> packet type that is not in the supported list, at the condition that it
> is more precise. Therefore, a PMD announcing no supported packet types
> can still set a matching packet type in a received packet.
> '''
> 
> If it's fine I'll submit it as a patch.

Yep, looks good to me.
Thanks
Konstantin

[dpdk-dev] [PATCH v3 1/2] mempool: add stack (lifo) mempool handler

2016-06-20 Thread Ananyev, Konstantin



> -Original Message-
> From: Jerin Jacob [mailto:jerin.jacob at caviumnetworks.com]
> Sent: Monday, June 20, 2016 3:22 PM
> To: Ananyev, Konstantin
> Cc: Thomas Monjalon; dev at dpdk.org; Hunt, David; olivier.matz at 6wind.com; 
> viktorin at rehivetech.com; shreyansh.jain at nxp.com
> Subject: Re: [dpdk-dev] [PATCH v3 1/2] mempool: add stack (lifo) mempool 
> handler
> 
> On Mon, Jun 20, 2016 at 01:58:04PM +, Ananyev, Konstantin wrote:
> >
> >
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Thomas Monjalon
> > > Sent: Monday, June 20, 2016 2:54 PM
> > > To: Jerin Jacob
> > > Cc: dev at dpdk.org; Hunt, David; olivier.matz at 6wind.com; viktorin at 
> > > rehivetech.com; shreyansh.jain at nxp.com
> > > Subject: Re: [dpdk-dev] [PATCH v3 1/2] mempool: add stack (lifo) mempool 
> > > handler
> > >
> > > 2016-06-20 18:55, Jerin Jacob:
> > > > On Mon, Jun 20, 2016 at 02:08:10PM +0100, David Hunt wrote:
> > > > > This is a mempool handler that is useful for pipelining apps, where
> > > > > the mempool cache doesn't really work - example, where we have one
> > > > > core doing rx (and alloc), and another core doing Tx (and return). In
> > > > > such a case, the mempool ring simply cycles through all the mbufs,
> > > > > resulting in a LLC miss on every mbuf allocated when the number of
> > > > > mbufs is large. A stack recycles buffers more effectively in this
> > > > > case.
> > > > >
> > > > > Signed-off-by: David Hunt 
> > > > > ---
> > > > >  lib/librte_mempool/Makefile|   1 +
> > > > >  lib/librte_mempool/rte_mempool_stack.c | 145 
> > > > > +
> > > >
> > > > How about moving new mempool handlers to drivers/mempool? (or similar).
> > > > In future, adding HW specific handlers in lib/librte_mempool/ may be 
> > > > bad idea.
> > >
> > > You're probably right.
> > > However we need to check and understand what a HW mempool handler will be.
> > > I imagine the first of them will have to move handlers in drivers/
> >
> > Does it mean it we'll have to move mbuf into drivers too?
> > Again other libs do use mempool too.
> > Why not just lib/librte_mempool/arch/
> > ?
> 
> I was proposing only to move only the new
> handler(lib/librte_mempool/rte_mempool_stack.c). Not any library or any
> other common code.
> 
> Just like DPDK crypto device, Even if it is software implementation its
> better to move in driver/crypto instead of lib/librte_cryptodev
> 
> "lib/librte_mempool/arch/" is not correct place as it is platform specific
> not architecture specific and HW mempool device may be PCIe or platform
> device.

Ok, but why rte_mempool_stack.c has to be moved?
I can hardly imagine it is a 'platform sepcific'.
>From my understanding it is a generic code.
Konstantin


> 
> > Konstantin
> >
> >
> > > Jerin, are you volunteer?

[dpdk-dev] [PATCH v3 1/2] mempool: add stack (lifo) mempool handler

2016-06-20 Thread Ananyev, Konstantin



> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Thomas Monjalon
> Sent: Monday, June 20, 2016 2:54 PM
> To: Jerin Jacob
> Cc: dev at dpdk.org; Hunt, David; olivier.matz at 6wind.com; viktorin at 
> rehivetech.com; shreyansh.jain at nxp.com
> Subject: Re: [dpdk-dev] [PATCH v3 1/2] mempool: add stack (lifo) mempool 
> handler
> 
> 2016-06-20 18:55, Jerin Jacob:
> > On Mon, Jun 20, 2016 at 02:08:10PM +0100, David Hunt wrote:
> > > This is a mempool handler that is useful for pipelining apps, where
> > > the mempool cache doesn't really work - example, where we have one
> > > core doing rx (and alloc), and another core doing Tx (and return). In
> > > such a case, the mempool ring simply cycles through all the mbufs,
> > > resulting in a LLC miss on every mbuf allocated when the number of
> > > mbufs is large. A stack recycles buffers more effectively in this
> > > case.
> > >
> > > Signed-off-by: David Hunt 
> > > ---
> > >  lib/librte_mempool/Makefile|   1 +
> > >  lib/librte_mempool/rte_mempool_stack.c | 145 
> > > +
> >
> > How about moving new mempool handlers to drivers/mempool? (or similar).
> > In future, adding HW specific handlers in lib/librte_mempool/ may be bad 
> > idea.
> 
> You're probably right.
> However we need to check and understand what a HW mempool handler will be.
> I imagine the first of them will have to move handlers in drivers/

Does it mean it we'll have to move mbuf into drivers too?
Again other libs do use mempool too.
Why not just lib/librte_mempool/arch/
?
Konstantin


> Jerin, are you volunteer?

[dpdk-dev] [PATCH v3] i40e: configure MTU

2016-06-20 Thread Ananyev, Konstantin

Hi Beilei,

> -Original Message-
> From: Xing, Beilei
> Sent: Monday, June 20, 2016 1:05 PM
> To: Ananyev, Konstantin; Yong Wang; Olivier Matz; Wu, Jingjing
> Cc: dev at dpdk.org; Julien Meunier; Thomas Monjalon
> Subject: RE: [dpdk-dev] [PATCH v3] i40e: configure MTU
> 
> 
> > -Original Message-
> > From: Ananyev, Konstantin
> > Sent: Monday, June 20, 2016 4:05 PM
> > To: Xing, Beilei ; Yong Wang
> > ; Olivier Matz ; Wu,
> > Jingjing 
> > Cc: dev at dpdk.org; Julien Meunier ; Thomas
> > Monjalon 
> > Subject: RE: [dpdk-dev] [PATCH v3] i40e: configure MTU
> >
> >
> >
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Xing, Beilei
> > > Sent: Monday, June 20, 2016 5:50 AM
> > > To: Yong Wang; Olivier Matz; Wu, Jingjing
> > > Cc: dev at dpdk.org; Julien Meunier; Thomas Monjalon
> > > Subject: Re: [dpdk-dev] [PATCH v3] i40e: configure MTU
> > >
> > >
> > > > -Original Message-
> > > > From: Yong Wang [mailto:yongwang at vmware.com]
> > > > Sent: Friday, June 17, 2016 1:40 AM
> > > > To: Olivier Matz ; Xing, Beilei
> > > > ; Wu, Jingjing 
> > > > Cc: dev at dpdk.org; Julien Meunier ; 
> > > > Thomas
> > > > Monjalon 
> > > > Subject: Re: [dpdk-dev] [PATCH v3] i40e: configure MTU
> > > >
> > > > On 5/16/16, 5:27 AM, "dev on behalf of Olivier Matz"
> > > >  wrote:
> > > >
> > > > >Hi Beilei,
> > > > >
> > > > >On 05/13/2016 10:15 AM, Beilei Xing wrote:
> > > > >> This patch enables configuring MTU for i40e.
> > > > >> Since changing MTU needs to reconfigure queue, stop port first
> > > > >> before configuring MTU.
> > > > >>
> > > > >> Signed-off-by: Beilei Xing 
> > > > >> ---
> > > > >> v3 changes:
> > > > >>  Add frame size with extra I40E_VLAN_TAG_SIZE.
> > > > >>  Delete i40e_dev_rx_init(pf) cause it will be called when port 
> > > > >> starts.
> > > > >>
> > > > >> v2 changes:
> > > > >>  If mtu is not within the allowed range, return -EINVAL instead of
> > -EBUSY.
> > > > >>  Delete rxq reconfigure cause rxq reconfigure will be finished in
> > > > >> i40e_dev_rx_init.
> > > > >>
> > > > >>  drivers/net/i40e/i40e_ethdev.c | 34
> > > > >> ++
> > > > >>  1 file changed, 34 insertions(+)
> > > > >>
> > > > >> [...]
> > > > >> +static int
> > > > >> +i40e_dev_mtu_set(struct rte_eth_dev *dev, uint16_t mtu) {
> > > > >> +struct i40e_pf *pf =
> > I40E_DEV_PRIVATE_TO_PF(dev->data->dev_private);
> > > > >> +struct rte_eth_dev_data *dev_data = pf->dev_data;
> > > > >> +uint32_t frame_size = mtu + ETHER_HDR_LEN
> > > > >> +  + ETHER_CRC_LEN + I40E_VLAN_TAG_SIZE;
> > > > >> +int ret = 0;
> > > > >> +
> > > > >> +/* check if mtu is within the allowed range */
> > > > >> +if ((mtu < ETHER_MIN_MTU) || (frame_size >
> > I40E_FRAME_SIZE_MAX))
> > > > >> +return -EINVAL;
> > > > >> +
> > > > >> +/* mtu setting is forbidden if port is start */
> > > > >> +if (dev_data->dev_started) {
> > > > >> +PMD_DRV_LOG(ERR,
> > > > >> +"port %d must be stopped before 
> > > > >> configuration\n",
> > > > >> +dev_data->port_id);
> > > > >> +return -ENOTSUP;
> > > > >> +}
> > > > >
> > > > >I'm not convinced that ENOTSUP is the proper return value here.
> > > > >It is usually returned when a function is not implemented, which is
> > > > >not the case here: the function is implemented but is forbidden
> > > > >because the port is running.
> > > > >
> > > > >I saw that Julien commented on your v1 that the return value should
> > > > >be one of:
> > > > > - (0) if successful.
> > > > > - (-ENOTSUP) if operatio

[dpdk-dev] [PATCH v3] i40e: configure MTU

2016-06-20 Thread Ananyev, Konstantin



> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Xing, Beilei
> Sent: Monday, June 20, 2016 5:50 AM
> To: Yong Wang; Olivier Matz; Wu, Jingjing
> Cc: dev at dpdk.org; Julien Meunier; Thomas Monjalon
> Subject: Re: [dpdk-dev] [PATCH v3] i40e: configure MTU
> 
> 
> > -Original Message-
> > From: Yong Wang [mailto:yongwang at vmware.com]
> > Sent: Friday, June 17, 2016 1:40 AM
> > To: Olivier Matz ; Xing, Beilei  > intel.com>;
> > Wu, Jingjing 
> > Cc: dev at dpdk.org; Julien Meunier ; Thomas
> > Monjalon 
> > Subject: Re: [dpdk-dev] [PATCH v3] i40e: configure MTU
> >
> > On 5/16/16, 5:27 AM, "dev on behalf of Olivier Matz"  > dpdk.org
> > on behalf of olivier.matz at 6wind.com> wrote:
> >
> > >Hi Beilei,
> > >
> > >On 05/13/2016 10:15 AM, Beilei Xing wrote:
> > >> This patch enables configuring MTU for i40e.
> > >> Since changing MTU needs to reconfigure queue, stop port first before
> > >> configuring MTU.
> > >>
> > >> Signed-off-by: Beilei Xing 
> > >> ---
> > >> v3 changes:
> > >>  Add frame size with extra I40E_VLAN_TAG_SIZE.
> > >>  Delete i40e_dev_rx_init(pf) cause it will be called when port starts.
> > >>
> > >> v2 changes:
> > >>  If mtu is not within the allowed range, return -EINVAL instead of 
> > >> -EBUSY.
> > >>  Delete rxq reconfigure cause rxq reconfigure will be finished in
> > >> i40e_dev_rx_init.
> > >>
> > >>  drivers/net/i40e/i40e_ethdev.c | 34
> > >> ++
> > >>  1 file changed, 34 insertions(+)
> > >>
> > >> [...]
> > >> +static int
> > >> +i40e_dev_mtu_set(struct rte_eth_dev *dev, uint16_t mtu) {
> > >> +struct i40e_pf *pf = 
> > >> I40E_DEV_PRIVATE_TO_PF(dev->data->dev_private);
> > >> +struct rte_eth_dev_data *dev_data = pf->dev_data;
> > >> +uint32_t frame_size = mtu + ETHER_HDR_LEN
> > >> +  + ETHER_CRC_LEN + I40E_VLAN_TAG_SIZE;
> > >> +int ret = 0;
> > >> +
> > >> +/* check if mtu is within the allowed range */
> > >> +if ((mtu < ETHER_MIN_MTU) || (frame_size > I40E_FRAME_SIZE_MAX))
> > >> +return -EINVAL;
> > >> +
> > >> +/* mtu setting is forbidden if port is start */
> > >> +if (dev_data->dev_started) {
> > >> +PMD_DRV_LOG(ERR,
> > >> +"port %d must be stopped before 
> > >> configuration\n",
> > >> +dev_data->port_id);
> > >> +return -ENOTSUP;
> > >> +}
> > >
> > >I'm not convinced that ENOTSUP is the proper return value here.
> > >It is usually returned when a function is not implemented, which is not
> > >the case here: the function is implemented but is forbidden because the
> > >port is running.
> > >
> > >I saw that Julien commented on your v1 that the return value should be
> > >one of:
> > > - (0) if successful.
> > > - (-ENOTSUP) if operation is not supported.
> > > - (-ENODEV) if *port_id* invalid.
> > > - (-EINVAL) if *mtu* invalid.
> > >
> > >But I think your initial value (-EBUSY) was fine. Maybe it should be
> > >added in the API instead, with the following description:
> > >  (-EBUSY) if the operation is not allowed when the port is running
> >
> > AFAICT, the same check is not done for other drivers that implement the 
> > mac_set
> > op. Wouldn?t it make more sense to have the driver disable the port,
> > reconfigure and re-enable it in this case, instead of returning error code? 
> >  If the
> > consensus in DPDK is to have the application disable the port first, we 
> > need to
> > enforce this policy across all devices and clearly document this behavior.
> >
> Hi,
> For ixgbe/igb, there's register for MTU during runtime, but for i40e, setting 
> MTU is finished by firmware during configuration. There'll
> be some risk when disable the port, reconfigure and re-enable the port, so 
> return error code in this case currently.

Ok, but then what is the point to introduce mtu_set() for i40e at all?
Let's just leave it unimplemented, as it is right now.
Then users would know that for that device they have to do 
dev_stop/dev_configure().
Konstantin

> Thanks for your comments, we'll improve the document later.
> 
> > >This would allow the application to take its dispositions to stop the
> > >port and restart it with the proper jumbo_frame argument.
> > >
> > >+CC Thomas which maintains ethdev API.
> > >
> > >
> > >Regards,
> > >Olivier

[dpdk-dev] [PATCH v3] i40e: configure MTU

2016-06-17 Thread Ananyev, Konstantin



> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Yong Wang
> Sent: Thursday, June 16, 2016 6:52 PM
> To: Olivier Matz; Xing, Beilei; Wu, Jingjing
> Cc: dev at dpdk.org; Julien Meunier; Thomas Monjalon
> Subject: Re: [dpdk-dev] [PATCH v3] i40e: configure MTU
> 
> On 6/16/16, 10:40 AM, "dev on behalf of Yong Wang"  on behalf of yongwang at vmware.com> wrote:
> 
> >On 5/16/16, 5:27 AM, "dev on behalf of Olivier Matz"  >dpdk.org on behalf of olivier.matz at 6wind.com> wrote:
> >
> >>Hi Beilei,
> >>
> >>On 05/13/2016 10:15 AM, Beilei Xing wrote:
> >>> This patch enables configuring MTU for i40e.
> >>> Since changing MTU needs to reconfigure queue, stop port first
> >>> before configuring MTU.
> >>>
> >>> Signed-off-by: Beilei Xing 
> >>> ---
> >>> v3 changes:
> >>>  Add frame size with extra I40E_VLAN_TAG_SIZE.
> >>>  Delete i40e_dev_rx_init(pf) cause it will be called when port starts.
> >>>
> >>> v2 changes:
> >>>  If mtu is not within the allowed range, return -EINVAL instead of -EBUSY.
> >>>  Delete rxq reconfigure cause rxq reconfigure will be finished in
> >>>  i40e_dev_rx_init.
> >>>
> >>>  drivers/net/i40e/i40e_ethdev.c | 34 ++
> >>>  1 file changed, 34 insertions(+)
> >>>
> >>> [...]
> >>> +static int
> >>> +i40e_dev_mtu_set(struct rte_eth_dev *dev, uint16_t mtu)
> >>> +{
> >>> + struct i40e_pf *pf = I40E_DEV_PRIVATE_TO_PF(dev->data->dev_private);
> >>> + struct rte_eth_dev_data *dev_data = pf->dev_data;
> >>> + uint32_t frame_size = mtu + ETHER_HDR_LEN
> >>> +   + ETHER_CRC_LEN + I40E_VLAN_TAG_SIZE;
> >>> + int ret = 0;
> >>> +
> >>> + /* check if mtu is within the allowed range */
> >>> + if ((mtu < ETHER_MIN_MTU) || (frame_size > I40E_FRAME_SIZE_MAX))
> >>> + return -EINVAL;
> >>> +
> >>> + /* mtu setting is forbidden if port is start */
> >>> + if (dev_data->dev_started) {
> >>> + PMD_DRV_LOG(ERR,
> >>> + "port %d must be stopped before configuration\n",
> >>> + dev_data->port_id);
> >>> + return -ENOTSUP;
> >>> + }
> >>
> >>I'm not convinced that ENOTSUP is the proper return value here.
> >>It is usually returned when a function is not implemented, which
> >>is not the case here: the function is implemented but is forbidden
> >>because the port is running.
> >>
> >>I saw that Julien commented on your v1 that the return value should
> >>be one of:
> >> - (0) if successful.
> >> - (-ENOTSUP) if operation is not supported.
> >> - (-ENODEV) if *port_id* invalid.
> >> - (-EINVAL) if *mtu* invalid.
> >>
> >>But I think your initial value (-EBUSY) was fine. Maybe it should be
> >>added in the API instead, with the following description:
> >>  (-EBUSY) if the operation is not allowed when the port is running
> >
> >AFAICT, the same check is not done for other drivers that implement
> >the mac_set op. Wouldn?t it make more sense to have the driver disable
> 
> Correction: this should read as mtu_set.
> 
> >the port, reconfigure and re-enable it in this case, instead of returning
> >error code?  If the consensus in DPDK is to have the application disable
> >the port first, we need to enforce this policy across all devices and
> >clearly document this behavior.

Other PMDs do allow to change mtu without stopping/reconfiguration the port
(under certain conditions of course).
As I remember that's why that function was introduced at the first place.
It is a bit strange that i40e doesn't allow that, might be the author can
comment why it is necessary.
Again the whole i40e_dev_mtu_set() looks a bit strange to me -
I can't see where the actual change of HW state is happening.
Konstantin

> >
> >>This would allow the application to take its dispositions to stop the
> >>port and restart it with the proper jumbo_frame argument.
> >>
> >>+CC Thomas which maintains ethdev API.
> >>
> >>
> >>Regards,
> >>Olivier
> >

[dpdk-dev] [PATCH v2] rte_hash: add scalable multi-writer insertion w/ Intel TSX

2016-06-16 Thread Ananyev, Konstantin

Hi Wei,

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wei Shen
> Sent: Thursday, June 16, 2016 5:53 AM
> To: dev at dpdk.org
> Cc: De Lara Guarch, Pablo; stephen at networkplumber.org; Tai, Charlie; 
> Maciocco, Christian; Gobriel, Sameh; Shen, Wei1
> Subject: [dpdk-dev] [PATCH v2] rte_hash: add scalable multi-writer insertion 
> w/ Intel TSX
> 
> This patch introduced scalable multi-writer Cuckoo Hash insertion
> based on a split Cuckoo Search and Move operation using Intel
> TSX. It can do scalable hash insertion with 22 cores with little
> performance loss and negligible TSX abortion rate.
> 
> * Added an extra rte_hash flag definition to switch default
>   single writer Cuckoo Hash behavior to multiwriter.
> 
> * Added a make_space_insert_bfs_mw() function to do split Cuckoo
>   search in BFS order.
> 
> * Added tsx_cuckoo_move_insert() to do Cuckoo move in Intel TSX
>   protected manner.
> 
> * Added test_hash_multiwriter() as test case for multi-writer
>   Cuckoo Hash.
> 
> Signed-off-by: Shen Wei 
> Signed-off-by: Sameh Gobriel 
> ---
>  app/test/Makefile  |   1 +
>  app/test/test_hash_multiwriter.c   | 272 
> +
>  doc/guides/rel_notes/release_16_07.rst |  12 ++
>  lib/librte_hash/rte_cuckoo_hash.c  | 231 +---
>  lib/librte_hash/rte_hash.h |   3 +
>  5 files changed, 494 insertions(+), 25 deletions(-)
>  create mode 100644 app/test/test_hash_multiwriter.c
> 
> diff --git a/app/test/Makefile b/app/test/Makefile
> index 053f3a2..5476300 100644
> --- a/app/test/Makefile
> +++ b/app/test/Makefile
> @@ -120,6 +120,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_thash.c
>  SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_hash_perf.c
>  SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_hash_functions.c
>  SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_hash_scaling.c
> +SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_hash_multiwriter.c
> 
>  SRCS-$(CONFIG_RTE_LIBRTE_LPM) += test_lpm.c
>  SRCS-$(CONFIG_RTE_LIBRTE_LPM) += test_lpm_perf.c
> diff --git a/app/test/test_hash_multiwriter.c 
> b/app/test/test_hash_multiwriter.c
> new file mode 100644
> index 000..54a0d2c
> --- /dev/null
> +++ b/app/test/test_hash_multiwriter.c
> @@ -0,0 +1,272 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2016 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + * * Redistributions of source code must retain the above copyright
> + *notice, this list of conditions and the following disclaimer.
> + * * Redistributions in binary form must reproduce the above copyright
> + *notice, this list of conditions and the following disclaimer in
> + *the documentation and/or other materials provided with the
> + *distribution.
> + * * Neither the name of Intel Corporation nor the names of its
> + *contributors may be used to endorse or promote products derived
> + *from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "test.h"
> +
> +/*
> + * Check condition and return an error if true. Assumes that "handle" is the
> + * name of the hash structure pointer to be freed.
> + */
> +#define RETURN_IF_ERROR(cond, str, ...) do {\
> + if (cond) { \
> + printf("ERROR line %d: " str "\n", __LINE__,\
> + ##__VA_ARGS__); \
> + if (handle) \
> + rte_hash_free(handle);  \
> + return -1;  \
> + }   \
> +} while (0)
> +
> +#define

[dpdk-dev] supported packet types

2016-06-16 Thread Ananyev, Konstantin


Hi Olivier,

> 
> Hi Konstantin,
> 
> On 06/15/2016 04:08 PM, Ananyev, Konstantin wrote:
> >
> > Hi Olivier,
> >
> >> Hi Konstantin,
> >>
> >> On 04/29/2016 06:03 PM, Ananyev, Konstantin wrote:
> >>>> The following commit introduces a function to list the supported
> >>>> packet types of a device:
> >>>>
> >>>>http://dpdk.org/browse/dpdk/commit/?id=78a38edf66
> >>>>
> >>>> I would like to know what does "supported" precisely mean.
> >>>> Is it:
> >>>>
> >>>> 1/ - if a ptype is marked as supported, the driver MUST set
> >>>>   this ptype if the packet matches the format described in rte_mbuf.h
> >>>>
> >>>> -> if the ptype is not recognized, the application is sure
> >>>>that the packet is not one of the supported ptype
> >>>> -> but this is difficult to take advantage of this inside an
> >>>>application that supports several different ports model
> >>>>that do not support the same packet types
> >>>>
> >>>> 2/ - if a ptype is marked as supported, the driver CAN set
> >>>>   the ptype if the packet matches the format described in rte_mbuf.h
> >>>>
> >>>> -> if a ptype is not recognized, the application does a software
> >>>>fallback
> >>>> -> in this case, it would useless to have the get_supported_ptype()
> >>>>
> >>>> Can you confirm if the PMDs and l3fwd (the only user) expect 1/
> >>>> or 2/ ?
> >>>
> >>> 1)
> >>>
> >>>>
> >>>> Can you elaborate on the advantages of having this API?
> >>>
> >>> Application can rely on information provided by PMD avoid parsing packet 
> >>> manually to recognise it's pytpe.
> >>>
> >>>>
> >>>> And a supplementary question: if a ptype is not marked as supported,
> >>>> is it forbidden for a driver to set this ptype anyway?
> >>>
> >>> I suppose it is not forbidden, but there is no guarantee from PMD that it
> >>> would be able to recognise that ptype.
> >>>
> >>> Konstantin
> >>>
> >>>> Because we can
> >>>> imagine a hardware that can only recognize packets in some conditions
> >>>> (ex: can recognize IPv4 if there is no vlan).
> >>>>
> >>>> I think properly defining the meaning of "supported" is mandatory
> >>>> to have an application beeing able to use this feature, and avoid
> >>>> PMDs to behave differently because the API is unclear (like we've
> >>>> already seen for other features).
> >>
> >> Back on this. I've made some tests with ixgbe, and I'm afraid it
> >> will be difficult to ensure that when a ptype is advertised, it will
> >> always be set in the mbuf, whatever the layers below. Here are some
> >> examples:
> >>
> >
> > 1)
> >
> >> - double vlans
> >>
> >> Ether(type=0x88A8)/Dot1Q(vlan=0x666)/Dot1Q(vlan=0x666)/IP()/UDP()/Raw("x"*32)
> >>ixgbe advertises RTE_PTYPE_ETHER in supported ptypes
> >>returned ptype: RTE_PTYPE_UNKNOWN
> >>should be: L2_ETHER
> >>(this works with any unknown ethtype)
> >
> > 2)
> >
> >>
> >> - ip6 in ip6 tunnel
> >>ixgbe advertises RTE_PTYPE_TUNNEL_IP in supported ptypes
> >>Ether()/IPv6()/IPv6()/UDP()/Raw("x"*32)
> >>returned ptype: L2_ETHER L3_IPV6
> >>should be: L2_ETHER L3_IPV6 TUNNEL_IP INNER_L3_IPV6 INNER_L4_UDP
> >
> > 3)
> >
> >>
> >> - ip options
> >>Ether()/IP(options=IPOption('\x83\x03\x10'))/UDP()/Raw("x"*32)
> >>returned ptype: RTE_PTYPE_UNKNOWN
> >>should be: L2_ETHER L3_IPV4_EXT L4_UDP
> >
> > 4)
> >
> >>
> >> - ip inside ip with options
> >>Ether()/IP(options=IPOption('\x83\x03\x10'))/IP()/UDP()/Raw("x"*32)
> >>returned ptype: L2_ETHER L3_IPV4_EXT
> >>should be: L2_ETHER L3_IPV4_EXT TUNNEL_IP INNER_L3_IPV4 INNER_L4_UDP
> >
> >
> > I am marked your test-cases with numbers to simplify referencing to them.
> > 1) & 3) - I believe the reason is a bug in our ptype code inside ixgbe PMD 
&

[dpdk-dev] [PATCH v9 1/8] ethdev: use locks to protect Rx/Tx callback lists

2016-06-15 Thread Ananyev, Konstantin



> -Original Message-
> From: Richardson, Bruce
> Sent: Wednesday, June 15, 2016 3:22 PM
> To: Ananyev, Konstantin
> Cc: Ivan Boule; Thomas Monjalon; Pattan, Reshma; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v9 1/8] ethdev: use locks to protect Rx/Tx 
> callback lists
> 
> On Wed, Jun 15, 2016 at 03:20:27PM +0100, Ananyev, Konstantin wrote:
> >
> >
> > > -Original Message-
> > > From: Ivan Boule [mailto:ivan.boule at 6wind.com]
> > > Sent: Wednesday, June 15, 2016 3:07 PM
> > > To: Richardson, Bruce; Ananyev, Konstantin
> > > Cc: Thomas Monjalon; Pattan, Reshma; dev at dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH v9 1/8] ethdev: use locks to protect Rx/Tx 
> > > callback lists
> > >
> > > On 06/15/2016 03:29 PM, Bruce Richardson wrote:
> > > > On Wed, Jun 15, 2016 at 12:40:16PM +, Ananyev, Konstantin wrote:
> > > >> Hi Ivan,
> > > >>
> > > >>> -Original Message-
> > > >>> From: Ivan Boule [mailto:ivan.boule at 6wind.com]
> > > >>> Sent: Wednesday, June 15, 2016 1:15 PM
> > > >>> To: Thomas Monjalon; Ananyev, Konstantin
> > > >>> Cc: Pattan, Reshma; dev at dpdk.org
> > > >>> Subject: Re: [dpdk-dev] [PATCH v9 1/8] ethdev: use locks to protect 
> > > >>> Rx/Tx callback lists
> > > >>>
> > > >>> On 06/15/2016 10:48 AM, Thomas Monjalon wrote:
> > > >>>
> > > >>>>>
> > > >>>>>> I think the read access would need locking but we do not want it
> > > >>>>>> in fast path.
> > > >>>>>
> > > >>>>> I don't think it would be needed.
> > > >>>>> As I said - read/write interaction didn't change from what we have 
> > > >>>>> right now.
> > > >>>>> But if you have some particular scenario in mind that you believe 
> > > >>>>> would cause
> > > >>>>> a race condition - please speak up.
> > > >>>>
> > > >>>> If we add/remove a callback during a burst? Is it possible that the 
> > > >>>> next
> > > >>>> pointer would have a wrong value leading to a crash?
> > > >>>> Maybe we need a comment to state that we should not alter burst
> > > >>>> callbacks while running burst functions.
> > > >>>>
> > > >>>
> > > >>> Hi Reshma,
> > > >>>
> > > >>> You claim that the "rte_eth_rx_cb_lock" does not need to be hold in 
> > > >>> the
> > > >>> function "rte_eth_rx_burst()" while parsing the "post_rx_burst_cbs" 
> > > >>> list
> > > >>> of RX callbacks associated with the polled RX queue to safely remove 
> > > >>> RX
> > > >>> callback(s) in parallel.
> > > >>> The problem is not [only] with the setting and the loading of 
> > > >>> "cb->next"
> > > >>> that you assume to be atomic operations, which is certainly true on 
> > > >>> most
> > > >>> CPUs.
> > > >>> I see the 2 important following issues:
> > > >>>
> > > >>> 1) the "rte_eth_rxtx_callback" data structure associated with a 
> > > >>> removed
> > > >>> RX callback could still be accessed in the callback parsing loop of 
> > > >>> the
> > > >>> function "rte_eth_rx_burst()" after having been freed in parallel.
> > > >>>
> > > >>> BUT SUCH A BAD SITUATION WILL NOT CURRENTLY HAPPEN, THANKS TO THE NICE
> > > >>> MEMORY LEAK BUG in the function "rte_eth_remove_rx_callback()"  that
> > > >>> does not free the "rte_eth_rxtx_callback" data structure associated 
> > > >>> with
> > > >>> the removed callback !
> > > >>
> > > >> Yes, though it is documented behaviour, someone can probably
> > > >> refer it as a feature, not a bug ;)
> > > >>
> > > >
> > > > +1
> > > > This is definitely not a bug, this is absolutely by design. One may 
> > > > argue with
> > > > the design, but it was done for a definite reason, so as to avoid 
>

[dpdk-dev] [PATCH v9 1/8] ethdev: use locks to protect Rx/Tx callback lists

2016-06-15 Thread Ananyev, Konstantin



> -Original Message-
> From: Ivan Boule [mailto:ivan.boule at 6wind.com]
> Sent: Wednesday, June 15, 2016 3:07 PM
> To: Richardson, Bruce; Ananyev, Konstantin
> Cc: Thomas Monjalon; Pattan, Reshma; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v9 1/8] ethdev: use locks to protect Rx/Tx 
> callback lists
> 
> On 06/15/2016 03:29 PM, Bruce Richardson wrote:
> > On Wed, Jun 15, 2016 at 12:40:16PM +, Ananyev, Konstantin wrote:
> >> Hi Ivan,
> >>
> >>> -Original Message-
> >>> From: Ivan Boule [mailto:ivan.boule at 6wind.com]
> >>> Sent: Wednesday, June 15, 2016 1:15 PM
> >>> To: Thomas Monjalon; Ananyev, Konstantin
> >>> Cc: Pattan, Reshma; dev at dpdk.org
> >>> Subject: Re: [dpdk-dev] [PATCH v9 1/8] ethdev: use locks to protect Rx/Tx 
> >>> callback lists
> >>>
> >>> On 06/15/2016 10:48 AM, Thomas Monjalon wrote:
> >>>
> >>>>>
> >>>>>> I think the read access would need locking but we do not want it
> >>>>>> in fast path.
> >>>>>
> >>>>> I don't think it would be needed.
> >>>>> As I said - read/write interaction didn't change from what we have 
> >>>>> right now.
> >>>>> But if you have some particular scenario in mind that you believe would 
> >>>>> cause
> >>>>> a race condition - please speak up.
> >>>>
> >>>> If we add/remove a callback during a burst? Is it possible that the next
> >>>> pointer would have a wrong value leading to a crash?
> >>>> Maybe we need a comment to state that we should not alter burst
> >>>> callbacks while running burst functions.
> >>>>
> >>>
> >>> Hi Reshma,
> >>>
> >>> You claim that the "rte_eth_rx_cb_lock" does not need to be hold in the
> >>> function "rte_eth_rx_burst()" while parsing the "post_rx_burst_cbs" list
> >>> of RX callbacks associated with the polled RX queue to safely remove RX
> >>> callback(s) in parallel.
> >>> The problem is not [only] with the setting and the loading of "cb->next"
> >>> that you assume to be atomic operations, which is certainly true on most
> >>> CPUs.
> >>> I see the 2 important following issues:
> >>>
> >>> 1) the "rte_eth_rxtx_callback" data structure associated with a removed
> >>> RX callback could still be accessed in the callback parsing loop of the
> >>> function "rte_eth_rx_burst()" after having been freed in parallel.
> >>>
> >>> BUT SUCH A BAD SITUATION WILL NOT CURRENTLY HAPPEN, THANKS TO THE NICE
> >>> MEMORY LEAK BUG in the function "rte_eth_remove_rx_callback()"  that
> >>> does not free the "rte_eth_rxtx_callback" data structure associated with
> >>> the removed callback !
> >>
> >> Yes, though it is documented behaviour, someone can probably
> >> refer it as a feature, not a bug ;)
> >>
> >
> > +1
> > This is definitely not a bug, this is absolutely by design. One may argue 
> > with
> > the design, but it was done for a definite reason, so as to avoid paying the
> > penalty of having locks. It pushes more responsibility onto the app, but it
> > does allow the app to choose the best solution for managing the freeing of
> > memory for its situation. The alternative is to force all apps to pay the 
> > cost
> > of having locks, even if better options for freeing the memory are 
> > available.
> >
> > /Bruce
> >
> 
> -1 (not to say 0x)
> 
> This is definitely an API design bug !
> I would say that if you don't know how to free a resource that you
> allocate, it is very likely that you are wrong allocating it.
> And this is exactly what happens here with RX/TX callback data structures.
> This problem can easily be addressed by just changing the API as follows:
> 
> Change
>  void *
>  rte_eth_add_rx_callback(uint8_t port_id, uint16_t queue_id,
>  rte_rx_callback_fn fn, void *user_param)
> 
> to
>  int
>  rte_eth_add_rx_callback(uint8_t port_id, uint16_t queue_id,
>  struct rte_eth_rxtx_callback *cb)
> 
> In addition of solving the problem, this approach makes the API
> consistent and let the application allocate "rte_eth_rxtx_callback" data
> structures through any appropriate mean.

That might make API a bit cleaner, but I don't see how it fixes the generic 
problem:
there is still no way to know by the upper layer when it is safe to free/re-use
removed callback, but to make sure that all IO on that queue is stopped
(I.E. some external synchronisation around the queue).   
As you said in previous mail: 
> This is an example of a well-known more generic object deletion problem
> which must arrange to guarantee that a deleted object is not used and
> not accessible for use anymore before being actually deleted (freed, for
> instance).
Konstantin

> 
> Regards,
> Ivan
> 
> --
> Ivan Boule
> 6WIND Development Engineer

[dpdk-dev] supported packet types

2016-06-15 Thread Ananyev, Konstantin


Hi Olivier,

> Hi Konstantin,
> 
> On 04/29/2016 06:03 PM, Ananyev, Konstantin wrote:
> >> The following commit introduces a function to list the supported
> >> packet types of a device:
> >>
> >>   http://dpdk.org/browse/dpdk/commit/?id=78a38edf66
> >>
> >> I would like to know what does "supported" precisely mean.
> >> Is it:
> >>
> >> 1/ - if a ptype is marked as supported, the driver MUST set
> >>  this ptype if the packet matches the format described in rte_mbuf.h
> >>
> >>-> if the ptype is not recognized, the application is sure
> >>   that the packet is not one of the supported ptype
> >>-> but this is difficult to take advantage of this inside an
> >>   application that supports several different ports model
> >>   that do not support the same packet types
> >>
> >> 2/ - if a ptype is marked as supported, the driver CAN set
> >>  the ptype if the packet matches the format described in rte_mbuf.h
> >>
> >>-> if a ptype is not recognized, the application does a software
> >>   fallback
> >>-> in this case, it would useless to have the get_supported_ptype()
> >>
> >> Can you confirm if the PMDs and l3fwd (the only user) expect 1/
> >> or 2/ ?
> >
> > 1)
> >
> >>
> >> Can you elaborate on the advantages of having this API?
> >
> > Application can rely on information provided by PMD avoid parsing packet 
> > manually to recognise it's pytpe.
> >
> >>
> >> And a supplementary question: if a ptype is not marked as supported,
> >> is it forbidden for a driver to set this ptype anyway?
> >
> > I suppose it is not forbidden, but there is no guarantee from PMD that it
> > would be able to recognise that ptype.
> >
> > Konstantin
> >
> >> Because we can
> >> imagine a hardware that can only recognize packets in some conditions
> >> (ex: can recognize IPv4 if there is no vlan).
> >>
> >> I think properly defining the meaning of "supported" is mandatory
> >> to have an application beeing able to use this feature, and avoid
> >> PMDs to behave differently because the API is unclear (like we've
> >> already seen for other features).
> 
> Back on this. I've made some tests with ixgbe, and I'm afraid it
> will be difficult to ensure that when a ptype is advertised, it will
> always be set in the mbuf, whatever the layers below. Here are some
> examples:
> 

1)

> - double vlans
> 
> Ether(type=0x88A8)/Dot1Q(vlan=0x666)/Dot1Q(vlan=0x666)/IP()/UDP()/Raw("x"*32)
>   ixgbe advertises RTE_PTYPE_ETHER in supported ptypes
>   returned ptype: RTE_PTYPE_UNKNOWN
>   should be: L2_ETHER
>   (this works with any unknown ethtype)

2)

> 
> - ip6 in ip6 tunnel
>   ixgbe advertises RTE_PTYPE_TUNNEL_IP in supported ptypes
>   Ether()/IPv6()/IPv6()/UDP()/Raw("x"*32)
>   returned ptype: L2_ETHER L3_IPV6
>   should be: L2_ETHER L3_IPV6 TUNNEL_IP INNER_L3_IPV6 INNER_L4_UDP

3)

> 
> - ip options
>   Ether()/IP(options=IPOption('\x83\x03\x10'))/UDP()/Raw("x"*32)
>   returned ptype: RTE_PTYPE_UNKNOWN
>   should be: L2_ETHER L3_IPV4_EXT L4_UDP

4)

> 
> - ip inside ip with options
>   Ether()/IP(options=IPOption('\x83\x03\x10'))/IP()/UDP()/Raw("x"*32)
>   returned ptype: L2_ETHER L3_IPV4_EXT
>   should be: L2_ETHER L3_IPV4_EXT TUNNEL_IP INNER_L3_IPV4 INNER_L4_UDP


I am marked your test-cases with numbers to simplify referencing to them.
1) & 3) - I believe the reason is a bug in our ptype code inside ixgbe PMD :(
I submitted a patch:
http://dpdk.org/dev/patchwork/patch/13820/
Could you give it a try?
I think it should fix these issues.

2) & 4) - unfortunately ixgbe HW supports only ipv4->ipv6 tunneling.
All other combinations are not supported.
Right now I didn't decide what is a best way to address this problem.
Two thoughts  I have right now:
- either  remove tunnelling recognition (RTE_PTYPE_TUNNEL_IP and 
RTE_PTYPE_INNER_*)
from supported ptypes in ixgbe_dev_supported_ptypes_get().
- or split RTE_PTYPE_TUNNEL_IP into RTE_PTYPE_TUNNEL_IPV4 and 
RTE_PTYPE_TUNNEL_IPV6.
But that looks a bit ugly, again wuld probably be required for VXLAN/GRE and 
might be other
tunnelling protocols in future.
So don't really like any of them.
If you have any better idea, please shout.

> 
> I'm sure we can find more examples that do not return the expected
> result, knowing that ixgbe is probably one of the most complete
> driver in dpdk. I'm afraid of the behavior for other PMDs :)

It is in gene

[dpdk-dev] [PATCH v9 1/8] ethdev: use locks to protect Rx/Tx callback lists

2016-06-15 Thread Ananyev, Konstantin

Hi Ivan,

> -Original Message-
> From: Ivan Boule [mailto:ivan.boule at 6wind.com]
> Sent: Wednesday, June 15, 2016 1:15 PM
> To: Thomas Monjalon; Ananyev, Konstantin
> Cc: Pattan, Reshma; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v9 1/8] ethdev: use locks to protect Rx/Tx 
> callback lists
> 
> On 06/15/2016 10:48 AM, Thomas Monjalon wrote:
> 
> >>
> >>> I think the read access would need locking but we do not want it
> >>> in fast path.
> >>
> >> I don't think it would be needed.
> >> As I said - read/write interaction didn't change from what we have right 
> >> now.
> >> But if you have some particular scenario in mind that you believe would 
> >> cause
> >> a race condition - please speak up.
> >
> > If we add/remove a callback during a burst? Is it possible that the next
> > pointer would have a wrong value leading to a crash?
> > Maybe we need a comment to state that we should not alter burst
> > callbacks while running burst functions.
> >
> 
> Hi Reshma,
> 
> You claim that the "rte_eth_rx_cb_lock" does not need to be hold in the
> function "rte_eth_rx_burst()" while parsing the "post_rx_burst_cbs" list
> of RX callbacks associated with the polled RX queue to safely remove RX
> callback(s) in parallel.
> The problem is not [only] with the setting and the loading of "cb->next"
> that you assume to be atomic operations, which is certainly true on most
> CPUs.
> I see the 2 important following issues:
> 
> 1) the "rte_eth_rxtx_callback" data structure associated with a removed
> RX callback could still be accessed in the callback parsing loop of the
> function "rte_eth_rx_burst()" after having been freed in parallel.
> 
> BUT SUCH A BAD SITUATION WILL NOT CURRENTLY HAPPEN, THANKS TO THE NICE
> MEMORY LEAK BUG in the function "rte_eth_remove_rx_callback()"  that
> does not free the "rte_eth_rxtx_callback" data structure associated with
> the removed callback !

Yes, though it is documented behaviour, someone can probably
refer it as a feature, not a bug ;)

> 
> 2) As a consequence of 1), RX callbacks can be invoked/executed
> while/after being removed.
> If the application must free resources that it dynamically allocated to
> be used by the RX callback being removed, how to guarantee that the last
> invocation of that RX callback has been completed and that such a
> callback will never be invoked again, so that the resources can safely
> be freed?
> 
> This is an example of a well-known more generic object deletion problem
> which must arrange to guarantee that a deleted object is not used and
> not accessible for use anymore before being actually deleted (freed, for
> instance).

Yes, and as I wrote in other mail, IMO it needs to be addressed.
But again it is already existing problem in rte_ethdev,
and I think it shouldn't stop pdump integration.
Konstantin

> 
> Note that a lock cannot be used in the execution path of the
> rte_eth_rx_burst() function to address this issue, as locks MUST NEVER
> be introduced in the RX/TX path of the DPDK framework.
> 
> Of course, the same issues stand for TX callbacks.
> 
> Regards,
> Ivan
> 
> 
> 
> --
> Ivan Boule
> 6WIND Development Engineer

1 2 3 4 5 6 7 8 9 >

1 - 100 of 815 matches

Mail list logo