[dpdk-dev] [PATCHv3] examples/l3fwd: em: use hw accelerated crc hash function for arm64

2016-10-25 Thread Thomas Monjalon
2016-10-14 16:40, Hemant Agrawal:
> if machine level CRC extension are available, offload the
> hash to machine provide functions e.g. armv8-a CRC extensions
> support it
> 
> Signed-off-by: Hemant Agrawal 
> Reviewed-by: Jerin Jacob 

Applied, thanks


[dpdk-dev] [PATCH v2 1/3] drivers: add name alias registration for rte_driver

2016-10-25 Thread Thomas Monjalon
2016-10-25 01:41, Yuanhan Liu:
> On Mon, Oct 24, 2016 at 12:22:21PM -0400, Jan Blunck wrote:
> > This adds infrastructure for drivers to allow being requested by an alias
> > so that a renamed driver can still get loaded by its legacy name.
> > 
> > Signed-off-by: Jan Blunck 
> > Reviewed-by: Maxime Coquelin 
> > Tested-by: Pablo de Lara 
> 
> Series Reviewed-by: Yuanhan Liu 

Applied, thanks


[dpdk-dev] [PATCH v5 06/21] eal/soc: introduce very essential SoC infra definitions

2016-10-25 Thread Shreyansh Jain
On Tuesday 25 October 2016 11:06 AM, Shreyansh Jain wrote:
> Hello Jan,
>
> On Monday 24 October 2016 09:51 PM, Jan Viktorin wrote:
>> On Mon, 24 Oct 2016 17:29:25 +0530
>> Shreyansh Jain  wrote:
>>
>>> From: Jan Viktorin 
>>>
>>> Define initial structures and functions for the SoC infrastructure.
>>> This patch supports only a very minimal functions for now.
>>> More features will be added in the following commits.
>>>
>>> Includes rte_device/rte_driver inheritance of
>>> rte_soc_device/rte_soc_driver.
>>>
>>> Signed-off-by: Jan Viktorin 
>>> Signed-off-by: Shreyansh Jain 
>>> Signed-off-by: Hemant Agrawal 
>>> ---
>>>  app/test/Makefile   |   1 +
>>>  app/test/test_soc.c |  90 +
>>>  lib/librte_eal/common/Makefile  |   2 +-
>>>  lib/librte_eal/common/eal_private.h |   4 +
>>>  lib/librte_eal/common/include/rte_soc.h | 138
>>> 
>>>  5 files changed, 234 insertions(+), 1 deletion(-)
>>>  create mode 100644 app/test/test_soc.c
>>>  create mode 100644 lib/librte_eal/common/include/rte_soc.h
>>>
>>> diff --git a/app/test/Makefile b/app/test/Makefile
>>
>> [...]
>>
>>> +++ b/lib/librte_eal/common/include/rte_soc.h
>>> @@ -0,0 +1,138 @@
>>
>> [...]
>>
>>> +
>>> +#include 
>>> +#include 
>>> +#include 
>>> +#include 
>>> +#include 
>>> +
>>> +#include 
>>> +#include 
>>> +
>>> +struct rte_soc_id {
>>> +const char *compatible; /**< OF compatible specification */
>>> +uint64_t priv_data; /**< SoC Driver specific data */
>>
>> Do you expect this to be a pointer?
>
> A 64 bit entry, which can be typecasted to pointer by implementations,
> if required. Or, it might as well remain as a 64bit entry as ID.
>
>>
>>> +};
>>> +
>>
>> [...]
>>
>>> +
>>> +/**
>>> + * Initialization function for the driver called during SoC probing.
>>> + */
>>> +typedef int (soc_devinit_t)(struct rte_soc_driver *, struct
>>> rte_soc_device *);
>>> +
>>> +/**
>>> + * Uninitialization function for the driver called during hotplugging.
>>> + */
>>> +typedef int (soc_devuninit_t)(struct rte_soc_device *);
>>> +
>>> +/**
>>> + * A structure describing a SoC driver.
>>> + */
>>> +struct rte_soc_driver {
>>> +TAILQ_ENTRY(rte_soc_driver) next;  /**< Next in list */
>>> +struct rte_driver driver;  /**< Inherit core driver. */
>>> +soc_devinit_t *devinit;/**< Device initialization */
>>> +soc_devuninit_t *devuninit;/**< Device uninitialization */
>>
>> Shouldn't those functions be named probe/remove?
>
> Indeed. I think there was a comment on v4 as well - I thought I had
> fixed it but it seems I have mixed up my patches. I will send v6
> immediately with this fixed. Thanks for pointing out.

Ah, I just noticed that I did change it - but in Patch 11. Ideally, it 
should have been done here itself. My bad.

>
>>
>>> +const struct rte_soc_id *id_table; /**< ID table, NULL
>>> terminated */
>>> +};
>>> +
>>
>> [...]
>>
>>> +#endif
>>
>>
>>
>
> -
> Shreyansh
>

-
Shreyansh


[dpdk-dev] [PATCH 2/2] examples/tep_term: Fix packet len for multi-seg mbuf

2016-10-25 Thread Thomas Monjalon
> > For multi-seg mbuf, ip->total_length should be pkt_len subtract
> > ether len.
> > 
> > Fixes: 4abe471ed6fc("examples/tep_term: implement VXLAN processing")
> > 
> > Signed-off-by: Michael Qiu 
> 
> Acked-by: Jianfeng Tan 

Applied, thanks


[dpdk-dev] [PATCH v6 2/2] app/test_pmd: fix DCB configuration

2016-10-25 Thread Bernard Iremonger
Data Centre Bridge (DCB) configuration fails when SRIOV is
enabled if nb_rxq and nb_txq are not set to 1.

When dcb_mode is DCB_VT_ENABLED set nb_rxq and nb_txq to 1.

The failure occurs during configuration of the ixgbe PMD when
it is started, in the ixgbe_check_mq_mode function, if nb_rxq
and nb_txq are not set to 1.

Fixes: 2a977b891f99 ("app/testpmd: fix DCB configuration")

Signed-off-by: Bernard Iremonger 
Acked-by: Wenzhuo Lu 
---
 app/test-pmd/testpmd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 6185be6..ee567c3 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -2002,8 +2002,8 @@ init_port_dcb_config(portid_t pid,
 * and has the same number of rxq and txq in dcb mode
 */
if (dcb_mode == DCB_VT_ENABLED) {
-   nb_rxq = rte_port->dev_info.max_rx_queues;
-   nb_txq = rte_port->dev_info.max_tx_queues;
+   nb_rxq = 1;
+   nb_txq = 1;
} else {
/*if vt is disabled, use all pf queues */
if (rte_port->dev_info.vmdq_pool_base == 0) {
-- 
2.10.1



[dpdk-dev] [PATCH v6 1/2] net/ixgbe: support multiqueue mode VMDq DCB with SRIOV

2016-10-25 Thread Bernard Iremonger
The folowing changes have been made to allow Data Centre Bridge
(DCB) configuration when SRIOV is enabled.

Modify ixgbe_check_mq_mode function,
when SRIOV is enabled, enable mq_mode
ETH_MQ_RX_VMDQ_DCB and ETH_MQ_TX_VMDQ_DCB.

Modify ixgbe_dcb_tx_hw_config function,
replace the struct ixgbe_hw parameter with a
struct rte_eth_dev parameter and handle SRIOV enabled.

Modify ixgbe_dev_mq_rx_configure function,
when SRIOV is enabled, enable mq_mode ETH_MQ_RX_VMDQ_DCB.

Modify ixgbe_configure_dcb function,
revise check on dev->data->nb_rx_queues.

Signed-off-by: Rahul R Shah 
Signed-off-by: Bernard Iremonger 
Acked-by: Wenzhuo Lu 
---
 drivers/net/ixgbe/ixgbe_ethdev.c | 11 ++-
 drivers/net/ixgbe/ixgbe_rxtx.c   | 35 ++-
 2 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 4ca5747..4d5ce83 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -1977,6 +1977,9 @@ ixgbe_check_mq_mode(struct rte_eth_dev *dev)
/* check multi-queue mode */
switch (dev_conf->rxmode.mq_mode) {
case ETH_MQ_RX_VMDQ_DCB:
+   PMD_INIT_LOG(INFO, "ETH_MQ_RX_VMDQ_DCB mode supported 
in SRIOV");
+   dev->data->dev_conf.rxmode.mq_mode = ETH_MQ_RX_VMDQ_DCB;
+   break;
case ETH_MQ_RX_VMDQ_DCB_RSS:
/* DCB/RSS VMDQ in SRIOV mode, not implement yet */
PMD_INIT_LOG(ERR, "SRIOV active,"
@@ -2012,11 +2015,9 @@ ixgbe_check_mq_mode(struct rte_eth_dev *dev)

switch (dev_conf->txmode.mq_mode) {
case ETH_MQ_TX_VMDQ_DCB:
-   /* DCB VMDQ in SRIOV mode, not implement yet */
-   PMD_INIT_LOG(ERR, "SRIOV is active,"
-   " unsupported VMDQ mq_mode tx %d.",
-   dev_conf->txmode.mq_mode);
-   return -EINVAL;
+   PMD_INIT_LOG(INFO, "ETH_MQ_TX_VMDQ_DCB mode supported 
in SRIOV");
+   dev->data->dev_conf.txmode.mq_mode = ETH_MQ_TX_VMDQ_DCB;
+   break;
default: /* ETH_MQ_TX_VMDQ_ONLY or ETH_MQ_TX_NONE */
dev->data->dev_conf.txmode.mq_mode = 
ETH_MQ_TX_VMDQ_ONLY;
break;
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 2ce8234..b2d9f45 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1,7 +1,7 @@
 /*-
  *   BSD LICENSE
  *
- *   Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
+ *   Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
  *   Copyright 2014 6WIND S.A.
  *   All rights reserved.
  *
@@ -3313,15 +3313,16 @@ ixgbe_vmdq_dcb_configure(struct rte_eth_dev *dev)

 /**
  * ixgbe_dcb_config_tx_hw_config - Configure general DCB TX parameters
- * @hw: pointer to hardware structure
+ * @dev: pointer to eth_dev structure
  * @dcb_config: pointer to ixgbe_dcb_config structure
  */
 static void
-ixgbe_dcb_tx_hw_config(struct ixgbe_hw *hw,
+ixgbe_dcb_tx_hw_config(struct rte_eth_dev *dev,
   struct ixgbe_dcb_config *dcb_config)
 {
uint32_t reg;
uint32_t q;
+   struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);

PMD_INIT_FUNC_TRACE();
if (hw->mac.type != ixgbe_mac_82598EB) {
@@ -3340,10 +3341,17 @@ ixgbe_dcb_tx_hw_config(struct ixgbe_hw *hw,
reg |= IXGBE_MTQC_VT_ENA;
IXGBE_WRITE_REG(hw, IXGBE_MTQC, reg);

-   /* Disable drop for all queues */
-   for (q = 0; q < 128; q++)
-   IXGBE_WRITE_REG(hw, IXGBE_QDE,
-   (IXGBE_QDE_WRITE | (q << IXGBE_QDE_IDX_SHIFT)));
+   if (RTE_ETH_DEV_SRIOV(dev).active == 0) {
+   /* Disable drop for all queues in VMDQ mode*/
+   for (q = 0; q < 128; q++)
+   IXGBE_WRITE_REG(hw, IXGBE_QDE,
+   (IXGBE_QDE_WRITE | (q << 
IXGBE_QDE_IDX_SHIFT)));
+   } else {
+   /* Enable drop for all queues in SRIOV mode */
+   for (q = 0; q < 128; q++)
+   IXGBE_WRITE_REG(hw, IXGBE_QDE,
+   (IXGBE_QDE_WRITE | (q << 
IXGBE_QDE_IDX_SHIFT) | IXGBE_QDE_ENABLE));
+   }

/* Enable the Tx desc arbiter */
reg = IXGBE_READ_REG(hw, IXGBE_RTTDCS);
@@ -3378,7 +3386,7 @@ ixgbe_vmdq_dcb_hw_tx_config(struct rte_eth_dev *dev,
vmdq_tx_conf->nb_queue_pools == ETH_16_POOLS ? 0x : 
0x);

/*Configure general DCB TX parameters*/
-   ixgbe_dcb_tx_hw_config(hw, 

[dpdk-dev] [PATCH v6 0/2] net/ixgbe: VMDq DCB with SRIOV

2016-10-25 Thread Bernard Iremonger
Changes in v6:
rebase to latest master.
revise commit messages.

Changes in v5:
fix enable/disable of the QDE bit in the PFQDE register.

Changes in v4:
changes to ixgbe patch following comments.

Changes in v3:
rebase to latest master.
update commit message for ixgbe patch
add testpmd patch.

Changes in v2:
rebase to  latest master.

Bernard Iremonger (2):
  net/ixgbe: support multiqueue mode VMDq DCB with SRIOV
  app/test_pmd: fix DCB configuration

 app/test-pmd/testpmd.c   |  4 ++--
 drivers/net/ixgbe/ixgbe_ethdev.c | 11 ++-
 drivers/net/ixgbe/ixgbe_rxtx.c   | 35 ++-
 3 files changed, 30 insertions(+), 20 deletions(-)

-- 
2.10.1



[dpdk-dev] [PATCH] pdump: revert PCI device name conversion

2016-10-25 Thread Reshma Pattan
Earlier ethdev library created the device names in the
"bus:device.func" format hence pdump library implemented
its own conversion method for changing the user passed
device name format "domain:bus:device.func" to "bus:device.func"
for finding the port id using device name using ethdev library
calls. Now after ethdev and eal rework
http://dpdk.org/dev/patchwork/patch/15855/,
the device names are created in the format "domain:bus:device.func",
so pdump library conversion is not needed any more, hence removed
the corresponding code.

Signed-off-by: Reshma Pattan 
---
 lib/librte_pdump/rte_pdump.c | 37 ++---
 1 file changed, 2 insertions(+), 35 deletions(-)

diff --git a/lib/librte_pdump/rte_pdump.c b/lib/librte_pdump/rte_pdump.c
index ea5ccd9..504a1ce 100644
--- a/lib/librte_pdump/rte_pdump.c
+++ b/lib/librte_pdump/rte_pdump.c
@@ -226,29 +226,6 @@ pdump_tx(uint8_t port __rte_unused, uint16_t qidx 
__rte_unused,
 }

 static int
-pdump_get_dombdf(char *device_id, char *domBDF, size_t len)
-{
-   int ret;
-   struct rte_pci_addr dev_addr = {0};
-
-   /* identify if device_id is pci address or name */
-   ret = eal_parse_pci_DomBDF(device_id, _addr);
-   if (ret < 0)
-   return -1;
-
-   if (dev_addr.domain)
-   ret = snprintf(domBDF, len, "%u:%u:%u.%u", dev_addr.domain,
-   dev_addr.bus, dev_addr.devid,
-   dev_addr.function);
-   else
-   ret = snprintf(domBDF, len, "%u:%u.%u", dev_addr.bus,
-   dev_addr.devid,
-   dev_addr.function);
-
-   return ret;
-}
-
-static int
 pdump_regitser_rx_callbacks(uint16_t end_q, uint8_t port, uint16_t queue,
struct rte_ring *ring, struct rte_mempool *mp,
uint16_t operation)
@@ -885,7 +862,6 @@ rte_pdump_enable_by_deviceid(char *device_id, uint16_t 
queue,
void *filter)
 {
int ret = 0;
-   char domBDF[DEVICE_ID_SIZE];

ret = pdump_validate_ring_mp(ring, mp);
if (ret < 0)
@@ -894,11 +870,7 @@ rte_pdump_enable_by_deviceid(char *device_id, uint16_t 
queue,
if (ret < 0)
return ret;

-   if (pdump_get_dombdf(device_id, domBDF, sizeof(domBDF)) > 0)
-   ret = pdump_prepare_client_request(domBDF, queue, flags,
-   ENABLE, ring, mp, filter);
-   else
-   ret = pdump_prepare_client_request(device_id, queue, flags,
+   ret = pdump_prepare_client_request(device_id, queue, flags,
ENABLE, ring, mp, filter);

return ret;
@@ -928,17 +900,12 @@ rte_pdump_disable_by_deviceid(char *device_id, uint16_t 
queue,
uint32_t flags)
 {
int ret = 0;
-   char domBDF[DEVICE_ID_SIZE];

ret = pdump_validate_flags(flags);
if (ret < 0)
return ret;

-   if (pdump_get_dombdf(device_id, domBDF, sizeof(domBDF)) > 0)
-   ret = pdump_prepare_client_request(domBDF, queue, flags,
-   DISABLE, NULL, NULL, NULL);
-   else
-   ret = pdump_prepare_client_request(device_id, queue, flags,
+   ret = pdump_prepare_client_request(device_id, queue, flags,
DISABLE, NULL, NULL, NULL);

return ret;
-- 
2.7.4



[dpdk-dev] [PATCH v10 1/6] ethdev: add Tx preparation

2016-10-25 Thread Kulasek, TomaszX


> -Original Message-
> From: Olivier Matz [mailto:olivier.matz at 6wind.com]
> Sent: Tuesday, October 25, 2016 16:42
> To: Kulasek, TomaszX ; dev at dpdk.org
> Cc: Ananyev, Konstantin 
> Subject: Re: [PATCH v10 1/6] ethdev: add Tx preparation
> 
> Hi Tomasz,
> 
> On 10/24/2016 06:51 PM, Tomasz Kulasek wrote:
> > Added API for `rte_eth_tx_prep`
> >
> > [...]
> > --- a/lib/librte_ether/rte_ethdev.h
> > +++ b/lib/librte_ether/rte_ethdev.h
> > @@ -182,6 +182,7 @@ extern "C" {
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include "rte_ether.h"
> >  #include "rte_eth_ctrl.h"
> >  #include "rte_dev_info.h"
> > @@ -699,6 +700,8 @@ struct rte_eth_desc_lim {
> > uint16_t nb_max;   /**< Max allowed number of descriptors. */
> > uint16_t nb_min;   /**< Min allowed number of descriptors. */
> > uint16_t nb_align; /**< Number of descriptors should be aligned to.
> > */
> > +   uint16_t nb_seg_max; /**< Max number of segments per whole
> packet. */
> > +   uint16_t nb_mtu_seg_max; /**< Max number of segments per one MTU */
> 
> Sorry if it was not clear in my previous review, but I think this should
> be better explained here. You said that the "limitation of number of
> segments may differ depend of TSO/non TSO".
> 
> As an application developer, I still have some difficulties to clearly
> understand what does that mean. Is it the maximum number of mbuf-segments
> that contain payload for one tcp-segment sent by the device?
> 
> In that case, it looks quite difficult to verify that in an application.
> It looks that this field is not used by validate_offload(), so how should
> it be used by an application?
> 

E.g. for i40e (from xl710 datasheet):

  8.4.1 Transmit Packet in System Memory
   ...
  A few rules related to the transmit packet in host memory are:
   ...
   - A single transmit packet may span up to 8 buffers (up to 8 data descriptors
 per packet including both the header and payload buffers).
   - The total number of data descriptors for the whole TSO (explained later on 
in
 this chapter) is unlimited as long as each segment within the TSO obeys
 the previous rule (up to 8 data descriptors per segment for both the TSO
 header and the segment payload buffers).
  ...

+#define I40E_TX_MAX_SEG UINT8_MAX
+#define I40E_TX_MAX_MTU_SEG 8

For ixgbe driver there's one limitation, both for TSO and non-TSO and it
is always 40-WTHRESH.

Such a differences causes that these values should checked in the tx_prep
callback (when it is required) for a specific device (e.g. i40e, ixgbe).

If other device have not such a limitations, or are higher than DPDK can
handle (nb_segs is uint8_t), then for performance reason it doesn't need
to be checked.

Validate_offload() is for general check, the drivers specific implementation
Is the part in the callback function e.g. in 

  uint16_t ixgbe_prep_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
uint16_t nb_pkts)

for ixgbe.

Values in the "struct rte_eth_desc_lim" provides an information to the
application, which should to let you to not create malicious packets or
at least to limit their number.

Using Tx preparation API, these checks are made transparently for the
application and when nb_segs are out of limits, tx_prep function fails.

> 
> >  };
> >
> >  /**
> > @@ -1188,6 +1191,11 @@ typedef uint16_t (*eth_tx_burst_t)(void *txq,
> >uint16_t nb_pkts);
> >  /**< @internal Send output packets on a transmit queue of an Ethernet
> > device. */
> >
> > +typedef uint16_t (*eth_tx_prep_t)(void *txq,
> > +  struct rte_mbuf **tx_pkts,
> > +  uint16_t nb_pkts);
> > +/**< @internal Prepare output packets on a transmit queue of an
> > +Ethernet device. */
> > +
> >  typedef int (*flow_ctrl_get_t)(struct rte_eth_dev *dev,
> >struct rte_eth_fc_conf *fc_conf);  /**< @internal
> Get
> > current flow control parameter on an Ethernet device */ @@ -1622,6
> > +1630,7 @@ struct rte_eth_rxtx_callback {  struct rte_eth_dev {
> > eth_rx_burst_t rx_pkt_burst; /**< Pointer to PMD receive function.
> */
> > eth_tx_burst_t tx_pkt_burst; /**< Pointer to PMD transmit function.
> > */
> > +   eth_tx_prep_t tx_pkt_prep; /**< Pointer to PMD transmit prepare
> > +function. */
> > struct rte_eth_dev_data *data;  /**< Pointer to device data */
> > const struct eth_driver *driver;/**< Driver for this device */
> > const struct eth_dev_ops *dev_ops; /**< Functions exported by PMD */
> > @@ -2816,6 +2825,93 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t
> queue_id,
> > return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts,
> > nb_pkts);  }
> >
> > +/**
> > + * Process a burst of output packets on a transmit queue of an Ethernet
> device.
> > + *
> > + * The rte_eth_tx_prep() function is invoked to prepare output
> > +packets to be
> > + * transmitted on the output queue *queue_id* of 

[dpdk-dev] mbuf changes

2016-10-25 Thread Shreyansh Jain
On Monday 24 October 2016 09:55 PM, Bruce Richardson wrote:
> On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith wrote:
>>
>>> On Oct 24, 2016, at 10:49 AM, Morten Br?rup  
>>> wrote:
>>>
>>> First of all: Thanks for a great DPDK Userspace 2016!
>>>
>>>
>>>
>>> Continuing the Userspace discussion about Olivier Matz?s proposed mbuf 
>>> changes...
>
> Thanks for keeping the discussion going!
>>>
>>>
>>>
>>> 1.
>>>
>>> Stephen Hemminger had a noteworthy general comment about keeping metadata 
>>> for the NIC in the appropriate section of the mbuf: Metadata generated by 
>>> the NIC?s RX handler belongs in the first cache line, and metadata required 
>>> by the NIC?s TX handler belongs in the second cache line. This also means 
>>> that touching the second cache line on ingress should be avoided if 
>>> possible; and Bruce Richardson mentioned that for this reason m->next was 
>>> zeroed on free().
>>>
> Thinking about it, I suspect there are more fields we can reset on free
> to save time on alloc. Refcnt, as discussed below is one of them, but so
> too could be the nb_segs field and possibly others.
>
>>>
>>>
>>> 2.
>>>
>>> There seemed to be consensus that the size of m->refcnt should match the 
>>> size of m->port because a packet could be duplicated on all physical ports 
>>> for L3 multicast and L2 flooding.
>>>
>>> Furthermore, although a single physical machine (i.e. a single server) with 
>>> 255 physical ports probably doesn?t exist, it might contain more than 255 
>>> virtual machines with a virtual port each, so it makes sense extending 
>>> these mbuf fields from 8 to 16 bits.
>>
>> I thought we also talked about removing the m->port from the mbuf as it is 
>> not really needed.
>>
> Yes, this was mentioned, and also the option of moving the port value to
> the second cacheline, but it appears that NXP are using the port value
> in their NIC drivers for passing in metadata, so we'd need their
> agreement on any move (or removal).

I am not sure where NXP's NIC came into picture on this, but now that it 
is highlighted, this field is required for libevent implementation [1].

A scheduler sending an event, which can be a packet, would only have 
information of a flow_id. From this matching it back to a port, without 
mbuf->port, would be very difficult (costly). There may be way around 
this but at least in current proposal I think port would be important to 
have - even if in second cache line.

But, off the top of my head, as of now it is not being used for any 
specific purpose in NXP's PMD implementation.

Even the SoC patches don't necessarily rely on it except using it 
because it is available.

@Bruce: where did you get the NXP context here from?

[1] http://dpdk.org/ml/archives/dev/2016-October/048592.html

>
>>>
>>>
>>>
>>> 3.
>>>
>>> Someone (Bruce Richardson?) suggested moving m->refcnt and m->port to the 
>>> second cache line, which then generated questions from the audience about 
>>> the real life purpose of m->port, and if m->port could be removed from the 
>>> mbuf structure.
>>>
>>>
>>>
>>> 4.
>>>
>>> I suggested using offset -1 for m->refcnt, so m->refcnt becomes 0 on first 
>>> allocation. This is based on the assumption that other mbuf fields must be 
>>> zeroed at alloc()/free() anyway, so zeroing m->refcnt is cheaper than 
>>> setting it to 1.
>>>
>>> Furthermore (regardless of m->refcnt offset), I suggested that it is not 
>>> required to modify m->refcnt when allocating and freeing the mbuf, thus 
>>> saving one write operation on both alloc() and free(). However, this 
>>> assumes that m->refcnt debugging, e.g. underrun detection, is not required.
>
> I don't think it really matters what sentinal value is used for the
> refcnt because it can't be blindly assigned on free like other fields.
> However, I think 0 as first reference value becomes more awkward
> than 1, because we need to deal with underflow. Consider the situation
> where we have two references to the mbuf, so refcnt is 1, and both are
> freed at the same time. Since the refcnt is not-zero, then both cores
> will do an atomic decrement simultaneously giving a refcnt of -1. We can
> then set this back to zero before freeing, however, I'd still prefer to
> have refcnt be an accurate value so that it always stays positive, and
> we can still set it to "one" on free to avoid having to set on alloc.
>
> Also, if we set refcnt on free rather than alloc, it does set itself up
> as a good candidate for moving to the second cacheline. Fast-path
> processing does not normally update the value.
>
>>>
>>>
>>>
>>> 5.
>>>
>>> And here?s something new to think about:
>>>
>>> m->next already reveals if there are more segments to a packet. Which 
>>> purpose does m->nb_segs serve that is not already covered by m->next?
>
> It is duplicate info, but nb_segs can be used to check the validity of
> the next pointer without having to read the second mbuf cacheline.
>
> Whether it's worth having is something I'm happy enough to 

[dpdk-dev] DPDK-QoS- Using un-used bandwidth within a class

2016-10-25 Thread Dumitrescu, Cristian
Hi Sreenaath,

I think you are simply hitting the known and documented performance issue with 
using a single pipe. The hierarchical scheduler performance is optimized for 
many pipes (hundreds, thousands , ?), not for single pipe. The rationale is 
that if you only needs handful of queues, you can simply use a single level 
class based queuing device as opposed to the hierarchical scheduler.

Are you getting any issues when using hundreds/thousands of active (i.e. with 
traffic going through them) pipes?

Thanks,
Cristian

From: sreenaath vasudevan [mailto:sreenaat...@gmail.com]
Sent: Monday, October 24, 2016 8:05 AM
To: dev at dpdk.org; Dumitrescu, Cristian 
Subject: DPDK-QoS- Using un-used bandwidth within a class

Hi
I am using DPDK QoS and I find something strange. I am not sure if something is 
wrong with my config or my understanding of queue weights is wrong.
In my config, I am using only 1 port and 1 subport and 1 pipe. Within that 
pipe, I am using only the last class (C3). Port, subport and pipe are 
configured with 100Mbps speed.
C3 is given the entire pipe's TB rate i.e entire bandwidth in essence.
In C3, I am giving relative weights of 1:4:2:2 for the four queues q0,q1,q2,q3
When no other traffic is coming in to q0,q2,q3, I am pumping ~100Mbps in to q1. 
However, I am seeing only 40% of the traffic going through q1. In other words 
the max throughput allowed through the queue is based on its weight and the 
unused bandwidth is not used.
Cannot the unused bandwidth from q0,q2 and q3 be used for q1?

Note-
Following is the QoS config output spit out by DPDK in syslog



SCHED: Low level config for pipe profile 0:

token bucket: period = 10, credits per period = 1, size = 25000

Traffic classes: period = 125, credits per period = [0, 0, 0, 125000]

Traffic class 3 oversubscription: weight = 0

WRR cost: [1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1], [4, 1, 2, 2]

SCHED: Low level config for subport 0:

Token bucket: period = 10, credits per period = 1, size = 25000

Traffic classes: period = 125, credits per period = [0, 0, 0, 125000]

Traffic class 3 oversubscription: wm min = 0, wm max = 0

--
regards
sreenaath


[dpdk-dev] [PATCH v2] mempool: fix search of maximum contiguous pages

2016-10-25 Thread Olivier Matz
From: Wei Dai 

paddr[i] + pg_sz always points to the start physical address of the
2nd page after pddr[i], so only up to 2 pages can be combinded to
be used. With this revision, more than 2 pages can be used.

Fixes: 84121f197187 ("mempool: store memory chunks in a list")

Signed-off-by: Wei Dai 
Signed-off-by: Olivier Matz 
---
 lib/librte_mempool/rte_mempool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/librte_mempool/rte_mempool.c b/lib/librte_mempool/rte_mempool.c
index 71017e1..e94e56f 100644
--- a/lib/librte_mempool/rte_mempool.c
+++ b/lib/librte_mempool/rte_mempool.c
@@ -428,7 +428,7 @@ rte_mempool_populate_phys_tab(struct rte_mempool *mp, char 
*vaddr,

/* populate with the largest group of contiguous pages */
for (n = 1; (i + n) < pg_num &&
-paddr[i] + pg_sz == paddr[i+n]; n++)
+paddr[i + n - 1] + pg_sz == paddr[i + n]; n++)
;

ret = rte_mempool_populate_phys(mp, vaddr + i * pg_sz,
-- 
2.8.1



[dpdk-dev] [PATCH] mempool: fix search of maximum contiguous pages

2016-10-25 Thread Olivier Matz
Hi Thomas,

On 10/25/2016 04:37 PM, Thomas Monjalon wrote:
> 2016-10-13 17:05, Olivier MATZ:
>> Hi Wei,
>>
>> On 10/13/2016 02:31 PM, Ananyev, Konstantin wrote:
>>>

>>> diff --git a/lib/librte_mempool/rte_mempool.c
>>> b/lib/librte_mempool/rte_mempool.c
>>> index 71017e1..e3e254a 100644
>>> --- a/lib/librte_mempool/rte_mempool.c
>>> +++ b/lib/librte_mempool/rte_mempool.c
>>> @@ -426,9 +426,12 @@ rte_mempool_populate_phys_tab(struct
>>> rte_mempool *mp, char *vaddr,
>>>
>>> for (i = 0; i < pg_num && mp->populated_size < mp->size; i += 
>>> n) {
>>>
>>> +   phys_addr_t paddr_next;
>>> +   paddr_next = paddr[i] + pg_sz;
>>> +
>>> /* populate with the largest group of contiguous pages 
>>> */
>>> for (n = 1; (i + n) < pg_num &&
>>> -paddr[i] + pg_sz == paddr[i+n]; n++)
>>> +paddr_next == paddr[i+n]; n++, paddr_next 
>>> += pg_sz)
>>> ;
>>
>> Good catch.
>> Why not just paddr[i + n - 1] != paddr[i + n]?
>
> Sorry, I meant 'paddr[i + n - 1] + pg_sz == paddr[i+n]' off course.
>
>> Then you don't need extra variable (paddr_next) here.
>> Konstantin

 Thank you, Konstantin
 'paddr[i + n - 1] + pg_sz = paddr[i + n]' also can fix it and have 
 straight meaning.
 But I assume that my revision with paddr_next += pg_sz may have a bit 
 better performance.
>>>
>>> I don't think there would be any real difference, again it is not 
>>> performance critical code-path.
>>>
 By the way, paddr[i] + n * pg_sz = paddr[i + n] can also resolve it.
>>>
>>> Yes, that's one seems even better for me - make things more clear.
>>
>> Thank you for fixing this.
>>
>> My vote would go for "addr[i + n - 1] + pg_sz == paddr[i + n]"
>>
>> If you feel "paddr[i] + n * pg_sz = paddr[i + n]" is clearer, I have no 
>> problem with it either.
> 
> No answer from Wei Dai.
> Please Olivier advise what to do with this patch.
> Thanks
> 

I think it's good to have this fix in 16.11.
I'm sending a v2 based on Wei's patch.

Olivier



[dpdk-dev] mbuf changes

2016-10-25 Thread Thomas Monjalon
2016-10-25 14:32, Ramia, Kannan Babu:
> I didn't get your question. The only information exchanged between the stages 
> is mbuf pointer. So the information has to be in mbuf, whether it's part of 
> Meta or in private area in the packet buffer headroom is that the question 
> you are asking. The private area is application specific, while I am looking 
> for the port information getting updated from driver to application and it's 
> generic to multiple applications. 

Thanks, your answer is perfect (except it is top-posted ;)

The discussion is more about performance. So you agree that the port
information is not performance sensitive.
It appears that it could be an information filled in a private area
at app-level, because the application knows which port it is polling.
However we may need a way to put some generic informations somewhere,
not in the first cache lines.


> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com] 
> Sent: Tuesday, October 25, 2016 6:54 PM
> To: Ramia, Kannan Babu 
> Cc: dev at dpdk.org; Olivier Matz ; Morten Br?rup 
> ; Ananyev, Konstantin  intel.com>; Richardson, Bruce ; Wiles, Keith 
> 
> Subject: Re: [dpdk-dev] mbuf changes
> 
> 2016-10-25 13:04, Ramia, Kannan Babu:
> > Port filed is important meta information for the application use like 
> > CGNAT vEPC functions etc.
> > I strongly recommend to keep the field in mind meta.
> 
> Have you tried to move this field outside of the mbuf?
> What is the performance degradation?
> We need more information than some assumptions.
> 




[dpdk-dev] mbuf changes

2016-10-25 Thread Olivier Matz


On 10/25/2016 04:25 PM, Morten Br?rup wrote:
> It might also make sense documenting the mbuf fields in more detail 
> somewhere. E.g. the need for nb_segs in the NIC's TX handler.

Good point, I'll do it at the same time than the first rework
proposition.



[dpdk-dev] [PATCH v10 1/6] ethdev: add Tx preparation

2016-10-25 Thread Olivier Matz
Hi Tomasz,

On 10/24/2016 06:51 PM, Tomasz Kulasek wrote:
> Added API for `rte_eth_tx_prep`
> 
> [...]
> --- a/lib/librte_ether/rte_ethdev.h
> +++ b/lib/librte_ether/rte_ethdev.h
> @@ -182,6 +182,7 @@ extern "C" {
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "rte_ether.h"
>  #include "rte_eth_ctrl.h"
>  #include "rte_dev_info.h"
> @@ -699,6 +700,8 @@ struct rte_eth_desc_lim {
>   uint16_t nb_max;   /**< Max allowed number of descriptors. */
>   uint16_t nb_min;   /**< Min allowed number of descriptors. */
>   uint16_t nb_align; /**< Number of descriptors should be aligned to. */
> + uint16_t nb_seg_max; /**< Max number of segments per whole packet. 
> */
> + uint16_t nb_mtu_seg_max; /**< Max number of segments per one MTU */

Sorry if it was not clear in my previous review, but I think this should
be better explained here. You said that the "limitation of number
of segments may differ depend of TSO/non TSO".

As an application developer, I still have some difficulties to
clearly understand what does that mean. Is it the maximum number
of mbuf-segments that contain payload for one tcp-segment sent by
the device?

In that case, it looks quite difficult to verify that in an application.
It looks that this field is not used by validate_offload(), so how
should it be used by an application?


>  };
>  
>  /**
> @@ -1188,6 +1191,11 @@ typedef uint16_t (*eth_tx_burst_t)(void *txq,
>  uint16_t nb_pkts);
>  /**< @internal Send output packets on a transmit queue of an Ethernet 
> device. */
>  
> +typedef uint16_t (*eth_tx_prep_t)(void *txq,
> +struct rte_mbuf **tx_pkts,
> +uint16_t nb_pkts);
> +/**< @internal Prepare output packets on a transmit queue of an Ethernet 
> device. */
> +
>  typedef int (*flow_ctrl_get_t)(struct rte_eth_dev *dev,
>  struct rte_eth_fc_conf *fc_conf);
>  /**< @internal Get current flow control parameter on an Ethernet device */
> @@ -1622,6 +1630,7 @@ struct rte_eth_rxtx_callback {
>  struct rte_eth_dev {
>   eth_rx_burst_t rx_pkt_burst; /**< Pointer to PMD receive function. */
>   eth_tx_burst_t tx_pkt_burst; /**< Pointer to PMD transmit function. */
> + eth_tx_prep_t tx_pkt_prep; /**< Pointer to PMD transmit prepare 
> function. */
>   struct rte_eth_dev_data *data;  /**< Pointer to device data */
>   const struct eth_driver *driver;/**< Driver for this device */
>   const struct eth_dev_ops *dev_ops; /**< Functions exported by PMD */
> @@ -2816,6 +2825,93 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
>   return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts, 
> nb_pkts);
>  }
>  
> +/**
> + * Process a burst of output packets on a transmit queue of an Ethernet 
> device.
> + *
> + * The rte_eth_tx_prep() function is invoked to prepare output packets to be
> + * transmitted on the output queue *queue_id* of the Ethernet device 
> designated
> + * by its *port_id*.
> + * The *nb_pkts* parameter is the number of packets to be prepared which are
> + * supplied in the *tx_pkts* array of *rte_mbuf* structures, each of them
> + * allocated from a pool created with rte_pktmbuf_pool_create().
> + * For each packet to send, the rte_eth_tx_prep() function performs
> + * the following operations:
> + *
> + * - Check if packet meets devices requirements for tx offloads.
> + *
> + * - Check limitations about number of segments.
> + *
> + * - Check additional requirements when debug is enabled.
> + *
> + * - Update and/or reset required checksums when tx offload is set for 
> packet.
> + *
> + * The rte_eth_tx_prep() function returns the number of packets ready to be
> + * sent. A return value equal to *nb_pkts* means that all packets are valid 
> and
> + * ready to be sent.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + *   The value must be a valid port id.
> + * @param queue_id
> + *   The index of the transmit queue through which output packets must be
> + *   sent.
> + *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
> + *   to rte_eth_dev_configure().
> + * @param tx_pkts
> + *   The address of an array of *nb_pkts* pointers to *rte_mbuf* structures
> + *   which contain the output packets.
> + * @param nb_pkts
> + *   The maximum number of packets to process.
> + * @return
> + *   The number of packets correct and ready to be sent. The return value 
> can be
> + *   less than the value of the *tx_pkts* parameter when some packet doesn't
> + *   meet devices requirements with rte_errno set appropriately.
> + */

Inserting here the previous comment:

>> Can we add the constraint that invalid packets are left untouched?
>>
>> I think most of the time there will be a software fallback in that case,
>> so it would be good to ensure that this function does not change the flags
>> or the packet data.
> 
> In current 

[dpdk-dev] [PATCH] mempool: fix search of maximum contiguous pages

2016-10-25 Thread Thomas Monjalon
2016-10-13 17:05, Olivier MATZ:
> Hi Wei,
> 
> On 10/13/2016 02:31 PM, Ananyev, Konstantin wrote:
> >
> >>
> > diff --git a/lib/librte_mempool/rte_mempool.c
> > b/lib/librte_mempool/rte_mempool.c
> > index 71017e1..e3e254a 100644
> > --- a/lib/librte_mempool/rte_mempool.c
> > +++ b/lib/librte_mempool/rte_mempool.c
> > @@ -426,9 +426,12 @@ rte_mempool_populate_phys_tab(struct
> > rte_mempool *mp, char *vaddr,
> >
> > for (i = 0; i < pg_num && mp->populated_size < mp->size; i += 
> > n) {
> >
> > +   phys_addr_t paddr_next;
> > +   paddr_next = paddr[i] + pg_sz;
> > +
> > /* populate with the largest group of contiguous pages 
> > */
> > for (n = 1; (i + n) < pg_num &&
> > -paddr[i] + pg_sz == paddr[i+n]; n++)
> > +paddr_next == paddr[i+n]; n++, paddr_next 
> > += pg_sz)
> > ;
> 
>  Good catch.
>  Why not just paddr[i + n - 1] != paddr[i + n]?
> >>>
> >>> Sorry, I meant 'paddr[i + n - 1] + pg_sz == paddr[i+n]' off course.
> >>>
>  Then you don't need extra variable (paddr_next) here.
>  Konstantin
> >>
> >> Thank you, Konstantin
> >> 'paddr[i + n - 1] + pg_sz = paddr[i + n]' also can fix it and have 
> >> straight meaning.
> >> But I assume that my revision with paddr_next += pg_sz may have a bit 
> >> better performance.
> >
> > I don't think there would be any real difference, again it is not 
> > performance critical code-path.
> >
> >> By the way, paddr[i] + n * pg_sz = paddr[i + n] can also resolve it.
> >
> > Yes, that's one seems even better for me - make things more clear.
> 
> Thank you for fixing this.
> 
> My vote would go for "addr[i + n - 1] + pg_sz == paddr[i + n]"
> 
> If you feel "paddr[i] + n * pg_sz = paddr[i + n]" is clearer, I have no 
> problem with it either.

No answer from Wei Dai.
Please Olivier advise what to do with this patch.
Thanks


[dpdk-dev] [PATCH] examples/ip_pipeline: fix gcc v6.2.1 build error

2016-10-25 Thread Thomas Monjalon
> > This patch fixes the misleading indentation error on compiling ip_pipeline
> > app with gcc v6.2.1.
> > 
> > Fixes: 3f2c9f3bb6c6 ("examples/ip_pipeline: add TAP port")
> > 
> > Signed-off-by: Jasvinder Singh 
> > ---
> >  examples/ip_pipeline/app.h | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> Acked-by: Cristian Dumitrescu 

Applied, thanks


[dpdk-dev] [PATCH] examples/ip_pipeline: fix freeBSD build error

2016-10-25 Thread Thomas Monjalon
> > Error log:
> >  CC init.o
> >  examples/ip_pipeline/init.c:38:22: fatal error: linux/if.h: No such file or
> > directory
> >  #include 
> >   ^
> > Fixes: 3f2c9f3bb6c6 ("examples/ip_pipeline: add TAP port")
> > 
> > Signed-off-by: Jasvinder Singh 
> 
> Acked-by: Cristian Dumitrescu 

Applied, thanks



[dpdk-dev] mbuf changes

2016-10-25 Thread Morten Brørup
Lots of good arguments from Konstantin below!

Thomas also wrote something worth repeating:
"The mbuf space must be kept jealously like a DPDK treasure :)"

Maybe we should take a step away from the keyboard and consider the decision 
process and guidance criteria for mbuf members.

Here is an example of what I am thinking:

1. It is the intention to optimize mbuf handling for the most likely fast path 
through the entire system on the most common applications.
2. It is the intention that the most likely fast path through the RX handler 
and RX processing of the most common applications only needs to touch the first 
cache line.
3. The first cache line should be preferred for metadata set by most common NIC 
hardware in the RX handler.
4. The second cache line should be preferred for metadata required by most 
common NIC hardware in the TX handler.
5. The first cache line could be used for metadata used by the RX processing in 
the most likely fast path of most common applications.
6. The second cache line could be used for metadata used by the TX processing 
in the most likely fast path of most common applications.
7. The RX and TX handlers are equally important, and the RX handler should not 
be optimized at the cost of the TX handler.

Please note the use of "most common" and "most likely". It should be clear 
which cases we are optimizing for.

We should probably also define "the most common applications" (e.g. a Layer 3 
router with or without VPN) and "the most likely fast path" (e.g. a single 
packet matching a series of lookups and ultimately forwarded to a single port).

It might also make sense documenting the mbuf fields in more detail somewhere. 
E.g. the need for nb_segs in the NIC's TX handler.


Med venlig hilsen / kind regards
- Morten Br?rup


> -Original Message-
> From: Ananyev, Konstantin [mailto:konstantin.ananyev at intel.com]
> Sent: Tuesday, October 25, 2016 3:39 PM
> To: Olivier Matz; Richardson, Bruce; Morten Br?rup
> Cc: Adrien Mazarguil; Wiles, Keith; dev at dpdk.org; Oleg Kuporosov
> Subject: RE: [dpdk-dev] mbuf changes
> 
> 
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Olivier Matz
> > Sent: Tuesday, October 25, 2016 1:49 PM
> > To: Richardson, Bruce ; Morten Br?rup
> > 
> > Cc: Adrien Mazarguil ; Wiles, Keith
> > ; dev at dpdk.org; Oleg Kuporosov
> > 
> > Subject: Re: [dpdk-dev] mbuf changes
> >
> >
> >
> > On 10/25/2016 02:45 PM, Bruce Richardson wrote:
> > > On Tue, Oct 25, 2016 at 02:33:55PM +0200, Morten Br?rup wrote:
> > >> Comments at the end.
> > >>
> > >> Med venlig hilsen / kind regards
> > >> - Morten Br?rup
> > >>
> > >>> -Original Message-
> > >>> From: Bruce Richardson [mailto:bruce.richardson at intel.com]
> > >>> Sent: Tuesday, October 25, 2016 2:20 PM
> > >>> To: Morten Br?rup
> > >>> Cc: Adrien Mazarguil; Wiles, Keith; dev at dpdk.org; Olivier Matz;
> > >>> Oleg Kuporosov
> > >>> Subject: Re: [dpdk-dev] mbuf changes
> > >>>
> > >>> On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Br?rup wrote:
> >  Comments inline.
> > 
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce
> > > Richardson
> > > Sent: Tuesday, October 25, 2016 1:14 PM
> > > To: Adrien Mazarguil
> > > Cc: Morten Br?rup; Wiles, Keith; dev at dpdk.org; Olivier Matz;
> > > Oleg Kuporosov
> > > Subject: Re: [dpdk-dev] mbuf changes
> > >
> > > On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil
> wrote:
> > >> On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Br?rup wrote:
> > >>> Comments inline.
> > >>>
> > >>> Med venlig hilsen / kind regards
> > >>> - Morten Br?rup
> > >>>
> > >>>
> >  -Original Message-
> >  From: Adrien Mazarguil [mailto:adrien.mazarguil at 6wind.com]
> >  Sent: Tuesday, October 25, 2016 11:39 AM
> >  To: Bruce Richardson
> >  Cc: Wiles, Keith; Morten Br?rup; dev at dpdk.org; Olivier Matz;
> >  Oleg Kuporosov
> >  Subject: Re: [dpdk-dev] mbuf changes
> > 
> >  On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson
> > >>> wrote:
> > > On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith
> > >>> wrote:
> >  [...]
> > >>> On Oct 24, 2016, at 10:49 AM, Morten Br?rup
> >   wrote:
> >  [...]
> > 
> > > One other point I'll mention is that we need to have a
> > > discussion on how/where to add in a timestamp value into
> > >>> the
> > > mbuf. Personally, I think it can be in a union with the
> > > sequence
> > > number value, but I also suspect that 32-bits of a
> > >>> timestamp
> > > is not going to be enough for
> >  many.
> > >
> > > Thoughts?
> > 
> >  If we consider that timestamp representation should use
> > > nanosecond
> >  granularity, a 32-bit value may 

[dpdk-dev] [PATCH] kni: fix build with kernel 4.9

2016-10-25 Thread Thomas Monjalon
> > compile error:
> >   CC [M]  .../lib/librte_eal/linuxapp/kni/igb_main.o
> > .../lib/librte_eal/linuxapp/kni/igb_main.c:2317:21:
> > error: initialization from incompatible pointer type
> > [-Werror=incompatible-pointer-types]
> >   .ndo_set_vf_vlan = igb_ndo_set_vf_vlan,
> >  ^~~
> > 
> > Linux kernel 4.9 updates API for ndo_set_vf_vlan:
> > Linux: 79aab093a0b5 ("net: Update API for VF vlan protocol 802.1ad
> > support")
> > 
> > Use new API for Linux kernels >= 4.9
> > 
> > Signed-off-by: Ferruh Yigit 
> Tested-by: Pablo de Lara 

Applied, thanks


[dpdk-dev] [PATCH v10 11/25] eal/pci: helpers for device name parsing/update

2016-10-25 Thread Pattan, Reshma
Hi Shreyansh,

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Shreyansh Jain
> Sent: Friday, September 16, 2016 5:30 AM
> To: dev at dpdk.org
> Cc: viktorin at rehivetech.com; David Marchand ;
> hemant.agrawal at nxp.com; Thomas Monjalon
> ; Shreyansh Jain 
> Subject: [dpdk-dev] [PATCH v10 11/25] eal/pci: helpers for device name
> parsing/update
> 
> From: David Marchand 
> 
> - Move rte_eth_dev_create_unique_device_name() from ether/rte_ethdev.c to
>   common/include/rte_pci.h as rte_eal_pci_device_name(). Being a common
>   method, can be used across crypto/net PCI PMDs.
> - Remove crypto specific routine and fallback to common name function.
> - Introduce a eal private Update function for PCI device naming.
> 
> Signed-off-by: David Marchand 
> [Shreyansh: Merge crypto/pci helper patches]
> Signed-off-by: Shreyansh Jain 
> ---
>  lib/librte_cryptodev/rte_cryptodev.c| 27 +++---
>  lib/librte_eal/bsdapp/eal/eal_pci.c | 49
> +
>  lib/librte_eal/common/eal_private.h | 13 +
>  lib/librte_eal/common/include/rte_pci.h | 24 
>  lib/librte_eal/linuxapp/eal/eal_pci.c   | 13 +
>  lib/librte_ether/rte_ethdev.c   | 24 +++-
>  6 files changed, 107 insertions(+), 43 deletions(-)
> 
> diff --git a/lib/librte_cryptodev/rte_cryptodev.c
> b/lib/librte_cryptodev/rte_cryptodev.c
> index 2a3b649..c81e366 100644
> --- a/lib/librte_cryptodev/rte_cryptodev.c
> +++ b/lib/librte_cryptodev/rte_cryptodev.c
> @@ -365,23 +365,6 @@ rte_cryptodev_pmd_allocate(const char *name, int
> socket_id)
>   return cryptodev;
>  }
> 
>   *
>   * This function is private to EAL.
> diff --git a/lib/librte_eal/common/include/rte_pci.h
> b/lib/librte_eal/common/include/rte_pci.h
> index cf81898..e1f695f 100644
> --- a/lib/librte_eal/common/include/rte_pci.h
> +++ b/lib/librte_eal/common/include/rte_pci.h
> @@ -82,6 +82,7 @@ extern "C" {
>  /** Formatting string for PCI device identifier: Ex: :00:01.0 */  #define
> PCI_PRI_FMT "%.4" PRIx16 ":%.2" PRIx8 ":%.2" PRIx8 ".%" PRIx8
> +#define PCI_PRI_STR_SIZE sizeof(":XX:XX.X")
> 
>  /** Short formatting string, without domain, for PCI device: Ex: 00:01.0 */
> #define PCI_SHORT_PRI_FMT "%.2" PRIx8 ":%.2" PRIx8 ".%" PRIx8 @@ -308,6
> 
> +static inline void
> +rte_eal_pci_device_name(const struct rte_pci_addr *addr,
> + char *output, size_t size)
> +{
> + RTE_VERIFY(size >= PCI_PRI_STR_SIZE);
> + RTE_VERIFY(snprintf(output, size, PCI_PRI_FMT,
> + addr->domain, addr->bus,
> + addr->devid, addr->function) >= 0); }
> +
> 
> +int
> +pci_update_device(const struct rte_pci_addr *addr) {
> + char filename[PATH_MAX];
> +
> + snprintf(filename, sizeof(filename), "%s/" PCI_PRI_FMT,
> +  pci_get_sysfs_path(), addr->domain, addr->bus, addr->devid,
> +  addr->function);
> +
> + return pci_scan_one(filename, addr->domain, addr->bus, addr->devid,
> + addr->function);
> +}
> +


Earlier device names were created in the format "bus:deviceid.function" as per 
the below ethdev API.
Now after above new eal API the name format is "domain:bus:deviceid.func" was 
that intentional  and why is that so.

> -static int
> -rte_eth_dev_create_unique_device_name(char *name, size_t size,
> - struct rte_pci_device *pci_dev)
> -{
> - int ret;
> -
> - ret = snprintf(name, size, "%d:%d.%d",
> - pci_dev->addr.bus, pci_dev->addr.devid,
> - pci_dev->addr.function);
> - if (ret < 0)
> - return ret;
> - return 0;
> -}
> -


[dpdk-dev] mbuf changes

2016-10-25 Thread Adrien Mazarguil
On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Br?rup wrote:
> Comments inline.

I'm only replying to the nb_segs bits here.

> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson
> > Sent: Tuesday, October 25, 2016 1:14 PM
> > To: Adrien Mazarguil
> > Cc: Morten Br?rup; Wiles, Keith; dev at dpdk.org; Olivier Matz; Oleg
> > Kuporosov
> > Subject: Re: [dpdk-dev] mbuf changes
> > 
> > On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil wrote:
> > > On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Br?rup wrote:
> > > > Comments inline.
> > > >
> > > > Med venlig hilsen / kind regards
> > > > - Morten Br?rup
> > > >
> > > >
> > > > > -Original Message-
> > > > > From: Adrien Mazarguil [mailto:adrien.mazarguil at 6wind.com]
> > > > > Sent: Tuesday, October 25, 2016 11:39 AM
> > > > > To: Bruce Richardson
> > > > > Cc: Wiles, Keith; Morten Br?rup; dev at dpdk.org; Olivier Matz; Oleg
> > > > > Kuporosov
> > > > > Subject: Re: [dpdk-dev] mbuf changes
> > > > >
> > > > > On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson wrote:
> > > > > > On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith wrote:
> > > > > [...]
> > > > > > > > On Oct 24, 2016, at 10:49 AM, Morten Br?rup
> > > > >  wrote:
> > > > > [...]
> > > > > > > > 5.
> > > > > > > >
> > > > > > > > And here?s something new to think about:
> > > > > > > >
> > > > > > > > m->next already reveals if there are more segments to a
> > packet.
> > > > > Which purpose does m->nb_segs serve that is not already covered
> > by
> > > > > m-
> > > > > >next?
> > > > > >
> > > > > > It is duplicate info, but nb_segs can be used to check the
> > > > > > validity
> > > > > of
> > > > > > the next pointer without having to read the second mbuf
> > cacheline.
> > > > > >
> > > > > > Whether it's worth having is something I'm happy enough to
> > > > > > discuss, though.
> > > > >
> > > > > Although slower in some cases than a full blown "next packet"
> > > > > pointer, nb_segs can also be conveniently abused to link several
> > > > > packets and their segments in the same list without wasting
> > space.
> > > >
> > > > I don?t understand that; can you please elaborate? Are you abusing
> > m->nb_segs as an index into an array in your application? If that is
> > the case, and it is endorsed by the community, we should get rid of m-
> > >nb_segs and add a member for application specific use instead.
> > >
> > > Well, that's just an idea, I'm not aware of any application using
> > > this, however the ability to link several packets with segments seems
> > > useful to me (e.g. buffering packets). Here's a diagram:
> > >
> > >  .---.   .---.   .---.   .---.   .---
> > ---
> > >  | pkt 0 |   | seg 1 |   | seg 2 |   | pkt 1 |   |
> > pkt 2
> > >  |  next --->|  next --->|  next --->|  next --->|
> > ...
> > >  | nb_segs 3 |   | nb_segs 1 |   | nb_segs 1 |   | nb_segs 1 |   |
> > >  `---'   `---'   `---'   `---'   `---
> > ---
> 
> I see. It makes it possible to refer to a burst of packets (with segments or 
> not) by a single mbuf reference, as an alternative to the current design 
> pattern of using an array and length (struct rte_mbuf **mbufs, unsigned 
> count).
> 
> This would require implementation in the PMDs etc.
> 
> And even in this case, m->nb_segs does not need to be an integer, but could 
> be replaced by a single bit indicating if the segment is a continuation of a 
> packet or the beginning (alternatively the end) of a packet, i.e. the bit can 
> be set for either the first or the last segment in the packet.

Sure however if we keep the current definition, a single bit would not be
enough as it must be nonzero for the buffer to be valid. I think a 8 bit
field is not that expensive for a counter.

> It is an almost equivalent alternative to the fundamental design pattern of 
> using an array of mbuf with count, which is widely implemented in DPDK. And 
> m->next still lives in the second cache line, so I don't see any gain by this.

That's right, it does not have to live in the first cache line, my only
concern was its entire removal.

> I still don't get how m->nb_segs can be abused without m->next.

By "abused" I mean that applications are not supposed to pass this kind of
mbuf lists directly to existing mbuf-handling functions (TX burst,
rte_pktmbuf_free() and so on), however these same applications (even PMDs)
can do so internally temporarily because it's so simple.

The next pointer of the last segment of a packet must still be set to NULL
every time a packet is retrieved from such a list to be processed.

> > However, nb_segs may be a good candidate for demotion, along with
> > possibly the port value, or the reference count.

Yes, I think that's fine as long as it's kept somewhere.

-- 
Adrien Mazarguil
6WIND


[dpdk-dev] mbuf changes

2016-10-25 Thread Thomas Monjalon
2016-10-25 13:04, Ramia, Kannan Babu:
> Port filed is important meta information for the application use like
> CGNAT vEPC functions etc.
> I strongly recommend to keep the field in mind meta.

Have you tried to move this field outside of the mbuf?
What is the performance degradation?
We need more information than some assumptions.



[dpdk-dev] mbuf changes

2016-10-25 Thread Thomas Monjalon
2016-10-25 15:00, Olivier Matz:
> On 10/25/2016 12:22 PM, Morten Br?rup wrote:
> > From: Ananyev, Konstantin
> >> From: Bruce Richardson
> >>> On Mon, Oct 24, 2016 at 11:47:16PM +0200, Morten Br?rup wrote:
>  From: Bruce Richardson
> > On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith wrote:
> >> I thought we also talked about removing the m->port from the
> >> mbuf as it is not really needed.
> >>
> > Yes, this was mentioned, and also the option of moving the port
> > value to the second cacheline, but it appears that NXP are using
> > the port value in their NIC drivers for passing in metadata, so
> > we'd need their agreement on any move (or removal).
> >
>  If a single driver instance services multiple ports, so the ports
>  are not polled individually, the m->port member will be required to
>  identify the port.
>  In that case, it might also used elsewhere in the ingress path,
>  and thus appropriate having in the first cache line.
> >>
> >> Ok, but these days most of devices have multiple rx queues.
> >> So identify the RX source properly you need not only port, but
> >> port+queue (at least 3 bytes).
> >> But I suppose better to wait for NXP input here.
> >>
> >>> So yes, this needs further discussion with NXP.
> > 
> > It seems that this concept might be somewhat similar to the
> > Flow Director information, which already has plenty of bits
> > in the first cache line. You should consider if this can be
> > somehow folded into the m->hash union.
> 
> I'll tend to agree with Morten.
> 
> First, I think that having more than 255 virtual links is possible,
> so increasing the port size to 16 bits is not a bad idea.
> 
> I think the port information is more useful for a network stack
> than the queue_id, but it may not be the case for all applications.
> On the other hand, the queue_id (or something else providing the same
> level of information) could led in the flow director structure.
> 
> Now, the question is: do we really need the port to be inside the mbuf?
> Indeed, the application can already associate a bulk of packet with a
> port id.
> 
> The reason why I'll prefer to keep it is because I'm pretty sure that
> some applications use it. At least in the examples/ directory, we can
> find distributor, dpdk_qat, ipv4_multicast, load_balancer,
> packet_ordering. I did not check these apps in detail, but it makes me
> feel that other external applications could make use of the port field.

Having some applications using it does not mean that there is a good
justification to keep it.
I think the data must be stored inside the mbuf struct only if it
increases the performance of known use cases significantly.
So the questions should be:
- How significant are the use cases?
- What is the performance drop when having the data outside of the struct?

The mbuf space must be kept jealously like a DPDK treasure :)


[dpdk-dev] mbuf changes

2016-10-25 Thread Morten Brørup
I support that!

Med venlig hilsen / kind regards
- Morten Br?rup


> -Original Message-
> From: Olivier Matz [mailto:olivier.matz at 6wind.com]
> Sent: Tuesday, October 25, 2016 3:14 PM
> To: Morten Br?rup; dev at dpdk.org
> Subject: Re: mbuf changes
> 
> Hi,
> 
> On 10/24/2016 05:49 PM, Morten Br?rup wrote:
> > And here?s something new to think about:
> >
> > m->next already reveals if there are more segments to a packet. Which
> > purpose does m->nb_segs serve that is not already covered by m->next?
> >
> 
> I was asking myself the same question some time ago:
> http://dpdk.org/ml/archives/dev/2016-May/039483.html
> 
> But it seems nb_segs is useful for PMDs on TX side, to anticipate how
> many descriptors a packet will use in the TX ring. It can also help a
> PMD to check that this packet is supported by the hardware (too many
> segments) without browsing the list.
> 
> So finally I think it should be kept in the mbuf. But as suggested by
> Bruce, it could go in the second cache line, since m->next is also
> there, at the condition that this field is set to 1 in mbuf_free().
> 
> Regards,
> Olivier



[dpdk-dev] DPDK & ASLR

2016-10-25 Thread Samir Shah
We know that it is recommended that ASLR be turned off so that the mappings
across primary and secondary remain the same.

Does ASLR need to be turned off system-wide, or DPDK-processes wide? Could
we use setarch/personality to disable ASLR for just the DPDK process and
leave it enabled for the rest of the system? Any experience to say if that
would work or not?

- Samir


[dpdk-dev] rte_eth_tx_burst() hangs because of nofreedescriptors

2016-10-25 Thread yingzhi
Hi Helin,


I'm doing comparison tests today, (with fragmented and small traffic).


On the other hand, I checked some example code from DPDK source: 
load_balancer:
app_lcore_io_tx:


n_pkts = rte_eth_tx_burst(
port,
0,
lp->tx.mbuf_out[port].array,
(uint16_t) n_mbufs);



if (unlikely(n_pkts < n_mbufs)) {
uint32_t k;
for (k = n_pkts; k < n_mbufs; k ++) {
struct rte_mbuf *pkt_to_free = 
lp->tx.mbuf_out[port].array[k];
rte_pktmbuf_free(pkt_to_free);
}
}

And fragmentation example
http://dpdk.org/doc/api-2.2/ip_fragmentation_2main_8c-example.html
ret = rte_eth_tx_burst(port, queueid, m_table, n);
if (unlikely(ret < n)) {
do {
rte_pktmbuf_free(m_table[ret]);
} while (++ret < n);
}



They works basically the same as in my app, is there any other way to free 
segmented packets?
The question is:
if pkt_mbuf next pointer is not none, rte_pktmbuf_free() will free all the 
packet of that linklist, but due to packet re-order, I guess those packets may 
not sequentially stores in the mbuf array? So how can we correctly free those 
fragmented packets if  rte_eth_tx_burst() didn't sent out them so we need to 
manually free?


Thanks



-- Original --
From:  "Zhang, Helin";;
Date:  Tue, Oct 25, 2016 09:05 AM
To:  "yingzhi"; 
Cc:  "dev at dpdk.org"; 
Subject:  RE: RE: [dpdk-dev] rte_kni_tx_burst() hangs because of 
nofreedescriptors








From: yingzhi [mailto:kai...@qq.com] 
 Sent: Monday, October 24, 2016 6:39 PM
 To: Zhang, Helin
 Cc: dev at dpdk.org
 Subject: Re: RE: [dpdk-dev] rte_kni_tx_burst() hangs because of no 
freedescriptors





Hi Helin,





Thanks for your response, to answer your questions:


1. we send only one packet each time calling rte_kni_tx_burst(), which means 
the last argument is 1.


2. it returns 0 because the free mbuf function inside tx_burst will not free 
any mbuf:





  if (txq->nb_tx_free < txq->tx_free_thresh)


  ixgbe_tx_free_bufs(txq);






after this operation, the txq->nb_tx_free is not increased and keeps to be "0" 
eventually.





I did some tests today, I commented out this section of ixgbe_rxtx_vec_common.h 
-> ixgbe_tx_free_bufs





  status = txq->tx_ring[txq->tx_next_dd].wb.status;


  if (!(status & IXGBE_ADVTXD_STAT_DD))


  return 0;






After ignoring DD bit check, our app runs for about 6 hours without issue. I 
suspect there is something wrong in my program set the DD bit somewhere. One of 
the possible cause currently I suspect is as far as I know, 
rte_pktmbuf_free(mbuf.array[k])  will free the mbuf of the packet and any 
fragmented packets following by it. But in our application such as below code 
snippet:





auto nb_tx = rte_eth_tx_burst(port, queue, mbuf.array, (uint16_t) 
nb_rx);


if (unlikely(nb_tx < nb_rx)) {


for (unsigned k = nb_tx; k < nb_rx; k++) {


rte_pktmbuf_free(mbuf.array[k]);


}


}

[Zhang, Helin] it seems above code piece has memory leak, if the buffer is 
chained. After all memory leaked, then the issue comes. Please try to check  if 
this is the root cause!






In this case if there are fragmented packets and failed transmission, may cause 
a mbuf be freed multiple times.





Above is just my suspect, need to do some tests later today or tomorrow.





Thanks


-- Original --


From:  "Zhang, Helin";;


Date:  Mon, Oct 24, 2016 11:33 AM


To:  "yingzhi"; 


Cc:  "dev at dpdk.org"; 


Subject:  RE: [dpdk-dev] rte_kni_tx_burst() hangs because of no freedescriptors






Hi Yingzhi

 Thank you for the reporting! The description is not so clear at least for me.
 Please help to narrown down the issue by youself.
 How many packets would it have for calling TX function?
 Why it would return 0 after calling TX function? No memory? Or return from 
else? Have you found anything?

 Regards,
 Helin

 > -Original Message-
 > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of yingzhi
 > Sent: Sunday, October 23, 2016 9:30 PM
 > To: users; dev at dpdk.org
 > Subject: [dpdk-dev] rte_kni_tx_burst() hangs because of no free descriptors
 > 
 > -
 > Hi Experts,
 > 
 > Background:
 > 
 > We are using DPDK to develop a LoadBalancer following below logic: When
 > a new packet is received:
 >  1. if the dst_addr is management IP, forward to KNI. 2. if the dst_addr is 
 > in
 > VIP list, select backend and forward(modify dst mac address). 3. otherwise
 > 

[dpdk-dev] [PATCH] Revert "bonding: use existing enslaved device queues"

2016-10-25 Thread Bruce Richardson
On Tue, Oct 25, 2016 at 02:48:04PM +0100, Declan Doherty wrote:
> On 25/10/16 13:57, Bruce Richardson wrote:
> > On Mon, Oct 24, 2016 at 04:07:17PM +0100, Declan Doherty wrote:
> > > On 24/10/16 15:51, Jan Blunck wrote:
> > > > On Mon, Oct 24, 2016 at 7:02 AM, Declan Doherty
> > > >  wrote:
> > > > > On 14/10/16 00:37, Eric Kinzie wrote:
> > > > > > 
> > > > > > On Wed Oct 12 16:24:21 +0100 2016, Bruce Richardson wrote:
> > > > > > > 
> > > > > > > On Wed, Oct 12, 2016 at 04:24:54PM +0300, Ilya Maximets wrote:
> > > > > > > > 
> > > > > > > > On 07.10.2016 05:02, Eric Kinzie wrote:
> > > > > > > > > 
> > > > > > > > > On Wed Sep 07 15:28:10 +0300 2016, Ilya Maximets wrote:
> > > > > > > > > > 
> > > > > > > > > > This reverts commit 
> > > > > > > > > > 5b7bb2bda5519b7800f814df64d4e015282140e5.
> > > > > > > > > > 
> > > > > > > > > > It is necessary to reconfigure all queues every time because
> > > > > > > > > > configuration
> > > > > > > > > > can be changed.
> > > > > > > > > > 
> > > > > > > > > > For example, if we're reconfiguring bonding device with new 
> > > > > > > > > > memory
> > > > > > > > > > pool,
> > > > > > > > > > already configured queues will still use the old one. And 
> > > > > > > > > > if the old
> > > > > > > > > > mempool be freed, application likely will panic in attempt 
> > > > > > > > > > to use
> > > > > > > > > > freed mempool.
> > > > > > > > > > 
> > > > > > > > > > This happens when we use the bonding device with OVS 2.6 
> > > > > > > > > > while MTU
> > > > > > > > > > reconfiguration:
> > > > > > > > > > 
> > > > > > > > > > PANIC in rte_mempool_get_ops():
> > > > > > > > > > assert "(ops_index >= 0) && (ops_index < 
> > > > > > > > > > RTE_MEMPOOL_MAX_OPS_IDX)"
> > > > > > > > > > failed
> > > > > > > > > > 
> > > > > > > > > > Cc: 
> > > > > > > > > > Signed-off-by: Ilya Maximets 
> > > > > > > > > > ---
> > > > > > > > > >  drivers/net/bonding/rte_eth_bond_pmd.c | 10 ++
> > > > > > > > > >  1 file changed, 2 insertions(+), 8 deletions(-)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c
> > > > > > > > > > b/drivers/net/bonding/rte_eth_bond_pmd.c
> > > > > > > > > > index b20a272..eb5b6d1 100644
> > > > > > > > > > --- a/drivers/net/bonding/rte_eth_bond_pmd.c
> > > > > > > > > > +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
> > > > > > > > > > @@ -1305,8 +1305,6 @@ slave_configure(struct rte_eth_dev
> > > > > > > > > > *bonded_eth_dev,
> > > > > > > > > > struct bond_rx_queue *bd_rx_q;
> > > > > > > > > > struct bond_tx_queue *bd_tx_q;
> > > > > > > > > > 
> > > > > > > > > > -   uint16_t old_nb_tx_queues = 
> > > > > > > > > > slave_eth_dev->data->nb_tx_queues;
> > > > > > > > > > -   uint16_t old_nb_rx_queues = 
> > > > > > > > > > slave_eth_dev->data->nb_rx_queues;
> > > > > > > > > > int errval;
> > > > > > > > > > uint16_t q_id;
> > > > > > > > > > 
> > > > > > > > > > @@ -1347,9 +1345,7 @@ slave_configure(struct rte_eth_dev
> > > > > > > > > > *bonded_eth_dev,
> > > > > > > > > > }
> > > > > > > > > > 
> > > > > > > > > > /* Setup Rx Queues */
> > > > > > > > > > -   /* Use existing queues, if any */
> > > > > > > > > > -   for (q_id = old_nb_rx_queues;
> > > > > > > > > > -q_id < bonded_eth_dev->data->nb_rx_queues; 
> > > > > > > > > > q_id++) {
> > > > > > > > > > +   for (q_id = 0; q_id < 
> > > > > > > > > > bonded_eth_dev->data->nb_rx_queues;
> > > > > > > > > > q_id++) {
> > > > > > > > > > bd_rx_q = (struct bond_rx_queue
> > > > > > > > > > *)bonded_eth_dev->data->rx_queues[q_id];
> > > > > > > > > > 
> > > > > > > > > > errval =
> > > > > > > > > > rte_eth_rx_queue_setup(slave_eth_dev->data->port_id, q_id,
> > > > > > > > > > @@ -1365,9 +1361,7 @@ slave_configure(struct rte_eth_dev
> > > > > > > > > > *bonded_eth_dev,
> > > > > > > > > > }
> > > > > > > > > > 
> > > > > > > > > > /* Setup Tx Queues */
> > > > > > > > > > -   /* Use existing queues, if any */
> > > > > > > > > > -   for (q_id = old_nb_tx_queues;
> > > > > > > > > > -q_id < bonded_eth_dev->data->nb_tx_queues; 
> > > > > > > > > > q_id++) {
> > > > > > > > > > +   for (q_id = 0; q_id < 
> > > > > > > > > > bonded_eth_dev->data->nb_tx_queues;
> > > > > > > > > > q_id++) {
> > > > > > > > > > bd_tx_q = (struct bond_tx_queue
> > > > > > > > > > *)bonded_eth_dev->data->tx_queues[q_id];
> > > > > > > > > > 
> > > > > > > > > > errval =
> > > > > > > > > > rte_eth_tx_queue_setup(slave_eth_dev->data->port_id, q_id,
> > > > > > > > > > --
> > > > > > > > > > 2.7.4
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > NAK
> > > > > > > > > 
> > > > > > > > > There are still some users of this code.  Let's give them a 
> > > > > > > > > chance to
> > > > > > > > > comment before removing it.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Hi Eric,
> > > > > > 

[dpdk-dev] [PATCH] Revert "bonding: use existing enslaved device queues"

2016-10-25 Thread Bruce Richardson
On Wed, Oct 19, 2016 at 10:55:25AM +0100, Bruce Richardson wrote:
> On Thu, Oct 06, 2016 at 03:32:36PM +0100, Declan Doherty wrote:
> > On 07/09/16 13:28, Ilya Maximets wrote:
> > > This reverts commit 5b7bb2bda5519b7800f814df64d4e015282140e5.
> > > 
> > > It is necessary to reconfigure all queues every time because configuration
> > > can be changed.
> > > 
> > 
> > Hey Ilya, this makes sense. I guess this was my original intention but I
> > missed this case in my review of the change.
> > 
> > > For example, if we're reconfiguring bonding device with new memory pool,
> > > already configured queues will still use the old one. And if the old
> > > mempool be freed, application likely will panic in attempt to use
> > > freed mempool.
> > > 
> > > This happens when we use the bonding device with OVS 2.6 while MTU
> > > reconfiguration:
> > > 
> > > PANIC in rte_mempool_get_ops():
> > > assert "(ops_index >= 0) && (ops_index < RTE_MEMPOOL_MAX_OPS_IDX)" failed
> > > 
> > > Cc: 
> > > Signed-off-by: Ilya Maximets 
> > > ---
> > 
> > Acked-by: Declan Doherty 
> 
> Applied to dpdk-next-net/rel_16_11
> 
Patch taken out of branch due to on-going discussion in this thread. It
can be re-applied later if consensus is reached that it is ok.

Apologies for any confusion caused by this, but after discussion with
the driver maintainer, we feel this is the safest course for now.

/Bruce



[dpdk-dev] [PATCH] pci: Don't call probe callback if driver already loaded.

2016-10-25 Thread Ben Walker
If the user asks to probe multiple times, the probe
callback should only be called on devices that don't have
a driver already loaded.

This is useful if a driver is registered after the
execution of a program has started and the list of devices
needs to be re-scanned.

Signed-off-by: Ben Walker 
---
 lib/librte_eal/common/eal_common_pci.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/lib/librte_eal/common/eal_common_pci.c 
b/lib/librte_eal/common/eal_common_pci.c
index 638cd86..971ad20 100644
--- a/lib/librte_eal/common/eal_common_pci.c
+++ b/lib/librte_eal/common/eal_common_pci.c
@@ -289,6 +289,10 @@ pci_probe_all_drivers(struct rte_pci_device *dev)
if (dev == NULL)
return -1;

+   /* Check if a driver is already loaded */
+   if (dev->driver != NULL)
+   return 0;
+
TAILQ_FOREACH(dr, _driver_list, next) {
rc = rte_eal_pci_probe_one_driver(dr, dev);
if (rc < 0)
-- 
2.7.4



[dpdk-dev] mbuf changes

2016-10-25 Thread Morten Brørup
Comments inline.

Med venlig hilsen / kind regards
- Morten Br?rup


> -Original Message-
> From: Shreyansh Jain [mailto:shreyansh.jain at nxp.com]
> Sent: Tuesday, October 25, 2016 1:54 PM
> To: Bruce Richardson
> Cc: Wiles, Keith; Morten Br?rup; dev at dpdk.org; Olivier Matz
> Subject: Re: mbuf changes
> 
> On Monday 24 October 2016 09:55 PM, Bruce Richardson wrote:
> > On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith wrote:
> >>
> >>> On Oct 24, 2016, at 10:49 AM, Morten Br?rup
>  wrote:
> >>>
> >>> First of all: Thanks for a great DPDK Userspace 2016!
> >>>
> >>>
> >>>
> >>> Continuing the Userspace discussion about Olivier Matz?s proposed
> mbuf changes...
> >
> > Thanks for keeping the discussion going!
> >>>
> >>>
> >>>
> >>> 1.
> >>>
> >>> Stephen Hemminger had a noteworthy general comment about keeping
> metadata for the NIC in the appropriate section of the mbuf: Metadata
> generated by the NIC?s RX handler belongs in the first cache line, and
> metadata required by the NIC?s TX handler belongs in the second cache
> line. This also means that touching the second cache line on ingress
> should be avoided if possible; and Bruce Richardson mentioned that for
> this reason m->next was zeroed on free().
> >>>
> > Thinking about it, I suspect there are more fields we can reset on
> free
> > to save time on alloc. Refcnt, as discussed below is one of them, but
> so
> > too could be the nb_segs field and possibly others.
> >
> >>>
> >>>
> >>> 2.
> >>>
> >>> There seemed to be consensus that the size of m->refcnt should
> match the size of m->port because a packet could be duplicated on all
> physical ports for L3 multicast and L2 flooding.
> >>>
> >>> Furthermore, although a single physical machine (i.e. a single
> server) with 255 physical ports probably doesn?t exist, it might
> contain more than 255 virtual machines with a virtual port each, so it
> makes sense extending these mbuf fields from 8 to 16 bits.
> >>
> >> I thought we also talked about removing the m->port from the mbuf as
> it is not really needed.
> >>
> > Yes, this was mentioned, and also the option of moving the port value
> to
> > the second cacheline, but it appears that NXP are using the port
> value
> > in their NIC drivers for passing in metadata, so we'd need their
> > agreement on any move (or removal).
> 
> I am not sure where NXP's NIC came into picture on this, but now that
> it
> is highlighted, this field is required for libevent implementation [1].
> 
> A scheduler sending an event, which can be a packet, would only have
> information of a flow_id. From this matching it back to a port, without
> mbuf->port, would be very difficult (costly). There may be way around
> this but at least in current proposal I think port would be important
> to
> have - even if in second cache line.
> 

Since it's a natural part of the RX handler, it rightfully belongs in the first 
cache line.

Admitting that I haven't read the referenced proposal in detail, I would like 
to reiterate that perhaps the Flow Director fields in the mbuf could be used 
instead of the port field.

> But, off the top of my head, as of now it is not being used for any
> specific purpose in NXP's PMD implementation.
> 
> Even the SoC patches don't necessarily rely on it except using it
> because it is available.
> 
> @Bruce: where did you get the NXP context here from?
> 
> [1] http://dpdk.org/ml/archives/dev/2016-October/048592.html
> 
> >
> >>>
> >>>
> >>>
> >>> 3.
> >>>
> >>> Someone (Bruce Richardson?) suggested moving m->refcnt and m->port
> to the second cache line, which then generated questions from the
> audience about the real life purpose of m->port, and if m->port could
> be removed from the mbuf structure.
> >>>
> >>>
> >>>
> >>> 4.
> >>>
> >>> I suggested using offset -1 for m->refcnt, so m->refcnt becomes 0
> on first allocation. This is based on the assumption that other mbuf
> fields must be zeroed at alloc()/free() anyway, so zeroing m->refcnt is
> cheaper than setting it to 1.
> >>>
> >>> Furthermore (regardless of m->refcnt offset), I suggested that it
> is not required to modify m->refcnt when allocating and freeing the
> mbuf, thus saving one write operation on both alloc() and free().
> However, this assumes that m->refcnt debugging, e.g. underrun
> detection, is not required.
> >
> > I don't think it really matters what sentinal value is used for the
> > refcnt because it can't be blindly assigned on free like other
> fields.
> > However, I think 0 as first reference value becomes more awkward
> > than 1, because we need to deal with underflow. Consider the
> situation
> > where we have two references to the mbuf, so refcnt is 1, and both
> are
> > freed at the same time. Since the refcnt is not-zero, then both cores
> > will do an atomic decrement simultaneously giving a refcnt of -1. We
> can
> > then set this back to zero before freeing, however, I'd still prefer
> to
> > have refcnt be an accurate value so that it always 

[dpdk-dev] mbuf changes

2016-10-25 Thread Olivier Matz


On 10/25/2016 02:45 PM, Bruce Richardson wrote:
> On Tue, Oct 25, 2016 at 02:33:55PM +0200, Morten Br?rup wrote:
>> Comments at the end.
>>
>> Med venlig hilsen / kind regards
>> - Morten Br?rup
>>
>>> -Original Message-
>>> From: Bruce Richardson [mailto:bruce.richardson at intel.com]
>>> Sent: Tuesday, October 25, 2016 2:20 PM
>>> To: Morten Br?rup
>>> Cc: Adrien Mazarguil; Wiles, Keith; dev at dpdk.org; Olivier Matz; Oleg
>>> Kuporosov
>>> Subject: Re: [dpdk-dev] mbuf changes
>>>
>>> On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Br?rup wrote:
 Comments inline.

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce
> Richardson
> Sent: Tuesday, October 25, 2016 1:14 PM
> To: Adrien Mazarguil
> Cc: Morten Br?rup; Wiles, Keith; dev at dpdk.org; Olivier Matz; Oleg
> Kuporosov
> Subject: Re: [dpdk-dev] mbuf changes
>
> On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil wrote:
>> On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Br?rup wrote:
>>> Comments inline.
>>>
>>> Med venlig hilsen / kind regards
>>> - Morten Br?rup
>>>
>>>
 -Original Message-
 From: Adrien Mazarguil [mailto:adrien.mazarguil at 6wind.com]
 Sent: Tuesday, October 25, 2016 11:39 AM
 To: Bruce Richardson
 Cc: Wiles, Keith; Morten Br?rup; dev at dpdk.org; Olivier Matz;
 Oleg Kuporosov
 Subject: Re: [dpdk-dev] mbuf changes

 On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson
>>> wrote:
> On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith
>>> wrote:
 [...]
>>> On Oct 24, 2016, at 10:49 AM, Morten Br?rup
  wrote:
 [...]

> One other point I'll mention is that we need to have a
> discussion on how/where to add in a timestamp value into
>>> the
> mbuf. Personally, I think it can be in a union with the
> sequence
> number value, but I also suspect that 32-bits of a
>>> timestamp
> is not going to be enough for
 many.
>
> Thoughts?

 If we consider that timestamp representation should use
> nanosecond
 granularity, a 32-bit value may likely wrap around too
>>> quickly
 to be useful. We can also assume that applications requesting
 timestamps may care more about latency than throughput, Oleg
> found
 that using the second cache line for this purpose had a
> noticeable impact [1].

  [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html
>>>
>>> I agree with Oleg about the latency vs. throughput importance
>>> for
> such applications.
>>>
>>> If you need high resolution timestamps, consider them to be
> generated by the NIC RX driver, possibly by the hardware itself
> (http://w3new.napatech.com/features/time-precision/hardware-time-
> stamp), so the timestamp belongs in the first cache line. And I am
> proposing that it should have the highest possible accuracy, which
> makes the value hardware dependent.
>>>
>>> Furthermore, I am arguing that we leave it up to the
>>> application
>>> to
> keep track of the slowly moving bits (i.e. counting whole seconds,
> hours and calendar date) out of band, so we don't use precious
>>> space
> in the mbuf. The application doesn't need the NIC RX driver's fast
> path to capture which date (or even which second) a packet was
> received. Yes, it adds complexity to the application, but we can't
> set aside 64 bit for a generic timestamp. Or as a weird tradeoff:
> Put the fast moving 32 bit in the first cache line and the slow
> moving 32 bit in the second cache line, as a placeholder for the
>>> application to fill out if needed.
> Yes, it means that the application needs to check the time and
> update its variable holding the slow moving time once every second
> or so; but that should be doable without significant effort.
>>
>> That's a good point, however without a 64 bit value, elapsed time
>> between two arbitrary mbufs cannot be measured reliably due to
>>> not
>> enough context, one way or another the low resolution value is
>> also
> needed.
>>
>> Obviously latency-sensitive applications are unlikely to perform
>> lengthy buffering and require this but I'm not sure about all the
>> possible use-cases. Considering many NICs expose 64 bit
>>> timestaps,
>> I suggest we do not truncate them.
>>
>> I'm not a fan of the weird tradeoff either, PMDs will be tempted
>> to fill the extra 32 bits whenever they can and negate the
>> performance improvement of the first cache line.
>
> I would tend to agree, and I don't really see any convenient way to
> avoid putting in a 64-bit field for the timestamp in cache-line 0.
> If we are 

[dpdk-dev] [PATCH] Revert "bonding: use existing enslaved device queues"

2016-10-25 Thread Declan Doherty
On 25/10/16 13:57, Bruce Richardson wrote:
> On Mon, Oct 24, 2016 at 04:07:17PM +0100, Declan Doherty wrote:
>> On 24/10/16 15:51, Jan Blunck wrote:
>>> On Mon, Oct 24, 2016 at 7:02 AM, Declan Doherty
>>>  wrote:
 On 14/10/16 00:37, Eric Kinzie wrote:
>
> On Wed Oct 12 16:24:21 +0100 2016, Bruce Richardson wrote:
>>
>> On Wed, Oct 12, 2016 at 04:24:54PM +0300, Ilya Maximets wrote:
>>>
>>> On 07.10.2016 05:02, Eric Kinzie wrote:

 On Wed Sep 07 15:28:10 +0300 2016, Ilya Maximets wrote:
>
> This reverts commit 5b7bb2bda5519b7800f814df64d4e015282140e5.
>
> It is necessary to reconfigure all queues every time because
> configuration
> can be changed.
>
> For example, if we're reconfiguring bonding device with new memory
> pool,
> already configured queues will still use the old one. And if the old
> mempool be freed, application likely will panic in attempt to use
> freed mempool.
>
> This happens when we use the bonding device with OVS 2.6 while MTU
> reconfiguration:
>
> PANIC in rte_mempool_get_ops():
> assert "(ops_index >= 0) && (ops_index < RTE_MEMPOOL_MAX_OPS_IDX)"
> failed
>
> Cc: 
> Signed-off-by: Ilya Maximets 
> ---
>  drivers/net/bonding/rte_eth_bond_pmd.c | 10 ++
>  1 file changed, 2 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c
> b/drivers/net/bonding/rte_eth_bond_pmd.c
> index b20a272..eb5b6d1 100644
> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
> @@ -1305,8 +1305,6 @@ slave_configure(struct rte_eth_dev
> *bonded_eth_dev,
> struct bond_rx_queue *bd_rx_q;
> struct bond_tx_queue *bd_tx_q;
>
> -   uint16_t old_nb_tx_queues = slave_eth_dev->data->nb_tx_queues;
> -   uint16_t old_nb_rx_queues = slave_eth_dev->data->nb_rx_queues;
> int errval;
> uint16_t q_id;
>
> @@ -1347,9 +1345,7 @@ slave_configure(struct rte_eth_dev
> *bonded_eth_dev,
> }
>
> /* Setup Rx Queues */
> -   /* Use existing queues, if any */
> -   for (q_id = old_nb_rx_queues;
> -q_id < bonded_eth_dev->data->nb_rx_queues; q_id++) {
> +   for (q_id = 0; q_id < bonded_eth_dev->data->nb_rx_queues;
> q_id++) {
> bd_rx_q = (struct bond_rx_queue
> *)bonded_eth_dev->data->rx_queues[q_id];
>
> errval =
> rte_eth_rx_queue_setup(slave_eth_dev->data->port_id, q_id,
> @@ -1365,9 +1361,7 @@ slave_configure(struct rte_eth_dev
> *bonded_eth_dev,
> }
>
> /* Setup Tx Queues */
> -   /* Use existing queues, if any */
> -   for (q_id = old_nb_tx_queues;
> -q_id < bonded_eth_dev->data->nb_tx_queues; q_id++) {
> +   for (q_id = 0; q_id < bonded_eth_dev->data->nb_tx_queues;
> q_id++) {
> bd_tx_q = (struct bond_tx_queue
> *)bonded_eth_dev->data->tx_queues[q_id];
>
> errval =
> rte_eth_tx_queue_setup(slave_eth_dev->data->port_id, q_id,
> --
> 2.7.4
>

 NAK

 There are still some users of this code.  Let's give them a chance to
 comment before removing it.
>>>
>>>
>>> Hi Eric,
>>>
>>> Are these users in CC-list? If not, could you, please, add them?
>>> This patch awaits in mail-list already more than a month. I think, it's
>>> enough
>>> time period for all who wants to say something. Patch fixes a real bug
>>> that
>>> prevent using of DPDK bonding in all applications that reconfigures
>>> devices
>>> in runtime including OVS.
>>>
>> Agreed.
>>
>> Eric, does reverting this patch cause you problems directly, or is your
>> concern
>> just with regards to potential impact to others?
>>
>> Thanks,
>> /Bruce
>
>
> This won't impact me directly.  The users are CCed (different thread)
> and I haven't seen any comment, so I no longer have any objection to
> reverting this change.
>
> Eric
>

 As there has been no further objections and this reinstates the original
 expected behavior of the bonding driver. I'm re-ack'ing for inclusion in
 release.

 Acked-by: Declan Doherty 
>>>
>>> Ok, I can revert the revert for us.
>>>
>>> Do I read this correctly that you are not interested in fixing this 
>>> properly?!
>>>
>>> Thanks,
>>> Jan
>>>
>>
>> Jan, sorry I missed the replies from last week due 

[dpdk-dev] [PATCH 0/2] crypto/qat: performance optimisation

2016-10-25 Thread De Lara Guarch, Pablo


> -Original Message-
> From: Trahe, Fiona
> Sent: Monday, October 24, 2016 5:00 AM
> To: dev at dpdk.org
> Cc: De Lara Guarch, Pablo; Trahe, Fiona; Griffin, John; Jain, Deepak K; 
> Kusztal,
> ArkadiuszX
> Subject: [PATCH 0/2] crypto/qat: performance optimisation
> 
> QAT PMD adjusts the buffer start address and offsets passed
> to the device so that the DMAs in and out of the device are
> 64-byte aligned.
> This gives more consistent throughput, which had been
> variable depending on how the application set up the mbuf.
> The message builder code had to be considerably re-factored
> to do this efficiently.
> Also performance test not taking IV prepend offsets
> into account were corrected.
> 
> Fiona Trahe (2):
>   crypto/qat: rework request builder for performance
>   app/test: use correct offsets in AES perf test
> 
>  app/test/test_cryptodev_perf.c   |  15 +-
>  drivers/crypto/qat/qat_adf/icp_qat_hw.h  |   5 +
>  drivers/crypto/qat/qat_adf/qat_algs.h|   1 +
>  drivers/crypto/qat/qat_adf/qat_algs_build_desc.c |   2 +
>  drivers/crypto/qat/qat_crypto.c  | 242 
> ---
>  5 files changed, 185 insertions(+), 80 deletions(-)
> 
> --
> 2.5.0

Applied to dpdk-next-crypto.
Thanks,

Pablo




[dpdk-dev] mbuf changes

2016-10-25 Thread Morten Brørup
Comments at the end.

Med venlig hilsen / kind regards
- Morten Br?rup

> -Original Message-
> From: Bruce Richardson [mailto:bruce.richardson at intel.com]
> Sent: Tuesday, October 25, 2016 2:20 PM
> To: Morten Br?rup
> Cc: Adrien Mazarguil; Wiles, Keith; dev at dpdk.org; Olivier Matz; Oleg
> Kuporosov
> Subject: Re: [dpdk-dev] mbuf changes
> 
> On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Br?rup wrote:
> > Comments inline.
> >
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce
> > > Richardson
> > > Sent: Tuesday, October 25, 2016 1:14 PM
> > > To: Adrien Mazarguil
> > > Cc: Morten Br?rup; Wiles, Keith; dev at dpdk.org; Olivier Matz; Oleg
> > > Kuporosov
> > > Subject: Re: [dpdk-dev] mbuf changes
> > >
> > > On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil wrote:
> > > > On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Br?rup wrote:
> > > > > Comments inline.
> > > > >
> > > > > Med venlig hilsen / kind regards
> > > > > - Morten Br?rup
> > > > >
> > > > >
> > > > > > -Original Message-
> > > > > > From: Adrien Mazarguil [mailto:adrien.mazarguil at 6wind.com]
> > > > > > Sent: Tuesday, October 25, 2016 11:39 AM
> > > > > > To: Bruce Richardson
> > > > > > Cc: Wiles, Keith; Morten Br?rup; dev at dpdk.org; Olivier Matz;
> > > > > > Oleg Kuporosov
> > > > > > Subject: Re: [dpdk-dev] mbuf changes
> > > > > >
> > > > > > On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson
> wrote:
> > > > > > > On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith
> wrote:
> > > > > > [...]
> > > > > > > > > On Oct 24, 2016, at 10:49 AM, Morten Br?rup
> > > > > >  wrote:
> > > > > > [...]
> >
> > > > > > > One other point I'll mention is that we need to have a
> > > > > > > discussion on how/where to add in a timestamp value into
> the
> > > > > > > mbuf. Personally, I think it can be in a union with the
> > > sequence
> > > > > > > number value, but I also suspect that 32-bits of a
> timestamp
> > > > > > > is not going to be enough for
> > > > > > many.
> > > > > > >
> > > > > > > Thoughts?
> > > > > >
> > > > > > If we consider that timestamp representation should use
> > > nanosecond
> > > > > > granularity, a 32-bit value may likely wrap around too
> quickly
> > > > > > to be useful. We can also assume that applications requesting
> > > > > > timestamps may care more about latency than throughput, Oleg
> > > found
> > > > > > that using the second cache line for this purpose had a
> > > noticeable impact [1].
> > > > > >
> > > > > >  [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html
> > > > >
> > > > > I agree with Oleg about the latency vs. throughput importance
> > > > > for
> > > such applications.
> > > > >
> > > > > If you need high resolution timestamps, consider them to be
> > > generated by the NIC RX driver, possibly by the hardware itself
> > > (http://w3new.napatech.com/features/time-precision/hardware-time-
> > > stamp), so the timestamp belongs in the first cache line. And I am
> > > proposing that it should have the highest possible accuracy, which
> > > makes the value hardware dependent.
> > > > >
> > > > > Furthermore, I am arguing that we leave it up to the
> application
> > > > > to
> > > keep track of the slowly moving bits (i.e. counting whole seconds,
> > > hours and calendar date) out of band, so we don't use precious
> space
> > > in the mbuf. The application doesn't need the NIC RX driver's fast
> > > path to capture which date (or even which second) a packet was
> > > received. Yes, it adds complexity to the application, but we can't
> > > set aside 64 bit for a generic timestamp. Or as a weird tradeoff:
> > > Put the fast moving 32 bit in the first cache line and the slow
> > > moving 32 bit in the second cache line, as a placeholder for the
> application to fill out if needed.
> > > Yes, it means that the application needs to check the time and
> > > update its variable holding the slow moving time once every second
> > > or so; but that should be doable without significant effort.
> > > >
> > > > That's a good point, however without a 64 bit value, elapsed time
> > > > between two arbitrary mbufs cannot be measured reliably due to
> not
> > > > enough context, one way or another the low resolution value is
> > > > also
> > > needed.
> > > >
> > > > Obviously latency-sensitive applications are unlikely to perform
> > > > lengthy buffering and require this but I'm not sure about all the
> > > > possible use-cases. Considering many NICs expose 64 bit
> timestaps,
> > > > I suggest we do not truncate them.
> > > >
> > > > I'm not a fan of the weird tradeoff either, PMDs will be tempted
> > > > to fill the extra 32 bits whenever they can and negate the
> > > > performance improvement of the first cache line.
> > >
> > > I would tend to agree, and I don't really see any convenient way to
> > > avoid putting in a 64-bit field for the timestamp in cache-line 0.
> > > If we are ok with having this 

[dpdk-dev] mbuf changes

2016-10-25 Thread Ramia, Kannan Babu
I didn't get your question. The only information exchanged between the stages 
is mbuf pointer. So the information has to be in mbuf, whether it's part of 
Meta or in private area in the packet buffer headroom is that the question you 
are asking. The private area is application specific, while I am looking for 
the port information getting updated from driver to application and it's 
generic to multiple applications. 

Regards
Kannan Babu


-Original Message-
From: Thomas Monjalon [mailto:thomas.monja...@6wind.com] 
Sent: Tuesday, October 25, 2016 6:54 PM
To: Ramia, Kannan Babu 
Cc: dev at dpdk.org; Olivier Matz ; Morten Br?rup 
; Ananyev, Konstantin ; Richardson, Bruce ; Wiles, Keith 

Subject: Re: [dpdk-dev] mbuf changes

2016-10-25 13:04, Ramia, Kannan Babu:
> Port filed is important meta information for the application use like 
> CGNAT vEPC functions etc.
> I strongly recommend to keep the field in mind meta.

Have you tried to move this field outside of the mbuf?
What is the performance degradation?
We need more information than some assumptions.



[dpdk-dev] mbuf changes

2016-10-25 Thread Morten Brørup
Comments inline.

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson
> Sent: Tuesday, October 25, 2016 1:14 PM
> To: Adrien Mazarguil
> Cc: Morten Br?rup; Wiles, Keith; dev at dpdk.org; Olivier Matz; Oleg
> Kuporosov
> Subject: Re: [dpdk-dev] mbuf changes
> 
> On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil wrote:
> > On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Br?rup wrote:
> > > Comments inline.
> > >
> > > Med venlig hilsen / kind regards
> > > - Morten Br?rup
> > >
> > >
> > > > -Original Message-
> > > > From: Adrien Mazarguil [mailto:adrien.mazarguil at 6wind.com]
> > > > Sent: Tuesday, October 25, 2016 11:39 AM
> > > > To: Bruce Richardson
> > > > Cc: Wiles, Keith; Morten Br?rup; dev at dpdk.org; Olivier Matz; Oleg
> > > > Kuporosov
> > > > Subject: Re: [dpdk-dev] mbuf changes
> > > >
> > > > On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson wrote:
> > > > > On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith wrote:
> > > > [...]
> > > > > > > On Oct 24, 2016, at 10:49 AM, Morten Br?rup
> > > >  wrote:
> > > > [...]
> > > > > > > 5.
> > > > > > >
> > > > > > > And here?s something new to think about:
> > > > > > >
> > > > > > > m->next already reveals if there are more segments to a
> packet.
> > > > Which purpose does m->nb_segs serve that is not already covered
> by
> > > > m-
> > > > >next?
> > > > >
> > > > > It is duplicate info, but nb_segs can be used to check the
> > > > > validity
> > > > of
> > > > > the next pointer without having to read the second mbuf
> cacheline.
> > > > >
> > > > > Whether it's worth having is something I'm happy enough to
> > > > > discuss, though.
> > > >
> > > > Although slower in some cases than a full blown "next packet"
> > > > pointer, nb_segs can also be conveniently abused to link several
> > > > packets and their segments in the same list without wasting
> space.
> > >
> > > I don?t understand that; can you please elaborate? Are you abusing
> m->nb_segs as an index into an array in your application? If that is
> the case, and it is endorsed by the community, we should get rid of m-
> >nb_segs and add a member for application specific use instead.
> >
> > Well, that's just an idea, I'm not aware of any application using
> > this, however the ability to link several packets with segments seems
> > useful to me (e.g. buffering packets). Here's a diagram:
> >
> >  .---.   .---.   .---.   .---.   .---
> ---
> >  | pkt 0 |   | seg 1 |   | seg 2 |   | pkt 1 |   |
> pkt 2
> >  |  next --->|  next --->|  next --->|  next --->|
> ...
> >  | nb_segs 3 |   | nb_segs 1 |   | nb_segs 1 |   | nb_segs 1 |   |
> >  `---'   `---'   `---'   `---'   `---
> ---

I see. It makes it possible to refer to a burst of packets (with segments or 
not) by a single mbuf reference, as an alternative to the current design 
pattern of using an array and length (struct rte_mbuf **mbufs, unsigned count).

This would require implementation in the PMDs etc.

And even in this case, m->nb_segs does not need to be an integer, but could be 
replaced by a single bit indicating if the segment is a continuation of a 
packet or the beginning (alternatively the end) of a packet, i.e. the bit can 
be set for either the first or the last segment in the packet.

It is an almost equivalent alternative to the fundamental design pattern of 
using an array of mbuf with count, which is widely implemented in DPDK. And 
m->next still lives in the second cache line, so I don't see any gain by this.

I still don't get how m->nb_segs can be abused without m->next.


> > > > > One other point I'll mention is that we need to have a
> > > > > discussion on how/where to add in a timestamp value into the
> > > > > mbuf. Personally, I think it can be in a union with the
> sequence
> > > > > number value, but I also suspect that 32-bits of a timestamp is
> > > > > not going to be enough for
> > > > many.
> > > > >
> > > > > Thoughts?
> > > >
> > > > If we consider that timestamp representation should use
> nanosecond
> > > > granularity, a 32-bit value may likely wrap around too quickly to
> > > > be useful. We can also assume that applications requesting
> > > > timestamps may care more about latency than throughput, Oleg
> found
> > > > that using the second cache line for this purpose had a
> noticeable impact [1].
> > > >
> > > >  [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html
> > >
> > > I agree with Oleg about the latency vs. throughput importance for
> such applications.
> > >
> > > If you need high resolution timestamps, consider them to be
> generated by the NIC RX driver, possibly by the hardware itself
> (http://w3new.napatech.com/features/time-precision/hardware-time-
> stamp), so the timestamp belongs in the first cache line. And I am
> proposing that it should have the highest possible accuracy, which
> makes the 

[dpdk-dev] mbuf changes

2016-10-25 Thread Bruce Richardson
On Tue, Oct 25, 2016 at 03:00:39PM +0200, Olivier Matz wrote:
> Hi,
> 
> On 10/25/2016 12:22 PM, Morten Br?rup wrote:
> > Comments inline (at the end).
> > 
> >> -Original Message-
> >> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Ananyev,
> >> Konstantin
> >> Sent: Tuesday, October 25, 2016 12:03 PM
> >> To: Richardson, Bruce; Morten Br?rup
> >> Cc: Wiles, Keith; dev at dpdk.org; Olivier Matz
> >> Subject: Re: [dpdk-dev] mbuf changes
> >>
> >> Hi everyone,
> >>
> >>> -Original Message-
> >>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson
> >>> Sent: Tuesday, October 25, 2016 9:54 AM
> >>> To: Morten Br?rup 
> >>> Cc: Wiles, Keith ; dev at dpdk.org; Olivier Matz
> >>> 
> >>> Subject: Re: [dpdk-dev] mbuf changes
> >>>
> >>> On Mon, Oct 24, 2016 at 11:47:16PM +0200, Morten Br?rup wrote:
>  Comments inline.
> 
>  Med venlig hilsen / kind regards
>  - Morten Br?rup
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce
> > Richardson
> > Sent: Monday, October 24, 2016 6:26 PM
> > To: Wiles, Keith
> > Cc: Morten Br?rup; dev at dpdk.org; Olivier Matz
> > Subject: Re: [dpdk-dev] mbuf changes
> >
> > On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith wrote:
> >>
> >>> On Oct 24, 2016, at 10:49 AM, Morten Br?rup
> >>> 
> > wrote:
> >>>
> >>> 2.
> >>>
> >>> There seemed to be consensus that the size of m->refcnt
> >> should
> >>> match
> > the size of m->port because a packet could be duplicated on all
> > physical ports for L3 multicast and L2 flooding.
> >>>
> >>> Furthermore, although a single physical machine (i.e. a
> >> single
> >>> server)
> > with 255 physical ports probably doesn?t exist, it might contain
> > more than
> > 255 virtual machines with a virtual port each, so it makes sense
> > extending these mbuf fields from 8 to 16 bits.
> >>
> >> I thought we also talked about removing the m->port from the
> >> mbuf as it
> > is not really needed.
> >>
> > Yes, this was mentioned, and also the option of moving the port
> > value to the second cacheline, but it appears that NXP are using
> > the port value in their NIC drivers for passing in metadata, so
> > we'd need their agreement on any move (or removal).
> >
>  If a single driver instance services multiple ports, so the ports
>  are not polled individually, the m->port member will be required to
>  identify
> >>> the port. In that case, it might also used elsewhere in the ingress
> >> path, and thus appropriate having in the first cache line.
> >>
> >> Ok, but these days most of devices have multiple rx queues.
> >> So identify the RX source properly you need not only port, but
> >> port+queue (at least 3 bytes).
> >> But I suppose better to wait for NXP input here.
> >>
> >>> So yes, this needs
> >>> further discussion with NXP.
> > 
> > It seems that this concept might be somewhat similar to the Flow Director 
> > information, which already has plenty of bits in the first cache line. You 
> > should consider if this can be somehow folded into the m->hash union.
> 
> 
> I'll tend to agree with Morten.
> 
> First, I think that having more than 255 virtual links is possible,
> so increasing the port size to 16 bits is not a bad idea.
> 
> I think the port information is more useful for a network stack
> than the queue_id, but it may not be the case for all applications.
> On the other hand, the queue_id (or something else providing the same
> level of information) could led in the flow director structure.
> 
> Now, the question is: do we really need the port to be inside the mbuf?
> Indeed, the application can already associate a bulk of packet with a
> port id.
> 
> The reason why I'll prefer to keep it is because I'm pretty sure that
> some applications use it. At least in the examples/ directory, we can
> find distributor, dpdk_qat, ipv4_multicast, load_balancer,
> packet_ordering. I did not check these apps in detail, but it makes me
> feel that other external applications could make use of the port field.
> 

Yes, judging from the reaction at Userspace and here on list, it appears
that port needs to be kept in the mbuf, and since it's present it should
be filled in by the drivers on RX, so it needs to stay where it is, I
think.

/Bruce


[dpdk-dev] mbuf changes

2016-10-25 Thread Ananyev, Konstantin


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Adrien Mazarguil
> Sent: Tuesday, October 25, 2016 2:48 PM
> To: Morten Br?rup 
> Cc: Richardson, Bruce ; Wiles, Keith 
> ; dev at dpdk.org; Olivier Matz
> ; Oleg Kuporosov 
> Subject: Re: [dpdk-dev] mbuf changes
> 
> On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Br?rup wrote:
> > Comments inline.
> 
> I'm only replying to the nb_segs bits here.
> 
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson
> > > Sent: Tuesday, October 25, 2016 1:14 PM
> > > To: Adrien Mazarguil
> > > Cc: Morten Br?rup; Wiles, Keith; dev at dpdk.org; Olivier Matz; Oleg
> > > Kuporosov
> > > Subject: Re: [dpdk-dev] mbuf changes
> > >
> > > On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil wrote:
> > > > On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Br?rup wrote:
> > > > > Comments inline.
> > > > >
> > > > > Med venlig hilsen / kind regards
> > > > > - Morten Br?rup
> > > > >
> > > > >
> > > > > > -Original Message-
> > > > > > From: Adrien Mazarguil [mailto:adrien.mazarguil at 6wind.com]
> > > > > > Sent: Tuesday, October 25, 2016 11:39 AM
> > > > > > To: Bruce Richardson
> > > > > > Cc: Wiles, Keith; Morten Br?rup; dev at dpdk.org; Olivier Matz; Oleg
> > > > > > Kuporosov
> > > > > > Subject: Re: [dpdk-dev] mbuf changes
> > > > > >
> > > > > > On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson wrote:
> > > > > > > On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith wrote:
> > > > > > [...]
> > > > > > > > > On Oct 24, 2016, at 10:49 AM, Morten Br?rup
> > > > > >  wrote:
> > > > > > [...]
> > > > > > > > > 5.
> > > > > > > > >
> > > > > > > > > And here?s something new to think about:
> > > > > > > > >
> > > > > > > > > m->next already reveals if there are more segments to a
> > > packet.
> > > > > > Which purpose does m->nb_segs serve that is not already covered
> > > by
> > > > > > m-
> > > > > > >next?
> > > > > > >
> > > > > > > It is duplicate info, but nb_segs can be used to check the
> > > > > > > validity
> > > > > > of
> > > > > > > the next pointer without having to read the second mbuf
> > > cacheline.
> > > > > > >
> > > > > > > Whether it's worth having is something I'm happy enough to
> > > > > > > discuss, though.
> > > > > >
> > > > > > Although slower in some cases than a full blown "next packet"
> > > > > > pointer, nb_segs can also be conveniently abused to link several
> > > > > > packets and their segments in the same list without wasting
> > > space.
> > > > >
> > > > > I don?t understand that; can you please elaborate? Are you abusing
> > > m->nb_segs as an index into an array in your application? If that is
> > > the case, and it is endorsed by the community, we should get rid of m-
> > > >nb_segs and add a member for application specific use instead.
> > > >
> > > > Well, that's just an idea, I'm not aware of any application using
> > > > this, however the ability to link several packets with segments seems
> > > > useful to me (e.g. buffering packets). Here's a diagram:
> > > >
> > > >  .---.   .---.   .---.   .---.   .---
> > > ---
> > > >  | pkt 0 |   | seg 1 |   | seg 2 |   | pkt 1 |   |
> > > pkt 2
> > > >  |  next --->|  next --->|  next --->|  next --->|
> > > ...
> > > >  | nb_segs 3 |   | nb_segs 1 |   | nb_segs 1 |   | nb_segs 1 |   |
> > > >  `---'   `---'   `---'   `---'   `---
> > > ---
> >
> > I see. It makes it possible to refer to a burst of packets (with segments 
> > or not) by a single mbuf reference, as an alternative to the current
> design pattern of using an array and length (struct rte_mbuf **mbufs, 
> unsigned count).
> >
> > This would require implementation in the PMDs etc.
> >
> > And even in this case, m->nb_segs does not need to be an integer, but could 
> > be replaced by a single bit indicating if the segment is a
> continuation of a packet or the beginning (alternatively the end) of a 
> packet, i.e. the bit can be set for either the first or the last segment in
> the packet.

We do need nb_segs -  at least for TX.
That's how TX function calculates how many TXDs it needs to allocate(and fill).
Of-course it can re-scan whole chain of segments to count them, but I think
it would slowdown things even more.
Though yes, I suppose it can be moved to the second cahe-line.
Konstantin

> 
> Sure however if we keep the current definition, a single bit would not be
> enough as it must be nonzero for the buffer to be valid. I think a 8 bit
> field is not that expensive for a counter.
> 
> > It is an almost equivalent alternative to the fundamental design pattern of 
> > using an array of mbuf with count, which is widely implemented
> in DPDK. And m->next still lives in the second cache line, so I don't see any 
> gain by this.
> 
> That's right, it does not have to live in the first cache line, my only
> 

[dpdk-dev] [PATCH v2 1/2] net/i40e: fix link status change interrupt

2016-10-25 Thread Bruce Richardson
On Tue, Oct 25, 2016 at 12:19:05PM +0800, Qiming Yang wrote:
> Previously, link status interrupt in i40e is achieved by checking
> LINK_STAT_CHANGE_MASK in PFINT_ICR0 register which is provided only
> for diagnostic use. Instead, drivers need to get the link status
> change notification by using LSE (Link Status Event).
> 
> This patch enables LSE and calls LSC callback when the event is
> received. This patch also removes the processing on
> LINK_STAT_CHANGE_MASK.
> 
> Fixes: 4861cde46116 ("i40e: new poll mode driver")
> 
> Signed-off-by: Qiming Yang 
> ---
Thanks for the V2. Unfortunately, this conflicts with some other changes
to the driver, making the patch not apply cleanly, and a manual apply I
tried did not compile successfully. Can you please do a V3 rebased on
top of dpdk-next-net/rel_16_11.

Thanks,
/Bruce


[dpdk-dev] mbuf changes

2016-10-25 Thread Bruce Richardson
On Tue, Oct 25, 2016 at 02:33:55PM +0200, Morten Br?rup wrote:
> Comments at the end.
> 
> Med venlig hilsen / kind regards
> - Morten Br?rup
> 
> > -Original Message-
> > From: Bruce Richardson [mailto:bruce.richardson at intel.com]
> > Sent: Tuesday, October 25, 2016 2:20 PM
> > To: Morten Br?rup
> > Cc: Adrien Mazarguil; Wiles, Keith; dev at dpdk.org; Olivier Matz; Oleg
> > Kuporosov
> > Subject: Re: [dpdk-dev] mbuf changes
> > 
> > On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Br?rup wrote:
> > > Comments inline.
> > >
> > > > -Original Message-
> > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce
> > > > Richardson
> > > > Sent: Tuesday, October 25, 2016 1:14 PM
> > > > To: Adrien Mazarguil
> > > > Cc: Morten Br?rup; Wiles, Keith; dev at dpdk.org; Olivier Matz; Oleg
> > > > Kuporosov
> > > > Subject: Re: [dpdk-dev] mbuf changes
> > > >
> > > > On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil wrote:
> > > > > On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Br?rup wrote:
> > > > > > Comments inline.
> > > > > >
> > > > > > Med venlig hilsen / kind regards
> > > > > > - Morten Br?rup
> > > > > >
> > > > > >
> > > > > > > -Original Message-
> > > > > > > From: Adrien Mazarguil [mailto:adrien.mazarguil at 6wind.com]
> > > > > > > Sent: Tuesday, October 25, 2016 11:39 AM
> > > > > > > To: Bruce Richardson
> > > > > > > Cc: Wiles, Keith; Morten Br?rup; dev at dpdk.org; Olivier Matz;
> > > > > > > Oleg Kuporosov
> > > > > > > Subject: Re: [dpdk-dev] mbuf changes
> > > > > > >
> > > > > > > On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson
> > wrote:
> > > > > > > > On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith
> > wrote:
> > > > > > > [...]
> > > > > > > > > > On Oct 24, 2016, at 10:49 AM, Morten Br?rup
> > > > > > >  wrote:
> > > > > > > [...]
> > >
> > > > > > > > One other point I'll mention is that we need to have a
> > > > > > > > discussion on how/where to add in a timestamp value into
> > the
> > > > > > > > mbuf. Personally, I think it can be in a union with the
> > > > sequence
> > > > > > > > number value, but I also suspect that 32-bits of a
> > timestamp
> > > > > > > > is not going to be enough for
> > > > > > > many.
> > > > > > > >
> > > > > > > > Thoughts?
> > > > > > >
> > > > > > > If we consider that timestamp representation should use
> > > > nanosecond
> > > > > > > granularity, a 32-bit value may likely wrap around too
> > quickly
> > > > > > > to be useful. We can also assume that applications requesting
> > > > > > > timestamps may care more about latency than throughput, Oleg
> > > > found
> > > > > > > that using the second cache line for this purpose had a
> > > > noticeable impact [1].
> > > > > > >
> > > > > > >  [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html
> > > > > >
> > > > > > I agree with Oleg about the latency vs. throughput importance
> > > > > > for
> > > > such applications.
> > > > > >
> > > > > > If you need high resolution timestamps, consider them to be
> > > > generated by the NIC RX driver, possibly by the hardware itself
> > > > (http://w3new.napatech.com/features/time-precision/hardware-time-
> > > > stamp), so the timestamp belongs in the first cache line. And I am
> > > > proposing that it should have the highest possible accuracy, which
> > > > makes the value hardware dependent.
> > > > > >
> > > > > > Furthermore, I am arguing that we leave it up to the
> > application
> > > > > > to
> > > > keep track of the slowly moving bits (i.e. counting whole seconds,
> > > > hours and calendar date) out of band, so we don't use precious
> > space
> > > > in the mbuf. The application doesn't need the NIC RX driver's fast
> > > > path to capture which date (or even which second) a packet was
> > > > received. Yes, it adds complexity to the application, but we can't
> > > > set aside 64 bit for a generic timestamp. Or as a weird tradeoff:
> > > > Put the fast moving 32 bit in the first cache line and the slow
> > > > moving 32 bit in the second cache line, as a placeholder for the
> > application to fill out if needed.
> > > > Yes, it means that the application needs to check the time and
> > > > update its variable holding the slow moving time once every second
> > > > or so; but that should be doable without significant effort.
> > > > >
> > > > > That's a good point, however without a 64 bit value, elapsed time
> > > > > between two arbitrary mbufs cannot be measured reliably due to
> > not
> > > > > enough context, one way or another the low resolution value is
> > > > > also
> > > > needed.
> > > > >
> > > > > Obviously latency-sensitive applications are unlikely to perform
> > > > > lengthy buffering and require this but I'm not sure about all the
> > > > > possible use-cases. Considering many NICs expose 64 bit
> > timestaps,
> > > > > I suggest we do not truncate them.
> > > > >
> > > > > I'm not a fan of the weird tradeoff either, PMDs will be tempted
> > > > > to fill the 

[dpdk-dev] mbuf changes

2016-10-25 Thread Ananyev, Konstantin


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Olivier Matz
> Sent: Tuesday, October 25, 2016 1:49 PM
> To: Richardson, Bruce ; Morten Br?rup  smartsharesystems.com>
> Cc: Adrien Mazarguil ; Wiles, Keith 
> ; dev at dpdk.org; Oleg Kuporosov
> 
> Subject: Re: [dpdk-dev] mbuf changes
> 
> 
> 
> On 10/25/2016 02:45 PM, Bruce Richardson wrote:
> > On Tue, Oct 25, 2016 at 02:33:55PM +0200, Morten Br?rup wrote:
> >> Comments at the end.
> >>
> >> Med venlig hilsen / kind regards
> >> - Morten Br?rup
> >>
> >>> -Original Message-
> >>> From: Bruce Richardson [mailto:bruce.richardson at intel.com]
> >>> Sent: Tuesday, October 25, 2016 2:20 PM
> >>> To: Morten Br?rup
> >>> Cc: Adrien Mazarguil; Wiles, Keith; dev at dpdk.org; Olivier Matz; Oleg
> >>> Kuporosov
> >>> Subject: Re: [dpdk-dev] mbuf changes
> >>>
> >>> On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Br?rup wrote:
>  Comments inline.
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce
> > Richardson
> > Sent: Tuesday, October 25, 2016 1:14 PM
> > To: Adrien Mazarguil
> > Cc: Morten Br?rup; Wiles, Keith; dev at dpdk.org; Olivier Matz; Oleg
> > Kuporosov
> > Subject: Re: [dpdk-dev] mbuf changes
> >
> > On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil wrote:
> >> On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Br?rup wrote:
> >>> Comments inline.
> >>>
> >>> Med venlig hilsen / kind regards
> >>> - Morten Br?rup
> >>>
> >>>
>  -Original Message-
>  From: Adrien Mazarguil [mailto:adrien.mazarguil at 6wind.com]
>  Sent: Tuesday, October 25, 2016 11:39 AM
>  To: Bruce Richardson
>  Cc: Wiles, Keith; Morten Br?rup; dev at dpdk.org; Olivier Matz;
>  Oleg Kuporosov
>  Subject: Re: [dpdk-dev] mbuf changes
> 
>  On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson
> >>> wrote:
> > On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith
> >>> wrote:
>  [...]
> >>> On Oct 24, 2016, at 10:49 AM, Morten Br?rup
>   wrote:
>  [...]
> 
> > One other point I'll mention is that we need to have a
> > discussion on how/where to add in a timestamp value into
> >>> the
> > mbuf. Personally, I think it can be in a union with the
> > sequence
> > number value, but I also suspect that 32-bits of a
> >>> timestamp
> > is not going to be enough for
>  many.
> >
> > Thoughts?
> 
>  If we consider that timestamp representation should use
> > nanosecond
>  granularity, a 32-bit value may likely wrap around too
> >>> quickly
>  to be useful. We can also assume that applications requesting
>  timestamps may care more about latency than throughput, Oleg
> > found
>  that using the second cache line for this purpose had a
> > noticeable impact [1].
> 
>   [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html
> >>>
> >>> I agree with Oleg about the latency vs. throughput importance
> >>> for
> > such applications.
> >>>
> >>> If you need high resolution timestamps, consider them to be
> > generated by the NIC RX driver, possibly by the hardware itself
> > (http://w3new.napatech.com/features/time-precision/hardware-time-
> > stamp), so the timestamp belongs in the first cache line. And I am
> > proposing that it should have the highest possible accuracy, which
> > makes the value hardware dependent.
> >>>
> >>> Furthermore, I am arguing that we leave it up to the
> >>> application
> >>> to
> > keep track of the slowly moving bits (i.e. counting whole seconds,
> > hours and calendar date) out of band, so we don't use precious
> >>> space
> > in the mbuf. The application doesn't need the NIC RX driver's fast
> > path to capture which date (or even which second) a packet was
> > received. Yes, it adds complexity to the application, but we can't
> > set aside 64 bit for a generic timestamp. Or as a weird tradeoff:
> > Put the fast moving 32 bit in the first cache line and the slow
> > moving 32 bit in the second cache line, as a placeholder for the
> >>> application to fill out if needed.
> > Yes, it means that the application needs to check the time and
> > update its variable holding the slow moving time once every second
> > or so; but that should be doable without significant effort.
> >>
> >> That's a good point, however without a 64 bit value, elapsed time
> >> between two arbitrary mbufs cannot be measured reliably due to
> >>> not
> >> enough context, one way or another the low resolution value is
> >> also
> > needed.
> >>
> >> Obviously latency-sensitive applications are unlikely to perform
> >> 

[dpdk-dev] [PATCH v4] net/i40e: fix hash filter invalid issue in X722

2016-10-25 Thread Bruce Richardson
On Tue, Oct 25, 2016 at 10:22:09AM +, Wu, Jingjing wrote:
> 
> 
> > -Original Message-
> > From: Guo, Jia
> > Sent: Tuesday, October 25, 2016 10:43 AM
> > To: Zhang, Helin ; Wu, Jingjing  > intel.com>
> > Cc: dev at dpdk.org; Guo, Jia 
> > Subject: [PATCH v4] net/i40e: fix hash filter invalid issue in X722
> > 
> > When verifying the Hash filtering on X722, we found a problem that
> > the hash value in descriptor is incorrect. The root caused is X722
> > uses different way of hash key word selection comparing with X710/XL710.
> > This patch fixes it by setting X722 specific key selection.
> > 
> > Signed-off-by: Jeff Guo 
> 
> If we use default configuration on HW, this issue will not exist. But
> our software sets the default value through i40e_filter_input_set_init.
> So I think fix line need to be the commit which introduced input set
> changing feature.
> 
> Acked-by: Jingjing Wu  with additional line:
> 
> Fixes: commit 98f055707685 ("i40e: configure input fields for RSS or flow 
> director ")
> 
Applied, with fixes line, to dpdk-next-net/rel_16_11

/Bruce


[dpdk-dev] [PATCH] doc: release note for ixgbe PMD API's

2016-10-25 Thread Bernard Iremonger
Signed-off-by: Bernard Iremonger 
---
 doc/guides/rel_notes/release_16_11.rst | 4 
 1 file changed, 4 insertions(+)

diff --git a/doc/guides/rel_notes/release_16_11.rst 
b/doc/guides/rel_notes/release_16_11.rst
index 26cdd62..8253614 100644
--- a/doc/guides/rel_notes/release_16_11.rst
+++ b/doc/guides/rel_notes/release_16_11.rst
@@ -131,6 +131,10 @@ New Features
   The GCC 4.9 ``-march`` option supports the Intel processor code names.
   The config option ``RTE_MACHINE`` can be used to pass code names to the 
compiler as ``-march`` flag.

+* **Added API's for VF management to the ixgbe PMD.**
+
+  Eight new API's have been added to the ixgbe PMD for VF management from the 
PF.
+  The declarations for the API's can be found in ``rte_pmd_ixgbe.h``.

 Resolved Issues
 ---
-- 
2.10.1



[dpdk-dev] [PATCH v2] net/i40e: fix fdir configure failed issue in X710

2016-10-25 Thread Bruce Richardson
On Tue, Oct 25, 2016 at 10:28:22AM +, Wu, Jingjing wrote:
> 
> 
> > -Original Message-
> > From: Guo, Jia
> > Sent: Tuesday, October 25, 2016 10:26 AM
> > To: Zhang, Helin ; Wu, Jingjing  > intel.com>
> > Cc: dev at dpdk.org; Guo, Jia 
> > Subject: [PATCH v2] net/i40e: fix fdir configure failed issue in X710
> > 
> > Because of some register is only supported by X722, such as 
> > I40E_GLQF_FD_PCTYPES,
> > so it need to use the mac type to distinguish the behavior of X722 from 
> > X710 and other
> > NICs, or it would result X710 functional failed.
> > 
> > Fixes: 8c5cb3c11513 (?net/i40e: add packet type translation for X722?)
> > Signed-off-by: Jeff Guo 
> 
> Acked-by: Jingjing Wu 
> 
Applied to dpdk-next-net/rel_16_11

/Bruce


[dpdk-dev] mbuf changes

2016-10-25 Thread Bruce Richardson
On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Br?rup wrote:
> Comments inline.
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson
> > Sent: Tuesday, October 25, 2016 1:14 PM
> > To: Adrien Mazarguil
> > Cc: Morten Br?rup; Wiles, Keith; dev at dpdk.org; Olivier Matz; Oleg
> > Kuporosov
> > Subject: Re: [dpdk-dev] mbuf changes
> > 
> > On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil wrote:
> > > On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Br?rup wrote:
> > > > Comments inline.
> > > >
> > > > Med venlig hilsen / kind regards
> > > > - Morten Br?rup
> > > >
> > > >
> > > > > -Original Message-
> > > > > From: Adrien Mazarguil [mailto:adrien.mazarguil at 6wind.com]
> > > > > Sent: Tuesday, October 25, 2016 11:39 AM
> > > > > To: Bruce Richardson
> > > > > Cc: Wiles, Keith; Morten Br?rup; dev at dpdk.org; Olivier Matz; Oleg
> > > > > Kuporosov
> > > > > Subject: Re: [dpdk-dev] mbuf changes
> > > > >
> > > > > On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson wrote:
> > > > > > On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith wrote:
> > > > > [...]
> > > > > > > > On Oct 24, 2016, at 10:49 AM, Morten Br?rup
> > > > >  wrote:
> > > > > [...]
> > > > > > > > 5.
> > > > > > > >
> > > > > > > > And here?s something new to think about:
> > > > > > > >
> > > > > > > > m->next already reveals if there are more segments to a
> > packet.
> > > > > Which purpose does m->nb_segs serve that is not already covered
> > by
> > > > > m-
> > > > > >next?
> > > > > >
> > > > > > It is duplicate info, but nb_segs can be used to check the
> > > > > > validity
> > > > > of
> > > > > > the next pointer without having to read the second mbuf
> > cacheline.
> > > > > >
> > > > > > Whether it's worth having is something I'm happy enough to
> > > > > > discuss, though.
> > > > >
> > > > > Although slower in some cases than a full blown "next packet"
> > > > > pointer, nb_segs can also be conveniently abused to link several
> > > > > packets and their segments in the same list without wasting
> > space.
> > > >
> > > > I don?t understand that; can you please elaborate? Are you abusing
> > m->nb_segs as an index into an array in your application? If that is
> > the case, and it is endorsed by the community, we should get rid of m-
> > >nb_segs and add a member for application specific use instead.
> > >
> > > Well, that's just an idea, I'm not aware of any application using
> > > this, however the ability to link several packets with segments seems
> > > useful to me (e.g. buffering packets). Here's a diagram:
> > >
> > >  .---.   .---.   .---.   .---.   .---
> > ---
> > >  | pkt 0 |   | seg 1 |   | seg 2 |   | pkt 1 |   |
> > pkt 2
> > >  |  next --->|  next --->|  next --->|  next --->|
> > ...
> > >  | nb_segs 3 |   | nb_segs 1 |   | nb_segs 1 |   | nb_segs 1 |   |
> > >  `---'   `---'   `---'   `---'   `---
> > ---
> 
> I see. It makes it possible to refer to a burst of packets (with segments or 
> not) by a single mbuf reference, as an alternative to the current design 
> pattern of using an array and length (struct rte_mbuf **mbufs, unsigned 
> count).
> 
> This would require implementation in the PMDs etc.
> 
> And even in this case, m->nb_segs does not need to be an integer, but could 
> be replaced by a single bit indicating if the segment is a continuation of a 
> packet or the beginning (alternatively the end) of a packet, i.e. the bit can 
> be set for either the first or the last segment in the packet.
> 
> It is an almost equivalent alternative to the fundamental design pattern of 
> using an array of mbuf with count, which is widely implemented in DPDK. And 
> m->next still lives in the second cache line, so I don't see any gain by this.
> 
> I still don't get how m->nb_segs can be abused without m->next.
> 
> 
> > > > > > One other point I'll mention is that we need to have a
> > > > > > discussion on how/where to add in a timestamp value into the
> > > > > > mbuf. Personally, I think it can be in a union with the
> > sequence
> > > > > > number value, but I also suspect that 32-bits of a timestamp is
> > > > > > not going to be enough for
> > > > > many.
> > > > > >
> > > > > > Thoughts?
> > > > >
> > > > > If we consider that timestamp representation should use
> > nanosecond
> > > > > granularity, a 32-bit value may likely wrap around too quickly to
> > > > > be useful. We can also assume that applications requesting
> > > > > timestamps may care more about latency than throughput, Oleg
> > found
> > > > > that using the second cache line for this purpose had a
> > noticeable impact [1].
> > > > >
> > > > >  [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html
> > > >
> > > > I agree with Oleg about the latency vs. throughput importance for
> > such applications.
> > > >
> > > > If you need high resolution timestamps, 

[dpdk-dev] mbuf changes

2016-10-25 Thread Bruce Richardson
On Tue, Oct 25, 2016 at 05:24:28PM +0530, Shreyansh Jain wrote:
> On Monday 24 October 2016 09:55 PM, Bruce Richardson wrote:
> > On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith wrote:
> > > 
> > > > On Oct 24, 2016, at 10:49 AM, Morten Br?rup  > > > smartsharesystems.com> wrote:
> > > > 
> > > > First of all: Thanks for a great DPDK Userspace 2016!
> > > > 
> > > > 
> > > > 
> > > > Continuing the Userspace discussion about Olivier Matz?s proposed mbuf 
> > > > changes...
> > 
> > Thanks for keeping the discussion going!
> > > > 
> > > > 
> > > > 
> > > > 1.
> > > > 
> > > > Stephen Hemminger had a noteworthy general comment about keeping 
> > > > metadata for the NIC in the appropriate section of the mbuf: Metadata 
> > > > generated by the NIC?s RX handler belongs in the first cache line, and 
> > > > metadata required by the NIC?s TX handler belongs in the second cache 
> > > > line. This also means that touching the second cache line on ingress 
> > > > should be avoided if possible; and Bruce Richardson mentioned that for 
> > > > this reason m->next was zeroed on free().
> > > > 
> > Thinking about it, I suspect there are more fields we can reset on free
> > to save time on alloc. Refcnt, as discussed below is one of them, but so
> > too could be the nb_segs field and possibly others.
> > 
> > > > 
> > > > 
> > > > 2.
> > > > 
> > > > There seemed to be consensus that the size of m->refcnt should match 
> > > > the size of m->port because a packet could be duplicated on all 
> > > > physical ports for L3 multicast and L2 flooding.
> > > > 
> > > > Furthermore, although a single physical machine (i.e. a single server) 
> > > > with 255 physical ports probably doesn?t exist, it might contain more 
> > > > than 255 virtual machines with a virtual port each, so it makes sense 
> > > > extending these mbuf fields from 8 to 16 bits.
> > > 
> > > I thought we also talked about removing the m->port from the mbuf as it 
> > > is not really needed.
> > > 
> > Yes, this was mentioned, and also the option of moving the port value to
> > the second cacheline, but it appears that NXP are using the port value
> > in their NIC drivers for passing in metadata, so we'd need their
> > agreement on any move (or removal).
> 
> I am not sure where NXP's NIC came into picture on this, but now that it is
> highlighted, this field is required for libevent implementation [1].
> 
> A scheduler sending an event, which can be a packet, would only have
> information of a flow_id. From this matching it back to a port, without
> mbuf->port, would be very difficult (costly). There may be way around this
> but at least in current proposal I think port would be important to have -
> even if in second cache line.
> 
> But, off the top of my head, as of now it is not being used for any specific
> purpose in NXP's PMD implementation.
> 
> Even the SoC patches don't necessarily rely on it except using it because it
> is available.
> 
> @Bruce: where did you get the NXP context here from?
> 
Oh, I'm just mis-remembering. :-( It was someone else who was looking for
this - Netronome, perhaps?

CC'ing Alejandro in the hope I'm remembering correctly second time
round!

/Bruce


[dpdk-dev] mbuf changes

2016-10-25 Thread Adrien Mazarguil
On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Br?rup wrote:
> Comments inline.
> 
> Med venlig hilsen / kind regards
> - Morten Br?rup
> 
> 
> > -Original Message-
> > From: Adrien Mazarguil [mailto:adrien.mazarguil at 6wind.com]
> > Sent: Tuesday, October 25, 2016 11:39 AM
> > To: Bruce Richardson
> > Cc: Wiles, Keith; Morten Br?rup; dev at dpdk.org; Olivier Matz; Oleg
> > Kuporosov
> > Subject: Re: [dpdk-dev] mbuf changes
> > 
> > On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson wrote:
> > > On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith wrote:
> > [...]
> > > > > On Oct 24, 2016, at 10:49 AM, Morten Br?rup
> >  wrote:
> > [...]
> > > > > 5.
> > > > >
> > > > > And here?s something new to think about:
> > > > >
> > > > > m->next already reveals if there are more segments to a packet.
> > Which purpose does m->nb_segs serve that is not already covered by m-
> > >next?
> > >
> > > It is duplicate info, but nb_segs can be used to check the validity
> > of
> > > the next pointer without having to read the second mbuf cacheline.
> > >
> > > Whether it's worth having is something I'm happy enough to discuss,
> > > though.
> > 
> > Although slower in some cases than a full blown "next packet" pointer,
> > nb_segs can also be conveniently abused to link several packets and
> > their segments in the same list without wasting space.
> 
> I don?t understand that; can you please elaborate? Are you abusing m->nb_segs 
> as an index into an array in your application? If that is the case, and it is 
> endorsed by the community, we should get rid of m->nb_segs and add a member 
> for application specific use instead. 

Well, that's just an idea, I'm not aware of any application using this,
however the ability to link several packets with segments seems
useful to me (e.g. buffering packets). Here's a diagram:

 .---.   .---.   .---.   .---.   .--
 | pkt 0 |   | seg 1 |   | seg 2 |   | pkt 1 |   | pkt 2
 |  next --->|  next --->|  next --->|  next --->| ...
 | nb_segs 3 |   | nb_segs 1 |   | nb_segs 1 |   | nb_segs 1 |   |
 `---'   `---'   `---'   `---'   `--

> > > One other point I'll mention is that we need to have a discussion on
> > > how/where to add in a timestamp value into the mbuf. Personally, I
> > > think it can be in a union with the sequence number value, but I also
> > > suspect that 32-bits of a timestamp is not going to be enough for
> > many.
> > >
> > > Thoughts?
> > 
> > If we consider that timestamp representation should use nanosecond
> > granularity, a 32-bit value may likely wrap around too quickly to be
> > useful. We can also assume that applications requesting timestamps may
> > care more about latency than throughput, Oleg found that using the
> > second cache line for this purpose had a noticeable impact [1].
> > 
> >  [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html
> 
> I agree with Oleg about the latency vs. throughput importance for such 
> applications.
> 
> If you need high resolution timestamps, consider them to be generated by the 
> NIC RX driver, possibly by the hardware itself 
> (http://w3new.napatech.com/features/time-precision/hardware-time-stamp), so 
> the timestamp belongs in the first cache line. And I am proposing that it 
> should have the highest possible accuracy, which makes the value hardware 
> dependent.
> 
> Furthermore, I am arguing that we leave it up to the application to keep 
> track of the slowly moving bits (i.e. counting whole seconds, hours and 
> calendar date) out of band, so we don't use precious space in the mbuf. The 
> application doesn't need the NIC RX driver's fast path to capture which date 
> (or even which second) a packet was received. Yes, it adds complexity to the 
> application, but we can't set aside 64 bit for a generic timestamp. Or as a 
> weird tradeoff: Put the fast moving 32 bit in the first cache line and the 
> slow moving 32 bit in the second cache line, as a placeholder for the 
> application to fill out if needed. Yes, it means that the application needs 
> to check the time and update its variable holding the slow moving time once 
> every second or so; but that should be doable without significant effort.

That's a good point, however without a 64 bit value, elapsed time between
two arbitrary mbufs cannot be measured reliably due to not enough context,
one way or another the low resolution value is also needed.

Obviously latency-sensitive applications are unlikely to perform lengthy
buffering and require this but I'm not sure about all the possible
use-cases. Considering many NICs expose 64 bit timestaps, I suggest we do
not truncate them.

I'm not a fan of the weird tradeoff either, PMDs will be tempted to fill the
extra 32 bits whenever they can and negate the performance improvement of
the first cache line.

-- 
Adrien Mazarguil
6WIND


[dpdk-dev] [PATCH 1/3] mbuf: embedding timestamp into the packet

2016-10-25 Thread Pattan, Reshma
Hi,

> -Original Message-
> From: Oleg Kuporosov [mailto:olegk at mellanox.com]
> Sent: Wednesday, October 19, 2016 6:40 PM
> To: Pattan, Reshma ; Olivier Matz
> 
> Cc: Richardson, Bruce ; Ananyev, Konstantin
> ; dev at dpdk.org
> Subject: RE: [dpdk-dev] [PATCH 1/3] mbuf: embedding timestamp into the
> packet
> 
> Hello Reshma
> 
> > I just read this mail chain, to make every one aware again, I am
> > emphasizing on the point that I am also adding new "time_arraived"
> > field to mbuf struct as part of  below 17.02 Road map item.
> 
> Thanks for your work for extending statistics support in DPDK.
> 
> "time_arrived" points here for mostly RX path, while more common term used
> "timestamp"
> Allows also using it for TX path as well in the future.
> 

I will take a note of this for next revision.

Thanks,
Reshma



[dpdk-dev] mbuf changes

2016-10-25 Thread Morten Brørup
Comments inline (at the end).

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Ananyev,
> Konstantin
> Sent: Tuesday, October 25, 2016 12:03 PM
> To: Richardson, Bruce; Morten Br?rup
> Cc: Wiles, Keith; dev at dpdk.org; Olivier Matz
> Subject: Re: [dpdk-dev] mbuf changes
> 
> Hi everyone,
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson
> > Sent: Tuesday, October 25, 2016 9:54 AM
> > To: Morten Br?rup 
> > Cc: Wiles, Keith ; dev at dpdk.org; Olivier Matz
> > 
> > Subject: Re: [dpdk-dev] mbuf changes
> >
> > On Mon, Oct 24, 2016 at 11:47:16PM +0200, Morten Br?rup wrote:
> > > Comments inline.
> > >
> > > Med venlig hilsen / kind regards
> > > - Morten Br?rup
> > >
> > > > -Original Message-
> > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce
> > > > Richardson
> > > > Sent: Monday, October 24, 2016 6:26 PM
> > > > To: Wiles, Keith
> > > > Cc: Morten Br?rup; dev at dpdk.org; Olivier Matz
> > > > Subject: Re: [dpdk-dev] mbuf changes
> > > >
> > > > On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith wrote:
> > > > >
> > > > > > On Oct 24, 2016, at 10:49 AM, Morten Br?rup
> > > > > > 
> > > > wrote:
> > > > > >
> > > > > > 2.
> > > > > >
> > > > > > There seemed to be consensus that the size of m->refcnt
> should
> > > > > > match
> > > > the size of m->port because a packet could be duplicated on all
> > > > physical ports for L3 multicast and L2 flooding.
> > > > > >
> > > > > > Furthermore, although a single physical machine (i.e. a
> single
> > > > > > server)
> > > > with 255 physical ports probably doesn?t exist, it might contain
> > > > more than
> > > > 255 virtual machines with a virtual port each, so it makes sense
> > > > extending these mbuf fields from 8 to 16 bits.
> > > > >
> > > > > I thought we also talked about removing the m->port from the
> > > > > mbuf as it
> > > > is not really needed.
> > > > >
> > > > Yes, this was mentioned, and also the option of moving the port
> > > > value to the second cacheline, but it appears that NXP are using
> > > > the port value in their NIC drivers for passing in metadata, so
> > > > we'd need their agreement on any move (or removal).
> > > >
> > > If a single driver instance services multiple ports, so the ports
> > > are not polled individually, the m->port member will be required to
> > > identify
> > the port. In that case, it might also used elsewhere in the ingress
> path, and thus appropriate having in the first cache line.
> 
> Ok, but these days most of devices have multiple rx queues.
> So identify the RX source properly you need not only port, but
> port+queue (at least 3 bytes).
> But I suppose better to wait for NXP input here.
> 
> >So yes, this needs
> > further discussion with NXP.

It seems that this concept might be somewhat similar to the Flow Director 
information, which already has plenty of bits in the first cache line. You 
should consider if this can be somehow folded into the m->hash union.

-Morten



[dpdk-dev] [PATCH v2 2/2] net/i40e: fix VF bonded device link down

2016-10-25 Thread Qiming Yang
If VF device is used as slave of a bond device, it will be polled
periodically through alarm. Interrupt is involved here. And then
VF will send I40E_VIRTCHNL_OP_GET_LINK_STAT message to
PF to query the status. The response is handled by interrupt
callback. Interrupt is involved here again. That's why bond
device cannot bring up.

This patch removes I40E_VIRTCHNL_OP_GET_LINK_STAT
message. Link status in VF driver will be updated when PF driver
notify it, and VF stores this link status locally. VF driver just
returns the local status when being required.

Fixes: 4861cde46116 ("i40e: new poll mode driver")

Signed-off-by: Qiming Yang 
---
 drivers/net/i40e/i40e_ethdev.c| 22 ++-
 drivers/net/i40e/i40e_ethdev.h|  4 +-
 drivers/net/i40e/i40e_ethdev_vf.c | 81 +--
 drivers/net/i40e/i40e_pf.c| 33 
 drivers/net/i40e/i40e_pf.h|  3 +-
 5 files changed, 67 insertions(+), 76 deletions(-)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 9c1f542..13060db 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -5418,6 +5418,24 @@ i40e_dev_handle_vfr_event(struct rte_eth_dev *dev)
 }

 static void
+i40e_notify_all_vfs_link_status(struct rte_eth_dev *dev)
+{
+   struct i40e_pf *pf = I40E_DEV_PRIVATE_TO_PF(dev->data->dev_private);
+   struct i40e_virtchnl_pf_event event;
+   int i;
+
+   event.event = I40E_VIRTCHNL_EVENT_LINK_CHANGE;
+   event.event_data.link_event.link_status =
+   dev->data->dev_link.link_status;
+   event.event_data.link_event.link_speed =
+   dev->data->dev_link.link_speed;
+
+   for (i = 0; i < pf->vf_num; i++)
+   i40e_pf_host_send_msg_to_vf(>vfs[i], I40E_VIRTCHNL_OP_EVENT,
+   I40E_SUCCESS, (uint8_t *), sizeof(event));
+}
+
+static void
 i40e_dev_handle_aq_msg(struct rte_eth_dev *dev)
 {
struct i40e_hw *hw = I40E_DEV_PRIVATE_TO_HW(dev->data->dev_private);
@@ -5455,9 +5473,11 @@ i40e_dev_handle_aq_msg(struct rte_eth_dev *dev)
break;
case i40e_aqc_opc_get_link_status:
ret = i40e_dev_link_update(dev, 0);
-   if (!ret)
+   if (!ret) {
+   i40e_notify_all_vfs_link_status(dev);
_rte_eth_dev_callback_process(dev,
RTE_ETH_EVENT_INTR_LSC);
+   }
break;
default:
PMD_DRV_LOG(ERR, "Request %u is not supported yet",
diff --git a/drivers/net/i40e/i40e_ethdev.h b/drivers/net/i40e/i40e_ethdev.h
index 92c8fad..61dfa93 100644
--- a/drivers/net/i40e/i40e_ethdev.h
+++ b/drivers/net/i40e/i40e_ethdev.h
@@ -599,7 +599,9 @@ int i40e_hash_filter_inset_select(struct i40e_hw *hw,
 struct rte_eth_input_set_conf *conf);
 int i40e_fdir_filter_inset_select(struct i40e_pf *pf,
 struct rte_eth_input_set_conf *conf);
-
+int i40e_pf_host_send_msg_to_vf(struct i40e_pf_vf *vf, uint32_t opcode,
+   uint32_t retval, uint8_t *msg,
+   uint16_t msglen);
 void i40e_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
struct rte_eth_rxq_info *qinfo);
 void i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
diff --git a/drivers/net/i40e/i40e_ethdev_vf.c 
b/drivers/net/i40e/i40e_ethdev_vf.c
index a616ae0..ba63a7f 100644
--- a/drivers/net/i40e/i40e_ethdev_vf.c
+++ b/drivers/net/i40e/i40e_ethdev_vf.c
@@ -126,8 +126,6 @@ static void i40evf_dev_promiscuous_enable(struct 
rte_eth_dev *dev);
 static void i40evf_dev_promiscuous_disable(struct rte_eth_dev *dev);
 static void i40evf_dev_allmulticast_enable(struct rte_eth_dev *dev);
 static void i40evf_dev_allmulticast_disable(struct rte_eth_dev *dev);
-static int i40evf_get_link_status(struct rte_eth_dev *dev,
- struct rte_eth_link *link);
 static int i40evf_init_vlan(struct rte_eth_dev *dev);
 static int i40evf_dev_rx_queue_start(struct rte_eth_dev *dev,
 uint16_t rx_queue_id);
@@ -1084,31 +1082,6 @@ i40evf_del_vlan(struct rte_eth_dev *dev, uint16_t vlanid)
return err;
 }

-static int
-i40evf_get_link_status(struct rte_eth_dev *dev, struct rte_eth_link *link)
-{
-   struct i40e_vf *vf = I40EVF_DEV_PRIVATE_TO_VF(dev->data->dev_private);
-   int err;
-   struct vf_cmd_info args;
-   struct rte_eth_link *new_link;
-
-   args.ops = (enum i40e_virtchnl_ops)I40E_VIRTCHNL_OP_GET_LINK_STAT;
-   args.in_args = NULL;
-   args.in_args_size = 0;
-   args.out_buffer = vf->aq_resp;
-   args.out_size = I40E_AQ_BUF_SZ;
-   err = i40evf_execute_vf_cmd(dev, );
-   if (err) {
-   PMD_DRV_LOG(ERR, "fail to execute command 

[dpdk-dev] [PATCH v2 1/2] net/i40e: fix link status change interrupt

2016-10-25 Thread Qiming Yang
Previously, link status interrupt in i40e is achieved by checking
LINK_STAT_CHANGE_MASK in PFINT_ICR0 register which is provided only
for diagnostic use. Instead, drivers need to get the link status
change notification by using LSE (Link Status Event).

This patch enables LSE and calls LSC callback when the event is
received. This patch also removes the processing on
LINK_STAT_CHANGE_MASK.

Fixes: 4861cde46116 ("i40e: new poll mode driver")

Signed-off-by: Qiming Yang 
---
 drivers/net/i40e/i40e_ethdev.c | 97 +-
 1 file changed, 19 insertions(+), 78 deletions(-)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index d0aeb70..9c1f542 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -108,7 +108,6 @@
I40E_PFINT_ICR0_ENA_GRST_MASK | \
I40E_PFINT_ICR0_ENA_PCI_EXCEPTION_MASK | \
I40E_PFINT_ICR0_ENA_STORM_DETECT_MASK | \
-   I40E_PFINT_ICR0_ENA_LINK_STAT_CHANGE_MASK | \
I40E_PFINT_ICR0_ENA_HMC_ERR_MASK | \
I40E_PFINT_ICR0_ENA_PE_CRITERR_MASK | \
I40E_PFINT_ICR0_ENA_VFLR_MASK | \
@@ -1768,6 +1767,16 @@ i40e_dev_start(struct rte_eth_dev *dev)
if (dev->data->dev_conf.intr_conf.lsc != 0)
PMD_INIT_LOG(INFO, "lsc won't enable because of"
 " no intr multiplex\n");
+   } else if (dev->data->dev_conf.intr_conf.lsc != 0) {
+   ret = i40e_aq_set_phy_int_mask(hw,
+  ~(I40E_AQ_EVENT_LINK_UPDOWN |
+  I40E_AQ_EVENT_MODULE_QUAL_FAIL |
+  I40E_AQ_EVENT_MEDIA_NA), NULL);
+   if (ret != I40E_SUCCESS)
+   PMD_DRV_LOG(WARNING, "Fail to set phy mask");
+
+   /* Call get_link_info aq commond to enable LSE */
+   i40e_dev_link_update(dev, 0);
}

/* enable uio intr after callback register */
@@ -1984,6 +1993,7 @@ i40e_dev_link_update(struct rte_eth_dev *dev,
struct rte_eth_link link, old;
int status;
unsigned rep_cnt = MAX_REPEAT_TIME;
+   bool enable_lse = dev->data->dev_conf.intr_conf.lsc ? true : false;

memset(, 0, sizeof(link));
memset(, 0, sizeof(old));
@@ -1992,7 +2002,8 @@ i40e_dev_link_update(struct rte_eth_dev *dev,

do {
/* Get link status information from hardware */
-   status = i40e_aq_get_link_info(hw, false, _status, NULL);
+   status = i40e_aq_get_link_info(hw, enable_lse,
+   _status, NULL);
if (status != I40E_SUCCESS) {
link.link_speed = ETH_SPEED_NUM_100M;
link.link_duplex = ETH_LINK_FULL_DUPLEX;
@@ -5442,6 +5453,12 @@ i40e_dev_handle_aq_msg(struct rte_eth_dev *dev)
info.msg_buf,
info.msg_len);
break;
+   case i40e_aqc_opc_get_link_status:
+   ret = i40e_dev_link_update(dev, 0);
+   if (!ret)
+   _rte_eth_dev_callback_process(dev,
+   RTE_ETH_EVENT_INTR_LSC);
+   break;
default:
PMD_DRV_LOG(ERR, "Request %u is not supported yet",
opcode);
@@ -5451,57 +5468,6 @@ i40e_dev_handle_aq_msg(struct rte_eth_dev *dev)
rte_free(info.msg_buf);
 }

-/*
- * Interrupt handler is registered as the alarm callback for handling LSC
- * interrupt in a definite of time, in order to wait the NIC into a stable
- * state. Currently it waits 1 sec in i40e for the link up interrupt, and
- * no need for link down interrupt.
- */
-static void
-i40e_dev_interrupt_delayed_handler(void *param)
-{
-   struct rte_eth_dev *dev = (struct rte_eth_dev *)param;
-   struct i40e_hw *hw = I40E_DEV_PRIVATE_TO_HW(dev->data->dev_private);
-   uint32_t icr0;
-
-   /* read interrupt causes again */
-   icr0 = I40E_READ_REG(hw, I40E_PFINT_ICR0);
-
-#ifdef RTE_LIBRTE_I40E_DEBUG_DRIVER
-   if (icr0 & I40E_PFINT_ICR0_ECC_ERR_MASK)
-   PMD_DRV_LOG(ERR, "ICR0: unrecoverable ECC error\n");
-   if (icr0 & I40E_PFINT_ICR0_MAL_DETECT_MASK)
-   PMD_DRV_LOG(ERR, "ICR0: malicious programming detected\n");
-   if (icr0 & I40E_PFINT_ICR0_GRST_MASK)
-   PMD_DRV_LOG(INFO, "ICR0: global reset requested\n");
-   if (icr0 & I40E_PFINT_ICR0_PCI_EXCEPTION_MASK)
-   PMD_DRV_LOG(INFO, "ICR0: PCI exception\n activated\n");
-   if (icr0 & I40E_PFINT_ICR0_STORM_DETECT_MASK)
-   PMD_DRV_LOG(INFO, "ICR0: a change in the storm control "
- 

[dpdk-dev] mbuf changes

2016-10-25 Thread Bruce Richardson
On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil wrote:
> On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Br?rup wrote:
> > Comments inline.
> > 
> > Med venlig hilsen / kind regards
> > - Morten Br?rup
> > 
> > 
> > > -Original Message-
> > > From: Adrien Mazarguil [mailto:adrien.mazarguil at 6wind.com]
> > > Sent: Tuesday, October 25, 2016 11:39 AM
> > > To: Bruce Richardson
> > > Cc: Wiles, Keith; Morten Br?rup; dev at dpdk.org; Olivier Matz; Oleg
> > > Kuporosov
> > > Subject: Re: [dpdk-dev] mbuf changes
> > > 
> > > On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson wrote:
> > > > On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith wrote:
> > > [...]
> > > > > > On Oct 24, 2016, at 10:49 AM, Morten Br?rup
> > >  wrote:
> > > [...]
> > > > > > 5.
> > > > > >
> > > > > > And here?s something new to think about:
> > > > > >
> > > > > > m->next already reveals if there are more segments to a packet.
> > > Which purpose does m->nb_segs serve that is not already covered by m-
> > > >next?
> > > >
> > > > It is duplicate info, but nb_segs can be used to check the validity
> > > of
> > > > the next pointer without having to read the second mbuf cacheline.
> > > >
> > > > Whether it's worth having is something I'm happy enough to discuss,
> > > > though.
> > > 
> > > Although slower in some cases than a full blown "next packet" pointer,
> > > nb_segs can also be conveniently abused to link several packets and
> > > their segments in the same list without wasting space.
> > 
> > I don?t understand that; can you please elaborate? Are you abusing 
> > m->nb_segs as an index into an array in your application? If that is the 
> > case, and it is endorsed by the community, we should get rid of m->nb_segs 
> > and add a member for application specific use instead. 
> 
> Well, that's just an idea, I'm not aware of any application using this,
> however the ability to link several packets with segments seems
> useful to me (e.g. buffering packets). Here's a diagram:
> 
>  .---.   .---.   .---.   .---.   .--
>  | pkt 0 |   | seg 1 |   | seg 2 |   | pkt 1 |   | pkt 2
>  |  next --->|  next --->|  next --->|  next --->| ...
>  | nb_segs 3 |   | nb_segs 1 |   | nb_segs 1 |   | nb_segs 1 |   |
>  `---'   `---'   `---'   `---'   `--
> 
> > > > One other point I'll mention is that we need to have a discussion on
> > > > how/where to add in a timestamp value into the mbuf. Personally, I
> > > > think it can be in a union with the sequence number value, but I also
> > > > suspect that 32-bits of a timestamp is not going to be enough for
> > > many.
> > > >
> > > > Thoughts?
> > > 
> > > If we consider that timestamp representation should use nanosecond
> > > granularity, a 32-bit value may likely wrap around too quickly to be
> > > useful. We can also assume that applications requesting timestamps may
> > > care more about latency than throughput, Oleg found that using the
> > > second cache line for this purpose had a noticeable impact [1].
> > > 
> > >  [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html
> > 
> > I agree with Oleg about the latency vs. throughput importance for such 
> > applications.
> > 
> > If you need high resolution timestamps, consider them to be generated by 
> > the NIC RX driver, possibly by the hardware itself 
> > (http://w3new.napatech.com/features/time-precision/hardware-time-stamp), so 
> > the timestamp belongs in the first cache line. And I am proposing that it 
> > should have the highest possible accuracy, which makes the value hardware 
> > dependent.
> > 
> > Furthermore, I am arguing that we leave it up to the application to keep 
> > track of the slowly moving bits (i.e. counting whole seconds, hours and 
> > calendar date) out of band, so we don't use precious space in the mbuf. The 
> > application doesn't need the NIC RX driver's fast path to capture which 
> > date (or even which second) a packet was received. Yes, it adds complexity 
> > to the application, but we can't set aside 64 bit for a generic timestamp. 
> > Or as a weird tradeoff: Put the fast moving 32 bit in the first cache line 
> > and the slow moving 32 bit in the second cache line, as a placeholder for 
> > the application to fill out if needed. Yes, it means that the application 
> > needs to check the time and update its variable holding the slow moving 
> > time once every second or so; but that should be doable without significant 
> > effort.
> 
> That's a good point, however without a 64 bit value, elapsed time between
> two arbitrary mbufs cannot be measured reliably due to not enough context,
> one way or another the low resolution value is also needed.
> 
> Obviously latency-sensitive applications are unlikely to perform lengthy
> buffering and require this but I'm not sure about all the possible
> use-cases. Considering many NICs expose 64 bit timestaps, I suggest we 

[dpdk-dev] mbuf changes

2016-10-25 Thread Morten Brørup
Comments inline.

Med venlig hilsen / kind regards
- Morten Br?rup


> -Original Message-
> From: Adrien Mazarguil [mailto:adrien.mazarguil at 6wind.com]
> Sent: Tuesday, October 25, 2016 11:39 AM
> To: Bruce Richardson
> Cc: Wiles, Keith; Morten Br?rup; dev at dpdk.org; Olivier Matz; Oleg
> Kuporosov
> Subject: Re: [dpdk-dev] mbuf changes
> 
> On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson wrote:
> > On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith wrote:
> [...]
> > > > On Oct 24, 2016, at 10:49 AM, Morten Br?rup
>  wrote:
> [...]
> > > > 5.
> > > >
> > > > And here?s something new to think about:
> > > >
> > > > m->next already reveals if there are more segments to a packet.
> Which purpose does m->nb_segs serve that is not already covered by m-
> >next?
> >
> > It is duplicate info, but nb_segs can be used to check the validity
> of
> > the next pointer without having to read the second mbuf cacheline.
> >
> > Whether it's worth having is something I'm happy enough to discuss,
> > though.
> 
> Although slower in some cases than a full blown "next packet" pointer,
> nb_segs can also be conveniently abused to link several packets and
> their segments in the same list without wasting space.

I don?t understand that; can you please elaborate? Are you abusing m->nb_segs 
as an index into an array in your application? If that is the case, and it is 
endorsed by the community, we should get rid of m->nb_segs and add a member for 
application specific use instead. 

> > One other point I'll mention is that we need to have a discussion on
> > how/where to add in a timestamp value into the mbuf. Personally, I
> > think it can be in a union with the sequence number value, but I also
> > suspect that 32-bits of a timestamp is not going to be enough for
> many.
> >
> > Thoughts?
> 
> If we consider that timestamp representation should use nanosecond
> granularity, a 32-bit value may likely wrap around too quickly to be
> useful. We can also assume that applications requesting timestamps may
> care more about latency than throughput, Oleg found that using the
> second cache line for this purpose had a noticeable impact [1].
> 
>  [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html

I agree with Oleg about the latency vs. throughput importance for such 
applications.

If you need high resolution timestamps, consider them to be generated by the 
NIC RX driver, possibly by the hardware itself 
(http://w3new.napatech.com/features/time-precision/hardware-time-stamp), so the 
timestamp belongs in the first cache line. And I am proposing that it should 
have the highest possible accuracy, which makes the value hardware dependent.

Furthermore, I am arguing that we leave it up to the application to keep track 
of the slowly moving bits (i.e. counting whole seconds, hours and calendar 
date) out of band, so we don't use precious space in the mbuf. The application 
doesn't need the NIC RX driver's fast path to capture which date (or even which 
second) a packet was received. Yes, it adds complexity to the application, but 
we can't set aside 64 bit for a generic timestamp. Or as a weird tradeoff: Put 
the fast moving 32 bit in the first cache line and the slow moving 32 bit in 
the second cache line, as a placeholder for the application to fill out if 
needed. Yes, it means that the application needs to check the time and update 
its variable holding the slow moving time once every second or so; but that 
should be doable without significant effort.


> 
> --
> Adrien Mazarguil
> 6WIND



[dpdk-dev] mbuf changes

2016-10-25 Thread Adrien Mazarguil
On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson wrote:
> On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith wrote:
[...]
> > > On Oct 24, 2016, at 10:49 AM, Morten Br?rup  
> > > wrote:
[...]
> > > 5.
> > > 
> > > And here?s something new to think about:
> > > 
> > > m->next already reveals if there are more segments to a packet. Which 
> > > purpose does m->nb_segs serve that is not already covered by m->next?
> 
> It is duplicate info, but nb_segs can be used to check the validity of
> the next pointer without having to read the second mbuf cacheline.
> 
> Whether it's worth having is something I'm happy enough to discuss,
> though.

Although slower in some cases than a full blown "next packet" pointer,
nb_segs can also be conveniently abused to link several packets and their
segments in the same list without wasting space.

> One other point I'll mention is that we need to have a discussion on
> how/where to add in a timestamp value into the mbuf. Personally, I think
> it can be in a union with the sequence number value, but I also suspect
> that 32-bits of a timestamp is not going to be enough for many.
> 
> Thoughts?

If we consider that timestamp representation should use nanosecond
granularity, a 32-bit value may likely wrap around too quickly to be
useful. We can also assume that applications requesting timestamps may care
more about latency than throughput, Oleg found that using the second cache
line for this purpose had a noticeable impact [1].

 [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html

-- 
Adrien Mazarguil
6WIND


[dpdk-dev] [PATCH v5 06/21] eal/soc: introduce very essential SoC infra definitions

2016-10-25 Thread Shreyansh Jain
Hello Jan,

On Monday 24 October 2016 09:51 PM, Jan Viktorin wrote:
> On Mon, 24 Oct 2016 17:29:25 +0530
> Shreyansh Jain  wrote:
>
>> From: Jan Viktorin 
>>
>> Define initial structures and functions for the SoC infrastructure.
>> This patch supports only a very minimal functions for now.
>> More features will be added in the following commits.
>>
>> Includes rte_device/rte_driver inheritance of
>> rte_soc_device/rte_soc_driver.
>>
>> Signed-off-by: Jan Viktorin 
>> Signed-off-by: Shreyansh Jain 
>> Signed-off-by: Hemant Agrawal 
>> ---
>>  app/test/Makefile   |   1 +
>>  app/test/test_soc.c |  90 +
>>  lib/librte_eal/common/Makefile  |   2 +-
>>  lib/librte_eal/common/eal_private.h |   4 +
>>  lib/librte_eal/common/include/rte_soc.h | 138 
>> 
>>  5 files changed, 234 insertions(+), 1 deletion(-)
>>  create mode 100644 app/test/test_soc.c
>>  create mode 100644 lib/librte_eal/common/include/rte_soc.h
>>
>> diff --git a/app/test/Makefile b/app/test/Makefile
>
> [...]
>
>> +++ b/lib/librte_eal/common/include/rte_soc.h
>> @@ -0,0 +1,138 @@
>
> [...]
>
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +#include 
>> +#include 
>> +
>> +struct rte_soc_id {
>> +const char *compatible; /**< OF compatible specification */
>> +uint64_t priv_data; /**< SoC Driver specific data */
>
> Do you expect this to be a pointer?

A 64 bit entry, which can be typecasted to pointer by implementations, 
if required. Or, it might as well remain as a 64bit entry as ID.

>
>> +};
>> +
>
> [...]
>
>> +
>> +/**
>> + * Initialization function for the driver called during SoC probing.
>> + */
>> +typedef int (soc_devinit_t)(struct rte_soc_driver *, struct rte_soc_device 
>> *);
>> +
>> +/**
>> + * Uninitialization function for the driver called during hotplugging.
>> + */
>> +typedef int (soc_devuninit_t)(struct rte_soc_device *);
>> +
>> +/**
>> + * A structure describing a SoC driver.
>> + */
>> +struct rte_soc_driver {
>> +TAILQ_ENTRY(rte_soc_driver) next;  /**< Next in list */
>> +struct rte_driver driver;  /**< Inherit core driver. */
>> +soc_devinit_t *devinit;/**< Device initialization */
>> +soc_devuninit_t *devuninit;/**< Device uninitialization */
>
> Shouldn't those functions be named probe/remove?

Indeed. I think there was a comment on v4 as well - I thought I had 
fixed it but it seems I have mixed up my patches. I will send v6 
immediately with this fixed. Thanks for pointing out.

>
>> +const struct rte_soc_id *id_table; /**< ID table, NULL terminated */
>> +};
>> +
>
> [...]
>
>> +#endif
>
>
>

-
Shreyansh


[dpdk-dev] [PATCH v2 2/2] net/i40e: fix VF bonded device link down

2016-10-25 Thread Wu, Jingjing


> -Original Message-
> From: Yang, Qiming
> Sent: Tuesday, October 25, 2016 12:19 PM
> To: dev at dpdk.org
> Cc: Wu, Jingjing ; Zhang, Helin  intel.com>; Richardson,
> Bruce ; Yang, Qiming 
> Subject: [PATCH v2 2/2] net/i40e: fix VF bonded device link down
> 
> If VF device is used as slave of a bond device, it will be polled
> periodically through alarm. Interrupt is involved here. And then
> VF will send I40E_VIRTCHNL_OP_GET_LINK_STAT message to
> PF to query the status. The response is handled by interrupt
> callback. Interrupt is involved here again. That's why bond
> device cannot bring up.
> 
> This patch removes I40E_VIRTCHNL_OP_GET_LINK_STAT
> message. Link status in VF driver will be updated when PF driver
> notify it, and VF stores this link status locally. VF driver just
> returns the local status when being required.
> 
> Fixes: 4861cde46116 ("i40e: new poll mode driver")
> 
> Signed-off-by: Qiming Yang 
Acked-by: Jingjing Wu 

Thanks.
BTW, next time, don't forget to note your changes comparing the previous 
version.
It will make reviewers' life easy. :)


/Jingjing


[dpdk-dev] [PATCH v2 1/2] net/i40e: fix link status change interrupt

2016-10-25 Thread Wu, Jingjing


> -Original Message-
> From: Yang, Qiming
> Sent: Tuesday, October 25, 2016 12:19 PM
> To: dev at dpdk.org
> Cc: Wu, Jingjing ; Zhang, Helin  intel.com>; Richardson,
> Bruce ; Yang, Qiming 
> Subject: [PATCH v2 1/2] net/i40e: fix link status change interrupt
> 
> Previously, link status interrupt in i40e is achieved by checking
> LINK_STAT_CHANGE_MASK in PFINT_ICR0 register which is provided only
> for diagnostic use. Instead, drivers need to get the link status
> change notification by using LSE (Link Status Event).
> 
> This patch enables LSE and calls LSC callback when the event is
> received. This patch also removes the processing on
> LINK_STAT_CHANGE_MASK.
> 
> Fixes: 4861cde46116 ("i40e: new poll mode driver")
> 
> Signed-off-by: Qiming Yang 
Acked-by: Jingjing Wu 


[dpdk-dev] [PATCH v2] net/i40e: fix fdir configure failed issue in X710

2016-10-25 Thread Wu, Jingjing


> -Original Message-
> From: Guo, Jia
> Sent: Tuesday, October 25, 2016 10:26 AM
> To: Zhang, Helin ; Wu, Jingjing  intel.com>
> Cc: dev at dpdk.org; Guo, Jia 
> Subject: [PATCH v2] net/i40e: fix fdir configure failed issue in X710
> 
> Because of some register is only supported by X722, such as 
> I40E_GLQF_FD_PCTYPES,
> so it need to use the mac type to distinguish the behavior of X722 from X710 
> and other
> NICs, or it would result X710 functional failed.
> 
> Fixes: 8c5cb3c11513 (?net/i40e: add packet type translation for X722?)
> Signed-off-by: Jeff Guo 

Acked-by: Jingjing Wu 

Thanks
Jingjing


[dpdk-dev] [PATCH v4] net/i40e: fix hash filter invalid issue in X722

2016-10-25 Thread Wu, Jingjing


> -Original Message-
> From: Guo, Jia
> Sent: Tuesday, October 25, 2016 10:43 AM
> To: Zhang, Helin ; Wu, Jingjing  intel.com>
> Cc: dev at dpdk.org; Guo, Jia 
> Subject: [PATCH v4] net/i40e: fix hash filter invalid issue in X722
> 
> When verifying the Hash filtering on X722, we found a problem that
> the hash value in descriptor is incorrect. The root caused is X722
> uses different way of hash key word selection comparing with X710/XL710.
> This patch fixes it by setting X722 specific key selection.
> 
> Signed-off-by: Jeff Guo 

If we use default configuration on HW, this issue will not exist. But
our software sets the default value through i40e_filter_input_set_init.
So I think fix line need to be the commit which introduced input set
changing feature.

Acked-by: Jingjing Wu  with additional line:

Fixes: commit 98f055707685 ("i40e: configure input fields for RSS or flow 
director ")


Thanks
Jingjing


[dpdk-dev] mbuf changes

2016-10-25 Thread Ananyev, Konstantin
Hi everyone,

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson
> Sent: Tuesday, October 25, 2016 9:54 AM
> To: Morten Br?rup 
> Cc: Wiles, Keith ; dev at dpdk.org; Olivier Matz 
> 
> Subject: Re: [dpdk-dev] mbuf changes
> 
> On Mon, Oct 24, 2016 at 11:47:16PM +0200, Morten Br?rup wrote:
> > Comments inline.
> >
> > Med venlig hilsen / kind regards
> > - Morten Br?rup
> >
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson
> > > Sent: Monday, October 24, 2016 6:26 PM
> > > To: Wiles, Keith
> > > Cc: Morten Br?rup; dev at dpdk.org; Olivier Matz
> > > Subject: Re: [dpdk-dev] mbuf changes
> > >
> > > On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith wrote:
> > > >
> > > > > On Oct 24, 2016, at 10:49 AM, Morten Br?rup  > > > > smartsharesystems.com>
> > > wrote:
> > > > >
> > > > > First of all: Thanks for a great DPDK Userspace 2016!
> > > > >
> > > > >
> > > > >
> > > > > Continuing the Userspace discussion about Olivier Matz?s proposed mbuf
> > > changes...
> > >
> > > Thanks for keeping the discussion going!
> > > > >
> > > > >
> > > > >
> > > > > 1.
> > > > >
> > > > > Stephen Hemminger had a noteworthy general comment about keeping
> > > metadata for the NIC in the appropriate section of the mbuf: Metadata
> > > generated by the NIC?s RX handler belongs in the first cache line, and
> > > metadata required by the NIC?s TX handler belongs in the second cache 
> > > line.
> > > This also means that touching the second cache line on ingress should be
> > > avoided if possible; and Bruce Richardson mentioned that for this reason 
> > > m-
> > > >next was zeroed on free().
> > > > >
> > > Thinking about it, I suspect there are more fields we can reset on free
> > > to save time on alloc. Refcnt, as discussed below is one of them, but so
> > > too could be the nb_segs field and possibly others.
> > Yes. Consider the use of rte_pktmbuf_reset() or add a 
> > rte_pktmbuf_prealloc() for this purpose.
> >
> Possibly. We just want to make sure that we don't reset too much either!
> :-)
> 
> >
> > > > > 2.
> > > > >
> > > > > There seemed to be consensus that the size of m->refcnt should match
> > > the size of m->port because a packet could be duplicated on all physical
> > > ports for L3 multicast and L2 flooding.
> > > > >
> > > > > Furthermore, although a single physical machine (i.e. a single server)
> > > with 255 physical ports probably doesn?t exist, it might contain more than
> > > 255 virtual machines with a virtual port each, so it makes sense extending
> > > these mbuf fields from 8 to 16 bits.
> > > >
> > > > I thought we also talked about removing the m->port from the mbuf as it
> > > is not really needed.
> > > >
> > > Yes, this was mentioned, and also the option of moving the port value to
> > > the second cacheline, but it appears that NXP are using the port value
> > > in their NIC drivers for passing in metadata, so we'd need their
> > > agreement on any move (or removal).
> > >
> > If a single driver instance services multiple ports, so the ports are not 
> > polled individually, the m->port member will be required to identify
> the port. In that case, it might also used elsewhere in the ingress path, and 
> thus appropriate having in the first cache line. 

Ok, but these days most of devices have multiple rx queues.
So identify the RX source properly you need not only port, but port+queue (at 
least 3 bytes).
But I suppose better to wait for NXP input here. 

>So yes, this needs
> further discussion with NXP.
> >
> >
> > > > > 3.
> > > > >
> > > > > Someone (Bruce Richardson?) suggested moving m->refcnt and m->port to
> > > the second cache line, which then generated questions from the audience
> > > about the real life purpose of m->port, and if m->port could be removed
> > > from the mbuf structure.
> > > > >
> > > > >
> > > > >
> > > > > 4.
> > > > >
> > > > > I suggested using offset -1 for m->refcnt, so m->refcnt becomes 0 on
> > > first allocation. This is based on the assumption that other mbuf fields
> > > must be zeroed at alloc()/free() anyway, so zeroing m->refcnt is cheaper
> > > than setting it to 1.
> > > > >
> > > > > Furthermore (regardless of m->refcnt offset), I suggested that it is
> > > not required to modify m->refcnt when allocating and freeing the mbuf, 
> > > thus
> > > saving one write operation on both alloc() and free(). However, this
> > > assumes that m->refcnt debugging, e.g. underrun detection, is not 
> > > required.
> > >
> > > I don't think it really matters what sentinal value is used for the
> > > refcnt because it can't be blindly assigned on free like other fields.
> > > However, I think 0 as first reference value becomes more awkward
> > > than 1, because we need to deal with underflow. Consider the situation
> > > where we have two references to the mbuf, so refcnt is 1, and both are
> > > freed at the same time. Since the refcnt is 

[dpdk-dev] mbuf changes

2016-10-25 Thread Bruce Richardson
On Mon, Oct 24, 2016 at 11:47:16PM +0200, Morten Br?rup wrote:
> Comments inline.
> 
> Med venlig hilsen / kind regards
> - Morten Br?rup
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson
> > Sent: Monday, October 24, 2016 6:26 PM
> > To: Wiles, Keith
> > Cc: Morten Br?rup; dev at dpdk.org; Olivier Matz
> > Subject: Re: [dpdk-dev] mbuf changes
> > 
> > On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith wrote:
> > >
> > > > On Oct 24, 2016, at 10:49 AM, Morten Br?rup  > > > smartsharesystems.com>
> > wrote:
> > > >
> > > > First of all: Thanks for a great DPDK Userspace 2016!
> > > >
> > > >
> > > >
> > > > Continuing the Userspace discussion about Olivier Matz?s proposed mbuf
> > changes...
> > 
> > Thanks for keeping the discussion going!
> > > >
> > > >
> > > >
> > > > 1.
> > > >
> > > > Stephen Hemminger had a noteworthy general comment about keeping
> > metadata for the NIC in the appropriate section of the mbuf: Metadata
> > generated by the NIC?s RX handler belongs in the first cache line, and
> > metadata required by the NIC?s TX handler belongs in the second cache line.
> > This also means that touching the second cache line on ingress should be
> > avoided if possible; and Bruce Richardson mentioned that for this reason m-
> > >next was zeroed on free().
> > > >
> > Thinking about it, I suspect there are more fields we can reset on free
> > to save time on alloc. Refcnt, as discussed below is one of them, but so
> > too could be the nb_segs field and possibly others.
> Yes. Consider the use of rte_pktmbuf_reset() or add a rte_pktmbuf_prealloc() 
> for this purpose.
> 
Possibly. We just want to make sure that we don't reset too much either!
:-)

> 
> > > > 2.
> > > >
> > > > There seemed to be consensus that the size of m->refcnt should match
> > the size of m->port because a packet could be duplicated on all physical
> > ports for L3 multicast and L2 flooding.
> > > >
> > > > Furthermore, although a single physical machine (i.e. a single server)
> > with 255 physical ports probably doesn?t exist, it might contain more than
> > 255 virtual machines with a virtual port each, so it makes sense extending
> > these mbuf fields from 8 to 16 bits.
> > >
> > > I thought we also talked about removing the m->port from the mbuf as it
> > is not really needed.
> > >
> > Yes, this was mentioned, and also the option of moving the port value to
> > the second cacheline, but it appears that NXP are using the port value
> > in their NIC drivers for passing in metadata, so we'd need their
> > agreement on any move (or removal).
> > 
> If a single driver instance services multiple ports, so the ports are not 
> polled individually, the m->port member will be required to identify the 
> port. In that case, it might also used elsewhere in the ingress path, and 
> thus appropriate having in the first cache line. So yes, this needs further 
> discussion with NXP.
> 
> 
> > > > 3.
> > > >
> > > > Someone (Bruce Richardson?) suggested moving m->refcnt and m->port to
> > the second cache line, which then generated questions from the audience
> > about the real life purpose of m->port, and if m->port could be removed
> > from the mbuf structure.
> > > >
> > > >
> > > >
> > > > 4.
> > > >
> > > > I suggested using offset -1 for m->refcnt, so m->refcnt becomes 0 on
> > first allocation. This is based on the assumption that other mbuf fields
> > must be zeroed at alloc()/free() anyway, so zeroing m->refcnt is cheaper
> > than setting it to 1.
> > > >
> > > > Furthermore (regardless of m->refcnt offset), I suggested that it is
> > not required to modify m->refcnt when allocating and freeing the mbuf, thus
> > saving one write operation on both alloc() and free(). However, this
> > assumes that m->refcnt debugging, e.g. underrun detection, is not required.
> > 
> > I don't think it really matters what sentinal value is used for the
> > refcnt because it can't be blindly assigned on free like other fields.
> > However, I think 0 as first reference value becomes more awkward
> > than 1, because we need to deal with underflow. Consider the situation
> > where we have two references to the mbuf, so refcnt is 1, and both are
> > freed at the same time. Since the refcnt is not-zero, then both cores
> > will do an atomic decrement simultaneously giving a refcnt of -1. We can
> > then set this back to zero before freeing, however, I'd still prefer to
> > have refcnt be an accurate value so that it always stays positive, and
> > we can still set it to "one" on free to avoid having to set on alloc.
> > 
> Good example. It explains very clearly why my suggested optimization is 
> tricky. My optimization is targeting the most likely scenario where m->refcnt 
> is only "one", but at the expense of some added complexity for the unlikely 
> scenarios. (I only suggested 0 as "one" because I assumed the value could be 
> zeroed faster than set to 1, so forget about that, if you 

[dpdk-dev] [PATCH] Revert "bonding: use existing enslaved device queues"

2016-10-25 Thread Ilya Maximets
On 24.10.2016 17:54, Jan Blunck wrote:
> On Wed, Oct 19, 2016 at 5:47 AM, Ilya Maximets  
> wrote:
>> On 18.10.2016 18:19, Jan Blunck wrote:
>>> On Tue, Oct 18, 2016 at 2:49 PM, Ilya Maximets  
>>> wrote:
 On 18.10.2016 15:28, Jan Blunck wrote:
> If the application already configured queues the PMD should not
> silently claim ownership and reset them.
>
> What exactly is the problem when changing MTU? This works fine from
> what I can tell.

 Following scenario leads to APP PANIC:

 1. mempool_1 = rte_mempool_create()
 2. rte_eth_rx_queue_setup(bond0, ..., mempool_1);
 3. rte_eth_dev_start(bond0);
 4. mempool_2 = rte_mempool_create();
 5. rte_eth_dev_stop(bond0);
 6. rte_eth_rx_queue_setup(bond0, ..., mempool_2);
 7. rte_eth_dev_start(bond0);
 * RX queues still use 'mempool_1' because reconfiguration doesn't 
 affect them. *
 8. rte_mempool_free(mempool_1);
 9. On any rx operation we'll get PANIC because of using freed 
 'mempool_1':
  PANIC in rte_mempool_get_ops():
  assert "(ops_index >= 0) && (ops_index < 
 RTE_MEMPOOL_MAX_OPS_IDX)" failed

 You may just start OVS 2.6 with DPDK bonding device and attempt to change 
 MTU via 'mtu_request'.
 Bug is easily reproducible.

>>>
>>> I see. I'm not 100% that this is expected to work without leaking the
>>> driver's queues though. The driver is allowed to do allocations in
>>> its rx_queue_setup() function that are being freed via
>>> rx_queue_release() later. But rx_queue_release() is only called if you
>>> reconfigure the
>>> device with 0 queues.

It's not true. Drivers usually calls 'rx_queue_release()' inside
'rx_queue_setup()' function while reallocating the already allocated
queue. (See ixgbe driver for example). Also all HW queues are
usually destroyed inside 'eth_dev_stop()' and reallocated in
'eth_dev_start()' or '{rx,tx}_queue_setup()'.
So, there is no leaks at all.

>>> From what I understand there is no other way to
>>> reconfigure a device to use another mempool.
>>>
>>> But ... even that wouldn't work with the bonding driver right now: the
>>> bonding master only configures the slaves during startup. I can put
>>> that on my todo list.

No, bonding master reconfigures new slaves in 'rte_eth_bond_slave_add()'
if needed.

>>> Coming back to your original problem: changing the MTU for the bond
>>> does work through rte_eth_dev_set_mtu() for slaves supporting that. In
>>> any other case you could (re-)configure rxmode.max_rx_pkt_len (and
>>> jumbo_frame / enable_scatter accordingly). This does work without a
>>> call to rte_eth_rx_queue_setup().
>>
>> Thanks for suggestion, but using of rte_eth_dev_set_mtu() without
>> reconfiguration will require to have mempools with huge mbufs (9KB)
>> for all ports from the start. This is unacceptable because leads to
>> significant performance regressions because of fast cache exhausting.
>> Also this will require big work to rewrite OVS reconfiguration code
>> this way.
>> Anyway, it isn't the MTU only problem. Number of rx/tx descriptors
>> also can't be changed in runtime.
>>
>>
>> I'm not fully understand what is the use case for this 'reusing' code.
>> Could you, please, describe situation where this behaviour is necessary?
> 
> The device that is added to the bond was used before and therefore
> already has allocated queues. Therefore we reuse the existing queues
> of the devices instead of borrowing the queues of the bond device. If
> the slave is removed from the bond again there is no need to allocate
> the queues again.
> 
> Hope that clarifies the usecase

So, As I see, this is just an optimization that leads to differently
configured queues of same device and possible application crash if the
old configuration doesn't valid any more.

With revert applied in your usecase while adding the device to bond
it's queues will be automatically reconfigured according to configuration
of the bond device. If the slave is removed from the bond all its'
queues will remain as configured by bond. You can reconfigure them if
needed. I guess, that in your case configuration of slave devices,
actually, matches configuration of bond device. In that case slave
will remain in the same state after removing from bond as it was
before adding.

Best regards, Ilya Maximets.

>
> On Wed, Sep 7, 2016 at 2:28 PM, Ilya Maximets  
> wrote:
>> This reverts commit 5b7bb2bda5519b7800f814df64d4e015282140e5.
>>
>> It is necessary to reconfigure all queues every time because 
>> configuration
>> can be changed.
>>
>> For example, if we're reconfiguring bonding device with new memory pool,
>> already configured queues will still use the old one. And if the old
>> mempool be freed, application likely will panic in attempt to use
>> freed mempool.
>>
>> This happens when 

[dpdk-dev] rte_kni_tx_burst() hangs because of no freedescriptors

2016-10-25 Thread Ananyev, Konstantin
Hi Helin,

> 
> From: yingzhi [mailto:kaitoy at qq.com]
> Sent: Monday, October 24, 2016 6:39 PM
> To: Zhang, Helin
> Cc: dev at dpdk.org
> Subject: Re: RE: [dpdk-dev] rte_kni_tx_burst() hangs because of no 
> freedescriptors
> 
> Hi Helin,
> 
> Thanks for your response, to answer your questions:
> 1. we send only one packet each time calling rte_kni_tx_burst(), which means 
> the last argument is 1.
> 2. it returns 0 because the free mbuf function inside tx_burst will not free 
> any mbuf:
> 
>   if (txq->nb_tx_free < txq->tx_free_thresh)
>   ixgbe_tx_free_bufs(txq);
> 
> after this operation, the txq->nb_tx_free is not increased and keeps to be 
> "0" eventually.
> 
> I did some tests today, I commented out this section of 
> ixgbe_rxtx_vec_common.h -> ixgbe_tx_free_bufs
> 
>   status = txq->tx_ring[txq->tx_next_dd].wb.status;
>   if (!(status & IXGBE_ADVTXD_STAT_DD))
>   return 0;
> 
> After ignoring DD bit check, our app runs for about 6 hours without issue. I 
> suspect there is something wrong in my program set the DD bit
> somewhere. One of the possible cause currently I suspect is as far as I know, 
> rte_pktmbuf_free(mbuf.array[k]) will free the mbuf of the
> packet and any fragmented packets following by it. But in our application 
> such as below code snippet:
> 
> auto nb_tx = rte_eth_tx_burst(port, queue, mbuf.array, (uint16_t) 
> nb_rx);
> if (unlikely(nb_tx < nb_rx)) {
> for (unsigned k = nb_tx; k < nb_rx; k++) {
> rte_pktmbuf_free(mbuf.array[k]);
> }
> }
> [Zhang, Helin] it seems above code piece has memory leak, if the buffer is 
> chained. After all memory leaked, then the issue comes. Please
> try to check if this is the root cause!
> 
> In this case if there are fragmented packets and failed transmission, may 
> cause a mbuf be freed multiple times.

Yes rte_pktmbuf_free() will free all chained segments for that packet.
The code above seems correct to me
(if of course .you don't try to free these packets again somewhere later).  
Can you clarify where do you think is a memory leak here?
Konstantin

> 
> Above is just my suspect, need to do some tests later today or tomorrow.
> 
> Thanks
> -- Original --
> From:  "Zhang, Helin";;
> Date:  Mon, Oct 24, 2016 11:33 AM
> To:  "yingzhi";
> Cc:  "dev at dpdk.org";
> Subject:  RE: [dpdk-dev] rte_kni_tx_burst() hangs because of no 
> freedescriptors
> 
> Hi Yingzhi
> 
> Thank you for the reporting! The description is not so clear at least for me.
> Please help to narrown down the issue by youself.
> How many packets would it have for calling TX function?
> Why it would return 0 after calling TX function? No memory? Or return from 
> else? Have you found anything?
> 
> Regards,
> Helin
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] > dpdk.org]> On Behalf Of yingzhi
> > Sent: Sunday, October 23, 2016 9:30 PM
> > To: users; dev at dpdk.org
> > Subject: [dpdk-dev] rte_kni_tx_burst() hangs because of no free descriptors
> >
> > -
> > Hi Experts,
> >
> > Background:
> >
> > We are using DPDK to develop a LoadBalancer following below logic: When
> > a new packet is received:
> >  1. if the dst_addr is management IP, forward to KNI. 2. if the dst_addr is 
> > in
> > VIP list, select backend and forward(modify dst mac address). 3. otherwise
> > drop the packet.
> >
> > At this stage, we use one single thread for KNI forwarding and another for
> > VIP forwarding(forward to eth).
> >
> > DPDK version: 16.07
> >  NIC: 82599ES 10-Gigabit SFI/SFP+ Network Connection
> >  Linux: 14.04.1-Ubuntu x64
> >
> > Promblem description:
> >
> > The program runs correctly for sometime(around 2 hours for 400Mb traffic).
> > But it it will hang. When problem happens, rte_eth_tx_burst() will not able 
> > to
> > send out any packets(always returns 0). We tracked into that function and
> > noticed it is actually calling ixgbe driver's ixgbe_xmit_pkts_vec() 
> > function in
> > our environment, because we use default tx queue configuration, after
> > printing some info, we found if the free function works fine:
> >  tx_rs_thresh: 32, tx_free_thresh: 32, nb_tx_free: 31
> >
> > it will trigger free and make 32 more free descriptors:
> >  tx_rs_thresh: 32, tx_free_thresh: 32, nb_tx_free: 62
> >
> > but when something going wrong, it will no longer free anything:
> >  tx_rs_thresh: 32, tx_free_thresh: 32, nb_tx_free: 0 tx_rs_thresh: 32,
> > tx_free_thresh: 32, nb_tx_free: 0
> >
> > It may related with the DD flag of the descriptor but we are not quite sure.
> >
> > Our program logic:
> >
> > create two mbuf pools on socket 0, one for rx_queue and one for kni. (all
> > lcore threads runs on socket0)
> >
> > init kni interface with rte_kni_alloc()
> >
> >
> > init one NIC interface with
> >  rte_eth_dev_configure(); 

[dpdk-dev] 16.07.1 stable patches review and test

2016-10-25 Thread Xu, Qian Q
Yuanhan
So we will get the package for testing on Oct.26th, right? Thx. 

-Original Message-
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Yuanhan Liu
Sent: Friday, October 14, 2016 5:27 PM
To: dpdk stable 
Cc: dev at dpdk.org
Subject: [dpdk-dev] 16.07.1 stable patches review and test

Hi,

I have applied most of bug fixing patches (listed below) to the 16.07 stable 
branch at

http://dpdk.org/browse/dpdk-stable/

Please help reviewing and testing. The planned date for the final release is 
26th, Oct. Before that, please shout if anyone has objections with these 
patches being applied.

Thanks.

--yliu

---
Alejandro Lucero (1):
  net/nfp: fix copying MAC address

Aleksey Katargin (1):
  table: fix symbol exports

Alex Zelezniak (1):
  net/ixgbe: fix VF reset to apply to correct VF

Ali Volkan Atli (1):
  net/e1000: fix returned number of available Rx descriptors

Arek Kusztal (1):
  app/test: fix verification of digest for GCM

Beilei Xing (2):
  net/i40e: fix dropping packets with ethertype 0x88A8
  net/i40e: fix parsing QinQ packets type

Bruce Richardson (1):
  net/mlx: fix debug build with gcc 6.1

Christian Ehrhardt (1):
  examples/ip_pipeline: fix Python interpreter

Deepak Kumar Jain (2):
  crypto/null: fix key size increment value
  crypto/qat: fix FreeBSD build

Dror Birkman (1):
  net/pcap: fix memory leak in jumbo frames

Ferruh Yigit (2):
  app/testpmd: fix help of MTU set commmand
  pmdinfogen: fix clang build

Gary Mussar (1):
  tools: fix virtio interface name when binding

Gowrishankar Muthukrishnan (1):
  examples/ip_pipeline: fix lcore mapping for ppc64

Hiroyuki Mikita (1):
  sched: fix releasing enqueued packets

James Poole (1):
  app/testpmd: fix timeout in Rx queue flushing

Jianfeng Tan (3):
  net/virtio_user: fix first queue pair without multiqueue
  net/virtio_user: fix wrong sequence of messages
  net/virtio_user: fix error management during init

Jim Harris (1):
  contigmem: zero all pages during mmap

John Daley (1):
  net/enic: fix bad L4 checksum flag on ICMP packets

Karmarkar Suyash (1):
  timer: fix lag delay

Maciej Czekaj (1):
  mem: fix crash on hugepage mapping error

Nelson Escobar (1):
  net/enic: fix freeing memory for descriptor ring

Olivier Matz (4):
  app/testpmd: fix crash when mempool allocation fails
  tools: fix json output of pmdinfo
  mbuf: fix error handling on pool creation
  mem: fix build with -O1

Pablo de Lara (3):
  hash: fix ring size
  hash: fix false zero signature key hit lookup
  crypto: fix build with icc

Qi Zhang (1):
  net/i40e/base: fix UDP packet header

Rich Lane (1):
  net/i40e: fix null pointer dereferences when using VMDq+RSS

Weiliang Luo (1):
  mempool: fix corruption due to invalid handler

Xiao Wang (5):
  net/fm10k: fix MAC address removal from switch
  net/ixgbe/base: fix pointer check
  net/ixgbe/base: fix check for NACK
  net/ixgbe/base: fix possible corruption of shadow RAM
  net/ixgbe/base: fix skipping PHY config

Yangchao Zhou (1):
  pci: fix memory leak when detaching device

Yury Kylulin (2):
  net/ixgbe: fix mbuf leak during Rx queue release
  net/i40e: fix mbuf leak during Rx queue release

Zhiyong Yang (1):
  net/virtio: fix xstats name


[dpdk-dev] [PATCH] doc: announce ABI change for ethtool app enhance

2016-10-25 Thread Xing, Beilei

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Qiming Yang
> Sent: Sunday, October 9, 2016 11:17 AM
> To: dev at dpdk.org
> Cc: Yang, Qiming 
> Subject: [dpdk-dev] [PATCH] doc: announce ABI change for ethtool app
> enhance
> 
> This patch adds a notice that the ABI change for ethtool app to get the NIC
> firmware version in the 17.02 release.
> 
> Signed-off-by: Qiming Yang 
> ---
>  doc/guides/rel_notes/deprecation.rst | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/doc/guides/rel_notes/deprecation.rst
> b/doc/guides/rel_notes/deprecation.rst
> index 845d2aa..60bd7ed 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -62,3 +62,7 @@ Deprecation Notices
>  * API will change for ``rte_port_source_params`` and
> ``rte_port_sink_params``
>structures. The member ``file_name`` data type will be changed from
>``char *`` to ``const char *``. This change targets release 16.11.
> +
> +* In 17.02 ABI change is planned: the ``rte_eth_dev_info`` structure
> +  will be extended with a new member ``fw_version`` in order to store
> +  the NIC firmware version.
> --
> 2.7.4

Acked-by: Beilei Xing




[dpdk-dev] [PATCH] net/i40e: fix the hash filter invalid calculation in X722

2016-10-25 Thread Guo, Jia
I will refine the commit log. Other hand,  since the issue is not directly 
related with prior patch, it just because some nic type adding request some 
special behavior handle. So it is reported issue fix, but may be have not 
corresponding fix line. Thanks jingjing's review. 

Best regards,
Jeff Guo


-Original Message-
From: Wu, Jingjing 
Sent: Monday, October 24, 2016 5:10 PM
To: Guo, Jia ; Zhang, Helin 
Cc: dev at dpdk.org; Yigit, Ferruh 
Subject: RE: [PATCH] net/i40e: fix the hash filter invalid calculation in X722


> -Original Message-
> From: Guo, Jia
> Sent: Thursday, October 20, 2016 10:49 AM
> To: Zhang, Helin ; Wu, Jingjing 
> 
> Cc: dev at dpdk.org; Guo, Jia ; Yigit, Ferruh 
> 
> Subject: [PATCH] net/i40e: fix the hash filter invalid calculation in 
> X722
> 
> As X722 extracts IPv4 header to Field Vector different with 
> XL710/X710, need to corresponding to modify the fields of IPv4 header 
> in input set to map different default Field Vector Table of different NICs.
> Signed-off-by: Jeff Guo 
> ---
> v3:
> remove the x722 macro
> v2:
> fix compile error when x722 macro is not defined and simplify the code 
> to avoid duplication.
> ---
>  drivers/net/i40e/i40e_ethdev.c | 60 
> +-
>  1 file changed, 47 insertions(+), 13 deletions(-)
> 

How about change the commit log it like:

When verifying the Hash filtering on X722, we found the behavior was not 
expected. For example, the hash value in descriptor is incorrect.
That was because X722 uses different way of hash key word selection comparing 
with X710/XL710.
This patch fixes it by setting X722 specific key selection.

And few minor comments:

If this is not the first patch, please use [PATCH v3] instead of [PATCH].
And the fixes line is missed.

Thanks
Jingijng


[dpdk-dev] [PATCH v2 1/3] drivers: add name alias registration for rte_driver

2016-10-25 Thread Yuanhan Liu
On Mon, Oct 24, 2016 at 12:22:21PM -0400, Jan Blunck wrote:
> This adds infrastructure for drivers to allow being requested by an alias
> so that a renamed driver can still get loaded by its legacy name.
> 
> Signed-off-by: Jan Blunck 
> Reviewed-by: Maxime Coquelin 
> Tested-by: Pablo de Lara 

Series Reviewed-by: Yuanhan Liu 

--yliu


[dpdk-dev] rte_kni_tx_burst() hangs because of no freedescriptors

2016-10-25 Thread Zhang, Helin


From: yingzhi [mailto:kai...@qq.com]
Sent: Monday, October 24, 2016 6:39 PM
To: Zhang, Helin
Cc: dev at dpdk.org
Subject: Re: RE: [dpdk-dev] rte_kni_tx_burst() hangs because of no 
freedescriptors

Hi Helin,

Thanks for your response, to answer your questions:
1. we send only one packet each time calling rte_kni_tx_burst(), which means 
the last argument is 1.
2. it returns 0 because the free mbuf function inside tx_burst will not free 
any mbuf:

  if (txq->nb_tx_free < txq->tx_free_thresh)
  ixgbe_tx_free_bufs(txq);

after this operation, the txq->nb_tx_free is not increased and keeps to be "0" 
eventually.

I did some tests today, I commented out this section of ixgbe_rxtx_vec_common.h 
-> ixgbe_tx_free_bufs

  status = txq->tx_ring[txq->tx_next_dd].wb.status;
  if (!(status & IXGBE_ADVTXD_STAT_DD))
  return 0;

After ignoring DD bit check, our app runs for about 6 hours without issue. I 
suspect there is something wrong in my program set the DD bit somewhere. One of 
the possible cause currently I suspect is as far as I know, 
rte_pktmbuf_free(mbuf.array[k]) will free the mbuf of the packet and any 
fragmented packets following by it. But in our application such as below code 
snippet:

auto nb_tx = rte_eth_tx_burst(port, queue, mbuf.array, (uint16_t) 
nb_rx);
if (unlikely(nb_tx < nb_rx)) {
for (unsigned k = nb_tx; k < nb_rx; k++) {
rte_pktmbuf_free(mbuf.array[k]);
}
}
[Zhang, Helin] it seems above code piece has memory leak, if the buffer is 
chained. After all memory leaked, then the issue comes. Please try to check if 
this is the root cause!

In this case if there are fragmented packets and failed transmission, may cause 
a mbuf be freed multiple times.

Above is just my suspect, need to do some tests later today or tomorrow.

Thanks
-- Original --
From:  "Zhang, Helin";;
Date:  Mon, Oct 24, 2016 11:33 AM
To:  "yingzhi";
Cc:  "dev at dpdk.org";
Subject:  RE: [dpdk-dev] rte_kni_tx_burst() hangs because of no freedescriptors

Hi Yingzhi

Thank you for the reporting! The description is not so clear at least for me.
Please help to narrown down the issue by youself.
How many packets would it have for calling TX function?
Why it would return 0 after calling TX function? No memory? Or return from 
else? Have you found anything?

Regards,
Helin

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] dpdk.org]> On Behalf Of yingzhi
> Sent: Sunday, October 23, 2016 9:30 PM
> To: users; dev at dpdk.org
> Subject: [dpdk-dev] rte_kni_tx_burst() hangs because of no free descriptors
>
> -
> Hi Experts,
>
> Background:
>
> We are using DPDK to develop a LoadBalancer following below logic: When
> a new packet is received:
>  1. if the dst_addr is management IP, forward to KNI. 2. if the dst_addr is in
> VIP list, select backend and forward(modify dst mac address). 3. otherwise
> drop the packet.
>
> At this stage, we use one single thread for KNI forwarding and another for
> VIP forwarding(forward to eth).
>
> DPDK version: 16.07
>  NIC: 82599ES 10-Gigabit SFI/SFP+ Network Connection
>  Linux: 14.04.1-Ubuntu x64
>
> Promblem description:
>
> The program runs correctly for sometime(around 2 hours for 400Mb traffic).
> But it it will hang. When problem happens, rte_eth_tx_burst() will not able to
> send out any packets(always returns 0). We tracked into that function and
> noticed it is actually calling ixgbe driver's ixgbe_xmit_pkts_vec() function 
> in
> our environment, because we use default tx queue configuration, after
> printing some info, we found if the free function works fine:
>  tx_rs_thresh: 32, tx_free_thresh: 32, nb_tx_free: 31
>
> it will trigger free and make 32 more free descriptors:
>  tx_rs_thresh: 32, tx_free_thresh: 32, nb_tx_free: 62
>
> but when something going wrong, it will no longer free anything:
>  tx_rs_thresh: 32, tx_free_thresh: 32, nb_tx_free: 0 tx_rs_thresh: 32,
> tx_free_thresh: 32, nb_tx_free: 0
>
> It may related with the DD flag of the descriptor but we are not quite sure.
>
> Our program logic:
>
> create two mbuf pools on socket 0, one for rx_queue and one for kni. (all
> lcore threads runs on socket0)
>
> init kni interface with rte_kni_alloc()
>
>
> init one NIC interface with
>  rte_eth_dev_configure(); rte_eth_rx_queue_setup();
> rte_eth_tx_queue_setup(); rte_eth_dev_start();
>
>
>
> in the eth main loop: (code is simplified)
>  while(1) { n = rte_eth_rx_burst(packets); for (i = 0; i < n; ++i)
>   { if
> (SEND_TO_KNI) { m = rte_kni_tx_burst(packets[i]); if 
> (m != 1))
> { rte_pktmbuf_free(packets[i]); } }   
>   if (SEND_TO_ETH)
> { // after modify the packet m = 
> rte_eth_tx_burst(packets[i]);
> if (m != 1)) 

[dpdk-dev] [PATCH v5 0/2] net/ixgbe: VMDq DCB with SRIOV

2016-10-25 Thread Lu, Wenzhuo
Hi,


> -Original Message-
> From: Iremonger, Bernard
> Sent: Wednesday, October 19, 2016 6:21 PM
> To: dev at dpdk.org; Shah, Rahul R; Lu, Wenzhuo
> Cc: Iremonger, Bernard
> Subject: [PATCH v5 0/2] net/ixgbe: VMDq DCB with SRIOV
> 
> Changes in v5:
> fix enable/disable of the QDE bit in the PFQDE register.
> 
> Changes in v4:
> changes to ixgbe patch following comments.
> 
> Changes in v3:
> rebase to latest master.
> update commit message for ixgbe patch
> add testpmd patch.
> 
> Changes in v2:
> rebase to  latest master.
> 
> Bernard Iremonger (2):
>   net/ixgbe: support multiqueue mode VMDq DCB with SRIOV
>   app/test_pmd: fix DCB configuration
> 
>  app/test-pmd/testpmd.c   |  4 ++--
>  drivers/net/ixgbe/ixgbe_ethdev.c | 11 ++-
>  drivers/net/ixgbe/ixgbe_rxtx.c   | 35 ++-
>  3 files changed, 30 insertions(+), 20 deletions(-)
> 
> --
> 2.10.1
Series-Acked-by: Wenzhuo Lu