from:"Olivier Matz"

Re: [dpdk-dev] [PATCH v12 1/6] ethdev: add Tx preparation

2016-12-02 Thread Olivier Matz

Hi Konstantin,

On Fri, 2 Dec 2016 01:06:30 +, "Ananyev, Konstantin"
 wrote:
> > 
> > 2016-11-23 18:36, Tomasz Kulasek:  
> > > +/**
> > > + * Process a burst of output packets on a transmit queue of an
> > > Ethernet device.
> > > + *
> > > + * The rte_eth_tx_prepare() function is invoked to prepare
> > > output packets to be
> > > + * transmitted on the output queue *queue_id* of the Ethernet
> > > device designated
> > > + * by its *port_id*.
> > > + * The *nb_pkts* parameter is the number of packets to be
> > > prepared which are
> > > + * supplied in the *tx_pkts* array of *rte_mbuf* structures,
> > > each of them
> > > + * allocated from a pool created with rte_pktmbuf_pool_create().
> > > + * For each packet to send, the rte_eth_tx_prepare() function
> > > performs
> > > + * the following operations:
> > > + *
> > > + * - Check if packet meets devices requirements for tx offloads.
> > > + *
> > > + * - Check limitations about number of segments.
> > > + *
> > > + * - Check additional requirements when debug is enabled.
> > > + *
> > > + * - Update and/or reset required checksums when tx offload is
> > > set for packet.
> > > + *
> > > + * Since this function can modify packet data, provided mbufs
> > > must be safely
> > > + * writable (e.g. modified data cannot be in shared segment).  
> > 
> > I think we will have to remove this limitation in next releases.
> > As we don't know how it could affect the API, I suggest to declare
> > this API EXPERIMENTAL.  
> 
> While I don't really mind to mart it as experimental, I don't really
> understand the reasoning: Why " this function can modify packet data,
> provided mbufs must be safely writable" suddenly becomes a problem?
> That seems like and obvious limitation to me and let say tx_burst()
> has the same one. Second, I don't see how you are going to remove it
> without introducing a heavy performance impact. Konstantin 
> 

About tx_burst(), I don't think we should force the user to provide a
writable mbuf. There are many use cases where passing a clone
already works as of today and it avoids duplicating the mbuf data. For
instance: traffic generator, multicast, bridging/tap, etc...

Moreover, this requirement would be inconsistent with the model you are
proposing in case of pipeline:
 - tx_prepare() on core X, may update the data
 - tx_burst() on core Y, should not touch the data to avoid cache misses


Regards,
Olivier

[dpdk-dev] [PATCH] scripts: fix checkpatch from standard input

2016-11-28 Thread Olivier Matz

On Mon, 21 Nov 2016 23:42:41 +0100, Thomas Monjalon
 wrote:
> When checking a valid patch from standard input,
> the footer lines of the report are not filtered out.
> 
> The function check is called outside of any loop,
> so the statement continue has no effect and the footer is printed.
> 
> Fixes: 8005feef421d ("scripts: add standard input to checkpatch")
> 
> Signed-off-by: Thomas Monjalon 

The 'continue' statement is not always without effect. On my machine
(but it looks it's not the same everywhere):
- with dash, the 'continue' acts like a return in that case
- with bash, it displays an error:
  "continue: only meaningful in a `for', `while', or `until' loop"
- with bash --posix, the 'continue' is ignored...

In my case, checkpatches.sh was displaying "0/1 valid" although there
was no error. This patch solves the issue, thanks.


Acked-by: Olivier Matz

[dpdk-dev] [PATCH v2] mempool: remove a redundant word "for" in comment

2016-11-28 Thread Olivier Matz

Hi Wei,

On Mon, 28 Nov 2016 09:42:12 +0100
Olivier Matz  wrote:
> Hi Wenzhuo,

First, sorry for the mistake in your name my previous mail.

Please find below some other comments about the patch (on the form).

> On Sun, 27 Nov 2016 10:43:47 +0800
> Wei Zhao  wrote:
> 
> > From: zhao wei 
> > 
> > There is a redundant repetition word "for" in commnet line of the

commnet -> comment

> > file rte_mempool.h after the definition of RTE_MEMPOOL_OPS_NAMESIZE.
> > The word "for"appear twice in line 359 and 360.One of them is

Missing space after '"for"' and after '360.'

> > redundant, so delete it.
> > 
> > Fixes: 449c49b93a6b (" mempool: support handler operations")

We should have an empty line after the 'Fixes:' tag. The
check-git-log.sh can help you to notice these errors.

Also, it is important that no spaces are added in the title of the
commit. You can get the exact line with:
  git log -1 --abbrev=12 --format='Fixes: %h (\"%s\")' 

> > Signed-off-by: zhao wei 

The name in your .gitconfig should be the same than in you mail:
Wei Zhao 

> > Acked-by: John McNamara   
> 
> Acked-by: Olivier Matz 
> 

Please, could you also check the same comments in the other patch?

Last thing: when doing another version of the patch, you should add a
changelog that describes what was modified. They take place after the 3
dashes.

Thank you for contributing.

Regards,
Olivier

[dpdk-dev] [PATCH v2] mempool: remove a redundant word "for" in comment

2016-11-28 Thread Olivier Matz

Hi Wenzhuo,

On Sun, 27 Nov 2016 10:43:47 +0800
Wei Zhao  wrote:

> From: zhao wei 
> 
> There is a redundant repetition word "for" in commnet line of the
> file rte_mempool.h after the definition of RTE_MEMPOOL_OPS_NAMESIZE.
> The word "for"appear twice in line 359 and 360.One of them is
> redundant, so delete it.
> 
> Fixes: 449c49b93a6b (" mempool: support handler operations")
> Signed-off-by: zhao wei 
> Acked-by: John McNamara 

Acked-by: Olivier Matz

[dpdk-dev] [PATCH v2] ethdev: check number of queues less than RTE_ETHDEV_QUEUE_STAT_CNTRS

2016-11-24 Thread Olivier Matz

Hi,

On Mon, 2016-11-21 at 09:59 +, Alejandro Lucero wrote:
> From: Bert van Leeuwen 
> 
> Arrays inside rte_eth_stats have size=RTE_ETHDEV_QUEUE_STAT_CNTRS.
> Some devices report more queues than that and this code blindly uses
> the reported number of queues by the device to fill those arrays up.
> This patch fixes the problem using MIN between the reported number of
> queues and RTE_ETHDEV_QUEUE_STAT_CNTRS.
> 
> Signed-off-by: Alejandro Lucero 
> 

Reviewed-by: Olivier Matz 

As a next step, I'm wondering if it would be possible to remove
this limitation. We could replace the tables in struct rte_eth_stats
by a pointer to an array allocated dynamically at pmd setup.

It would break the API, so it should be announced first. I'm thinking
of something like:

struct rte_eth_generic_stats {
uint64_t ipackets;
uint64_t opackets;
uint64_t ibytes;
uint64_t obytes;
uint64_t imissed;
uint64_t ierrors;
uint64_t oerrors;
uint64_t rx_nombuf
};

struct rte_eth_stats {
struct rte_eth_generic_stats?port_stats;
struct rte_eth_generic_stats?*queue_stats;
};

The queue_stats array would always be indexed by queue_id.
The xstats would continue to report the generic stats per-port and
per-queue.

About the mapping API, either we keep it as-is, or it could
become a driver-specific API.

Thomas, what do you think?

Regards,
Olivier

[dpdk-dev] [RFC 2/9] ethdev: move queue id check in generic layer

2016-11-24 Thread Olivier Matz

Hi Ferruh,

On Thu, 2016-11-24 at 10:59 +, Ferruh Yigit wrote:
> On 11/24/2016 9:54 AM, Olivier Matz wrote:
> > The check of queue_id is done in all drivers implementing
> > rte_eth_rx_queue_count(). Factorize this check in the generic
> > function.
> > 
> > Note that the nfp driver was doing the check differently, which
> > could
> > induce crashes if the queue index was too big.
> > 
> > By the way, also move the is_supported test before the port valid
> > and
> > queue valid test.
> > 
> > PR=52423
> > Signed-off-by: Olivier Matz 
> > Acked-by: Ivan Boule 
> > ---
> 
> <...>
> 
> > diff --git a/lib/librte_ether/rte_ethdev.h
> > b/lib/librte_ether/rte_ethdev.h
> > index c3edc23..9551cfd 100644
> > --- a/lib/librte_ether/rte_ethdev.h
> > +++ b/lib/librte_ether/rte_ethdev.h
> > @@ -2693,7 +2693,7 @@ rte_eth_rx_burst(uint8_t port_id, uint16_t
> > queue_id,
> > ? *??The queue id on the specific port.
> > ? * @return
> > ? *??The number of used descriptors in the specific queue, or:
> > - *?(-EINVAL) if *port_id* is invalid
> > + *?(-EINVAL) if *port_id* or *queue_id* is invalid
> > ? *?(-ENOTSUP) if the device does not support this function
> > ? */
> > ?static inline int
> > @@ -2701,8 +2701,10 @@ rte_eth_rx_queue_count(uint8_t port_id,
> > uint16_t queue_id)
> > ?{
> > ?   struct rte_eth_dev *dev = _eth_devices[port_id];
> > ?
> > -   RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> > ?   RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_count,
> > -ENOTSUP);
> > +   RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> 
> Doing port validity check before accessing dev->dev_ops-
> >rx_queue_count
> can be good idea.
> 
> What about validating port_id even before accessing
> rte_eth_devices[port_id]?
> 

oops right, we should not move this line, it's stupid...

Thanks for the feedback,
Olivier

[dpdk-dev] [RFC 1/9] ethdev: clarify api comments of rx queue count

2016-11-24 Thread Olivier Matz

On Thu, 2016-11-24 at 10:52 +, Ferruh Yigit wrote:
> On 11/24/2016 9:54 AM, Olivier Matz wrote:
> > The API comments are not consistent between each other.
> > 
> > The function rte_eth_rx_queue_count() returns the number of used
> > descriptors on a receive queue.
> > 
> > PR=52423
> 
> What is this marker?
> 

Sorry, this is a mistake, it's an internal marker...
I hoped nobody would notice it ;)


> > Signed-off-by: Olivier Matz 
> > Acked-by: Ivan Boule 
> 
> Acked-by: Ferruh Yigit 
> 

Thanks for reviewing!

Regards,
Olivier

[dpdk-dev] [RFC 9/9] net/e1000: add handler for tx queue descriptor count

2016-11-24 Thread Olivier Matz

Like for TX, use a binary search algorithm to get the number of used Tx
descriptors.

PR=52423
Signed-off-by: Olivier Matz 
Acked-by: Ivan Boule 
---
 drivers/net/e1000/e1000_ethdev.h |  5 +++-
 drivers/net/e1000/em_ethdev.c|  1 +
 drivers/net/e1000/em_rxtx.c  | 51 
 3 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/drivers/net/e1000/e1000_ethdev.h b/drivers/net/e1000/e1000_ethdev.h
index ad9ddaf..8945916 100644
--- a/drivers/net/e1000/e1000_ethdev.h
+++ b/drivers/net/e1000/e1000_ethdev.h
@@ -364,7 +364,10 @@ int eth_em_rx_queue_setup(struct rte_eth_dev *dev, 
uint16_t rx_queue_id,
struct rte_mempool *mb_pool);

 uint32_t eth_em_rx_queue_count(struct rte_eth_dev *dev,
-   uint16_t rx_queue_id);
+  uint16_t rx_queue_id);
+
+uint32_t eth_em_tx_queue_count(struct rte_eth_dev *dev,
+  uint16_t tx_queue_id);

 int eth_em_rx_descriptor_done(void *rx_queue, uint16_t offset);

diff --git a/drivers/net/e1000/em_ethdev.c b/drivers/net/e1000/em_ethdev.c
index 866a5cf..7fe5e3b 100644
--- a/drivers/net/e1000/em_ethdev.c
+++ b/drivers/net/e1000/em_ethdev.c
@@ -190,6 +190,7 @@ static const struct eth_dev_ops eth_em_ops = {
.rx_queue_setup   = eth_em_rx_queue_setup,
.rx_queue_release = eth_em_rx_queue_release,
.rx_queue_count   = eth_em_rx_queue_count,
+   .tx_queue_count   = eth_em_tx_queue_count,
.rx_descriptor_done   = eth_em_rx_descriptor_done,
.tx_queue_setup   = eth_em_tx_queue_setup,
.tx_queue_release = eth_em_tx_queue_release,
diff --git a/drivers/net/e1000/em_rxtx.c b/drivers/net/e1000/em_rxtx.c
index a469fd7..8afcfda 100644
--- a/drivers/net/e1000/em_rxtx.c
+++ b/drivers/net/e1000/em_rxtx.c
@@ -1432,6 +1432,57 @@ eth_em_rx_queue_count(struct rte_eth_dev *dev, uint16_t 
rx_queue_id)
return offset;
 }

+uint32_t
+eth_em_tx_queue_count(struct rte_eth_dev *dev, uint16_t tx_queue_id)
+{
+   volatile uint8_t *status;
+   struct em_tx_queue *txq;
+   int32_t offset, interval, idx, resolution;
+
+   txq = dev->data->tx_queues[tx_queue_id];
+
+   /* check if ring empty */
+   idx = txq->tx_tail - 1;
+   if (idx < 0)
+   idx += txq->nb_tx_desc;
+   status = >tx_ring[idx].upper.fields.status;
+   if (*status & E1000_TXD_STAT_DD)
+   return 0;
+
+   /* check if ring full */
+   idx = txq->tx_tail + 1;
+   if (idx >= txq->nb_tx_desc)
+   idx -= txq->nb_tx_desc;
+   status = >tx_ring[idx].upper.fields.status;
+   if (!(*status & E1000_TXD_STAT_DD))
+   return txq->nb_tx_desc;
+
+   /* decrease the precision if ring is large */
+   if (txq->nb_tx_desc <= 256)
+   resolution = 4;
+   else
+   resolution = 16;
+
+   /* use a binary search */
+   offset = txq->nb_tx_desc >> 1;
+   interval = offset;
+
+   do {
+   idx = txq->tx_tail + offset;
+   if (idx >= txq->nb_tx_desc)
+   idx -= txq->nb_tx_desc;
+
+   interval >>= 1;
+   status = >tx_ring[idx].upper.fields.status;
+   if (*status & E1000_TXD_STAT_DD)
+   offset += interval;
+   else
+   offset -= interval;
+   } while (interval >= resolution);
+
+   return txq->nb_tx_desc - offset;
+}
+
 int
 eth_em_rx_descriptor_done(void *rx_queue, uint16_t offset)
 {
-- 
2.8.1

[dpdk-dev] [RFC 8/9] net/e1000: optimize rx queue descriptor count

2016-11-24 Thread Olivier Matz

Use a binary search algorithm to find the first empty DD bit. The
ring-empty and ring-full cases are managed separately as they are more
likely to happen.

PR=52423
Signed-off-by: Olivier Matz 
Acked-by: Ivan Boule 
---
 drivers/net/e1000/em_rxtx.c | 55 +
 1 file changed, 41 insertions(+), 14 deletions(-)

diff --git a/drivers/net/e1000/em_rxtx.c b/drivers/net/e1000/em_rxtx.c
index c1c724b..a469fd7 100644
--- a/drivers/net/e1000/em_rxtx.c
+++ b/drivers/net/e1000/em_rxtx.c
@@ -1385,24 +1385,51 @@ eth_em_rx_queue_setup(struct rte_eth_dev *dev,
 uint32_t
 eth_em_rx_queue_count(struct rte_eth_dev *dev, uint16_t rx_queue_id)
 {
-#define EM_RXQ_SCAN_INTERVAL 4
-   volatile struct e1000_rx_desc *rxdp;
+   volatile uint8_t *status;
struct em_rx_queue *rxq;
-   uint32_t desc = 0;
+   uint32_t offset, interval, resolution;
+   int32_t idx;

rxq = dev->data->rx_queues[rx_queue_id];
-   rxdp = &(rxq->rx_ring[rxq->rx_tail]);
-
-   while ((desc < rxq->nb_rx_desc) &&
-   (rxdp->status & E1000_RXD_STAT_DD)) {
-   desc += EM_RXQ_SCAN_INTERVAL;
-   rxdp += EM_RXQ_SCAN_INTERVAL;
-   if (rxq->rx_tail + desc >= rxq->nb_rx_desc)
-   rxdp = &(rxq->rx_ring[rxq->rx_tail +
-   desc - rxq->nb_rx_desc]);
-   }

-   return desc;
+   /* check if ring empty */
+   idx = rxq->rx_tail;
+   status = >rx_ring[idx].status;
+   if (!(*status & E1000_RXD_STAT_DD))
+   return 0;
+
+   /* decrease the precision if ring is large */
+   if (rxq->nb_rx_desc <= 256)
+   resolution = 4;
+   else
+   resolution = 16;
+
+   /* check if ring full */
+   idx = rxq->rx_tail - rxq->nb_rx_hold - resolution;
+   if (idx < 0)
+   idx += rxq->nb_rx_desc;
+   status = >rx_ring[idx].status;
+   if (*status & E1000_RXD_STAT_DD)
+   return rxq->nb_rx_desc;
+
+   /* use a binary search */
+   interval = (rxq->nb_rx_desc - rxq->nb_rx_hold) >> 1;
+   offset = interval;
+
+   do {
+   idx = rxq->rx_tail + offset;
+   if (idx >= rxq->nb_rx_desc)
+   idx -= rxq->nb_rx_desc;
+
+   interval >>= 1;
+   status = >rx_ring[idx].status;
+   if (*status & E1000_RXD_STAT_DD)
+   offset += interval;
+   else
+   offset -= interval;
+   } while (interval >= resolution);
+
+   return offset;
 }

 int
-- 
2.8.1

[dpdk-dev] [RFC 7/9] net/igb: add handler for tx queue descriptor count

2016-11-24 Thread Olivier Matz

Like for TX, use a binary search algorithm to get the number of used Tx
descriptors.

PR=52423
Signed-off-by: Olivier Matz 
Acked-by: Ivan Boule 
---
 drivers/net/e1000/e1000_ethdev.h |  5 +++-
 drivers/net/e1000/igb_ethdev.c   |  1 +
 drivers/net/e1000/igb_rxtx.c | 51 
 3 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/drivers/net/e1000/e1000_ethdev.h b/drivers/net/e1000/e1000_ethdev.h
index 6c25c8d..ad9ddaf 100644
--- a/drivers/net/e1000/e1000_ethdev.h
+++ b/drivers/net/e1000/e1000_ethdev.h
@@ -300,7 +300,10 @@ int eth_igb_rx_queue_setup(struct rte_eth_dev *dev, 
uint16_t rx_queue_id,
struct rte_mempool *mb_pool);

 uint32_t eth_igb_rx_queue_count(struct rte_eth_dev *dev,
-   uint16_t rx_queue_id);
+   uint16_t rx_queue_id);
+
+uint32_t eth_igb_tx_queue_count(struct rte_eth_dev *dev,
+   uint16_t tx_queue_id);

 int eth_igb_rx_descriptor_done(void *rx_queue, uint16_t offset);

diff --git a/drivers/net/e1000/igb_ethdev.c b/drivers/net/e1000/igb_ethdev.c
index 08f2a68..a54d374 100644
--- a/drivers/net/e1000/igb_ethdev.c
+++ b/drivers/net/e1000/igb_ethdev.c
@@ -399,6 +399,7 @@ static const struct eth_dev_ops eth_igb_ops = {
.rx_queue_intr_disable = eth_igb_rx_queue_intr_disable,
.rx_queue_release = eth_igb_rx_queue_release,
.rx_queue_count   = eth_igb_rx_queue_count,
+   .tx_queue_count   = eth_igb_tx_queue_count,
.rx_descriptor_done   = eth_igb_rx_descriptor_done,
.tx_queue_setup   = eth_igb_tx_queue_setup,
.tx_queue_release = eth_igb_tx_queue_release,
diff --git a/drivers/net/e1000/igb_rxtx.c b/drivers/net/e1000/igb_rxtx.c
index 6b0111f..2ff2417 100644
--- a/drivers/net/e1000/igb_rxtx.c
+++ b/drivers/net/e1000/igb_rxtx.c
@@ -1554,6 +1554,57 @@ eth_igb_rx_queue_count(struct rte_eth_dev *dev, uint16_t 
rx_queue_id)
return offset;
 }

+uint32_t
+eth_igb_tx_queue_count(struct rte_eth_dev *dev, uint16_t tx_queue_id)
+{
+   volatile uint32_t *status;
+   struct igb_tx_queue *txq;
+   int32_t offset, interval, idx, resolution;
+
+   txq = dev->data->tx_queues[tx_queue_id];
+
+   /* check if ring empty */
+   idx = txq->tx_tail - 1;
+   if (idx < 0)
+   idx += txq->nb_tx_desc;
+   status = >tx_ring[idx].wb.status;
+   if (*status & rte_cpu_to_le_32(E1000_TXD_STAT_DD))
+   return 0;
+
+   /* check if ring full */
+   idx = txq->tx_tail + 1;
+   if (idx >= txq->nb_tx_desc)
+   idx -= txq->nb_tx_desc;
+   status = >tx_ring[idx].wb.status;
+   if (!(*status & rte_cpu_to_le_32(E1000_TXD_STAT_DD)))
+   return txq->nb_tx_desc;
+
+   /* decrease the precision if ring is large */
+   if (txq->nb_tx_desc <= 256)
+   resolution = 4;
+   else
+   resolution = 16;
+
+   /* use a binary search */
+   interval = txq->nb_tx_desc >> 1;
+   offset = interval;
+
+   do {
+   interval >>= 1;
+   idx = txq->tx_tail + offset;
+   if (idx >= txq->nb_tx_desc)
+   idx -= txq->nb_tx_desc;
+
+   status = >tx_ring[idx].wb.status;
+   if (*status & rte_cpu_to_le_32(E1000_TXD_STAT_DD))
+   offset += interval;
+   else
+   offset -= interval;
+   } while (interval >= resolution);
+
+   return txq->nb_tx_desc - offset;
+}
+
 int
 eth_igb_rx_descriptor_done(void *rx_queue, uint16_t offset)
 {
-- 
2.8.1

[dpdk-dev] [RFC 6/9] net/igb: optimize rx queue descriptor count

2016-11-24 Thread Olivier Matz

Use a binary search algorithm to find the first empty DD bit. The
ring-empty and ring-full cases are managed separately as they are more
likely to happen.

PR=52423
Signed-off-by: Olivier Matz 
Acked-by: Ivan Boule 
---
 drivers/net/e1000/igb_rxtx.c | 55 +---
 1 file changed, 41 insertions(+), 14 deletions(-)

diff --git a/drivers/net/e1000/igb_rxtx.c b/drivers/net/e1000/igb_rxtx.c
index e9aa356..6b0111f 100644
--- a/drivers/net/e1000/igb_rxtx.c
+++ b/drivers/net/e1000/igb_rxtx.c
@@ -1507,24 +1507,51 @@ eth_igb_rx_queue_setup(struct rte_eth_dev *dev,
 uint32_t
 eth_igb_rx_queue_count(struct rte_eth_dev *dev, uint16_t rx_queue_id)
 {
-#define IGB_RXQ_SCAN_INTERVAL 4
-   volatile union e1000_adv_rx_desc *rxdp;
+   volatile uint32_t *status;
struct igb_rx_queue *rxq;
-   uint32_t desc = 0;
+   uint32_t offset, interval, resolution;
+   int32_t idx;

rxq = dev->data->rx_queues[rx_queue_id];
-   rxdp = &(rxq->rx_ring[rxq->rx_tail]);
-
-   while ((desc < rxq->nb_rx_desc) &&
-   (rxdp->wb.upper.status_error & E1000_RXD_STAT_DD)) {
-   desc += IGB_RXQ_SCAN_INTERVAL;
-   rxdp += IGB_RXQ_SCAN_INTERVAL;
-   if (rxq->rx_tail + desc >= rxq->nb_rx_desc)
-   rxdp = &(rxq->rx_ring[rxq->rx_tail +
-   desc - rxq->nb_rx_desc]);
-   }

-   return desc;
+   /* check if ring empty */
+   idx = rxq->rx_tail;
+   status = >rx_ring[idx].wb.upper.status_error;
+   if (!(*status & rte_cpu_to_le_32(E1000_RXD_STAT_DD)))
+   return 0;
+
+   /* decrease the precision if ring is large */
+   if (rxq->nb_rx_desc <= 256)
+   resolution = 4;
+   else
+   resolution = 16;
+
+   /* check if ring full */
+   idx = rxq->rx_tail - rxq->nb_rx_hold - resolution;
+   if (idx < 0)
+   idx += rxq->nb_rx_desc;
+   status = >rx_ring[idx].wb.upper.status_error;
+   if (*status & rte_cpu_to_le_32(E1000_RXD_STAT_DD))
+   return rxq->nb_rx_desc;
+
+   /* use a binary search */
+   interval = (rxq->nb_rx_desc - rxq->nb_rx_hold) >> 1;
+   offset = interval;
+
+   do {
+   idx = rxq->rx_tail + offset;
+   if (idx >= rxq->nb_rx_desc)
+   idx -= rxq->nb_rx_desc;
+
+   interval >>= 1;
+   status = >rx_ring[idx].wb.upper.status_error;
+   if (*status & rte_cpu_to_le_32(E1000_RXD_STAT_DD))
+   offset += interval;
+   else
+   offset -= interval;
+   } while (interval >= resolution);
+
+   return offset;
 }

 int
-- 
2.8.1

[dpdk-dev] [RFC 5/9] net/ixgbe: add handler for Tx queue descriptor count

2016-11-24 Thread Olivier Matz

Like for TX, use a binary search algorithm to get the number of used Tx
descriptors.

PR=52423
Signed-off-by: Olivier Matz 
Acked-by: Ivan Boule 
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  1 +
 drivers/net/ixgbe/ixgbe_ethdev.h |  4 ++-
 drivers/net/ixgbe/ixgbe_rxtx.c   | 57 
 drivers/net/ixgbe/ixgbe_rxtx.h   |  2 ++
 4 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index baffc71..0ba098a 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -553,6 +553,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
.rx_queue_intr_disable = ixgbe_dev_rx_queue_intr_disable,
.rx_queue_release = ixgbe_dev_rx_queue_release,
.rx_queue_count   = ixgbe_dev_rx_queue_count,
+   .tx_queue_count   = ixgbe_dev_tx_queue_count,
.rx_descriptor_done   = ixgbe_dev_rx_descriptor_done,
.tx_queue_setup   = ixgbe_dev_tx_queue_setup,
.tx_queue_release = ixgbe_dev_tx_queue_release,
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.h b/drivers/net/ixgbe/ixgbe_ethdev.h
index 4ff6338..e060c3d 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.h
+++ b/drivers/net/ixgbe/ixgbe_ethdev.h
@@ -348,7 +348,9 @@ int  ixgbe_dev_tx_queue_setup(struct rte_eth_dev *dev, 
uint16_t tx_queue_id,
const struct rte_eth_txconf *tx_conf);

 uint32_t ixgbe_dev_rx_queue_count(struct rte_eth_dev *dev,
-   uint16_t rx_queue_id);
+ uint16_t rx_queue_id);
+uint32_t ixgbe_dev_tx_queue_count(struct rte_eth_dev *dev,
+ uint16_t tx_queue_id);

 int ixgbe_dev_rx_descriptor_done(void *rx_queue, uint16_t offset);
 int ixgbevf_dev_rx_descriptor_done(void *rx_queue, uint16_t offset);
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 07509b4..5bf6b1a 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -2437,6 +2437,7 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev *dev,

txq->nb_tx_desc = nb_desc;
txq->tx_rs_thresh = tx_rs_thresh;
+   txq->tx_rs_thresh_div = nb_desc / tx_rs_thresh;
txq->tx_free_thresh = tx_free_thresh;
txq->pthresh = tx_conf->tx_thresh.pthresh;
txq->hthresh = tx_conf->tx_thresh.hthresh;
@@ -2906,6 +2907,62 @@ ixgbe_dev_rx_queue_count(struct rte_eth_dev *dev, 
uint16_t rx_queue_id)
return offset;
 }

+uint32_t
+ixgbe_dev_tx_queue_count(struct rte_eth_dev *dev, uint16_t tx_queue_id)
+{
+   struct ixgbe_tx_queue *txq;
+   uint32_t status;
+   int32_t offset, interval, idx = 0;
+   int32_t max_offset, used_desc;
+
+   txq = dev->data->tx_queues[tx_queue_id];
+
+   /* if DD on next threshold desc is not set, assume used packets
+* are pending.
+*/
+   status = txq->tx_ring[txq->tx_next_dd].wb.status;
+   if (!(status & rte_cpu_to_le_32(IXGBE_ADVTXD_STAT_DD)))
+   return txq->nb_tx_desc - txq->nb_tx_free - 1;
+
+   /* browse DD bits between tail starting from tx_next_dd: we have
+* to be careful since DD bits are only set every tx_rs_thresh
+* descriptor.
+*/
+   interval = txq->tx_rs_thresh_div >> 1;
+   offset = interval * txq->tx_rs_thresh;
+
+   /* don't go beyond tail */
+   max_offset = txq->tx_tail - txq->tx_next_dd;
+   if (max_offset < 0)
+   max_offset += txq->nb_tx_desc;
+
+   do {
+   interval >>= 1;
+
+   if (offset >= max_offset) {
+   offset -= (interval * txq->tx_rs_thresh);
+   continue;
+   }
+
+   idx = txq->tx_next_dd + offset;
+   if (idx >= txq->nb_tx_desc)
+   idx -= txq->nb_tx_desc;
+
+   status = txq->tx_ring[idx].wb.status;
+   if (status & rte_cpu_to_le_32(IXGBE_ADVTXD_STAT_DD))
+   offset += (interval * txq->tx_rs_thresh);
+   else
+   offset -= (interval * txq->tx_rs_thresh);
+   } while (interval > 0);
+
+   /* idx is now the index of the head */
+   used_desc = txq->tx_tail - idx;
+   if (used_desc < 0)
+   used_desc += txq->nb_tx_desc;
+
+   return used_desc;
+}
+
 int
 ixgbe_dev_rx_descriptor_done(void *rx_queue, uint16_t offset)
 {
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 2608b36..f69b5de 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -221,6 +221,8 @@ struct ixgbe_tx_queue {
uint16_ttx_free_thresh;
/** Number of TX descriptors to use before RS bit is set. */
uint16_ttx_rs_thresh;
+   /** Number of TX descriptors divided b

[dpdk-dev] [RFC 4/9] net/ixgbe: optimize Rx queue descriptor count

2016-11-24 Thread Olivier Matz

Use a binary search algorithm to find the first empty DD bit. The
ring-empty and ring-full cases are managed separately as they are more
likely to happen.

PR=52423
Signed-off-by: Olivier Matz 
Acked-by: Ivan Boule 
---
 drivers/net/ixgbe/ixgbe_rxtx.c | 63 --
 1 file changed, 48 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 1a8ea5f..07509b4 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -2852,25 +2852,58 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
 uint32_t
 ixgbe_dev_rx_queue_count(struct rte_eth_dev *dev, uint16_t rx_queue_id)
 {
-#define IXGBE_RXQ_SCAN_INTERVAL 4
-   volatile union ixgbe_adv_rx_desc *rxdp;
+   volatile uint32_t *status;
struct ixgbe_rx_queue *rxq;
-   uint32_t desc = 0;
+   uint32_t offset, interval, nb_hold, resolution;
+   int32_t idx;

rxq = dev->data->rx_queues[rx_queue_id];
-   rxdp = &(rxq->rx_ring[rxq->rx_tail]);
-
-   while ((desc < rxq->nb_rx_desc) &&
-   (rxdp->wb.upper.status_error &
-   rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD))) {
-   desc += IXGBE_RXQ_SCAN_INTERVAL;
-   rxdp += IXGBE_RXQ_SCAN_INTERVAL;
-   if (rxq->rx_tail + desc >= rxq->nb_rx_desc)
-   rxdp = &(rxq->rx_ring[rxq->rx_tail +
-   desc - rxq->nb_rx_desc]);
-   }

-   return desc;
+   /* check if ring empty */
+   idx = rxq->rx_tail;
+   status = >rx_ring[idx].wb.upper.status_error;
+   if (!(*status & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
+   return 0;
+
+   /* decrease the precision if ring is large */
+   if (rxq->nb_rx_desc <= 256)
+   resolution = 4;
+   else
+   resolution = 16;
+
+   /* check if ring full */
+#ifdef RTE_IXGBE_INC_VECTOR
+   if (rxq->rx_using_sse)
+   nb_hold = rxq->rxrearm_nb;
+   else
+#endif
+   nb_hold = rxq->nb_rx_hold;
+
+   idx = rxq->rx_tail - nb_hold - resolution;
+   if (idx < 0)
+   idx += rxq->nb_rx_desc;
+   status = >rx_ring[idx].wb.upper.status_error;
+   if (*status & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD))
+   return rxq->nb_rx_desc;
+
+   /* use a binary search */
+   interval = (rxq->nb_rx_desc - nb_hold) >> 1;
+   offset = interval;
+
+   do {
+   idx = rxq->rx_tail + offset;
+   if (idx >= rxq->nb_rx_desc)
+   idx -= rxq->nb_rx_desc;
+
+   interval >>= 1;
+   status = >rx_ring[idx].wb.upper.status_error;
+   if (*status & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD))
+   offset += interval;
+   else
+   offset -= interval;
+   } while (interval >= resolution);
+
+   return offset;
 }

 int
-- 
2.8.1

[dpdk-dev] [RFC 3/9] ethdev: add handler for Tx queue descriptor count

2016-11-24 Thread Olivier Matz

Implement the Tx counterpart of rte_eth_rx_queue_count() in ethdev API,
which returns the number of used descriptors in a Tx queue.

It can help an application to detect that a link is too slow and cannot
send at the desired rate. In this case, the application can decide to
decrease the rate, or drop the packets with the lowest priority.

PR=52423
Signed-off-by: Olivier Matz 
Acked-by: Ivan Boule 
---
 lib/librte_ether/rte_ethdev.h | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
index 9551cfd..8244807 100644
--- a/lib/librte_ether/rte_ethdev.h
+++ b/lib/librte_ether/rte_ethdev.h
@@ -1147,6 +1147,10 @@ typedef uint32_t (*eth_rx_queue_count_t)(struct 
rte_eth_dev *dev,
 uint16_t rx_queue_id);
 /**< @internal Get number of used descriptors on a receive queue. */

+typedef uint32_t (*eth_tx_queue_count_t)(struct rte_eth_dev *dev,
+uint16_t tx_queue_id);
+/**< @internal Get number of used descriptors on a transmit queue */
+
 typedef int (*eth_rx_descriptor_done_t)(void *rxq, uint16_t offset);
 /**< @internal Check DD bit of specific RX descriptor */

@@ -1461,6 +1465,8 @@ struct eth_dev_ops {
eth_queue_release_trx_queue_release;/**< Release RX queue.*/
eth_rx_queue_count_t   rx_queue_count;
/**< Get the number of used RX descriptors. */
+   eth_tx_queue_count_t   tx_queue_count;
+   /**< Get the number of used TX descriptors. */
eth_rx_descriptor_done_t   rx_descriptor_done;  /**< Check rxd DD bit */
/**< Enable Rx queue interrupt. */
eth_rx_enable_intr_t   rx_queue_intr_enable;
@@ -2710,6 +2716,31 @@ rte_eth_rx_queue_count(uint8_t port_id, uint16_t 
queue_id)
 }

 /**
+ * Get the number of used descriptors of a tx queue
+ *
+ * @param port_id
+ *  The port identifier of the Ethernet device.
+ * @param queue_id
+ *  The queue id on the specific port.
+ * @return
+ *  - number of free descriptors if positive or zero
+ *  - (-EINVAL) if *port_id* or *queue_id* is invalid.
+ *  - (-ENOTSUP) if the device does not support this function
+ */
+static inline int
+rte_eth_tx_queue_count(uint8_t port_id, uint16_t queue_id)
+{
+   struct rte_eth_dev *dev = _eth_devices[port_id];
+
+   RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->tx_queue_count, -ENOTSUP);
+   RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+   if (queue_id >= dev->data->nb_tx_queues)
+   return -EINVAL;
+
+   return (*dev->dev_ops->tx_queue_count)(dev, queue_id);
+}
+
+/**
  * Check if the DD bit of the specific RX descriptor in the queue has been set
  *
  * @param port_id
-- 
2.8.1

[dpdk-dev] [RFC 2/9] ethdev: move queue id check in generic layer

2016-11-24 Thread Olivier Matz

The check of queue_id is done in all drivers implementing
rte_eth_rx_queue_count(). Factorize this check in the generic function.

Note that the nfp driver was doing the check differently, which could
induce crashes if the queue index was too big.

By the way, also move the is_supported test before the port valid and
queue valid test.

PR=52423
Signed-off-by: Olivier Matz 
Acked-by: Ivan Boule 
---
 drivers/net/e1000/em_rxtx.c| 5 -
 drivers/net/e1000/igb_rxtx.c   | 5 -
 drivers/net/i40e/i40e_rxtx.c   | 5 -
 drivers/net/ixgbe/ixgbe_rxtx.c | 5 -
 drivers/net/nfp/nfp_net.c  | 6 --
 lib/librte_ether/rte_ethdev.h  | 6 --
 6 files changed, 4 insertions(+), 28 deletions(-)

diff --git a/drivers/net/e1000/em_rxtx.c b/drivers/net/e1000/em_rxtx.c
index 41f51c0..c1c724b 100644
--- a/drivers/net/e1000/em_rxtx.c
+++ b/drivers/net/e1000/em_rxtx.c
@@ -1390,11 +1390,6 @@ eth_em_rx_queue_count(struct rte_eth_dev *dev, uint16_t 
rx_queue_id)
struct em_rx_queue *rxq;
uint32_t desc = 0;

-   if (rx_queue_id >= dev->data->nb_rx_queues) {
-   PMD_RX_LOG(DEBUG, "Invalid RX queue_id=%d", rx_queue_id);
-   return 0;
-   }
-
rxq = dev->data->rx_queues[rx_queue_id];
rxdp = &(rxq->rx_ring[rxq->rx_tail]);

diff --git a/drivers/net/e1000/igb_rxtx.c b/drivers/net/e1000/igb_rxtx.c
index dbd37ac..e9aa356 100644
--- a/drivers/net/e1000/igb_rxtx.c
+++ b/drivers/net/e1000/igb_rxtx.c
@@ -1512,11 +1512,6 @@ eth_igb_rx_queue_count(struct rte_eth_dev *dev, uint16_t 
rx_queue_id)
struct igb_rx_queue *rxq;
uint32_t desc = 0;

-   if (rx_queue_id >= dev->data->nb_rx_queues) {
-   PMD_RX_LOG(ERR, "Invalid RX queue id=%d", rx_queue_id);
-   return 0;
-   }
-
rxq = dev->data->rx_queues[rx_queue_id];
rxdp = &(rxq->rx_ring[rxq->rx_tail]);

diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 7ae7d9f..79a72f0 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -1793,11 +1793,6 @@ i40e_dev_rx_queue_count(struct rte_eth_dev *dev, 
uint16_t rx_queue_id)
struct i40e_rx_queue *rxq;
uint16_t desc = 0;

-   if (unlikely(rx_queue_id >= dev->data->nb_rx_queues)) {
-   PMD_DRV_LOG(ERR, "Invalid RX queue id %u", rx_queue_id);
-   return 0;
-   }
-
rxq = dev->data->rx_queues[rx_queue_id];
rxdp = &(rxq->rx_ring[rxq->rx_tail]);
while ((desc < rxq->nb_rx_desc) &&
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index b2d9f45..1a8ea5f 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -2857,11 +2857,6 @@ ixgbe_dev_rx_queue_count(struct rte_eth_dev *dev, 
uint16_t rx_queue_id)
struct ixgbe_rx_queue *rxq;
uint32_t desc = 0;

-   if (rx_queue_id >= dev->data->nb_rx_queues) {
-   PMD_RX_LOG(ERR, "Invalid RX queue id=%d", rx_queue_id);
-   return 0;
-   }
-
rxq = dev->data->rx_queues[rx_queue_id];
rxdp = &(rxq->rx_ring[rxq->rx_tail]);

diff --git a/drivers/net/nfp/nfp_net.c b/drivers/net/nfp/nfp_net.c
index e315dd8..f1d00fb 100644
--- a/drivers/net/nfp/nfp_net.c
+++ b/drivers/net/nfp/nfp_net.c
@@ -1084,12 +1084,6 @@ nfp_net_rx_queue_count(struct rte_eth_dev *dev, uint16_t 
queue_idx)
uint32_t count;

rxq = (struct nfp_net_rxq *)dev->data->rx_queues[queue_idx];
-
-   if (rxq == NULL) {
-   PMD_INIT_LOG(ERR, "Bad queue: %u\n", queue_idx);
-   return 0;
-   }
-
idx = rxq->rd_p % rxq->rx_count;
rxds = >rxds[idx];

diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
index c3edc23..9551cfd 100644
--- a/lib/librte_ether/rte_ethdev.h
+++ b/lib/librte_ether/rte_ethdev.h
@@ -2693,7 +2693,7 @@ rte_eth_rx_burst(uint8_t port_id, uint16_t queue_id,
  *  The queue id on the specific port.
  * @return
  *  The number of used descriptors in the specific queue, or:
- * (-EINVAL) if *port_id* is invalid
+ * (-EINVAL) if *port_id* or *queue_id* is invalid
  * (-ENOTSUP) if the device does not support this function
  */
 static inline int
@@ -2701,8 +2701,10 @@ rte_eth_rx_queue_count(uint8_t port_id, uint16_t 
queue_id)
 {
struct rte_eth_dev *dev = _eth_devices[port_id];

-   RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_count, -ENOTSUP);
+   RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+   if (queue_id >= dev->data->nb_rx_queues)
+   return -EINVAL;

return (*dev->dev_ops->rx_queue_count)(dev, queue_id);
 }
-- 
2.8.1

[dpdk-dev] [RFC 1/9] ethdev: clarify api comments of rx queue count

2016-11-24 Thread Olivier Matz

The API comments are not consistent between each other.

The function rte_eth_rx_queue_count() returns the number of used
descriptors on a receive queue.

PR=52423
Signed-off-by: Olivier Matz 
Acked-by: Ivan Boule 
---
 lib/librte_ether/rte_ethdev.h | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
index 9678179..c3edc23 100644
--- a/lib/librte_ether/rte_ethdev.h
+++ b/lib/librte_ether/rte_ethdev.h
@@ -1145,7 +1145,7 @@ typedef void (*eth_queue_release_t)(void *queue);

 typedef uint32_t (*eth_rx_queue_count_t)(struct rte_eth_dev *dev,
 uint16_t rx_queue_id);
-/**< @internal Get number of available descriptors on a receive queue of an 
Ethernet device. */
+/**< @internal Get number of used descriptors on a receive queue. */

 typedef int (*eth_rx_descriptor_done_t)(void *rxq, uint16_t offset);
 /**< @internal Check DD bit of specific RX descriptor */
@@ -1459,7 +1459,8 @@ struct eth_dev_ops {
eth_queue_stop_t   tx_queue_stop;/**< Stop TX for a queue.*/
eth_rx_queue_setup_t   rx_queue_setup;/**< Set up device RX queue.*/
eth_queue_release_trx_queue_release;/**< Release RX queue.*/
-   eth_rx_queue_count_t   rx_queue_count; /**< Get Rx queue count. */
+   eth_rx_queue_count_t   rx_queue_count;
+   /**< Get the number of used RX descriptors. */
eth_rx_descriptor_done_t   rx_descriptor_done;  /**< Check rxd DD bit */
/**< Enable Rx queue interrupt. */
eth_rx_enable_intr_t   rx_queue_intr_enable;
@@ -2684,7 +2685,7 @@ rte_eth_rx_burst(uint8_t port_id, uint16_t queue_id,
 }

 /**
- * Get the number of used descriptors in a specific queue
+ * Get the number of used descriptors of a rx queue
  *
  * @param port_id
  *  The port identifier of the Ethernet device.
@@ -2699,9 +2700,11 @@ static inline int
 rte_eth_rx_queue_count(uint8_t port_id, uint16_t queue_id)
 {
struct rte_eth_dev *dev = _eth_devices[port_id];
+
RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_count, -ENOTSUP);
-return (*dev->dev_ops->rx_queue_count)(dev, queue_id);
+
+   return (*dev->dev_ops->rx_queue_count)(dev, queue_id);
 }

 /**
-- 
2.8.1

[dpdk-dev] [RFC 0/9] get Rx and Tx used descriptors

2016-11-24 Thread Olivier Matz

it

txq->tx_tail: sw value for tail register
txq->tx_free_thresh: free buffers if count(free descriptors) < this value
txq->tx_rs_thresh: RS bit is set every X descriptor
txq->tx_next_dd: next desc to scan for DD bit
txq->tx_next_rs: next desc to set RS bit
txq->last_desc_cleaned: last descriptor that have been cleaned
txq->nb_tx_free: number of free descriptors

Example:

||
|   D   R   R   R|
|x   |
|<- descs not sent yet  ->   |
|x   |
||
^last_desc_cleaned=8^next_rs=47
^next_dd=15   ^tail=45
 ^hw_head=20

 <  nb_used  ->

The hardware is currently processing the descriptor 20
'R' means the descriptor has the RS bit
'D' means the descriptor has the DD + RS bits
'x' are packets in txq (not sent)
'.' are packet already sent but not freed by sw

In this example, we have rs_thres=8. On next call to ixgbe_tx_free_bufs(),
some buffers will be freed.

The new implementation does a binary search (checking for DD) between next_dd
and tail.



Olivier Matz (9):
  ethdev: clarify api comments of rx queue count
  ethdev: move queue id check in generic layer
  ethdev: add handler for Tx queue descriptor count
  net/ixgbe: optimize Rx queue descriptor count
  net/ixgbe: add handler for Tx queue descriptor count
  net/igb: optimize rx queue descriptor count
  net/igb: add handler for tx queue descriptor count
  net/e1000: optimize rx queue descriptor count
  net/e1000: add handler for tx queue descriptor count

 drivers/net/e1000/e1000_ethdev.h |  10 +++-
 drivers/net/e1000/em_ethdev.c|   1 +
 drivers/net/e1000/em_rxtx.c  | 109 --
 drivers/net/e1000/igb_ethdev.c   |   1 +
 drivers/net/e1000/igb_rxtx.c | 109 --
 drivers/net/i40e/i40e_rxtx.c |   5 --
 drivers/net/ixgbe/ixgbe_ethdev.c |   1 +
 drivers/net/ixgbe/ixgbe_ethdev.h |   4 +-
 drivers/net/ixgbe/ixgbe_rxtx.c   | 123 +--
 drivers/net/ixgbe/ixgbe_rxtx.h   |   2 +
 drivers/net/nfp/nfp_net.c|   6 --
 lib/librte_ether/rte_ethdev.h|  48 +--
 12 files changed, 344 insertions(+), 75 deletions(-)

-- 
2.8.1

[dpdk-dev] [PATCH 5/5] net/virtio: fix Tso when mbuf is shared

2016-11-24 Thread Olivier Matz

With virtio, doing tso requires to modify the network
packet data:
- the dpdk API requires to set the l4 checksum to an
  Intel-Nic-like pseudo header checksum that does
  not include the ip length
- the virtio peer expects that the l4 checksum is
  a standard pseudo header checksum.

This is a problem with shared packets, because they
should not be modified.

This patch fixes this issue by copying the headers into
a linear buffer in that case. This buffer is located in
the virtio_tx_region, at the same place where the
virtio header is stored.

The size of this buffer is set to 256, which should
be enough in all cases:
  sizeof(ethernet) + sizeof(vlan) * 2 + sizeof(ip6)
sizeof(ip6-ext) + sizeof(tcp) + sizeof(tcp-opts)
  = 14 + 8 + 40 + sizeof(ip6-ext) + 40 + sizeof(tcp-opts)
  = 102 + sizeof(ip6-ext) + sizeof(tcp-opts)

Fixes: 696573046e9e ("net/virtio: support TSO")

Signed-off-by: Olivier Matz 
---
 drivers/net/virtio/virtio_rxtx.c | 119 +++
 drivers/net/virtio/virtqueue.h   |   2 +
 2 files changed, 85 insertions(+), 36 deletions(-)

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 22d97a4..577c775 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -211,43 +211,73 @@ virtqueue_enqueue_recv_refill(struct virtqueue *vq, 
struct rte_mbuf *cookie)

 /* When doing TSO, the IP length is not included in the pseudo header
  * checksum of the packet given to the PMD, but for virtio it is
- * expected.
+ * expected. Fix the mbuf or a copy if the mbuf is shared.
  */
-static void
-virtio_tso_fix_cksum(struct rte_mbuf *m)
+static unsigned int
+virtio_tso_fix_cksum(struct rte_mbuf *m, char *hdr, size_t hdr_sz)
 {
-   /* common case: header is not fragmented */
-   if (likely(rte_pktmbuf_data_len(m) >= m->l2_len + m->l3_len +
-   m->l4_len)) {
-   struct ipv4_hdr *iph;
-   struct ipv6_hdr *ip6h;
-   struct tcp_hdr *th;
-   uint16_t prev_cksum, new_cksum, ip_len, ip_paylen;
-   uint32_t tmp;
-
-   iph = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *, m->l2_len);
-   th = RTE_PTR_ADD(iph, m->l3_len);
-   if ((iph->version_ihl >> 4) == 4) {
-   iph->hdr_checksum = 0;
-   iph->hdr_checksum = rte_ipv4_cksum(iph);
-   ip_len = iph->total_length;
-   ip_paylen = rte_cpu_to_be_16(rte_be_to_cpu_16(ip_len) -
-   m->l3_len);
-   } else {
-   ip6h = (struct ipv6_hdr *)iph;
-   ip_paylen = ip6h->payload_len;
+   struct ipv4_hdr *iph, iph_copy;
+   struct ipv6_hdr *ip6h = NULL, ip6h_copy;
+   struct tcp_hdr *th, th_copy;
+   size_t hdrlen = m->l2_len + m->l3_len + m->l4_len;
+   uint16_t prev_cksum, new_cksum, ip_len, ip_paylen;
+   uint32_t tmp;
+   int shared = 0;
+
+   /* mbuf is write-only, we need to copy the headers in a linear buffer */
+   if (unlikely(rte_pktmbuf_data_is_shared(m, 0, hdrlen))) {
+   shared = 1;
+
+   /* network headers are too big, there's nothing we can do */
+   if (hdrlen > hdr_sz)
+   return 0;
+
+   rte_pktmbuf_read_copy(m, 0, hdrlen, hdr);
+   iph = (struct ipv4_hdr *)(hdr + m->l2_len);
+   ip6h = (struct ipv6_hdr *)(hdr + m->l2_len);
+   th = (struct tcp_hdr *)(hdr + m->l2_len + m->l3_len);
+   } else {
+   iph = rte_pktmbuf_read(m, m->l2_len, sizeof(*iph), _copy);
+   th = rte_pktmbuf_read(m, m->l2_len + m->l3_len, sizeof(*th),
+   _copy);
+   }
+
+   if ((iph->version_ihl >> 4) == 4) {
+   iph->hdr_checksum = 0;
+   iph->hdr_checksum = rte_ipv4_cksum(iph);
+   ip_len = iph->total_length;
+   ip_paylen = rte_cpu_to_be_16(rte_be_to_cpu_16(ip_len) -
+   m->l3_len);
+   } else {
+   if (!shared) {
+   ip6h = rte_pktmbuf_read(m, m->l2_len, sizeof(*ip6h),
+   _copy);
}
+   ip_paylen = ip6h->payload_len;
+   }

-   /* calculate the new phdr checksum not including ip_paylen */
-   prev_cksum = th->cksum;
-   tmp = prev_cksum;
-   tmp += ip_paylen;
-   tmp = (tmp & 0x) + (tmp >> 16);
-   new_cksum = tmp;
+   /* calculate the new phdr checksum not including ip_paylen */
+   prev_cksum = th->cksum;
+   tmp = prev_cksum;
+   tmp += ip_paylen;
+   tmp = (tmp & 0x) + (tmp >> 16);
+   new_cksum = tmp;

-   /* replace

[dpdk-dev] [PATCH 4/5] mbuf: new helper to copy data from a mbuf

2016-11-24 Thread Olivier Matz

Signed-off-by: Olivier Matz 
---
 app/test/test_mbuf.c   |  7 +++
 lib/librte_mbuf/rte_mbuf.h | 32 +++-
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index 5f1bc5d..73fd7df 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -451,6 +451,13 @@ testclone_testupdate_testdetach(void)
GOTO_FAIL("invalid data");
if (data != check_data)
GOTO_FAIL("data should have been copied");
+   if (rte_pktmbuf_read_copy(m2, 0, sizeof(uint32_t), check_data) < 0)
+   GOTO_FAIL("cannot copy data");
+   if (check_data[0] != MAGIC_DATA)
+   GOTO_FAIL("invalid data");
+   if (data != check_data)
+   GOTO_FAIL("data should have been copied");
+
/* free mbuf */
rte_pktmbuf_free(m);
m = NULL;
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index e898d25..edae89f 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1643,7 +1643,7 @@ static inline int rte_pktmbuf_data_is_shared(const struct 
rte_mbuf *m,
 }

 /**
- * @internal used by rte_pktmbuf_read().
+ * @internal used by rte_pktmbuf_read() and rte_pktmbuf_read_copy().
  */
 void *__rte_pktmbuf_read(const struct rte_mbuf *m, uint32_t off,
uint32_t len, void *buf);
@@ -1728,6 +1728,36 @@ static inline int rte_pktmbuf_write(const struct 
rte_mbuf *m,
 }

 /**
+ * Copy data from a mbuf into a linear buffer
+ *
+ * @param m
+ *   The pointer to the mbuf.
+ * @param off
+ *   The offset of the data in the mbuf.
+ * @param len
+ *   The amount of bytes to copy.
+ * @param buf
+ *   The buffer where data is copied, it should be at least
+ *   as large as len.
+ * @return
+ *   - (0) on success
+ *   - (-1) on error: mbuf is too small
+ */
+static inline int rte_pktmbuf_read_copy(const struct rte_mbuf *m,
+   uint32_t off, uint32_t len, void *buf)
+{
+   if (likely(off + len <= rte_pktmbuf_data_len(m))) {
+   rte_memcpy(buf, rte_pktmbuf_mtod_offset(m, char *, off), len);
+   return 0;
+   }
+
+   if (__rte_pktmbuf_read(m, off, len, buf) == NULL)
+   return -1;
+
+   return 0;
+}
+
+/**
  * Chain an mbuf to another, thereby creating a segmented packet.
  *
  * Note: The implementation will do a linear walk over the segments to find
-- 
2.8.1

[dpdk-dev] [PATCH 3/5] mbuf: new helper to write data in a mbuf chain

2016-11-24 Thread Olivier Matz

Introduce a new helper to write data in a chain of mbufs,
spreading it in the segments.

Signed-off-by: Olivier Matz 
---
 app/test/test_mbuf.c | 21 +++
 lib/librte_mbuf/rte_mbuf.c   | 44 +++
 lib/librte_mbuf/rte_mbuf.h   | 50 
 lib/librte_mbuf/rte_mbuf_version.map |  6 +
 4 files changed, 121 insertions(+)

diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index 7656a4d..5f1bc5d 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -335,6 +335,10 @@ testclone_testupdate_testdetach(void)
struct rte_mbuf *clone2 = NULL;
struct rte_mbuf *m2 = NULL;
unaligned_uint32_t *data;
+   uint32_t magic = MAGIC_DATA;
+   uint32_t check_data[2];
+
+   memset(check_data, 0, sizeof(check_data));

/* alloc a mbuf */
m = rte_pktmbuf_alloc(pktmbuf_pool);
@@ -421,6 +425,8 @@ testclone_testupdate_testdetach(void)
if (m2 == NULL)
GOTO_FAIL("cannot allocate m2");
rte_pktmbuf_append(m2, sizeof(uint32_t));
+   if (rte_pktmbuf_write(m2, 0, sizeof(uint32_t), ) < 0)
+   GOTO_FAIL("cannot write data in m2");
rte_pktmbuf_chain(m2, clone);
clone = NULL;

@@ -430,6 +436,21 @@ testclone_testupdate_testdetach(void)
rte_pktmbuf_pkt_len(m2) - sizeof(uint32_t)) == 0)
GOTO_FAIL("m2 data should be marked as shared");

+   /* check data content */
+   data = rte_pktmbuf_read(m2, 0, sizeof(uint32_t), check_data);
+   if (data == NULL)
+   GOTO_FAIL("cannot read data");
+   if (*data != MAGIC_DATA)
+   GOTO_FAIL("invalid data");
+   if (data == check_data)
+   GOTO_FAIL("data should not have been copied");
+   data = rte_pktmbuf_read(m2, 0, sizeof(uint32_t) * 2, check_data);
+   if (data == NULL)
+   GOTO_FAIL("cannot read data");
+   if (data[0] != MAGIC_DATA || data[1] != MAGIC_DATA)
+   GOTO_FAIL("invalid data");
+   if (data != check_data)
+   GOTO_FAIL("data should have been copied");
/* free mbuf */
rte_pktmbuf_free(m);
m = NULL;
diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index b31958e..ed56193 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -298,6 +298,50 @@ void *__rte_pktmbuf_read(const struct rte_mbuf *m, 
uint32_t off,
return buf;
 }

+/* write len data bytes in a mbuf at specified offset (internal) */
+int
+__rte_pktmbuf_write(const struct rte_mbuf *m, uint32_t off,
+   uint32_t len, const void *buf)
+{
+   const struct rte_mbuf *seg = m;
+   uint32_t buf_off = 0, copy_len;
+   char *dst;
+
+   if (off + len > rte_pktmbuf_pkt_len(m))
+   return -1;
+
+   while (off >= rte_pktmbuf_data_len(seg)) {
+   off -= rte_pktmbuf_data_len(seg);
+   seg = seg->next;
+   }
+
+   dst = rte_pktmbuf_mtod_offset(seg, char *, off);
+   if (buf == dst)
+   return 0;
+
+   if (off + len <= rte_pktmbuf_data_len(seg)) {
+   RTE_ASSERT(!rte_pktmbuf_is_shared(seg));
+   rte_memcpy(dst, buf, len);
+   return 0;
+   }
+
+   /* copy data in several segments */
+   while (len > 0) {
+   RTE_ASSERT(!rte_pktmbuf_is_shared(seg));
+   copy_len = rte_pktmbuf_data_len(seg) - off;
+   if (copy_len > len)
+   copy_len = len;
+   dst = rte_pktmbuf_mtod_offset(seg, char *, off);
+   rte_memcpy(dst, (const char *)buf + buf_off, copy_len);
+   off = 0;
+   buf_off += copy_len;
+   len -= copy_len;
+   seg = seg->next;
+   }
+
+   return 0;
+}
+
 /*
  * Get the name of a RX offload flag. Must be kept synchronized with flag
  * definitions in rte_mbuf.h.
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index cd77a56..e898d25 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1678,6 +1678,56 @@ static inline void *rte_pktmbuf_read(const struct 
rte_mbuf *m,
 }

 /**
+ * @internal used by rte_pktmbuf_write().
+ */
+int __rte_pktmbuf_write(const struct rte_mbuf *m, uint32_t off,
+   uint32_t len, const void *buf);
+
+/**
+ * Write len data bytes in a mbuf at specified offset.
+ *
+ * If the mbuf is contiguous between off and off+len, rte_memcpy() is
+ * called. Else, it will split the data in the segments.
+ *
+ * The caller must ensure that all destination segments are writable
+ * (not shared).
+ *
+ * If the destination pointer in the mbuf is the same than the source
+ * buffer, the function do nothing and is successful.
+ *
+ * If the mbuf is too small, the function fails.
+

[dpdk-dev] [PATCH 2/5] mbuf: new helper to check if a mbuf is shared

2016-11-24 Thread Olivier Matz

Introduce 2 new helpers rte_pktmbuf_seg_is_shared() and
rte_pktmbuf_data_is_shared() to check if the packet data inside
a mbuf is shared (and shall not be modified).

To avoid a "discards const qualifier" error, add a const to the argument
of rte_mbuf_from_indirect().

Signed-off-by: Olivier Matz 
---
 app/test/test_mbuf.c   | 34 +++---
 lib/librte_mbuf/rte_mbuf.h | 71 +-
 2 files changed, 100 insertions(+), 5 deletions(-)

diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index c0823ea..7656a4d 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -333,6 +333,7 @@ testclone_testupdate_testdetach(void)
struct rte_mbuf *m = NULL;
struct rte_mbuf *clone = NULL;
struct rte_mbuf *clone2 = NULL;
+   struct rte_mbuf *m2 = NULL;
unaligned_uint32_t *data;

/* alloc a mbuf */
@@ -384,6 +385,11 @@ testclone_testupdate_testdetach(void)
if (*data != MAGIC_DATA)
GOTO_FAIL("invalid data in clone->next\n");

+   if (rte_pktmbuf_seg_is_shared(m) == 0)
+   GOTO_FAIL("m should be marked as shared\n");
+   if (rte_pktmbuf_seg_is_shared(clone) == 0)
+   GOTO_FAIL("clone should be marked as shared\n");
+
if (rte_mbuf_refcnt_read(m) != 2)
GOTO_FAIL("invalid refcnt in m\n");

@@ -410,14 +416,32 @@ testclone_testupdate_testdetach(void)
if (rte_mbuf_refcnt_read(m->next) != 3)
GOTO_FAIL("invalid refcnt in m->next\n");

+   /* prepend data to one of the clone */
+   m2 = rte_pktmbuf_alloc(pktmbuf_pool);
+   if (m2 == NULL)
+   GOTO_FAIL("cannot allocate m2");
+   rte_pktmbuf_append(m2, sizeof(uint32_t));
+   rte_pktmbuf_chain(m2, clone);
+   clone = NULL;
+
+   if (rte_pktmbuf_data_is_shared(m2, 0, sizeof(uint32_t)))
+   GOTO_FAIL("m2 headers should not be marked as shared");
+   if (rte_pktmbuf_data_is_shared(m2, sizeof(uint32_t),
+   rte_pktmbuf_pkt_len(m2) - sizeof(uint32_t)) == 0)
+   GOTO_FAIL("m2 data should be marked as shared");
+
/* free mbuf */
rte_pktmbuf_free(m);
-   rte_pktmbuf_free(clone);
-   rte_pktmbuf_free(clone2);
-
m = NULL;
-   clone = NULL;
+   rte_pktmbuf_free(m2);
+   m2 = NULL;
+
+   if (rte_pktmbuf_seg_is_shared(clone2))
+   GOTO_FAIL("clone2 should not be marked as shared\n");
+
+   rte_pktmbuf_free(clone2);
clone2 = NULL;
+
printf("%s ok\n", __func__);
return 0;

@@ -428,6 +452,8 @@ testclone_testupdate_testdetach(void)
rte_pktmbuf_free(clone);
if (clone2)
rte_pktmbuf_free(clone2);
+   if (m2)
+   rte_pktmbuf_free(m2);
return -1;
 }

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 14956f6..cd77a56 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -576,7 +576,7 @@ rte_mbuf_data_dma_addr_default(const struct rte_mbuf *mb)
  *   The address of the direct mbuf corresponding to buffer_addr.
  */
 static inline struct rte_mbuf *
-rte_mbuf_from_indirect(struct rte_mbuf *mi)
+rte_mbuf_from_indirect(const struct rte_mbuf *mi)
 {
return (struct rte_mbuf *)RTE_PTR_SUB(mi->buf_addr, sizeof(*mi) + 
mi->priv_size);
 }
@@ -1574,6 +1574,75 @@ static inline int rte_pktmbuf_is_contiguous(const struct 
rte_mbuf *m)
 }

 /**
+ * Test if a mbuf segment is shared
+ *
+ * Return true if the data embedded in this segment is shared by several
+ * mbufs. In this case, the mbuf data should be considered as read-only.
+ *
+ * @param m
+ *   The packet mbuf.
+ * @return
+ *   - (1), the mbuf segment is shared (read-only)
+ *   - (0), the mbuf segment is not shared (writable)
+ */
+static inline int rte_pktmbuf_seg_is_shared(const struct rte_mbuf *m)
+{
+   if (rte_mbuf_refcnt_read(m) > 1)
+   return 1;
+
+   if (RTE_MBUF_INDIRECT(m) &&
+   rte_mbuf_refcnt_read(rte_mbuf_from_indirect(m)) > 1)
+   return 1;
+
+   return 0;
+}
+
+/**
+ * Test if some data in an mbuf chain is shared
+ *
+ * Return true if the specified data area in the mbuf chain is shared by
+ * several mbufs. In this case, this data should be considered as
+ * read-only.
+ *
+ * If the area described by off and len exceeds the bounds of the mbuf
+ * chain (off + len <= rte_pktmbuf_pkt_len()), the exceeding part of the
+ * area is ignored.
+ *
+ * @param m
+ *   The packet mbuf.
+ * @return
+ *   - (1), the mbuf data is shared (read-only)
+ *   - (0), the mbuf data is not shared (writable)
+ */
+static inline int rte_pktmbuf_data_is_shared(const struct rte_mbuf *m,
+   uint32_t off, uint32_t len)
+{
+   const struct rte_mbuf *seg = m;
+
+

[dpdk-dev] [PATCH 1/5] mbuf: remove const attribute in mbuf read function

2016-11-24 Thread Olivier Matz

There is no good reason to have this const attribute: rte_pktmbuf_read()
returns a pointer which is either in a private buffer, or in the mbuf.

In the first case, it is clearly not const. In the second case, it is up
to the user to check that the mbuf is not shared and that data can be
modified.

Signed-off-by: Olivier Matz 
---
 lib/librte_mbuf/rte_mbuf.c | 2 +-
 lib/librte_mbuf/rte_mbuf.h | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index 63f43c8..b31958e 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -265,7 +265,7 @@ rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, 
unsigned dump_len)
 }

 /* read len data bytes in a mbuf at specified offset (internal) */
-const void *__rte_pktmbuf_read(const struct rte_mbuf *m, uint32_t off,
+void *__rte_pktmbuf_read(const struct rte_mbuf *m, uint32_t off,
uint32_t len, void *buf)
 {
const struct rte_mbuf *seg = m;
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index ead7c6e..14956f6 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1576,7 +1576,7 @@ static inline int rte_pktmbuf_is_contiguous(const struct 
rte_mbuf *m)
 /**
  * @internal used by rte_pktmbuf_read().
  */
-const void *__rte_pktmbuf_read(const struct rte_mbuf *m, uint32_t off,
+void *__rte_pktmbuf_read(const struct rte_mbuf *m, uint32_t off,
uint32_t len, void *buf);

 /**
@@ -1599,7 +1599,7 @@ const void *__rte_pktmbuf_read(const struct rte_mbuf *m, 
uint32_t off,
  *   The pointer to the data, either in the mbuf if it is contiguous,
  *   or in the user buffer. If mbuf is too small, NULL is returned.
  */
-static inline const void *rte_pktmbuf_read(const struct rte_mbuf *m,
+static inline void *rte_pktmbuf_read(const struct rte_mbuf *m,
uint32_t off, uint32_t len, void *buf)
 {
if (likely(off + len <= rte_pktmbuf_data_len(m)))
-- 
2.8.1

[dpdk-dev] [PATCH 0/5] virtio/mbuf: fix virtio tso with shared mbufs

2016-11-24 Thread Olivier Matz

This patchset fixes the transmission of cloned mbufs when using
virtio + TSO. The problem is we need to fix the L4 checksum in the
packet, but it should be considered as read-only, as pointed-out
by Stephen here:
http://dpdk.org/ml/archives/dev/2016-October/048873.html

Unfortunatly the patchset is quite big, but I did not manage to
find a shorter solution. The first patches add some mbuf helpers
that are used in virtio in the last patch.

This last patch adds a zone for each tx ring entry where headers
can be copied, patched, and referenced by virtio descriptors in
case the mbuf is read-only. If its not the case, the mbuf is
modified as before.

I tested with the same test plan than the one described in
http://dpdk.org/ml/archives/dev/2016-October/048092.html
(only the TSO test case).

I also replayed the test with the following patches to validate
the code path for:

- segmented packets (it forces a local copy in virtio_tso_fix_cksum)

--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -279,7 +279,7 @@ void *__rte_pktmbuf_read(const struct rte_mbuf *m, uint32_t 
off,
seg = seg->next;
}

-   if (off + len <= rte_pktmbuf_data_len(seg))
+   if (0 && off + len <= rte_pktmbuf_data_len(seg))
return rte_pktmbuf_mtod_offset(seg, char *, off);

/* rare case: header is split among several segments */
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 9dc6f10..5a4312a 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1671,7 +1671,7 @@ void *__rte_pktmbuf_read(const struct rte_mbuf *m, 
uint32_t off,
 static inline void *rte_pktmbuf_read(const struct rte_mbuf *m,
uint32_t off, uint32_t len, void *buf)
 {
-   if (likely(off + len <= rte_pktmbuf_data_len(m)))
+   if (likely(0 && off + len <= rte_pktmbuf_data_len(m)))
return rte_pktmbuf_mtod_offset(m, char *, off);
else
return __rte_pktmbuf_read(m, off, len, buf);

- and for shared mbuf (force to use the buffer in virtio tx ring)

--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -225,7 +225,7 @@ virtio_tso_fix_cksum(struct rte_mbuf *m, char *hdr, size_t 
hdr_sz)
int shared = 0;

/* mbuf is write-only, we need to copy the headers in a linear buffer */
-   if (unlikely(rte_pktmbuf_data_is_shared(m, 0, hdrlen))) {
+   if (unlikely(1 || rte_pktmbuf_data_is_shared(m, 0, hdrlen))) {
shared = 1;

/* network headers are too big, there's nothing we can do */


Olivier Matz (5):
  mbuf: remove const attribute in mbuf read function
  mbuf: new helper to check if a mbuf is shared
  mbuf: new helper to write data in a mbuf chain
  mbuf: new helper to copy data from a mbuf
  net/virtio: fix Tso when mbuf is shared

 app/test/test_mbuf.c |  62 +-
 drivers/net/virtio/virtio_rxtx.c | 119 ++
 drivers/net/virtio/virtqueue.h   |   2 +
 lib/librte_mbuf/rte_mbuf.c   |  46 +-
 lib/librte_mbuf/rte_mbuf.h   | 157 ++-
 lib/librte_mbuf/rte_mbuf_version.map |   6 ++
 6 files changed, 347 insertions(+), 45 deletions(-)

-- 
2.8.1

[dpdk-dev] [PATCH v2] log: do not drop debug logs at compile time

2016-11-23 Thread Olivier Matz

Today, all logs whose level is lower than INFO are dropped at
compile-time. This prevents from enabling debug logs at runtime using
--log-level=8.

The rationale was to remove debug logs from the data path at
compile-time, avoiding a test at run-time.

This patch changes the behavior of RTE_LOG() to avoid the compile-time
optimization, and introduces the RTE_LOG_DP() macro that has the same
behavior than the previous RTE_LOG(), for the rare cases where debug
logs are in the data path.

So it is now possible to enable debug logs at run-time by just
specifying --log-level=8. Some drivers still have special compile-time
options to enable more debug log. Maintainers may consider to
remove/reduce them.

Signed-off-by: Olivier Matz 
---
v1 -> v2:
- fix test in RTE_LOG_DP() as pointed-out by David

 config/common_base  |  1 +
 doc/guides/faq/faq.rst  |  2 +-
 drivers/net/bnxt/bnxt_txr.c |  2 +-
 drivers/net/nfp/nfp_net.c   |  8 +++---
 examples/distributor/main.c |  4 +--
 examples/ipsec-secgw/esp.c  |  2 +-
 examples/ipsec-secgw/ipsec.c|  4 +--
 examples/packet_ordering/main.c |  6 ++--
 examples/quota_watermark/qw/main.c  |  2 +-
 examples/tep_termination/main.c |  4 +--
 examples/vhost/main.c   | 14 -
 examples/vhost_xen/main.c   | 20 ++---
 lib/librte_eal/common/include/rte_log.h | 51 +
 13 files changed, 68 insertions(+), 52 deletions(-)

diff --git a/config/common_base b/config/common_base
index 4bff83a..652a839 100644
--- a/config/common_base
+++ b/config/common_base
@@ -89,6 +89,7 @@ CONFIG_RTE_MAX_MEMSEG=256
 CONFIG_RTE_MAX_MEMZONE=2560
 CONFIG_RTE_MAX_TAILQ=32
 CONFIG_RTE_LOG_LEVEL=RTE_LOG_INFO
+CONFIG_RTE_LOG_DP_LEVEL=RTE_LOG_INFO
 CONFIG_RTE_LOG_HISTORY=256
 CONFIG_RTE_LIBEAL_USE_HPET=n
 CONFIG_RTE_EAL_ALLOW_INV_SOCKET_ID=n
diff --git a/doc/guides/faq/faq.rst b/doc/guides/faq/faq.rst
index 8d1ea6c..0adc549 100644
--- a/doc/guides/faq/faq.rst
+++ b/doc/guides/faq/faq.rst
@@ -101,7 +101,7 @@ Yes, the option ``--log-level=`` accepts one of these 
numbers:
 #define RTE_LOG_INFO 7U /* Informational. */
 #define RTE_LOG_DEBUG 8U/* Debug-level messages. */

-It is also possible to change the maximum (and default level) at compile time
+It is also possible to change the default level at compile time
 with ``CONFIG_RTE_LOG_LEVEL``.


diff --git a/drivers/net/bnxt/bnxt_txr.c b/drivers/net/bnxt/bnxt_txr.c
index 8bf8fee..0d15bb1 100644
--- a/drivers/net/bnxt/bnxt_txr.c
+++ b/drivers/net/bnxt/bnxt_txr.c
@@ -298,7 +298,7 @@ static int bnxt_handle_tx_cp(struct bnxt_tx_queue *txq)
if (CMP_TYPE(txcmp) == TX_CMPL_TYPE_TX_L2)
nb_tx_pkts++;
else
-   RTE_LOG(DEBUG, PMD,
+   RTE_LOG_DP(DEBUG, PMD,
"Unhandled CMP type %02x\n",
CMP_TYPE(txcmp));
raw_cons = NEXT_RAW_CMP(raw_cons);
diff --git a/drivers/net/nfp/nfp_net.c b/drivers/net/nfp/nfp_net.c
index 707be8b..e315dd8 100644
--- a/drivers/net/nfp/nfp_net.c
+++ b/drivers/net/nfp/nfp_net.c
@@ -1707,7 +1707,7 @@ nfp_net_recv_pkts(void *rx_queue, struct rte_mbuf 
**rx_pkts, uint16_t nb_pkts)
 * DPDK just checks the queue is lower than max queues
 * enabled. But the queue needs to be configured
 */
-   RTE_LOG(ERR, PMD, "RX Bad queue\n");
+   RTE_LOG_DP(ERR, PMD, "RX Bad queue\n");
return -EINVAL;
}

@@ -1720,7 +1720,7 @@ nfp_net_recv_pkts(void *rx_queue, struct rte_mbuf 
**rx_pkts, uint16_t nb_pkts)

rxb = >rxbufs[idx];
if (unlikely(rxb == NULL)) {
-   RTE_LOG(ERR, PMD, "rxb does not exist!\n");
+   RTE_LOG_DP(ERR, PMD, "rxb does not exist!\n");
break;
}

@@ -1740,7 +1740,7 @@ nfp_net_recv_pkts(void *rx_queue, struct rte_mbuf 
**rx_pkts, uint16_t nb_pkts)
 */
new_mb = rte_pktmbuf_alloc(rxq->mem_pool);
if (unlikely(new_mb == NULL)) {
-   RTE_LOG(DEBUG, PMD, "RX mbuf alloc failed port_id=%u "
+   RTE_LOG_DP(DEBUG, PMD, "RX mbuf alloc failed port_id=%u 
"
"queue_id=%u\n", (unsigned)rxq->port_id,
(unsigned)rxq->qidx);
nfp_net_mbuf_alloc_failed(rxq);
@@ -1771,7 +1771,7 @@ nfp_net_recv_pkts(void *rx_queue, struct rte_mbuf 
**rx_pkts, uint16_t nb_pkts)
 * responsibility of avoiding it. But we have
 * to give

[dpdk-dev] [PATCH 0/2] l2fwd/l3fwd: rework long options parsing

2016-11-22 Thread Olivier Matz

Hi,

On 11/22/2016 02:52 PM, Olivier Matz wrote:
> These 2 patches were part of this RFC, which will not be integrated:
> http://dpdk.org/ml/archives/dev/2016-September/046974.html
> 
> It does not bring any functional change, it just reworks the way long
> options are parsed in l2fwd and l3fwd to avoid uneeded strcmp() calls
> and to ease the addition of a new long option in the future.
> 
> I send them in case maintainers think it is better this way, but I have
> no real need.
> 
> Olivier Matz (2):
>   l3fwd: rework long options parsing
>   l2fwd: rework long options parsing
> 
>  examples/l2fwd/main.c |  30 +++--
>  examples/l3fwd/main.c | 169 
> ++
>  2 files changed, 111 insertions(+), 88 deletions(-)
> 

Sorry, I missed some checkpatch issues. I'll fix them in v2.
I'm waiting a bit for other comments, in case of.


Olivier

[dpdk-dev] [PATCH 1/2] l3fwd: rework long options parsing

2016-11-22 Thread Olivier Matz

Avoid the use of several strncpy() since getopt is able to
map a long option with an id, which can be matched in the
same switch/case than short options.

Signed-off-by: Olivier Matz 
---
 examples/l3fwd/main.c | 169 ++
 1 file changed, 87 insertions(+), 82 deletions(-)

diff --git a/examples/l3fwd/main.c b/examples/l3fwd/main.c
index 7223e77..f84ef50 100644
--- a/examples/l3fwd/main.c
+++ b/examples/l3fwd/main.c
@@ -474,6 +474,13 @@ parse_eth_dest(const char *optarg)
 #define MAX_JUMBO_PKT_LEN  9600
 #define MEMPOOL_CACHE_SIZE 256

+static const char short_options[] =
+   "p:"  /* portmask */
+   "P"   /* promiscuous */
+   "L"   /* enable long prefix match */
+   "E"   /* enable exact match */
+   ;
+
 #define CMD_LINE_OPT_CONFIG "config"
 #define CMD_LINE_OPT_ETH_DEST "eth-dest"
 #define CMD_LINE_OPT_NO_NUMA "no-numa"
@@ -481,6 +488,31 @@ parse_eth_dest(const char *optarg)
 #define CMD_LINE_OPT_ENABLE_JUMBO "enable-jumbo"
 #define CMD_LINE_OPT_HASH_ENTRY_NUM "hash-entry-num"
 #define CMD_LINE_OPT_PARSE_PTYPE "parse-ptype"
+enum {
+   /* long options mapped to a short option */
+
+   /* first long only option value must be >= 256, so that we won't
+* conflict with short options */
+   CMD_LINE_OPT_MIN_NUM = 256,
+   CMD_LINE_OPT_CONFIG_NUM,
+   CMD_LINE_OPT_ETH_DEST_NUM,
+   CMD_LINE_OPT_NO_NUMA_NUM,
+   CMD_LINE_OPT_IPV6_NUM,
+   CMD_LINE_OPT_ENABLE_JUMBO_NUM,
+   CMD_LINE_OPT_HASH_ENTRY_NUM_NUM,
+   CMD_LINE_OPT_PARSE_PTYPE_NUM,
+};
+
+static const struct option lgopts[] = {
+   {CMD_LINE_OPT_CONFIG, 1, 0, CMD_LINE_OPT_CONFIG_NUM},
+   {CMD_LINE_OPT_ETH_DEST, 1, 0, CMD_LINE_OPT_ETH_DEST_NUM},
+   {CMD_LINE_OPT_NO_NUMA, 0, 0, CMD_LINE_OPT_NO_NUMA_NUM},
+   {CMD_LINE_OPT_IPV6, 0, 0, CMD_LINE_OPT_IPV6_NUM},
+   {CMD_LINE_OPT_ENABLE_JUMBO, 0, 0, CMD_LINE_OPT_ENABLE_JUMBO_NUM},
+   {CMD_LINE_OPT_HASH_ENTRY_NUM, 1, 0, CMD_LINE_OPT_HASH_ENTRY_NUM_NUM},
+   {CMD_LINE_OPT_PARSE_PTYPE, 0, 0, CMD_LINE_OPT_PARSE_PTYPE_NUM},
+   {NULL, 0, 0, 0}
+};

 /*
  * This expression is used to calculate the number of mbufs needed
@@ -504,16 +536,6 @@ parse_args(int argc, char **argv)
char **argvopt;
int option_index;
char *prgname = argv[0];
-   static struct option lgopts[] = {
-   {CMD_LINE_OPT_CONFIG, 1, 0, 0},
-   {CMD_LINE_OPT_ETH_DEST, 1, 0, 0},
-   {CMD_LINE_OPT_NO_NUMA, 0, 0, 0},
-   {CMD_LINE_OPT_IPV6, 0, 0, 0},
-   {CMD_LINE_OPT_ENABLE_JUMBO, 0, 0, 0},
-   {CMD_LINE_OPT_HASH_ENTRY_NUM, 1, 0, 0},
-   {CMD_LINE_OPT_PARSE_PTYPE, 0, 0, 0},
-   {NULL, 0, 0, 0}
-   };

argvopt = argv;

@@ -534,7 +556,7 @@ parse_args(int argc, char **argv)
"L3FWD: LPM and EM are mutually exclusive, select only one";
const char *str13 = "L3FWD: LPM or EM none selected, default LPM on";

-   while ((opt = getopt_long(argc, argvopt, "p:PLE",
+   while ((opt = getopt_long(argc, argvopt, short_options,
lgopts, _index)) != EOF) {

switch (opt) {
@@ -547,6 +569,7 @@ parse_args(int argc, char **argv)
return -1;
}
break;
+
case 'P':
printf("%s\n", str2);
promiscuous_on = 1;
@@ -563,89 +586,71 @@ parse_args(int argc, char **argv)
break;

/* long options */
-   case 0:
-   if (!strncmp(lgopts[option_index].name,
-   CMD_LINE_OPT_CONFIG,
-   sizeof(CMD_LINE_OPT_CONFIG))) {
-
-   ret = parse_config(optarg);
-   if (ret) {
-   printf("%s\n", str5);
-   print_usage(prgname);
-   return -1;
-   }
-   }
-
-   if (!strncmp(lgopts[option_index].name,
-   CMD_LINE_OPT_ETH_DEST,
-   sizeof(CMD_LINE_OPT_ETH_DEST))) {
-   parse_eth_dest(optarg);
-   }
-
-   if (!strncmp(lgopts[option_index].name,
-   CMD_LINE_OPT_NO_NUMA,
-   sizeof(CMD_LINE_OPT_NO_NUMA))) {
-   printf("%s\n", str6);
-   numa_on = 0;
+   case

[dpdk-dev] [PATCH 0/2] l2fwd/l3fwd: rework long options parsing

2016-11-22 Thread Olivier Matz

These 2 patches were part of this RFC, which will not be integrated:
http://dpdk.org/ml/archives/dev/2016-September/046974.html

It does not bring any functional change, it just reworks the way long
options are parsed in l2fwd and l3fwd to avoid uneeded strcmp() calls
and to ease the addition of a new long option in the future.

I send them in case maintainers think it is better this way, but I have
no real need.

Olivier Matz (2):
  l3fwd: rework long options parsing
  l2fwd: rework long options parsing

 examples/l2fwd/main.c |  30 +++--
 examples/l3fwd/main.c | 169 ++
 2 files changed, 111 insertions(+), 88 deletions(-)

-- 
2.8.1

[dpdk-dev] [PATCH v2] drivers: advertise kmod dependencies in pmdinfo

2016-11-22 Thread Olivier Matz

Hi Adrien,

On 11/22/2016 11:27 AM, Adrien Mazarguil wrote:
> Hi Olivier,
> 
> Neither mlx4 nor mlx5 depend on igb/uio/vfio modules, please see below.
> 
> On Tue, Nov 22, 2016 at 10:50:57AM +0100, Olivier Matz wrote:
>> Add a new macro RTE_PMD_REGISTER_KMOD_DEP() that allows a driver to
>> declare the list of kernel modules required to run properly.
>>
>> Today, most PCI drivers require uio/vfio.
>>
>> Signed-off-by: Olivier Matz 
>> Acked-by: Fiona Trahe 
>> ---
> [...]
>> diff --git a/drivers/net/mlx4/mlx4.c b/drivers/net/mlx4/mlx4.c
>> index da61a85..a0065bf 100644
>> --- a/drivers/net/mlx4/mlx4.c
>> +++ b/drivers/net/mlx4/mlx4.c
>> @@ -5937,3 +5937,4 @@ rte_mlx4_pmd_init(void)
>>  
>>  RTE_PMD_EXPORT_NAME(net_mlx4, __COUNTER__);
>>  RTE_PMD_REGISTER_PCI_TABLE(net_mlx4, mlx4_pci_id_map);
>> +RTE_PMD_REGISTER_KMOD_DEP(net_mlx4, "* igb_uio | uio_pci_generic | vfio");
> 
> RTE_PMD_REGISTER_KMOD_DEP(net_mlx4, "* ib_uverbs & mlx4_en & mlx4_core & 
> mlx4_ib");
> 
>> diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
>> index 90cc35e..b0343f3 100644
>> --- a/drivers/net/mlx5/mlx5.c
>> +++ b/drivers/net/mlx5/mlx5.c
>> @@ -759,3 +759,4 @@ rte_mlx5_pmd_init(void)
>>
>>  RTE_PMD_EXPORT_NAME(net_mlx5, __COUNTER__);
>>  RTE_PMD_REGISTER_PCI_TABLE(net_mlx5, mlx5_pci_id_map);
>> +RTE_PMD_REGISTER_KMOD_DEP(net_mlx5, "* igb_uio | uio_pci_generic | vfio");
> 
> RTE_PMD_REGISTER_KMOD_DEP(net_mlx5, "* ib_uverbs & mlx5_core & mlx5_ib");
> 

Thank you for reviewing. I messed up in the rebase, the v1 was
closer to what you suggest, sorry. I'll send an update.

Olivier

[dpdk-dev] [PATCH] mempool: fix Api documentation

2016-11-22 Thread Olivier Matz

A previous commit changed the local_cache table into a
pointer, reducing the size of the rte_mempool structure.

Fix the API comment of rte_mempool_create() related to
this modification.

Fixes: 213af31e0960 ("mempool: reduce structure size if no cache needed")

Signed-off-by: Olivier Matz 
---
 lib/librte_mempool/rte_mempool.h | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
index 440f3b1..956ce04 100644
--- a/lib/librte_mempool/rte_mempool.h
+++ b/lib/librte_mempool/rte_mempool.h
@@ -610,9 +610,7 @@ typedef void (rte_mempool_ctor_t)(struct rte_mempool *, 
void *);
  *   never be used. The access to the per-lcore table is of course
  *   faster than the multi-producer/consumer pool. The cache can be
  *   disabled if the cache_size argument is set to 0; it can be useful to
- *   avoid losing objects in cache. Note that even if not used, the
- *   memory space for cache is always reserved in a mempool structure,
- *   except if CONFIG_RTE_MEMPOOL_CACHE_MAX_SIZE is set to 0.
+ *   avoid losing objects in cache.
  * @param private_data_size
  *   The size of the private data appended after the mempool
  *   structure. This is useful for storing some private data after the
-- 
2.8.1

[dpdk-dev] [PATCH v2] drivers: advertise kmod dependencies in pmdinfo

2016-11-22 Thread Olivier Matz

Add a new macro RTE_PMD_REGISTER_KMOD_DEP() that allows a driver to
declare the list of kernel modules required to run properly.

Today, most PCI drivers require uio/vfio.

Signed-off-by: Olivier Matz 
Acked-by: Fiona Trahe 
---

v1 -> v2:   
 
- do not advertise uio_pci_generic for vf drivers
- rebase on top of head: use new driver names and prefix
  macro with RTE_   


rfc -> v1:
- the kmod information can be per-device using a modalias-like
  pattern
- change syntax to use '&' and '|' instead of ',' and ':'
- remove useless prerequisites in kmod lis: no need to
  specify both uio and uio_pci_generic, only the latter is
  required
- update kmod list in szedata2 driver
- remove kmod list in qat driver: it requires more than just loading
  a kmod, which is described in documentation


 buildtools/pmdinfogen/pmdinfogen.c  |  1 +
 buildtools/pmdinfogen/pmdinfogen.h  |  1 +
 drivers/net/bnx2x/bnx2x_ethdev.c|  2 ++
 drivers/net/bnxt/bnxt_ethdev.c  |  1 +
 drivers/net/cxgbe/cxgbe_ethdev.c|  1 +
 drivers/net/e1000/em_ethdev.c   |  1 +
 drivers/net/e1000/igb_ethdev.c  |  2 ++
 drivers/net/ena/ena_ethdev.c|  1 +
 drivers/net/enic/enic_ethdev.c  |  1 +
 drivers/net/fm10k/fm10k_ethdev.c|  1 +
 drivers/net/i40e/i40e_ethdev.c  |  1 +
 drivers/net/i40e/i40e_ethdev_vf.c   |  1 +
 drivers/net/ixgbe/ixgbe_ethdev.c|  2 ++
 drivers/net/mlx4/mlx4.c |  1 +
 drivers/net/mlx5/mlx5.c |  1 +
 drivers/net/nfp/nfp_net.c   |  1 +
 drivers/net/qede/qede_ethdev.c  |  2 ++
 drivers/net/szedata2/rte_eth_szedata2.c |  2 ++
 drivers/net/thunderx/nicvf_ethdev.c |  1 +
 drivers/net/virtio/virtio_ethdev.c  |  1 +
 drivers/net/vmxnet3/vmxnet3_ethdev.c|  1 +
 lib/librte_eal/common/include/rte_dev.h | 25 +
 tools/dpdk-pmdinfo.py   |  5 -
 23 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/buildtools/pmdinfogen/pmdinfogen.c 
b/buildtools/pmdinfogen/pmdinfogen.c
index 59ab956..5129c57 100644
--- a/buildtools/pmdinfogen/pmdinfogen.c
+++ b/buildtools/pmdinfogen/pmdinfogen.c
@@ -269,6 +269,7 @@ struct opt_tag {

 static const struct opt_tag opt_tags[] = {
{"_param_string_export", "params"},
+   {"_kmod_dep_export", "kmod"},
 };

 static int complete_pmd_entry(struct elf_info *info, struct pmd_driver *drv)
diff --git a/buildtools/pmdinfogen/pmdinfogen.h 
b/buildtools/pmdinfogen/pmdinfogen.h
index 1da2966..2fab2aa 100644
--- a/buildtools/pmdinfogen/pmdinfogen.h
+++ b/buildtools/pmdinfogen/pmdinfogen.h
@@ -85,6 +85,7 @@ else \

 enum opt_params {
PMD_PARAM_STRING = 0,
+   PMD_KMOD_DEP,
PMD_OPT_MAX
 };

diff --git a/drivers/net/bnx2x/bnx2x_ethdev.c b/drivers/net/bnx2x/bnx2x_ethdev.c
index 0eae433..0f1e4a2 100644
--- a/drivers/net/bnx2x/bnx2x_ethdev.c
+++ b/drivers/net/bnx2x/bnx2x_ethdev.c
@@ -643,5 +643,7 @@ static struct eth_driver rte_bnx2xvf_pmd = {

 RTE_PMD_REGISTER_PCI(net_bnx2x, rte_bnx2x_pmd.pci_drv);
 RTE_PMD_REGISTER_PCI_TABLE(net_bnx2x, pci_id_bnx2x_map);
+RTE_PMD_REGISTER_KMOD_DEP(net_bnx2x, "* igb_uio | uio_pci_generic | vfio");
 RTE_PMD_REGISTER_PCI(net_bnx2xvf, rte_bnx2xvf_pmd.pci_drv);
 RTE_PMD_REGISTER_PCI_TABLE(net_bnx2xvf, pci_id_bnx2xvf_map);
+RTE_PMD_REGISTER_KMOD_DEP(net_bnx2xvf, "* igb_uio | vfio");
diff --git a/drivers/net/bnxt/bnxt_ethdev.c b/drivers/net/bnxt/bnxt_ethdev.c
index 035fe07..a24e153 100644
--- a/drivers/net/bnxt/bnxt_ethdev.c
+++ b/drivers/net/bnxt/bnxt_ethdev.c
@@ -1173,3 +1173,4 @@ static struct eth_driver bnxt_rte_pmd = {

 RTE_PMD_REGISTER_PCI(net_bnxt, bnxt_rte_pmd.pci_drv);
 RTE_PMD_REGISTER_PCI_TABLE(net_bnxt, bnxt_pci_id_map);
+RTE_PMD_REGISTER_KMOD_DEP(net_bnxt, "* igb_uio | uio_pci_generic | vfio");
diff --git a/drivers/net/cxgbe/cxgbe_ethdev.c b/drivers/net/cxgbe/cxgbe_ethdev.c
index b7f28eb..317598d 100644
--- a/drivers/net/cxgbe/cxgbe_ethdev.c
+++ b/drivers/net/cxgbe/cxgbe_ethdev.c
@@ -1050,3 +1050,4 @@ static struct eth_driver rte_cxgbe_pmd = {

 RTE_PMD_REGISTER_PCI(net_cxgbe, rte_cxgbe_pmd.pci_drv);
 RTE_PMD_REGISTER_PCI_TABLE(net_cxgbe, cxgb4_pci_tbl);
+RTE_PMD_REGISTER_KMOD_DEP(net_cxgbe, "* igb_uio | uio_pci_generic | vfio");
diff --git a/drivers/net/e1000/em_ethdev.c b/drivers/net/e1000/em_ethdev.c
index aee3d34..866a5cf 100644
--- a/drivers/net/e1000/em_ethdev.c
+++ b/drivers/net/e1000/em_ethdev.c
@@ -1807,3 +1807,4 @@ eth_em_set_mc_addr_list(struct rte_eth_dev *dev,

 RTE_PMD_REGISTER_PCI(net_e1000_em, rte_em_pmd.pci_drv);
 RTE_PMD_REGISTER_PCI_TABLE(net_e1000_em, pci_id_em_map);
+RTE_PMD_REGISTER_KMOD_DEP(net_e1000_em, "* igb_uio | uio_pci_generic | vfio");
diff -

[dpdk-dev] [PATCH v3 2/2] mempool: pktmbuf pool default fallback for mempool ops error

2016-11-22 Thread Olivier Matz

Hi Hemant,

Back on this topic, please see some comments below.

On 11/07/2016 01:30 PM, Hemant Agrawal wrote:
> Hi Olivier,
>   
>> -Original Message-----
>> From: Olivier Matz [mailto:olivier.matz at 6wind.com]
>> Sent: Friday, October 14, 2016 5:41 PM
>>> On 9/22/2016 6:42 PM, Hemant Agrawal wrote:
>>>> Hi Olivier
>>>>
>>>> On 9/19/2016 7:27 PM, Olivier Matz wrote:
>>>>> Hi Hemant,
>>>>>
>>>>> On 09/16/2016 06:46 PM, Hemant Agrawal wrote:
>>>>>> In the rte_pktmbuf_pool_create, if the default external mempool is
>>>>>> not available, the implementation can default to "ring_mp_mc",
>>>>>> which is an software implementation.
>>>>>>
>>>>>> Signed-off-by: Hemant Agrawal 
>>>>>> ---
>>>>>> Changes in V3:
>>>>>> * adding warning message to say that falling back to default sw
>>>>>> pool
>>>>>> ---
>>>>>>  lib/librte_mbuf/rte_mbuf.c | 8 
>>>>>>  1 file changed, 8 insertions(+)
>>>>>>
>>>>>> diff --git a/lib/librte_mbuf/rte_mbuf.c
>>>>>> b/lib/librte_mbuf/rte_mbuf.c index 4846b89..8ab0eb1 100644
>>>>>> --- a/lib/librte_mbuf/rte_mbuf.c
>>>>>> +++ b/lib/librte_mbuf/rte_mbuf.c
>>>>>> @@ -176,6 +176,14 @@ rte_pktmbuf_pool_create(const char *name,
>>>>>> unsigned n,
>>>>>>
>>>>>>  rte_errno = rte_mempool_set_ops_byname(mp,
>>>>>>  RTE_MBUF_DEFAULT_MEMPOOL_OPS, NULL);
>>>>>> +
>>>>>> +/* on error, try falling back to the software based default
>>>>>> pool */
>>>>>> +if (rte_errno == -EOPNOTSUPP) {
>>>>>> +RTE_LOG(WARNING, MBUF, "Default HW Mempool not supported. "
>>>>>> +"falling back to sw mempool \"ring_mp_mc\"");
>>>>>> +rte_errno = rte_mempool_set_ops_byname(mp, "ring_mp_mc",
>>>>>> NULL);
>>>>>> +}
>>>>>> +
>>>>>>  if (rte_errno != 0) {
>>>>>>  RTE_LOG(ERR, MBUF, "error setting mempool handler\n");
>>>>>>  return NULL;
>>>>>>
>>>>>
>>>>> Without adding a new method ".supported()", the first call to
>>>>> rte_mempool_populate() could return the same error ENOTSUP. In this
>>>>> case, it is still possible to fallback.
>>>>>
>>>> It will be bit late.
>>>>
>>>> On failure, than we have to set the default ops and do a goto before
>>>> rte_pktmbuf_pool_init(mp, _priv);
>>
>> I still think we can do the job without adding the .supported() method.
>> The following code is just an (untested) example:
>>
>> struct rte_mempool *
>> rte_pktmbuf_pool_create(const char *name, unsigned n,
>> unsigned cache_size, uint16_t priv_size, uint16_t data_room_size,
>> int socket_id)
>> {
>> struct rte_mempool *mp;
>> struct rte_pktmbuf_pool_private mbp_priv;
>> unsigned elt_size;
>> int ret;
>> const char *ops[] = {
>> RTE_MBUF_DEFAULT_MEMPOOL_OPS, "ring_mp_mc", NULL,
>> };
>> const char **op;
>>
>> if (RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) != priv_size) {
>> RTE_LOG(ERR, MBUF, "mbuf priv_size=%u is not aligned\n",
>> priv_size);
>> rte_errno = EINVAL;
>> return NULL;
>> }
>> elt_size = sizeof(struct rte_mbuf) + (unsigned)priv_size +
>> (unsigned)data_room_size;
>> mbp_priv.mbuf_data_room_size = data_room_size;
>> mbp_priv.mbuf_priv_size = priv_size;
>>
>> for (op = [0]; *op != NULL; op++) {
>> mp = rte_mempool_create_empty(name, n, elt_size, cache_size,
>> sizeof(struct rte_pktmbuf_pool_private), socket_id, 0);
>> if (mp == NULL)
>> return NULL;
>>
>> ret = rte_mempool_set_ops_byname(mp, *op, NULL);
>> if (ret != 0) {
>> RTE_LOG(ERR, MBUF, "error setting mempool handler\n");
>> rte_mempool_free(mp);
>> if (ret == -ENOTSUP)
>> continue;
>> rte_errno = -ret;
>> return NULL;
>>

[dpdk-dev] Adding API to force freeing consumed buffers in TX ring

2016-11-21 Thread Olivier Matz

Hi,

On 11/21/2016 03:33 PM, Wiles, Keith wrote:
> 
>> On Nov 21, 2016, at 4:48 AM, Damjan Marion (damarion) > cisco.com> wrote:
>>
>>
>> Hi,
>>
>> Currently in VPP we do memcpy of whole packet when we need to do 
>> replication as we cannot know if specific buffer is transmitted
>> from tx ring before we update it again (i.e. l2 header rewrite).
>>
>> Unless there is already a way to address this issue in DPDK which I?m not 
>> aware
>> of my proposal is that we provide mechanism for polling TX ring 
>> for consumed buffers. This can be either completely new API or 
>> extension of rte_etx_tx_burst (i.e. special case when nb_pkts=0).
>>
>> This will allows us to start polling tx ring when we expect some 
>> mbuf back, instead of waiting for next tx burst (which we don?t know
>> when it will happen) and hoping that we will reach free_threshold soon.
> 
> +1
> 
> In Pktgen I have the problem of not being able to reclaim all of the TX mbufs 
> to update them for the next set of packets to send. I know this is not a 
> common case, but I do see the case where the application needs its mbufs 
> freed off the TX ring. Currently you need to have at least a TX ring size of 
> mbufs on hand to make sure you can send to a TX ring. If you allocate too few 
> you run into a deadlock case as the number of mbufs  on a TX ring does not 
> hit the flush mark. If you are sending to multiple TX rings on the same numa 
> node from the a single TX pool you have to understand the total number of 
> mbufs you need to have allocated to hit the TX flush on each ring. Not a 
> clean way to handle the problems as you may have limited memory or require 
> some logic to add more mbufs for dynamic ports.
> 
> Anyway it would be great to require a way to clean up the TX done ring, using 
> nb_pkts == 0 is the simplest way, but a new API is fine too.
>>
>> Any thoughts?

Yes, it looks useful to have a such API.

I would prefer another function instead of diverting the meaning of
nb_pkts. Maybe this?

  void rte_eth_tx_free_bufs(uint8_t port_id, uint16_t queue_id);


Regards,
Olivier

[dpdk-dev] [PATCH] lib/librte_mempool: a redundant word in comment

2016-11-18 Thread Olivier Matz

Hi Wei,

2lOn 11/15/2016 07:54 AM, Zhao1, Wei wrote:
> Hi, john
> 
>> -Original Message-
>> From: Mcnamara, John
>> Sent: Monday, November 14, 2016 6:30 PM
>> To: Zhao1, Wei ; dev at dpdk.org
>> Cc: olivier.matz at 6wind.com; Zhao1, Wei 
>> Subject: RE: [dpdk-dev] [PATCH] lib/librte_mempool: a redundant word in
>> comment
>>
>>
>>
>>> -Original Message-
>>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wei Zhao
>>> Sent: Monday, November 14, 2016 2:47 AM
>>> To: dev at dpdk.org
>>> Cc: olivier.matz at 6wind.com; Zhao1, Wei 
>>> Subject: [dpdk-dev] [PATCH] lib/librte_mempool: a redundant word in
>>> comment
>>>
>>> From: zhao wei 
>>
>> I think you need to add your name to gitconfig file on the sending machine to
>> avoid this "From:"
>>
>>>
>>> There is a redundant repetition word "for" in commnet line the file
>>> rte_mempool.h after the definition of RTE_MEMPOOL_OPS_NAMESIZE.
>>> The word "for"appear twice in line 359 and 360.One of them is
>>> redundant, so delete it.
>>>
>>> Fixes: 449c49b93a6b ("lib/librte_mempool: mempool: support handler
>>> operations")

The proper fixline should be:
  Fixes: 449c49b93a6b ("mempool: support handler operations")

(no need to add "lib/librte_mempool:")
This comment also applies to the other patch, I missed it.


>>>
>>> Signed-off-by: zhao wei 
>>
>> /commnet/comment/
>>
>> And same comment as before about the title. Apart from that:
>>
>> Acked-by: John McNamara 
>>
>>
> 
> Thank you for your suggestion,  I will change as your comment in following 
> patch!
> 

Also same comment about "mempool:" instead of "lib/librte_mempool: mempool:"


Thanks,
Olivier

[dpdk-dev] [PATCH] lib/librte_mempool: a redundant of socket_id assignment

2016-11-18 Thread Olivier Matz

Hi Wei,

On 11/14/2016 11:25 AM, Mcnamara, John wrote:
> 
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wei Zhao
>> Sent: Monday, November 14, 2016 2:16 AM
>> To: dev at dpdk.org
>> Cc: olivier.matz at 6wind.com; Zhao1, Wei 
>> Subject: [dpdk-dev] [PATCH] lib/librte_mempool: a redundant of socket_id
>> assignment
>>
>> From: zhao wei 
>>
>> There is a redundant repetition mempool socket_id assignment in the file
>> rte_mempool.c in function rte_mempool_create_empty.The statement "mp-
>>> socket_id = socket_id;"appear twice in line 821 and 824.One of them is
>> redundant, so delete it.
>>
>> Fixes: 85226f9c526b ("lib/librte_mempool:  mempool:introduce a function to
>> create an empty pool")
>>
>> Signed-off-by: zhao wei 
> 
> Titles should generally start with a verb to indicate what is being done.
> Something like:
> 
> lib/librte_mempool: remove redundant socket_id assignment
> 
> Apart from that. 
> 
> Acked-by: John McNamara 

I would even say:
  mempool: remove redundant socket_id assignment

Acked-by: Olivier Matz

[dpdk-dev] [PATCH v2] mempool: Free memzone if mempool populate phys fails

2016-11-11 Thread Olivier Matz

Hi Hemant,

On 11/11/2016 04:47 PM, Hemant Agrawal wrote:
> From: Nipun Gupta 
> 
> This patch fixes the issue of memzone not being freed incase the
> rte_mempool_populate_phys fails in the rte_mempool_populate_default
> 
> This issue was identified when testing with OVS ~2.6
> - configure the system with low memory (e.g. < 500 MB)
> - add bridge and dpdk interfaces
> - delete brigde
> - keep on repeating the above sequence.
> 
> Fixes: d1d914ebbc25 ("mempool: allocate in several memory chunks by default")
> 
> Signed-off-by: Nipun Gupta 
> ---
>  lib/librte_mempool/rte_mempool.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/librte_mempool/rte_mempool.c 
> b/lib/librte_mempool/rte_mempool.c
> index e94e56f..aa513b9 100644
> --- a/lib/librte_mempool/rte_mempool.c
> +++ b/lib/librte_mempool/rte_mempool.c
> @@ -578,8 +578,10 @@ rte_mempool_populate_default(struct rte_mempool *mp)
>   mz->len, pg_sz,
>   rte_mempool_memchunk_mz_free,
>   (void *)(uintptr_t)mz);
> - if (ret < 0)
> + if (ret < 0) {
> + rte_memzone_free(mz);
>   goto fail;
> + }
>   }
>  
>   return mp->size;
> 

Acked-by: Olivier Matz 


Thanks
Olivier

[dpdk-dev] disable hugepages

2016-11-10 Thread Olivier Matz



On 11/10/2016 02:10 PM, Wiles, Keith wrote:
> 
>> On Nov 10, 2016, at 6:32 AM, Keren Hochman  
>> wrote:
>>
>> I tried using the following dpdk options:
>> --no-huge --vdev eth_pcap0 ,rx_pcap=/t1,tx_pcap=/t2
>> *It's worked but the number of elements is limited, although the machine
>> has enough free memory. *rte_mempool_create is failed when I'm trying to
>> allocate more memory. Is there any limitation on the memory beside the
>> machine?
> 
> DPDK will just use the standard linux memory allocator, so no limitation in 
> DPDK. Now you could be hitting the limit as a user, need to check your system 
> to make sure you can allocate that much memory to a user. Try using the 
> command ulimit and see what it reports.
> 
> I do not remember exactly how to change limits except with ulimit command. I 
> may have modified /etc/security/limits.conf file.

I don't think it's a ulimit issue.
Actually, the memory is reserved once at startup. The -m EAL
option allows to specify the amount of memory allocated:

  -m MB   Memory to allocate (see also --socket-mem)

So I guess setting it to an higher value (256?) would do the job.

Regards,
Olivier

[dpdk-dev] [PATCH] doc: postpone ABI changes for mbuf

2016-11-09 Thread Olivier Matz

Mbuf modifications are not ready for 16.11, postpone them to 17.02.

Signed-off-by: Olivier Matz 
---
 doc/guides/rel_notes/deprecation.rst | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/doc/guides/rel_notes/deprecation.rst 
b/doc/guides/rel_notes/deprecation.rst
index 9f5fa55..1a9e1ae 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -15,16 +15,17 @@ Deprecation Notices
   ``nb_seg_max`` and ``nb_mtu_seg_max`` providing information about number of
   segments limit to be transmitted by device for TSO/non-TSO packets.

-* ABI changes are planned for 16.11 in the ``rte_mbuf`` structure: some fields
+* ABI changes are planned for 17.02 in the ``rte_mbuf`` structure: some fields
   may be reordered to facilitate the writing of ``data_off``, ``refcnt``, and
   ``nb_segs`` in one operation, because some platforms have an overhead if the
   store address is not naturally aligned. Other mbuf fields, such as the
-  ``port`` field, may be moved or removed as part of this mbuf work.
+  ``port`` field, may be moved or removed as part of this mbuf work. A
+  ``timestamp`` will also be added.

 * The mbuf flags PKT_RX_VLAN_PKT and PKT_RX_QINQ_PKT are deprecated and
   are respectively replaced by PKT_RX_VLAN_STRIPPED and
   PKT_RX_QINQ_STRIPPED, that are better described. The old flags and
-  their behavior will be kept in 16.07 and will be removed in 16.11.
+  their behavior will be kept until 16.11 and will be removed in 17.02.

 * mempool: The functions ``rte_mempool_count`` and ``rte_mempool_free_count``
   will be removed in 17.02.
-- 
2.8.1

[dpdk-dev] disable hugepages

2016-11-09 Thread Olivier Matz

Hi Keren,

On 11/09/2016 03:40 PM, Keren Hochman wrote:
> On Wed, Nov 9, 2016 at 3:40 PM, Christian Ehrhardt <
> christian.ehrhardt at canonical.com> wrote:
> 
>>
>> On Wed, Nov 9, 2016 at 1:55 PM, Keren Hochman <
>> keren.hochman at lightcyber.com> wrote:
>>
>>> how can I create mempool without hugepages?My application is running on a
>>> pcap file so no huge pages is needed ?
>>>
>>
>> Not sure if that is what you really want (Debug use only), but in general
>> no-huge is available as EAL arg
>>
>> From http://pktgen.readthedocs.io/en/latest/usage_eal.html :
>>
>> EAL options for DEBUG use only:
>>   --no-huge   : Use malloc instead of hugetlbfs
>>
> I need this option only for testing. How can I use rte_mempool_create if I
> use --no-huge?

When using --no-huge, the dpdk libraries (including mempool) allocate
its memory in standard memory. Just keep in mind the physical addresses
will be wrong, so this memory cannot be given to hw devices.

Regards,
Olivier

[dpdk-dev] [PATCH] app/test: fix crash of lpm test

2016-11-09 Thread Olivier Matz

The test recently added accesses to lpm->tbl8[ip >> 8] with is much
larger than the size of the table, causing a crash of the test
application.

Fix this typo by replacing tbl8 by tbl24.

Fixes: 231fa88ed522 ("app/test: verify LPM tbl8 recycle")

Signed-off-by: Olivier Matz 
---

Hi Wei,

I don't know lpm very well and I did not spend much time to understand
the test case. I guess that's the proper fix, but please check carefully
that I'm not doing something wrong :)

Thanks,
Olivier


 app/test/test_lpm.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/app/test/test_lpm.c b/app/test/test_lpm.c
index 80e0efc..41ae80f 100644
--- a/app/test/test_lpm.c
+++ b/app/test/test_lpm.c
@@ -1256,7 +1256,7 @@ test18(void)
rte_lpm_add(lpm, ip, depth, next_hop);

TEST_LPM_ASSERT(lpm->tbl24[ip>>8].valid_group);
-   tbl8_group_index = lpm->tbl8[ip>>8].group_idx;
+   tbl8_group_index = lpm->tbl24[ip>>8].group_idx;

depth = 23;
next_hop = 2;
@@ -1272,7 +1272,7 @@ test18(void)
rte_lpm_add(lpm, ip, depth, next_hop);

TEST_LPM_ASSERT(lpm->tbl24[ip>>8].valid_group);
-   TEST_LPM_ASSERT(tbl8_group_index == lpm->tbl8[ip>>8].group_idx);
+   TEST_LPM_ASSERT(tbl8_group_index == lpm->tbl24[ip>>8].group_idx);

depth = 24;
next_hop = 4;
@@ -1288,7 +1288,7 @@ test18(void)
rte_lpm_add(lpm, ip, depth, next_hop);

TEST_LPM_ASSERT(lpm->tbl24[ip>>8].valid_group);
-   TEST_LPM_ASSERT(tbl8_group_index == lpm->tbl8[ip>>8].group_idx);
+   TEST_LPM_ASSERT(tbl8_group_index == lpm->tbl24[ip>>8].group_idx);

rte_lpm_free(lpm);
 #undef group_idx
-- 
2.8.1

[dpdk-dev] [PATCH v3 02/12] net/virtio: setup and start cq in configure callback

2016-11-08 Thread Olivier Matz

Hi Lei,

On 11/02/2016 02:38 AM, Yao, Lei A wrote:
> Hi, Olivier
> 
> During the validation work with v16.11-rc2, I find that this patch will cause 
> VM crash if enable virtio bonding in VM. Could you have a check at your side? 
> The following is steps at my side. Thanks a lot
> 
> 1. bind PF port to igb_uio.
> modprobe uio
> insmod ./x86_64-native-linuxapp-gcc/kmod/igb_uio.ko
> ./tools/dpdk-devbind.py --bind=igb_uio 84:00.1
> 
> 2. start vhost switch.
> ./examples/vhost/build/vhost-switch -c 0x1c -n 4 --socket-mem 4096,4096 - 
> -p 0x1 --mergeable 0 --vm2vm 0 --socket-file ./vhost-net
> 
> 3. bootup one vm with four virtio net device
> qemu-system-x86_64 \
> -name vm0 -enable-kvm -chardev 
> socket,path=/tmp/vm0_qga0.sock,server,nowait,id=vm0_qga0 \
> -device virtio-serial -device 
> virtserialport,chardev=vm0_qga0,name=org.qemu.guest_agent.0 \
> -daemonize -monitor unix:/tmp/vm0_monitor.sock,server,nowait \
> -net nic,vlan=0,macaddr=00:00:00:c7:56:64,addr=1f \
> net user,vlan=0,hostfwd=tcp:10.239.129.127:6107:22 \
> -chardev socket,id=char0,path=./vhost-net \
> -netdev type=vhost-user,id=netdev0,chardev=char0,vhostforce \
> -device virtio-net-pci,netdev=netdev0,mac=52:54:00:00:00:01 \
> -chardev socket,id=char1,path=./vhost-net \
> -netdev type=vhost-user,id=netdev1,chardev=char1,vhostforce \
> -device virtio-net-pci,netdev=netdev1,mac=52:54:00:00:00:02 \
> -chardev socket,id=char2,path=./vhost-net \
> -netdev type=vhost-user,id=netdev2,chardev=char2,vhostforce \
> -device virtio-net-pci,netdev=netdev2,mac=52:54:00:00:00:03 \
> -chardev socket,id=char3,path=./vhost-net \
> -netdev type=vhost-user,id=netdev3,chardev=char3,vhostforce \
> -device virtio-net-pci,netdev=netdev3,mac=52:54:00:00:00:04 \
> -cpu host -smp 8 -m 4096 \
> -object memory-backend-file,id=mem,size=4096M,mem-path=/mnt/huge,share=on \
> -numa node,memdev=mem -mem-prealloc -drive file=/home/osimg/ubuntu16.img -vnc 
> :10
> 
> 4. on vm:
> bind virtio net device to igb_uio
> modprobe uio
> insmod ./x86_64-native-linuxapp-gcc/kmod/igb_uio.ko
> tools/dpdk-devbind.py --bind=igb_uio 00:04.0 00:05.0 00:06.0 00:07.0
> 5. startup test_pmd app
> ./x86_64-native-linuxapp-gcc/app/testpmd -c 0x1f -n 4 - -i --txqflags=0xf00 
> --disable-hw-vlan-filter
> 6. create one bonding device (port 4)
> create bonded device 0 0 (the first 0: mode, the second: the socket number)
> show bonding config 4
> 7. bind port 0, 1, 2 to port 4
> add bonding slave 0 4
> add bonding slave 1 4
> add bonding slave 2 4
> port start 4
> Result: just after port start 4(port 4 is bonded port), the vm shutdown 
> immediately.

Sorry for the late answer. I reproduced the issue on rc2, and I confirm
that Yuanhan's patchset fixes it in rc3.

Regards,
Olivier

[dpdk-dev] [PATCH v2 1/1] mempool: Add sanity check when secondary link in less mempools than primary

2016-11-08 Thread Olivier Matz

Hello Jean,

On 10/28/2016 08:37 PM, Jean Tourrilhes wrote:
> If the mempool ops the caller wants to use is not registered, the
> library will segfault in an obscure way when trying to use that
> mempool. It's better to catch it early and warn the user.
> 
> If the primary and secondary process were build using different build
> systems, the list of constructors included by the linker in each
> binary might be different. Mempools are registered via constructors, so
> the linker magic will directly impact which tailqs are registered with
> the primary and the secondary.
> DPDK currently assumes that the secondary has a superset of the
> mempools registered at the primary, and they are in the same order
> (same index in primary and secondary). In some build scenario, the
> secondary might not initialise any mempools at all.
> 
> This would also catch cases where there is a bug in the mempool
> registration, or some memory corruptions, but this has not been
> observed.

I still don't get how you can have different constructors in your
primary and secondary. As I said in my previous answer, my
understanding of how secondary process works in dpdk is that the
binaries have to be synchronized.

Your are just talking about linker magic but you don't explain how
to reproduce your issue easily. Please, can you provide a simple
example (let's say one .c and one Makefile) plus a simple screenshot
that highlights the issue?

We'll then check if this issue should be fixed in mempool or if
we should provide helpers in the build system to avoid this situation.

> 
> Signed-off-by: Jean Tourrilhes 
> ---
>  lib/librte_mempool/rte_mempool.c | 19 +++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/lib/librte_mempool/rte_mempool.c 
> b/lib/librte_mempool/rte_mempool.c
> index 2e28e2e..82260cc 100644
> --- a/lib/librte_mempool/rte_mempool.c
> +++ b/lib/librte_mempool/rte_mempool.c
> @@ -1275,6 +1275,25 @@ rte_mempool_lookup(const char *name)
>   return NULL;
>   }
>  
> + /* Sanity check : secondary may have initialised less mempools
> +  * than primary due to linker and constructor magic. Or maybe
> +  * there is a mempool corruption or bug. In any case, we can't
> +  * go on, we will segfault in an obscure way.
> +  * This does not detect the case where the constructor order
> +  * is different between primary and secondary and where the
> +  * index points to the wrong ops. This would require more
> +  * extensive changes, and is much less likely.
> +  * Jean II */
> + if(mp->ops_index >= (int32_t) rte_mempool_ops_table.num_ops) {
> + unsigned i;
> + /* Dump list of mempool ops for further investigation. */
> + for (i = 0; i < rte_mempool_ops_table.num_ops; i++) {
> + RTE_LOG(ERR, EAL, "Registered mempool[%d] is %s\n", i, 
> rte_mempool_ops_table.ops[i].name);
> + }
> + /* Do not dump mempool list itself, it will segfault. */
> + rte_panic("Cannot find ops for mempool, ops_index %d, num_ops 
> %d - maybe due to build process or linker configuration\n", mp->ops_index, 
> rte_mempool_ops_table.num_ops);
> + }
> +

Also, please use checkpatch to ensure it matches the style.
See
http://dpdk.org/doc/guides/contributing/patches.html#checking-the-patches

I don't feel signing your comments is absolutely required. In addition
it does not give a lot of information about who wrote it, given the
large number of Jean II: https://fr.wikipedia.org/wiki/Jean_II

Regards,
Olivier

[dpdk-dev] [PATCH v2] net/virtio: cache Rx/Tx offload ability check

2016-11-08 Thread Olivier Matz

Hi Yuanhan,

On 11/04/2016 03:29 PM, Yuanhan Liu wrote:
> It's not a good idea to do the check of whether Rx/Tx offload is
> enabled at the data path. Instead, we could do the check at init
> stage and store the result, so that we could avoid the check again
> and again at the critical datapath.
> 
> Cc: Olivier Matz 
> Signed-off-by: Yuanhan Liu 
> ---
> v2: - rebase on top of the bug fix patches
> - define rx/tx_offload as uint8_t instead of int
> 
>  drivers/net/virtio/virtio_ethdev.c | 19 +++
>  drivers/net/virtio/virtio_pci.h|  2 ++
>  drivers/net/virtio/virtio_rxtx.c   | 31 +--
>  3 files changed, 26 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/net/virtio/virtio_ethdev.c 
> b/drivers/net/virtio/virtio_ethdev.c
> index 1505f67..2adae58 100644
> --- a/drivers/net/virtio/virtio_ethdev.c
> +++ b/drivers/net/virtio/virtio_ethdev.c
> @@ -1188,6 +1188,22 @@ rx_func_get(struct rte_eth_dev *eth_dev)
>   eth_dev->rx_pkt_burst = _recv_pkts;
>  }
>  
> +static inline int
> +rx_offload_enabled(struct virtio_hw *hw)
> +{
> + return vtpci_with_feature(hw, VIRTIO_NET_F_GUEST_CSUM) ||
> + vtpci_with_feature(hw, VIRTIO_NET_F_GUEST_TSO4) ||
> + vtpci_with_feature(hw, VIRTIO_NET_F_GUEST_TSO6);
> +}
> +
> +static inline int
> +tx_offload_enabled(struct virtio_hw *hw)
> +{
> + return vtpci_with_feature(hw, VIRTIO_NET_F_CSUM) ||
> + vtpci_with_feature(hw, VIRTIO_NET_F_HOST_TSO4) ||
> + vtpci_with_feature(hw, VIRTIO_NET_F_HOST_TSO6);
> +}

Do we need these functions to be inlined?

It looks better to do like this, but out of curiosity, do you see a
performance improvement?

Regards,
Olivier

[dpdk-dev] [PATCH v11 1/6] ethdev: add Tx preparation

2016-10-27 Thread Olivier Matz



On 10/26/2016 02:56 PM, Tomasz Kulasek wrote:
> Added API for `rte_eth_tx_prep`
> 
> [...]
> 
> Signed-off-by: Tomasz Kulasek 

Acked-by: Olivier Matz

[dpdk-dev] [PATCH v2] mempool: fix search of maximum contiguous pages

2016-10-25 Thread Olivier Matz

From: Wei Dai <wei@intel.com>

paddr[i] + pg_sz always points to the start physical address of the
2nd page after pddr[i], so only up to 2 pages can be combinded to
be used. With this revision, more than 2 pages can be used.

Fixes: 84121f197187 ("mempool: store memory chunks in a list")

Signed-off-by: Wei Dai 
Signed-off-by: Olivier Matz 
---
 lib/librte_mempool/rte_mempool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/librte_mempool/rte_mempool.c b/lib/librte_mempool/rte_mempool.c
index 71017e1..e94e56f 100644
--- a/lib/librte_mempool/rte_mempool.c
+++ b/lib/librte_mempool/rte_mempool.c
@@ -428,7 +428,7 @@ rte_mempool_populate_phys_tab(struct rte_mempool *mp, char 
*vaddr,

/* populate with the largest group of contiguous pages */
for (n = 1; (i + n) < pg_num &&
-paddr[i] + pg_sz == paddr[i+n]; n++)
+paddr[i + n - 1] + pg_sz == paddr[i + n]; n++)
;

ret = rte_mempool_populate_phys(mp, vaddr + i * pg_sz,
-- 
2.8.1

[dpdk-dev] [PATCH] mempool: fix search of maximum contiguous pages

2016-10-25 Thread Olivier Matz

Hi Thomas,

On 10/25/2016 04:37 PM, Thomas Monjalon wrote:
> 2016-10-13 17:05, Olivier MATZ:
>> Hi Wei,
>>
>> On 10/13/2016 02:31 PM, Ananyev, Konstantin wrote:
>>>
>>>>
>>>>>>> diff --git a/lib/librte_mempool/rte_mempool.c
>>>>>>> b/lib/librte_mempool/rte_mempool.c
>>>>>>> index 71017e1..e3e254a 100644
>>>>>>> --- a/lib/librte_mempool/rte_mempool.c
>>>>>>> +++ b/lib/librte_mempool/rte_mempool.c
>>>>>>> @@ -426,9 +426,12 @@ rte_mempool_populate_phys_tab(struct
>>>>>>> rte_mempool *mp, char *vaddr,
>>>>>>>
>>>>>>> for (i = 0; i < pg_num && mp->populated_size < mp->size; i += 
>>>>>>> n) {
>>>>>>>
>>>>>>> +   phys_addr_t paddr_next;
>>>>>>> +   paddr_next = paddr[i] + pg_sz;
>>>>>>> +
>>>>>>> /* populate with the largest group of contiguous pages 
>>>>>>> */
>>>>>>> for (n = 1; (i + n) < pg_num &&
>>>>>>> -paddr[i] + pg_sz == paddr[i+n]; n++)
>>>>>>> +paddr_next == paddr[i+n]; n++, paddr_next 
>>>>>>> += pg_sz)
>>>>>>> ;
>>>>>>
>>>>>> Good catch.
>>>>>> Why not just paddr[i + n - 1] != paddr[i + n]?
>>>>>
>>>>> Sorry, I meant 'paddr[i + n - 1] + pg_sz == paddr[i+n]' off course.
>>>>>
>>>>>> Then you don't need extra variable (paddr_next) here.
>>>>>> Konstantin
>>>>
>>>> Thank you, Konstantin
>>>> 'paddr[i + n - 1] + pg_sz = paddr[i + n]' also can fix it and have 
>>>> straight meaning.
>>>> But I assume that my revision with paddr_next += pg_sz may have a bit 
>>>> better performance.
>>>
>>> I don't think there would be any real difference, again it is not 
>>> performance critical code-path.
>>>
>>>> By the way, paddr[i] + n * pg_sz = paddr[i + n] can also resolve it.
>>>
>>> Yes, that's one seems even better for me - make things more clear.
>>
>> Thank you for fixing this.
>>
>> My vote would go for "addr[i + n - 1] + pg_sz == paddr[i + n]"
>>
>> If you feel "paddr[i] + n * pg_sz = paddr[i + n]" is clearer, I have no 
>> problem with it either.
> 
> No answer from Wei Dai.
> Please Olivier advise what to do with this patch.
> Thanks
> 

I think it's good to have this fix in 16.11.
I'm sending a v2 based on Wei's patch.

Olivier

[dpdk-dev] mbuf changes

2016-10-25 Thread Olivier Matz



On 10/25/2016 04:25 PM, Morten Br?rup wrote:
> It might also make sense documenting the mbuf fields in more detail 
> somewhere. E.g. the need for nb_segs in the NIC's TX handler.

Good point, I'll do it at the same time than the first rework
proposition.

[dpdk-dev] [PATCH v10 1/6] ethdev: add Tx preparation

2016-10-25 Thread Olivier Matz

Hi Tomasz,

On 10/24/2016 06:51 PM, Tomasz Kulasek wrote:
> Added API for `rte_eth_tx_prep`
> 
> [...]
> --- a/lib/librte_ether/rte_ethdev.h
> +++ b/lib/librte_ether/rte_ethdev.h
> @@ -182,6 +182,7 @@ extern "C" {
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "rte_ether.h"
>  #include "rte_eth_ctrl.h"
>  #include "rte_dev_info.h"
> @@ -699,6 +700,8 @@ struct rte_eth_desc_lim {
>   uint16_t nb_max;   /**< Max allowed number of descriptors. */
>   uint16_t nb_min;   /**< Min allowed number of descriptors. */
>   uint16_t nb_align; /**< Number of descriptors should be aligned to. */
> + uint16_t nb_seg_max; /**< Max number of segments per whole packet. 
> */
> + uint16_t nb_mtu_seg_max; /**< Max number of segments per one MTU */

Sorry if it was not clear in my previous review, but I think this should
be better explained here. You said that the "limitation of number
of segments may differ depend of TSO/non TSO".

As an application developer, I still have some difficulties to
clearly understand what does that mean. Is it the maximum number
of mbuf-segments that contain payload for one tcp-segment sent by
the device?

In that case, it looks quite difficult to verify that in an application.
It looks that this field is not used by validate_offload(), so how
should it be used by an application?


>  };
>  
>  /**
> @@ -1188,6 +1191,11 @@ typedef uint16_t (*eth_tx_burst_t)(void *txq,
>  uint16_t nb_pkts);
>  /**< @internal Send output packets on a transmit queue of an Ethernet 
> device. */
>  
> +typedef uint16_t (*eth_tx_prep_t)(void *txq,
> +struct rte_mbuf **tx_pkts,
> +uint16_t nb_pkts);
> +/**< @internal Prepare output packets on a transmit queue of an Ethernet 
> device. */
> +
>  typedef int (*flow_ctrl_get_t)(struct rte_eth_dev *dev,
>  struct rte_eth_fc_conf *fc_conf);
>  /**< @internal Get current flow control parameter on an Ethernet device */
> @@ -1622,6 +1630,7 @@ struct rte_eth_rxtx_callback {
>  struct rte_eth_dev {
>   eth_rx_burst_t rx_pkt_burst; /**< Pointer to PMD receive function. */
>   eth_tx_burst_t tx_pkt_burst; /**< Pointer to PMD transmit function. */
> + eth_tx_prep_t tx_pkt_prep; /**< Pointer to PMD transmit prepare 
> function. */
>   struct rte_eth_dev_data *data;  /**< Pointer to device data */
>   const struct eth_driver *driver;/**< Driver for this device */
>   const struct eth_dev_ops *dev_ops; /**< Functions exported by PMD */
> @@ -2816,6 +2825,93 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
>   return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts, 
> nb_pkts);
>  }
>  
> +/**
> + * Process a burst of output packets on a transmit queue of an Ethernet 
> device.
> + *
> + * The rte_eth_tx_prep() function is invoked to prepare output packets to be
> + * transmitted on the output queue *queue_id* of the Ethernet device 
> designated
> + * by its *port_id*.
> + * The *nb_pkts* parameter is the number of packets to be prepared which are
> + * supplied in the *tx_pkts* array of *rte_mbuf* structures, each of them
> + * allocated from a pool created with rte_pktmbuf_pool_create().
> + * For each packet to send, the rte_eth_tx_prep() function performs
> + * the following operations:
> + *
> + * - Check if packet meets devices requirements for tx offloads.
> + *
> + * - Check limitations about number of segments.
> + *
> + * - Check additional requirements when debug is enabled.
> + *
> + * - Update and/or reset required checksums when tx offload is set for 
> packet.
> + *
> + * The rte_eth_tx_prep() function returns the number of packets ready to be
> + * sent. A return value equal to *nb_pkts* means that all packets are valid 
> and
> + * ready to be sent.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + *   The value must be a valid port id.
> + * @param queue_id
> + *   The index of the transmit queue through which output packets must be
> + *   sent.
> + *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
> + *   to rte_eth_dev_configure().
> + * @param tx_pkts
> + *   The address of an array of *nb_pkts* pointers to *rte_mbuf* structures
> + *   which contain the output packets.
> + * @param nb_pkts
> + *   The maximum number of packets to process.
> + * @return
> + *   The number of packets correct and ready to be sent. The return value 
> can be
> + *   less than the value of the *tx_pkts* parameter when some packet doesn't
> + *   meet devices requirements with rte_errno set appropriately.
> + */

Inserting here the previous comment:

>> Can we add the constraint that invalid packets are left untouched?
>>
>> I think most of the time there will be a software fallback in that case,
>> so it would be good to ensure that this function does not change the flags
>> or the packet data.
> 
> In current

[dpdk-dev] mbuf changes

2016-10-25 Thread Olivier Matz



On 10/25/2016 02:45 PM, Bruce Richardson wrote:
> On Tue, Oct 25, 2016 at 02:33:55PM +0200, Morten Br?rup wrote:
>> Comments at the end.
>>
>> Med venlig hilsen / kind regards
>> - Morten Br?rup
>>
>>> -Original Message-
>>> From: Bruce Richardson [mailto:bruce.richardson at intel.com]
>>> Sent: Tuesday, October 25, 2016 2:20 PM
>>> To: Morten Br?rup
>>> Cc: Adrien Mazarguil; Wiles, Keith; dev at dpdk.org; Olivier Matz; Oleg
>>> Kuporosov
>>> Subject: Re: [dpdk-dev] mbuf changes
>>>
>>> On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Br?rup wrote:
>>>> Comments inline.
>>>>
>>>>> -Original Message-
>>>>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce
>>>>> Richardson
>>>>> Sent: Tuesday, October 25, 2016 1:14 PM
>>>>> To: Adrien Mazarguil
>>>>> Cc: Morten Br?rup; Wiles, Keith; dev at dpdk.org; Olivier Matz; Oleg
>>>>> Kuporosov
>>>>> Subject: Re: [dpdk-dev] mbuf changes
>>>>>
>>>>> On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil wrote:
>>>>>> On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Br?rup wrote:
>>>>>>> Comments inline.
>>>>>>>
>>>>>>> Med venlig hilsen / kind regards
>>>>>>> - Morten Br?rup
>>>>>>>
>>>>>>>
>>>>>>>> -Original Message-
>>>>>>>> From: Adrien Mazarguil [mailto:adrien.mazarguil at 6wind.com]
>>>>>>>> Sent: Tuesday, October 25, 2016 11:39 AM
>>>>>>>> To: Bruce Richardson
>>>>>>>> Cc: Wiles, Keith; Morten Br?rup; dev at dpdk.org; Olivier Matz;
>>>>>>>> Oleg Kuporosov
>>>>>>>> Subject: Re: [dpdk-dev] mbuf changes
>>>>>>>>
>>>>>>>> On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson
>>> wrote:
>>>>>>>>> On Mon, Oct 24, 2016 at 04:11:33PM +, Wiles, Keith
>>> wrote:
>>>>>>>> [...]
>>>>>>>>>>> On Oct 24, 2016, at 10:49 AM, Morten Br?rup
>>>>>>>>  wrote:
>>>>>>>> [...]
>>>>
>>>>>>>>> One other point I'll mention is that we need to have a
>>>>>>>>> discussion on how/where to add in a timestamp value into
>>> the
>>>>>>>>> mbuf. Personally, I think it can be in a union with the
>>>>> sequence
>>>>>>>>> number value, but I also suspect that 32-bits of a
>>> timestamp
>>>>>>>>> is not going to be enough for
>>>>>>>> many.
>>>>>>>>>
>>>>>>>>> Thoughts?
>>>>>>>>
>>>>>>>> If we consider that timestamp representation should use
>>>>> nanosecond
>>>>>>>> granularity, a 32-bit value may likely wrap around too
>>> quickly
>>>>>>>> to be useful. We can also assume that applications requesting
>>>>>>>> timestamps may care more about latency than throughput, Oleg
>>>>> found
>>>>>>>> that using the second cache line for this purpose had a
>>>>> noticeable impact [1].
>>>>>>>>
>>>>>>>>  [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html
>>>>>>>
>>>>>>> I agree with Oleg about the latency vs. throughput importance
>>>>>>> for
>>>>> such applications.
>>>>>>>
>>>>>>> If you need high resolution timestamps, consider them to be
>>>>> generated by the NIC RX driver, possibly by the hardware itself
>>>>> (http://w3new.napatech.com/features/time-precision/hardware-time-
>>>>> stamp), so the timestamp belongs in the first cache line. And I am
>>>>> proposing that it should have the highest possible accuracy, which
>>>>> makes the value hardware dependent.
>>>>>>>
>>>>>>> Furthermore, I am arguing that we leave it up to the
>>> application
>>>>>>> to
>>>>> keep track of the slowly moving bits (i.e. counting whole seconds,
>>>>> hours and calendar date) out of band, so

[dpdk-dev] [PATCH 1/3] mbuf: embedding timestamp into the packet

2016-10-18 Thread Olivier Matz

On 10/13/2016 04:35 PM, Oleg Kuporosov wrote:
> The hard requirement of financial services industry is accurate
> timestamping aligned with the packet itself. This patch is to satisfy
> this requirement:
> 
> - include uint64_t timestamp field into rte_mbuf with minimal impact to
>   throughput/latency. Keep it just simple uint64_t in ns (more than 580
>   years) would be enough for immediate needs while using full
>   struct timespec with twice bigger size would have much stronger
>   performance impact as missed cacheline0.
> 
> - it is possible as there is 6-bytes gap in 1st cacheline (fast path)
>   and moving uint16_t vlan_tci_outer field to 2nd cacheline.
> 
> - such move will only impact for pretty rare usable VLAN RX stripping
>   mode for outer TCI (it used only for one NIC i40e from the whole set and
>   allows to keep minimal performance impact for RX/TX timestamps.

This argument is difficult to accept. One can say you are adding
a field for a pretty rare case used by only one NIC :)

Honestly, I'm not able to judge whether timestamp is more important than
vlan_tci_outer. As room is tight in the first cache line, your patch
submission is the occasion to raise the question: how to decide what
should be in the first part of the mbuf? There are also some other
candidates for moving: m->seqn is only used in librte_reorder and it
is not set in the RX part of a driver.

About the timestamp, it would be valuable to have other opinions,
not only about the placement of the field in the structure, but also
to check that this API is also usable for other NICs.

Have you measured the impact of having the timestamp in the second part
of the mbuf?

Changing the mbuf structure should happen as rarely as possible, and we
have to make sure we take the correct decisions. I think we will
discuss this at dpdk userland 2016.

Apart from that, I wonder if an ol_flag should be added to tell that
the timestamp field is valid in the mbuf.

Regards,
Olivier

[dpdk-dev] [PATCH v6 1/6] ethdev: add Tx preparation

2016-10-18 Thread Olivier Matz

Hi Tomasz,

I think the principle of tx_prep() is good, it may for instance help to
remove the function virtio_tso_fix_cksum() from the virtio, and maybe
even change the mbuf TSO/cksum API.

I have some questions/comments below, I'm sorry it comes very late.

On 10/14/2016 05:05 PM, Tomasz Kulasek wrote:
> Added API for `rte_eth_tx_prep`
> 
> uint16_t rte_eth_tx_prep(uint8_t port_id, uint16_t queue_id,
>   struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
> 
> Added fields to the `struct rte_eth_desc_lim`:
> 
>   uint16_t nb_seg_max;
>   /**< Max number of segments per whole packet. */
> 
>   uint16_t nb_mtu_seg_max;
>   /**< Max number of segments per one MTU */

Not sure I understand the second one. Is this is case of TSO?

Is it a usual limitation in different network hardware?
Can this info be retrieved/used by the application?

> 
> Created `rte_pkt.h` header with common used functions:
> 
> int rte_validate_tx_offload(struct rte_mbuf *m)
>   to validate general requirements for tx offload in packet such a
>   flag completness. In current implementation this function is called
>   optionaly when RTE_LIBRTE_ETHDEV_DEBUG is enabled.
> 
> int rte_phdr_cksum_fix(struct rte_mbuf *m)
>   to fix pseudo header checksum for TSO and non-TSO tcp/udp packets
>   before hardware tx checksum offload.
>- for non-TSO tcp/udp packets full pseudo-header checksum is
>  counted and set.
>- for TSO the IP payload length is not included.

Why not in rte_net.h?

> [...]
>  
> @@ -2816,6 +2825,82 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
>   return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts, 
> nb_pkts);
>  }
>  
> +/**
> + * Process a burst of output packets on a transmit queue of an Ethernet 
> device.
> + *
> + * The rte_eth_tx_prep() function is invoked to prepare output packets to be
> + * transmitted on the output queue *queue_id* of the Ethernet device 
> designated
> + * by its *port_id*.
> + * The *nb_pkts* parameter is the number of packets to be prepared which are
> + * supplied in the *tx_pkts* array of *rte_mbuf* structures, each of them
> + * allocated from a pool created with rte_pktmbuf_pool_create().
> + * For each packet to send, the rte_eth_tx_prep() function performs
> + * the following operations:
> + *
> + * - Check if packet meets devices requirements for tx offloads.

Do you mean hardware requirements?
Can the application be aware of these requirements? I mean capability
flags, or something in dev_infos?

Maybe the comment could be more precise?

> + * - Check limitations about number of segments.
> + *
> + * - Check additional requirements when debug is enabled.

What kind of additional requirements?

> + *
> + * - Update and/or reset required checksums when tx offload is set for 
> packet.
> + *

By reading this, I think it may not be clear for the user about what
should be set in the mbuf. In mbuf API, it is said:

 * TCP segmentation offload. To enable this offload feature for a
 * packet to be transmitted on hardware supporting TSO:
 *  - set the PKT_TX_TCP_SEG flag in mbuf->ol_flags (this flag implies
 *PKT_TX_TCP_CKSUM)
 *  - set the flag PKT_TX_IPV4 or PKT_TX_IPV6
 *  - if it's IPv4, set the PKT_TX_IP_CKSUM flag and write the IP checksum
 *to 0 in the packet
 *  - fill the mbuf offload information: l2_len, l3_len, l4_len, tso_segsz
 *  - calculate the pseudo header checksum without taking ip_len in account,
 *and set it in the TCP header. Refer to rte_ipv4_phdr_cksum() and
 *rte_ipv6_phdr_cksum() that can be used as helpers.

If I understand well, using tx_prep(), the user will have to do the
same except writing the IP checksum to 0, and without setting the
TCP pseudo header checksum, right?

> + * The rte_eth_tx_prep() function returns the number of packets ready to be
> + * sent. A return value equal to *nb_pkts* means that all packets are valid 
> and
> + * ready to be sent.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The index of the transmit queue through which output packets must be
> + *   sent.
> + *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
> + *   to rte_eth_dev_configure().
> + * @param tx_pkts
> + *   The address of an array of *nb_pkts* pointers to *rte_mbuf* structures
> + *   which contain the output packets.
> + * @param nb_pkts
> + *   The maximum number of packets to process.
> + * @return
> + *   The number of packets correct and ready to be sent. The return value 
> can be
> + *   less than the value of the *tx_pkts* parameter when some packet doesn't
> + *   meet devices requirements with rte_errno set appropriately.
> + */

Can we add the constraint that invalid packets are left untouched?

I think most of the time there will be a software fallback in that
case, so it would be good to ensure that this function does not change
the flags or the

[dpdk-dev] [PATCH v2 12/12] virtio: add Tso support

2016-10-18 Thread Olivier Matz

Hi Stephen,

On 10/14/2016 01:33 AM, Stephen Hemminger wrote:
> On Thu, 13 Oct 2016 16:18:39 +0800
> Yuanhan Liu  wrote:
> 
>> On Mon, Oct 03, 2016 at 11:00:23AM +0200, Olivier Matz wrote:
>>> +/* When doing TSO, the IP length is not included in the pseudo header
>>> + * checksum of the packet given to the PMD, but for virtio it is
>>> + * expected.
>>> + */
>>> +static void
>>> +virtio_tso_fix_cksum(struct rte_mbuf *m)
>>> +{
>>> +   /* common case: header is not fragmented */
>>> +   if (likely(rte_pktmbuf_data_len(m) >= m->l2_len + m->l3_len +
>>> +   m->l4_len)) {  
>> ...
>>> +   /* replace it in the packet */
>>> +   th->cksum = new_cksum;
>>> +   } else {  
>> ...
>>> +   /* replace it in the packet */
>>> +   *rte_pktmbuf_mtod_offset(m, uint8_t *,
>>> +   m->l2_len + m->l3_len + 16) = new_cksum.u8[0];
>>> +   *rte_pktmbuf_mtod_offset(m, uint8_t *,
>>> +   m->l2_len + m->l3_len + 17) = new_cksum.u8[1];
>>> +   }  
>>
>> The tcp header will always be in the mbuf, right? Otherwise, you can't
>> update the cksum field here. What's the point of introducing the "else
>> clause" then?
>>
>>  --yliu
> 
> You need to check the reference count before updating any data in mbuf.
> 

That's correct, I'll fix that.

Thanks for the comment,
Olivier

[dpdk-dev] [PATCH v3 2/2] mempool: pktmbuf pool default fallback for mempool ops error

2016-10-14 Thread Olivier Matz

Hi Hemant,

Sorry for the late answer. Please see some comments inline.

On 10/13/2016 03:15 PM, Hemant Agrawal wrote:
> Hi Olivier,
> Any updates w.r.t this patch set?
> 
> Regards
> Hemant
> On 9/22/2016 6:42 PM, Hemant Agrawal wrote:
>> Hi Olivier
>>
>> On 9/19/2016 7:27 PM, Olivier Matz wrote:
>>> Hi Hemant,
>>>
>>> On 09/16/2016 06:46 PM, Hemant Agrawal wrote:
>>>> In the rte_pktmbuf_pool_create, if the default external mempool is
>>>> not available, the implementation can default to "ring_mp_mc", which
>>>> is an software implementation.
>>>>
>>>> Signed-off-by: Hemant Agrawal 
>>>> ---
>>>> Changes in V3:
>>>> * adding warning message to say that falling back to default sw pool
>>>> ---
>>>>  lib/librte_mbuf/rte_mbuf.c | 8 
>>>>  1 file changed, 8 insertions(+)
>>>>
>>>> diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
>>>> index 4846b89..8ab0eb1 100644
>>>> --- a/lib/librte_mbuf/rte_mbuf.c
>>>> +++ b/lib/librte_mbuf/rte_mbuf.c
>>>> @@ -176,6 +176,14 @@ rte_pktmbuf_pool_create(const char *name,
>>>> unsigned n,
>>>>
>>>>  rte_errno = rte_mempool_set_ops_byname(mp,
>>>>  RTE_MBUF_DEFAULT_MEMPOOL_OPS, NULL);
>>>> +
>>>> +/* on error, try falling back to the software based default
>>>> pool */
>>>> +if (rte_errno == -EOPNOTSUPP) {
>>>> +RTE_LOG(WARNING, MBUF, "Default HW Mempool not supported. "
>>>> +"falling back to sw mempool \"ring_mp_mc\"");
>>>> +rte_errno = rte_mempool_set_ops_byname(mp, "ring_mp_mc",
>>>> NULL);
>>>> +}
>>>> +
>>>>  if (rte_errno != 0) {
>>>>  RTE_LOG(ERR, MBUF, "error setting mempool handler\n");
>>>>  return NULL;
>>>>
>>>
>>> Without adding a new method ".supported()", the first call to
>>> rte_mempool_populate() could return the same error ENOTSUP. In this
>>> case, it is still possible to fallback.
>>>
>> It will be bit late.
>>
>> On failure, than we have to set the default ops and do a goto before
>> rte_pktmbuf_pool_init(mp, _priv);

I still think we can do the job without adding the .supported() method.
The following code is just an (untested) example:

struct rte_mempool *
rte_pktmbuf_pool_create(const char *name, unsigned n,
unsigned cache_size, uint16_t priv_size, uint16_t data_room_size,
int socket_id)
{
struct rte_mempool *mp;
struct rte_pktmbuf_pool_private mbp_priv;
unsigned elt_size;
int ret;
const char *ops[] = {
RTE_MBUF_DEFAULT_MEMPOOL_OPS, "ring_mp_mc", NULL,
};
const char **op;

if (RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) != priv_size) {
RTE_LOG(ERR, MBUF, "mbuf priv_size=%u is not aligned\n",
priv_size);
rte_errno = EINVAL;
return NULL;
}
elt_size = sizeof(struct rte_mbuf) + (unsigned)priv_size +
(unsigned)data_room_size;
mbp_priv.mbuf_data_room_size = data_room_size;
mbp_priv.mbuf_priv_size = priv_size;

for (op = [0]; *op != NULL; op++) {
mp = rte_mempool_create_empty(name, n, elt_size, cache_size,
sizeof(struct rte_pktmbuf_pool_private), socket_id, 0);
if (mp == NULL)
return NULL;

ret = rte_mempool_set_ops_byname(mp, *op, NULL);
if (ret != 0) {
RTE_LOG(ERR, MBUF, "error setting mempool handler\n");
rte_mempool_free(mp);
if (ret == -ENOTSUP)
continue;
rte_errno = -ret;
return NULL;
}
rte_pktmbuf_pool_init(mp, _priv);

ret = rte_mempool_populate_default(mp);
if (ret < 0) {
rte_mempool_free(mp);
if (ret == -ENOTSUP)
continue;
rte_errno = -ret;
return NULL;
}
}

rte_mempool_obj_iter(mp, rte_pktmbuf_init, NULL);

return mp;
}


>>> I've just submitted an RFC, which I think is quite linked:
>>> http://dpdk.org/ml/archives/dev/2016-September/046974.html
>>> Assuming a new parameter "mempool_ops" is added to
>>> rte_pktmbuf_pool_create(), would it make sense to fallback to
>>> "ring_mp_mc"? What about just returning ENOTSUP? The application could
>>> do the job and decide which sw fallback to use.
>>
>> We ran into this issue when trying to run the standard DPDK

[dpdk-dev] [PATCH] mempool: Add sanity check when secondary link in less mempools than primary

2016-10-14 Thread Olivier Matz

Hi Jean,

On 10/12/2016 10:04 PM, Jean Tourrilhes wrote:
> mempool: Add sanity check when secondary link in less mempools than primary
> 
> If the primary and secondary process were build using different build
> systems, the list of constructors included by the linker in each
> binary might be different. Mempools are registered via constructors, so
> the linker magic will directly impact which tailqs are registered with
> the primary and the secondary.
> 
> DPDK currently assumes that the secondary has a superset of the
> mempools registered at the primary, and they are in the same order
> (same index in primary and secondary). In some build scenario, the
> secondary might not initialise any mempools at all. This would result
> in an obscure segfault when trying to use the mempool. Instead, fail
> early with a more explicit error message.
> 
> Signed-off-by: Jean Tourrilhes 
> ---
>  lib/librte_mempool/rte_mempool.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/lib/librte_mempool/rte_mempool.c 
> b/lib/librte_mempool/rte_mempool.c
> index 2e28e2e..4fe9158 100644
> --- a/lib/librte_mempool/rte_mempool.c
> +++ b/lib/librte_mempool/rte_mempool.c
> @@ -1275,6 +1275,16 @@ rte_mempool_lookup(const char *name)
>   return NULL;
>   }
>  
> + /* Sanity check : secondary may have initialised less mempools
> +  * than primary due to linker and constructor magic. Note that
> +  * this does not address the case where the constructor order
> +  * is different between primary and secondary and where the index
> +  * points to the wrong ops. Jean II */
> + if(mp->ops_index >= (int32_t) rte_mempool_ops_table.num_ops) {
> + /* Do not dump mempool list, it will segfault. */
> + rte_panic("Cannot find ops for mempool, ops_index %d, num_ops 
> %d - maybe due to build process or linker configuration\n", mp->ops_index, 
> rte_mempool_ops_table.num_ops);
> + }
> +
>   return mp;
>  }
>  
> 

I'm not really fan of this. I think the configuration and build system
of primary and secondaries should be the same to avoid this kind of
issues. Some other issues may happen if the configuration is different,
for instance the size of structures may be different.

There is already a lot of mess due to primary/secondary at many places
in the code, I'm not sure adding more is really desirable.

Regards,
Olivier

[dpdk-dev] [PATCH v3 12/12] net/virtio: add Tso support

2016-10-13 Thread Olivier Matz



On 10/13/2016 08:50 PM, Thomas Monjalon wrote:
> 2016-10-14 00:05, Yuanhan Liu:
>> On Thu, Oct 13, 2016 at 04:16:11PM +0200, Olivier Matz wrote:
>>> +/* When doing TSO, the IP length is not included in the pseudo header
>>> + * checksum of the packet given to the PMD, but for virtio it is
>>> + * expected.
>>> + */
>>> +static void
>>> +virtio_tso_fix_cksum(struct rte_mbuf *m)
>>> +{
>>> +   /* common case: header is not fragmented */
>>> +   if (likely(rte_pktmbuf_data_len(m) >= m->l2_len + m->l3_len +
>>> +   m->l4_len)) {
>>> +   struct ipv4_hdr *iph;
>>> +   struct ipv6_hdr *ip6h;
>>> +   struct tcp_hdr *th;
>>> +   uint16_t prev_cksum, new_cksum, ip_len, ip_paylen;
>>> +   uint32_t tmp;
>> ...
>>> +   } else {
>>
>> As discussed just now, if you drop the else part, you could add my
>> ACK for the whole virtio changes, and Review-ed by for all mbuf and
>> other changes.
>>
>> Thoams, please pick them by youself directly: since it depends on
>> other patches and they will be picked (or already be picked?) by you.
> 
> Applied
>   - without TSO checksum on fragmented header
>   - with some release notes changes
>   - with Yuanhan acked/reviewed
> Thanks
> 

Thanks Thomas, and also to Xiao, Maxime and Yuanhan for the review!

[dpdk-dev] [PATCH v2 12/12] virtio: add Tso support

2016-10-13 Thread Olivier Matz



Le 13 octobre 2016 17:29:35 CEST, Yuanhan Liu  
a ?crit :
>On Thu, Oct 13, 2016 at 05:15:24PM +0200, Olivier MATZ wrote:
>> 
>> 
>> On 10/13/2016 05:01 PM, Yuanhan Liu wrote:
>> >On Thu, Oct 13, 2016 at 04:52:25PM +0200, Olivier MATZ wrote:
>> >>
>> >>
>> >>On 10/13/2016 04:16 PM, Yuanhan Liu wrote:
>> >>>On Thu, Oct 13, 2016 at 04:02:49PM +0200, Olivier MATZ wrote:
>> >>>>
>> >>>>
>> >>>>On 10/13/2016 10:18 AM, Yuanhan Liu wrote:
>> >>>>>On Mon, Oct 03, 2016 at 11:00:23AM +0200, Olivier Matz wrote:
>> >>>>>>+/* When doing TSO, the IP length is not included in the pseudo
>header
>> >>>>>>+ * checksum of the packet given to the PMD, but for virtio it
>is
>> >>>>>>+ * expected.
>> >>>>>>+ */
>> >>>>>>+static void
>> >>>>>>+virtio_tso_fix_cksum(struct rte_mbuf *m)
>> >>>>>>+{
>> >>>>>>+  /* common case: header is not fragmented */
>> >>>>>>+  if (likely(rte_pktmbuf_data_len(m) >= m->l2_len + m->l3_len +
>> >>>>>>+  m->l4_len)) {
>> >>>>>...
>> >>>>>>+  /* replace it in the packet */
>> >>>>>>+  th->cksum = new_cksum;
>> >>>>>>+  } else {
>> >>>>>...
>> >>>>>>+  /* replace it in the packet */
>> >>>>>>+  *rte_pktmbuf_mtod_offset(m, uint8_t *,
>> >>>>>>+  m->l2_len + m->l3_len + 16) = new_cksum.u8[0];
>> >>>>>>+  *rte_pktmbuf_mtod_offset(m, uint8_t *,
>> >>>>>>+  m->l2_len + m->l3_len + 17) = new_cksum.u8[1];
>> >>>>>>+  }
>> >>>>>
>> >>>>>The tcp header will always be in the mbuf, right? Otherwise, you
>can't
>> >>>>>update the cksum field here. What's the point of introducing the
>"else
>> >>>>>clause" then?
>> >>>>
>> >>>>Sorry, I don't see the problem you're pointing out here.
>> >>>>
>> >>>>What I want to solve here is to support the cases where the mbuf
>is
>> >>>>segmented in the middle of the network header (which is probably
>a rare
>> >>>>case).
>> >>>
>> >>>How it's gonna segmented?
>> >>
>> >>The mbuf is given by the application. So if the application
>generates a
>> >>segmented mbuf, it should work.
>> >>
>> >>This could happen for instance if the application uses mbuf clones
>to share
>> >>the IP/TCP/data part of the mbuf and prepend a specific
>Ethernet/vlan for
>> >>different destination.
>> >>
>> >>
>> >>>>In the "else" part, I only access the mbuf byte by byte using the
>> >>>>rte_pktmbuf_mtod_offset() accessor. An alternative would have
>been to copy
>> >>>>the header in a linear buffer, fix the checksum, then copy it
>again in the
>> >>>>packet, but there is no mbuf helpers to do these copies for now.
>> >>>
>> >>>In the "else" clause, the ip header is still in the mbuf, right?
>> >>>Why do you have to access it the way like:
>> >>>
>> >>>  ip_version = *rte_pktmbuf_mtod_offset(m, uint8_t *,
>> >>>  m->l2_len) >> 4;
>> >>>
>> >>>Why can't you just use
>> >>>
>> >>>  iph = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *, m->l2_len);
>> >>>  iph->version_ihl ;
>> >>
>> >>AFAIK, there is no requirement that each network header has to be
>contiguous
>> >>in a mbuf segment.
>> >>
>> >>Of course, a split in the middle of a network header probably never
>> >>happens... but we never knows, as it is not forbidden. I think the
>code
>> >>should be robust enough to avoid accesses to wrong addresses.
>> >>
>> >>Hope it's clear enough :)
>> >
>> >Thanks, but not really. Maybe let me ask this way: what wrong would
>> >happen if we use
>> >iph = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *, m->l2_len);
>> >to access the IP header? Is it about the endian?
>> 
>> If you have a packet split like this:
>> 
>> mbuf segment 1 mbuf segment 2
>>    --
>> | Ethernet header |  IP hea|   |der | TCP header | data
>>    --
>>^
>>iph
>
>Thanks, that's clear. How could you be able to access the tcp header
>from the first mbuf then? I mean, how is the following code supposed
>to work?
>
>prev_cksum.u8[0] = *rte_pktmbuf_mtod_offset(m, uint8_t *,
>   m->l2_len + m->l3_len + 16);
>

Oh I see... Sorry there was a confusion on my side with another (internal) 
macro that browses the segments if the offset ils not in the first one.

If you agree, let's add the code without the else part, I'll fix it for the rc2.


>> The IP header is not contiguous. So accessing to the end of the
>structure
>> will access to a wrong location.
>> 
>> >One more question is do you have any case to trigger the "else"
>clause?
>> 
>> No, but I think it may happen.
>
>A piece of untest code is not trusted though ...
>
>   --yliu

[dpdk-dev] [PATCH v2 12/12] virtio: add Tso support

2016-10-13 Thread Olivier MATZ



On 10/13/2016 05:01 PM, Yuanhan Liu wrote:
> On Thu, Oct 13, 2016 at 04:52:25PM +0200, Olivier MATZ wrote:
>>
>>
>> On 10/13/2016 04:16 PM, Yuanhan Liu wrote:
>>> On Thu, Oct 13, 2016 at 04:02:49PM +0200, Olivier MATZ wrote:
>>>>
>>>>
>>>> On 10/13/2016 10:18 AM, Yuanhan Liu wrote:
>>>>> On Mon, Oct 03, 2016 at 11:00:23AM +0200, Olivier Matz wrote:
>>>>>> +/* When doing TSO, the IP length is not included in the pseudo header
>>>>>> + * checksum of the packet given to the PMD, but for virtio it is
>>>>>> + * expected.
>>>>>> + */
>>>>>> +static void
>>>>>> +virtio_tso_fix_cksum(struct rte_mbuf *m)
>>>>>> +{
>>>>>> +/* common case: header is not fragmented */
>>>>>> +if (likely(rte_pktmbuf_data_len(m) >= m->l2_len + m->l3_len +
>>>>>> +m->l4_len)) {
>>>>> ...
>>>>>> +/* replace it in the packet */
>>>>>> +th->cksum = new_cksum;
>>>>>> +} else {
>>>>> ...
>>>>>> +/* replace it in the packet */
>>>>>> +*rte_pktmbuf_mtod_offset(m, uint8_t *,
>>>>>> +m->l2_len + m->l3_len + 16) = new_cksum.u8[0];
>>>>>> +*rte_pktmbuf_mtod_offset(m, uint8_t *,
>>>>>> +m->l2_len + m->l3_len + 17) = new_cksum.u8[1];
>>>>>> +}
>>>>>
>>>>> The tcp header will always be in the mbuf, right? Otherwise, you can't
>>>>> update the cksum field here. What's the point of introducing the "else
>>>>> clause" then?
>>>>
>>>> Sorry, I don't see the problem you're pointing out here.
>>>>
>>>> What I want to solve here is to support the cases where the mbuf is
>>>> segmented in the middle of the network header (which is probably a rare
>>>> case).
>>>
>>> How it's gonna segmented?
>>
>> The mbuf is given by the application. So if the application generates a
>> segmented mbuf, it should work.
>>
>> This could happen for instance if the application uses mbuf clones to share
>> the IP/TCP/data part of the mbuf and prepend a specific Ethernet/vlan for
>> different destination.
>>
>>
>>>> In the "else" part, I only access the mbuf byte by byte using the
>>>> rte_pktmbuf_mtod_offset() accessor. An alternative would have been to copy
>>>> the header in a linear buffer, fix the checksum, then copy it again in the
>>>> packet, but there is no mbuf helpers to do these copies for now.
>>>
>>> In the "else" clause, the ip header is still in the mbuf, right?
>>> Why do you have to access it the way like:
>>>
>>> ip_version = *rte_pktmbuf_mtod_offset(m, uint8_t *,
>>> m->l2_len) >> 4;
>>>
>>> Why can't you just use
>>>
>>> iph = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *, m->l2_len);
>>> iph->version_ihl ;
>>
>> AFAIK, there is no requirement that each network header has to be contiguous
>> in a mbuf segment.
>>
>> Of course, a split in the middle of a network header probably never
>> happens... but we never knows, as it is not forbidden. I think the code
>> should be robust enough to avoid accesses to wrong addresses.
>>
>> Hope it's clear enough :)
>
> Thanks, but not really. Maybe let me ask this way: what wrong would
> happen if we use
>   iph = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *, m->l2_len);
> to access the IP header? Is it about the endian?

If you have a packet split like this:

mbuf segment 1 mbuf segment 2
   --
| Ethernet header |  IP hea|   |der | TCP header | data
   --
^
iph

The IP header is not contiguous. So accessing to the end of the 
structure will access to a wrong location.

> One more question is do you have any case to trigger the "else" clause?

No, but I think it may happen.

Olivier

[dpdk-dev] [PATCH] mempool: fix search of maximum contiguous pages

2016-10-13 Thread Olivier MATZ

Hi Wei,

On 10/13/2016 02:31 PM, Ananyev, Konstantin wrote:
>
>>
> diff --git a/lib/librte_mempool/rte_mempool.c
> b/lib/librte_mempool/rte_mempool.c
> index 71017e1..e3e254a 100644
> --- a/lib/librte_mempool/rte_mempool.c
> +++ b/lib/librte_mempool/rte_mempool.c
> @@ -426,9 +426,12 @@ rte_mempool_populate_phys_tab(struct
> rte_mempool *mp, char *vaddr,
>
>   for (i = 0; i < pg_num && mp->populated_size < mp->size; i += 
> n) {
>
> + phys_addr_t paddr_next;
> + paddr_next = paddr[i] + pg_sz;
> +
>   /* populate with the largest group of contiguous pages 
> */
>   for (n = 1; (i + n) < pg_num &&
> -  paddr[i] + pg_sz == paddr[i+n]; n++)
> +  paddr_next == paddr[i+n]; n++, paddr_next += pg_sz)
>   ;

 Good catch.
 Why not just paddr[i + n - 1] != paddr[i + n]?
>>>
>>> Sorry, I meant 'paddr[i + n - 1] + pg_sz == paddr[i+n]' off course.
>>>
 Then you don't need extra variable (paddr_next) here.
 Konstantin
>>
>> Thank you, Konstantin
>> 'paddr[i + n - 1] + pg_sz = paddr[i + n]' also can fix it and have straight 
>> meaning.
>> But I assume that my revision with paddr_next += pg_sz may have a bit better 
>> performance.
>
> I don't think there would be any real difference, again it is not performance 
> critical code-path.
>
>> By the way, paddr[i] + n * pg_sz = paddr[i + n] can also resolve it.
>
> Yes, that's one seems even better for me - make things more clear.

Thank you for fixing this.

My vote would go for "addr[i + n - 1] + pg_sz == paddr[i + n]"

If you feel "paddr[i] + n * pg_sz = paddr[i + n]" is clearer, I have no 
problem with it either.

Regards,
Olivier

[dpdk-dev] [PATCH v2 12/12] virtio: add Tso support

2016-10-13 Thread Olivier MATZ



On 10/13/2016 04:16 PM, Yuanhan Liu wrote:
> On Thu, Oct 13, 2016 at 04:02:49PM +0200, Olivier MATZ wrote:
>>
>>
>> On 10/13/2016 10:18 AM, Yuanhan Liu wrote:
>>> On Mon, Oct 03, 2016 at 11:00:23AM +0200, Olivier Matz wrote:
>>>> +/* When doing TSO, the IP length is not included in the pseudo header
>>>> + * checksum of the packet given to the PMD, but for virtio it is
>>>> + * expected.
>>>> + */
>>>> +static void
>>>> +virtio_tso_fix_cksum(struct rte_mbuf *m)
>>>> +{
>>>> +  /* common case: header is not fragmented */
>>>> +  if (likely(rte_pktmbuf_data_len(m) >= m->l2_len + m->l3_len +
>>>> +  m->l4_len)) {
>>> ...
>>>> +  /* replace it in the packet */
>>>> +  th->cksum = new_cksum;
>>>> +  } else {
>>> ...
>>>> +  /* replace it in the packet */
>>>> +  *rte_pktmbuf_mtod_offset(m, uint8_t *,
>>>> +  m->l2_len + m->l3_len + 16) = new_cksum.u8[0];
>>>> +  *rte_pktmbuf_mtod_offset(m, uint8_t *,
>>>> +  m->l2_len + m->l3_len + 17) = new_cksum.u8[1];
>>>> +  }
>>>
>>> The tcp header will always be in the mbuf, right? Otherwise, you can't
>>> update the cksum field here. What's the point of introducing the "else
>>> clause" then?
>>
>> Sorry, I don't see the problem you're pointing out here.
>>
>> What I want to solve here is to support the cases where the mbuf is
>> segmented in the middle of the network header (which is probably a rare
>> case).
>
> How it's gonna segmented?

The mbuf is given by the application. So if the application generates a 
segmented mbuf, it should work.

This could happen for instance if the application uses mbuf clones to 
share the IP/TCP/data part of the mbuf and prepend a specific 
Ethernet/vlan for different destination.


>> In the "else" part, I only access the mbuf byte by byte using the
>> rte_pktmbuf_mtod_offset() accessor. An alternative would have been to copy
>> the header in a linear buffer, fix the checksum, then copy it again in the
>> packet, but there is no mbuf helpers to do these copies for now.
>
> In the "else" clause, the ip header is still in the mbuf, right?
> Why do you have to access it the way like:
>
>   ip_version = *rte_pktmbuf_mtod_offset(m, uint8_t *,
>   m->l2_len) >> 4;
>
> Why can't you just use
>
>   iph = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *, m->l2_len);
>   iph->version_ihl ;

AFAIK, there is no requirement that each network header has to be 
contiguous in a mbuf segment.

Of course, a split in the middle of a network header probably never 
happens... but we never knows, as it is not forbidden. I think the code 
should be robust enough to avoid accesses to wrong addresses.

Hope it's clear enough :)

Thanks
Olivier

[dpdk-dev] [PATCH v3 12/12] net/virtio: add Tso support

2016-10-13 Thread Olivier Matz

Signed-off-by: Olivier Matz 
---
 drivers/net/virtio/virtio_ethdev.c |   6 ++
 drivers/net/virtio/virtio_ethdev.h |   2 +
 drivers/net/virtio/virtio_rxtx.c   | 133 +++--
 3 files changed, 136 insertions(+), 5 deletions(-)

diff --git a/drivers/net/virtio/virtio_ethdev.c 
b/drivers/net/virtio/virtio_ethdev.c
index 109f855..969edb6 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1572,6 +1572,7 @@ virtio_dev_link_update(struct rte_eth_dev *dev, 
__rte_unused int wait_to_complet
 static void
 virtio_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 {
+   uint64_t tso_mask;
struct virtio_hw *hw = dev->data->dev_private;

if (dev->pci_dev)
@@ -1599,6 +1600,11 @@ virtio_dev_info_get(struct rte_eth_dev *dev, struct 
rte_eth_dev_info *dev_info)
DEV_TX_OFFLOAD_UDP_CKSUM |
DEV_TX_OFFLOAD_TCP_CKSUM;
}
+
+   tso_mask = (1ULL << VIRTIO_NET_F_HOST_TSO4) |
+   (1ULL << VIRTIO_NET_F_HOST_TSO6);
+   if ((hw->guest_features & tso_mask) == tso_mask)
+   dev_info->tx_offload_capa |= DEV_TX_OFFLOAD_TCP_TSO;
 }

 /*
diff --git a/drivers/net/virtio/virtio_ethdev.h 
b/drivers/net/virtio/virtio_ethdev.h
index d55e7ed..f77f618 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -63,6 +63,8 @@
 1u << VIRTIO_NET_F_CTRL_RX   | \
 1u << VIRTIO_NET_F_CTRL_VLAN | \
 1u << VIRTIO_NET_F_CSUM  | \
+1u << VIRTIO_NET_F_HOST_TSO4 | \
+1u << VIRTIO_NET_F_HOST_TSO6 | \
 1u << VIRTIO_NET_F_MRG_RXBUF | \
 1u << VIRTIO_RING_F_INDIRECT_DESC |\
 1ULL << VIRTIO_F_VERSION_1)
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 0fa635a..4b01ea3 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -209,10 +209,117 @@ virtqueue_enqueue_recv_refill(struct virtqueue *vq, 
struct rte_mbuf *cookie)
return 0;
 }

+/* When doing TSO, the IP length is not included in the pseudo header
+ * checksum of the packet given to the PMD, but for virtio it is
+ * expected.
+ */
+static void
+virtio_tso_fix_cksum(struct rte_mbuf *m)
+{
+   /* common case: header is not fragmented */
+   if (likely(rte_pktmbuf_data_len(m) >= m->l2_len + m->l3_len +
+   m->l4_len)) {
+   struct ipv4_hdr *iph;
+   struct ipv6_hdr *ip6h;
+   struct tcp_hdr *th;
+   uint16_t prev_cksum, new_cksum, ip_len, ip_paylen;
+   uint32_t tmp;
+
+   iph = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *, m->l2_len);
+   th = RTE_PTR_ADD(iph, m->l3_len);
+   if ((iph->version_ihl >> 4) == 4) {
+   iph->hdr_checksum = 0;
+   iph->hdr_checksum = rte_ipv4_cksum(iph);
+   ip_len = iph->total_length;
+   ip_paylen = rte_cpu_to_be_16(rte_be_to_cpu_16(ip_len) -
+   m->l3_len);
+   } else {
+   ip6h = (struct ipv6_hdr *)iph;
+   ip_paylen = ip6h->payload_len;
+   }
+
+   /* calculate the new phdr checksum not including ip_paylen */
+   prev_cksum = th->cksum;
+   tmp = prev_cksum;
+   tmp += ip_paylen;
+   tmp = (tmp & 0x) + (tmp >> 16);
+   new_cksum = tmp;
+
+   /* replace it in the packet */
+   th->cksum = new_cksum;
+   } else {
+   const struct ipv4_hdr *iph;
+   struct ipv4_hdr iph_copy;
+   union {
+   uint16_t u16;
+   uint8_t u8[2];
+   } prev_cksum, new_cksum, ip_len, ip_paylen, ip_csum;
+   uint32_t tmp;
+
+   /* Same code than above, but we use rte_pktmbuf_read()
+* or we read/write in mbuf data one byte at a time to
+* avoid issues if the packet is multi segmented.
+*/
+
+   uint8_t ip_version;
+
+   ip_version = *rte_pktmbuf_mtod_offset(m, uint8_t *,
+   m->l2_len) >> 4;
+
+   /* calculate ip checksum (API imposes to set it to 0)
+* and get ip payload len */
+   if (ip_version == 4) {
+   *rte_pktmbuf_mtod_offset(m, uint8_t *,
+   m->l2_len + 10) = 0;
+   *rte_pktmbuf_mtod_offset(m, uint8_t *,
+   m->l2_len + 11) = 0;
+   iph = rte_pktmbuf_read(m, m->l2_len,
+

[dpdk-dev] [PATCH v3 11/12] net/virtio: add Lro support

2016-10-13 Thread Olivier Matz

Signed-off-by: Olivier Matz 
Reviewed-by: Maxime Coquelin 
---
 drivers/net/virtio/virtio_ethdev.c | 15 ++-
 drivers/net/virtio/virtio_ethdev.h |  9 -
 drivers/net/virtio/virtio_rxtx.c   | 25 -
 3 files changed, 38 insertions(+), 11 deletions(-)

diff --git a/drivers/net/virtio/virtio_ethdev.c 
b/drivers/net/virtio/virtio_ethdev.c
index c3c53be..109f855 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1348,6 +1348,10 @@ virtio_dev_configure(struct rte_eth_dev *dev)
req_features = VIRTIO_PMD_DEFAULT_GUEST_FEATURES;
if (rxmode->hw_ip_checksum)
req_features |= (1ULL << VIRTIO_NET_F_GUEST_CSUM);
+   if (rxmode->enable_lro)
+   req_features |=
+   (1ULL << VIRTIO_NET_F_GUEST_TSO4) |
+   (1ULL << VIRTIO_NET_F_GUEST_TSO6);

/* if request features changed, reinit the device */
if (req_features != hw->req_guest_features) {
@@ -1363,6 +1367,14 @@ virtio_dev_configure(struct rte_eth_dev *dev)
return -ENOTSUP;
}

+   if (rxmode->enable_lro &&
+   (!vtpci_with_feature(hw, VIRTIO_NET_F_GUEST_TSO4) ||
+   !vtpci_with_feature(hw, VIRTIO_NET_F_GUEST_TSO4))) {
+   PMD_DRV_LOG(NOTICE,
+   "lro not available on this host");
+   return -ENOTSUP;
+   }
+
/* Setup and start control queue */
if (vtpci_with_feature(hw, VIRTIO_NET_F_CTRL_VQ)) {
ret = virtio_dev_cq_queue_setup(dev,
@@ -1578,7 +1590,8 @@ virtio_dev_info_get(struct rte_eth_dev *dev, struct 
rte_eth_dev_info *dev_info)
};
dev_info->rx_offload_capa =
DEV_RX_OFFLOAD_TCP_CKSUM |
-   DEV_RX_OFFLOAD_UDP_CKSUM;
+   DEV_RX_OFFLOAD_UDP_CKSUM |
+   DEV_RX_OFFLOAD_TCP_LRO;
dev_info->tx_offload_capa = 0;

if (hw->guest_features & (1ULL << VIRTIO_NET_F_CSUM)) {
diff --git a/drivers/net/virtio/virtio_ethdev.h 
b/drivers/net/virtio/virtio_ethdev.h
index adca6ba..d55e7ed 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -117,13 +117,4 @@ uint16_t virtio_xmit_pkts_simple(void *tx_queue, struct 
rte_mbuf **tx_pkts,

 int eth_virtio_dev_init(struct rte_eth_dev *eth_dev);

-/*
- * The VIRTIO_NET_F_GUEST_TSO[46] features permit the host to send us
- * frames larger than 1514 bytes. We do not yet support software LRO
- * via tcp_lro_rx().
- */
-#define VTNET_LRO_FEATURES (VIRTIO_NET_F_GUEST_TSO4 | \
-   VIRTIO_NET_F_GUEST_TSO6 | VIRTIO_NET_F_GUEST_ECN)
-
-
 #endif /* _VIRTIO_ETHDEV_H_ */
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 675dc43..0fa635a 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -715,13 +715,36 @@ virtio_rx_offload(struct rte_mbuf *m, struct 
virtio_net_hdr *hdr)
m->ol_flags |= PKT_RX_L4_CKSUM_GOOD;
}

+   /* GSO request, save required information in mbuf */
+   if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+   /* Check unsupported modes */
+   if ((hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN) ||
+   (hdr->gso_size == 0)) {
+   return -EINVAL;
+   }
+
+   /* Update mss lengthes in mbuf */
+   m->tso_segsz = hdr->gso_size;
+   switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
+   case VIRTIO_NET_HDR_GSO_TCPV4:
+   case VIRTIO_NET_HDR_GSO_TCPV6:
+   m->ol_flags |= PKT_RX_LRO | \
+   PKT_RX_L4_CKSUM_NONE;
+   break;
+   default:
+   return -EINVAL;
+   }
+   }
+
return 0;
 }

 static inline int
 rx_offload_enabled(struct virtio_hw *hw)
 {
-   return vtpci_with_feature(hw, VIRTIO_NET_F_GUEST_CSUM);
+   return vtpci_with_feature(hw, VIRTIO_NET_F_GUEST_CSUM) ||
+   vtpci_with_feature(hw, VIRTIO_NET_F_GUEST_TSO4) ||
+   vtpci_with_feature(hw, VIRTIO_NET_F_GUEST_TSO6);
 }

 #define VIRTIO_MBUF_BURST_SZ 64
-- 
2.8.1

[dpdk-dev] [PATCH v3 10/12] net/virtio: add Tx checksum offload support

2016-10-13 Thread Olivier Matz

Signed-off-by: Olivier Matz 
---
 drivers/net/virtio/virtio_ethdev.c |  7 
 drivers/net/virtio/virtio_ethdev.h |  1 +
 drivers/net/virtio/virtio_rxtx.c   | 73 +++---
 3 files changed, 61 insertions(+), 20 deletions(-)

diff --git a/drivers/net/virtio/virtio_ethdev.c 
b/drivers/net/virtio/virtio_ethdev.c
index 00b4c38..c3c53be 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1579,6 +1579,13 @@ virtio_dev_info_get(struct rte_eth_dev *dev, struct 
rte_eth_dev_info *dev_info)
dev_info->rx_offload_capa =
DEV_RX_OFFLOAD_TCP_CKSUM |
DEV_RX_OFFLOAD_UDP_CKSUM;
+   dev_info->tx_offload_capa = 0;
+
+   if (hw->guest_features & (1ULL << VIRTIO_NET_F_CSUM)) {
+   dev_info->tx_offload_capa |=
+   DEV_TX_OFFLOAD_UDP_CKSUM |
+   DEV_TX_OFFLOAD_TCP_CKSUM;
+   }
 }

 /*
diff --git a/drivers/net/virtio/virtio_ethdev.h 
b/drivers/net/virtio/virtio_ethdev.h
index fd29a7f..adca6ba 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -62,6 +62,7 @@
 1u << VIRTIO_NET_F_CTRL_VQ   | \
 1u << VIRTIO_NET_F_CTRL_RX   | \
 1u << VIRTIO_NET_F_CTRL_VLAN | \
+1u << VIRTIO_NET_F_CSUM  | \
 1u << VIRTIO_NET_F_MRG_RXBUF | \
 1u << VIRTIO_RING_F_INDIRECT_DESC |\
 1ULL << VIRTIO_F_VERSION_1)
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index fc0d84b..675dc43 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -53,6 +53,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 

 #include "virtio_logs.h"
 #include "virtio_ethdev.h"
@@ -207,18 +209,27 @@ virtqueue_enqueue_recv_refill(struct virtqueue *vq, 
struct rte_mbuf *cookie)
return 0;
 }

+static inline int
+tx_offload_enabled(struct virtio_hw *hw)
+{
+   return vtpci_with_feature(hw, VIRTIO_NET_F_CSUM);
+}
+
 static inline void
 virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
   uint16_t needed, int use_indirect, int can_push)
 {
+   struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
struct vq_desc_extra *dxp;
struct virtqueue *vq = txvq->vq;
struct vring_desc *start_dp;
uint16_t seg_num = cookie->nb_segs;
uint16_t head_idx, idx;
uint16_t head_size = vq->hw->vtnet_hdr_size;
-   unsigned long offs;
+   struct virtio_net_hdr *hdr;
+   int offload;

+   offload = tx_offload_enabled(vq->hw);
head_idx = vq->vq_desc_head_idx;
idx = head_idx;
dxp = >vq_descx[idx];
@@ -228,10 +239,12 @@ virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct 
rte_mbuf *cookie,
start_dp = vq->vq_ring.desc;

if (can_push) {
-   /* put on zero'd transmit header (no offloads) */
-   void *hdr = rte_pktmbuf_prepend(cookie, head_size);
-
-   memset(hdr, 0, head_size);
+   /* prepend cannot fail, checked by caller */
+   hdr = (struct virtio_net_hdr *)
+   rte_pktmbuf_prepend(cookie, head_size);
+   /* if offload disabled, it is not zeroed below, do it now */
+   if (offload == 0)
+   memset(hdr, 0, head_size);
} else if (use_indirect) {
/* setup tx ring slot to point to indirect
 * descriptor list stored in reserved region.
@@ -239,14 +252,11 @@ virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct 
rte_mbuf *cookie,
 * the first slot in indirect ring is already preset
 * to point to the header in reserved region
 */
-   struct virtio_tx_region *txr = txvq->virtio_net_hdr_mz->addr;
-
-   offs = idx * sizeof(struct virtio_tx_region)
-   + offsetof(struct virtio_tx_region, tx_indir);
-
-   start_dp[idx].addr  = txvq->virtio_net_hdr_mem + offs;
+   start_dp[idx].addr  = txvq->virtio_net_hdr_mem +
+   RTE_PTR_DIFF([idx].tx_indir, txr);
start_dp[idx].len   = (seg_num + 1) * sizeof(struct vring_desc);
start_dp[idx].flags = VRING_DESC_F_INDIRECT;
+   hdr = (struct virtio_net_hdr *)[idx].tx_hdr;

/* loop below will fill in rest of the indirect elements */
start_dp = txr[idx].tx_indir;
@@ -255,15 +265,43 @@ virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct 
rte_mbuf *cookie,
/* setup first tx ring slot to point to header
 * stored in reserved region.
 */
-   offs = idx * sizeof(struct virti

[dpdk-dev] [PATCH v3 09/12] net/virtio: add Rx checksum offload support

2016-10-13 Thread Olivier Matz

Signed-off-by: Olivier Matz 
---
 drivers/net/virtio/virtio_ethdev.c | 21 ++
 drivers/net/virtio/virtio_ethdev.h |  2 +-
 drivers/net/virtio/virtio_rxtx.c   | 79 ++
 drivers/net/virtio/virtqueue.h |  1 +
 4 files changed, 95 insertions(+), 8 deletions(-)

diff --git a/drivers/net/virtio/virtio_ethdev.c 
b/drivers/net/virtio/virtio_ethdev.c
index b5bc0ee..00b4c38 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1262,7 +1262,7 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
eth_dev->data->dev_flags = dev_flags;

/* reset device and negotiate default features */
-   ret = virtio_init_device(eth_dev, VIRTIO_PMD_GUEST_FEATURES);
+   ret = virtio_init_device(eth_dev, VIRTIO_PMD_DEFAULT_GUEST_FEATURES);
if (ret < 0)
return ret;

@@ -1345,13 +1345,10 @@ virtio_dev_configure(struct rte_eth_dev *dev)
int ret;

PMD_INIT_LOG(DEBUG, "configure");
+   req_features = VIRTIO_PMD_DEFAULT_GUEST_FEATURES;
+   if (rxmode->hw_ip_checksum)
+   req_features |= (1ULL << VIRTIO_NET_F_GUEST_CSUM);

-   if (rxmode->hw_ip_checksum) {
-   PMD_DRV_LOG(ERR, "HW IP checksum not supported");
-   return -EINVAL;
-   }
-
-   req_features = VIRTIO_PMD_GUEST_FEATURES;
/* if request features changed, reinit the device */
if (req_features != hw->req_guest_features) {
ret = virtio_init_device(dev, req_features);
@@ -1359,6 +1356,13 @@ virtio_dev_configure(struct rte_eth_dev *dev)
return ret;
}

+   if (rxmode->hw_ip_checksum &&
+   !vtpci_with_feature(hw, VIRTIO_NET_F_GUEST_CSUM)) {
+   PMD_DRV_LOG(NOTICE,
+   "rx ip checksum not available on this host");
+   return -ENOTSUP;
+   }
+
/* Setup and start control queue */
if (vtpci_with_feature(hw, VIRTIO_NET_F_CTRL_VQ)) {
ret = virtio_dev_cq_queue_setup(dev,
@@ -1572,6 +1576,9 @@ virtio_dev_info_get(struct rte_eth_dev *dev, struct 
rte_eth_dev_info *dev_info)
dev_info->default_txconf = (struct rte_eth_txconf) {
.txq_flags = ETH_TXQ_FLAGS_NOOFFLOADS
};
+   dev_info->rx_offload_capa =
+   DEV_RX_OFFLOAD_TCP_CKSUM |
+   DEV_RX_OFFLOAD_UDP_CKSUM;
 }

 /*
diff --git a/drivers/net/virtio/virtio_ethdev.h 
b/drivers/net/virtio/virtio_ethdev.h
index dc18341..fd29a7f 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -54,7 +54,7 @@
 #define VIRTIO_MAX_RX_PKTLEN  9728

 /* Features desired/implemented by this driver. */
-#define VIRTIO_PMD_GUEST_FEATURES  \
+#define VIRTIO_PMD_DEFAULT_GUEST_FEATURES  \
(1u << VIRTIO_NET_F_MAC   | \
 1u << VIRTIO_NET_F_STATUS| \
 1u << VIRTIO_NET_F_MQ| \
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 9ab441b..fc0d84b 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -51,6 +51,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 

 #include "virtio_logs.h"
 #include "virtio_ethdev.h"
@@ -632,6 +634,63 @@ virtio_update_packet_stats(struct virtnet_stats *stats, 
struct rte_mbuf *mbuf)
}
 }

+/* Optionally fill offload information in structure */
+static int
+virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
+{
+   struct rte_net_hdr_lens hdr_lens;
+   uint32_t hdrlen, ptype;
+   int l4_supported = 0;
+
+   /* nothing to do */
+   if (hdr->flags == 0 && hdr->gso_type == VIRTIO_NET_HDR_GSO_NONE)
+   return 0;
+
+   m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN;
+
+   ptype = rte_net_get_ptype(m, _lens, RTE_PTYPE_ALL_MASK);
+   m->packet_type = ptype;
+   if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
+   (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
+   (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
+   l4_supported = 1;
+
+   if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+   hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
+   if (hdr->csum_start <= hdrlen && l4_supported) {
+   m->ol_flags |= PKT_RX_L4_CKSUM_NONE;
+   } else {
+   /* Unknown proto or tunnel, do sw cksum. We can assume
+* the cksum field is in the first segment since the
+* buffers we provided to the host are large enough.
+* In case of SCTP, this will be wrong since it's a CRC
+* b

[dpdk-dev] [PATCH v3 08/12] app/testpmd: display lro segment size

2016-10-13 Thread Olivier Matz

In csumonly engine, display the value of LRO segment if the
LRO flag is set.

Signed-off-by: Olivier Matz 
Reviewed-by: Maxime Coquelin 
---
 app/test-pmd/csumonly.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index da15185..57e6ae2 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -822,6 +822,8 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
"l4_proto=%d l4_len=%d flags=%s\n",
info.l2_len, rte_be_to_cpu_16(info.ethertype),
info.l3_len, info.l4_proto, info.l4_len, buf);
+   if (rx_ol_flags & PKT_RX_LRO)
+   printf("rx: m->lro_segsz=%u\n", m->tso_segsz);
if (info.is_tunnel == 1)
printf("rx: outer_l2_len=%d outer_ethertype=%x "
"outer_l3_len=%d\n", info.outer_l2_len,
-- 
2.8.1

[dpdk-dev] [PATCH v3 07/12] mbuf: new flag for LRO

2016-10-13 Thread Olivier Matz

When receiving coalesced packets in virtio, the original size of the
segments is provided. This is a useful information because it allows to
resegment with the same size.

Add a RX new flag in mbuf, that can be set when packets are coalesced by
a hardware or virtual driver when the m->tso_segsz field is valid and is
set to the segment size of original packets.

This flag is used in next commits in the virtio pmd.

Signed-off-by: Olivier Matz 
Reviewed-by: Maxime Coquelin 
---
 doc/guides/rel_notes/release_16_11.rst | 5 +
 lib/librte_mbuf/rte_mbuf.c | 2 ++
 lib/librte_mbuf/rte_mbuf.h | 7 +++
 3 files changed, 14 insertions(+)

diff --git a/doc/guides/rel_notes/release_16_11.rst 
b/doc/guides/rel_notes/release_16_11.rst
index 2ec63b2..c9fcfb9 100644
--- a/doc/guides/rel_notes/release_16_11.rst
+++ b/doc/guides/rel_notes/release_16_11.rst
@@ -115,6 +115,11 @@ New Features
   good, bad, or not present (useful for virtual drivers). This modification
   was done for IP and L4.

+* **Added a LRO mbuf flag.**
+
+  Added a new RX LRO mbuf flag, used when packets are coalesced. This
+  flag indicates that the segment size of original packets is known.
+
 Resolved Issues
 ---

diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index 8d9b875..63f43c8 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -319,6 +319,7 @@ const char *rte_get_rx_ol_flag_name(uint64_t mask)
case PKT_RX_IEEE1588_PTP: return "PKT_RX_IEEE1588_PTP";
case PKT_RX_IEEE1588_TMST: return "PKT_RX_IEEE1588_TMST";
case PKT_RX_QINQ_STRIPPED: return "PKT_RX_QINQ_STRIPPED";
+   case PKT_RX_LRO: return "PKT_RX_LRO";
default: return NULL;
}
 }
@@ -352,6 +353,7 @@ rte_get_rx_ol_flag_list(uint64_t mask, char *buf, size_t 
buflen)
{ PKT_RX_IEEE1588_PTP, PKT_RX_IEEE1588_PTP, NULL },
{ PKT_RX_IEEE1588_TMST, PKT_RX_IEEE1588_TMST, NULL },
{ PKT_RX_QINQ_STRIPPED, PKT_RX_QINQ_STRIPPED, NULL },
+   { PKT_RX_LRO, PKT_RX_LRO, NULL },
};
const char *name;
unsigned int i;
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 38022a3..f5eedda 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -170,6 +170,13 @@ extern "C" {
  */
 #define PKT_RX_QINQ_PKT  PKT_RX_QINQ_STRIPPED

+/**
+ * When packets are coalesced by a hardware or virtual driver, this flag
+ * can be set in the RX mbuf, meaning that the m->tso_segsz field is
+ * valid and is set to the segment size of original packets.
+ */
+#define PKT_RX_LRO   (1ULL << 16)
+
 /* add new RX flags here */

 /* add new TX flags here */
-- 
2.8.1

[dpdk-dev] [PATCH v3 06/12] app/testpmd: adapt checksum stats in csum engine

2016-10-13 Thread Olivier Matz

Reviewed-by: Maxime Coquelin 
---
 app/test-pmd/csumonly.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 27d0f08..da15185 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -697,8 +697,10 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
rx_ol_flags = m->ol_flags;

/* Update the L3/L4 checksum error packet statistics */
-   rx_bad_ip_csum += ((rx_ol_flags & PKT_RX_IP_CKSUM_BAD) != 0);
-   rx_bad_l4_csum += ((rx_ol_flags & PKT_RX_L4_CKSUM_BAD) != 0);
+   if ((rx_ol_flags & PKT_RX_IP_CKSUM_MASK) == PKT_RX_IP_CKSUM_BAD)
+   rx_bad_ip_csum += 1;
+   if ((rx_ol_flags & PKT_RX_L4_CKSUM_MASK) == PKT_RX_L4_CKSUM_BAD)
+   rx_bad_l4_csum += 1;

/* step 1: dissect packet, parsing optional vlan, ip4/ip6, vxlan
 * and inner headers */
-- 
2.8.1

[dpdk-dev] [PATCH v3 05/12] mbuf: add new Rx checksum mbuf flags

2016-10-13 Thread Olivier Matz

Following discussions in [1] and [2], introduce a new bit to
describe the Rx checksum status in mbuf.

Before this patch, only one flag was available:
  PKT_RX_L4_CKSUM_BAD: L4 cksum of RX pkt. is not OK.

And same for L3:
  PKT_RX_IP_CKSUM_BAD: IP cksum of RX pkt. is not OK.

This had 2 issues:
- it was not possible to differentiate "checksum good" from
  "checksum unknown".
- it was not possible for a virtual driver to say "the checksum
  in packet may be wrong, but data integrity is valid".

This patch tries to solve this issue by having 4 states (2 bits)
for the IP and L4 Rx checksums. New values are:

 - PKT_RX_L4_CKSUM_UNKNOWN: no information about the RX L4 checksum
   -> the application should verify the checksum by sw
 - PKT_RX_L4_CKSUM_BAD: the L4 checksum in the packet is wrong
   -> the application can drop the packet without additional check
 - PKT_RX_L4_CKSUM_GOOD: the L4 checksum in the packet is valid
   -> the application can accept the packet without verifying the
  checksum by sw
 - PKT_RX_L4_CKSUM_NONE: the L4 checksum is not correct in the packet
   data, but the integrity of the L4 data is verified.
   -> the application can process the packet but must not verify the
  checksum by sw. It has to take care to recalculate the cksum
  if the packet is transmitted (either by sw or using tx offload)

  And same for L3 (replace L4 by IP in description above).

This commit tries to be compatible with existing applications that
only check the existing flag (CKSUM_BAD).

[1] http://dpdk.org/ml/archives/dev/2016-May/039920.html
[2] http://dpdk.org/ml/archives/dev/2016-June/040007.html

Signed-off-by: Olivier Matz 
Reviewed-by: Maxime Coquelin 
---
 doc/guides/rel_notes/release_16_11.rst |  6 
 lib/librte_mbuf/rte_mbuf.c | 16 +--
 lib/librte_mbuf/rte_mbuf.h | 51 --
 3 files changed, 68 insertions(+), 5 deletions(-)

diff --git a/doc/guides/rel_notes/release_16_11.rst 
b/doc/guides/rel_notes/release_16_11.rst
index fbc0cbd..2ec63b2 100644
--- a/doc/guides/rel_notes/release_16_11.rst
+++ b/doc/guides/rel_notes/release_16_11.rst
@@ -109,6 +109,12 @@ New Features
   Added a new function ``rte_raw_cksum_mbuf()`` to process the checksum of
   data embedded in an mbuf chain.

+* **Added new Rx checksum mbuf flags.**
+
+  Added new Rx checksum flags in mbufs to describe more states: unknown,
+  good, bad, or not present (useful for virtual drivers). This modification
+  was done for IP and L4.
+
 Resolved Issues
 ---

diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index 4e1fdd1..8d9b875 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -309,7 +309,11 @@ const char *rte_get_rx_ol_flag_name(uint64_t mask)
case PKT_RX_RSS_HASH: return "PKT_RX_RSS_HASH";
case PKT_RX_FDIR: return "PKT_RX_FDIR";
case PKT_RX_L4_CKSUM_BAD: return "PKT_RX_L4_CKSUM_BAD";
+   case PKT_RX_L4_CKSUM_GOOD: return "PKT_RX_L4_CKSUM_GOOD";
+   case PKT_RX_L4_CKSUM_NONE: return "PKT_RX_L4_CKSUM_NONE";
case PKT_RX_IP_CKSUM_BAD: return "PKT_RX_IP_CKSUM_BAD";
+   case PKT_RX_IP_CKSUM_GOOD: return "PKT_RX_IP_CKSUM_GOOD";
+   case PKT_RX_IP_CKSUM_NONE: return "PKT_RX_IP_CKSUM_NONE";
case PKT_RX_EIP_CKSUM_BAD: return "PKT_RX_EIP_CKSUM_BAD";
case PKT_RX_VLAN_STRIPPED: return "PKT_RX_VLAN_STRIPPED";
case PKT_RX_IEEE1588_PTP: return "PKT_RX_IEEE1588_PTP";
@@ -333,8 +337,16 @@ rte_get_rx_ol_flag_list(uint64_t mask, char *buf, size_t 
buflen)
{ PKT_RX_VLAN_PKT, PKT_RX_VLAN_PKT, NULL },
{ PKT_RX_RSS_HASH, PKT_RX_RSS_HASH, NULL },
{ PKT_RX_FDIR, PKT_RX_FDIR, NULL },
-   { PKT_RX_L4_CKSUM_BAD, PKT_RX_L4_CKSUM_BAD, NULL },
-   { PKT_RX_IP_CKSUM_BAD, PKT_RX_IP_CKSUM_BAD, NULL },
+   { PKT_RX_L4_CKSUM_BAD, PKT_RX_L4_CKSUM_MASK, NULL },
+   { PKT_RX_L4_CKSUM_GOOD, PKT_RX_L4_CKSUM_MASK, NULL },
+   { PKT_RX_L4_CKSUM_NONE, PKT_RX_L4_CKSUM_MASK, NULL },
+   { PKT_RX_L4_CKSUM_UNKNOWN, PKT_RX_L4_CKSUM_MASK,
+ "PKT_RX_L4_CKSUM_UNKNOWN" },
+   { PKT_RX_IP_CKSUM_BAD, PKT_RX_IP_CKSUM_MASK, NULL },
+   { PKT_RX_IP_CKSUM_GOOD, PKT_RX_IP_CKSUM_MASK, NULL },
+   { PKT_RX_IP_CKSUM_NONE, PKT_RX_IP_CKSUM_MASK, NULL },
+   { PKT_RX_IP_CKSUM_UNKNOWN, PKT_RX_IP_CKSUM_MASK,
+ "PKT_RX_IP_CKSUM_UNKNOWN" },
{ PKT_RX_EIP_CKSUM_BAD, PKT_RX_EIP_CKSUM_BAD, NULL },
{ PKT_RX_VLAN_STRIPPED, PKT_RX_VLAN_STRIPPED, NULL },
{ PKT_RX_IEEE1588_PTP, PKT_RX_IEEE1588_PTP, NULL },
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 7541070..38022a3

[dpdk-dev] [PATCH v3 04/12] net: add function to calculate a checksum in a mbuf

2016-10-13 Thread Olivier Matz

This function can be used to calculate the checksum of data embedded in
mbuf, that can be composed of several segments.

This function will be used by the virtio pmd in next commits to calculate
the checksum in software in case the protocol is not recognized.

Signed-off-by: Olivier Matz 
---
 doc/guides/rel_notes/release_16_11.rst |  5 +++
 lib/librte_net/rte_ip.h| 71 ++
 2 files changed, 76 insertions(+)

diff --git a/doc/guides/rel_notes/release_16_11.rst 
b/doc/guides/rel_notes/release_16_11.rst
index 51fc707..fbc0cbd 100644
--- a/doc/guides/rel_notes/release_16_11.rst
+++ b/doc/guides/rel_notes/release_16_11.rst
@@ -104,6 +104,11 @@ New Features
   The config option ``RTE_MACHINE`` can be used to pass code names to the 
compiler as ``-march`` flag.


+* **Added a functions to calculate the checksum of data in a mbuf.**
+
+  Added a new function ``rte_raw_cksum_mbuf()`` to process the checksum of
+  data embedded in an mbuf chain.
+
 Resolved Issues
 ---

diff --git a/lib/librte_net/rte_ip.h b/lib/librte_net/rte_ip.h
index 5b7554a..4491b86 100644
--- a/lib/librte_net/rte_ip.h
+++ b/lib/librte_net/rte_ip.h
@@ -230,6 +230,77 @@ rte_raw_cksum(const void *buf, size_t len)
 }

 /**
+ * Compute the raw (non complemented) checksum of a packet.
+ *
+ * @param m
+ *   The pointer to the mbuf.
+ * @param off
+ *   The offset in bytes to start the checksum.
+ * @param len
+ *   The length in bytes of the data to ckecksum.
+ * @param cksum
+ *   A pointer to the checksum, filled on success.
+ * @return
+ *   0 on success, -1 on error (bad length or offset).
+ */
+static inline int
+rte_raw_cksum_mbuf(const struct rte_mbuf *m, uint32_t off, uint32_t len,
+   uint16_t *cksum)
+{
+   const struct rte_mbuf *seg;
+   const char *buf;
+   uint32_t sum, tmp;
+   uint32_t seglen, done;
+
+   /* easy case: all data in the first segment */
+   if (off + len <= rte_pktmbuf_data_len(m)) {
+   *cksum = rte_raw_cksum(rte_pktmbuf_mtod_offset(m,
+   const char *, off), len);
+   return 0;
+   }
+
+   if (unlikely(off + len > rte_pktmbuf_pkt_len(m)))
+   return -1; /* invalid params, return a dummy value */
+
+   /* else browse the segment to find offset */
+   seglen = 0;
+   for (seg = m; seg != NULL; seg = seg->next) {
+   seglen = rte_pktmbuf_data_len(seg);
+   if (off < seglen)
+   break;
+   off -= seglen;
+   }
+   seglen -= off;
+   buf = rte_pktmbuf_mtod_offset(seg, const char *, off);
+   if (seglen >= len) {
+   /* all in one segment */
+   *cksum = rte_raw_cksum(buf, len);
+   return 0;
+   }
+
+   /* hard case: process checksum of several segments */
+   sum = 0;
+   done = 0;
+   for (;;) {
+   tmp = __rte_raw_cksum(buf, seglen, 0);
+   if (done & 1)
+   tmp = rte_bswap16(tmp);
+   sum += tmp;
+   done += seglen;
+   if (done == len)
+   break;
+   seg = seg->next;
+   buf = rte_pktmbuf_mtod(seg, const char *);
+   seglen = rte_pktmbuf_data_len(seg);
+   if (seglen > len - done)
+   seglen = len - done;
+   }
+
+   *cksum = __rte_raw_cksum_reduce(sum);
+   return 0;
+}
+
+/**
  * Process the IPv4 checksum of an IPv4 header.
  *
  * The checksum field must be set to 0 by the caller.
-- 
2.8.1

[dpdk-dev] [PATCH v3 03/12] net/virtio: reinitialize the device in configure callback

2016-10-13 Thread Olivier Matz

Add the ability to reset the virtio device in the configure callback
if the features flag changed since previous reset. This will be possible
with the introduction of offload support in next commits.

Signed-off-by: Olivier Matz 
Reviewed-by: Maxime Coquelin 
---
 drivers/net/virtio/virtio_ethdev.c | 26 +++---
 drivers/net/virtio/virtio_pci.h|  1 +
 2 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/drivers/net/virtio/virtio_ethdev.c 
b/drivers/net/virtio/virtio_ethdev.c
index f3921ac..b5bc0ee 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1045,14 +1045,13 @@ virtio_vlan_filter_set(struct rte_eth_dev *dev, 
uint16_t vlan_id, int on)
 }

 static int
-virtio_negotiate_features(struct virtio_hw *hw)
+virtio_negotiate_features(struct virtio_hw *hw, uint64_t req_features)
 {
uint64_t host_features;

/* Prepare guest_features: feature that driver wants to support */
-   hw->guest_features = VIRTIO_PMD_GUEST_FEATURES;
PMD_INIT_LOG(DEBUG, "guest_features before negotiate = %" PRIx64,
-   hw->guest_features);
+   req_features);

/* Read device(host) feature bits */
host_features = hw->vtpci_ops->get_features(hw);
@@ -1063,6 +1062,7 @@ virtio_negotiate_features(struct virtio_hw *hw)
 * Negotiate features: Subset of device feature bits are written back
 * guest feature bits.
 */
+   hw->guest_features = req_features;
hw->guest_features = vtpci_negotiate_features(hw, host_features);
PMD_INIT_LOG(DEBUG, "features after negotiate = %" PRIx64,
hw->guest_features);
@@ -1081,6 +1081,8 @@ virtio_negotiate_features(struct virtio_hw *hw)
}
}

+   hw->req_guest_features = req_features;
+
return 0;
 }

@@ -1121,8 +1123,9 @@ rx_func_get(struct rte_eth_dev *eth_dev)
eth_dev->rx_pkt_burst = _recv_pkts;
 }

+/* reset device and renegotiate features if needed */
 static int
-virtio_init_device(struct rte_eth_dev *eth_dev)
+virtio_init_device(struct rte_eth_dev *eth_dev, uint64_t req_features)
 {
struct virtio_hw *hw = eth_dev->data->dev_private;
struct virtio_net_config *config;
@@ -1137,7 +1140,7 @@ virtio_init_device(struct rte_eth_dev *eth_dev)

/* Tell the host we've known how to drive the device. */
vtpci_set_status(hw, VIRTIO_CONFIG_STATUS_DRIVER);
-   if (virtio_negotiate_features(hw) < 0)
+   if (virtio_negotiate_features(hw, req_features) < 0)
return -1;

/* If host does not support status then disable LSC */
@@ -1258,8 +1261,8 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)

eth_dev->data->dev_flags = dev_flags;

-   /* reset device and negotiate features */
-   ret = virtio_init_device(eth_dev);
+   /* reset device and negotiate default features */
+   ret = virtio_init_device(eth_dev, VIRTIO_PMD_GUEST_FEATURES);
if (ret < 0)
return ret;

@@ -1338,6 +1341,7 @@ virtio_dev_configure(struct rte_eth_dev *dev)
 {
const struct rte_eth_rxmode *rxmode = >data->dev_conf.rxmode;
struct virtio_hw *hw = dev->data->dev_private;
+   uint64_t req_features;
int ret;

PMD_INIT_LOG(DEBUG, "configure");
@@ -1347,6 +1351,14 @@ virtio_dev_configure(struct rte_eth_dev *dev)
return -EINVAL;
}

+   req_features = VIRTIO_PMD_GUEST_FEATURES;
+   /* if request features changed, reinit the device */
+   if (req_features != hw->req_guest_features) {
+   ret = virtio_init_device(dev, req_features);
+   if (ret < 0)
+   return ret;
+   }
+
/* Setup and start control queue */
if (vtpci_with_feature(hw, VIRTIO_NET_F_CTRL_VQ)) {
ret = virtio_dev_cq_queue_setup(dev,
diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index 6930cd6..bbf06ec 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -245,6 +245,7 @@ struct virtio_net_config;
 struct virtio_hw {
struct virtnet_ctl *cvq;
struct rte_pci_ioport io;
+   uint64_treq_guest_features;
uint64_tguest_features;
uint32_tmax_queue_pairs;
uint16_tvtnet_hdr_size;
-- 
2.8.1

[dpdk-dev] [PATCH v3 02/12] net/virtio: setup and start cq in configure callback

2016-10-13 Thread Olivier Matz

Move the configuration of control queue in the configure callback.
This is needed by next commit, which introduces the reinitialization
of the device in the configure callback to change the feature flags.
Therefore, the control queue will have to be restarted at the same
place.

As virtio_dev_cq_queue_setup() is called from a place where
config->max_virtqueue_pairs is not available, we need to store this in
the private structure. It replaces max_rx_queues and max_tx_queues which
have the same value. The log showing the value of max_rx_queues and
max_tx_queues is also removed since config->max_virtqueue_pairs is
already displayed above.

Signed-off-by: Olivier Matz 
Reviewed-by: Maxime Coquelin 
---
 drivers/net/virtio/virtio_ethdev.c | 43 +++---
 drivers/net/virtio/virtio_ethdev.h |  4 ++--
 drivers/net/virtio/virtio_pci.h|  3 +--
 3 files changed, 24 insertions(+), 26 deletions(-)

diff --git a/drivers/net/virtio/virtio_ethdev.c 
b/drivers/net/virtio/virtio_ethdev.c
index 77ca569..f3921ac 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -552,6 +552,9 @@ virtio_dev_close(struct rte_eth_dev *dev)
if (hw->started == 1)
virtio_dev_stop(dev);

+   if (hw->cvq)
+   virtio_dev_queue_release(hw->cvq->vq);
+
/* reset the NIC */
if (dev->data->dev_flags & RTE_ETH_DEV_INTR_LSC)
vtpci_irq_config(hw, VIRTIO_MSI_NO_VECTOR);
@@ -1191,16 +1194,7 @@ virtio_init_device(struct rte_eth_dev *eth_dev)
config->max_virtqueue_pairs = 1;
}

-   hw->max_rx_queues =
-   (VIRTIO_MAX_RX_QUEUES < config->max_virtqueue_pairs) ?
-   VIRTIO_MAX_RX_QUEUES : config->max_virtqueue_pairs;
-   hw->max_tx_queues =
-   (VIRTIO_MAX_TX_QUEUES < config->max_virtqueue_pairs) ?
-   VIRTIO_MAX_TX_QUEUES : config->max_virtqueue_pairs;
-
-   virtio_dev_cq_queue_setup(eth_dev,
-   config->max_virtqueue_pairs * 2,
-   SOCKET_ID_ANY);
+   hw->max_queue_pairs = config->max_virtqueue_pairs;

PMD_INIT_LOG(DEBUG, "config->max_virtqueue_pairs=%d",
config->max_virtqueue_pairs);
@@ -1211,19 +1205,15 @@ virtio_init_device(struct rte_eth_dev *eth_dev)
config->mac[2], config->mac[3],
config->mac[4], config->mac[5]);
} else {
-   hw->max_rx_queues = 1;
-   hw->max_tx_queues = 1;
+   PMD_INIT_LOG(DEBUG, "config->max_virtqueue_pairs=1");
+   hw->max_queue_pairs = 1;
}

-   PMD_INIT_LOG(DEBUG, "hw->max_rx_queues=%d   hw->max_tx_queues=%d",
-   hw->max_rx_queues, hw->max_tx_queues);
if (pci_dev)
PMD_INIT_LOG(DEBUG, "port %d vendorID=0x%x deviceID=0x%x",
eth_dev->data->port_id, pci_dev->id.vendor_id,
pci_dev->id.device_id);

-   virtio_dev_cq_start(eth_dev);
-
return 0;
 }

@@ -1285,7 +1275,6 @@ static int
 eth_virtio_dev_uninit(struct rte_eth_dev *eth_dev)
 {
struct rte_pci_device *pci_dev;
-   struct virtio_hw *hw = eth_dev->data->dev_private;

PMD_INIT_FUNC_TRACE();

@@ -1301,9 +1290,6 @@ eth_virtio_dev_uninit(struct rte_eth_dev *eth_dev)
eth_dev->tx_pkt_burst = NULL;
eth_dev->rx_pkt_burst = NULL;

-   if (hw->cvq)
-   virtio_dev_queue_release(hw->cvq->vq);
-
rte_free(eth_dev->data->mac_addrs);
eth_dev->data->mac_addrs = NULL;

@@ -1352,6 +1338,7 @@ virtio_dev_configure(struct rte_eth_dev *dev)
 {
const struct rte_eth_rxmode *rxmode = >data->dev_conf.rxmode;
struct virtio_hw *hw = dev->data->dev_private;
+   int ret;

PMD_INIT_LOG(DEBUG, "configure");

@@ -1360,6 +1347,16 @@ virtio_dev_configure(struct rte_eth_dev *dev)
return -EINVAL;
}

+   /* Setup and start control queue */
+   if (vtpci_with_feature(hw, VIRTIO_NET_F_CTRL_VQ)) {
+   ret = virtio_dev_cq_queue_setup(dev,
+   hw->max_queue_pairs * 2,
+   SOCKET_ID_ANY);
+   if (ret < 0)
+   return ret;
+   virtio_dev_cq_start(dev);
+   }
+
hw->vlan_strip = rxmode->hw_vlan_strip;

if (rxmode->hw_vlan_filter
@@ -1553,8 +1550,10 @@ virtio_dev_info_get(struct rte_eth_dev *dev, struct 
rte_eth_dev_info *dev_info)
dev_info->driver_name = dev->driver->pci_drv.drive

[dpdk-dev] [PATCH v3 01/12] net/virtio: move device initialization in a function

2016-10-13 Thread Olivier Matz

Move all code related to device initialization in a new function
virtio_init_device().

This commit brings no functional change, it prepares the next commits
that will add the offload support. For that, it will be needed to
reinitialize the device from ethdev->configure(), using this new
function.

Signed-off-by: Olivier Matz 
Reviewed-by: Maxime Coquelin 
---
 drivers/net/virtio/virtio_ethdev.c | 99 ++
 1 file changed, 58 insertions(+), 41 deletions(-)

diff --git a/drivers/net/virtio/virtio_ethdev.c 
b/drivers/net/virtio/virtio_ethdev.c
index b4dfc0a..77ca569 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1118,46 +1118,13 @@ rx_func_get(struct rte_eth_dev *eth_dev)
eth_dev->rx_pkt_burst = _recv_pkts;
 }

-/*
- * This function is based on probe() function in virtio_pci.c
- * It returns 0 on success.
- */
-int
-eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
+static int
+virtio_init_device(struct rte_eth_dev *eth_dev)
 {
struct virtio_hw *hw = eth_dev->data->dev_private;
struct virtio_net_config *config;
struct virtio_net_config local_config;
-   struct rte_pci_device *pci_dev;
-   uint32_t dev_flags = RTE_ETH_DEV_DETACHABLE;
-   int ret;
-
-   RTE_BUILD_BUG_ON(RTE_PKTMBUF_HEADROOM < sizeof(struct 
virtio_net_hdr_mrg_rxbuf));
-
-   eth_dev->dev_ops = _eth_dev_ops;
-   eth_dev->tx_pkt_burst = _xmit_pkts;
-
-   if (rte_eal_process_type() == RTE_PROC_SECONDARY) {
-   rx_func_get(eth_dev);
-   return 0;
-   }
-
-   /* Allocate memory for storing MAC addresses */
-   eth_dev->data->mac_addrs = rte_zmalloc("virtio", VIRTIO_MAX_MAC_ADDRS * 
ETHER_ADDR_LEN, 0);
-   if (eth_dev->data->mac_addrs == NULL) {
-   PMD_INIT_LOG(ERR,
-   "Failed to allocate %d bytes needed to store MAC 
addresses",
-   VIRTIO_MAX_MAC_ADDRS * ETHER_ADDR_LEN);
-   return -ENOMEM;
-   }
-
-   pci_dev = eth_dev->pci_dev;
-
-   if (pci_dev) {
-   ret = vtpci_init(pci_dev, hw, _flags);
-   if (ret)
-   return ret;
-   }
+   struct rte_pci_device *pci_dev = eth_dev->pci_dev;

/* Reset the device although not necessary at startup */
vtpci_reset(hw);
@@ -1172,10 +1139,11 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)

/* If host does not support status then disable LSC */
if (!vtpci_with_feature(hw, VIRTIO_NET_F_STATUS))
-   dev_flags &= ~RTE_ETH_DEV_INTR_LSC;
+   eth_dev->data->dev_flags &= ~RTE_ETH_DEV_INTR_LSC;
+   else
+   eth_dev->data->dev_flags |= RTE_ETH_DEV_INTR_LSC;

rte_eth_copy_pci_info(eth_dev, pci_dev);
-   eth_dev->data->dev_flags = dev_flags;

rx_func_get(eth_dev);

@@ -1254,12 +1222,61 @@ eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
eth_dev->data->port_id, pci_dev->id.vendor_id,
pci_dev->id.device_id);

+   virtio_dev_cq_start(eth_dev);
+
+   return 0;
+}
+
+/*
+ * This function is based on probe() function in virtio_pci.c
+ * It returns 0 on success.
+ */
+int
+eth_virtio_dev_init(struct rte_eth_dev *eth_dev)
+{
+   struct virtio_hw *hw = eth_dev->data->dev_private;
+   struct rte_pci_device *pci_dev;
+   uint32_t dev_flags = RTE_ETH_DEV_DETACHABLE;
+   int ret;
+
+   RTE_BUILD_BUG_ON(RTE_PKTMBUF_HEADROOM < sizeof(struct 
virtio_net_hdr_mrg_rxbuf));
+
+   eth_dev->dev_ops = _eth_dev_ops;
+   eth_dev->tx_pkt_burst = _xmit_pkts;
+
+   if (rte_eal_process_type() == RTE_PROC_SECONDARY) {
+   rx_func_get(eth_dev);
+   return 0;
+   }
+
+   /* Allocate memory for storing MAC addresses */
+   eth_dev->data->mac_addrs = rte_zmalloc("virtio", VIRTIO_MAX_MAC_ADDRS * 
ETHER_ADDR_LEN, 0);
+   if (eth_dev->data->mac_addrs == NULL) {
+   PMD_INIT_LOG(ERR,
+   "Failed to allocate %d bytes needed to store MAC 
addresses",
+   VIRTIO_MAX_MAC_ADDRS * ETHER_ADDR_LEN);
+   return -ENOMEM;
+   }
+
+   pci_dev = eth_dev->pci_dev;
+
+   if (pci_dev) {
+   ret = vtpci_init(pci_dev, hw, _flags);
+   if (ret)
+   return ret;
+   }
+
+   eth_dev->data->dev_flags = dev_flags;
+
+   /* reset device and negotiate features */
+   ret = virtio_init_device(eth_dev);
+   if (ret < 0)
+   return ret;
+
/* Setup interrupt callback  */
if (eth_dev->data->dev_flags & RTE_ETH_DEV_INTR_LSC)
rte_intr_callback_register(_dev->intr_handle,
-  virtio_interrupt_handler, eth_dev);
-
-   virtio_dev_cq_start(eth_dev);
+   virtio_interrupt_handler, eth_dev);

return 0;
 }
-- 
2.8.1

[dpdk-dev] [PATCH v3 00/12] net/virtio: add offload support

2016-10-13 Thread Olivier Matz

This patchset, targetted for 16.11, introduces the support of rx and tx
offload in virtio pmd.  To achieve this, some new mbuf flags must be
introduced, as discussed in [1].

It applies on master + a patch fixing the testpmd csum engine:
http://dpdk.org/dev/patchwork/patch/16538/

The new mbuf checksum flags are backward compatible for current
applications that assume that unknown_csum = good_cum (since there
was only a bad_csum flag). But it the patchset is integrated, we
should consider updating the PMDs to match the new API for 16.11.

[1] http://dpdk.org/ml/archives/dev/2016-May/039920.html

changes v2 -> v3

- fix typo in release note
- add unlikely() in cksum calculation error case
- add likely() in virtio rx function when cksum != 0x
- return an error code instead of the cksum in rte_raw_cksum_mbuf()
- do not access to the virtio header if no offload is negotiated (rx and tx)
- return an error if offload cannot be negotiated
- use offsetof() instead of magic hardcoded values for cksum offsets
- changefix some commit titles

changes v1 -> v2
- change mbuf checksum calculation static inline
- fix checksum calculation for protocol where csum=0 means no csum
- move mbuf checksum calculation in librte_net
- use RTE_MIN() to set max rx/tx queue
- rebase on top of head

Olivier Matz (12):
  virtio: move device initialization in a function
  virtio: setup and start cq in configure callback
  virtio: reinitialize the device in configure callback
  net: add function to calculate a checksum in a mbuf
  mbuf: add new Rx checksum mbuf flags
  app/testpmd: fix checksum stats in csum engine
  mbuf: new flag for LRO
  app/testpmd: display lro segment size
  virtio: add Rx checksum offload support
  virtio: add Tx checksum offload support
  virtio: add Lro support
  virtio: add Tso support

 app/test-pmd/csumonly.c|   8 +-
 doc/guides/rel_notes/release_16_11.rst |  16 ++
 drivers/net/virtio/virtio_ethdev.c | 197 ++
 drivers/net/virtio/virtio_ethdev.h |  18 +-
 drivers/net/virtio/virtio_pci.h|   4 +-
 drivers/net/virtio/virtio_rxtx.c   | 298 ++---
 drivers/net/virtio/virtqueue.h |   1 +
 lib/librte_mbuf/rte_mbuf.c |  18 +-
 lib/librte_mbuf/rte_mbuf.h |  58 ++-
 lib/librte_net/rte_ip.h|  71 
 10 files changed, 580 insertions(+), 109 deletions(-)

Test plan
=

(replayed on v3)

Platform description


  guest (dpdk)
  ++
  ||
  ||
  | port0  +-<---+
  |   ixgbe /  | |
  |   directio | |
  || |
  |port1   | ^ flow1
  ++ | (flow2 is the reverse)
 |   |
 | virtio|
 v   |
  ++ |
  | tap0   /   | |
  |1.1.1.1   / | |
  |ns-tap  /   | |
  |  / | |
  |/   ixgbe2  +-->--+
  |  /1.1.1.2  |
  |/  ns-ixgbe |
  ++
  host (linux, vhost-net)


flow1:
  host -(ixgbe)-> guest -(virtio)-> host
  1.1.1.2 -> 1.1.1.1

flow2:
  host -(virtio)-> guest -(ixgbe)-> host
  1.1.1.2 -> 1.1.1.1

Host configuration
--

Start qemu with:

- a ne2k management interface to avoi any conflict with dpdk
- 2 ixgbe interfaces given to with vm through vfio
- a virtio net device, connected to a tap interface through vhost-net

  /usr/bin/qemu-system-x86_64 -k fr -daemonize --enable-kvm -m 1G -cpu host \
-smp 3 -serial telnet::40564,server,nowait -serial null \
-qmp tcp::44340,server,nowait -monitor telnet::49229,server,nowait \
-device ne2k_pci,mac=de:ad:de:01:02:03,netdev=user.0,addr=03 \
-netdev user,id=user.0,hostfwd=tcp::34965-:22 \
-device vfio-pci,host=:04:00.0 -device vfio-pci,host=:04:00.1 \
-netdev type=tap,id=vhostnet0,script=no,vhost=on,queues=8 \
-device virtio-net-pci,netdev=vhostnet0,ioeventfd=on,mq=on,vectors=17 \
-hda "/path/to/ubuntu-14.04-template.qcow2" \
-snapshot -vga none -display none

Move the tap interface in a netns, and configure it:

  ip netns add ns-tap
  ip netns exec ns-tap ip l set lo up
  ip link set tap0 netns ns-tap
  ip netns exec ns-tap ip l set tap0 down
  ip netns exec ns-tap ip l set addr 02:00:00:00:00:01 dev tap0
  ip netns exec ns-tap ip l set tap0 up
  ip netns exec ns-tap ip a a 1.1.1.1/24 dev tap0
  ip netns exec ns-tap arp -s 1.1.1.2 02:00:00:00:00:00
  ip netns exec ns-tap ip a

Move the ixgbe interface in a netns, and configure it:

  IXGBE=ixgbe2
  ip netns add ns-ixgbe
  ip netns exec ns-ixgbe ip l set lo up
  ip link set ${IXGBE} netns ns-ixgbe
  ip netns exec ns-ixgbe ip l set ${IXGBE} down
  ip netns exec ns-ixgbe ip l set addr 02:00:00:00:00:00 dev ${IXGBE}
  ip netns exec ns-ixgbe ip l set ${IXGBE} up
  ip n

[dpdk-dev] [PATCH v2 12/12] virtio: add Tso support

2016-10-13 Thread Olivier MATZ

On 10/13/2016 10:18 AM, Yuanhan Liu wrote:
> On Mon, Oct 03, 2016 at 11:00:23AM +0200, Olivier Matz wrote:
>> +/* When doing TSO, the IP length is not included in the pseudo header
>> + * checksum of the packet given to the PMD, but for virtio it is
>> + * expected.
>> + */
>> +static void
>> +virtio_tso_fix_cksum(struct rte_mbuf *m)
>> +{
>> +/* common case: header is not fragmented */
>> +if (likely(rte_pktmbuf_data_len(m) >= m->l2_len + m->l3_len +
>> +m->l4_len)) {
> ...
>> +/* replace it in the packet */
>> +th->cksum = new_cksum;
>> +} else {
> ...
>> +/* replace it in the packet */
>> +*rte_pktmbuf_mtod_offset(m, uint8_t *,
>> +m->l2_len + m->l3_len + 16) = new_cksum.u8[0];
>> +*rte_pktmbuf_mtod_offset(m, uint8_t *,
>> +m->l2_len + m->l3_len + 17) = new_cksum.u8[1];
>> +}
>
> The tcp header will always be in the mbuf, right? Otherwise, you can't
> update the cksum field here. What's the point of introducing the "else
> clause" then?

Sorry, I don't see the problem you're pointing out here.

What I want to solve here is to support the cases where the mbuf is 
segmented in the middle of the network header (which is probably a rare 
case).

In the "else" part, I only access the mbuf byte by byte using the 
rte_pktmbuf_mtod_offset() accessor. An alternative would have been to 
copy the header in a linear buffer, fix the checksum, then copy it again 
in the packet, but there is no mbuf helpers to do these copies for now.

Regards,
Olivier

[dpdk-dev] [PATCH v2 10/12] virtio: add Tx checksum offload support

2016-10-13 Thread Olivier MATZ



On 10/13/2016 10:38 AM, Yuanhan Liu wrote:
> On Mon, Oct 03, 2016 at 11:00:21AM +0200, Olivier Matz wrote:
>> +/* Checksum Offload */
>> +switch (cookie->ol_flags & PKT_TX_L4_MASK) {
>> +case PKT_TX_UDP_CKSUM:
>> +hdr->csum_start = cookie->l2_len + cookie->l3_len;
>> +hdr->csum_offset = 6;
>> +hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
>> +break;
>> +
>> +case PKT_TX_TCP_CKSUM:
>> +hdr->csum_start = cookie->l2_len + cookie->l3_len;
>> +hdr->csum_offset = 16;
>
> I would suggest to use "offsetof(...)" here, instead of some magic
> number like 16.

Will do, it's actually clearer.

Olivier

[dpdk-dev] [PATCH v2 03/12] virtio: reinitialize the device in configure callback

2016-10-13 Thread Olivier MATZ



On 10/13/2016 09:54 AM, Yuanhan Liu wrote:
> On Wed, Oct 12, 2016 at 06:01:25PM +0200, Olivier MATZ wrote:
>> Hello Yuanhan,
>>
>> On 10/12/2016 04:41 PM, Yuanhan Liu wrote:
>>> On Mon, Oct 03, 2016 at 11:00:14AM +0200, Olivier Matz wrote:
>>>> @@ -1344,6 +1347,7 @@ virtio_dev_configure(struct rte_eth_dev *dev)
>>>>   {
>>>>const struct rte_eth_rxmode *rxmode = >data->dev_conf.rxmode;
>>>>struct virtio_hw *hw = dev->data->dev_private;
>>>> +  uint64_t req_features;
>>>>int ret;
>>>>
>>>>PMD_INIT_LOG(DEBUG, "configure");
>>>> @@ -1353,6 +1357,14 @@ virtio_dev_configure(struct rte_eth_dev *dev)
>>>>return -EINVAL;
>>>>}
>>>>
>>>> +  req_features = VIRTIO_PMD_GUEST_FEATURES;
>>>> +  /* if request features changed, reinit the device */
>>>> +  if (req_features != hw->req_guest_features) {
>>>> +  ret = virtio_init_device(dev, req_features);
>>>> +  if (ret < 0)
>>>> +  return ret;
>>>> +  }
>>>
>>> Why do you have to reset virtio here? This doesn't make too much sense
>>> to me.
>>>
>>> IIUC, you want to make sure those TSO related features being unset at
>>> init time, and enable it (by doing reset) when it's asked to be enabled
>>> (by rte_eth_dev_configure)?
>>>
>>> Why not always setting those features? We could do the actual offloads
>>> when:
>>>
>>> - those features have been negoiated
>>>
>>> - they are enabled through rte_eth_dev_configure
>>>
>>> With that, I think we could avoid the reset here?
>>
>> It would work for TX, since you decide to use or not the feature. But I
>> think this won't work for RX: if you negociate LRO at init, the host may
>> send you large packets, even if LRO is disabled in dev_configure.
>
> I see. Thanks.
>
> Besides, I think you should return error when LRO is not negoiated
> after the reset (say, when it's disabled through qemu command line)?

Good idea, I now return an error if offload cannot be negotiated.

Olivier

[dpdk-dev] [PATCH] testpmd: fix tso with csum engine

2016-10-13 Thread Olivier Matz

The commit that disabled tso for small packets was broken during the
rebase. The problem is the IP checksum is not calculated in software if:
- TX IP checksum is disabled
- TSO is enabled
- the current packet is smaller than tso segment size

When checking if the PKT_TX_IP_CKSUM flag should be set (in case
of tso), use the local tso_segsz variable, which is set to 0 when the
packet is too small to require tso. Therefore the IP checksum will be
correctly calculated in software.

Moreover, we should not use tunnel segment size for non-tunnel tso, else
TSO will stay disabled for all packets.

Fixes: 97c21329d42b ("app/testpmd: do not use TSO for small packets")

Signed-off-by: Olivier Matz 
---
 app/test-pmd/csumonly.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index f9e65b6..27d0f08 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -336,7 +336,7 @@ process_inner_cksums(void *l3_hdr, const struct 
testpmd_offload_info *info,
if (!info->is_tunnel) {
max_pkt_len = info->l2_len + info->l3_len + info->l4_len +
info->tso_segsz;
-   if (info->tunnel_tso_segsz != 0 && info->pkt_len > max_pkt_len)
+   if (info->tso_segsz != 0 && info->pkt_len > max_pkt_len)
tso_segsz = info->tso_segsz;
} else {
max_pkt_len = info->outer_l2_len + info->outer_l3_len +
@@ -351,9 +351,7 @@ process_inner_cksums(void *l3_hdr, const struct 
testpmd_offload_info *info,
ipv4_hdr->hdr_checksum = 0;

ol_flags |= PKT_TX_IPV4;
-   if (info->l4_proto == IPPROTO_TCP &&
-   ((info->is_tunnel && info->tunnel_tso_segsz != 0) ||
-(!info->is_tunnel && info->tso_segsz != 0))) {
+   if (info->l4_proto == IPPROTO_TCP && tso_segsz) {
ol_flags |= PKT_TX_IP_CKSUM;
} else {
if (testpmd_ol_flags & TESTPMD_TX_OFFLOAD_IP_CKSUM)
-- 
2.8.1

[dpdk-dev] [PATCH v2 03/12] virtio: reinitialize the device in configure callback

2016-10-12 Thread Olivier MATZ

Hello Yuanhan,

On 10/12/2016 04:41 PM, Yuanhan Liu wrote:
> On Mon, Oct 03, 2016 at 11:00:14AM +0200, Olivier Matz wrote:
>> @@ -1344,6 +1347,7 @@ virtio_dev_configure(struct rte_eth_dev *dev)
>>   {
>>  const struct rte_eth_rxmode *rxmode = >data->dev_conf.rxmode;
>>  struct virtio_hw *hw = dev->data->dev_private;
>> +uint64_t req_features;
>>  int ret;
>>
>>  PMD_INIT_LOG(DEBUG, "configure");
>> @@ -1353,6 +1357,14 @@ virtio_dev_configure(struct rte_eth_dev *dev)
>>  return -EINVAL;
>>  }
>>
>> +req_features = VIRTIO_PMD_GUEST_FEATURES;
>> +/* if request features changed, reinit the device */
>> +if (req_features != hw->req_guest_features) {
>> +ret = virtio_init_device(dev, req_features);
>> +if (ret < 0)
>> +return ret;
>> +}
>
> Why do you have to reset virtio here? This doesn't make too much sense
> to me.
>
> IIUC, you want to make sure those TSO related features being unset at
> init time, and enable it (by doing reset) when it's asked to be enabled
> (by rte_eth_dev_configure)?
>
> Why not always setting those features? We could do the actual offloads
> when:
>
> - those features have been negoiated
>
> - they are enabled through rte_eth_dev_configure
>
> With that, I think we could avoid the reset here?

It would work for TX, since you decide to use or not the feature. But I 
think this won't work for RX: if you negociate LRO at init, the host may 
send you large packets, even if LRO is disabled in dev_configure.

Regards,
Olivier

[dpdk-dev] [PATCH v2 09/12] virtio: add Rx checksum offload support

2016-10-12 Thread Olivier MATZ



On 10/12/2016 03:02 PM, Yuanhan Liu wrote:
> On Wed, Oct 05, 2016 at 03:27:47PM +0200, Maxime Coquelin wrote:
>>> /* Update offload features */
>>> -   if (virtio_rx_offload(rxm, hdr) < 0) {
>>> +   if ((features & VIRTIO_NET_F_GUEST_CSUM) &&
>> s/VIRTIO_NET_F_GUEST_CSUM/(1u << VIRTIO_NET_F_GUEST_CSUM)/
>
> There is a helper function for that: vtpci_with_feature.

Ok, will use it.

Thanks,
Olivier

[dpdk-dev] [PATCH v6 8/8] app/testpmd: hide segsize when unrelevant in csum engine

2016-10-12 Thread Olivier Matz

When TSO is not asked, hide the segment size.

Signed-off-by: Olivier Matz 
Acked-by: Pablo de Lara 
---
 app/test-pmd/csumonly.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index d51d85a..f9e65b6 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -843,10 +843,12 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
"m->outer_l3_len=%d\n",
m->outer_l2_len,
m->outer_l3_len);
-   if (info.tunnel_tso_segsz != 0)
+   if (info.tunnel_tso_segsz != 0 &&
+   (m->ol_flags & PKT_TX_TCP_SEG))
printf("tx: m->tso_segsz=%d\n",
m->tso_segsz);
-   } else if (info.tso_segsz != 0)
+   } else if (info.tso_segsz != 0 &&
+   (m->ol_flags & PKT_TX_TCP_SEG))
printf("tx: m->tso_segsz=%d\n", m->tso_segsz);
rte_get_tx_ol_flag_list(m->ol_flags, buf, sizeof(buf));
printf("tx: flags=%s", buf);
-- 
2.8.1

[dpdk-dev] [PATCH v6 7/8] app/testpmd: don't use tso if packet is too small

2016-10-12 Thread Olivier Matz

Asking for TSO (TCP Segmentation Offload) on packets that are already
smaller than (headers + MSS) does not work, for instance on ixgbe.

Fix the csumonly engine to only set the TSO flag when a segmentation
offload is really required, i.e. when packet is large enough.

Signed-off-by: Olivier Matz 
Acked-by: Pablo de Lara 
---
 app/test-pmd/csumonly.c | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index dbd8dc4..d51d85a 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -102,6 +102,7 @@ struct testpmd_offload_info {
uint8_t outer_l4_proto;
uint16_t tso_segsz;
uint16_t tunnel_tso_segsz;
+   uint32_t pkt_len;
 };

 /* simplified GRE header */
@@ -329,6 +330,21 @@ process_inner_cksums(void *l3_hdr, const struct 
testpmd_offload_info *info,
struct tcp_hdr *tcp_hdr;
struct sctp_hdr *sctp_hdr;
uint64_t ol_flags = 0;
+   uint32_t max_pkt_len, tso_segsz = 0;
+
+   /* ensure packet is large enough to require tso */
+   if (!info->is_tunnel) {
+   max_pkt_len = info->l2_len + info->l3_len + info->l4_len +
+   info->tso_segsz;
+   if (info->tunnel_tso_segsz != 0 && info->pkt_len > max_pkt_len)
+   tso_segsz = info->tso_segsz;
+   } else {
+   max_pkt_len = info->outer_l2_len + info->outer_l3_len +
+   info->l2_len + info->l3_len + info->l4_len +
+   info->tunnel_tso_segsz;
+   if (info->tunnel_tso_segsz != 0 && info->pkt_len > max_pkt_len)
+   tso_segsz = info->tunnel_tso_segsz;
+   }

if (info->ethertype == _htons(ETHER_TYPE_IPv4)) {
ipv4_hdr = l3_hdr;
@@ -369,8 +385,7 @@ process_inner_cksums(void *l3_hdr, const struct 
testpmd_offload_info *info,
} else if (info->l4_proto == IPPROTO_TCP) {
tcp_hdr = (struct tcp_hdr *)((char *)l3_hdr + info->l3_len);
tcp_hdr->cksum = 0;
-   if ((info->is_tunnel && info->tunnel_tso_segsz != 0) ||
-   (!info->is_tunnel && info->tso_segsz != 0)) {
+   if (tso_segsz) {
ol_flags |= PKT_TX_TCP_SEG;
tcp_hdr->cksum = get_psd_sum(l3_hdr, info->ethertype,
ol_flags);
@@ -679,6 +694,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)

m = pkts_burst[i];
info.is_tunnel = 0;
+   info.pkt_len = rte_pktmbuf_pkt_len(m);
tx_ol_flags = 0;
rx_ol_flags = m->ol_flags;

-- 
2.8.1

[dpdk-dev] [PATCH v6 6/8] app/testpmd: display Rx port in csum engine

2016-10-12 Thread Olivier Matz

This information is useful when debugging, especially with
bidirectional traffic.

Signed-off-by: Olivier Matz 
Acked-by: Pablo de Lara 
---
 app/test-pmd/csumonly.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 34d4b11..dbd8dc4 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -798,8 +798,8 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
char buf[256];

printf("-\n");
-   printf("mbuf=%p, pkt_len=%u, nb_segs=%hhu:\n",
-   m, m->pkt_len, m->nb_segs);
+   printf("port=%u, mbuf=%p, pkt_len=%u, nb_segs=%hhu:\n",
+   fs->rx_port, m, m->pkt_len, m->nb_segs);
/* dump rx parsed packet info */
rte_get_rx_ol_flag_list(rx_ol_flags, buf, sizeof(buf));
printf("rx: l2_len=%d ethertype=%x l3_len=%d "
-- 
2.8.1

[dpdk-dev] [PATCH v6 5/8] app/testpmd: do not change ip addrs in csum engine

2016-10-12 Thread Olivier Matz

The csum forward engine was updated to change the IP addresses in the
packet data in
commit 51f694dd40f5 ("app/testpmd: rework checksum forward engine")

This was done to ensure that the checksum is correctly reprocessed when
using hardware checksum offload. But the functions
process_inner_cksums() and process_outer_cksums() already reset the
checksum field to 0, so this is not necessary.

Moreover, this makes the engine more complex than needed, and prevents
to easily use it to forward traffic (like iperf) as it modifies the
packets.

This patch drops this behavior.

Signed-off-by: Olivier Matz 
Acked-by: Pablo de Lara 
---
 app/test-pmd/csumonly.c | 27 ++-
 1 file changed, 2 insertions(+), 25 deletions(-)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 42974d5..34d4b11 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -318,21 +318,6 @@ parse_encap_ip(void *encap_ip, struct testpmd_offload_info 
*info)
info->l2_len = 0;
 }

-/* modify the IPv4 or IPv4 source address of a packet */
-static void
-change_ip_addresses(void *l3_hdr, uint16_t ethertype)
-{
-   struct ipv4_hdr *ipv4_hdr = l3_hdr;
-   struct ipv6_hdr *ipv6_hdr = l3_hdr;
-
-   if (ethertype == _htons(ETHER_TYPE_IPv4)) {
-   ipv4_hdr->src_addr =
-   rte_cpu_to_be_32(rte_be_to_cpu_32(ipv4_hdr->src_addr) + 
1);
-   } else if (ethertype == _htons(ETHER_TYPE_IPv6)) {
-   ipv6_hdr->src_addr[15] = ipv6_hdr->src_addr[15] + 1;
-   }
-}
-
 /* if possible, calculate the checksum of a packet in hw or sw,
  * depending on the testpmd command line configuration */
 static uint64_t
@@ -620,7 +605,6 @@ pkt_copy_split(const struct rte_mbuf *pkt)
  * Receive a burst of packets, and for each packet:
  *  - parse packet, and try to recognize a supported packet type (1)
  *  - if it's not a supported packet type, don't touch the packet, else:
- *  - modify the IPs in inner headers and in outer headers if any
  *  - reprocess the checksum of all supported layers. This is done in SW
  *or HW, depending on testpmd command line configuration
  *  - if TSO is enabled in testpmd command line, also flag the mbuf for TCP
@@ -747,14 +731,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
l3_hdr = (char *)l3_hdr + info.outer_l3_len + 
info.l2_len;
}

-   /* step 2: change all source IPs (v4 or v6) so we need
-* to recompute the chksums even if they were correct */
-
-   change_ip_addresses(l3_hdr, info.ethertype);
-   if (info.is_tunnel == 1)
-   change_ip_addresses(outer_l3_hdr, info.outer_ethertype);
-
-   /* step 3: depending on user command line configuration,
+   /* step 2: depending on user command line configuration,
 * recompute checksum either in software or flag the
 * mbuf to offload the calculation to the NIC. If TSO
 * is configured, prepare the mbuf for TCP segmentation. */
@@ -772,7 +749,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
!!(tx_ol_flags & PKT_TX_TCP_SEG));
}

-   /* step 4: fill the mbuf meta data (flags and header lengths) */
+   /* step 3: fill the mbuf meta data (flags and header lengths) */

if (info.is_tunnel == 1) {
if (info.tunnel_tso_segsz ||
-- 
2.8.1

[dpdk-dev] [PATCH v6 4/8] app/testpmd: add option to enable lro

2016-10-12 Thread Olivier Matz

Introduce a new argument '--enable-lro' to ask testpmd to enable the LRO
feature on enabled ports, like it's done for '--enable-rx-cksum' for
instance.

Signed-off-by: Olivier Matz 
Acked-by: Pablo de Lara 
---
 app/test-pmd/parameters.c | 4 
 doc/guides/testpmd_app_ug/run_app.rst | 4 
 2 files changed, 8 insertions(+)

diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index 6a6a07e..c45f78a 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -149,6 +149,7 @@ usage(char* progname)
   "If the drop-queue doesn't exist, the packet is dropped. "
   "By default drop-queue=127.\n");
printf("  --crc-strip: enable CRC stripping by hardware.\n");
+   printf("  --enable-lro: enable large receive offload.\n");
printf("  --enable-rx-cksum: enable rx hardware checksum offload.\n");
printf("  --disable-hw-vlan: disable hardware vlan.\n");
printf("  --disable-hw-vlan-filter: disable hardware vlan filter.\n");
@@ -524,6 +525,7 @@ launch_args_parse(int argc, char** argv)
{ "pkt-filter-size",1, 0, 0 },
{ "pkt-filter-drop-queue",  1, 0, 0 },
{ "crc-strip",  0, 0, 0 },
+   { "enable-lro", 0, 0, 0 },
{ "enable-rx-cksum",0, 0, 0 },
{ "enable-scatter", 0, 0, 0 },
{ "disable-hw-vlan",0, 0, 0 },
@@ -764,6 +766,8 @@ launch_args_parse(int argc, char** argv)
}
if (!strcmp(lgopts[opt_idx].name, "crc-strip"))
rx_mode.hw_strip_crc = 1;
+   if (!strcmp(lgopts[opt_idx].name, "enable-lro"))
+   rx_mode.enable_lro = 1;
if (!strcmp(lgopts[opt_idx].name, "enable-scatter"))
rx_mode.enable_scatter = 1;
if (!strcmp(lgopts[opt_idx].name, "enable-rx-cksum"))
diff --git a/doc/guides/testpmd_app_ug/run_app.rst 
b/doc/guides/testpmd_app_ug/run_app.rst
index 7712bd2..55c7ac0 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -285,6 +285,10 @@ The commandline options are:

 Enable hardware CRC stripping.

+*   ``--enable-lro``
+
+Enable large receive offload.
+
 *   ``--enable-rx-cksum``

 Enable hardware RX checksum offload.
-- 
2.8.1

[dpdk-dev] [PATCH v6 3/8] app/testpmd: dump Rx flags in csum engine

2016-10-12 Thread Olivier Matz

Signed-off-by: Olivier Matz 
Acked-by: Pablo de Lara 
---
 app/test-pmd/csumonly.c | 31 +--
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 2ecd6b8..42974d5 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -652,7 +652,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
uint16_t nb_rx;
uint16_t nb_tx;
uint16_t i;
-   uint64_t ol_flags;
+   uint64_t rx_ol_flags, tx_ol_flags;
uint16_t testpmd_ol_flags;
uint32_t retry;
uint32_t rx_bad_ip_csum;
@@ -693,13 +693,14 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i + 1],
   void *));

-   ol_flags = 0;
-   info.is_tunnel = 0;
m = pkts_burst[i];
+   info.is_tunnel = 0;
+   tx_ol_flags = 0;
+   rx_ol_flags = m->ol_flags;

/* Update the L3/L4 checksum error packet statistics */
-   rx_bad_ip_csum += ((m->ol_flags & PKT_RX_IP_CKSUM_BAD) != 0);
-   rx_bad_l4_csum += ((m->ol_flags & PKT_RX_L4_CKSUM_BAD) != 0);
+   rx_bad_ip_csum += ((rx_ol_flags & PKT_RX_IP_CKSUM_BAD) != 0);
+   rx_bad_l4_csum += ((rx_ol_flags & PKT_RX_L4_CKSUM_BAD) != 0);

/* step 1: dissect packet, parsing optional vlan, ip4/ip6, vxlan
 * and inner headers */
@@ -721,7 +722,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
info.l3_len);
parse_vxlan(udp_hdr, , m->packet_type);
if (info.is_tunnel)
-   ol_flags |= PKT_TX_TUNNEL_VXLAN;
+   tx_ol_flags |= PKT_TX_TUNNEL_VXLAN;
} else if (info.l4_proto == IPPROTO_GRE) {
struct simple_gre_hdr *gre_hdr;

@@ -729,14 +730,14 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
((char *)l3_hdr + info.l3_len);
parse_gre(gre_hdr, );
if (info.is_tunnel)
-   ol_flags |= PKT_TX_TUNNEL_GRE;
+   tx_ol_flags |= PKT_TX_TUNNEL_GRE;
} else if (info.l4_proto == IPPROTO_IPIP) {
void *encap_ip_hdr;

encap_ip_hdr = (char *)l3_hdr + info.l3_len;
parse_encap_ip(encap_ip_hdr, );
if (info.is_tunnel)
-   ol_flags |= PKT_TX_TUNNEL_IPIP;
+   tx_ol_flags |= PKT_TX_TUNNEL_IPIP;
}
}

@@ -759,15 +760,16 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 * is configured, prepare the mbuf for TCP segmentation. */

/* process checksums of inner headers first */
-   ol_flags |= process_inner_cksums(l3_hdr, , 
testpmd_ol_flags);
+   tx_ol_flags |= process_inner_cksums(l3_hdr, ,
+   testpmd_ol_flags);

/* Then process outer headers if any. Note that the software
 * checksum will be wrong if one of the inner checksums is
 * processed in hardware. */
if (info.is_tunnel == 1) {
-   ol_flags |= process_outer_cksums(outer_l3_hdr, ,
+   tx_ol_flags |= process_outer_cksums(outer_l3_hdr, ,
testpmd_ol_flags,
-   !!(ol_flags & PKT_TX_TCP_SEG));
+   !!(tx_ol_flags & PKT_TX_TCP_SEG));
}

/* step 4: fill the mbuf meta data (flags and header lengths) */
@@ -802,7 +804,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
m->l4_len = info.l4_len;
m->tso_segsz = info.tso_segsz;
}
-   m->ol_flags = ol_flags;
+   m->ol_flags = tx_ol_flags;

/* Do split & copy for the packet. */
if (tx_pkt_split != TX_PKT_SPLIT_OFF) {
@@ -822,10 +824,11 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
printf("mbuf=%p, pkt_len=%u, nb_segs=%hhu:\n",
m, m->pkt_len, m->nb_segs);
/* dump rx parsed packet info */
+   rte_get_rx_ol_flag_list(rx_ol_flags, buf, sizeof(buf));
printf("rx: l2_len=%d ethertype=%x l3_len=%d "
-   "l4_pro

[dpdk-dev] [PATCH v6 2/8] app/testpmd: use new function to dump offload flags

2016-10-12 Thread Olivier Matz

Use the functions introduced in the previous commit to dump the offload
flags.

Signed-off-by: Olivier Matz 
Acked-by: Pablo de Lara 
---
 app/test-pmd/csumonly.c | 31 +++
 app/test-pmd/rxonly.c   | 15 ++-
 2 files changed, 5 insertions(+), 41 deletions(-)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 4fe038d..2ecd6b8 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -816,27 +816,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)

/* if verbose mode is enabled, dump debug info */
if (verbose_level > 0) {
-   struct {
-   uint64_t flag;
-   uint64_t mask;
-   } tx_flags[] = {
-   { PKT_TX_IP_CKSUM, PKT_TX_IP_CKSUM },
-   { PKT_TX_UDP_CKSUM, PKT_TX_L4_MASK },
-   { PKT_TX_TCP_CKSUM, PKT_TX_L4_MASK },
-   { PKT_TX_SCTP_CKSUM, PKT_TX_L4_MASK },
-   { PKT_TX_IPV4, PKT_TX_IPV4 },
-   { PKT_TX_IPV6, PKT_TX_IPV6 },
-   { PKT_TX_OUTER_IP_CKSUM, PKT_TX_OUTER_IP_CKSUM 
},
-   { PKT_TX_OUTER_IPV4, PKT_TX_OUTER_IPV4 },
-   { PKT_TX_OUTER_IPV6, PKT_TX_OUTER_IPV6 },
-   { PKT_TX_TCP_SEG, PKT_TX_TCP_SEG },
-   { PKT_TX_TUNNEL_VXLAN, PKT_TX_TUNNEL_MASK },
-   { PKT_TX_TUNNEL_GRE, PKT_TX_TUNNEL_MASK },
-   { PKT_TX_TUNNEL_IPIP, PKT_TX_TUNNEL_MASK },
-   { PKT_TX_TUNNEL_GENEVE, PKT_TX_TUNNEL_MASK },
-   };
-   unsigned j;
-   const char *name;
+   char buf[256];

printf("-\n");
printf("mbuf=%p, pkt_len=%u, nb_segs=%hhu:\n",
@@ -872,13 +852,8 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
m->tso_segsz);
} else if (info.tso_segsz != 0)
printf("tx: m->tso_segsz=%d\n", m->tso_segsz);
-   printf("tx: flags=");
-   for (j = 0; j < sizeof(tx_flags)/sizeof(*tx_flags); 
j++) {
-   name = 
rte_get_tx_ol_flag_name(tx_flags[j].flag);
-   if ((m->ol_flags & tx_flags[j].mask) ==
-   tx_flags[j].flag)
-   printf("%s ", name);
-   }
+   rte_get_tx_ol_flag_list(m->ol_flags, buf, sizeof(buf));
+   printf("tx: flags=%s", buf);
printf("\n");
}
}
diff --git a/app/test-pmd/rxonly.c b/app/test-pmd/rxonly.c
index 9acc4c6..fff815c 100644
--- a/app/test-pmd/rxonly.c
+++ b/app/test-pmd/rxonly.c
@@ -229,19 +229,8 @@ pkt_burst_receive(struct fwd_stream *fs)
}
printf(" - Receive queue=0x%x", (unsigned) fs->rx_queue);
printf("\n");
-   if (ol_flags != 0) {
-   unsigned rxf;
-   const char *name;
-
-   for (rxf = 0; rxf < sizeof(mb->ol_flags) * 8; rxf++) {
-   if ((ol_flags & (1ULL << rxf)) == 0)
-   continue;
-   name = rte_get_rx_ol_flag_name(1ULL << rxf);
-   if (name == NULL)
-   continue;
-   printf("  %s\n", name);
-   }
-   }
+   rte_get_rx_ol_flag_list(mb->ol_flags, buf, sizeof(buf));
+   printf("  ol_flags: %s\n", buf);
rte_pktmbuf_free(mb);
}

-- 
2.8.1

[dpdk-dev] [PATCH v6 1/8] mbuf: add function to dump ol flag list

2016-10-12 Thread Olivier Matz

The functions rte_get_rx_ol_flag_name() and rte_get_tx_ol_flag_name()
can dump one flag, or set of flag that are part of the same mask (ex:
PKT_TX_UDP_CKSUM, part of PKT_TX_L4_MASK). But they are not designed to
dump the list of flags contained in mbuf->ol_flags.

This commit introduce new functions to do that. Similarly to the packet
type dump functions, the goal is to factorize the code that could be
used in several applications and reduce the risk of desynchronization
between the flags and the dump functions.

Signed-off-by: Olivier Matz 
Acked-by: Pablo de Lara 
---
 doc/guides/rel_notes/release_16_11.rst |   5 ++
 lib/librte_mbuf/rte_mbuf.c | 101 +
 lib/librte_mbuf/rte_mbuf.h |  28 +
 lib/librte_mbuf/rte_mbuf_version.map   |   2 +
 4 files changed, 136 insertions(+)

diff --git a/doc/guides/rel_notes/release_16_11.rst 
b/doc/guides/rel_notes/release_16_11.rst
index 25c447d..0f04a39 100644
--- a/doc/guides/rel_notes/release_16_11.rst
+++ b/doc/guides/rel_notes/release_16_11.rst
@@ -99,6 +99,11 @@ New Features
   * AES GCM/CTR mode


+* **Added functions to dump the offload flags as a string.**
+
+  Added two new functions ``rte_get_rx_ol_flag_list()`` and
+  ``rte_get_tx_ol_flag_list()`` to dump offload flags as a string.
+
 Resolved Issues
 ---

diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index 04f9ed3..4e1fdd1 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -319,6 +319,54 @@ const char *rte_get_rx_ol_flag_name(uint64_t mask)
}
 }

+struct flag_mask {
+   uint64_t flag;
+   uint64_t mask;
+   const char *default_name;
+};
+
+/* write the list of rx ol flags in buffer buf */
+int
+rte_get_rx_ol_flag_list(uint64_t mask, char *buf, size_t buflen)
+{
+   const struct flag_mask rx_flags[] = {
+   { PKT_RX_VLAN_PKT, PKT_RX_VLAN_PKT, NULL },
+   { PKT_RX_RSS_HASH, PKT_RX_RSS_HASH, NULL },
+   { PKT_RX_FDIR, PKT_RX_FDIR, NULL },
+   { PKT_RX_L4_CKSUM_BAD, PKT_RX_L4_CKSUM_BAD, NULL },
+   { PKT_RX_IP_CKSUM_BAD, PKT_RX_IP_CKSUM_BAD, NULL },
+   { PKT_RX_EIP_CKSUM_BAD, PKT_RX_EIP_CKSUM_BAD, NULL },
+   { PKT_RX_VLAN_STRIPPED, PKT_RX_VLAN_STRIPPED, NULL },
+   { PKT_RX_IEEE1588_PTP, PKT_RX_IEEE1588_PTP, NULL },
+   { PKT_RX_IEEE1588_TMST, PKT_RX_IEEE1588_TMST, NULL },
+   { PKT_RX_QINQ_STRIPPED, PKT_RX_QINQ_STRIPPED, NULL },
+   };
+   const char *name;
+   unsigned int i;
+   int ret;
+
+   if (buflen == 0)
+   return -1;
+
+   buf[0] = '\0';
+   for (i = 0; i < RTE_DIM(rx_flags); i++) {
+   if ((mask & rx_flags[i].mask) != rx_flags[i].flag)
+   continue;
+   name = rte_get_rx_ol_flag_name(rx_flags[i].flag);
+   if (name == NULL)
+   name = rx_flags[i].default_name;
+   ret = snprintf(buf, buflen, "%s ", name);
+   if (ret < 0)
+   return -1;
+   if ((size_t)ret >= buflen)
+   return -1;
+   buf += ret;
+   buflen -= ret;
+   }
+
+   return 0;
+}
+
 /*
  * Get the name of a TX offload flag. Must be kept synchronized with flag
  * definitions in rte_mbuf.h.
@@ -345,3 +393,56 @@ const char *rte_get_tx_ol_flag_name(uint64_t mask)
default: return NULL;
}
 }
+
+/* write the list of tx ol flags in buffer buf */
+int
+rte_get_tx_ol_flag_list(uint64_t mask, char *buf, size_t buflen)
+{
+   const struct flag_mask tx_flags[] = {
+   { PKT_TX_VLAN_PKT, PKT_TX_VLAN_PKT, NULL },
+   { PKT_TX_IP_CKSUM, PKT_TX_IP_CKSUM, NULL },
+   { PKT_TX_TCP_CKSUM, PKT_TX_L4_MASK, NULL },
+   { PKT_TX_SCTP_CKSUM, PKT_TX_L4_MASK, NULL },
+   { PKT_TX_UDP_CKSUM, PKT_TX_L4_MASK, NULL },
+   { PKT_TX_L4_NO_CKSUM, PKT_TX_L4_MASK, "PKT_TX_L4_NO_CKSUM" },
+   { PKT_TX_IEEE1588_TMST, PKT_TX_IEEE1588_TMST, NULL },
+   { PKT_TX_TCP_SEG, PKT_TX_TCP_SEG, NULL },
+   { PKT_TX_IPV4, PKT_TX_IPV4, NULL },
+   { PKT_TX_IPV6, PKT_TX_IPV6, NULL },
+   { PKT_TX_OUTER_IP_CKSUM, PKT_TX_OUTER_IP_CKSUM, NULL },
+   { PKT_TX_OUTER_IPV4, PKT_TX_OUTER_IPV4, NULL },
+   { PKT_TX_OUTER_IPV6, PKT_TX_OUTER_IPV6, NULL },
+   { PKT_TX_TUNNEL_VXLAN, PKT_TX_TUNNEL_MASK,
+ "PKT_TX_TUNNEL_NONE" },
+   { PKT_TX_TUNNEL_GRE, PKT_TX_TUNNEL_MASK,
+ "PKT_TX_TUNNEL_NONE" },
+   { PKT_TX_TUNNEL_IPIP, PKT_TX_TUNNEL_MASK,
+ "PKT_TX_TUNNEL_NONE" },
+   { PKT_TX_TUNNEL_GENEVE, PKT_TX_TUNNEL_MASK,
+ "PKT_TX_TUNNEL_NONE" },

[dpdk-dev] [PATCH v6 0/8] Misc enhancements in testpmd

2016-10-12 Thread Olivier Matz

This patchset introduces several enhancements or minor fixes
in testpmd. It is targetted for v16.11, and applies on top of
software ptype v2 patchset [1].

These patches are useful to validate the virtio offload
patchset [2] (to be rebased).

[1] http://dpdk.org/ml/archives/dev/2016-August/045876.html
[2] http://dpdk.org/ml/archives/dev/2016-July/044404.html

v5 -> v6:
- rebase against head

v4 -> v5:
- fix headline lowercase for "Rx"
- fix typo in API comment: "ouput" -> "output"

v3 -> v4:
- fix typo in documentation

v2 -> v3:
- move return type on a separate line in function definitions
- add documentation for the new --enable-lro option

v1 -> v2:
- rebase on top of sw ptype v2 patch

Olivier Matz (8):
  mbuf: add function to dump ol flag list
  app/testpmd: use new function to dump offload flags
  app/testpmd: dump Rx flags in csum engine
  app/testpmd: add option to enable lro
  app/testpmd: do not change ip addrs in csum engine
  app/testpmd: display Rx port in csum engine
  app/testpmd: don't use tso if packet is too small
  app/testpmd: hide segsize when unrelevant in csum engine

 app/test-pmd/csumonly.c| 119 +
 app/test-pmd/parameters.c  |   4 ++
 app/test-pmd/rxonly.c  |  15 +
 doc/guides/rel_notes/release_16_11.rst |   5 ++
 doc/guides/testpmd_app_ug/run_app.rst  |   4 ++
 lib/librte_mbuf/rte_mbuf.c | 101 
 lib/librte_mbuf/rte_mbuf.h |  28 
 lib/librte_mbuf/rte_mbuf_version.map   |   2 +
 8 files changed, 192 insertions(+), 86 deletions(-)

-- 
2.8.1

[dpdk-dev] [PATCH v2 09/12] virtio: add Rx checksum offload support

2016-10-11 Thread Olivier MATZ



On 10/11/2016 04:36 PM, Maxime Coquelin wrote:
>
>
> On 10/11/2016 04:29 PM, Olivier MATZ wrote:
>>
>>
>> On 10/11/2016 04:04 PM, Maxime Coquelin wrote:
>>>> +/* Optionally fill offload information in structure */
>>>> +static int
>>>> +virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
>>>> +{
>>>> +struct rte_net_hdr_lens hdr_lens;
>>>> +uint32_t hdrlen, ptype;
>>>> +int l4_supported = 0;
>>>> +
>>>> +/* nothing to do */
>>>> +if (hdr->flags == 0 && hdr->gso_type == VIRTIO_NET_HDR_GSO_NONE)
>>>> +return 0;
>>>> +
>>>> +m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN;
>>>> +
>>>> +ptype = rte_net_get_ptype(m, _lens, RTE_PTYPE_ALL_MASK);
>>>> +m->packet_type = ptype;
>>>> +if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
>>>> +(ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
>>>> +(ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
>>>> +l4_supported = 1;
>>>> +
>>>> +if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
>>>> +hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
>>>> +if (hdr->csum_start <= hdrlen && l4_supported) {
>>>> +m->ol_flags |= PKT_RX_L4_CKSUM_NONE;
>>>> +} else {
>>>> +/* Unknown proto or tunnel, do sw cksum. We can assume
>>>> + * the cksum field is in the first segment since the
>>>> + * buffers we provided to the host are large enough.
>>>> + * In case of SCTP, this will be wrong since it's a CRC
>>>> + * but there's nothing we can do.
>>>> + */
>>>> +uint16_t csum, off;
>>>> +
>>>> +csum = rte_raw_cksum_mbuf(m, hdr->csum_start,
>>>> +rte_pktmbuf_pkt_len(m) - hdr->csum_start);
>>>> +if (csum != 0x)
>>> Why don't we do the 1-complement if 0x?
>>
>> This was modified after a comment from Xiao.
>>
>> In checksum arithmetic (ones' complement), there are 2 equivalent ways
>> to say the checksum is 0: 0x (0-), and 0x (0+).
>> Some protocols like UDP use this to differentiate between 0x (packet
>> checksum is 0) and 0x (packet checksum is not calculated).
>>
>> Here, we want to avoid to set a checksum to 0, in case it would mean no
>> checksum for UDP packets. Instead, it is set to 0x, which is also a
>> valid checksum for this packet.
>
> Ha ok, I wasn't aware of this.
> Thanks for the explanation!
>
> Maybe not a big deal, but we could add likely around the test?

Yep, good idea.

Thanks!
Olivier

[dpdk-dev] [PATCH v2 09/12] virtio: add Rx checksum offload support

2016-10-11 Thread Olivier MATZ



On 10/11/2016 04:04 PM, Maxime Coquelin wrote:
>> +/* Optionally fill offload information in structure */
>> +static int
>> +virtio_rx_offload(struct rte_mbuf *m, struct virtio_net_hdr *hdr)
>> +{
>> +struct rte_net_hdr_lens hdr_lens;
>> +uint32_t hdrlen, ptype;
>> +int l4_supported = 0;
>> +
>> +/* nothing to do */
>> +if (hdr->flags == 0 && hdr->gso_type == VIRTIO_NET_HDR_GSO_NONE)
>> +return 0;
>> +
>> +m->ol_flags |= PKT_RX_IP_CKSUM_UNKNOWN;
>> +
>> +ptype = rte_net_get_ptype(m, _lens, RTE_PTYPE_ALL_MASK);
>> +m->packet_type = ptype;
>> +if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
>> +(ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
>> +(ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
>> +l4_supported = 1;
>> +
>> +if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
>> +hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
>> +if (hdr->csum_start <= hdrlen && l4_supported) {
>> +m->ol_flags |= PKT_RX_L4_CKSUM_NONE;
>> +} else {
>> +/* Unknown proto or tunnel, do sw cksum. We can assume
>> + * the cksum field is in the first segment since the
>> + * buffers we provided to the host are large enough.
>> + * In case of SCTP, this will be wrong since it's a CRC
>> + * but there's nothing we can do.
>> + */
>> +uint16_t csum, off;
>> +
>> +csum = rte_raw_cksum_mbuf(m, hdr->csum_start,
>> +rte_pktmbuf_pkt_len(m) - hdr->csum_start);
>> +if (csum != 0x)
> Why don't we do the 1-complement if 0x?

This was modified after a comment from Xiao.

In checksum arithmetic (ones' complement), there are 2 equivalent ways 
to say the checksum is 0: 0x (0-), and 0x (0+).
Some protocols like UDP use this to differentiate between 0x (packet 
checksum is 0) and 0x (packet checksum is not calculated).

Here, we want to avoid to set a checksum to 0, in case it would mean no 
checksum for UDP packets. Instead, it is set to 0x, which is also a 
valid checksum for this packet.

Regards,
Olivier

[dpdk-dev] [PATCH v2 04/12] net: add function to calculate a checksum in a mbuf

2016-10-11 Thread Olivier MATZ

Hi Maxime,

On 10/11/2016 03:25 PM, Maxime Coquelin wrote:
>>  /**
>> + * Compute the raw (non complemented) checksum of a packet.
>> + *
>> + * @param m
>> + *   The pointer to the mbuf.
>> + * @param off
>> + *   The offset in bytes to start the checksum.
>> + * @param len
>> + *   The length in bytes of the data to ckecksum.
>> + */
>> +static inline uint16_t
>> +rte_raw_cksum_mbuf(const struct rte_mbuf *m, uint32_t off, uint32_t len)
>> +{
>> +const struct rte_mbuf *seg;
>> +const char *buf;
>> +uint32_t sum, tmp;
>> +uint32_t seglen, done;
>> +
>> +/* easy case: all data in the first segment */
>> +if (off + len <= rte_pktmbuf_data_len(m))
>> +return rte_raw_cksum(rte_pktmbuf_mtod_offset(m,
>> +const char *, off), len);
>> +
>> +if (off + len > rte_pktmbuf_pkt_len(m))
> unlikely?

Yes, will add it.

>> +return 0; /* invalid params, return a dummy value */
> Couldn't be better to return an error, so that the caller has a chance
> to see it is passing wrong arguments?
> The csum would be passed as an arg.

Looks much better indeed. I'll change it for next revision.


Thanks,
Olivier

[dpdk-dev] [PATCH v2 00/12] net/virtio: add offload support

2016-10-11 Thread Olivier MATZ

Hi Yuanhan,

On 10/11/2016 01:35 PM, Yuanhan Liu wrote:
> Hi,
>
> Firstly, apologize for so late review. It's been forgotten :(
>
> BTW, please feel free to ping me in future if I made no response
> in one or two weeks!
>
> I haven't reviewed it carefully yet (something I will do tomorrow).
> Before that, few quick questions.
>
> Firstly, would you write down some test steps? Honestly, I'm not
> quite sure how that works without the TCP/IP stack.

Not sure I'm getting your question.
The test plan described in the cover letter works without any dpdk 
tcp/ip stack. It uses testpmd, which is able to bridge packets and ask 
for TCP segmentation.


> On Mon, Oct 03, 2016 at 11:00:11AM +0200, Olivier Matz wrote:
>> This patchset, targetted for 16.11, introduces the support of rx and tx
>> offload in virtio pmd.  To achieve this, some new mbuf flags must be
>> introduced, as discussed in [1].
>>
>> It applies on top of:
>> - software packet type [2]
>> - testpmd enhancements [3]
>
> I didn't do the search. Have the two got merged?

As of now, it's not merged yet. I think Thomas is on it.

Regards,
Olivier

[dpdk-dev] [PATCH v3 03/16] mbuf: move packet type definitions in a new file

2016-10-11 Thread Olivier MATZ

Hi Thomas,

On 10/10/2016 04:52 PM, Thomas Monjalon wrote:
> 2016-10-03 10:38, Olivier Matz:
>> The file rte_mbuf.h starts to be quite big, and next commits
>> will introduce more functions related to packet types. Let's
>> move them in a new file.
>>
>> Signed-off-by: Olivier Matz 
>> ---
>>   lib/librte_mbuf/Makefile |   2 +-
>>   lib/librte_mbuf/rte_mbuf.h   | 495 +--
>>   lib/librte_mbuf/rte_mbuf_ptype.h | 552 
>> +++
>
> Why not moving packet type and other packet flags in librte_net?
>

These are mbuf features.
I can reverse the question: why moving them in librte_net? :)

[dpdk-dev] [PATCH v5 8/8] app/testpmd: hide segsize when unrelevant in csum engine

2016-10-07 Thread Olivier Matz

When TSO is not asked, hide the segment size.

Signed-off-by: Olivier Matz 
---
 app/test-pmd/csumonly.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 2057633..d5eb260 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -808,7 +808,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
(testpmd_ol_flags & 
TESTPMD_TX_OFFLOAD_OUTER_IP_CKSUM))
printf("tx: m->outer_l2_len=%d 
m->outer_l3_len=%d\n",
m->outer_l2_len, m->outer_l3_len);
-   if (info.tso_segsz != 0)
+   if (info.tso_segsz != 0 && (m->ol_flags & 
PKT_TX_TCP_SEG))
printf("tx: m->tso_segsz=%d\n", m->tso_segsz);
rte_get_tx_ol_flag_list(m->ol_flags, buf, sizeof(buf));
printf("tx: flags=%s", buf);
-- 
2.8.1

[dpdk-dev] [PATCH v5 7/8] app/testpmd: don't use tso if packet is too small

2016-10-07 Thread Olivier Matz

Asking for TSO (TCP Segmentation Offload) on packets that are already
smaller than (headers + MSS) does not work, for instance on ixgbe.

Fix the csumonly engine to only set the TSO flag when a segmentation
offload is really required, i.e. when packet is large enough.

Signed-off-by: Olivier Matz 
---
 app/test-pmd/csumonly.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 19c8099..2057633 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -101,6 +101,7 @@ struct testpmd_offload_info {
uint16_t outer_l3_len;
uint8_t outer_l4_proto;
uint16_t tso_segsz;
+   uint32_t pkt_len;
 };

 /* simplified GRE header */
@@ -328,13 +329,20 @@ process_inner_cksums(void *l3_hdr, const struct 
testpmd_offload_info *info,
struct tcp_hdr *tcp_hdr;
struct sctp_hdr *sctp_hdr;
uint64_t ol_flags = 0;
+   uint32_t max_pkt_len, tso_segsz = 0;
+
+   /* ensure packet is large enough to require tso */
+   max_pkt_len = info->l2_len + info->l3_len + info->l4_len +
+   info->tso_segsz;
+   if (info->tso_segsz != 0 && info->pkt_len > max_pkt_len)
+   tso_segsz = info->tso_segsz;

if (info->ethertype == _htons(ETHER_TYPE_IPv4)) {
ipv4_hdr = l3_hdr;
ipv4_hdr->hdr_checksum = 0;

ol_flags |= PKT_TX_IPV4;
-   if (info->tso_segsz != 0 && info->l4_proto == IPPROTO_TCP) {
+   if (tso_segsz != 0 && info->l4_proto == IPPROTO_TCP) {
ol_flags |= PKT_TX_IP_CKSUM;
} else {
if (testpmd_ol_flags & TESTPMD_TX_OFFLOAD_IP_CKSUM)
@@ -366,7 +374,7 @@ process_inner_cksums(void *l3_hdr, const struct 
testpmd_offload_info *info,
} else if (info->l4_proto == IPPROTO_TCP) {
tcp_hdr = (struct tcp_hdr *)((char *)l3_hdr + info->l3_len);
tcp_hdr->cksum = 0;
-   if (info->tso_segsz != 0) {
+   if (tso_segsz != 0) {
ol_flags |= PKT_TX_TCP_SEG;
tcp_hdr->cksum = get_psd_sum(l3_hdr, info->ethertype,
ol_flags);
@@ -666,6 +674,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)

m = pkts_burst[i];
info.is_tunnel = 0;
+   info.pkt_len = rte_pktmbuf_pkt_len(m);
tx_ol_flags = 0;
rx_ol_flags = m->ol_flags;

-- 
2.8.1

[dpdk-dev] [PATCH v5 6/8] app/testpmd: display Rx port in csum engine

2016-10-07 Thread Olivier Matz

This information is useful when debugging, especially with
bidirectional traffic.

Signed-off-by: Olivier Matz 
---
 app/test-pmd/csumonly.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index eeb67db..19c8099 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -773,8 +773,8 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
char buf[256];

printf("-\n");
-   printf("mbuf=%p, pkt_len=%u, nb_segs=%hhu:\n",
-   m, m->pkt_len, m->nb_segs);
+   printf("port=%u, mbuf=%p, pkt_len=%u, nb_segs=%hhu:\n",
+   fs->rx_port, m, m->pkt_len, m->nb_segs);
/* dump rx parsed packet info */
rte_get_rx_ol_flag_list(rx_ol_flags, buf, sizeof(buf));
printf("rx: l2_len=%d ethertype=%x l3_len=%d "
-- 
2.8.1

[dpdk-dev] [PATCH v5 5/8] app/testpmd: do not change ip addrs in csum engine

2016-10-07 Thread Olivier Matz

The csum forward engine was updated to change the IP addresses in the
packet data in
commit 51f694dd40f5 ("app/testpmd: rework checksum forward engine")

This was done to ensure that the checksum is correctly reprocessed when
using hardware checksum offload. But the functions
process_inner_cksums() and process_outer_cksums() already reset the
checksum field to 0, so this is not necessary.

Moreover, this makes the engine more complex than needed, and prevents
to easily use it to forward traffic (like iperf) as it modifies the
packets.

This patch drops this behavior.

Signed-off-by: Olivier Matz 
---
 app/test-pmd/csumonly.c | 30 --
 1 file changed, 4 insertions(+), 26 deletions(-)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index e7ee0b3..eeb67db 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -317,21 +317,6 @@ parse_encap_ip(void *encap_ip, struct testpmd_offload_info 
*info)
info->l2_len = 0;
 }

-/* modify the IPv4 or IPv4 source address of a packet */
-static void
-change_ip_addresses(void *l3_hdr, uint16_t ethertype)
-{
-   struct ipv4_hdr *ipv4_hdr = l3_hdr;
-   struct ipv6_hdr *ipv6_hdr = l3_hdr;
-
-   if (ethertype == _htons(ETHER_TYPE_IPv4)) {
-   ipv4_hdr->src_addr =
-   rte_cpu_to_be_32(rte_be_to_cpu_32(ipv4_hdr->src_addr) + 
1);
-   } else if (ethertype == _htons(ETHER_TYPE_IPv6)) {
-   ipv6_hdr->src_addr[15] = ipv6_hdr->src_addr[15] + 1;
-   }
-}
-
 /* if possible, calculate the checksum of a packet in hw or sw,
  * depending on the testpmd command line configuration */
 static uint64_t
@@ -608,7 +593,6 @@ pkt_copy_split(const struct rte_mbuf *pkt)
  * Receive a burst of packets, and for each packet:
  *  - parse packet, and try to recognize a supported packet type (1)
  *  - if it's not a supported packet type, don't touch the packet, else:
- *  - modify the IPs in inner headers and in outer headers if any
  *  - reprocess the checksum of all supported layers. This is done in SW
  *or HW, depending on testpmd command line configuration
  *  - if TSO is enabled in testpmd command line, also flag the mbuf for TCP
@@ -725,20 +709,14 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
l3_hdr = (char *)l3_hdr + info.outer_l3_len + 
info.l2_len;
}

-   /* step 2: change all source IPs (v4 or v6) so we need
-* to recompute the chksums even if they were correct */
-
-   change_ip_addresses(l3_hdr, info.ethertype);
-   if (info.is_tunnel == 1)
-   change_ip_addresses(outer_l3_hdr, info.outer_ethertype);
-
-   /* step 3: depending on user command line configuration,
+   /* step 2: depending on user command line configuration,
 * recompute checksum either in software or flag the
 * mbuf to offload the calculation to the NIC. If TSO
 * is configured, prepare the mbuf for TCP segmentation. */

/* process checksums of inner headers first */
-   tx_ol_flags |= process_inner_cksums(l3_hdr, , 
testpmd_ol_flags);
+   tx_ol_flags |= process_inner_cksums(l3_hdr, ,
+   testpmd_ol_flags);

/* Then process outer headers if any. Note that the software
 * checksum will be wrong if one of the inner checksums is
@@ -748,7 +726,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
testpmd_ol_flags);
}

-   /* step 4: fill the mbuf meta data (flags and header lengths) */
+   /* step 3: fill the mbuf meta data (flags and header lengths) */

if (info.is_tunnel == 1) {
if (testpmd_ol_flags & 
TESTPMD_TX_OFFLOAD_OUTER_IP_CKSUM) {
-- 
2.8.1

[dpdk-dev] [PATCH v5 4/8] app/testpmd: add option to enable lro

2016-10-07 Thread Olivier Matz

Introduce a new argument '--enable-lro' to ask testpmd to enable the LRO
feature on enabled ports, like it's done for '--enable-rx-cksum' for
instance.

Signed-off-by: Olivier Matz 
---
 app/test-pmd/parameters.c | 4 
 doc/guides/testpmd_app_ug/run_app.rst | 4 
 2 files changed, 8 insertions(+)

diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index 6a6a07e..c45f78a 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -149,6 +149,7 @@ usage(char* progname)
   "If the drop-queue doesn't exist, the packet is dropped. "
   "By default drop-queue=127.\n");
printf("  --crc-strip: enable CRC stripping by hardware.\n");
+   printf("  --enable-lro: enable large receive offload.\n");
printf("  --enable-rx-cksum: enable rx hardware checksum offload.\n");
printf("  --disable-hw-vlan: disable hardware vlan.\n");
printf("  --disable-hw-vlan-filter: disable hardware vlan filter.\n");
@@ -524,6 +525,7 @@ launch_args_parse(int argc, char** argv)
{ "pkt-filter-size",1, 0, 0 },
{ "pkt-filter-drop-queue",  1, 0, 0 },
{ "crc-strip",  0, 0, 0 },
+   { "enable-lro", 0, 0, 0 },
{ "enable-rx-cksum",0, 0, 0 },
{ "enable-scatter", 0, 0, 0 },
{ "disable-hw-vlan",0, 0, 0 },
@@ -764,6 +766,8 @@ launch_args_parse(int argc, char** argv)
}
if (!strcmp(lgopts[opt_idx].name, "crc-strip"))
rx_mode.hw_strip_crc = 1;
+   if (!strcmp(lgopts[opt_idx].name, "enable-lro"))
+   rx_mode.enable_lro = 1;
if (!strcmp(lgopts[opt_idx].name, "enable-scatter"))
rx_mode.enable_scatter = 1;
if (!strcmp(lgopts[opt_idx].name, "enable-rx-cksum"))
diff --git a/doc/guides/testpmd_app_ug/run_app.rst 
b/doc/guides/testpmd_app_ug/run_app.rst
index 7712bd2..55c7ac0 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -285,6 +285,10 @@ The commandline options are:

 Enable hardware CRC stripping.

+*   ``--enable-lro``
+
+Enable large receive offload.
+
 *   ``--enable-rx-cksum``

 Enable hardware RX checksum offload.
-- 
2.8.1

[dpdk-dev] [PATCH v5 3/8] app/testpmd: dump Rx flags in csum engine

2016-10-07 Thread Olivier Matz

Signed-off-by: Olivier Matz 
---
 app/test-pmd/csumonly.c | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 5ca5702..e7ee0b3 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -640,7 +640,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
uint16_t nb_rx;
uint16_t nb_tx;
uint16_t i;
-   uint64_t ol_flags;
+   uint64_t rx_ol_flags, tx_ol_flags;
uint16_t testpmd_ol_flags;
uint32_t retry;
uint32_t rx_bad_ip_csum;
@@ -680,13 +680,14 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i + 1],
   void *));

-   ol_flags = 0;
-   info.is_tunnel = 0;
m = pkts_burst[i];
+   info.is_tunnel = 0;
+   tx_ol_flags = 0;
+   rx_ol_flags = m->ol_flags;

/* Update the L3/L4 checksum error packet statistics */
-   rx_bad_ip_csum += ((m->ol_flags & PKT_RX_IP_CKSUM_BAD) != 0);
-   rx_bad_l4_csum += ((m->ol_flags & PKT_RX_L4_CKSUM_BAD) != 0);
+   rx_bad_ip_csum += ((rx_ol_flags & PKT_RX_IP_CKSUM_BAD) != 0);
+   rx_bad_l4_csum += ((rx_ol_flags & PKT_RX_L4_CKSUM_BAD) != 0);

/* step 1: dissect packet, parsing optional vlan, ip4/ip6, vxlan
 * and inner headers */
@@ -737,13 +738,13 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 * is configured, prepare the mbuf for TCP segmentation. */

/* process checksums of inner headers first */
-   ol_flags |= process_inner_cksums(l3_hdr, , 
testpmd_ol_flags);
+   tx_ol_flags |= process_inner_cksums(l3_hdr, , 
testpmd_ol_flags);

/* Then process outer headers if any. Note that the software
 * checksum will be wrong if one of the inner checksums is
 * processed in hardware. */
if (info.is_tunnel == 1) {
-   ol_flags |= process_outer_cksums(outer_l3_hdr, ,
+   tx_ol_flags |= process_outer_cksums(outer_l3_hdr, ,
testpmd_ol_flags);
}

@@ -777,7 +778,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
m->l4_len = info.l4_len;
}
m->tso_segsz = info.tso_segsz;
-   m->ol_flags = ol_flags;
+   m->ol_flags = tx_ol_flags;

/* Do split & copy for the packet. */
if (tx_pkt_split != TX_PKT_SPLIT_OFF) {
@@ -797,10 +798,11 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
printf("mbuf=%p, pkt_len=%u, nb_segs=%hhu:\n",
m, m->pkt_len, m->nb_segs);
/* dump rx parsed packet info */
+   rte_get_rx_ol_flag_list(rx_ol_flags, buf, sizeof(buf));
printf("rx: l2_len=%d ethertype=%x l3_len=%d "
-   "l4_proto=%d l4_len=%d\n",
+   "l4_proto=%d l4_len=%d flags=%s\n",
info.l2_len, rte_be_to_cpu_16(info.ethertype),
-   info.l3_len, info.l4_proto, info.l4_len);
+   info.l3_len, info.l4_proto, info.l4_len, buf);
if (info.is_tunnel == 1)
printf("rx: outer_l2_len=%d outer_ethertype=%x "
"outer_l3_len=%d\n", info.outer_l2_len,
-- 
2.8.1

[dpdk-dev] [PATCH v5 2/8] app/testpmd: use new function to dump offload flags

2016-10-07 Thread Olivier Matz

Use the functions introduced in the previous commit to dump the offload
flags.

Signed-off-by: Olivier Matz 
---
 app/test-pmd/csumonly.c | 27 +++
 app/test-pmd/rxonly.c   | 15 ++-
 2 files changed, 5 insertions(+), 37 deletions(-)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 21cb78f..5ca5702 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -791,23 +791,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)

/* if verbose mode is enabled, dump debug info */
if (verbose_level > 0) {
-   struct {
-   uint64_t flag;
-   uint64_t mask;
-   } tx_flags[] = {
-   { PKT_TX_IP_CKSUM, PKT_TX_IP_CKSUM },
-   { PKT_TX_UDP_CKSUM, PKT_TX_L4_MASK },
-   { PKT_TX_TCP_CKSUM, PKT_TX_L4_MASK },
-   { PKT_TX_SCTP_CKSUM, PKT_TX_L4_MASK },
-   { PKT_TX_IPV4, PKT_TX_IPV4 },
-   { PKT_TX_IPV6, PKT_TX_IPV6 },
-   { PKT_TX_OUTER_IP_CKSUM, PKT_TX_OUTER_IP_CKSUM 
},
-   { PKT_TX_OUTER_IPV4, PKT_TX_OUTER_IPV4 },
-   { PKT_TX_OUTER_IPV6, PKT_TX_OUTER_IPV6 },
-   { PKT_TX_TCP_SEG, PKT_TX_TCP_SEG },
-   };
-   unsigned j;
-   const char *name;
+   char buf[256];

printf("-\n");
printf("mbuf=%p, pkt_len=%u, nb_segs=%hhu:\n",
@@ -837,13 +821,8 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
m->outer_l2_len, m->outer_l3_len);
if (info.tso_segsz != 0)
printf("tx: m->tso_segsz=%d\n", m->tso_segsz);
-   printf("tx: flags=");
-   for (j = 0; j < sizeof(tx_flags)/sizeof(*tx_flags); 
j++) {
-   name = 
rte_get_tx_ol_flag_name(tx_flags[j].flag);
-   if ((m->ol_flags & tx_flags[j].mask) ==
-   tx_flags[j].flag)
-   printf("%s ", name);
-   }
+   rte_get_tx_ol_flag_list(m->ol_flags, buf, sizeof(buf));
+   printf("tx: flags=%s", buf);
printf("\n");
}
}
diff --git a/app/test-pmd/rxonly.c b/app/test-pmd/rxonly.c
index 9acc4c6..fff815c 100644
--- a/app/test-pmd/rxonly.c
+++ b/app/test-pmd/rxonly.c
@@ -229,19 +229,8 @@ pkt_burst_receive(struct fwd_stream *fs)
}
printf(" - Receive queue=0x%x", (unsigned) fs->rx_queue);
printf("\n");
-   if (ol_flags != 0) {
-   unsigned rxf;
-   const char *name;
-
-   for (rxf = 0; rxf < sizeof(mb->ol_flags) * 8; rxf++) {
-   if ((ol_flags & (1ULL << rxf)) == 0)
-   continue;
-   name = rte_get_rx_ol_flag_name(1ULL << rxf);
-   if (name == NULL)
-   continue;
-   printf("  %s\n", name);
-   }
-   }
+   rte_get_rx_ol_flag_list(mb->ol_flags, buf, sizeof(buf));
+   printf("  ol_flags: %s\n", buf);
rte_pktmbuf_free(mb);
}

-- 
2.8.1

[dpdk-dev] [PATCH v5 1/8] mbuf: add function to dump ol flag list

2016-10-07 Thread Olivier Matz

The functions rte_get_rx_ol_flag_name() and rte_get_tx_ol_flag_name()
can dump one flag, or set of flag that are part of the same mask (ex:
PKT_TX_UDP_CKSUM, part of PKT_TX_L4_MASK). But they are not designed to
dump the list of flags contained in mbuf->ol_flags.

This commit introduce new functions to do that. Similarly to the packet
type dump functions, the goal is to factorize the code that could be
used in several applications and reduce the risk of desynchronization
between the flags and the dump functions.

Signed-off-by: Olivier Matz 
---
 doc/guides/rel_notes/release_16_11.rst |  5 ++
 lib/librte_mbuf/rte_mbuf.c | 93 ++
 lib/librte_mbuf/rte_mbuf.h | 28 ++
 lib/librte_mbuf/rte_mbuf_version.map   |  2 +
 4 files changed, 128 insertions(+)

diff --git a/doc/guides/rel_notes/release_16_11.rst 
b/doc/guides/rel_notes/release_16_11.rst
index e9dc797..a0dc9d4 100644
--- a/doc/guides/rel_notes/release_16_11.rst
+++ b/doc/guides/rel_notes/release_16_11.rst
@@ -78,6 +78,11 @@ New Features

   Added new functions ``rte_get_ptype_*()`` to dump a packet type as a string.

+* **Added functions to dump the offload flags as a string.**
+
+  Added two new functions ``rte_get_rx_ol_flag_list()`` and
+  ``rte_get_tx_ol_flag_list()`` to dump offload flags as a string.
+
 Resolved Issues
 ---

diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index cd95e6f..d58a701 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -319,6 +319,54 @@ const char *rte_get_rx_ol_flag_name(uint64_t mask)
}
 }

+struct flag_mask {
+   uint64_t flag;
+   uint64_t mask;
+   const char *default_name;
+};
+
+/* write the list of rx ol flags in buffer buf */
+int
+rte_get_rx_ol_flag_list(uint64_t mask, char *buf, size_t buflen)
+{
+   const struct flag_mask rx_flags[] = {
+   { PKT_RX_VLAN_PKT, PKT_RX_VLAN_PKT, NULL },
+   { PKT_RX_RSS_HASH, PKT_RX_RSS_HASH, NULL },
+   { PKT_RX_FDIR, PKT_RX_FDIR, NULL },
+   { PKT_RX_L4_CKSUM_BAD, PKT_RX_L4_CKSUM_BAD, NULL },
+   { PKT_RX_IP_CKSUM_BAD, PKT_RX_IP_CKSUM_BAD, NULL },
+   { PKT_RX_EIP_CKSUM_BAD, PKT_RX_EIP_CKSUM_BAD, NULL },
+   { PKT_RX_VLAN_STRIPPED, PKT_RX_VLAN_STRIPPED, NULL },
+   { PKT_RX_IEEE1588_PTP, PKT_RX_IEEE1588_PTP, NULL },
+   { PKT_RX_IEEE1588_TMST, PKT_RX_IEEE1588_TMST, NULL },
+   { PKT_RX_QINQ_STRIPPED, PKT_RX_QINQ_STRIPPED, NULL },
+   };
+   const char *name;
+   unsigned int i;
+   int ret;
+
+   if (buflen == 0)
+   return -1;
+
+   buf[0] = '\0';
+   for (i = 0; i < RTE_DIM(rx_flags); i++) {
+   if ((mask & rx_flags[i].mask) != rx_flags[i].flag)
+   continue;
+   name = rte_get_rx_ol_flag_name(rx_flags[i].flag);
+   if (name == NULL)
+   name = rx_flags[i].default_name;
+   ret = snprintf(buf, buflen, "%s ", name);
+   if (ret < 0)
+   return -1;
+   if ((size_t)ret >= buflen)
+   return -1;
+   buf += ret;
+   buflen -= ret;
+   }
+
+   return 0;
+}
+
 /*
  * Get the name of a TX offload flag. Must be kept synchronized with flag
  * definitions in rte_mbuf.h.
@@ -341,3 +389,48 @@ const char *rte_get_tx_ol_flag_name(uint64_t mask)
default: return NULL;
}
 }
+
+/* write the list of tx ol flags in buffer buf */
+int
+rte_get_tx_ol_flag_list(uint64_t mask, char *buf, size_t buflen)
+{
+   const struct flag_mask tx_flags[] = {
+   { PKT_TX_VLAN_PKT, PKT_TX_VLAN_PKT, NULL },
+   { PKT_TX_IP_CKSUM, PKT_TX_IP_CKSUM, NULL },
+   { PKT_TX_TCP_CKSUM, PKT_TX_L4_MASK, NULL },
+   { PKT_TX_SCTP_CKSUM, PKT_TX_L4_MASK, NULL },
+   { PKT_TX_UDP_CKSUM, PKT_TX_L4_MASK, NULL },
+   { PKT_TX_L4_NO_CKSUM, PKT_TX_L4_MASK, "PKT_TX_L4_NO_CKSUM" },
+   { PKT_TX_IEEE1588_TMST, PKT_TX_IEEE1588_TMST, NULL },
+   { PKT_TX_TCP_SEG, PKT_TX_TCP_SEG, NULL },
+   { PKT_TX_IPV4, PKT_TX_IPV4, NULL },
+   { PKT_TX_IPV6, PKT_TX_IPV6, NULL },
+   { PKT_TX_OUTER_IP_CKSUM, PKT_TX_OUTER_IP_CKSUM, NULL },
+   { PKT_TX_OUTER_IPV4, PKT_TX_OUTER_IPV4, NULL },
+   { PKT_TX_OUTER_IPV6, PKT_TX_OUTER_IPV6, NULL },
+   };
+   const char *name;
+   unsigned int i;
+   int ret;
+
+   if (buflen == 0)
+   return -1;
+
+   buf[0] = '\0';
+   for (i = 0; i < RTE_DIM(tx_flags); i++) {
+   if ((mask & tx_flags[i].mask) != tx_flags[i].flag)
+   continue;
+   name = rte_get_tx_ol_flag_name(tx_flags[i].flag);
+   if (name == NULL)
+

[dpdk-dev] [PATCH v5 0/8] Misc enhancements in testpmd

2016-10-07 Thread Olivier Matz

This patchset introduces several enhancements or minor fixes
in testpmd. It is targetted for v16.11, and applies on top of
software ptype v2 patchset [1].

These patches are useful to validate the virtio offload
patchset [2] (to be rebased).

[1] http://dpdk.org/ml/archives/dev/2016-August/045876.html
[2] http://dpdk.org/ml/archives/dev/2016-July/044404.html

v4 -> v5:
- fix headline lowercase for "Rx"
- fix typo in API comment: "ouput" -> "output"

v3 -> v4:
- fix typo in documentation

v2 -> v3:
- move return type on a separate line in function definitions
- add documentation for the new --enable-lro option

v1 -> v2:
- rebase on top of sw ptype v2 patch

Olivier Matz (8):
  mbuf: add function to dump ol flag list
  app/testpmd: use new function to dump offload flags
  app/testpmd: dump Rx flags in csum engine
  app/testpmd: add option to enable lro
  app/testpmd: do not change ip addrs in csum engine
  app/testpmd: display Rx port in csum engine
  app/testpmd: don't use tso if packet is too small
  app/testpmd: hide segsize when unrelevant in csum engine

 app/test-pmd/csumonly.c| 96 --
 app/test-pmd/parameters.c  |  4 ++
 app/test-pmd/rxonly.c  | 15 +-
 doc/guides/rel_notes/release_16_11.rst |  5 ++
 doc/guides/testpmd_app_ug/run_app.rst  |  4 ++
 lib/librte_mbuf/rte_mbuf.c | 93 
 lib/librte_mbuf/rte_mbuf.h | 28 ++
 lib/librte_mbuf/rte_mbuf_version.map   |  2 +
 8 files changed, 170 insertions(+), 77 deletions(-)

-- 
2.8.1

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1449 matches

Mail list logo