[dpdk-dev] [PATCH v4 0/2] ethdev: add port speed capability bitmap

2015-09-07 Thread Marc Sune
2015-08-29 2:16 GMT+02:00 Marc Sune :

> The current rte_eth_dev_info abstraction does not provide any mechanism to
> get the supported speed(s) of an ethdev.
>
> For some drivers (e.g. ixgbe), an educated guess can be done based on the
> driver's name (driver_name in rte_eth_dev_info), see:
>
> http://dpdk.org/ml/archives/dev/2013-August/000412.html
>
> However, i) doing string comparisons is annoying, and can silently
> break existing applications if PMDs change their names ii) it does not
> provide all the supported capabilities of the ethdev iii) for some drivers
> it
> is impossible determine correctly the (max) speed by the application
> (e.g. in i40, distinguish between XL710 and X710).
>
> This small patch adds speed_capa bitmap in rte_eth_dev_info, which is
> filled
> by the PMDs according to the physical device capabilities.
>
> v2: rebase, converted speed_capa into 32 bits bitmap, fixed alignment
> (checkpatch).
>
> v3: rebase to v2.1. unified ETH_LINK_SPEED and ETH_SPEED_CAP into
> ETH_SPEED.
> Converted field speed in struct rte_eth_conf to speeds, to allow a
> bitmap
> for defining the announced speeds, as suggested by M. Brorup. Fixed
> spelling issues.
>
> v4: fixed errata in the documentation of field speeds of rte_eth_conf, and
> commit 1/2 message. rebased to v2.1.0. v3 was incorrectly based on
> ~2.1.0-rc1.
>

Thomas,

Since mostly you were commenting for v1 and v2; any opinion on this one?

Regards
marc


>
> Marc Sune (2):
>   Added ETH_SPEED_ bitmap in rte_eth_dev_info
>   Filling speed capability bitmaps in the PMDs
>
>  app/test-pmd/cmdline.c| 32 +++
>  app/test/virtual_pmd.c|  2 +-
>  drivers/net/bonding/rte_eth_bond_8023ad.c | 14 +-
>  drivers/net/cxgbe/base/t4_hw.c|  8 +++---
>  drivers/net/e1000/em_ethdev.c | 20 +-
>  drivers/net/e1000/igb_ethdev.c| 20 +-
>  drivers/net/fm10k/fm10k_ethdev.c  |  3 +++
>  drivers/net/i40e/i40e_ethdev.c| 43
> +++
>  drivers/net/i40e/i40e_ethdev_vf.c |  2 +-
>  drivers/net/ixgbe/ixgbe_ethdev.c  | 32 +++
>  drivers/net/mlx4/mlx4.c   |  6 +
>  drivers/net/vmxnet3/vmxnet3_ethdev.c  |  2 +-
>  lib/librte_ether/rte_ethdev.h | 42
> +-
>  13 files changed, 142 insertions(+), 84 deletions(-)
>
> --
> 2.1.4
>
>


[dpdk-dev] DPDK v2.1.0 with tilegx problem

2015-09-07 Thread Tony Lu
>-Original Message-
>From: Mcnamara, John [mailto:john.mcnamara at intel.com]
>Sent: Monday, September 07, 2015 4:52 PM
>To: Arthas; dev
>Cc: Zhigang Lu; Cyril Chemparathy; Zhigang Lu
>Subject: RE: [dpdk-dev] DPDK v2.1.0 with tilegx problem
>
>CCing the Tile-GX maintainers.
>
>> -Original Message-
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Arthas
>> Sent: Saturday, September 5, 2015 2:49 AM
>> To: dev
>> Subject: [dpdk-dev] DPDK v2.1.0 with tilegx problem
>>
>> Hi, I'm working on tilegx platform. but when I compiled DPDK 2.1.0
>> with tilegx target. I got a error.
>>
>> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:182: error:
>> 'GXIO_MPIPE_IQUEUE_ENTRY_128' undeclared here (not in a function)
>> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:182: error: array
>> index in initializer not of integer type
>> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:182: error: (near
>> initialization for 'mpipe_iqueue_sizes')
>> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:183: error:
>> 'GXIO_MPIPE_IQUEUE_ENTRY_512' undeclared here (not in a function)
>> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:183: error: array
>> index in initializer not of integer type
>> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:183: error: (near
>> initialization for 'mpipe_iqueue_sizes')
>> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:184: error:
>> 'GXIO_MPIPE_IQUEUE_ENTRY_2K' undeclared here (not in a function)
>> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:184: error: array
>> index in initializer not of integer type
>> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:184: error: (near
>> initialization for 'mpipe_iqueue_sizes')
>> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:185: error:
>> 'GXIO_MPIPE_IQUEUE_ENTRY_64K' undeclared here (not in a function)
>> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:185: error: array
>> index in initializer not of integer type
>> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:185: error: (near
>> initialization for 'mpipe_iqueue_sizes')
>> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c: In function
>> 'mpipe_link_mac':
>> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:1545: warning:
>> implicit declaration of function 'gxio_mpipe_link_enumerate_mac'
>> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:1545: warning:
>> nested extern declaration of 'gxio_mpipe_link_enumerate_mac'
>>
>> My tilegx MDE version is TileraMDE-4.2.4.174600, but I can't found any
>> definition of GXIO_MPIPE_IQUEUE_ENTRY_128/512/2K/64K.
>> only found gxio_mpipe_link_enumerate_mac is a kernel global symbol.
>>

We need to use TileraMDE-4.3.3 and later versions which include the
definition
of gxio_mpipe_link_enumerate_mac.

Thanks
-Zhigang

>> Did someon got this problem?
>>
>> Regards,
>> Arthas



[dpdk-dev] DPDK v2.1.0 with tilegx problem

2015-09-07 Thread Arthas
This problem had solved! My tilegx libgxio don't support 
'gxio_mpipe_link_enumerate_mac', only provide API ''gxio_mpipe_link_enumerate', 
add new api 'gxio_mpipe_link_enumerate_mac' for libgxio and compile ok! :)

  /*
 typedef struct
{
  /** The address. */
  uint8_t mac[6];
}
_gxio_mpipe_link_mac_t;
/tilegx/src/sys/hv/include/hv/drv_mpipe_intf.h
 */
 int
gxio_mpipe_link_enumerate_mac(int idx, char* link_name, uint8_t *hw)
{
  int rv;
  _gxio_mpipe_link_name_t name;
  _gxio_mpipe_link_mac_t mac;
   gxio_mpipe_context_t* context = _gxio_get_link_context();
  if (!context)
return GXIO_ERR_NO_DEVICE;
   rv = gxio_mpipe_info_enumerate_aux(context, idx, , );
  if (rv >= 0) {
strncpy(link_name, name.name, sizeof (name.name));
memcpy(hw, , sizeof(_gxio_mpipe_link_mac_t));
  }
  return rv;
}


  Regards,
Arthas



 -- Original --
  From:  "Arthas";;
 Date:  Sat, Sep 5, 2015 09:48 AM
 To:  "dev"; 

 Subject:  [dpdk-dev] DPDK v2.1.0 with tilegx problem



Hi, I'm working on tilegx platform. but when I compiled DPDK 2.1.0 with tilegx 
target. I got a error.

/DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:182: error: 
'GXIO_MPIPE_IQUEUE_ENTRY_128' undeclared here (not in a function)
/DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:182: error: array index in 
initializer not of integer type
/DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:182: error: (near 
initialization for 'mpipe_iqueue_sizes')
/DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:183: error: 
'GXIO_MPIPE_IQUEUE_ENTRY_512' undeclared here (not in a function)
/DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:183: error: array index in 
initializer not of integer type
/DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:183: error: (near 
initialization for 'mpipe_iqueue_sizes')
/DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:184: error: 
'GXIO_MPIPE_IQUEUE_ENTRY_2K' undeclared here (not in a function)
/DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:184: error: array index in 
initializer not of integer type
/DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:184: error: (near 
initialization for 'mpipe_iqueue_sizes')
/DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:185: error: 
'GXIO_MPIPE_IQUEUE_ENTRY_64K' undeclared here (not in a function)
/DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:185: error: array index in 
initializer not of integer type
/DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:185: error: (near 
initialization for 'mpipe_iqueue_sizes')
/DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c: In function 
'mpipe_link_mac':
/DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:1545: warning: implicit 
declaration of function 'gxio_mpipe_link_enumerate_mac'
/DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:1545: warning: nested extern 
declaration of 'gxio_mpipe_link_enumerate_mac'

My tilegx MDE version is TileraMDE-4.2.4.174600, but I can't found any 
definition of GXIO_MPIPE_IQUEUE_ENTRY_128/512/2K/64K.  
only found gxio_mpipe_link_enumerate_mac is a kernel global symbol. 

Did someon got this problem?

Regards,
Arthas


[dpdk-dev] [PATCH] doc: add missing field in ethertype_filter example in testpmd doc

2015-09-07 Thread Thomas Monjalon
2015-08-26 08:43, Pablo de Lara:
> The two examples of ethertype_filter in testpmd documentation
> were missing the mac address field, so the example was incorrect.
> 
> Signed-off-by: Pablo de Lara 

Applied, thanks


[dpdk-dev] [PATCH] librte_eal: Fix wrong header file for old gcc version

2015-09-07 Thread Thomas Monjalon
2015-08-24 17:22, Michael Qiu:
> For __SSE3__, the corresponding header file should be pmmintrin.h,
> tmmintrin.h works for __SSSE3__.

Please could you better explain the difference and what is exactly the bug
being fixed?
Thanks



[dpdk-dev] [PATCH v2 0/3] clean deprecated code in hash library

2015-09-07 Thread Thomas Monjalon
2015-09-04 15:56, Thomas Monjalon:
> This patchset removes all deprecated macros and functions
> from the hash library.
> Then the DPDK version can be changed to 2.2.0-rc0.
> 
> Changes in v2:
> - increment hash library version
> - merge hash patches
> - increment DPDK version
> 
> Pablo de Lara (2):
>   enic: use appropriate key length in hash table
>   hash: remove deprecated function and macros
> 
> Thomas Monjalon (1):
>   version: 2.2.0-rc0

Applied, thanks


[dpdk-dev] [PATCH] ixgbe: prefetch packet headers in vector PMD receive function

2015-09-07 Thread Zoltan Kiss


On 07/09/15 13:57, Richardson, Bruce wrote:
>
>
>> -Original Message-
>> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
>> Sent: Monday, September 7, 2015 1:26 PM
>> To: dev at dpdk.org
>> Cc: Ananyev, Konstantin; Richardson, Bruce
>> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD receive
>> function
>>
>> Hi,
>>
>> I just realized I've missed the "[PATCH]" tag from the subject. Did anyone
>> had time to review this?
>>
>
> Hi Zoltan,
>
> the big thing that concerns me with this is the addition of new instructions 
> for
> each packet in the fast path. Ideally, this prefetching would be better 
> handled
> in the application itself, as for some apps, e.g. those using pipelining, the
> core doing the RX from the NIC may not touch the packet data at all, and the
> prefetches will instead cause a performance slowdown.
>
> Is it possible to get the same performance increase - or something close to 
> it -
> by making changes in OVS?

OVS already does a prefetch when it's processing the previous packet, 
but apparently it's not early enough. At least for my test scenario, 
where I'm forwarding UDP packets with the least possible overhead. I 
guess in tests where OVS does more complex processing it should be fine.
I'll try to move the prefetch earlier in OVS codebase, but I'm not sure 
if it'll help.
Also, I've checked the PMD receive functions, and generally it's quite 
mixed whether they prefetch the header or not. All the other 3 ixgbe 
receive functions do that for example, as well as the following drivers:

bnx2x
e1000
fm10k (scattered)
i40e
igb
virtio

While these drivers don't do that:

cxgbe
enic
fm10k (non-scattered)
mlx4

I think it would be better to add rte_packet_prefetch() everywhere, 
because then applications can turn that off with 
CONFIG_RTE_PMD_PACKET_PREFETCH.

>
> Regards,
> /Bruce
>
>> Regards,
>>
>> Zoltan
>>
>> On 01/09/15 20:17, Zoltan Kiss wrote:
>>> The lack of this prefetch causes a significant performance drop in
>>> OVS-DPDK: 13.3 Mpps instead of 14 when forwarding 64 byte packets.
>>> Even though OVS prefetches the next packet's header before it starts
>>> processing the current one, it doesn't get there fast enough. This
>>> aligns with the behaviour of other receive functions.
>>>
>>> Signed-off-by: Zoltan Kiss 
>>> ---
>>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
>>> b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
>>> index cf25a53..51299fa 100644
>>> --- a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
>>> @@ -502,6 +502,15 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq,
>> struct rte_mbuf **rx_pkts,
>>>   _mm_storeu_si128((void *)_pkts[pos]-
>>> rx_descriptor_fields1,
>>>   pkt_mb1);
>>>
>>> +   rte_packet_prefetch((char*)(rx_pkts[pos]->buf_addr) +
>>> +   RTE_PKTMBUF_HEADROOM);
>>> +   rte_packet_prefetch((char*)(rx_pkts[pos + 1]->buf_addr)
>> +
>>> +   RTE_PKTMBUF_HEADROOM);
>>> +   rte_packet_prefetch((char*)(rx_pkts[pos + 2]->buf_addr)
>> +
>>> +   RTE_PKTMBUF_HEADROOM);
>>> +   rte_packet_prefetch((char*)(rx_pkts[pos + 3]->buf_addr)
>> +
>>> +   RTE_PKTMBUF_HEADROOM);
>>> +
>>>   /* C.4 calc avaialbe number of desc */
>>>   var = __builtin_popcountll(_mm_cvtsi128_si64(staterr));
>>>   nb_pkts_recd += var;
>>>


[dpdk-dev] [PATCH v3] mbuf/ip_frag: Move mbuf chaining to common code

2015-09-07 Thread Simon Kagstrom
Chaining/segmenting mbufs can be useful in many places, so make it
global.

Signed-off-by: Simon Kagstrom 
Signed-off-by: Johan Faltstrom 
---
ChangeLog:
v2:
  * Check for nb_segs byte overflow (Olivier MATZ)
  * Don't reset nb_segs in tail (Olivier MATZ)
v3:
  * Describe performance implications of linear search
  * Correct check-for-out-of-bounds (Konstantin Ananyev)

 lib/librte_ip_frag/ip_frag_common.h  | 23 -
 lib/librte_ip_frag/rte_ipv4_reassembly.c |  7 +--
 lib/librte_ip_frag/rte_ipv6_reassembly.c |  7 +--
 lib/librte_mbuf/rte_mbuf.h   | 34 
 4 files changed, 44 insertions(+), 27 deletions(-)

diff --git a/lib/librte_ip_frag/ip_frag_common.h 
b/lib/librte_ip_frag/ip_frag_common.h
index 6b2acee..cde6ed4 100644
--- a/lib/librte_ip_frag/ip_frag_common.h
+++ b/lib/librte_ip_frag/ip_frag_common.h
@@ -166,27 +166,4 @@ ip_frag_reset(struct ip_frag_pkt *fp, uint64_t tms)
fp->frags[IP_FIRST_FRAG_IDX] = zero_frag;
 }

-/* chain two mbufs */
-static inline void
-ip_frag_chain(struct rte_mbuf *mn, struct rte_mbuf *mp)
-{
-   struct rte_mbuf *ms;
-
-   /* adjust start of the last fragment data. */
-   rte_pktmbuf_adj(mp, (uint16_t)(mp->l2_len + mp->l3_len));
-
-   /* chain two fragments. */
-   ms = rte_pktmbuf_lastseg(mn);
-   ms->next = mp;
-
-   /* accumulate number of segments and total length. */
-   mn->nb_segs = (uint8_t)(mn->nb_segs + mp->nb_segs);
-   mn->pkt_len += mp->pkt_len;
-
-   /* reset pkt_len and nb_segs for chained fragment. */
-   mp->pkt_len = mp->data_len;
-   mp->nb_segs = 1;
-}
-
-
 #endif /* _IP_FRAG_COMMON_H_ */
diff --git a/lib/librte_ip_frag/rte_ipv4_reassembly.c 
b/lib/librte_ip_frag/rte_ipv4_reassembly.c
index 5d24843..26d07f9 100644
--- a/lib/librte_ip_frag/rte_ipv4_reassembly.c
+++ b/lib/librte_ip_frag/rte_ipv4_reassembly.c
@@ -63,7 +63,9 @@ ipv4_frag_reassemble(const struct ip_frag_pkt *fp)
/* previous fragment found. */
if(fp->frags[i].ofs + fp->frags[i].len == ofs) {

-   ip_frag_chain(fp->frags[i].mb, m);
+   /* adjust start of the last fragment data. */
+   rte_pktmbuf_adj(m, (uint16_t)(m->l2_len + 
m->l3_len));
+   rte_pktmbuf_chain(fp->frags[i].mb, m);

/* update our last fragment and offset. */
m = fp->frags[i].mb;
@@ -78,7 +80,8 @@ ipv4_frag_reassemble(const struct ip_frag_pkt *fp)
}

/* chain with the first fragment. */
-   ip_frag_chain(fp->frags[IP_FIRST_FRAG_IDX].mb, m);
+   rte_pktmbuf_adj(m, (uint16_t)(m->l2_len + m->l3_len));
+   rte_pktmbuf_chain(fp->frags[IP_FIRST_FRAG_IDX].mb, m);
m = fp->frags[IP_FIRST_FRAG_IDX].mb;

/* update mbuf fields for reassembled packet. */
diff --git a/lib/librte_ip_frag/rte_ipv6_reassembly.c 
b/lib/librte_ip_frag/rte_ipv6_reassembly.c
index 1f1c172..5969b4a 100644
--- a/lib/librte_ip_frag/rte_ipv6_reassembly.c
+++ b/lib/librte_ip_frag/rte_ipv6_reassembly.c
@@ -86,7 +86,9 @@ ipv6_frag_reassemble(const struct ip_frag_pkt *fp)
/* previous fragment found. */
if (fp->frags[i].ofs + fp->frags[i].len == ofs) {

-   ip_frag_chain(fp->frags[i].mb, m);
+   /* adjust start of the last fragment data. */
+   rte_pktmbuf_adj(m, (uint16_t)(m->l2_len + 
m->l3_len));
+   rte_pktmbuf_chain(fp->frags[i].mb, m);

/* update our last fragment and offset. */
m = fp->frags[i].mb;
@@ -101,7 +103,8 @@ ipv6_frag_reassemble(const struct ip_frag_pkt *fp)
}

/* chain with the first fragment. */
-   ip_frag_chain(fp->frags[IP_FIRST_FRAG_IDX].mb, m);
+   rte_pktmbuf_adj(m, (uint16_t)(m->l2_len + m->l3_len));
+   rte_pktmbuf_chain(fp->frags[IP_FIRST_FRAG_IDX].mb, m);
m = fp->frags[IP_FIRST_FRAG_IDX].mb;

/* update mbuf fields for reassembled packet. */
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index d7c9030..f1f1400 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1775,6 +1775,40 @@ static inline int rte_pktmbuf_is_contiguous(const struct 
rte_mbuf *m)
 }

 /**
+ * Chain an mbuf to another, thereby creating a segmented packet.
+ *
+ * Note: The implementation will do a linear walk over the segments to find
+ * the tail entry. For cases when there are many segments, it's better to
+ * chain the entries manually.
+ *
+ * @param head the head of the mbuf chain (the first packet)
+ * @param tail the mbuf to put last in the chain
+ *
+ * @return 0 on success, -EOVERFLOW if the chain is full (256 entries)
+ */
+static inline int rte_pktmbuf_chain(struct rte_mbuf 

[dpdk-dev] [PATCH v2] mbuf/ip_frag: Move mbuf chaining to common code

2015-09-07 Thread Simon Kågström
On 2015-09-07 14:32, Ananyev, Konstantin wrote:
>> +static inline int rte_pktmbuf_chain(struct rte_mbuf *head, struct rte_mbuf 
>> *tail)
>> +{
>> +struct rte_mbuf *cur_tail;
>> +
>> +/* Check for number-of-segments-overflow */
>> +if (head->nb_segs + tail->nb_segs >= sizeof(head->nb_segs) << 8)
>> +return -EOVERFLOW;
> 
> Would probably be better 'sizeof(head->nb_segs) << CHAR_BIT', or even just: ' 
>  > UINT8_MAX'.
> Konstantin

Thanks. I got it wrong anyway, what I wanted was to be able to handle
the day when nb_segs changes to a 16-bit number, but then it should
really be

  ... >= 1 << (sizeof(head->nb_segs) * 8)

anyway. I'll fix that and also add a warning that the implementation
will do a linear search to find the tail entry.

// Simon



[dpdk-dev] [PATCH] ixgbe: prefetch packet headers in vector PMD receive function

2015-09-07 Thread Richardson, Bruce


> -Original Message-
> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
> Sent: Monday, September 7, 2015 3:15 PM
> To: Richardson, Bruce; dev at dpdk.org
> Cc: Ananyev, Konstantin
> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD receive
> function
> 
> 
> 
> On 07/09/15 13:57, Richardson, Bruce wrote:
> >
> >
> >> -Original Message-
> >> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
> >> Sent: Monday, September 7, 2015 1:26 PM
> >> To: dev at dpdk.org
> >> Cc: Ananyev, Konstantin; Richardson, Bruce
> >> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD
> >> receive function
> >>
> >> Hi,
> >>
> >> I just realized I've missed the "[PATCH]" tag from the subject. Did
> >> anyone had time to review this?
> >>
> >
> > Hi Zoltan,
> >
> > the big thing that concerns me with this is the addition of new
> > instructions for each packet in the fast path. Ideally, this
> > prefetching would be better handled in the application itself, as for
> > some apps, e.g. those using pipelining, the core doing the RX from the
> > NIC may not touch the packet data at all, and the prefetches will
> instead cause a performance slowdown.
> >
> > Is it possible to get the same performance increase - or something
> > close to it - by making changes in OVS?
> 
> OVS already does a prefetch when it's processing the previous packet, but
> apparently it's not early enough. At least for my test scenario, where I'm
> forwarding UDP packets with the least possible overhead. I guess in tests
> where OVS does more complex processing it should be fine.
> I'll try to move the prefetch earlier in OVS codebase, but I'm not sure if
> it'll help.

I would suggest trying to prefetch more than one packet ahead. Prefetching 4 or
8 ahead might work better, depending on the processing being done.

/Bruce


[dpdk-dev] [PATCH] ip_pipeline: enable promiscuous mode configuration

2015-09-07 Thread Jasvinder Singh
This patch allows parser to read promisc entry from
the LINK section defined in configuration file. It
is an optional parameter: if present, value should
be read (yes/no, on/off), else the value is the
default value (i.e. 1 = promiscuous mode on)

Example of config file:
[LINK0]
promisc = no; optional parameter, default value is ?yes?

Signed-off-by: Jasvinder Singh 
---
 examples/ip_pipeline/config_parse.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/examples/ip_pipeline/config_parse.c 
b/examples/ip_pipeline/config_parse.c
index c9b78f9..e40c4f2 100644
--- a/examples/ip_pipeline/config_parse.c
+++ b/examples/ip_pipeline/config_parse.c
@@ -1240,7 +1240,13 @@ parse_link(struct app_params *app,
struct rte_cfgfile_entry *ent = [i];

ret = -ESRCH;
-   if (strcmp(ent->name, "arp_q") == 0)
+   if (strcmp(ent->name, "promisc") == 0) {
+   ret = parser_read_arg_bool(ent->value);
+   if (ret >= 0) {
+   param->promisc = ret;
+   ret = 0;
+   }
+   } else if (strcmp(ent->name, "arp_q") == 0)
ret = parser_read_uint32(>arp_q,
ent->value);
else if (strcmp(ent->name, "tcp_syn_q") == 0)
@@ -1991,6 +1997,7 @@ save_links_params(struct app_params *app, FILE *f)

fprintf(f, "[%s]\n", p->name);
fprintf(f, "; %s = %" PRIu32 "\n", "pmd_id", p->pmd_id);
+   fprintf(f, "%s = %s\n", "promisc", p->promisc ? "yes" : "no");
fprintf(f, "%s = %" PRIu32 "\n", "arp_q", p->arp_q);
fprintf(f, "%s = %" PRIu32 "\n", "tcp_syn_local_q",
p->tcp_syn_local_q);
-- 
2.1.0



[dpdk-dev] [PATCH] ip_pipeline: add check on nic's rxq and txq

2015-09-07 Thread Jasvinder Singh
This patch checks that rx queue and tx queue of each link
specified in ip pipeline configuration file are used.

Signed-off-by: Jasvinder Singh 
---
 examples/ip_pipeline/config_check.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/examples/ip_pipeline/config_check.c 
b/examples/ip_pipeline/config_check.c
index 07f4c8b..b843926 100644
--- a/examples/ip_pipeline/config_check.c
+++ b/examples/ip_pipeline/config_check.c
@@ -95,6 +95,8 @@ check_links(struct app_params *app)
"%s RXQs are not contiguous (A)\n", link->name);

n_rxq = app_link_get_n_rxq(app, link);
+   
+   APP_CHECK((n_rxq),  "%s does not have any RXQ \n", link->name);

APP_CHECK((n_rxq == rxq_max + 1),
"%s RXQs are not contiguous (B)\n", link->name);
@@ -112,7 +114,9 @@ check_links(struct app_params *app)

/* Check that link RXQs are contiguous */
n_txq = app_link_get_n_txq(app, link);
-
+   
+   APP_CHECK((n_txq),  "%s does not have any TXQ \n", link->name);
+   
for (i = 0; i < n_txq; i++) {
char name[APP_PARAM_NAME_SIZE];
int pos;
-- 
2.1.0



[dpdk-dev] [PATCH v2] mbuf/ip_frag: Move mbuf chaining to common code

2015-09-07 Thread Simon Kagstrom
Chaining/segmenting mbufs can be useful in many places, so make it
global.

Signed-off-by: Simon Kagstrom 
Signed-off-by: Johan Faltstrom 
---
ChangeLog:
v2:
  * Check for nb_segs byte overflow (Olivier MATZ)
  * Don't reset nb_segs in tail (Olivier MATZ)

 lib/librte_ip_frag/ip_frag_common.h  | 23 ---
 lib/librte_ip_frag/rte_ipv4_reassembly.c |  7 +--
 lib/librte_ip_frag/rte_ipv6_reassembly.c |  7 +--
 lib/librte_mbuf/rte_mbuf.h   | 30 ++
 4 files changed, 40 insertions(+), 27 deletions(-)

diff --git a/lib/librte_ip_frag/ip_frag_common.h 
b/lib/librte_ip_frag/ip_frag_common.h
index 6b2acee..cde6ed4 100644
--- a/lib/librte_ip_frag/ip_frag_common.h
+++ b/lib/librte_ip_frag/ip_frag_common.h
@@ -166,27 +166,4 @@ ip_frag_reset(struct ip_frag_pkt *fp, uint64_t tms)
fp->frags[IP_FIRST_FRAG_IDX] = zero_frag;
 }

-/* chain two mbufs */
-static inline void
-ip_frag_chain(struct rte_mbuf *mn, struct rte_mbuf *mp)
-{
-   struct rte_mbuf *ms;
-
-   /* adjust start of the last fragment data. */
-   rte_pktmbuf_adj(mp, (uint16_t)(mp->l2_len + mp->l3_len));
-
-   /* chain two fragments. */
-   ms = rte_pktmbuf_lastseg(mn);
-   ms->next = mp;
-
-   /* accumulate number of segments and total length. */
-   mn->nb_segs = (uint8_t)(mn->nb_segs + mp->nb_segs);
-   mn->pkt_len += mp->pkt_len;
-
-   /* reset pkt_len and nb_segs for chained fragment. */
-   mp->pkt_len = mp->data_len;
-   mp->nb_segs = 1;
-}
-
-
 #endif /* _IP_FRAG_COMMON_H_ */
diff --git a/lib/librte_ip_frag/rte_ipv4_reassembly.c 
b/lib/librte_ip_frag/rte_ipv4_reassembly.c
index 5d24843..26d07f9 100644
--- a/lib/librte_ip_frag/rte_ipv4_reassembly.c
+++ b/lib/librte_ip_frag/rte_ipv4_reassembly.c
@@ -63,7 +63,9 @@ ipv4_frag_reassemble(const struct ip_frag_pkt *fp)
/* previous fragment found. */
if(fp->frags[i].ofs + fp->frags[i].len == ofs) {

-   ip_frag_chain(fp->frags[i].mb, m);
+   /* adjust start of the last fragment data. */
+   rte_pktmbuf_adj(m, (uint16_t)(m->l2_len + 
m->l3_len));
+   rte_pktmbuf_chain(fp->frags[i].mb, m);

/* update our last fragment and offset. */
m = fp->frags[i].mb;
@@ -78,7 +80,8 @@ ipv4_frag_reassemble(const struct ip_frag_pkt *fp)
}

/* chain with the first fragment. */
-   ip_frag_chain(fp->frags[IP_FIRST_FRAG_IDX].mb, m);
+   rte_pktmbuf_adj(m, (uint16_t)(m->l2_len + m->l3_len));
+   rte_pktmbuf_chain(fp->frags[IP_FIRST_FRAG_IDX].mb, m);
m = fp->frags[IP_FIRST_FRAG_IDX].mb;

/* update mbuf fields for reassembled packet. */
diff --git a/lib/librte_ip_frag/rte_ipv6_reassembly.c 
b/lib/librte_ip_frag/rte_ipv6_reassembly.c
index 1f1c172..5969b4a 100644
--- a/lib/librte_ip_frag/rte_ipv6_reassembly.c
+++ b/lib/librte_ip_frag/rte_ipv6_reassembly.c
@@ -86,7 +86,9 @@ ipv6_frag_reassemble(const struct ip_frag_pkt *fp)
/* previous fragment found. */
if (fp->frags[i].ofs + fp->frags[i].len == ofs) {

-   ip_frag_chain(fp->frags[i].mb, m);
+   /* adjust start of the last fragment data. */
+   rte_pktmbuf_adj(m, (uint16_t)(m->l2_len + 
m->l3_len));
+   rte_pktmbuf_chain(fp->frags[i].mb, m);

/* update our last fragment and offset. */
m = fp->frags[i].mb;
@@ -101,7 +103,8 @@ ipv6_frag_reassemble(const struct ip_frag_pkt *fp)
}

/* chain with the first fragment. */
-   ip_frag_chain(fp->frags[IP_FIRST_FRAG_IDX].mb, m);
+   rte_pktmbuf_adj(m, (uint16_t)(m->l2_len + m->l3_len));
+   rte_pktmbuf_chain(fp->frags[IP_FIRST_FRAG_IDX].mb, m);
m = fp->frags[IP_FIRST_FRAG_IDX].mb;

/* update mbuf fields for reassembled packet. */
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index d7c9030..19a4bb5 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1775,6 +1775,36 @@ static inline int rte_pktmbuf_is_contiguous(const struct 
rte_mbuf *m)
 }

 /**
+ * Chain an mbuf to another, thereby creating a segmented packet.
+ *
+ * @param head the head of the mbuf chain (the first packet)
+ * @param tail the mbuf to put last in the chain
+ *
+ * @return 0 on success, -EOVERFLOW if the chain is full (256 entries)
+ */
+static inline int rte_pktmbuf_chain(struct rte_mbuf *head, struct rte_mbuf 
*tail)
+{
+   struct rte_mbuf *cur_tail;
+
+   /* Check for number-of-segments-overflow */
+   if (head->nb_segs + tail->nb_segs >= sizeof(head->nb_segs) << 8)
+   return -EOVERFLOW;
+
+   /* Chain 'tail' onto the old tail */
+   cur_tail = 

[dpdk-dev] [PATCH] ip_pipeline: enable promiscuous mode configuration

2015-09-07 Thread Dumitrescu, Cristian


> -Original Message-
> From: Singh, Jasvinder
> Sent: Monday, September 7, 2015 3:58 PM
> To: dev at dpdk.org
> Cc: Dumitrescu, Cristian
> Subject: [PATCH] ip_pipeline: enable promiscuous mode configuration
> 
> This patch allows parser to read promisc entry from
> the LINK section defined in configuration file. It
> is an optional parameter: if present, value should
> be read (yes/no, on/off), else the value is the
> default value (i.e. 1 = promiscuous mode on)
> 
> Example of config file:
> [LINK0]
> promisc = no; optional parameter, default value is ?yes?
> 
> Signed-off-by: Jasvinder Singh 
> ---

Acked-by: Cristian Dumitrescu 



[dpdk-dev] [PATCH] ip_pipeline: add check on nic's rxq and txq

2015-09-07 Thread Dumitrescu, Cristian


> -Original Message-
> From: Singh, Jasvinder
> Sent: Monday, September 7, 2015 3:54 PM
> To: dev at dpdk.org
> Cc: Dumitrescu, Cristian
> Subject: [PATCH] ip_pipeline: add check on nic's rxq and txq
> 
> This patch checks that rx queue and tx queue of each link
> specified in ip pipeline configuration file are used.
> 
> Signed-off-by: Jasvinder Singh 
> ---

Acked-by: Cristian Dumitrescu 



[dpdk-dev] [PATCH] ixgbe: prefetch packet headers in vector PMD receive function

2015-09-07 Thread Zoltan Kiss
Hi,

I just realized I've missed the "[PATCH]" tag from the subject. Did 
anyone had time to review this?

Regards,

Zoltan

On 01/09/15 20:17, Zoltan Kiss wrote:
> The lack of this prefetch causes a significant performance drop in
> OVS-DPDK: 13.3 Mpps instead of 14 when forwarding 64 byte packets. Even
> though OVS prefetches the next packet's header before it starts processing
> the current one, it doesn't get there fast enough. This aligns with the
> behaviour of other receive functions.
>
> Signed-off-by: Zoltan Kiss 
> ---
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec.c 
> b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> index cf25a53..51299fa 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> @@ -502,6 +502,15 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct 
> rte_mbuf **rx_pkts,
>  _mm_storeu_si128((void 
> *)_pkts[pos]->rx_descriptor_fields1,
>  pkt_mb1);
>
> +   rte_packet_prefetch((char*)(rx_pkts[pos]->buf_addr) +
> +   RTE_PKTMBUF_HEADROOM);
> +   rte_packet_prefetch((char*)(rx_pkts[pos + 1]->buf_addr) +
> +   RTE_PKTMBUF_HEADROOM);
> +   rte_packet_prefetch((char*)(rx_pkts[pos + 2]->buf_addr) +
> +   RTE_PKTMBUF_HEADROOM);
> +   rte_packet_prefetch((char*)(rx_pkts[pos + 3]->buf_addr) +
> +   RTE_PKTMBUF_HEADROOM);
> +
>  /* C.4 calc avaialbe number of desc */
>  var = __builtin_popcountll(_mm_cvtsi128_si64(staterr));
>  nb_pkts_recd += var;
>


[dpdk-dev] [PATCH] ixgbe: prefetch packet headers in vector PMD receive function

2015-09-07 Thread Richardson, Bruce


> -Original Message-
> From: Zoltan Kiss [mailto:zoltan.kiss at linaro.org]
> Sent: Monday, September 7, 2015 1:26 PM
> To: dev at dpdk.org
> Cc: Ananyev, Konstantin; Richardson, Bruce
> Subject: Re: [PATCH] ixgbe: prefetch packet headers in vector PMD receive
> function
> 
> Hi,
> 
> I just realized I've missed the "[PATCH]" tag from the subject. Did anyone
> had time to review this?
> 

Hi Zoltan,

the big thing that concerns me with this is the addition of new instructions for
each packet in the fast path. Ideally, this prefetching would be better handled
in the application itself, as for some apps, e.g. those using pipelining, the
core doing the RX from the NIC may not touch the packet data at all, and the
prefetches will instead cause a performance slowdown.

Is it possible to get the same performance increase - or something close to it -
by making changes in OVS?

Regards,
/Bruce

> Regards,
> 
> Zoltan
> 
> On 01/09/15 20:17, Zoltan Kiss wrote:
> > The lack of this prefetch causes a significant performance drop in
> > OVS-DPDK: 13.3 Mpps instead of 14 when forwarding 64 byte packets.
> > Even though OVS prefetches the next packet's header before it starts
> > processing the current one, it doesn't get there fast enough. This
> > aligns with the behaviour of other receive functions.
> >
> > Signed-off-by: Zoltan Kiss 
> > ---
> > diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> > b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> > index cf25a53..51299fa 100644
> > --- a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> > +++ b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
> > @@ -502,6 +502,15 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq,
> struct rte_mbuf **rx_pkts,
> >  _mm_storeu_si128((void *)_pkts[pos]-
> >rx_descriptor_fields1,
> >  pkt_mb1);
> >
> > +   rte_packet_prefetch((char*)(rx_pkts[pos]->buf_addr) +
> > +   RTE_PKTMBUF_HEADROOM);
> > +   rte_packet_prefetch((char*)(rx_pkts[pos + 1]->buf_addr)
> +
> > +   RTE_PKTMBUF_HEADROOM);
> > +   rte_packet_prefetch((char*)(rx_pkts[pos + 2]->buf_addr)
> +
> > +   RTE_PKTMBUF_HEADROOM);
> > +   rte_packet_prefetch((char*)(rx_pkts[pos + 3]->buf_addr)
> +
> > +   RTE_PKTMBUF_HEADROOM);
> > +
> >  /* C.4 calc avaialbe number of desc */
> >  var = __builtin_popcountll(_mm_cvtsi128_si64(staterr));
> >  nb_pkts_recd += var;
> >


[dpdk-dev] [PATCH RFC] mbuf/ip_frag: Move mbuf chaining to common code

2015-09-07 Thread Simon Kågström
On 2015-09-07 11:35, Olivier MATZ wrote:

>> Wonder why do we need to do that?
>> Probably head mbuf is out of space and want to expand it using 
>> pktmbuf_chain()?
>> So in that case seems logical:
>> 1) allocate new mbuf (it's pkt_len will be 0)
>> b) call pktmbuf_chain()
> 
> By experience, having empty segments in the middle of a mbuf
> chain is problematic (functions getting ptr at offsets, some pmds
> or hardware may behave badly), I wanted to avoid that risk.
> 
> Now, the use-case you described is legitimate. Another option would
> be to have another function pktmbuf_append_new(m) that returns a new
> mbuf that is already chained to the other.

I see with that method in that you have to remember to actually update
pkt_len in the head buffer when chaining an empty mbuf. Anyway, to
disallow this behavior should probably not be the responsibility of
rte_pktmbuf_chain(), so I'm fine with leaving the check out.

// Simon



[dpdk-dev] [PATCH v2] mbuf/ip_frag: Move mbuf chaining to common code

2015-09-07 Thread Ananyev, Konstantin


Hi Simon,
Looks good to me, just one nit, see below.
Konstantin 

>  /**
> + * Chain an mbuf to another, thereby creating a segmented packet.
> + *
> + * @param head the head of the mbuf chain (the first packet)
> + * @param tail the mbuf to put last in the chain
> + *
> + * @return 0 on success, -EOVERFLOW if the chain is full (256 entries)
> + */
> +static inline int rte_pktmbuf_chain(struct rte_mbuf *head, struct rte_mbuf 
> *tail)
> +{
> + struct rte_mbuf *cur_tail;
> +
> + /* Check for number-of-segments-overflow */
> + if (head->nb_segs + tail->nb_segs >= sizeof(head->nb_segs) << 8)
> + return -EOVERFLOW;

Would probably be better 'sizeof(head->nb_segs) << CHAR_BIT', or even just: '  
> UINT8_MAX'.
Konstantin

> +
> + /* Chain 'tail' onto the old tail */
> + cur_tail = rte_pktmbuf_lastseg(head);
> + cur_tail->next = tail;
> +
> + /* accumulate number of segments and total length. */
> + head->nb_segs = (uint8_t)(head->nb_segs + tail->nb_segs);
> + head->pkt_len += tail->pkt_len;
> +
> + /* pkt_len is only set in the head */
> + tail->pkt_len = tail->data_len;
> +
> + return 0;
> +}
> +
> +/**
>   * Dump an mbuf structure to the console.
>   *
>   * Dump all fields for the given packet mbuf and all its associated
> --
> 1.9.1



[dpdk-dev] ixgbe: account more Rx errors Issue

2015-09-07 Thread Tahhan, Maryam
> From: Olivier MATZ [mailto:olivier.matz at 6wind.com]
> Sent: Monday, September 7, 2015 9:30 AM
> To: Tahhan, Maryam; Andriy Berestovskyy
> Cc: dev at dpdk.org
> Subject: Re: ixgbe: account more Rx errors Issue
> 
> Hi,
> 
> On 09/06/2015 07:15 PM, Tahhan, Maryam wrote:
> >> From: Andriy Berestovskyy [mailto:aber at semihalf.com]
> >> Sent: Friday, September 4, 2015 5:59 PM
> >> To: Tahhan, Maryam
> >> Cc: dev at dpdk.org; Olivier MATZ
> >> Subject: Re: ixgbe: account more Rx errors Issue
> >>
> >> Hi Maryam,
> >> Please see below.
> >>
> >>> XEC counts the Number of receive IPv4, TCP, UDP or SCTP XSUM errors
> >>
> >> Please note than UDP checksum is optional for IPv4, but UDP packets
> >> with zero checksum hit XEC.
> >>
> >
> > I understand, but this is what the hardware register is picking up and what 
> > I
> included previously is the definitions of the registers from the datasheet.
> >
> >>> And general crc errors counts Counts the number of receive packets
> >>> with
> >> CRC errors.
> >>
> >> Let me explain you with an example.
> >>
> >> DPDK 2.0 behavior:
> >> host A sends 10M IPv4 UDP packets (no checksum) to host B host B
> >> stats: 9M ipackets + 1M ierrors (missed) = 10M
> >>
> >> DPDK 2.1 behavior:
> >> host A sends 10M IPv4 UDP packets (no checksum) to host B host B
> >> stats: 9M ipackets + 11M in ierrors (1M missed + 10M XEC) = 20M?
> >
> > Because it's hitting the 2 error registers. If you had packets with multiple
> errors that are added up as part of ierrors you'll still be getting more than
> 10M errors which is why I asked for feedback on the 3 suggestions below.
> What I'm saying is the number of errors being > the number of received
> packets will be seen if you hit multiple error registers on the NIC.
> >
> >>
> >>> So our options are we can:
> >>> 1. Add only one of these into the error stats.
> >>> 2. We can introduce some cooking of stats in this scenario, so only
> >>> add
> >> either or if they are equal or one is higher than the other.
> >>> 3. Add them all which means you can have more errors than the number
> >>> of
> >> received packets, but TBH this is going to be the case if your
> >> packets have multiple errors anyway.
> >>
> >> 4. ierrors should reflect NIC drops only.
> >
> > I may have misinterpreted this, but ierrors in rte_ethdev.h ierrors is 
> > defined
> as the Total number of erroneous received packets.
> > Maybe we need a clear definition or a separate drop counter as I see
> uint64_t q_errors defined as: Total number of queue packets received that
> are dropped.
> >
> >> XEC does not count drops, so IMO it should be removed from ierrors.
> >
> > While it's picking up the 0 checksum as an error (which it shouldn't
> > necessarily be doing), removing it could mean missing other valid
> > L3/L4 checksum errors... Let me experiment some more with L3/L4
> > checksum errors and crcerrs to see if we can cook the stats around
> > this register in particular. I would hate to remove it and miss
> > genuine errors
> 
> For me, the definition that looks the most straightforward is:
> 
> ipackets = packets successfully received by hardware imissed = packets
> dropped by hardware because the software does
>   not poll fast enough (= queue full)
> ierrors = packets dropped by hardware (malformed packets, ...)
> 
> These 3 stats never count twice the same packet.
> 
> If we want more statistics, they could go in xstats. For instance, a counter 
> for
> invalid checksum. The definition of these stats would be pmd-specific.
> 
> I agree we should clarify and have a consensus on the definitions before going
> further.
> 
> 
> Regards,
> Olivier

Hi Olivier
I think it's important to distinguish between errors and drops and provide a 
statistics API that exposes both. This way people have access to as much 
information as possible when things do go wrong and nothing is missed in terms 
of errors.

My suggestion for the high level registers would be:
ipackets = Total number of packets successfully received by hardware
imissed = Total number of  packets dropped by hardware because the software 
does not poll fast enough (= queue full)
idrops = Total number of packets dropped by hardware (malformed packets, ...) 
Where the # of drops can ONLY be <=  the packets received (without overlap 
between registers).
ierrors = Total number of erroneous received packets. Where the # of errors can 
be >= the packets received (without overlap between registers), this is because 
there may be multiple errors associated with a packet.

This way people can see how many packets were dropped and why at a high level 
as well as through the extended stats API rather than using one API or the 
other. What do you think?

Best Regards
Maryam
> 
> 
> 
> >
> >>
> >> Please note that we still can access the XEC using
> >> rte_eth_xstats_get()
> >>
> >>
> >> Regards,
> >> Andriy


[dpdk-dev] [PATCH RFC] mbuf/ip_frag: Move mbuf chaining to common code

2015-09-07 Thread Olivier MATZ
Hi,

>>> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
>>> index 8c2db1b..ef47256 100644
>>> --- a/lib/librte_mbuf/rte_mbuf.h
>>> +++ b/lib/librte_mbuf/rte_mbuf.h
>>> @@ -1801,6 +1801,29 @@ static inline int rte_pktmbuf_is_contiguous(const 
>>> struct rte_mbuf *m)
>>>  }
>>>
>>>  /**
>>> + * Chain an mbuf to another, thereby creating a segmented packet.
>>> + *
>>> + * @param head the head of the mbuf chain (the first packet)
>>> + * @param tail the mbuf to put last in the chain
>>> + */
>>> +static inline void rte_pktmbuf_chain(struct rte_mbuf *head, struct 
>>> rte_mbuf *tail)
>>> +{
>>> +   struct rte_mbuf *cur_tail;
>>> +
>>
>> Here, we could check if the pkt_len of tail mbuf is 0. If
>> it's the case, we can just free it and return. It would avoid
>> to have an empty segment inside the mbuf chain, which can be
>> annoying.
>>
>> if (unlikely(rte_pktmbuf_pkt_len(tail) == 0)) {
>>  rte_pktmbuf_free(tail);
>>  return;
>> }
> 
> Wonder why do we need to do that?
> Probably head mbuf is out of space and want to expand it using 
> pktmbuf_chain()?
> So in that case seems logical:
> 1) allocate new mbuf (it's pkt_len will be 0)
> b) call pktmbuf_chain()

By experience, having empty segments in the middle of a mbuf
chain is problematic (functions getting ptr at offsets, some pmds
or hardware may behave badly), I wanted to avoid that risk.

Now, the use-case you described is legitimate. Another option would
be to have another function pktmbuf_append_new(m) that returns a new
mbuf that is already chained to the other.

But I'm also fine with removing the test, it's maybe simpler.

Regards,
Olivier


[dpdk-dev] [PATCH 1/1] ip_frag: fix creating ipv6 fragment extension header

2015-09-07 Thread Dumitrescu, Cristian


> -Original Message-
> From: Ananyev, Konstantin
> Sent: Monday, September 7, 2015 2:23 PM
> To: Dumitrescu, Cristian; Azarewicz, PiotrX T; dev at dpdk.org
> Subject: RE: [dpdk-dev] [PATCH 1/1] ip_frag: fix creating ipv6 fragment
> extension header
> 
> 
> 
> > -Original Message-
> > From: Dumitrescu, Cristian
> > Sent: Monday, September 07, 2015 12:22 PM
> > To: Ananyev, Konstantin; Azarewicz, PiotrX T; dev at dpdk.org
> > Subject: RE: [dpdk-dev] [PATCH 1/1] ip_frag: fix creating ipv6 fragment
> extension header
> >
> >
> >
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Ananyev,
> > > Konstantin
> > > Sent: Friday, September 4, 2015 6:51 PM
> > > To: Azarewicz, PiotrX T; dev at dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH 1/1] ip_frag: fix creating ipv6 fragment
> > > extension header
> > >
> > > Hi Piotr,
> > >
> > > > -Original Message-
> > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Piotr
> > > > Sent: Wednesday, September 02, 2015 3:13 PM
> > > > To: dev at dpdk.org
> > > > Subject: [dpdk-dev] [PATCH 1/1] ip_frag: fix creating ipv6 fragment
> > > extension header
> > > >
> > > > From: Piotr Azarewicz 
> > > >
> > > > Previous implementation won't work on every environment. The order
> of
> > > > allocation of bit-fields within a unit (high-order to low-order or
> > > > low-order to high-order) is implementation-defined.
> > > > Solution: used bytes instead of bit fields.
> > >
> > > Seems like right thing to do to me.
> > > Though I think we also should replace:
> > > union {
> > > struct {
> > > uint16_t frag_offset:13; /**< Offset from the 
> > > start of the
> packet
> > > */
> > > uint16_t reserved2:2; /**< Reserved */
> > > uint16_t more_frags:1;
> > > /**< 1 if more fragments left, 0 if last fragment 
> > > */
> > > };
> > > uint16_t frag_data;
> > > /**< union of all fragmentation data */
> > > };
> > >
> > > With just:
> > > uint16_t frag_data;
> > >  and probably provide macros to read/set fragment_offset and
> more_flags
> > > values.
> > > Otherwise people might keep using the wrong layout.
> > > Konstantin
> > >
> >
> > I agree with your proposal, but wouldn't this be an ABI change? To avoid an
> ABI change, we should probably leave the union?
> 
> 
> No I don't think it would - the size of the field will remain the same: 
> uint16_t.
> Also if the bit-field is invalid what for to keep it?
> Konstantin
> 

Excellent then.

> >
> > > >
> > > > Signed-off-by: Piotr Azarewicz 
> > > > ---
> > > >  lib/librte_ip_frag/rte_ipv6_fragmentation.c |6 ++
> > > >  1 file changed, 2 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/lib/librte_ip_frag/rte_ipv6_fragmentation.c
> > > b/lib/librte_ip_frag/rte_ipv6_fragmentation.c
> > > > index 0e32aa8..7342421 100644
> > > > --- a/lib/librte_ip_frag/rte_ipv6_fragmentation.c
> > > > +++ b/lib/librte_ip_frag/rte_ipv6_fragmentation.c
> > > > @@ -65,10 +65,8 @@ __fill_ipv6hdr_frag(struct ipv6_hdr *dst,
> > > >
> > > > fh = (struct ipv6_extension_fragment *) ++dst;
> > > > fh->next_header = src->proto;
> > > > -   fh->reserved1   = 0;
> > > > -   fh->frag_offset = rte_cpu_to_be_16(fofs);
> > > > -   fh->reserved2   = 0;
> > > > -   fh->more_frags  = rte_cpu_to_be_16(mf);
> > > > +   fh->reserved1 = 0;
> > > > +   fh->frag_data = rte_cpu_to_be_16((fofs & ~IPV6_HDR_FO_MASK) |
> > > mf);
> > > > fh->id = 0;
> > > >  }
> > > >
> > > > --
> > > > 1.7.9.5



[dpdk-dev] DPDK v2.1.0 with tilegx problem

2015-09-07 Thread Thomas Monjalon
2015-09-07 08:51, Mcnamara, John:
> CCing the Tile-GX maintainers.

The TILE-Gx maintainer is Zhigang Lu.



[dpdk-dev] [PATCH v3] librte_cfgfile(rte_cfgfile.h): modify the macros values

2015-09-07 Thread Dumitrescu, Cristian


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Jasvinder Singh
> Sent: Friday, September 4, 2015 1:59 PM
> To: dev at dpdk.org
> Subject: [dpdk-dev] [PATCH v3] librte_cfgfile(rte_cfgfile.h): modify the
> macros values
> 
> This patch refers to the ABI change proposed for librte_cfgfile
> (rte_cfgfile.h). In order to allow for longer names and values,
> the new values of macros CFG_NAME_LEN and CFG_NAME_VAL are set.
> 
> Signed-off-by: Jasvinder Singh 
> ---

Acked-by: Cristian Dumitrescu 



[dpdk-dev] [PATCH 1/1] ip_frag: fix creating ipv6 fragment extension header

2015-09-07 Thread Ananyev, Konstantin


> -Original Message-
> From: Dumitrescu, Cristian
> Sent: Monday, September 07, 2015 12:22 PM
> To: Ananyev, Konstantin; Azarewicz, PiotrX T; dev at dpdk.org
> Subject: RE: [dpdk-dev] [PATCH 1/1] ip_frag: fix creating ipv6 fragment 
> extension header
> 
> 
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Ananyev,
> > Konstantin
> > Sent: Friday, September 4, 2015 6:51 PM
> > To: Azarewicz, PiotrX T; dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 1/1] ip_frag: fix creating ipv6 fragment
> > extension header
> >
> > Hi Piotr,
> >
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Piotr
> > > Sent: Wednesday, September 02, 2015 3:13 PM
> > > To: dev at dpdk.org
> > > Subject: [dpdk-dev] [PATCH 1/1] ip_frag: fix creating ipv6 fragment
> > extension header
> > >
> > > From: Piotr Azarewicz 
> > >
> > > Previous implementation won't work on every environment. The order of
> > > allocation of bit-fields within a unit (high-order to low-order or
> > > low-order to high-order) is implementation-defined.
> > > Solution: used bytes instead of bit fields.
> >
> > Seems like right thing to do to me.
> > Though I think we also should replace:
> > union {
> > struct {
> > uint16_t frag_offset:13; /**< Offset from the start 
> > of the packet
> > */
> > uint16_t reserved2:2; /**< Reserved */
> > uint16_t more_frags:1;
> > /**< 1 if more fragments left, 0 if last fragment */
> > };
> > uint16_t frag_data;
> > /**< union of all fragmentation data */
> > };
> >
> > With just:
> > uint16_t frag_data;
> >  and probably provide macros to read/set fragment_offset and more_flags
> > values.
> > Otherwise people might keep using the wrong layout.
> > Konstantin
> >
> 
> I agree with your proposal, but wouldn't this be an ABI change? To avoid an 
> ABI change, we should probably leave the union?


No I don't think it would - the size of the field will remain the same: 
uint16_t.
Also if the bit-field is invalid what for to keep it?
Konstantin

> 
> > >
> > > Signed-off-by: Piotr Azarewicz 
> > > ---
> > >  lib/librte_ip_frag/rte_ipv6_fragmentation.c |6 ++
> > >  1 file changed, 2 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/lib/librte_ip_frag/rte_ipv6_fragmentation.c
> > b/lib/librte_ip_frag/rte_ipv6_fragmentation.c
> > > index 0e32aa8..7342421 100644
> > > --- a/lib/librte_ip_frag/rte_ipv6_fragmentation.c
> > > +++ b/lib/librte_ip_frag/rte_ipv6_fragmentation.c
> > > @@ -65,10 +65,8 @@ __fill_ipv6hdr_frag(struct ipv6_hdr *dst,
> > >
> > >   fh = (struct ipv6_extension_fragment *) ++dst;
> > >   fh->next_header = src->proto;
> > > - fh->reserved1   = 0;
> > > - fh->frag_offset = rte_cpu_to_be_16(fofs);
> > > - fh->reserved2   = 0;
> > > - fh->more_frags  = rte_cpu_to_be_16(mf);
> > > + fh->reserved1 = 0;
> > > + fh->frag_data = rte_cpu_to_be_16((fofs & ~IPV6_HDR_FO_MASK) |
> > mf);
> > >   fh->id = 0;
> > >  }
> > >
> > > --
> > > 1.7.9.5



[dpdk-dev] [PATCH 1/1] ip_frag: fix creating ipv6 fragment extension header

2015-09-07 Thread Dumitrescu, Cristian


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Ananyev,
> Konstantin
> Sent: Friday, September 4, 2015 6:51 PM
> To: Azarewicz, PiotrX T; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 1/1] ip_frag: fix creating ipv6 fragment
> extension header
> 
> Hi Piotr,
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Piotr
> > Sent: Wednesday, September 02, 2015 3:13 PM
> > To: dev at dpdk.org
> > Subject: [dpdk-dev] [PATCH 1/1] ip_frag: fix creating ipv6 fragment
> extension header
> >
> > From: Piotr Azarewicz 
> >
> > Previous implementation won't work on every environment. The order of
> > allocation of bit-fields within a unit (high-order to low-order or
> > low-order to high-order) is implementation-defined.
> > Solution: used bytes instead of bit fields.
> 
> Seems like right thing to do to me.
> Though I think we also should replace:
> union {
> struct {
> uint16_t frag_offset:13; /**< Offset from the start 
> of the packet
> */
> uint16_t reserved2:2; /**< Reserved */
> uint16_t more_frags:1;
> /**< 1 if more fragments left, 0 if last fragment */
> };
> uint16_t frag_data;
> /**< union of all fragmentation data */
> };
> 
> With just:
> uint16_t frag_data;
>  and probably provide macros to read/set fragment_offset and more_flags
> values.
> Otherwise people might keep using the wrong layout.
> Konstantin
> 

I agree with your proposal, but wouldn't this be an ABI change? To avoid an ABI 
change, we should probably leave the union?

> >
> > Signed-off-by: Piotr Azarewicz 
> > ---
> >  lib/librte_ip_frag/rte_ipv6_fragmentation.c |6 ++
> >  1 file changed, 2 insertions(+), 4 deletions(-)
> >
> > diff --git a/lib/librte_ip_frag/rte_ipv6_fragmentation.c
> b/lib/librte_ip_frag/rte_ipv6_fragmentation.c
> > index 0e32aa8..7342421 100644
> > --- a/lib/librte_ip_frag/rte_ipv6_fragmentation.c
> > +++ b/lib/librte_ip_frag/rte_ipv6_fragmentation.c
> > @@ -65,10 +65,8 @@ __fill_ipv6hdr_frag(struct ipv6_hdr *dst,
> >
> > fh = (struct ipv6_extension_fragment *) ++dst;
> > fh->next_header = src->proto;
> > -   fh->reserved1   = 0;
> > -   fh->frag_offset = rte_cpu_to_be_16(fofs);
> > -   fh->reserved2   = 0;
> > -   fh->more_frags  = rte_cpu_to_be_16(mf);
> > +   fh->reserved1 = 0;
> > +   fh->frag_data = rte_cpu_to_be_16((fofs & ~IPV6_HDR_FO_MASK) |
> mf);
> > fh->id = 0;
> >  }
> >
> > --
> > 1.7.9.5



[dpdk-dev] [PATCH] ring: add function to free a ring

2015-09-07 Thread Olivier MATZ
Hi Pablo,

Please find some comments below.

On 08/18/2015 04:00 PM, Pablo de Lara wrote:
> When creating a ring, a memzone is created to allocate it in memory,
> but the ring could not be freed, as memzones could not be.
> 
> Since memzones can be freed now, then rings can be as well,
> taking into account if they were initialized using pre-allocated memory
> (in which case, memory should be freed externally) or using 
> rte_memzone_reserve
> (with rte_ring_create), freeing the memory with rte_memzone_free.
> 
> Signed-off-by: Pablo de Lara 
> ---
>  lib/librte_ring/rte_ring.c   | 43 
> 
>  lib/librte_ring/rte_ring.h   |  7 ++
>  lib/librte_ring/rte_ring_version.map |  7 ++
>  3 files changed, 57 insertions(+)
> 
> diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
> index c9e59d4..83ce6d3 100644
> --- a/lib/librte_ring/rte_ring.c
> +++ b/lib/librte_ring/rte_ring.c
> @@ -208,6 +208,49 @@ rte_ring_create(const char *name, unsigned count, int 
> socket_id,
>   return r;
>  }
>  
> +/* free the ring */
> +void
> +rte_ring_free(struct rte_ring *r)
> +{
> + struct rte_ring_list *ring_list = NULL;
> + char mz_name[RTE_MEMZONE_NAMESIZE];
> + struct rte_tailq_entry *te;
> + const struct rte_memzone *mz;
> +
> + if (r == NULL)
> + return;
> +
> + snprintf(mz_name, sizeof(mz_name), "%s%s", RTE_RING_MZ_PREFIX, r->name);
> + mz = rte_memzone_lookup(mz_name);
> +
> + /*
> +  * Free ring memory if it was allocated with rte_memzone_reserve,
> +  * otherwise it should be freed externally
> +  */
> + if (rte_memzone_free(mz) != 0)
> + return;

Should we have a log here? I think it may hide important
bugs if we just return silently here.

> +
> + ring_list = RTE_TAILQ_CAST(rte_ring_tailq.head, rte_ring_list);
> + rte_rwlock_write_lock(RTE_EAL_TAILQ_RWLOCK);
> +
> + /* find out tailq entry */
> + TAILQ_FOREACH(te, ring_list, next) {
> + if (te->data == (void *) r)
> + break;
> + }
> +
> + if (te == NULL) {
> + rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
> + return;
> + }

If I understand well, a ring is in the tailq only if it was
created with rte_ring_create(). A ring that is statically created
in memory is not in the tailq. But we already returned in that
case. So (te == NULL) should not happen here, right? We could
also add an error log then.

I'm not sure we should handle the case where the ring is not allocated
with rte_ring_create() in this function. If the ring is allocated with
another mean (either in a global variable, or with another dynamic
memory allocator), this function should not be called.

What do you think?


Regards,
Olivier


> +
> + TAILQ_REMOVE(ring_list, te, next);
> +
> + rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
> +
> + rte_free(te);
> +}
> +
>  /*
>   * change the high water mark. If *count* is 0, water marking is
>   * disabled
> diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> index af6..e75566f 100644
> --- a/lib/librte_ring/rte_ring.h
> +++ b/lib/librte_ring/rte_ring.h
> @@ -300,6 +300,13 @@ int rte_ring_init(struct rte_ring *r, const char *name, 
> unsigned count,
>   */
>  struct rte_ring *rte_ring_create(const char *name, unsigned count,
>int socket_id, unsigned flags);
> +/**
> + * De-allocate all memory used by the ring.
> + *
> + * @param r
> + *   Ring to free
> + */
> +void rte_ring_free(struct rte_ring *r);
>  
>  /**
>   * Change the high water mark.
> diff --git a/lib/librte_ring/rte_ring_version.map 
> b/lib/librte_ring/rte_ring_version.map
> index 982fdd1..5474b98 100644
> --- a/lib/librte_ring/rte_ring_version.map
> +++ b/lib/librte_ring/rte_ring_version.map
> @@ -11,3 +11,10 @@ DPDK_2.0 {
>  
>   local: *;
>  };
> +
> +DPDK_2.2 {
> + global:
> +
> + rte_ring_free;
> +
> +} DPDK_2.0;
> 


[dpdk-dev] ixgbe: account more Rx errors Issue

2015-09-07 Thread Olivier MATZ
Hi,

On 09/06/2015 07:15 PM, Tahhan, Maryam wrote:
>> From: Andriy Berestovskyy [mailto:aber at semihalf.com]
>> Sent: Friday, September 4, 2015 5:59 PM
>> To: Tahhan, Maryam
>> Cc: dev at dpdk.org; Olivier MATZ
>> Subject: Re: ixgbe: account more Rx errors Issue
>>
>> Hi Maryam,
>> Please see below.
>>
>>> XEC counts the Number of receive IPv4, TCP, UDP or SCTP XSUM errors
>>
>> Please note than UDP checksum is optional for IPv4, but UDP packets with
>> zero checksum hit XEC.
>>
> 
> I understand, but this is what the hardware register is picking up and what I 
> included previously is the definitions of the registers from the datasheet.
> 
>>> And general crc errors counts Counts the number of receive packets with
>> CRC errors.
>>
>> Let me explain you with an example.
>>
>> DPDK 2.0 behavior:
>> host A sends 10M IPv4 UDP packets (no checksum) to host B host B stats: 9M
>> ipackets + 1M ierrors (missed) = 10M
>>
>> DPDK 2.1 behavior:
>> host A sends 10M IPv4 UDP packets (no checksum) to host B host B stats: 9M
>> ipackets + 11M in ierrors (1M missed + 10M XEC) = 20M?
> 
> Because it's hitting the 2 error registers. If you had packets with multiple 
> errors that are added up as part of ierrors you'll still be getting more than 
> 10M errors which is why I asked for feedback on the 3 suggestions below. What 
> I'm saying is the number of errors being > the number of received packets 
> will be seen if you hit multiple error registers on the NIC.
> 
>>
>>> So our options are we can:
>>> 1. Add only one of these into the error stats.
>>> 2. We can introduce some cooking of stats in this scenario, so only add
>> either or if they are equal or one is higher than the other.
>>> 3. Add them all which means you can have more errors than the number of
>> received packets, but TBH this is going to be the case if your packets have
>> multiple errors anyway.
>>
>> 4. ierrors should reflect NIC drops only.
> 
> I may have misinterpreted this, but ierrors in rte_ethdev.h ierrors is 
> defined as the Total number of erroneous received packets.
> Maybe we need a clear definition or a separate drop counter as I see uint64_t 
> q_errors defined as: Total number of queue packets received that are dropped.
> 
>> XEC does not count drops, so IMO it should be removed from ierrors.
> 
> While it's picking up the 0 checksum as an error (which it shouldn't 
> necessarily be doing), removing it could mean missing other valid L3/L4 
> checksum errors... Let me experiment some more with L3/L4 checksum errors and 
> crcerrs to see if we can cook the stats around this register in particular. I 
> would hate to remove it and miss genuine errors 

For me, the definition that looks the most straightforward is:

ipackets = packets successfully received by hardware
imissed = packets dropped by hardware because the software does
  not poll fast enough (= queue full)
ierrors = packets dropped by hardware (malformed packets, ...)

These 3 stats never count twice the same packet.

If we want more statistics, they could go in xstats. For instance,
a counter for invalid checksum. The definition of these stats would
be pmd-specific.

I agree we should clarify and have a consensus on the definitions
before going further.


Regards,
Olivier



> 
>>
>> Please note that we still can access the XEC using rte_eth_xstats_get()
>>
>>
>> Regards,
>> Andriy


[dpdk-dev] [PATCH RFC] mbuf/ip_frag: Move mbuf chaining to common code

2015-09-07 Thread Olivier MATZ
Hi Simon,

I think it's a good idea. Please see some minor comments below.

On 08/31/2015 02:41 PM, Simon Kagstrom wrote:
> Chaining/segmenting mbufs can be useful in many places, so make it
> global.
> 
> Signed-off-by: Simon Kagstrom 
> Signed-off-by: Johan Faltstrom 
> ---
> NOTE! Only compile-tested.
> 
> We were looking for packet segmenting functionality in the MBUF API but
> didn't find it. This patch moves the implementation, apart from the
> things which look ip_frag-specific.
> 
>  lib/librte_ip_frag/ip_frag_common.h  | 23 ---
>  lib/librte_ip_frag/rte_ipv4_reassembly.c |  7 +--
>  lib/librte_ip_frag/rte_ipv6_reassembly.c |  7 +--
>  lib/librte_mbuf/rte_mbuf.h   | 23 +++
>  4 files changed, 33 insertions(+), 27 deletions(-)
> 
> diff --git a/lib/librte_ip_frag/ip_frag_common.h 
> b/lib/librte_ip_frag/ip_frag_common.h
> index 6b2acee..cde6ed4 100644

> [...]

> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index 8c2db1b..ef47256 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -1801,6 +1801,29 @@ static inline int rte_pktmbuf_is_contiguous(const 
> struct rte_mbuf *m)
>  }
>  
>  /**
> + * Chain an mbuf to another, thereby creating a segmented packet.
> + *
> + * @param head the head of the mbuf chain (the first packet)
> + * @param tail the mbuf to put last in the chain
> + */
> +static inline void rte_pktmbuf_chain(struct rte_mbuf *head, struct rte_mbuf 
> *tail)
> +{
> + struct rte_mbuf *cur_tail;
> +

Here, we could check if the pkt_len of tail mbuf is 0. If
it's the case, we can just free it and return. It would avoid
to have an empty segment inside the mbuf chain, which can be
annoying.

if (unlikely(rte_pktmbuf_pkt_len(tail) == 0)) {
rte_pktmbuf_free(tail);
return;
}

> + /* Chain 'tail' onto the old tail */
> + cur_tail = rte_pktmbuf_lastseg(head);
> + cur_tail->next = tail;
> +
> + /* accumulate number of segments and total length. */
> + head->nb_segs = (uint8_t)(head->nb_segs + tail->nb_segs);

I'm wondering if we shouldn't check the overflow here. In
this case we would need to have a return value in case of
failure.

> + head->pkt_len += tail->pkt_len;
> +
> + /* reset pkt_len and nb_segs for chained fragment. */
> + tail->pkt_len = tail->data_len;
> + tail->nb_segs = 1;

I don't think it's required to reset this fields in the tail mbuf.
In any case, they will be reset again.

> +}
> +
> +/**
>   * Dump an mbuf structure to the console.
>   *
>   * Dump all fields for the given packet mbuf and all its associated
> 


Regards,
Olivier


[dpdk-dev] [PATCH 3/4] virtio: use indirect ring elements

2015-09-07 Thread Thomas Monjalon
General comment after reading this page:
http://dpdk.org/dev/patchwork/patch/6905/
Please remove useless context when replying to make answers shorter
and easier to read.
Thanks



[dpdk-dev] [RFC PATCH 0/6] remove pci driver from vdevs

2015-09-07 Thread Zende, Amruta S
Certain functions like "rte_eth_dev_socket_id" assume the device to be a PCI 
device and access pointers like rte_eth_devices[port_id].pci_dev->numa_node. 
Any such assumptions and dependencies should also be removed.

Thanks & regards,
Amruta Zende
+91-20-4305-2969

-Original Message-
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Iremonger, Bernard
Sent: Thursday, September 3, 2015 7:33 PM
To: Thomas Monjalon; Neil Horman
Cc: dev at dpdk.org
Subject: Re: [dpdk-dev] [RFC PATCH 0/6] remove pci driver from vdevs

Hi Neil, Thomas,

> > 
> >
> > > > You didn't remove the relationship of the ethdev to the pci 
> > > > driver though, which is really the problem, An ethdev may reside 
> > > > on any number of bus types (pci/usb/vmbus/virt/none).
> >
> > 
> >
> > > >  Whats really needed is a way to associate an ethdev with an 
> > > > arbitrary bus
> >
> > 
> >
> > > > > Please see email below from 6Wind
> > > > >
> > > > > http://dpdk.org/ml/archives/dev/2015-July/022107.html
> > > > >
> > > > I think you misread that.  I think all Thomas is asking for 
> > > > (correct me if I'm wrong Thomas), is for someone to start 
> > > > refactoring the ethdev registration code so that we can have a 
> > > > single init path without the need for wierd typing and 
> > > > differentiation at
> init time.
> >
> > 
> >
> > > >  We just need to
> > > > start thinking about how to make bus registration independent of 
> > > > ethernet device registration, and while your patch series sort 
> > > > of eliminates some of that, its really not a proper refactoring 
> > > > of the sort I think Thomas is asking for.
> >
> > I am just trying to distill what the actual requirement is from the 
> > above
> discussion.
> >
> > (1) Remove relationship of the ethdev to the pci driver.
> Correct

I plan to continue work on the RFC to address this.

> 
> > (2) Refactor ethdev registration code so that there is a single init  path.
> Correct (or mostly correct, in rereading my initial post, there may be 
> some ambiguitiy here)
> 
> > (3) Make bus registration independent of ethdev registration.
> Correct
> 
> > (4) Change all (17) PMD's to use the  modified structures.
> >
> Correct (I assume the 17 number is accurate here)

There are 17 PMD directories some of which have multiple PMD's.
The total number of PMD's is 23 at present.


> > The rte_eal_driver_registration() code is  in librte_eal,  untouched 
> > as yet by
> this patch set.
> >
> Correct, the code that registers an init function is a single path, which is 
> great.
> However, that path requires that you specify a device type (in this 
> case PMD_PDEV or PMD_VDEV to differentiate pci vs virtual devices (it 
> uses this for ordering of initalization in rte_eal_dev_init, which is 
> a hack).  What would be preferred would be if the structure that was 
> registered via that macro only held a name and an init routine to 
> initalize the driver itself. Inside that init routine, the driver 
> would then be responsible for registering with the appropriate bus 
> type (virtual/pci/usb/whatever).  A secondary function would be 
> registered as part of that process (akin to the linux/bsd probe()
> method) which was called once for each instance of the device found on 
> that bus.
> 

I will send a second RFC to address the eal driver registration code issues in 
librte_eal.

> > The rte_eth_driver_registration() code is in librte_ether.
> > Should the pci fields be removed from the struct rte_eth_dev{} and 
> > struct eth_driver{},
> IMO, yes, they should, as the driver can store pointers to those 
> devices in their private per-device data area.  That said, there may 
> be value in including a union of all bus types in the ethdev itself, 
> if there are higher layer functions that need to be aware of a given 
> ethdevs bus type

I plan to park the issue of multiple bus types for now.
More consensus is needed on the best way forward. 

> 
> > and put somewhere else or replaced with a union of bus  types and
> drivers?

Regards,

Bernard.




[dpdk-dev] CloudNetEngine vSwitch boosts performance

2015-09-07 Thread Thomas Monjalon
Hi Jun Xiao,

2015-09-05 10:01, Jun Xiao:
> CloudNetEngine vSwitch boosts performance for both NFV and cloud use cases in 
> technical preview update 3.For the details, please refer to 
> http://cloudnetengine.com/en/blog/2015/09/05/cloudnetengine-vswitch-boosts-performance-cloud-an/
> One key update:- CNE vSwitch is over 65% better than OVS-DPDK for small 
> packet size traffic. (It's about 10-20% advantage in technical preview update 
> 2)
> Again, thanks for all the contributions to make DPDK as such a great 
> foundation!Jun Xiaowww.cloudnetengine.com

Closed and Open Source products based on DPDK are encouraged.
It is nice to announce new stuff, but please do it only once.
This mailing list focus on DPDK development discussions, i.e. drivers and 
algorithms.
Thanks for understanding


[dpdk-dev] [PATCH RFC] mbuf/ip_frag: Move mbuf chaining to common code

2015-09-07 Thread Ananyev, Konstantin
Hi lads,

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Olivier MATZ
> Sent: Monday, September 07, 2015 8:33 AM
> To: Simon Kagstrom; dev at dpdk.org; Zhang, Helin; Gonzalez Monroy, Sergio; 
> Burakov, Anatoly
> Subject: Re: [dpdk-dev] [PATCH RFC] mbuf/ip_frag: Move mbuf chaining to 
> common code
> 
> Hi Simon,
> 
> I think it's a good idea. Please see some minor comments below.
> 
> On 08/31/2015 02:41 PM, Simon Kagstrom wrote:
> > Chaining/segmenting mbufs can be useful in many places, so make it
> > global.
> >
> > Signed-off-by: Simon Kagstrom 
> > Signed-off-by: Johan Faltstrom 
> > ---
> > NOTE! Only compile-tested.
> >
> > We were looking for packet segmenting functionality in the MBUF API but
> > didn't find it. This patch moves the implementation, apart from the
> > things which look ip_frag-specific.
> >
> >  lib/librte_ip_frag/ip_frag_common.h  | 23 ---
> >  lib/librte_ip_frag/rte_ipv4_reassembly.c |  7 +--
> >  lib/librte_ip_frag/rte_ipv6_reassembly.c |  7 +--
> >  lib/librte_mbuf/rte_mbuf.h   | 23 +++
> >  4 files changed, 33 insertions(+), 27 deletions(-)
> >
> > diff --git a/lib/librte_ip_frag/ip_frag_common.h 
> > b/lib/librte_ip_frag/ip_frag_common.h
> > index 6b2acee..cde6ed4 100644
> 
> > [...]
> 
> > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > index 8c2db1b..ef47256 100644
> > --- a/lib/librte_mbuf/rte_mbuf.h
> > +++ b/lib/librte_mbuf/rte_mbuf.h
> > @@ -1801,6 +1801,29 @@ static inline int rte_pktmbuf_is_contiguous(const 
> > struct rte_mbuf *m)
> >  }
> >
> >  /**
> > + * Chain an mbuf to another, thereby creating a segmented packet.
> > + *
> > + * @param head the head of the mbuf chain (the first packet)
> > + * @param tail the mbuf to put last in the chain
> > + */
> > +static inline void rte_pktmbuf_chain(struct rte_mbuf *head, struct 
> > rte_mbuf *tail)
> > +{
> > +   struct rte_mbuf *cur_tail;
> > +
> 
> Here, we could check if the pkt_len of tail mbuf is 0. If
> it's the case, we can just free it and return. It would avoid
> to have an empty segment inside the mbuf chain, which can be
> annoying.
> 
> if (unlikely(rte_pktmbuf_pkt_len(tail) == 0)) {
>   rte_pktmbuf_free(tail);
>   return;
> }

Wonder why do we need to do that?
Probably head mbuf is out of space and want to expand it using pktmbuf_chain()?
So in that case seems logical:
1) allocate new mbuf (it's pkt_len will be 0)
b) call pktmbuf_chain()

Konstantin

> 
> > +   /* Chain 'tail' onto the old tail */
> > +   cur_tail = rte_pktmbuf_lastseg(head);
> > +   cur_tail->next = tail;
> > +
> > +   /* accumulate number of segments and total length. */
> > +   head->nb_segs = (uint8_t)(head->nb_segs + tail->nb_segs);
> 
> I'm wondering if we shouldn't check the overflow here. In
> this case we would need to have a return value in case of
> failure.
> 
> > +   head->pkt_len += tail->pkt_len;
> > +
> > +   /* reset pkt_len and nb_segs for chained fragment. */
> > +   tail->pkt_len = tail->data_len;
> > +   tail->nb_segs = 1;
> 
> I don't think it's required to reset this fields in the tail mbuf.
> In any case, they will be reset again.
> 
> > +}
> > +
> > +/**
> >   * Dump an mbuf structure to the console.
> >   *
> >   * Dump all fields for the given packet mbuf and all its associated
> >
> 
> 
> Regards,
> Olivier


[dpdk-dev] [PATCH 0/4] RFC virtio performance enhancement and cleanups

2015-09-07 Thread Thomas Monjalon
Hi Stephen,

2015-09-04 13:58, Stephen Hemminger:
> These are compile tested only, haven't debugged or checked out the corner
> case. Submitted for discussion and future planning.
> 
> Stephen Hemminger (4):
>   virtio: clean up space checks on xmit
>   virtio: don't use unlikely for normal tx stuff
>   virtio: use indirect ring elements
>   virtio: use any layout on transmit

Thanks for the nice optimizations.
Do you have some performance numbers to share?


[dpdk-dev] DPDK v2.1.0 with tilegx problem

2015-09-07 Thread Mcnamara, John
CCing the Tile-GX maintainers.

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Arthas
> Sent: Saturday, September 5, 2015 2:49 AM
> To: dev
> Subject: [dpdk-dev] DPDK v2.1.0 with tilegx problem
> 
> Hi, I'm working on tilegx platform. but when I compiled DPDK 2.1.0 with
> tilegx target. I got a error.
> 
> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:182: error:
> 'GXIO_MPIPE_IQUEUE_ENTRY_128' undeclared here (not in a function)
> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:182: error: array index
> in initializer not of integer type
> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:182: error: (near
> initialization for 'mpipe_iqueue_sizes')
> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:183: error:
> 'GXIO_MPIPE_IQUEUE_ENTRY_512' undeclared here (not in a function)
> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:183: error: array index
> in initializer not of integer type
> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:183: error: (near
> initialization for 'mpipe_iqueue_sizes')
> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:184: error:
> 'GXIO_MPIPE_IQUEUE_ENTRY_2K' undeclared here (not in a function)
> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:184: error: array index
> in initializer not of integer type
> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:184: error: (near
> initialization for 'mpipe_iqueue_sizes')
> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:185: error:
> 'GXIO_MPIPE_IQUEUE_ENTRY_64K' undeclared here (not in a function)
> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:185: error: array index
> in initializer not of integer type
> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:185: error: (near
> initialization for 'mpipe_iqueue_sizes')
> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c: In function
> 'mpipe_link_mac':
> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:1545: warning: implicit
> declaration of function 'gxio_mpipe_link_enumerate_mac'
> /DPDK/dpdk-v2.1.0/drivers/net/mpipe/mpipe_tilegx.c:1545: warning: nested
> extern declaration of 'gxio_mpipe_link_enumerate_mac'
> 
> My tilegx MDE version is TileraMDE-4.2.4.174600, but I can't found any
> definition of GXIO_MPIPE_IQUEUE_ENTRY_128/512/2K/64K.
> only found gxio_mpipe_link_enumerate_mac is a kernel global symbol.
> 
> Did someon got this problem?
> 
> Regards,
> Arthas


[dpdk-dev] pcap->eth low TX performance

2015-09-07 Thread Yerden Zhumabekov
tx burst sends, say, 10-15% percent of a supplied array. The tail is
being ignored so I have to drop it to avoid overflow.
Ethernet device is 82599.

In my app, I transmit all traffic through a ring then feed it to eth.
That leads to overflow as well.

04.09.2015 20:03, Kyle Larose ?:
> Are you reading from the pcap faster than the device can transmit?
> Does the app hold off reading from the pcap when the ethdev is pushing
> back, or does it just tail drop?
>
> On Fri, Sep 4, 2015 at 12:14 AM, Yerden Zhumabekov
> mailto:e_zhumabekov at sts.kz>> wrote:
>
> Hello,
>
> Did anyone try to work with pcap PMD recently? We're testing our app
> with this setup:
>
> PCAP --- rte_eth_rx_burst--> APP-> rte_eth_tx_burst -> ethdev
>
> I'm experiencing very low TX performance leading to massive mbuf drop
> while trying to send those packets over the Ethernet device. I tried
> running ordinary l2fwd and got the same issue with over 80-90% of
> packets drop. When I substitute PCAP with another ordinary Ethernet
> device, everything works fine. Can anyone share an idea?
>
> --
> Sincerely,
>
> Yerden Zhumabekov
> State Technical Service
> Astana, KZ
>
>

-- 
Sincerely,

Yerden Zhumabekov
State Technical Service
Astana, KZ



[dpdk-dev] X710 : ntuple or flow director filtering problems

2015-09-07 Thread VERDOUX, Sylvain
Hello all,

I'm currently trying hardware filtering on an Intel X710 NIC and i face several 
problems. I will briefly try to present what I want to achieve : I have 2 
multicast streams coming on one port of my NIC with 2 different multicast 
groups/ports, and I want to filter and redirect them to 2 RX queues to be able 
to retrieve packets into 2 different processes.
To do that, I first tried a simple 2-tuple filter with IP/port. But when using 
rte_eth_dev_filter_supported(portid, RTE_ETH_FILTER_NTUPLE) I get an error 
(error=-22 invalid argument ?) which I find strange because when using ethtool 
-k on the interface, I have ntuple-filters: on, so I guessed it was supported 
on this NIC.
Then I wanted to try flow director filters (should be supported as stated by 
the Intel documentation on this NIC). I tried using testpmd app to see what was 
supported and it seems signature mode is not. Perfect mode is accepted but I 
can't apply any mask on filters fields. In my own program, I configure the 
port's fdir_conf as follows :
struct rte_fdir_conf fdir_conf {
.mode = RTE_FDIR_MODE_PERFECT,
.pballoc = RTE_FDIR_PBALLOC_64K,
.status = RTE_FDIR_REPORT_STATUS,
.drop_queue = 127,
.mask = {
.vlan_tci_mask = 0x0,
.ipv4_mask = {
.src_ip = 0x,
.dst_ip = 0x,
},
.ipv6_mask = {
.src_ip = {0x, 0x, 0x, 
0x},
.dst_ip = {0x, 0x, 0x, 
0x},
},
.src_port_mask = 0x,
.dst_port_mask = 0x,
},
};

When I then use rte_eth_dev_fdir_set_masks or 
rte_eth_dev_fdir_add_perfect_filter functions they return an -95 error 
(unsupported by hardware ?). So I looked into the testpmd code and found that 
it is using rte_eth_dev_filter_ctrl function. This one using a fdir filter is 
accepted but I'm unable to set the filter properly. Indeed, I just want to 
filter on dst_ip and dst_port so I don't really know what to set in filter's 
other fields (flexbytes, src_ip ...) as I can't apply a mask.

I hope someone could help me on this topic (I know should miss something, but 
what?).

Thanks a lot,
Best regards,

Sylvain







[dpdk-dev] [RFC PATCH 5/8] lib/librte_vhost:dequeue vhost TSO offload

2015-09-07 Thread Liu, Jijiang


> -Original Message-
> From: Ouyang, Changchun
> Sent: Monday, August 31, 2015 8:40 PM
> To: Liu, Jijiang; dev at dpdk.org
> Cc: Ouyang, Changchun
> Subject: RE: [dpdk-dev] [RFC PATCH 5/8] lib/librte_vhost:dequeue vhost TSO
> offload
> 
> 
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Jijiang Liu
> > Sent: Monday, August 31, 2015 5:42 PM
> > To: dev at dpdk.org
> > Subject: [dpdk-dev] [RFC PATCH 5/8] lib/librte_vhost:dequeue vhost TSO
> > offload
> >
> > Dequeue vhost TSO offload
> >
> > Signed-off-by: Jijiang Liu 
> > ---
> >  lib/librte_vhost/vhost_rxtx.c |   29 -
> >  1 files changed, 28 insertions(+), 1 deletions(-)
> >
> > diff --git a/lib/librte_vhost/vhost_rxtx.c
> > b/lib/librte_vhost/vhost_rxtx.c index 0d07338..9adfdb1 100644
> > --- a/lib/librte_vhost/vhost_rxtx.c
> > +++ b/lib/librte_vhost/vhost_rxtx.c
> > @@ -545,6 +545,30 @@ rte_vhost_enqueue_burst(struct virtio_net *dev,
> > uint16_t queue_id,
> > return virtio_dev_rx(dev, queue_id, pkts, count);  }
> >
> > +static inline void __attribute__((always_inline))
> > +vhost_dequeue_offload(uint64_t addr, struct rte_mbuf *m) {
> > +   struct virtio_net_hdr *hdr =
> > +   (struct virtio_net_hdr *)((uintptr_t)addr);
> > +
> > +   if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
> > +   switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
> > +   case VIRTIO_NET_HDR_GSO_TCPV4:
> > +   m->ol_flags |= (PKT_TX_IPV4 | PKT_TX_TCP_SEG);
> > +   m->tso_segsz = hdr->gso_size;
> > +   break;
> > +   case VIRTIO_NET_HDR_GSO_TCPV6:
> > +   m->ol_flags |= (PKT_TX_IPV6 | PKT_TX_TCP_SEG);
> > +   m->tso_segsz = hdr->gso_size;
> > +   break;
> > +   default:
> > +   RTE_LOG(ERR, VHOST_DATA,
> > +   "bad gso type %u.\n", hdr->gso_type);
> > +   break;
> 
> Do we need special handling for the bad gso type?
Yes, we need return error, and log it and break this operation.

I will change it in next version.

> 
> > +   }
> > +   }
> > +}
> > +
> >  uint16_t
> >  rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
> > struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t
> > count) @@ -553,6 +577,7 @@ rte_vhost_dequeue_burst(struct virtio_net
> > *dev, uint16_t queue_id,
> > struct vhost_virtqueue *vq;
> > struct vring_desc *desc;
> > uint64_t vb_addr = 0;
> > +   uint64_t vb_net_hdr_addr = 0;
> > uint32_t head[MAX_PKT_BURST];
> > uint32_t used_idx;
> > uint32_t i;
> > @@ -604,6 +629,8 @@ rte_vhost_dequeue_burst(struct virtio_net *dev,
> > uint16_t queue_id,
> >
> > desc = >desc[head[entry_success]];
> >
> > +   vb_net_hdr_addr = gpa_to_vva(dev, desc->addr);
> > +
> > /* Discard first buffer as it is the virtio header */
> > if (desc->flags & VRING_DESC_F_NEXT) {
> > desc = >desc[desc->next];
> > @@ -742,7 +769,7 @@ rte_vhost_dequeue_burst(struct virtio_net *dev,
> > uint16_t queue_id,
> > break;
> >
> > m->nb_segs = seg_num;
> > -
> > +   vhost_dequeue_offload(vb_net_hdr_addr, m);
> > pkts[entry_success] = m;
> > vq->last_used_idx++;
> > entry_success++;
> > --
> > 1.7.7.6



[dpdk-dev] [RFC PATCH 4/8] driver/virtio:enqueue TSO offload

2015-09-07 Thread Liu, Jijiang


> -Original Message-
> From: Ouyang, Changchun
> Sent: Monday, August 31, 2015 8:29 PM
> To: Liu, Jijiang; dev at dpdk.org
> Cc: Ouyang, Changchun
> Subject: RE: [dpdk-dev] [RFC PATCH 4/8] driver/virtio:enqueue TSO offload
> 
> 
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Jijiang Liu
> > Sent: Monday, August 31, 2015 5:42 PM
> > To: dev at dpdk.org
> > Subject: [dpdk-dev] [RFC PATCH 4/8] driver/virtio:enqueue TSO offload
> >
> > Enqueue TSO4/6 offload.
> >
> > Signed-off-by: Jijiang Liu 
> > ---
> >  drivers/net/virtio/virtio_rxtx.c |   23 +++
> >  1 files changed, 23 insertions(+), 0 deletions(-)
> >
> > diff --git a/drivers/net/virtio/virtio_rxtx.c
> > b/drivers/net/virtio/virtio_rxtx.c
> > index c5b53bb..4c2d838 100644
> > --- a/drivers/net/virtio/virtio_rxtx.c
> > +++ b/drivers/net/virtio/virtio_rxtx.c
> > @@ -198,6 +198,28 @@ virtqueue_enqueue_recv_refill(struct virtqueue
> > *vq, struct rte_mbuf *cookie)
> > return 0;
> >  }
> >
> > +static void
> > +virtqueue_enqueue_offload(struct virtqueue *txvq, struct rte_mbuf *m,
> > +   uint16_t idx, uint16_t hdr_sz)
> > +{
> > +   struct virtio_net_hdr *hdr = (struct virtio_net_hdr *)(uintptr_t)
> > +   (txvq->virtio_net_hdr_addr + idx * hdr_sz);
> > +
> > +   if (m->tso_segsz != 0 && m->ol_flags & PKT_TX_TCP_SEG) {
> > +   if (m->ol_flags & PKT_TX_IPV4) {
> > +   if (!vtpci_with_feature(txvq->hw,
> > VIRTIO_NET_F_HOST_TSO4))
> > +   return;
> 
> Do we need return error if host can't handle tso for the packet?
> 
> > +   hdr->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
> > +   } else if (m->ol_flags & PKT_TX_IPV6) {
> > +   if (!vtpci_with_feature(txvq->hw,
> > VIRTIO_NET_F_HOST_TSO6))
> > +   return;
> 
> Same as above
> 
> > +   hdr->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
> > +   }
> 
> Do we need else branch for the case of neither tcpv4 nor tcpv6?
> 
> > +   hdr->gso_size = m->tso_segsz;
> > +   hdr->hdr_len = m->l2_len + m->l3_len + m->l4_len;
> > +   }
> > +}
> > +
> >  static int
> >  virtqueue_enqueue_xmit(struct virtqueue *txvq, struct rte_mbuf
> > *cookie) { @@ -221,6 +243,7 @@ virtqueue_enqueue_xmit(struct virtqueue
> > *txvq, struct rte_mbuf *cookie)
> > dxp->cookie = (void *)cookie;
> > dxp->ndescs = needed;
> >
> > +   virtqueue_enqueue_offload(txvq, cookie, idx, head_size);
> 
> If TSO is not enabled in the feature bit, how to resolve here?

The TSO enablement  check is in the function.

If TSO is not enabled, and don't need to fill virtio_net_hdr now.

> > start_dp = txvq->vq_ring.desc;
> > start_dp[idx].addr =
> > txvq->virtio_net_hdr_mem + idx * head_size;
> > --
> > 1.7.7.6



[dpdk-dev] vhost compliant virtio based networking interface in container

2015-09-07 Thread Xie, Huawei
On 8/26/2015 5:23 PM, Tetsuya Mukawa wrote:
> On 2015/08/25 18:56, Xie, Huawei wrote:
>> On 8/25/2015 10:59 AM, Tetsuya Mukawa wrote:
>>> Hi Xie and Yanping,
>>>
>>>
>>> May I ask you some questions?
>>> It seems we are also developing an almost same one.
>> Good to know that we are tackling the same problem and have the similar
>> idea.
>> What is your status now? We had the POC running, and compliant with
>> dpdkvhost.
>> Interrupt like notification isn't supported.
> We implemented vhost PMD first, so we just start implementing it.
>
>>> On 2015/08/20 19:14, Xie, Huawei wrote:
 Added dev at dpdk.org

 On 8/20/2015 6:04 PM, Xie, Huawei wrote:
> Yanping:
> I read your mail, seems what we did are quite similar. Here i wrote a
> quick mail to describe our design. Let me know if it is the same thing.
>
> Problem Statement:
> We don't have a high performance networking interface in container for
> NFV. Current veth pair based interface couldn't be easily accelerated.
>
> The key components involved:
> 1.DPDK based virtio PMD driver in container.
> 2.device simulation framework in container.
> 3.dpdk(or kernel) vhost running in host.
>
> How virtio is created?
> A:  There is no "real" virtio-pci device in container environment.
> 1). Host maintains pools of memories, and shares memory to container.
> This could be accomplished through host share a huge page file to 
> container.
> 2). Containers creates virtio rings based on the shared memory.
> 3). Container creates mbuf memory pools on the shared memory.
> 4) Container send the memory and vring information to vhost through
> vhost message. This could be done either through ioctl call or vhost
> user message.
>
> How vhost message is sent?
> A: There are two alternative ways to do this.
> 1) The customized virtio PMD is responsible for all the vring creation,
> and vhost message sending.
>>> Above is our approach so far.
>>> It seems Yanping also takes this kind of approach.
>>> We are using vhost-user functionality instead of using the vhost-net
>>> kernel module.
>>> Probably this is the difference between Yanping and us.
>> In my current implementation, the device simulation layer talks to "user
>> space" vhost through cuse interface. It could also be done through vhost
>> user socket. This isn't the key point.
>> Here vhost-user is kind of confusing, maybe user space vhost is more
>> accurate, either cuse or unix domain socket. :).
>>
>> As for yanping, they are now connecting to vhost-net kernel module, but
>> they are also trying to connect to "user space" vhost.  Correct me if wrong.
>> Yes, there is some difference between these two. Vhost-net kernel module
>> could directly access other process's memory, while using
>> vhost-user(cuse/user), we need do the memory mapping.
>>> BTW, we are going to submit a vhost PMD for DPDK-2.2.
>>> This PMD is implemented on librte_vhost.
>>> It allows DPDK application to handle a vhost-user(cuse) backend as a
>>> normal NIC port.
>>> This PMD should work with both Xie and Yanping approach.
>>> (In the case of Yanping approach, we may need vhost-cuse)
>>>
> 2) We could do this through a lightweight device simulation framework.
> The device simulation creates simple PCI bus. On the PCI bus,
> virtio-net PCI devices are created. The device simulations provides
> IOAPI for MMIO/IO access.
>>> Does it mean you implemented a kernel module?
>>> If so, do you still need vhost-cuse functionality to handle vhost
>>> messages n userspace?
>> The device simulation is  a library running in user space in container. 
>> It is linked with DPDK app. It creates pseudo buses and virtio-net PCI
>> devices.
>> The virtio-container-PMD configures the virtio-net pseudo devices
>> through IOAPI provided by the device simulation rather than IO
>> instructions as in KVM.
>> Why we use device simulation?
>> We could create other virtio devices in container, and provide an common
>> way to talk to vhost-xx module.
> Thanks for explanation.
> At first reading, I thought the difference between approach1 and
> approach2 is whether we need to implement a new kernel module, or not.
> But I understand how you implemented.
>
> Please let me explain our design more.
> We might use a kind of similar approach to handle a pseudo virtio-net
> device in DPDK.
> (Anyway, we haven't finished implementing yet, this overview might have
> some technical problems)
>
> Step1. Separate virtio-net and vhost-user socket related code from QEMU,
> then implement it as a separated program.
> The program also has below features.
>  - Create a directory that contains almost same files like
> /sys/bus/pci/device//*
>(To scan these file located on outside sysfs, we need to fix EAL)
>  - This dummy device is driven by dummy-virtio-net-driver. This name is
> specified by '/driver' file.
>  - Create a