[dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx perf

2014-09-18 Thread Bruce Richardson
On Thu, Sep 18, 2014 at 11:29:30AM -0400, Neil Horman wrote:
> On Thu, Sep 18, 2014 at 02:36:13PM +0100, Bruce Richardson wrote:
> > On Wed, Sep 17, 2014 at 01:59:36PM -0400, Neil Horman wrote:
> > > On Wed, Sep 17, 2014 at 03:35:19PM +, Richardson, Bruce wrote:
> > > > 
> > > > > -Original Message-
> > > > > From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > > > > Sent: Wednesday, September 17, 2014 4:21 PM
> > > > > To: Richardson, Bruce
> > > > > Cc: dev at dpdk.org
> > > > > Subject: Re: [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve 
> > > > > slow-path tx
> > > > > perf
> > > > > 
> > > > > On Wed, Sep 17, 2014 at 11:01:39AM +0100, Bruce Richardson wrote:
> > > > > > Make a small improvement to slow path TX performance by adding in a
> > > > > > prefetch for the second mbuf cache line.
> > > > > > Also move assignment of l2/l3 length values only when needed.
> > > > > >
> > > > > > Signed-off-by: Bruce Richardson 
> > > > > > ---
> > > > > >  lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 12 +++-
> > > > > >  1 file changed, 7 insertions(+), 5 deletions(-)
> > > > > >
> > > > > > diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > > index 6f702b3..c0bb49f 100644
> > > > > > --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > > +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > > @@ -565,25 +565,26 @@ ixgbe_xmit_pkts(void *tx_queue, struct 
> > > > > > rte_mbuf
> > > > > **tx_pkts,
> > > > > > ixgbe_xmit_cleanup(txq);
> > > > > > }
> > > > > >
> > > > > > +   rte_prefetch0(>mbuf->pool);
> > > > > > +
> > > > > 
> > > > > Can you explain what all of these prefetches are doing?  It looks to 
> > > > > me like
> > > > > they're just fetching the first caheline of the mempool structure, 
> > > > > which it
> > > > > appears amounts to the pools name.  I don't see that having any use 
> > > > > here.
> > > > > 
> > > > This does make a decent enough performance difference in my tests (the 
> > > > amount varies depending on the RX path being used by testpmd). 
> > > > 
> > > > What I've done with the prefetches is two-fold:
> > > > 1) changed it from prefetching the mbuf (first cache line) to 
> > > > prefetching the mbuf pool pointer (second cache line) so that when we 
> > > > go to access the pool pointer to free transmitted mbufs we don't get a 
> > > > cache miss. When clearing the ring and freeing mbufs, the pool pointer 
> > > > is the only mbuf field used, so we don't need that first cache line.
> > > ok, this makes some sense, but you're not guaranteed to either have that
> > > prefetch be needed, nor are you certain it will still be in cache by the 
> > > time
> > > you get to the free call.  Seems like it might be preferable to prefecth 
> > > the
> > > data pointed to by tx_pkt, as you're sure to use that every loop 
> > > iteration.
> > 
> > The vast majority of the times the prefetch is necessary, and it does help 
> > performance doing things this way. If the prefetch is not necessary, it's 
> > just one extra instruction, while, if it is needed, having the prefetch 
> > occur 20 cycles before access (picking an arbitrary value) means that we 
> > have cut down the time it takes to pull the data from cache when it is 
> > needed by 20 cycles.
> I understand how prefetch works. What I'm concerned about is its overuse, and
> its tendency to frequently need re-calibration (though I admit I missed the &
> operator in the patch, and thought you were prefetching the contents of the
> struct, not the pointer value itself).  As you say, if the pool pointer is
> almost certain to be used, then it may well make sense to prefetch the data, 
> but
> in doing so, you potentially evict something that you were about to use, so
> you're not doing yourself any favors.  I understand that you've validated this
> experimentally, and so it works, right now.  I just like to be very careful
> about how prefetch happens, as it can easily (and sliently) start hurting far
> more than it helps.
> 
>

[dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx perf

2014-09-18 Thread Neil Horman
On Thu, Sep 18, 2014 at 04:42:36PM +0100, Bruce Richardson wrote:
> On Thu, Sep 18, 2014 at 11:29:30AM -0400, Neil Horman wrote:
> > On Thu, Sep 18, 2014 at 02:36:13PM +0100, Bruce Richardson wrote:
> > > On Wed, Sep 17, 2014 at 01:59:36PM -0400, Neil Horman wrote:
> > > > On Wed, Sep 17, 2014 at 03:35:19PM +, Richardson, Bruce wrote:
> > > > > 
> > > > > > -Original Message-
> > > > > > From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > > > > > Sent: Wednesday, September 17, 2014 4:21 PM
> > > > > > To: Richardson, Bruce
> > > > > > Cc: dev at dpdk.org
> > > > > > Subject: Re: [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve 
> > > > > > slow-path tx
> > > > > > perf
> > > > > > 
> > > > > > On Wed, Sep 17, 2014 at 11:01:39AM +0100, Bruce Richardson wrote:
> > > > > > > Make a small improvement to slow path TX performance by adding in 
> > > > > > > a
> > > > > > > prefetch for the second mbuf cache line.
> > > > > > > Also move assignment of l2/l3 length values only when needed.
> > > > > > >
> > > > > > > Signed-off-by: Bruce Richardson 
> > > > > > > ---
> > > > > > >  lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 12 +++-
> > > > > > >  1 file changed, 7 insertions(+), 5 deletions(-)
> > > > > > >
> > > > > > > diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > > b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > > > index 6f702b3..c0bb49f 100644
> > > > > > > --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > > > +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > > > @@ -565,25 +565,26 @@ ixgbe_xmit_pkts(void *tx_queue, struct 
> > > > > > > rte_mbuf
> > > > > > **tx_pkts,
> > > > > > >   ixgbe_xmit_cleanup(txq);
> > > > > > >   }
> > > > > > >
> > > > > > > + rte_prefetch0(>mbuf->pool);
> > > > > > > +
> > > > > > 
> > > > > > Can you explain what all of these prefetches are doing?  It looks 
> > > > > > to me like
> > > > > > they're just fetching the first caheline of the mempool structure, 
> > > > > > which it
> > > > > > appears amounts to the pools name.  I don't see that having any use 
> > > > > > here.
> > > > > > 
> > > > > This does make a decent enough performance difference in my tests 
> > > > > (the amount varies depending on the RX path being used by testpmd). 
> > > > > 
> > > > > What I've done with the prefetches is two-fold:
> > > > > 1) changed it from prefetching the mbuf (first cache line) to 
> > > > > prefetching the mbuf pool pointer (second cache line) so that when we 
> > > > > go to access the pool pointer to free transmitted mbufs we don't get 
> > > > > a cache miss. When clearing the ring and freeing mbufs, the pool 
> > > > > pointer is the only mbuf field used, so we don't need that first 
> > > > > cache line.
> > > > ok, this makes some sense, but you're not guaranteed to either have that
> > > > prefetch be needed, nor are you certain it will still be in cache by 
> > > > the time
> > > > you get to the free call.  Seems like it might be preferable to 
> > > > prefecth the
> > > > data pointed to by tx_pkt, as you're sure to use that every loop 
> > > > iteration.
> > > 
> > > The vast majority of the times the prefetch is necessary, and it does 
> > > help 
> > > performance doing things this way. If the prefetch is not necessary, it's 
> > > just one extra instruction, while, if it is needed, having the prefetch 
> > > occur 20 cycles before access (picking an arbitrary value) means that we 
> > > have cut down the time it takes to pull the data from cache when it is 
> > > needed by 20 cycles.
> > I understand how prefetch works. What I'm concerned about is its overuse, 
> > and
> > its tendency to frequently need re-calibration (though I admit I missed the 
> > &
> > operator in the patch, and thought you were prefetching the contents of the
> > struct, n

[dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx perf

2014-09-18 Thread Neil Horman
On Thu, Sep 18, 2014 at 02:36:13PM +0100, Bruce Richardson wrote:
> On Wed, Sep 17, 2014 at 01:59:36PM -0400, Neil Horman wrote:
> > On Wed, Sep 17, 2014 at 03:35:19PM +, Richardson, Bruce wrote:
> > > 
> > > > -Original Message-
> > > > From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > > > Sent: Wednesday, September 17, 2014 4:21 PM
> > > > To: Richardson, Bruce
> > > > Cc: dev at dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve 
> > > > slow-path tx
> > > > perf
> > > > 
> > > > On Wed, Sep 17, 2014 at 11:01:39AM +0100, Bruce Richardson wrote:
> > > > > Make a small improvement to slow path TX performance by adding in a
> > > > > prefetch for the second mbuf cache line.
> > > > > Also move assignment of l2/l3 length values only when needed.
> > > > >
> > > > > Signed-off-by: Bruce Richardson 
> > > > > ---
> > > > >  lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 12 +++-
> > > > >  1 file changed, 7 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > index 6f702b3..c0bb49f 100644
> > > > > --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > @@ -565,25 +565,26 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf
> > > > **tx_pkts,
> > > > >   ixgbe_xmit_cleanup(txq);
> > > > >   }
> > > > >
> > > > > + rte_prefetch0(>mbuf->pool);
> > > > > +
> > > > 
> > > > Can you explain what all of these prefetches are doing?  It looks to me 
> > > > like
> > > > they're just fetching the first caheline of the mempool structure, 
> > > > which it
> > > > appears amounts to the pools name.  I don't see that having any use 
> > > > here.
> > > > 
> > > This does make a decent enough performance difference in my tests (the 
> > > amount varies depending on the RX path being used by testpmd). 
> > > 
> > > What I've done with the prefetches is two-fold:
> > > 1) changed it from prefetching the mbuf (first cache line) to prefetching 
> > > the mbuf pool pointer (second cache line) so that when we go to access 
> > > the pool pointer to free transmitted mbufs we don't get a cache miss. 
> > > When clearing the ring and freeing mbufs, the pool pointer is the only 
> > > mbuf field used, so we don't need that first cache line.
> > ok, this makes some sense, but you're not guaranteed to either have that
> > prefetch be needed, nor are you certain it will still be in cache by the 
> > time
> > you get to the free call.  Seems like it might be preferable to prefecth the
> > data pointed to by tx_pkt, as you're sure to use that every loop iteration.
> 
> The vast majority of the times the prefetch is necessary, and it does help 
> performance doing things this way. If the prefetch is not necessary, it's 
> just one extra instruction, while, if it is needed, having the prefetch 
> occur 20 cycles before access (picking an arbitrary value) means that we 
> have cut down the time it takes to pull the data from cache when it is 
> needed by 20 cycles.
I understand how prefetch works. What I'm concerned about is its overuse, and
its tendency to frequently need re-calibration (though I admit I missed the &
operator in the patch, and thought you were prefetching the contents of the
struct, not the pointer value itself).  As you say, if the pool pointer is
almost certain to be used, then it may well make sense to prefetch the data, but
in doing so, you potentially evict something that you were about to use, so
you're not doing yourself any favors.  I understand that you've validated this
experimentally, and so it works, right now.  I just like to be very careful
about how prefetch happens, as it can easily (and sliently) start hurting far
more than it helps.

> As for the value pointed to by tx_pkt, since this is a 
> packet the app has just been working on, it's almost certainly already in 
> l1/l2 cache.  
> 
Not sure I follow you here.  tx_pkts is an array of mbufs passed to the pmd from
rte_eth_tx_burts, which in turn is called by the application.  I don't see any
reasonable guarantee that any of those packets have been touch in sufficiently
recent history that they are likely to be in cache.  It seems like, if you do
want to do prefetching, interrotagting nb_tx and doing a prefetch of an
approriate stride to fill multiple cachelines with successive mbuf headers might
provide superior performance.
Neil



[dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx perf

2014-09-17 Thread Richardson, Bruce

> -Original Message-
> From: Neil Horman [mailto:nhorman at tuxdriver.com]
> Sent: Wednesday, September 17, 2014 4:21 PM
> To: Richardson, Bruce
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path 
> tx
> perf
> 
> On Wed, Sep 17, 2014 at 11:01:39AM +0100, Bruce Richardson wrote:
> > Make a small improvement to slow path TX performance by adding in a
> > prefetch for the second mbuf cache line.
> > Also move assignment of l2/l3 length values only when needed.
> >
> > Signed-off-by: Bruce Richardson 
> > ---
> >  lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 12 +++-
> >  1 file changed, 7 insertions(+), 5 deletions(-)
> >
> > diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > index 6f702b3..c0bb49f 100644
> > --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > @@ -565,25 +565,26 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf
> **tx_pkts,
> > ixgbe_xmit_cleanup(txq);
> > }
> >
> > +   rte_prefetch0(>mbuf->pool);
> > +
> 
> Can you explain what all of these prefetches are doing?  It looks to me like
> they're just fetching the first caheline of the mempool structure, which it
> appears amounts to the pools name.  I don't see that having any use here.
> 
This does make a decent enough performance difference in my tests (the amount 
varies depending on the RX path being used by testpmd). 

What I've done with the prefetches is two-fold:
1) changed it from prefetching the mbuf (first cache line) to prefetching the 
mbuf pool pointer (second cache line) so that when we go to access the pool 
pointer to free transmitted mbufs we don't get a cache miss. When clearing the 
ring and freeing mbufs, the pool pointer is the only mbuf field used, so we 
don't need that first cache line.
2) changed the code to prefetch earlier - in effect to prefetch one mbuf ahead. 
The original code prefetched the mbuf to be freed as soon as it started 
processing the mbuf to replace it. Instead now, every time we calculate what 
the next mbuf position is going to be we prefetch the mbuf in that position 
(i.e. the mbuf pool pointer we are going to free the mbuf to), even while we 
are still updating the previous mbuf slot on the ring. This gives the prefetch 
much more time to resolve and get the data we need in the cache before we need 
it.

Hope this clarifies things.

/Bruce


[dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx perf

2014-09-17 Thread Neil Horman
On Wed, Sep 17, 2014 at 03:35:19PM +, Richardson, Bruce wrote:
> 
> > -Original Message-
> > From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > Sent: Wednesday, September 17, 2014 4:21 PM
> > To: Richardson, Bruce
> > Cc: dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve 
> > slow-path tx
> > perf
> > 
> > On Wed, Sep 17, 2014 at 11:01:39AM +0100, Bruce Richardson wrote:
> > > Make a small improvement to slow path TX performance by adding in a
> > > prefetch for the second mbuf cache line.
> > > Also move assignment of l2/l3 length values only when needed.
> > >
> > > Signed-off-by: Bruce Richardson 
> > > ---
> > >  lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 12 +++-
> > >  1 file changed, 7 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > index 6f702b3..c0bb49f 100644
> > > --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > @@ -565,25 +565,26 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf
> > **tx_pkts,
> > >   ixgbe_xmit_cleanup(txq);
> > >   }
> > >
> > > + rte_prefetch0(>mbuf->pool);
> > > +
> > 
> > Can you explain what all of these prefetches are doing?  It looks to me like
> > they're just fetching the first caheline of the mempool structure, which it
> > appears amounts to the pools name.  I don't see that having any use here.
> > 
> This does make a decent enough performance difference in my tests (the amount 
> varies depending on the RX path being used by testpmd). 
> 
> What I've done with the prefetches is two-fold:
> 1) changed it from prefetching the mbuf (first cache line) to prefetching the 
> mbuf pool pointer (second cache line) so that when we go to access the pool 
> pointer to free transmitted mbufs we don't get a cache miss. When clearing 
> the ring and freeing mbufs, the pool pointer is the only mbuf field used, so 
> we don't need that first cache line.
ok, this makes some sense, but you're not guaranteed to either have that
prefetch be needed, nor are you certain it will still be in cache by the time
you get to the free call.  Seems like it might be preferable to prefecth the
data pointed to by tx_pkt, as you're sure to use that every loop iteration.

> 2) changed the code to prefetch earlier - in effect to prefetch one mbuf 
> ahead. The original code prefetched the mbuf to be freed as soon as it 
> started processing the mbuf to replace it. Instead now, every time we 
> calculate what the next mbuf position is going to be we prefetch the mbuf in 
> that position (i.e. the mbuf pool pointer we are going to free the mbuf to), 
> even while we are still updating the previous mbuf slot on the ring. This 
> gives the prefetch much more time to resolve and get the data we need in the 
> cache before we need it.
> 
Again, early isn't necessecarily better, as it just means more time for the data
in cache to get victimized. It seems like it would be better to prefetch the
tx_pkts data a few cache lines ahead.

Neil

> Hope this clarifies things.
> 
> /Bruce
> 


[dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx perf

2014-09-17 Thread Neil Horman
On Wed, Sep 17, 2014 at 11:01:39AM +0100, Bruce Richardson wrote:
> Make a small improvement to slow path TX performance by adding in a
> prefetch for the second mbuf cache line.
> Also move assignment of l2/l3 length values only when needed.
> 
> Signed-off-by: Bruce Richardson 
> ---
>  lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c 
> b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> index 6f702b3..c0bb49f 100644
> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> @@ -565,25 +565,26 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf 
> **tx_pkts,
>   ixgbe_xmit_cleanup(txq);
>   }
>  
> + rte_prefetch0(>mbuf->pool);
> +

Can you explain what all of these prefetches are doing?  It looks to me like
they're just fetching the first caheline of the mempool structure, which it
appears amounts to the pools name.  I don't see that having any use here.

>   /* TX loop */
>   for (nb_tx = 0; nb_tx < nb_pkts; nb_tx++) {
>   new_ctx = 0;
>   tx_pkt = *tx_pkts++;
>   pkt_len = tx_pkt->pkt_len;
>  
> - RTE_MBUF_PREFETCH_TO_FREE(txe->mbuf);
> -
>   /*
>* Determine how many (if any) context descriptors
>* are needed for offload functionality.
>*/
>   ol_flags = tx_pkt->ol_flags;
> - vlan_macip_lens.f.vlan_tci = tx_pkt->vlan_tci;
> - vlan_macip_lens.f.l2_l3_len = tx_pkt->l2_l3_len;
>  
>   /* If hardware offload required */
>   tx_ol_req = ol_flags & PKT_TX_OFFLOAD_MASK;
>   if (tx_ol_req) {
> + vlan_macip_lens.f.vlan_tci = tx_pkt->vlan_tci;
> + vlan_macip_lens.f.l2_l3_len = tx_pkt->l2_l3_len;
> +
>   /* If new context need be built or reuse the exist ctx. 
> */
>   ctx = what_advctx_update(txq, tx_ol_req,
>   vlan_macip_lens.data);
> @@ -720,7 +721,7 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
>   [tx_id];
>  
>   txn = _ring[txe->next_id];
> - RTE_MBUF_PREFETCH_TO_FREE(txn->mbuf);
> + rte_prefetch0(>mbuf->pool);
>  
>   if (txe->mbuf != NULL) {
>   rte_pktmbuf_free_seg(txe->mbuf);
> @@ -749,6 +750,7 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
>   do {
>   txd = [tx_id];
>   txn = _ring[txe->next_id];
> + rte_prefetch0(>mbuf->pool);
>  
>   if (txe->mbuf != NULL)
>   rte_pktmbuf_free_seg(txe->mbuf);
> -- 
> 1.9.3
> 
> 


[dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx perf

2014-09-17 Thread Bruce Richardson
Make a small improvement to slow path TX performance by adding in a
prefetch for the second mbuf cache line.
Also move assignment of l2/l3 length values only when needed.

Signed-off-by: Bruce Richardson 
---
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c 
b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index 6f702b3..c0bb49f 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -565,25 +565,26 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
ixgbe_xmit_cleanup(txq);
}

+   rte_prefetch0(>mbuf->pool);
+
/* TX loop */
for (nb_tx = 0; nb_tx < nb_pkts; nb_tx++) {
new_ctx = 0;
tx_pkt = *tx_pkts++;
pkt_len = tx_pkt->pkt_len;

-   RTE_MBUF_PREFETCH_TO_FREE(txe->mbuf);
-
/*
 * Determine how many (if any) context descriptors
 * are needed for offload functionality.
 */
ol_flags = tx_pkt->ol_flags;
-   vlan_macip_lens.f.vlan_tci = tx_pkt->vlan_tci;
-   vlan_macip_lens.f.l2_l3_len = tx_pkt->l2_l3_len;

/* If hardware offload required */
tx_ol_req = ol_flags & PKT_TX_OFFLOAD_MASK;
if (tx_ol_req) {
+   vlan_macip_lens.f.vlan_tci = tx_pkt->vlan_tci;
+   vlan_macip_lens.f.l2_l3_len = tx_pkt->l2_l3_len;
+
/* If new context need be built or reuse the exist ctx. 
*/
ctx = what_advctx_update(txq, tx_ol_req,
vlan_macip_lens.data);
@@ -720,7 +721,7 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
[tx_id];

txn = _ring[txe->next_id];
-   RTE_MBUF_PREFETCH_TO_FREE(txn->mbuf);
+   rte_prefetch0(>mbuf->pool);

if (txe->mbuf != NULL) {
rte_pktmbuf_free_seg(txe->mbuf);
@@ -749,6 +750,7 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
do {
txd = [tx_id];
txn = _ring[txe->next_id];
+   rte_prefetch0(>mbuf->pool);

if (txe->mbuf != NULL)
rte_pktmbuf_free_seg(txe->mbuf);
-- 
1.9.3