[dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

2014-01-30 Thread Thomas Graf
On 01/29/2014 09:47 PM, Fran?ois-Fr?d?ric Ozog wrote:
> In the telecom world, if you fix the underlying framework of an app, you
> will still have to validate the solution, ie app/framework. In addition, the
> idea of shared libraries introduces the implied requirement to validate apps
> against diverse versions of DPDK shared libraries. This translates into
> development and support costs.
>
> I also expect many DPDK applications to tackle core networking features,
> with sub micro second packet handling delays  and even lower than 200ns
> (NAT64...). The lazy binding based on ELF PLT represent quite a cost, not
> mentioning that optimization stops are shared libraries boundaries (gcc
> whole program optimization can be very effective...). Microsoft DLL linkage
> are an order of magnitude faster. If Linux was to provide that, I would
> probably revise my judgment. (I haven't checked Linux dynamic linking
> implementation for some time so my understanding of Linux dynamic linking
> may be outdated).

All very valid points and I am not suggesting to stop offering the
static linking option in any way. Dynamic linking will by design result
in more cycles. My sole point is that for a core platform component
like OVS, the shared library benefits _might_ outweigh the performance
difference. In order for a shared library to be effective, some form of
ABI compatibility must be guaranteed though.

> I don't think it is so straight forward. Many recent cards such as Chelsio
> and Myricom have a very different "packet memory layout" that does not fit
> so easily into actual DPDK architecture.
>
> 1) "traditional" architecture: the driver reserves X buffers and provide the
> card with descriptors of those buffers. Each packet is DMA'ed into exactly
> one buffer. Typically you have 2K buffers, a 64 byte packet consumes exactly
> one buffer
>
> 2) "alternative" new architecture: the driver reserves a memory zone, say
> 4MB, without any structure, and provide a a single zone description and a
> ring buffer to the card. (there no individual buffer descriptors any more).
> The card fills the memory zone with packets, one next to the other and
> specifies where the packets are by updating the supplied ring. Out of the
> many issues fitting this scheme into DPDK, you cannot free a single mbuf:
> you have to maintain a ref count to the memory zone so that, when all mbufs
> have been "released", the memory zone can be freed.
> That's quite a stretch from actual paradigm.
>
> Apart from this aspect, managing RSS is two tied to Intel's flow director
> concepts and cannot accommodate directly smarter or dumber RSS mechanisms.
>
> That said, I fully agree PMD API should be revisited.

Fair enough. I don't see a reason why multiple interfaces could not
coexist in order to support multiple memory layouts. What I'm hearing
so far is that while there is no objection to bringing stability to the
APIs, it should not result in performance side effects and it is still
early to nail down the yet fluent APIs.


[dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

2014-01-29 Thread François-Frédéric Ozog
> > First and easy answer: it is open source, so anyone can recompile. So,
> > what's the issue?
> 
> I'm talking from a pure distribution perspective here: Requiring to
> recompile all DPDK based applications to distribute a bugfix or to add
> support for a new PMD is not ideal.

> 
> So ideally OVS would have the possibility to link against the shared
> library long term.

I agree that distribution of DPDK apps is not covered properly at present.
Identifying the proper scheme requires a specific analysis based on the
constraints of the Telecom/Cloud/Networking markets.

In the telecom world, if you fix the underlying framework of an app, you
will still have to validate the solution, ie app/framework. In addition, the
idea of shared libraries introduces the implied requirement to validate apps
against diverse versions of DPDK shared libraries. This translates into
development and support costs.

I also expect many DPDK applications to tackle core networking features,
with sub micro second packet handling delays  and even lower than 200ns
(NAT64...). The lazy binding based on ELF PLT represent quite a cost, not
mentioning that optimization stops are shared libraries boundaries (gcc
whole program optimization can be very effective...). Microsoft DLL linkage
are an order of magnitude faster. If Linux was to provide that, I would
probably revise my judgment. (I haven't checked Linux dynamic linking
implementation for some time so my understanding of Linux dynamic linking
may be outdated).


> 
> > I get lost: do you mean ABI + API toward the PMDs or towards the
> > applications using the librte ?
> 
> Towards the PMDs is more straight forward at first so it seems logical to
> focus on that first.

I don't think it is so straight forward. Many recent cards such as Chelsio
and Myricom have a very different "packet memory layout" that does not fit
so easily into actual DPDK architecture.

1) "traditional" architecture: the driver reserves X buffers and provide the
card with descriptors of those buffers. Each packet is DMA'ed into exactly
one buffer. Typically you have 2K buffers, a 64 byte packet consumes exactly
one buffer

2) "alternative" new architecture: the driver reserves a memory zone, say
4MB, without any structure, and provide a a single zone description and a
ring buffer to the card. (there no individual buffer descriptors any more).
The card fills the memory zone with packets, one next to the other and
specifies where the packets are by updating the supplied ring. Out of the
many issues fitting this scheme into DPDK, you cannot free a single mbuf:
you have to maintain a ref count to the memory zone so that, when all mbufs
have been "released", the memory zone can be freed.
That's quite a stretch from actual paradigm.

Apart from this aspect, managing RSS is two tied to Intel's flow director
concepts and cannot accommodate directly smarter or dumber RSS mechanisms.

That said, I fully agree PMD API should be revisited.


Cordially,

Fran?ois-Fr?d?ric



[dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

2014-01-29 Thread Thomas Graf
On 01/29/2014 05:34 PM, Vincent JARDIN wrote:
> Thomas,
>
> First and easy answer: it is open source, so anyone can recompile. So,
> what's the issue?

I'm talking from a pure distribution perspective here: Requiring to
recompile all DPDK based applications to distribute a bugfix or to
add support for a new PMD is not ideal.

So ideally OVS would have the possibility to link against the shared
library long term.

> I get lost: do you mean ABI + API toward the PMDs or towards the
> applications using the librte ?

Towards the PMDs is more straight forward at first so it seems logical
to focus on that first.

A stable API and ABI for librte seems required as well long term as
DPDK does offer shared libraries but I realize that this is a stretch
goal in the initial phase.


[dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

2014-01-29 Thread Vincent JARDIN
Thomas,

First and easy answer: it is open source, so anyone can recompile. So, 
what's the issue?

> Without a concept of stable interfaces, it will be difficult to
> package and distribute RTE libraries, PMD, and DPDK applications. Right
> now, the obvious path would include packaging the PMD bits together
> with each DPDK application depending on the version of DPDK the binary
> was compiled against. This is clearly not ideal.

>
>> I agree that some areas could be improved since they are not into the
>> critical datapath of packets, but still other areas remain very CPU
>> constraints. For instance:
>> http://dpdk.org/browse/dpdk/commit/lib/librte_ether/rte_ethdev.h?id=c3d0564cf0f00c3c9a61cf72bd4bd1c441740637
>>
>> is bad:
>> struct eth_dev_ops
>> is churned, no comment, and a #ifdef that changes the structure
>> according to compilation!
>
> This is a very good example as it outlines the difference between
> control structures and the fast path. We have this same exact trade off
> in the kernel a lot where we have highly optimized internal APIs
> towards modules and drivers but want to provide binary compatibility to
> a certain extend.

As long as we agree on this limited scope, we'll think about it and 
provide a proposal on dev at dpdk.org mailing list.

> As for the specific example you mention, it is relatively trivial to
> make eth_dev_ops backwards compatible by appending appropriate padding
> to the struct before a new major release and ensure that new members
> are added by replacing the padding accordingly. Obviously no ifdefs
> would be allowed anymore.

Of course, it is basic C!

>> Should an application use the librte libraries of the DPDK:
>>- you can use RTE_VERSION and RTE_VERSION_NUM :
>> http://dpdk.org/doc/api/rte__version_8h.html#a8775053b0f721b9fa0457494cfbb7ed9
>
> Right. This would be more or less identical to requiring a specific
> DPDK version in OVS_CHEC_DPDK. It's not ideal to require application to
> clutter their code with #ifdefs all over for every new minor release
> though.
>
>>- you can write your own wrapper (with CPU overhead) in order to have
>> a stable ABI, that wrapper should be tight to the versions of the librte
>> => the overhead is part of your application instead of the DPDK,
>>- *otherwise recompile your software, it is opensource, what's the
>> issue?*
>>
>> We are opened to any suggestion to have stable ABI, but it should never
>> remove the options to have fast/efficient/compilation/CPU execution
>> processing.
>
> Absolutely agreed. We also don't want to add tons of abstraction and
> overcomplicate everything. Still, I strongly believe that the definition
> of stable interfaces towards applications and especially PMD is
> essential.
>
> I'm not proposing to standardize all the APIs towards applications on
> the level of POSIX. DPDK is in early stages and disruptive changes will
> come along. What I would propose on an abstract level is:
>
> 1. Extend but not break API between minor releases. Postpone API
> breakages to the next major release. High cadence of major
> releases initially, lower cadence as DPDK matures.
>
> 2. Define ABI stability towards PMD for minor releases to allow
> isolated packaging of PMD by padding control structures and keeping
> functions ABI stable.

I get lost: do you mean ABI + API toward the PMDs or towards the 
applications using the librte ?

Best regards,
   Vincent



[dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

2014-01-29 Thread Pravin Shelar
On Wed, Jan 29, 2014 at 2:01 AM, Thomas Graf  wrote:
> On 01/28/2014 02:48 AM, pshelar at nicira.com wrote:
>>
>> From: Pravin B Shelar 
>>
>> Following patch adds DPDK netdev-class to userspace datapath.
>> Approach taken in this patch differs from Intel? DPDK vSwitch
>> where DPDK datapath switching is done in saparate process.  This
>> patch adds support for DPDK type port and uses OVS userspace
>> datapath for switching.  Therefore all DPDK processing and flow
>> miss handling is done in single process.  This also avoids code
>> duplication by reusing OVS userspace datapath switching and
>> therefore it supports all flow matching and actions that
>> user-space datapath supports.  Refer to INSTALL.DPDK doc for
>> further info.
>>
>> With this patch I got similar performance for netperf TCP_STREAM
>> tests compared to kernel datapath.
>>
>> This is based a patch from Gerald Rogers.
>>
>> Signed-off-by: Pravin B Shelar 
>> CC: "Gerald Rogers" 
>
>
> Pravin,
>
> Some initial comments below. I will provide more after deeper
> digging.
>

> Do you have any ideas on how to implement the TX batching yet?
>
We can batch packets for some interval, then to do tx-burst. But I did
not see any performance improvements as we have other bottleneck in
userspace datapath. Ben's RCU patch should help there.

>
>> +
>> +static int
>> +netdev_dpdk_rx_drain(struct netdev_rx *rx_)
>> +{
>> +struct netdev_rx_dpdk *rx = netdev_rx_dpdk_cast(rx_);
>> +int pending;
>> +int i;
>> +
>> +pending = rx->ofpbuf_cnt;
>> +if (pending) {
>
>
> This conditional seems unneeded.
>
Right.

>
>> +for (i = 0; i < pending; i++) {
>> + build_ofpbuf(rx, >ofpbuf[i], NULL);
>> +}
>> +rx->ofpbuf_cnt = 0;
>> +return 0;
>> +}
>> +
>> +return 0;
>> +}
>> +
>> +/* Tx function. Transmit packets indefinitely */
>> +static int
>> +dpdk_do_tx_copy(struct netdev *netdev, char *buf, int size)
>> +{
>> +struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> +struct rte_mbuf *pkt;
>> +uint32_t nb_tx = 0;
>> +
>> +pkt = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
>> +if (!pkt) {
>> +return 0;
>
>
> Silent drop? ;-) Shouldn't these drops be accounted for somehow?
>
ahh, I will keep it in netdev-dpdk.

>
>> +}
>> +
>> +/* We have to do a copy for now */
>> +memcpy(pkt->pkt.data, buf, size);
>> +
>> +rte_pktmbuf_data_len(pkt) = size;
>> +rte_pktmbuf_pkt_len(pkt) = size;
>> +
>> +rte_spinlock_lock(>tx_lock);
>
>
> What is the purpose of tx_lock here? Multiple threads writing to
> the same Q? The lock is not acquired for the zerocopy path below.
>
There are PMD threads which have their own queue. So tx in these
threads is lockless. But vswitchd can send packet from other thread
all other thread send packets from single queue which is locked.

>
>> +nb_tx = rte_eth_tx_burst(dev->port_id, NR_QUEUE, , 1);
>> +rte_spinlock_unlock(>tx_lock);
>> +
>> +if (nb_tx != 1) {
>> +/* free buffers if we couldn't transmit packets */
>> +rte_mempool_put_bulk(dev->dpdk_mp->mp, (void **), 1);
>> +}
>> +return nb_tx;
>> +}
>> +
>> +static int
>> +netdev_dpdk_send(struct netdev *netdev,
>> + struct ofpbuf *ofpbuf, bool may_steal)
>> +{
>> +struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> +
>> +if (ofpbuf->size > dev->max_packet_len) {
>> +VLOG_ERR("2big size %d max_packet_len %d",
>> +  (int)ofpbuf->size , dev->max_packet_len);
>
>
> Should probably use VLOG_RATE_LIMIT_INIT
>
ok.

>
>> +return E2BIG;
>> +}
>> +
>> +rte_prefetch0(>private_p);
>> +if (!may_steal ||
>> +!ofpbuf->private_p || ofpbuf->source != OFPBUF_DPDK) {
>> +dpdk_do_tx_copy(netdev, (char *) ofpbuf->data, ofpbuf->size);
>> +} else {
>> +struct rte_mbuf *pkt;
>> +uint32_t nb_tx;
>> +int qid;
>> +
>> +pkt = ofpbuf->private_p;
>> +ofpbuf->private_p = NULL;
>> +rte_pktmbuf_data_len(pkt) = ofpbuf->size;
>> +rte_pktmbuf_pkt_len(pkt) = ofpbuf->size;
>> +
>> +/* TODO: TX batching. */
>> +qid = rte_lcore_id() % NR_QUEUE;
>> +nb_tx = rte_eth_tx_burst(dev->port_id, qid, , 1);
>> +if (nb_tx != 1) {
>> +struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> +
>> +rte_mempool_put_bulk(dev->dpdk_mp->mp, (void **), 1);
>> +VLOG_ERR("TX error, zero packets sent");
>
>
> Same here
>
ok

>
>> +   }
>> +}
>> +return 0;
>> +}
>
>
>> +static int
>> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>> +{
>> +struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> +int old_mtu, err;
>> +struct dpdk_mp *old_mp;
>> +struct dpdk_mp *mp;
>> +
>> +ovs_mutex_lock(_mutex);
>> +ovs_mutex_lock(>mutex);
>> +if (dev->mtu == mtu) {
>> +err = 0;
>> +goto out;
>> +}
>> +
>> +mp = dpdk_mp_get(dev->socket_id, dev->mtu);
>> +   

[dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

2014-01-29 Thread Vincent JARDIN
Hi Thomas,

On 29/01/2014 09:15, Thomas Graf wrote:

 > The obvious and usual best practise would be for DPDK to guarantee
 > ABI stability between minor releases.
 >
 > Since dpdk-dev is copied as well, any comments?

DPDK's ABIs are not Kernel's ABIs, they are not POSIX, there is no 
standard. Currently, there is no such plan to have a stable ABI since we 
need to keep freedom to chase CPU cycles over having a stable ABI. For 
instance, some applications on top of the DPDK process the packets in 
less than 150 CPU cycles (have a look at testpmd:
   http://dpdk.org/browse/dpdk/tree/app/test-pmd )

I agree that some areas could be improved since they are not into the 
critical datapath of packets, but still other areas remain very CPU 
constraints. For instance:
http://dpdk.org/browse/dpdk/commit/lib/librte_ether/rte_ethdev.h?id=c3d0564cf0f00c3c9a61cf72bd4bd1c441740637
is bad:
struct eth_dev_ops
is churned, no comment, and a #ifdef that changes the structure 
according to compilation!

Should an application use the librte libraries of the DPDK:
   - you can use RTE_VERSION and RTE_VERSION_NUM :
http://dpdk.org/doc/api/rte__version_8h.html#a8775053b0f721b9fa0457494cfbb7ed9
   - you can write your own wrapper (with CPU overhead) in order to have 
a stable ABI, that wrapper should be tight to the versions of the librte 
=> the overhead is part of your application instead of the DPDK,
   - *otherwise recompile your software, it is opensource, what's the 
issue?*

We are opened to any suggestion to have stable ABI, but it should never 
remove the options to have fast/efficient/compilation/CPU execution 
processing.

Best regards,
   Vincent


[dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

2014-01-29 Thread Thomas Graf
On 01/28/2014 02:48 AM, pshelar at nicira.com wrote:
> From: Pravin B Shelar 
>
> Following patch adds DPDK netdev-class to userspace datapath.
> Approach taken in this patch differs from Intel? DPDK vSwitch
> where DPDK datapath switching is done in saparate process.  This
> patch adds support for DPDK type port and uses OVS userspace
> datapath for switching.  Therefore all DPDK processing and flow
> miss handling is done in single process.  This also avoids code
> duplication by reusing OVS userspace datapath switching and
> therefore it supports all flow matching and actions that
> user-space datapath supports.  Refer to INSTALL.DPDK doc for
> further info.
>
> With this patch I got similar performance for netperf TCP_STREAM
> tests compared to kernel datapath.
>
> This is based a patch from Gerald Rogers.
>
> Signed-off-by: Pravin B Shelar 
> CC: "Gerald Rogers" 

Pravin,

Some initial comments below. I will provide more after deeper
digging.

Do you have any ideas on how to implement the TX batching yet?

> +
> +static int
> +netdev_dpdk_rx_drain(struct netdev_rx *rx_)
> +{
> +struct netdev_rx_dpdk *rx = netdev_rx_dpdk_cast(rx_);
> +int pending;
> +int i;
> +
> +pending = rx->ofpbuf_cnt;
> +if (pending) {

This conditional seems unneeded.

> +for (i = 0; i < pending; i++) {
> + build_ofpbuf(rx, >ofpbuf[i], NULL);
> +}
> +rx->ofpbuf_cnt = 0;
> +return 0;
> +}
> +
> +return 0;
> +}
> +
> +/* Tx function. Transmit packets indefinitely */
> +static int
> +dpdk_do_tx_copy(struct netdev *netdev, char *buf, int size)
> +{
> +struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +struct rte_mbuf *pkt;
> +uint32_t nb_tx = 0;
> +
> +pkt = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
> +if (!pkt) {
> +return 0;

Silent drop? ;-) Shouldn't these drops be accounted for somehow?

> +}
> +
> +/* We have to do a copy for now */
> +memcpy(pkt->pkt.data, buf, size);
> +
> +rte_pktmbuf_data_len(pkt) = size;
> +rte_pktmbuf_pkt_len(pkt) = size;
> +
> +rte_spinlock_lock(>tx_lock);

What is the purpose of tx_lock here? Multiple threads writing to
the same Q? The lock is not acquired for the zerocopy path below.

> +nb_tx = rte_eth_tx_burst(dev->port_id, NR_QUEUE, , 1);
> +rte_spinlock_unlock(>tx_lock);
> +
> +if (nb_tx != 1) {
> +/* free buffers if we couldn't transmit packets */
> +rte_mempool_put_bulk(dev->dpdk_mp->mp, (void **), 1);
> +}
> +return nb_tx;
> +}
> +
> +static int
> +netdev_dpdk_send(struct netdev *netdev,
> + struct ofpbuf *ofpbuf, bool may_steal)
> +{
> +struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +
> +if (ofpbuf->size > dev->max_packet_len) {
> +VLOG_ERR("2big size %d max_packet_len %d",
> +  (int)ofpbuf->size , dev->max_packet_len);

Should probably use VLOG_RATE_LIMIT_INIT

> +return E2BIG;
> +}
> +
> +rte_prefetch0(>private_p);
> +if (!may_steal ||
> +!ofpbuf->private_p || ofpbuf->source != OFPBUF_DPDK) {
> +dpdk_do_tx_copy(netdev, (char *) ofpbuf->data, ofpbuf->size);
> +} else {
> +struct rte_mbuf *pkt;
> +uint32_t nb_tx;
> +int qid;
> +
> +pkt = ofpbuf->private_p;
> +ofpbuf->private_p = NULL;
> +rte_pktmbuf_data_len(pkt) = ofpbuf->size;
> +rte_pktmbuf_pkt_len(pkt) = ofpbuf->size;
> +
> +/* TODO: TX batching. */
> +qid = rte_lcore_id() % NR_QUEUE;
> +nb_tx = rte_eth_tx_burst(dev->port_id, qid, , 1);
> +if (nb_tx != 1) {
> +struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +
> +rte_mempool_put_bulk(dev->dpdk_mp->mp, (void **), 1);
> +VLOG_ERR("TX error, zero packets sent");

Same here

> +   }
> +}
> +return 0;
> +}

> +static int
> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
> +{
> +struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +int old_mtu, err;
> +struct dpdk_mp *old_mp;
> +struct dpdk_mp *mp;
> +
> +ovs_mutex_lock(_mutex);
> +ovs_mutex_lock(>mutex);
> +if (dev->mtu == mtu) {
> +err = 0;
> +goto out;
> +}
> +
> +mp = dpdk_mp_get(dev->socket_id, dev->mtu);
> +if (!mp) {
> +err = ENOMEM;
> +goto out;
> +}
> +
> +rte_eth_dev_stop(dev->port_id);
> +
> +old_mtu = dev->mtu;
> +old_mp = dev->dpdk_mp;
> +dev->dpdk_mp = mp;
> +dev->mtu = mtu;
> +dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
> +
> +err = dpdk_eth_dev_init(dev);
> +if (err) {
> +
> +dpdk_mp_put(mp);
> +dev->mtu = old_mtu;
> +dev->dpdk_mp = old_mp;
> +dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
> +dpdk_eth_dev_init(dev);

Would be nice if we don't need these constructs and DPDK would
provide an all or nothing init method.

> +goto out;
> +}
> +
> +dpdk_mp_put(old_mp);
> 

[dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

2014-01-29 Thread Stephen Hemminger
On Wed, 29 Jan 2014 18:14:01 +0100
Thomas Graf  wrote:

> On 01/29/2014 05:34 PM, Vincent JARDIN wrote:
> > Thomas,
> >
> > First and easy answer: it is open source, so anyone can recompile. So,
> > what's the issue?
> 
> I'm talking from a pure distribution perspective here: Requiring to
> recompile all DPDK based applications to distribute a bugfix or to
> add support for a new PMD is not ideal.
> 
> So ideally OVS would have the possibility to link against the shared
> library long term.
> 
> > I get lost: do you mean ABI + API toward the PMDs or towards the
> > applications using the librte ?
> 
> Towards the PMDs is more straight forward at first so it seems logical
> to focus on that first.
> 
> A stable API and ABI for librte seems required as well long term as
> DPDK does offer shared libraries but I realize that this is a stretch
> goal in the initial phase.

I would hate to see the API/ABI nailed down. We have lots of bug fixes
and new drivers that are ready to contribute, but most of them have some
change to existing ABI.


[dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

2014-01-29 Thread Thomas Graf
On 01/28/2014 07:17 PM, Pravin Shelar wrote:
> Right, version mismatch will not work. API provided by DPDK are not
> stable, So OVS has to be built for different releases for now.
>
> I do not see how we can fix it from OVS side. DPDK needs to
> standardize API, Actually OVS also needs more API, like DPDK
> initialization, mempool destroy, etc.

Agreed. It's not fixable from the OVS side. I also don't want to
object to including this. I'm just raising awareness of the issue
as this will become essential for dstribution.

The obvious and usual best practise would be for DPDK to guarantee
ABI stability between minor releases.

Since dpdk-dev is copied as well, any comments?


[dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

2014-01-28 Thread Pravin Shelar
On Tue, Jan 28, 2014 at 7:48 AM, Thomas Graf  wrote:
> On 01/28/2014 02:48 AM, pshelar at nicira.com wrote:
>>
>> From: Pravin B Shelar 
>>
>> Following patch adds DPDK netdev-class to userspace datapath.
>> Approach taken in this patch differs from Intel? DPDK vSwitch
>> where DPDK datapath switching is done in saparate process.  This
>> patch adds support for DPDK type port and uses OVS userspace
>> datapath for switching.  Therefore all DPDK processing and flow
>> miss handling is done in single process.  This also avoids code
>> duplication by reusing OVS userspace datapath switching and
>> therefore it supports all flow matching and actions that
>> user-space datapath supports.  Refer to INSTALL.DPDK doc for
>> further info.
>>
>> With this patch I got similar performance for netperf TCP_STREAM
>> tests compared to kernel datapath.
>
>
> I'm happy to see this happen!
>
>
>
>> +static const struct rte_eth_conf port_conf = {
>> +.rxmode = {
>> +.mq_mode = ETH_MQ_RX_RSS,
>> +.split_hdr_size = 0,
>> +.header_split   = 0, /* Header Split disabled */
>> +.hw_ip_checksum = 0, /* IP checksum offload enabled */
>> +.hw_vlan_filter = 0, /* VLAN filtering disabled */
>> +.jumbo_frame= 0, /* Jumbo Frame Support disabled */
>> +.hw_strip_crc   = 0, /* CRC stripped by hardware */
>> +},
>> +.rx_adv_conf = {
>> +.rss_conf = {
>> +.rss_key = NULL,
>> +.rss_hf = ETH_RSS_IPV4_TCP | ETH_RSS_IPV4 |
>> ETH_RSS_IPV6,
>> +},
>> +},
>> +.txmode = {
>> +.mq_mode = ETH_MQ_TX_NONE,
>> +},
>> +};
>
>
> I realize this is an RFC patch but I will ask anyway:
>
> What are the plans on managing runtime dependencies of a DPDK enabled OVS
> and DPDK itself? Will a OVS built against DPDK 1.5.2 work with
> drivers written for 1.5.3?
>
> Based on the above use of struct rte_eth_conf it would seem that once
> released, rte_eth_conf cannot be extended anymore without breaking
> ABI compatibility. The same applies to many of the other user
> structures. I see various structures changes between minor releases,
> for example dpdk.org ed2c69c3ef7 between 1.5.1 and 1.5.2.
>

Right, version mismatch will not work. API provided by DPDK are not
stable, So OVS has to be built for different releases for now.

I do not see how we can fix it from OVS side. DPDK needs to
standardize API, Actually OVS also needs more API, like DPDK
initialization, mempool destroy, etc.


[dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.

2014-01-27 Thread Pravin Shelar
On Mon, Jan 27, 2014 at 8:49 PM, Ben Pfaff  wrote:
> On Mon, Jan 27, 2014 at 05:48:35PM -0800, pshelar at nicira.com wrote:
>> From: Pravin B Shelar 
>>
>> Following patch adds DPDK netdev-class to userspace datapath.
>> Approach taken in this patch differs from Intel?? DPDK vSwitch
>> where DPDK datapath switching is done in saparate process.  This
>> patch adds support for DPDK type port and uses OVS userspace
>> datapath for switching.  Therefore all DPDK processing and flow
>> miss handling is done in single process.  This also avoids code
>> duplication by reusing OVS userspace datapath switching and
>> therefore it supports all flow matching and actions that
>> user-space datapath supports.  Refer to INSTALL.DPDK doc for
>> further info.
>>
>> With this patch I got similar performance for netperf TCP_STREAM
>> tests compared to kernel datapath.
>>
>> This is based a patch from Gerald Rogers.
>>
>> Signed-off-by: Pravin B Shelar 
>> CC: "Gerald Rogers" 
>
> I haven't looked at the patch yet (it does sound awesome) but if it's
> based on Gerald's code then I'd expect to get his Signed-off-by too.

Right.

Gerald's Patch did not had any signed-off. So If he send me signed-off
now, I will update the commit msg.