[dpdk-dev] [PATCH 4/4] net/bonding: fix configuration of LACP slaves

2016-08-01 Thread Robert Sanford
Problem: When adding a slave or starting a bond device, the bond
device configures slave devices via function slave_configure().
However, settings configured in the bond device's rte_eth_conf are
not propagated to the slaves. For example, VLAN and CRC stripping
are not working as expected.

The problem is that we pass the wrong argument when we invoke
rte_eth_dev_configure(). We pass the slave's currently configured
rte_eth_conf (as a source arg!), when we should pass a copy of the
(master) bond device's rte_eth_conf.

Solution: Make a local copy of the bond device's rte_eth_conf, adjust
the LSC flag based on the slave, and then pass that rte_eth_conf to
rte_eth_dev_configure().

Also, remove code that directly pokes RSS data into the slave's
rte_eth_conf, as that is also contained in the proper rte_eth_conf
that we will pass to rte_eth_dev_configure().

Signed-off-by: Robert Sanford 
---
 drivers/net/bonding/rte_eth_bond_pmd.c |   28 +++-
 1 files changed, 7 insertions(+), 21 deletions(-)

diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c 
b/drivers/net/bonding/rte_eth_bond_pmd.c
index b20a272..486582f 100644
--- a/drivers/net/bonding/rte_eth_bond_pmd.c
+++ b/drivers/net/bonding/rte_eth_bond_pmd.c
@@ -1302,6 +1302,7 @@ int
 slave_configure(struct rte_eth_dev *bonded_eth_dev,
struct rte_eth_dev *slave_eth_dev)
 {
+   struct rte_eth_conf slave_eth_conf;
struct bond_rx_queue *bd_rx_q;
struct bond_tx_queue *bd_tx_q;

@@ -1313,33 +1314,18 @@ slave_configure(struct rte_eth_dev *bonded_eth_dev,
/* Stop slave */
rte_eth_dev_stop(slave_eth_dev->data->port_id);

-   /* Enable interrupts on slave device if supported */
-   if (slave_eth_dev->data->dev_flags & RTE_ETH_DEV_INTR_LSC)
-   slave_eth_dev->data->dev_conf.intr_conf.lsc = 1;
-
-   /* If RSS is enabled for bonding, try to enable it for slaves  */
-   if (bonded_eth_dev->data->dev_conf.rxmode.mq_mode & ETH_MQ_RX_RSS_FLAG) 
{
-   if 
(bonded_eth_dev->data->dev_conf.rx_adv_conf.rss_conf.rss_key_len
-   != 0) {
-   
slave_eth_dev->data->dev_conf.rx_adv_conf.rss_conf.rss_key_len =
-   
bonded_eth_dev->data->dev_conf.rx_adv_conf.rss_conf.rss_key_len;
-   
slave_eth_dev->data->dev_conf.rx_adv_conf.rss_conf.rss_key =
-   
bonded_eth_dev->data->dev_conf.rx_adv_conf.rss_conf.rss_key;
-   } else {
-   
slave_eth_dev->data->dev_conf.rx_adv_conf.rss_conf.rss_key = NULL;
-   }
+   /* Build slave rte_eth_conf, starting from bonded's conf */
+   slave_eth_conf = bonded_eth_dev->data->dev_conf;

-   slave_eth_dev->data->dev_conf.rx_adv_conf.rss_conf.rss_hf =
-   
bonded_eth_dev->data->dev_conf.rx_adv_conf.rss_conf.rss_hf;
-   slave_eth_dev->data->dev_conf.rxmode.mq_mode =
-   bonded_eth_dev->data->dev_conf.rxmode.mq_mode;
-   }
+   /* Enable interrupts on slave device if supported */
+   slave_eth_conf.intr_conf.lsc =
+   !!(slave_eth_dev->data->dev_flags & RTE_ETH_DEV_INTR_LSC);

/* Configure device */
errval = rte_eth_dev_configure(slave_eth_dev->data->port_id,
bonded_eth_dev->data->nb_rx_queues,
bonded_eth_dev->data->nb_tx_queues,
-   &(slave_eth_dev->data->dev_conf));
+   _eth_conf);
if (errval != 0) {
RTE_BOND_LOG(ERR, "Cannot configure slave device: port %u , err 
(%d)",
slave_eth_dev->data->port_id, errval);
-- 
1.7.1



[dpdk-dev] [PATCH 3/4] net/bonding: another fix to LACP mempool size

2016-08-01 Thread Robert Sanford
The following log message may appear after a slave is idle (or nearly
idle) for a few minutes: "PMD: Failed to allocate LACP packet from
pool".

Problem: All mbufs from a slave's private pool (used exclusively for
transmitting LACPDUs) have been allocated and are still sitting in
the device's tx descriptor ring and other cores' mempool caches.

Solution: Ensure that each slaves' tx (LACPDU) mempool owns more than
n-tx-queues * (n-tx-descriptors + per-core-mempool-flush-threshold)
mbufs.

Note that the LACP tx machine function is the only code that allocates
from a slave's private pool. It runs in the context of the interrupt
thread, and thus it has no mempool cache of its own.

Signed-off-by: Robert Sanford 
---
 drivers/net/bonding/rte_eth_bond_8023ad.c |   10 +++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/net/bonding/rte_eth_bond_8023ad.c 
b/drivers/net/bonding/rte_eth_bond_8023ad.c
index 2f7ae70..1207896 100644
--- a/drivers/net/bonding/rte_eth_bond_8023ad.c
+++ b/drivers/net/bonding/rte_eth_bond_8023ad.c
@@ -854,6 +854,8 @@ bond_mode_8023ad_activate_slave(struct rte_eth_dev 
*bond_dev, uint8_t slave_id)
char mem_name[RTE_ETH_NAME_MAX_LEN];
int socket_id;
unsigned element_size;
+   unsigned cache_size;
+   unsigned cache_flushthresh;
uint32_t total_tx_desc;
struct bond_tx_queue *bd_tx_q;
uint16_t q_id;
@@ -890,19 +892,21 @@ bond_mode_8023ad_activate_slave(struct rte_eth_dev 
*bond_dev, uint8_t slave_id)

element_size = sizeof(struct slow_protocol_frame) + sizeof(struct 
rte_mbuf)
+ RTE_PKTMBUF_HEADROOM;
+   cache_size = RTE_MEMPOOL_CACHE_MAX_SIZE >= 32 ?
+   32 : RTE_MEMPOOL_CACHE_MAX_SIZE;
+   cache_flushthresh = RTE_MEMPOOL_CALC_CACHE_FLUSHTHRESH(cache_size);

/* The size of the mempool should be at least:
 * the sum of the TX descriptors + BOND_MODE_8023AX_SLAVE_TX_PKTS */
total_tx_desc = BOND_MODE_8023AX_SLAVE_TX_PKTS;
for (q_id = 0; q_id < bond_dev->data->nb_tx_queues; q_id++) {
bd_tx_q = (struct 
bond_tx_queue*)bond_dev->data->tx_queues[q_id];
-   total_tx_desc += bd_tx_q->nb_tx_desc;
+   total_tx_desc += bd_tx_q->nb_tx_desc + cache_flushthresh;
}

snprintf(mem_name, RTE_DIM(mem_name), "slave_port%u_pool", slave_id);
port->mbuf_pool = rte_mempool_create(mem_name,
-   total_tx_desc, element_size,
-   RTE_MEMPOOL_CACHE_MAX_SIZE >= 32 ? 32 : 
RTE_MEMPOOL_CACHE_MAX_SIZE,
+   total_tx_desc, element_size, cache_size,
sizeof(struct rte_pktmbuf_pool_private), rte_pktmbuf_pool_init,
NULL, rte_pktmbuf_init, NULL, socket_id, MEMPOOL_F_NO_SPREAD);

-- 
1.7.1



[dpdk-dev] [PATCH 2/4] mempool: make cache flush threshold macro public

2016-08-01 Thread Robert Sanford
Rename macros that calculate a mempool cache flush threshold, and
move them from rte_mempool.c to rte_mempool.h, so that the bonding
driver can accurately calculate its mbuf requirements.

Signed-off-by: Robert Sanford 
---
 lib/librte_mempool/rte_mempool.c |8 ++--
 lib/librte_mempool/rte_mempool.h |7 +++
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/lib/librte_mempool/rte_mempool.c b/lib/librte_mempool/rte_mempool.c
index 2e28e2e..cca4843 100644
--- a/lib/librte_mempool/rte_mempool.c
+++ b/lib/librte_mempool/rte_mempool.c
@@ -69,10 +69,6 @@ static struct rte_tailq_elem rte_mempool_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_mempool_tailq)

-#define CACHE_FLUSHTHRESH_MULTIPLIER 1.5
-#define CALC_CACHE_FLUSHTHRESH(c)  \
-   ((typeof(c))((c) * CACHE_FLUSHTHRESH_MULTIPLIER))
-
 /*
  * return the greatest common divisor between a and b (fast algorithm)
  *
@@ -686,7 +682,7 @@ static void
 mempool_cache_init(struct rte_mempool_cache *cache, uint32_t size)
 {
cache->size = size;
-   cache->flushthresh = CALC_CACHE_FLUSHTHRESH(size);
+   cache->flushthresh = RTE_MEMPOOL_CALC_CACHE_FLUSHTHRESH(size);
cache->len = 0;
 }

@@ -762,7 +758,7 @@ rte_mempool_create_empty(const char *name, unsigned n, 
unsigned elt_size,

/* asked cache too big */
if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE ||
-   CALC_CACHE_FLUSHTHRESH(cache_size) > n) {
+   RTE_MEMPOOL_CALC_CACHE_FLUSHTHRESH(cache_size) > n) {
rte_errno = EINVAL;
return NULL;
}
diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
index 059ad9e..4323c1b 100644
--- a/lib/librte_mempool/rte_mempool.h
+++ b/lib/librte_mempool/rte_mempool.h
@@ -263,6 +263,13 @@ struct rte_mempool {
 #define MEMPOOL_F_NO_PHYS_CONTIG 0x0020 /**< Don't need physically contiguous 
objs. */

 /**
+ * Calculate the threshold before we flush excess elements.
+ */
+#define RTE_MEMPOOL_CACHE_FLUSHTHRESH_MULTIPLIER 1.5
+#define RTE_MEMPOOL_CALC_CACHE_FLUSHTHRESH(c)  \
+   ((typeof(c))((c) * RTE_MEMPOOL_CACHE_FLUSHTHRESH_MULTIPLIER))
+
+/**
  * @internal When debug is enabled, store some statistics.
  *
  * @param mp
-- 
1.7.1



[dpdk-dev] [PATCH 1/4] testpmd: fix LACP ports to work with idle links

2016-08-01 Thread Robert Sanford
Problem: When there is little or no TX traffic on an LACP port
(bonding mode 4), we don't call its tx burst function in a timely
manner, and thus we don't transmit LACPDUs when we should.

Solution: Add and maintain an "lacp_master" flag in rte_port struct.
In the main packet forwarding loop, if port is an LACP master, in 1
out of N loops force an empty call to the tx burst API.

Signed-off-by: Robert Sanford 
---
 app/test-pmd/cmdline.c |9 +
 app/test-pmd/testpmd.c |   37 +
 app/test-pmd/testpmd.h |4 
 3 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index f90befc..2a629ee 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -3975,6 +3975,10 @@ static void cmd_set_bonding_mode_parsed(void 
*parsed_result,
/* Set the bonding mode for the relevant port. */
if (0 != rte_eth_bond_mode_set(port_id, res->value))
printf("\t Failed to set bonding mode for port = %d.\n", 
port_id);
+   else if (res->value == BONDING_MODE_8023AD)
+   set_port_lacp_master(port_id);
+   else
+   clear_port_lacp_master(port_id);
 }

 cmdline_parse_token_string_t cmd_setbonding_mode_set =
@@ -4408,6 +4412,11 @@ static void cmd_create_bonded_device_parsed(void 
*parsed_result,
reconfig(port_id, res->socket);
rte_eth_promiscuous_enable(port_id);
ports[port_id].enabled = 1;
+
+   if (res->mode == BONDING_MODE_8023AD)
+   set_port_lacp_master(port_id);
+   else
+   clear_port_lacp_master(port_id);
}

 }
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 1428974..806667e 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -920,12 +920,28 @@ run_pkt_fwd_on_lcore(struct fwd_lcore *fc, packet_fwd_t 
pkt_fwd)
struct fwd_stream **fsm;
streamid_t nb_fs;
streamid_t sm_id;
+   unsigned int loop_count = 0;

fsm = _streams[fc->stream_idx];
nb_fs = fc->stream_nb;
do {
for (sm_id = 0; sm_id < nb_fs; sm_id++)
(*pkt_fwd)(fsm[sm_id]);
+
+   /*
+* Per rte_eth_bond.h, we must invoke LACP master's tx
+* burst function at least once every 100 ms.
+*/
+   loop_count++;
+   if (likely(loop_count % 1024 != 0))
+   continue;
+   for (sm_id = 0; sm_id < nb_fs; sm_id++) {
+   struct fwd_stream *fs = fsm[sm_id];
+
+   if (port_is_lacp_master(fs->tx_port))
+   rte_eth_tx_burst(fs->tx_port, fs->tx_queue,
+NULL, 0);
+   }
} while (! fc->stopped);
 }

@@ -1881,6 +1897,27 @@ uint8_t port_is_bonding_slave(portid_t slave_pid)
return port->slave_flag;
 }

+void set_port_lacp_master(portid_t pid)
+{
+   struct rte_port *port = [pid];
+
+   port->lacp_master = 1;
+}
+
+void clear_port_lacp_master(portid_t pid)
+{
+   struct rte_port *port = [pid];
+
+   port->lacp_master = 0;
+}
+
+uint8_t port_is_lacp_master(portid_t pid)
+{
+   struct rte_port *port = [pid];
+
+   return port->lacp_master;
+}
+
 const uint16_t vlan_tags[] = {
0,  1,  2,  3,  4,  5,  6,  7,
8,  9, 10, 11,  12, 13, 14, 15,
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 2b281cc..0898194 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -170,6 +170,7 @@ struct rte_port {
struct ether_addr   *mc_addr_pool; /**< pool of multicast addrs */
uint32_tmc_addr_nb; /**< nb. of addr. in mc_addr_pool */
uint8_t slave_flag; /**< bonding slave port */
+   uint8_t lacp_master;/**< bonding LACP master */
 };

 extern portid_t __rte_unused
@@ -542,6 +543,9 @@ void init_port_config(void);
 void set_port_slave_flag(portid_t slave_pid);
 void clear_port_slave_flag(portid_t slave_pid);
 uint8_t port_is_bonding_slave(portid_t slave_pid);
+void set_port_lacp_master(portid_t pid);
+void clear_port_lacp_master(portid_t pid);
+uint8_t port_is_lacp_master(portid_t pid);

 int init_port_dcb_config(portid_t pid, enum dcb_mode_enable dcb_mode,
 enum rte_eth_nb_tcs num_tcs,
-- 
1.7.1



[dpdk-dev] [PATCH 0/4] net/bonding: bonding and LACP fixes

2016-08-01 Thread Robert Sanford
In this patch series, we fix two bonding driver bugs and
enhance testpmd so that bonding mode 4 (LACP) ports remain
operational even when idle.

1. Problem: testpmd does not call bonding mode 4 (LACP) ports' tx
   burst function at least every 100 ms, as mandated.
   Solution: Enhance testpmd's packet forwarding loop to infrequently
   invoke the tx burst API for bonding ports in mode 4, to transmit
   LACPDUs to the partner in a timely manner.
2. Problem: Bonding driver (item 3 below) needs to know how many
   objects may become cached in a memory pool. Solution: Rename
   macros that calculate a mempool cache flush threshold, and move
   them from rte_mempool.c to rte_mempool.h.
3. Problem: With little or no tx traffic, LACP tx machine may run out
   of mbufs. Solution: When calculating the minimum number of mbufs
   required in a bonding mode 4 slave's private (tx LACPDU) pool,
   include the maximum number of mbufs that may be cached in the
   pool's per-core caches.
4. Problem: When configuring a bonding device, we don't properly
   propagate most of the settings from the master to the slaves.
   Solution: Fix slave_configure() to correctly pass configuration
   data to rte_eth_dev_configure() on behalf of the slaves.

Notes for configuring and running testpmd: We specify four ethernet
devices in the arguments, because testpmd expects an even number.
We configure two devices to be slaves under one bonded device, one
device to be the other side of the forwarding bridge, and we ignore
the fourth eth dev.

 +-++---+ ++
 |client A |<==>|DPDK   | ||
 |bonded device||testpmd|<===>|client B|
 |with 2 slaves|<==>|   | ||
 +-++---+ ++

To reproduce the out of buffers problem (#3), apply patch 1/4, run
testpmd (with example args and commands shown below), and run ping
from client A, like this: "ping -i18 -c10 clientB". After about five
minutes, one of the slaves will run out of LACPDU mbufs.

Example testpmd args:

  ./testpmd -c 0x0555 -n 2 \
--log-level 7 \
--pci-whitelist "01:00.0" \
--pci-whitelist "01:00.1" \
--pci-whitelist "05:00.0" \
--pci-whitelist "84:00.0" \
--master-lcore 0 -- \
--interactive --portmask=0xf --numa --socket-num=0 --auto-start \
--coremask=0x0554 --rxd=512 --txd=256 \
--burst=32 --mbcache=64 \
--nb-cores=2 --rxq=1 --txq=1

Example testpmd commands to reconfigure into bonding mode 4:

  stop
  port stop all
  create bonded device 4 0
  add bonding slave 2 4
  add bonding slave 3 4
  port start 0
  port start 1
  port start 4
  set portlist 4,0
  start

Robert Sanford (4):
  testpmd: fix LACP ports to work with idle links
  mempool: make cache flush threshold macro public
  net/bonding: another fix to LACP mempool size
  net/bonding: fix configuration of LACP slaves

 app/test-pmd/cmdline.c|9 +++
 app/test-pmd/testpmd.c|   37 +
 app/test-pmd/testpmd.h|4 +++
 drivers/net/bonding/rte_eth_bond_8023ad.c |   10 +--
 drivers/net/bonding/rte_eth_bond_pmd.c|   28 +
 lib/librte_mempool/rte_mempool.c  |8 +
 lib/librte_mempool/rte_mempool.h  |7 +
 7 files changed, 73 insertions(+), 30 deletions(-)



[dpdk-dev] [PATCH 4/4] port: fix ethdev writer burst too big

2016-03-28 Thread Robert Sanford
For f_tx_bulk functions in rte_port_ethdev.c, we may unintentionally
send bursts larger than tx_burst_sz to the underlying ethdev.
Some PMDs (e.g., ixgbe) may truncate this request to their maximum
burst size, resulting in unnecessary enqueuing failures or ethdev
writer retries.

We propose to fix this by moving the tx buffer flushing logic from
*after* the loop that puts all packets into the tx buffer, to *inside*
the loop, testing for a full burst when adding each packet.

Signed-off-by: Robert Sanford 
---
 lib/librte_port/rte_port_ethdev.c |   20 ++--
 1 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/lib/librte_port/rte_port_ethdev.c 
b/lib/librte_port/rte_port_ethdev.c
index 3fb4947..1283338 100644
--- a/lib/librte_port/rte_port_ethdev.c
+++ b/lib/librte_port/rte_port_ethdev.c
@@ -151,7 +151,7 @@ static int rte_port_ethdev_reader_stats_read(void *port,
 struct rte_port_ethdev_writer {
struct rte_port_out_stats stats;

-   struct rte_mbuf *tx_buf[2 * RTE_PORT_IN_BURST_SIZE_MAX];
+   struct rte_mbuf *tx_buf[RTE_PORT_IN_BURST_SIZE_MAX];
uint32_t tx_burst_sz;
uint16_t tx_buf_count;
uint64_t bsz_mask;
@@ -257,11 +257,11 @@ rte_port_ethdev_writer_tx_bulk(void *port,
p->tx_buf[tx_buf_count++] = pkt;
RTE_PORT_ETHDEV_WRITER_STATS_PKTS_IN_ADD(p, 1);
pkts_mask &= ~pkt_mask;
-   }

-   p->tx_buf_count = tx_buf_count;
-   if (tx_buf_count >= p->tx_burst_sz)
-   send_burst(p);
+   p->tx_buf_count = tx_buf_count;
+   if (tx_buf_count >= p->tx_burst_sz)
+   send_burst(p);
+   }
}

return 0;
@@ -328,7 +328,7 @@ static int rte_port_ethdev_writer_stats_read(void *port,
 struct rte_port_ethdev_writer_nodrop {
struct rte_port_out_stats stats;

-   struct rte_mbuf *tx_buf[2 * RTE_PORT_IN_BURST_SIZE_MAX];
+   struct rte_mbuf *tx_buf[RTE_PORT_IN_BURST_SIZE_MAX];
uint32_t tx_burst_sz;
uint16_t tx_buf_count;
uint64_t bsz_mask;
@@ -466,11 +466,11 @@ rte_port_ethdev_writer_nodrop_tx_bulk(void *port,
p->tx_buf[tx_buf_count++] = pkt;
RTE_PORT_ETHDEV_WRITER_NODROP_STATS_PKTS_IN_ADD(p, 1);
pkts_mask &= ~pkt_mask;
-   }

-   p->tx_buf_count = tx_buf_count;
-   if (tx_buf_count >= p->tx_burst_sz)
-   send_burst_nodrop(p);
+   p->tx_buf_count = tx_buf_count;
+   if (tx_buf_count >= p->tx_burst_sz)
+   send_burst_nodrop(p);
+   }
}

return 0;
-- 
1.7.1



[dpdk-dev] [PATCH 3/4] port: fix full burst checks in f_tx_bulk ops

2016-03-28 Thread Robert Sanford
For several f_tx_bulk functions in rte_port_{ethdev,ring,sched}.c,
it appears that the intent of the bsz_mask logic is to test whether
pkts_mask contains a full burst (i.e., the  least
significant bits are set).

There are two problems with the bsz_mask code: 1) It truncates
by using the wrong size for local variable uint32_t bsz_mask, and
2) We may pass oversized bursts to the underlying ethdev/ring/sched,
e.g., tx_burst_sz=16, bsz_mask=0x8000, and pkts_mask=0x1
(17 packets), results in expr==0, and we send a burst larger than
desired (and non-power-of-2) to the underlying tx burst interface.

We propose to effectively set bsz_mask = (1 << tx_burst_sz) - 1
(while avoiding truncation for tx_burst_sz=64), to cache the mask
value of a full burst, and then do a simple compare with pkts_mask
in each f_tx_bulk.

Signed-off-by: Robert Sanford 
---
 lib/librte_port/rte_port_ethdev.c |   15 ---
 lib/librte_port/rte_port_ring.c   |   16 
 lib/librte_port/rte_port_sched.c  |7 ++-
 3 files changed, 10 insertions(+), 28 deletions(-)

diff --git a/lib/librte_port/rte_port_ethdev.c 
b/lib/librte_port/rte_port_ethdev.c
index 1c34602..3fb4947 100644
--- a/lib/librte_port/rte_port_ethdev.c
+++ b/lib/librte_port/rte_port_ethdev.c
@@ -188,7 +188,7 @@ rte_port_ethdev_writer_create(void *params, int socket_id)
port->queue_id = conf->queue_id;
port->tx_burst_sz = conf->tx_burst_sz;
port->tx_buf_count = 0;
-   port->bsz_mask = 1LLU << (conf->tx_burst_sz - 1);
+   port->bsz_mask = UINT64_MAX >> (64 - conf->tx_burst_sz);

return port;
 }
@@ -229,12 +229,9 @@ rte_port_ethdev_writer_tx_bulk(void *port,
 {
struct rte_port_ethdev_writer *p =
(struct rte_port_ethdev_writer *) port;
-   uint32_t bsz_mask = p->bsz_mask;
uint32_t tx_buf_count = p->tx_buf_count;
-   uint64_t expr = (pkts_mask & (pkts_mask + 1)) |
-   ((pkts_mask & bsz_mask) ^ bsz_mask);

-   if (expr == 0) {
+   if (pkts_mask == p->bsz_mask) {
uint64_t n_pkts = __builtin_popcountll(pkts_mask);
uint32_t n_pkts_ok;

@@ -369,7 +366,7 @@ rte_port_ethdev_writer_nodrop_create(void *params, int 
socket_id)
port->queue_id = conf->queue_id;
port->tx_burst_sz = conf->tx_burst_sz;
port->tx_buf_count = 0;
-   port->bsz_mask = 1LLU << (conf->tx_burst_sz - 1);
+   port->bsz_mask = UINT64_MAX >> (64 - conf->tx_burst_sz);

/*
 * When n_retries is 0 it means that we should wait for every packet to
@@ -435,13 +432,9 @@ rte_port_ethdev_writer_nodrop_tx_bulk(void *port,
 {
struct rte_port_ethdev_writer_nodrop *p =
(struct rte_port_ethdev_writer_nodrop *) port;
-
-   uint32_t bsz_mask = p->bsz_mask;
uint32_t tx_buf_count = p->tx_buf_count;
-   uint64_t expr = (pkts_mask & (pkts_mask + 1)) |
-   ((pkts_mask & bsz_mask) ^ bsz_mask);

-   if (expr == 0) {
+   if (pkts_mask == p->bsz_mask) {
uint64_t n_pkts = __builtin_popcountll(pkts_mask);
uint32_t n_pkts_ok;

diff --git a/lib/librte_port/rte_port_ring.c b/lib/librte_port/rte_port_ring.c
index 765ecc5..b36e4ce 100644
--- a/lib/librte_port/rte_port_ring.c
+++ b/lib/librte_port/rte_port_ring.c
@@ -217,7 +217,7 @@ rte_port_ring_writer_create_internal(void *params, int 
socket_id,
port->ring = conf->ring;
port->tx_burst_sz = conf->tx_burst_sz;
port->tx_buf_count = 0;
-   port->bsz_mask = 1LLU << (conf->tx_burst_sz - 1);
+   port->bsz_mask = UINT64_MAX >> (64 - conf->tx_burst_sz);
port->is_multi = is_multi;

return port;
@@ -299,13 +299,9 @@ rte_port_ring_writer_tx_bulk_internal(void *port,
 {
struct rte_port_ring_writer *p =
(struct rte_port_ring_writer *) port;
-
-   uint32_t bsz_mask = p->bsz_mask;
uint32_t tx_buf_count = p->tx_buf_count;
-   uint64_t expr = (pkts_mask & (pkts_mask + 1)) |
-   ((pkts_mask & bsz_mask) ^ bsz_mask);

-   if (expr == 0) {
+   if (pkts_mask == p->bsz_mask) {
uint64_t n_pkts = __builtin_popcountll(pkts_mask);
uint32_t n_pkts_ok;

@@ -486,7 +482,7 @@ rte_port_ring_writer_nodrop_create_internal(void *params, 
int socket_id,
port->ring = conf->ring;
port->tx_burst_sz = conf->tx_burst_sz;
port->tx_buf_count = 0;
-   port->bsz_mask = 1LLU << (conf->tx_burst_sz - 1);
+   port->bsz_mask = UINT64_MAX >> (64 - conf->tx_burst_sz);
port->is_multi = is_multi;

/*
@@ -613,13 +609,9 @@ rte_port_ring_writer_nodrop_tx_bulk_internal(void *port,
 {
struct rte_port_ring_writ

[dpdk-dev] [PATCH 2/4] port: fix ring writer buffer overflow

2016-03-28 Thread Robert Sanford
Ring writer tx_bulk functions may write past the end of tx_buf[].
Solution is to double the size of tx_buf[].

Signed-off-by: Robert Sanford 
---
 lib/librte_port/rte_port_ring.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/librte_port/rte_port_ring.c b/lib/librte_port/rte_port_ring.c
index b847fea..765ecc5 100644
--- a/lib/librte_port/rte_port_ring.c
+++ b/lib/librte_port/rte_port_ring.c
@@ -179,7 +179,7 @@ rte_port_ring_reader_stats_read(void *port,
 struct rte_port_ring_writer {
struct rte_port_out_stats stats;

-   struct rte_mbuf *tx_buf[RTE_PORT_IN_BURST_SIZE_MAX];
+   struct rte_mbuf *tx_buf[2 * RTE_PORT_IN_BURST_SIZE_MAX];
struct rte_ring *ring;
uint32_t tx_burst_sz;
uint32_t tx_buf_count;
@@ -447,7 +447,7 @@ rte_port_ring_writer_stats_read(void *port,
 struct rte_port_ring_writer_nodrop {
struct rte_port_out_stats stats;

-   struct rte_mbuf *tx_buf[RTE_PORT_IN_BURST_SIZE_MAX];
+   struct rte_mbuf *tx_buf[2 * RTE_PORT_IN_BURST_SIZE_MAX];
struct rte_ring *ring;
uint32_t tx_burst_sz;
uint32_t tx_buf_count;
-- 
1.7.1



[dpdk-dev] [PATCH 1/4] app/test: enhance test_port_ring_writer

2016-03-28 Thread Robert Sanford
Add code to send two 60-packet bursts to a ring port_out.
This tests a ring writer buffer overflow problem and fix
(in patch 2/4).

Signed-off-by: Robert Sanford 
---
 app/test/test_table_ports.c |   27 +--
 1 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/app/test/test_table_ports.c b/app/test/test_table_ports.c
index 2532367..0c0ec0a 100644
--- a/app/test/test_table_ports.c
+++ b/app/test/test_table_ports.c
@@ -149,8 +149,8 @@ test_port_ring_writer(void)

/* -- Traffic TX -- */
int expected_pkts, received_pkts;
-   struct rte_mbuf *mbuf[RTE_PORT_IN_BURST_SIZE_MAX];
-   struct rte_mbuf *res_mbuf[RTE_PORT_IN_BURST_SIZE_MAX];
+   struct rte_mbuf *mbuf[2 * RTE_PORT_IN_BURST_SIZE_MAX];
+   struct rte_mbuf *res_mbuf[2 * RTE_PORT_IN_BURST_SIZE_MAX];

port_ring_writer_params.ring = RING_TX;
port_ring_writer_params.tx_burst_sz = RTE_PORT_IN_BURST_SIZE_MAX;
@@ -216,5 +216,28 @@ test_port_ring_writer(void)
for (i = 0; i < RTE_PORT_IN_BURST_SIZE_MAX; i++)
rte_pktmbuf_free(res_mbuf[i]);

+   /* TX Bulk - send two 60-packet bursts */
+   uint64_t pkt_mask = 0xfff0ULL;
+
+   for (i = 0; i < 4; i++)
+   mbuf[i] = NULL;
+   for (i = 4; i < 64; i++)
+   mbuf[i] = rte_pktmbuf_alloc(pool);
+   rte_port_ring_writer_ops.f_tx_bulk(port, mbuf, pkt_mask);
+   for (i = 4; i < 64; i++)
+   mbuf[i] = rte_pktmbuf_alloc(pool);
+   rte_port_ring_writer_ops.f_tx_bulk(port, mbuf, pkt_mask);
+   rte_port_ring_writer_ops.f_flush(port);
+
+   expected_pkts = 2 * 60;
+   received_pkts = rte_ring_sc_dequeue_burst(port_ring_writer_params.ring,
+   (void **)res_mbuf, 2 * RTE_PORT_IN_BURST_SIZE_MAX);
+
+   if (received_pkts != expected_pkts)
+   return -10;
+
+   for (i = 0; i < received_pkts; i++)
+   rte_pktmbuf_free(res_mbuf[i]);
+
return 0;
 }
-- 
1.7.1



[dpdk-dev] [PATCH 0/4] port: fix and test bugs in tx_bulk ops

2016-03-28 Thread Robert Sanford
This patch series does the following:

* enhances port ring writer test, to send two large, but not full
  bursts; exposes ring writer buffer overflow
* fixes ring writer buffer overflow
* fixes full burst checks in ethdev, ring, and sched f_tx_bulk ops
* fixes ethdev writer, to send bursts no larger than specified max



Robert Sanford (4):
  app/test: enhance test_port_ring_writer
  port: fix ring writer buffer overflow
  port: fix full burst checks in f_tx_bulk ops
  port: fix ethdev writer burst too big

 app/test/test_table_ports.c   |   27 +--
 lib/librte_port/rte_port_ethdev.c |   35 ++-
 lib/librte_port/rte_port_ring.c   |   20 ++--
 lib/librte_port/rte_port_sched.c  |7 ++-
 4 files changed, 47 insertions(+), 42 deletions(-)



[dpdk-dev] [PATCH] eal/linux: fix rte_epoll_wait

2015-08-18 Thread Robert Sanford
Function rte_epoll_wait should return when underlying call
to epoll_wait times out.

Signed-off-by: Robert Sanford 
---
 lib/librte_eal/linuxapp/eal/eal_interrupts.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/lib/librte_eal/linuxapp/eal/eal_interrupts.c 
b/lib/librte_eal/linuxapp/eal/eal_interrupts.c
index 3f87875..25cae6a 100644
--- a/lib/librte_eal/linuxapp/eal/eal_interrupts.c
+++ b/lib/librte_eal/linuxapp/eal/eal_interrupts.c
@@ -1012,6 +1012,9 @@ rte_epoll_wait(int epfd, struct rte_epoll_event *events,
strerror(errno));
rc = -1;
break;
+   } else {
+   /* rc == 0, epoll_wait timed out */
+   break;
}
}

-- 
1.7.1



[dpdk-dev] [RFC PATCH] eal: rte_rand yields only 62 random bits

2015-03-30 Thread Robert Sanford
Yes, applications have many choices for PRNGs. But, we still need one
internally for the following libs: PMDs (e1000, fm10k, i40e, ixgbe, virtio,
xenvirt), sched, and timer.


On Fri, Mar 27, 2015 at 8:03 PM, Stephen Hemminger <
stephen at networkplumber.org> wrote:
> I would argue remove rte_rand from DPDK.



On Mon, Mar 30, 2015 at 1:28 AM, Stephen Hemminger <
stephen at networkplumber.org> wrote:

> if some one needs PRNG, th GNU scientific library has lots of them
>
> https://www.gnu.org/software/gsl/manual/html_node/Random-number-generator-algorithms.html
>
> On Fri, Mar 27, 2015 at 5:38 PM, Matthew Hall 
> wrote:
>
> > On Fri, Mar 27, 2015 at 05:03:02PM -0700, Stephen Hemminger wrote:
> > > I would argue remove rte_rand from DPDK.
> >
> > +1
> >
> > To paraphrase Donald Knuth, "Random numbers should not be generated
> [using
> > a
> > function coded] at random."
> >
> > It'd be better to fix libc, or considering that has a slow dev cycle and
> > platform compatibility limits, use some simple, semi-random,
> > high-performance
> > BSD licensed routine from a known-good library.
> >
> > Matthew.
> >
>


[dpdk-dev] [RFC PATCH] eal: rte_rand yields only 62 random bits

2015-03-17 Thread Robert Sanford
The implementation of rte_rand() returns only 62 bits of 
pseudo-randomness, because the underlying calls to lrand48()
"return non-negative long integers uniformly distributed
between 0 and 2^31."

We have written a potential fix, but before we spend more
time testing and refining it, I wanted to check with you
guys.

We switched to using the reentrant versions of [ls]rand48,
and maintain per-lcore state. We need ~2.06 calls to
lrand48_r(), per call to rte_rand().

Do you agree with the approach we've taken in this patch?


Thanks,
Robert

---
 lib/librte_eal/common/include/rte_random.h |   33 +--
 lib/librte_eal/linuxapp/eal/eal_thread.c   |7 ++
 2 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/lib/librte_eal/common/include/rte_random.h 
b/lib/librte_eal/common/include/rte_random.h
index 24ae836..b9248cd 100644
--- a/lib/librte_eal/common/include/rte_random.h
+++ b/lib/librte_eal/common/include/rte_random.h
@@ -46,6 +46,17 @@ extern "C" {

 #include 
 #include 
+#include 
+#include 
+
+struct rte_rand_data {
+   struct drand48_data _dr48;
+   uint32_t _hi_bits;
+   uint8_t _bits_left;
+};
+
+RTE_DECLARE_PER_LCORE(struct rte_rand_data, _rand_data);
+

 /**
  * Seed the pseudo-random generator.
@@ -60,7 +71,7 @@ extern "C" {
 static inline void
 rte_srand(uint64_t seedval)
 {
-   srand48((long unsigned int)seedval);
+   srand48_r((long unsigned int)seedval, _PER_LCORE(_rand_data)._dr48);
 }

 /**
@@ -76,10 +87,26 @@ rte_srand(uint64_t seedval)
 static inline uint64_t
 rte_rand(void)
 {
+   struct rte_rand_data *rd = _PER_LCORE(_rand_data);
uint64_t val;
-   val = lrand48();
+   uint32_t hi_bits;
+   long int result;
+
+   if (unlikely(rd->_bits_left < 2)) {
+   lrand48_r(>_dr48, );
+   rd->_hi_bits |= (uint32_t)result << (1 - rd->_bits_left);
+   rd->_bits_left += 31;
+   }
+
+   hi_bits = rd->_hi_bits;
+   lrand48_r(>_dr48, );
+   val = (uint32_t)result | (hi_bits & 0x800);
val <<= 32;
-   val += lrand48();
+   hi_bits <<= 1;
+   lrand48_r(>_dr48, );
+   val |= (uint32_t)result | (hi_bits & 0x800);
+   rd->_hi_bits = hi_bits << 1;
+   rd->_bits_left -= 2;
return val;
 }

diff --git a/lib/librte_eal/linuxapp/eal/eal_thread.c 
b/lib/librte_eal/linuxapp/eal/eal_thread.c
index 5635c7d..08e7f72 100644
--- a/lib/librte_eal/linuxapp/eal/eal_thread.c
+++ b/lib/librte_eal/linuxapp/eal/eal_thread.c
@@ -52,6 +52,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 

 #include "eal_private.h"
 #include "eal_thread.h"
@@ -59,6 +61,8 @@
 RTE_DEFINE_PER_LCORE(unsigned, _lcore_id) = LCORE_ID_ANY;
 RTE_DEFINE_PER_LCORE(unsigned, _socket_id) = (unsigned)SOCKET_ID_ANY;
 RTE_DEFINE_PER_LCORE(rte_cpuset_t, _cpuset);
+RTE_DEFINE_PER_LCORE(struct rte_rand_data, _rand_data);
+

 /*
  * Send a message to a slave lcore identified by slave_id to call a
@@ -147,6 +151,9 @@ eal_thread_loop(__attribute__((unused)) void *arg)
/* set the lcore ID in per-lcore memory area */
RTE_PER_LCORE(_lcore_id) = lcore_id;

+   /* seed per-lcore PRNG */
+   rte_srand(rte_rdtsc());
+
/* set CPU affinity */
if (eal_thread_set_affinity() < 0)
rte_panic("cannot set affinity\n");
-- 
1.7.1



[dpdk-dev] [PATCH v2 0/3] timer: fix rte_timer_reset

2015-02-25 Thread Robert Sanford
On Wed, Feb 25, 2015 at 6:16 AM, Bruce Richardson <
bruce.richardson at intel.com> wrote:

> On Wed, Feb 25, 2015 at 06:02:24AM -0500, Robert Sanford wrote:
>
> >
> > One question about lib rte_timer that's been troubling me for a while:
> How
> > are skip lists better than BSD-style timer wheels?
> >
> > --
>
> The skip list may not be any better than a timer wheel - it's just what is
> used
> now, and it does give pretty good performance (insert O(log n) [up to a few
> million timers per core], expiry O(1)).
> Originally in DPDK, the timers were maintained in a regular sorted linked
> list,
> but that suffered from scalability issues when starting timers, or stopped
> before
> expiry. The skip-list was therefore a big improvement on that, and gave us
> much greater scalability in timers, without any regressions in
> performance. I
> don't know if anyone has tried to implement and benchmark a timer-wheel
> based
> rte_timer library replacement. I'd be interested to see a performance
> comparison
> between the two implementations! :-)
>
> Regards,
> /Bruce
>
>
I've wanted to try out the timer-wheels since before the skip-list version,
but it just hasn't made it to the top of my priority list.
The other thing that concerns me about the skip-list implementation is the
extra cache line that all those pointers consume.

--
Thanks,
Robert


[dpdk-dev] [PATCH v2 0/3] timer: fix rte_timer_reset

2015-02-25 Thread Robert Sanford
Hi Thomas,

Yes, I'm interested in becoming a maintainer of rte_timer. What are the
responsibilities?


One question about lib rte_timer that's been troubling me for a while: How
are skip lists better than BSD-style timer wheels?

--
Regards,
Robert


On Wed, Feb 25, 2015 at 4:46 AM, Thomas Monjalon 
wrote:

> > > Changes in v2:
> > > - split into multiple patches
> > > - minor coding-style changes
> > >
> > > Robert Sanford (3):
> > >timer: fix return value of rte_timer_reset(),
> > >  insert rte_pause() into rte_timer_reset_sync() wait-loop
> > >app/test: fix timer stress test to succeed on multiple runs,
> > >  display number of times that rte_timer_reset() fails
> > >  (expected) due to races with other cores
> >
> > Series:
> > Acked-by: Olivier Matz 
>
> Applied, thanks
>
> Robert, as you well know rte_timer and you work on it,
> maybe you are interested in becoming maintainer?
>


[dpdk-dev] [PATCH v2 3/3] timer: fix rte_timer_reset return value

2015-02-24 Thread Robert Sanford
- API rte_timer_reset() should return -1 when the timer is in the
RUNNING or CONFIG state. Instead, it ignores the return value of
internal function __rte_timer_reset() and always returns 0.
We change rte_timer_reset() to return the value returned by
__rte_timer_reset().

- Enhance timer stress test 2 to report how many timer reset
collisions occur, i.e., how many times rte_timer_reset() fails
due to a timer being in the CONFIG state.

Signed-off-by: Robert Sanford 

---
app/test/test_timer.c|   19 ---
lib/librte_timer/rte_timer.c |4 +---
2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/app/test/test_timer.c b/app/test/test_timer.c
index 070437a..79fa955 100644
--- a/app/test/test_timer.c
+++ b/app/test/test_timer.c
@@ -247,10 +247,12 @@ static int
 timer_stress2_main_loop(__attribute__((unused)) void *arg)
 {
static struct rte_timer *timers;
-   int i;
+   int i, ret;
static volatile int ready = 0;
uint64_t delay = rte_get_timer_hz() / 4;
unsigned lcore_id = rte_lcore_id();
+   int32_t my_collisions = 0;
+   static rte_atomic32_t collisions = RTE_ATOMIC32_INIT(0);

if (lcore_id == rte_get_master_lcore()) {
cb_count = 0;
@@ -269,15 +271,25 @@ timer_stress2_main_loop(__attribute__((unused)) void *arg)
}

/* have all cores schedule all timers on master lcore */
-   for (i = 0; i < NB_STRESS2_TIMERS; i++)
-   rte_timer_reset([i], delay, SINGLE, 
rte_get_master_lcore(),
+   for (i = 0; i < NB_STRESS2_TIMERS; i++) {
+   ret = rte_timer_reset([i], delay, SINGLE, 
rte_get_master_lcore(),
timer_stress2_cb, NULL);
+   /* there will be collisions when multiple cores simultaneously
+* configure the same timers */
+   if (ret != 0)
+   my_collisions++;
+   }
+   if (my_collisions != 0)
+   rte_atomic32_add(, my_collisions);

ready = 0;
rte_delay_ms(500);

/* now check that we get the right number of callbacks */
if (lcore_id == rte_get_master_lcore()) {
+   my_collisions = rte_atomic32_read();
+   if (my_collisions != 0)
+   printf("- %d timer reset collisions (OK)\n", 
my_collisions);
rte_timer_manage();
if (cb_count != NB_STRESS2_TIMERS) {
printf("Test Failed\n");
@@ -317,6 +329,7 @@ timer_stress2_main_loop(__attribute__((unused)) void *arg)
rte_free(timers);
timers = NULL;
ready = 0;
+   rte_atomic32_set(, 0);

if (cb_count != NB_STRESS2_TIMERS) {
printf("Test Failed\n");
diff --git a/lib/librte_timer/rte_timer.c b/lib/librte_timer/rte_timer.c
index dae76cc..d18abf5 100644
--- a/lib/librte_timer/rte_timer.c
+++ b/lib/librte_timer/rte_timer.c
@@ -424,10 +424,8 @@ rte_timer_reset(struct rte_timer *tim, uint64_t ticks,
else
period = 0;

-   __rte_timer_reset(tim,  cur_time + ticks, period, tim_lcore,
+   return __rte_timer_reset(tim,  cur_time + ticks, period, tim_lcore,
  fct, arg, 0);
-
-   return 0;
 }

 /* loop until rte_timer_reset() succeed */
-- 
1.7.1



[dpdk-dev] [PATCH v2 2/3] timer: fix stress test on multiple runs

2015-02-24 Thread Robert Sanford
Fix timer stress test to succeed on multiple runs.

Signed-off-by: Robert Sanford 

---
app/test/test_timer.c |7 +++
1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/app/test/test_timer.c b/app/test/test_timer.c
index 4b4800b..070437a 100644
--- a/app/test/test_timer.c
+++ b/app/test/test_timer.c
@@ -253,6 +253,7 @@ timer_stress2_main_loop(__attribute__((unused)) void *arg)
unsigned lcore_id = rte_lcore_id();

if (lcore_id == rte_get_master_lcore()) {
+   cb_count = 0;
timers = rte_malloc(NULL, sizeof(*timers) * NB_STRESS2_TIMERS, 
0);
if (timers == NULL) {
printf("Test Failed\n");
@@ -311,6 +312,12 @@ timer_stress2_main_loop(__attribute__((unused)) void *arg)
/* now check that we get the right number of callbacks */
if (lcore_id == rte_get_master_lcore()) {
rte_timer_manage();
+
+   /* clean up statics, in case we run again */
+   rte_free(timers);
+   timers = NULL;
+   ready = 0;
+
if (cb_count != NB_STRESS2_TIMERS) {
printf("Test Failed\n");
printf("- Stress test 2, part 2 failed\n");
-- 
1.7.1



[dpdk-dev] [PATCH v2 0/3] timer: fix rte_timer_reset

2015-02-24 Thread Robert Sanford
Changes in v2:
- split into multiple patches
- minor coding-style changes

Robert Sanford (3):
  timer: fix return value of rte_timer_reset(),
insert rte_pause() into rte_timer_reset_sync() wait-loop
  app/test: fix timer stress test to succeed on multiple runs,
display number of times that rte_timer_reset() fails
(expected) due to races with other cores

app/test/test_timer.c|   26 +++---
lib/librte_timer/rte_timer.c |7 +++
2 files changed, 26 insertions(+), 7 deletions(-)



[dpdk-dev] [PATCH v4 00/17] support multi-pthread per core

2015-02-06 Thread Robert Sanford
On Fri, Feb 6, 2015 at 10:47 AM, Olivier MATZ wrote:

> Hi,
>
> On 02/02/2015 03:02 AM, Cunming Liang wrote:
> > v4 changes:
> >   new patch fixing strnlen() invalid return in 32bit icc [03/17]
> >   update and add more comments on sched_yield() [16/17]
> >
> > v3 changes:
> >   new patch adding sched_yield() in rte_ring to avoid long spin [16/17]
> >
> > v2 changes:
> >   add '-' support for EAL option '--lcores' [02/17]
> >
> > The patch series contain the enhancements of EAL and fixes for libraries
> > to run multi-pthreads(either EAL or non-EAL thread) per physical core.
> > Two major changes list as below:
> > - Extend the core affinity of each EAL thread to 1:n.
> >   Each lcore stands for a EAL thread rather than a logical core.
> >   The change adds new EAL option to allow static lcore to cpuset
> assginment.
> >   Then a lcore(EAL thread) affinity to a cpuset, original 1:1 mapping is
> the special case.
> > - Fix the libraries to allow running on any non-EAL thread.
> >   It fix the gaps running libraries in non-EAL thread(dynamic created by
> user).
> >   Each fix libraries take care the case of rte_lcore_id() >=
> RTE_MAX_LCORE.
>
> Sorry if I missed something, but after reading the mailing list threads
> about this subject, I cannot find an explanation about what problem
> this series try to solve.
>
> Can you give some details about which use-case require to have multiple
> pthreads per cpu? What are the advantage of doing so?
>
> Regards,
> Olivier
>


http://dpdk.org/ml/archives/dev/2014-December/009838.html


[dpdk-dev] [RFC PATCH] rte_timer: Fix rte_timer_reset return value

2015-02-06 Thread Robert Sanford
Hi Olivier,

Thanks for reviewing this patch.
Please see my responses to your comments, below.

I also have one request for you. You probably use git almost every day. For
people who only use git maybe once per year, could you please show us the
exact sequence of commands that you run to prepare a patch series? We know
there are man pages and online documents, etc, but it would be an extremely
valuable jumpstart if you just give us a snippet of your shell history
showing the exact commands that you run to prepare and email a patch
series. I would much rather spend time getting the code right, and less
time learning (by trial and error) the nuances of git apply, add, commit,
format-patch, send-email, etc.


>   /* now check that we get the right number of callbacks */
> >   if (lcore_id == rte_get_master_lcore()) {
> > + if ((my_collisions = rte_atomic32_read()) != 0)
> > + printf("- %d timer reset collisions (OK)\n",
> my_collisions);
>
> That's not very important, but I think avoiding affectation + comparison
> at the same time is clearer:
>
>   my_collisions = rte_atomic32_read();
>   if (my_collisions != 0) {
> ...
>

Yes, I will change this.


> > @@ -311,6 +323,13 @@ timer_stress2_main_loop(__attribute__((unused))
> void *arg)
> >   /* now check that we get the right number of callbacks */
> >   if (lcore_id == rte_get_master_lcore()) {
> >   rte_timer_manage();
> > +
> > + /* clean up statics, in case we run again */
> > + rte_free(timers);
> > + timers = 0;
>
> timers = NULL is better than timers = 0 as it's a pointer.
>

Yes, I will change this.


> > + ready = 0;
>
> The lines above should go in another patch as it fixes another problem
> (+ a memory leek).
> "testpmd: allow to restart timer stress tests"
>

Yes, I will split it into two patches: rte_timer and test_timer. But, I
don't see much benefit in splitting test_timer.c changes into separate
patches for each bug discovered.


> > diff --git a/lib/librte_timer/rte_timer.c b/lib/librte_timer/rte_timer.c
> > index 269a992..d18abf5 100644
> > --- a/lib/librte_timer/rte_timer.c
> > +++ b/lib/librte_timer/rte_timer.c
> > @@ -424,10 +424,8 @@ rte_timer_reset(struct rte_timer *tim, uint64_t
> ticks,
> >   else
> >   period = 0;
> >
> > - __rte_timer_reset(tim,  cur_time + ticks, period, tim_lcore,
> > + return __rte_timer_reset(tim,  cur_time + ticks, period, tim_lcore,
> > fct, arg, 0);
> > -
> > - return 0;
> >  }
> >
> >  /* loop until rte_timer_reset() succeed */
> > @@ -437,7 +435,8 @@ rte_timer_reset_sync(struct rte_timer *tim, uint64_t
> ticks,
> >rte_timer_cb_t fct, void *arg)
> >  {
> >   while (rte_timer_reset(tim, ticks, type, tim_lcore,
> > -fct, arg) != 0);
> > +fct, arg) != 0)
> > + rte_pause();
> >  }
>
> Maybe the lines above could go to another patch too.
> "timers: relax cpu in rte_timer_reset_sync()"
>
>
If you mean that we should have one patch for rte_timer_reset() and one for
rte_timer_reset_sync(), my response is: Come on, these are two one-line
fixes in a pair of related and adjacent functions. Let's not go overboard
by splitting them into two patches. :-)

Also, I think the commit log should highlight the fact that
> your patch also fixes rte_timer_reset_sync() that was not
> working at all.
>
>
We said something to that effect: "Change API rte_timer_reset_sync() to
invoke rte_pause() while spin-waiting for rte_timer_reset() to succeed." I
can use different wording if you like.


>
> Thanks!
> Olivier
>

Thank you,
Robert


[dpdk-dev] Enhance KNI DPDK-app-side to be Multi-Producer/Consumer

2014-11-19 Thread Robert Sanford
Hi Danny,

On Fri, Nov 14, 2014 at 7:04 PM, Zhou, Danny  wrote:

> It will be always good if you can submit the RFC patch in terms of KNI
> optimization.
>
> On the other hand, do you have any perf. data to prove that your patchset
> could improve
> KNI performance which is the concern that most customers care about? We
> introduced
> multiple-threaded KNI kernel support last year, if I remember correctly,
> the key perform
> bottle-neck we found is the skb alloc/free and memcpy between skb and
> mbuf. Would be
> very happy if your patchset can approve I am wrong.



This is not an attempt to improve raw performance. Our modest goal is to
make librte_kni's RX/TX burst APIs multithreaded, without changing
rte_kni.ko. In this RFC patch, we make it possible for multiple cores to
concurrently invoke rte_kni_tx_burst (or rte_kni_rx_burst) for the same KNI
device.

At the moment, multiple cores invoking rte_kni_tx_burst for the same device
cannot function correctly, because the rte_kni_fifo structures (memory
shared between app and kernel driver) are single-producer, single-consumer.
The following patch supplements the rte_kni_fifo structure with an
additional structure that is private to the application, and we borrow
librte_ring's MP/MC enqueue/dequeue logic.

Here is a patch for 1.8. We have only tested a 1.7.1 version. Please have a
look and let us know whether you think something like this would be useful.

--
Thanks,
Robert


*Signed-off-by: Robert Sanford >*

---
 lib/librte_kni/rte_kni.c  |   21 +-
 lib/librte_kni/rte_kni_fifo.h |  131
+
 2 files changed, 148 insertions(+), 4 deletions(-)

diff --git a/lib/librte_kni/rte_kni.c b/lib/librte_kni/rte_kni.c
index fdb7509..8009173 100644
--- a/lib/librte_kni/rte_kni.c
+++ b/lib/librte_kni/rte_kni.c
@@ -76,6 +76,11 @@ struct rte_kni {
  struct rte_kni_fifo *alloc_q;   /**< Allocated mbufs queue */
  struct rte_kni_fifo *free_q;/**< To be freed mbufs queue */

+ struct rte_kni_fifo_multi tx_q_mc;  /**< Make tx_q multi-consumer */
+ struct rte_kni_fifo_multi alloc_q_mp;/**< Make alloc_q multi-producer */
+ struct rte_kni_fifo_multi rx_q_mp;  /**< Make rx_q multi-producer */
+ struct rte_kni_fifo_multi free_q_mc;/**< Make free_q multi-consumer */
+
  /* For request & response */
  struct rte_kni_fifo *req_q; /**< Request queue */
  struct rte_kni_fifo *resp_q;/**< Response queue */
@@ -414,6 +419,11 @@ rte_kni_alloc(struct rte_mempool *pktmbuf_pool,
  kni_fifo_init(ctx->free_q, KNI_FIFO_COUNT_MAX);
  dev_info.free_phys = mz->phys_addr;

+ kni_fifo_multi_init(>tx_q_mc, KNI_FIFO_COUNT_MAX);
+ kni_fifo_multi_init(>alloc_q_mp, KNI_FIFO_COUNT_MAX);
+ kni_fifo_multi_init(>rx_q_mp, KNI_FIFO_COUNT_MAX);
+ kni_fifo_multi_init(>free_q_mc, KNI_FIFO_COUNT_MAX);
+
  /* Request RING */
  mz = slot->m_req_q;
  ctx->req_q = mz->addr;
@@ -557,7 +567,8 @@ rte_kni_handle_request(struct rte_kni *kni)
 unsigned
 rte_kni_tx_burst(struct rte_kni *kni, struct rte_mbuf **mbufs, unsigned
num)
 {
- unsigned ret = kni_fifo_put(kni->rx_q, (void **)mbufs, num);
+ unsigned ret = kni_fifo_put_mp(kni->rx_q, >rx_q_mp, (void **)mbufs,
+ num);

  /* Get mbufs from free_q and then free them */
  kni_free_mbufs(kni);
@@ -568,7 +579,8 @@ rte_kni_tx_burst(struct rte_kni *kni, struct rte_mbuf
**mbufs, unsigned num)
 unsigned
 rte_kni_rx_burst(struct rte_kni *kni, struct rte_mbuf **mbufs, unsigned
num)
 {
- unsigned ret = kni_fifo_get(kni->tx_q, (void **)mbufs, num);
+ unsigned ret = kni_fifo_get_mc(kni->tx_q, >tx_q_mc,
+ (void **)mbufs, num);

  /* Allocate mbufs and then put them into alloc_q */
  kni_allocate_mbufs(kni);
@@ -582,7 +594,8 @@ kni_free_mbufs(struct rte_kni *kni)
  int i, ret;
  struct rte_mbuf *pkts[MAX_MBUF_BURST_NUM];

- ret = kni_fifo_get(kni->free_q, (void **)pkts, MAX_MBUF_BURST_NUM);
+ ret = kni_fifo_get_mc(kni->free_q, >free_q_mc, (void **)pkts,
+ MAX_MBUF_BURST_NUM);
  if (likely(ret > 0)) {
  for (i = 0; i < ret; i++)
  rte_pktmbuf_free(pkts[i]);
@@ -629,7 +642,7 @@ kni_allocate_mbufs(struct rte_kni *kni)
  if (i <= 0)
  return;

- ret = kni_fifo_put(kni->alloc_q, (void **)pkts, i);
+ ret = kni_fifo_put_mp(kni->alloc_q, >alloc_q_mp, (void **)pkts, i);

  /* Check if any mbufs not put into alloc_q, and then free them */
  if (ret >= 0 && ret < i && ret < MAX_MBUF_BURST_NUM) {
diff --git a/lib/librte_kni/rte_kni_fifo.h b/lib/librte_kni/rte_kni_fifo.h
index 8cb8587..7dccba2 100644
--- a/lib/librte_kni/rte_kni_fifo.h
+++ b/lib/librte_kni/rte_kni_fifo.h
@@ -91,3 +91,134 @@ kni_fifo_get(struct rte_kni_fifo *fifo, void **data,
unsigned num)
  fifo->read = new_read;
  return i;
 }
+
+
+/**
+ * Suplemental members to facilitate MP/MC access to KNI FIFOs.
+ */
+struct rte_kni_fifo_multi {
+ volatile uint32_t head;
+ volatile uint32_t tail;
+ ui

[dpdk-dev] [PATCH 1/2] igb_uio: fix compability on old kernel

2014-08-22 Thread Robert Sanford
This is what we came up with. It works for us. In our kernel headers'
linux/pci.h, pci_num_vf is enclosed within "#ifdef CONFIG_PCI_IOV/#endif";
pci_intx_mask_supported and pci_check_and_mask_intx are enclosed within
"#ifdef HAVE_PCI_SET_MWI/#endif".

What do you think?

--
Thanks,
Robert


---
 lib/librte_eal/linuxapp/igb_uio/compat.h |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/librte_eal/linuxapp/igb_uio/compat.h
b/lib/librte_eal/linuxapp/igb_uio/compat.h
index 2a16540..f7404d8 100644
--- a/lib/librte_eal/linuxapp/igb_uio/compat.h
+++ b/lib/librte_eal/linuxapp/igb_uio/compat.h
@@ -17,7 +17,7 @@
 #define   PCI_MSIX_ENTRY_CTRL_MASKBIT   1
 #endif

-#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 34)
+#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 34) &&
!defined(CONFIG_PCI_IOV)
 static int pci_num_vf(struct pci_dev *dev)
 {
struct iov {
@@ -38,7 +38,7 @@ static int pci_num_vf(struct pci_dev *dev)
 #endif


-#if LINUX_VERSION_CODE < KERNEL_VERSION(3, 3, 0)
+#if LINUX_VERSION_CODE < KERNEL_VERSION(3, 3, 0) &&
!defined(HAVE_PCI_SET_MWI)

 /* Check if INTX works to control irq's.
  * Set's INTX_DISABLE flag and reads it back
-- 
1.7.1



On Fri, Aug 22, 2014 at 1:29 PM, Sanford, Robert 
wrote:
>
> Hi Thomas,
>
> Not that I would *like* to fix this, but I *need* to fix it. We are using
> CentOS 6.5, which I believe is based on RHEL. We have kernel
> 2.6.32-431.3.1.el6.x86_64.
>
> I realize that we need to add/change ifdefs around pci_num_vf,
> pci_intx_mask_supported, and pci_check_and_mask_intx in igb_uio/compat.h.
> Any more specific suggestions on how to (elegantly) fix it for us, but not
> break it for anyone else?
>
> --
> Regards,
> Robert
>


[dpdk-dev] [PATCH v11 1/5] bond: new link bonding library

2014-06-30 Thread Robert Sanford
Hi Declan,

On Sun, Jun 29, 2014 at 1:49 PM, Declan Doherty 
wrote:

> Initial release with support for
>  Mode 0 - Round Robin
>  Mode 1 - Active Backup
>  Mode 2 - Balance -> Supports 3 transmit polices (layer 2, layer 2+3,
> layer 3+4)
>  Mode 3 - Broadcast
>
> Signed-off-by: Declan Doherty 
> ---
>  config/common_bsdapp   |5 +
>  config/common_linuxapp |5 +
>  doc/doxy-api-index.md  |1 +
>  doc/doxy-api.conf  |1 +
>  lib/Makefile   |1 +
>  lib/librte_pmd_bond/Makefile   |   61 ++
>  lib/librte_pmd_bond/rte_eth_bond.h |  255 ++
>  lib/librte_pmd_bond/rte_eth_bond_api.c |  662 +++
>  lib/librte_pmd_bond/rte_eth_bond_args.c|  252 ++
>  lib/librte_pmd_bond/rte_eth_bond_pmd.c | 1212
> 
>  lib/librte_pmd_bond/rte_eth_bond_private.h |  215 +
>  mk/rte.app.mk  |4 +
>  12 files changed, 2674 insertions(+), 0 deletions(-)
>  create mode 100644 lib/librte_pmd_bond/Makefile
>  create mode 100644 lib/librte_pmd_bond/rte_eth_bond.h
>  create mode 100644 lib/librte_pmd_bond/rte_eth_bond_api.c
>  create mode 100644 lib/librte_pmd_bond/rte_eth_bond_args.c
>  create mode 100644 lib/librte_pmd_bond/rte_eth_bond_pmd.c
>  create mode 100644 lib/librte_pmd_bond/rte_eth_bond_private.h
>
>
>
I see a potential problem with bond_ethdev_rx_burst( ).
We could receive more packets than the caller asked for, and overrun the
caller's rte_mbuf * array.
The fix could be something like this:

--- a/lib/librte_pmd_bond/rte_eth_bond_pmd.c
+++ b/lib/librte_pmd_bond/rte_eth_bond_pmd.c
@@ -73,13 +73,15 @@ bond_ethdev_rx_burst(void *queue, struct rte_mbuf
**bufs, uint16_t nb_pkts)
case BONDING_MODE_ROUND_ROBIN:
case BONDING_MODE_BROADCAST:
case BONDING_MODE_BALANCE:
-   for (i = 0; i < internals->active_slave_count; i++) {
+   for (i = 0; i < internals->active_slave_count && nb_pkts;
i++) {
/* Offset of pointer to *bufs increases as packets
are received
 * from other slaves */
num_rx_slave =
rte_eth_rx_burst(internals->active_slaves[i],
bd_rx_q->queue_id, bufs +
num_rx_total, nb_pkts);
-   if (num_rx_slave)
+   if (num_rx_slave) {
num_rx_total += num_rx_slave;
+   nb_pkts -= num_rx_slave;
+   }
}
break;
case BONDING_MODE_ACTIVE_BACKUP:

--
Regards,
Robert


[dpdk-dev] [PATCH v3 2/2] malloc: fix malloc and free linear complexity

2014-06-23 Thread Robert Sanford
Fix typos and false assumptions in malloc unit tests.

Without enhancements to lib rte_malloc, malloc autotest fails every
second (2nd) run. With enhancements, malloc autotest fails in
function test_multi_alloc_statistics, because we compare the wrong
sets of statistics.

Signed-off-by: Robert Sanford 

---
 app/test/test_malloc.c |   17 +
 1 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/app/test/test_malloc.c b/app/test/test_malloc.c
index 3c38383..bf27eff 100644
--- a/app/test/test_malloc.c
+++ b/app/test/test_malloc.c
@@ -66,7 +66,7 @@
  * ==
  *
  * Allocate some dynamic memory from heap (3 areas). Check that areas
- * don't overlap an that alignment constraints match. This test is
+ * don't overlap and that alignment constraints match. This test is
  * done many times on different lcores simultaneously.
  */

@@ -313,9 +313,9 @@ test_big_alloc(void)
rte_malloc_get_socket_stats(socket,_stats);

/* Check statistics reported are correct */
-   /* Allocation increase, cannot be the same as before big allocation */
-   if (post_stats.heap_totalsz_bytes == pre_stats.heap_totalsz_bytes) {
-   printf("Malloc statistics are incorrect - heap_totalz_bytes\n");
+   /* Allocation may increase, or may be the same as before big allocation 
*/
+   if (post_stats.heap_totalsz_bytes < pre_stats.heap_totalsz_bytes) {
+   printf("Malloc statistics are incorrect - 
heap_totalsz_bytes\n");
return -1;
}
/* Check that allocated size adds up correctly */
@@ -336,7 +336,8 @@ test_big_alloc(void)
return -1;
}
/* New blocks now available - just allocated 1 but also 1 new free */
-   if(post_stats.free_count != pre_stats.free_count ) {
+   if (post_stats.free_count != pre_stats.free_count &&
+   post_stats.free_count != pre_stats.free_count - 1) {
printf("Malloc statistics are incorrect - free_count\n");
return -1;
}
@@ -425,7 +426,7 @@ test_multi_alloc_statistics(void)
}

/* Make sure that we didn't touch our greatest chunk: 2 * 11M)  */
-   if (second_stats.greatest_free_size != pre_stats.greatest_free_size) {
+   if (post_stats.greatest_free_size != pre_stats.greatest_free_size) {
printf("Incorrect heap statistics: Greatest free size \n");
return -1;
}
@@ -1036,11 +1037,11 @@ test_malloc(void)

ret = test_multi_alloc_statistics();
if (ret < 0) {
-   printf("test_muti_alloc_statistics() failed\n");
+   printf("test_multi_alloc_statistics() failed\n");
return ret;
}
else
-   printf("test_muti_alloc_statistics() passed\n");
+   printf("test_multi_alloc_statistics() passed\n");

return 0;
 }
-- 
1.7.1



[dpdk-dev] [PATCH v3 1/2] malloc: fix malloc and free linear complexity

2014-06-23 Thread Robert Sanford
Problems with lib rte_malloc:
1. Rte_malloc searches a heap's entire free list looking for the best
   fit, resulting in linear complexity.
2. Heaps store free blocks in a singly-linked list, resulting in
   linear complexity when rte_free needs to remove an adjacent block.
3. The library inserts and removes free blocks with ad hoc, in-line
   code, rather than using linked-list functions or macros.
4. The library wastes potential small blocks of size 64 and 128 bytes
   (plus overhead of 64 bytes) as padding when reusing free blocks or
   resizing allocated blocks.

This patch addresses those problems as follows:
1. Replace single free list with a handful of free lists. Each free
   list contains blocks of a specified size range, for example:
 list[0]: (0   , 2^8]
 list[1]: (2^8 , 2^10]
 list[2]: (2^10, 2^12]
 list[3]: (2^12, 2^14]
 list[4]: (2^14, MAX_SIZE]

   When allocating a block, start at the first list that can contain
   a big enough block. Search subsequent lists, if necessary.
   Terminate the search as soon as we find a block that is big enough.
2. Use doubly-linked lists, so that we can remove free blocks in
   constant time.
3. Use BSD LIST macros, as defined in sys/queue.h and the QUEUE(3)
   man page.
4. Change code to utilize small blocks of data size 64 and 128, when
   splitting larger blocks.

Signed-off-by: Robert Sanford 

---
 lib/librte_eal/common/include/rte_malloc_heap.h |6 +-
 lib/librte_malloc/malloc_elem.c |  127 +++
 lib/librte_malloc/malloc_elem.h |   17 +++-
 lib/librte_malloc/malloc_heap.c |   67 +---
 4 files changed, 131 insertions(+), 86 deletions(-)

diff --git a/lib/librte_eal/common/include/rte_malloc_heap.h 
b/lib/librte_eal/common/include/rte_malloc_heap.h
index fc4fd0a..f727b7a 100644
--- a/lib/librte_eal/common/include/rte_malloc_heap.h
+++ b/lib/librte_eal/common/include/rte_malloc_heap.h
@@ -35,14 +35,18 @@
 #define _RTE_MALLOC_HEAP_H_

 #include 
+#include 
 #include 

+/* Number of free lists per heap, grouped by size. */
+#define RTE_HEAP_NUM_FREELISTS  5
+
 /**
  * Structure to hold malloc heap
  */
 struct malloc_heap {
rte_spinlock_t lock;
-   struct malloc_elem * volatile free_head;
+   LIST_HEAD(, malloc_elem) free_head[RTE_HEAP_NUM_FREELISTS];
unsigned mz_count;
unsigned alloc_count;
size_t total_size;
diff --git a/lib/librte_malloc/malloc_elem.c b/lib/librte_malloc/malloc_elem.c
index 172da69..75a94d0 100644
--- a/lib/librte_malloc/malloc_elem.c
+++ b/lib/librte_malloc/malloc_elem.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 

 #include 
@@ -49,7 +50,7 @@
 #include "malloc_elem.h"
 #include "malloc_heap.h"

-#define MIN_DATA_SIZE (CACHE_LINE_SIZE * 2)
+#define MIN_DATA_SIZE (CACHE_LINE_SIZE)

 /*
  * initialise a general malloc_elem header structure
@@ -60,7 +61,8 @@ malloc_elem_init(struct malloc_elem *elem,
 {
elem->heap = heap;
elem->mz = mz;
-   elem->prev = elem->next_free = NULL;
+   elem->prev = NULL;
+   memset(>free_list, 0, sizeof(elem->free_list));
elem->state = ELEM_FREE;
elem->size = size;
elem->pad = 0;
@@ -125,19 +127,76 @@ split_elem(struct malloc_elem *elem, struct malloc_elem 
*split_pt)
 }

 /*
+ * Given an element size, compute its freelist index.
+ * We free an element into the freelist containing similarly-sized elements.
+ * We try to allocate elements starting with the freelist containing
+ * similarly-sized elements, and if necessary, we search freelists
+ * containing larger elements.
+ *
+ * Example element size ranges for a heap with five free lists:
+ *   heap->free_head[0] - (0   , 2^8]
+ *   heap->free_head[1] - (2^8 , 2^10]
+ *   heap->free_head[2] - (2^10 ,2^12]
+ *   heap->free_head[3] - (2^12, 2^14]
+ *   heap->free_head[4] - (2^14, MAX_SIZE]
+ */
+size_t
+malloc_elem_free_list_index(size_t size)
+{
+#define MALLOC_MINSIZE_LOG2   8
+#define MALLOC_LOG2_INCREMENT 2
+
+   size_t log2;
+   size_t index;
+
+   if (size <= (1UL << MALLOC_MINSIZE_LOG2))
+   return 0;
+
+   /* Find next power of 2 >= size. */
+   log2 = sizeof(size) * 8 - __builtin_clzl(size-1);
+
+   /* Compute freelist index, based on log2(size). */
+   index = (log2 - MALLOC_MINSIZE_LOG2 + MALLOC_LOG2_INCREMENT - 1) /
+   MALLOC_LOG2_INCREMENT;
+
+   return (index <= RTE_HEAP_NUM_FREELISTS-1?
+   index: RTE_HEAP_NUM_FREELISTS-1);
+}
+
+/*
+ * Add the specified element to its heap's free list.
+ */
+void
+malloc_elem_free_list_insert(struct malloc_elem *elem)
+{
+   size_t idx = malloc_elem_free_list_index(elem->size - 
MALLOC_ELEM_HEADER_LEN);
+
+   elem->state = ELEM_FREE;
+   LIST_INSERT_HEAD(>heap->free_head[idx], elem, free_list);
+}
+
+/*
+ * Remove the 

[dpdk-dev] [PATCH v3 0/2] malloc: fix malloc and free linear complexity

2014-06-23 Thread Robert Sanford
Comments on previous versions of this patch:
http://dpdk.org/ml/archives/dev/2014-May/002297.html
http://dpdk.org/ml/archives/dev/2014-June/003518.html

Additional changes from original to v3:
* Reduce the minimum-sized block that we put on a free list when
  splitting a larger block, from 192 to 64. Although memory is
  plentiful, why waste 64 and 128-byte (plus overhead) blocks?

-#define MIN_DATA_SIZE (CACHE_LINE_SIZE * 2)
+#define MIN_DATA_SIZE (CACHE_LINE_SIZE)

-   if (old_elem_size <= MALLOC_ELEM_OVERHEAD + MIN_DATA_SIZE){
+   if (old_elem_size < MALLOC_ELEM_OVERHEAD + MIN_DATA_SIZE){

-   if (elem->size - new_size > MIN_DATA_SIZE + MALLOC_ELEM_OVERHEAD){
+   if (elem->size - new_size >= MIN_DATA_SIZE + MALLOC_ELEM_OVERHEAD){

Changes from v2 to v3:
* Change the size ranges of the five free lists per heap. The first
  list will effectively contain blocks of size [64,256].

-  *   heap->free_head[0] - (0   , 2^7]
-  *   heap->free_head[1] - (2^7 , 2^9]
-  *   heap->free_head[2] - (2^9 , 2^11]
-  *   heap->free_head[3] - (2^11, 2^13]
-  *   heap->free_head[4] - (2^13, MAX_SIZE]
+  *   heap->free_head[0] - (0   , 2^8]
+  *   heap->free_head[1] - (2^8 , 2^10]
+  *   heap->free_head[2] - (2^10 ,2^12]
+  *   heap->free_head[3] - (2^12, 2^14]
+  *   heap->free_head[4] - (2^14, MAX_SIZE]

- #define MALLOC_MINSIZE_LOG2   7
+ #define MALLOC_MINSIZE_LOG2   8

* Fix typos and false assumptions in malloc unit tests. Even without
  linked-list enhancements to lib rte_malloc, malloc autotest fails
  every second (2nd) run.


[dpdk-dev] [PATCH v2] malloc: fix malloc and free linear complexity

2014-06-17 Thread Robert Sanford
Hi Pablo,

> Overall patch looks OK, but malloc unit tests fail on the last test
(test_multi_alloc_statistics).
> Apparently, the biggest free chunk size changes as it allocates some
memory, whereas in
> the previous implementation, this size did not change. I wonder if unit
test is wrong or if there
> is actually an issue here. Could you look at this as well?

Thanks for your comments. Yes, I will investigate problems with the malloc
unit tests.

BTW, I intend to make one other adjustment to the previous (v2) set of
changes:

--- a/lib/librte_malloc/malloc_elem.c
+++ b/lib/librte_malloc/malloc_elem.c
@@ -143,7 +143,7 @@ split_elem(struct malloc_elem *elem, struct malloc_elem
*split_pt)
 size_t
 malloc_elem_free_list_index(size_t size)
 {
-#define MALLOC_MINSIZE_LOG2   7
+#define MALLOC_MINSIZE_LOG2   8
 #define MALLOC_LOG2_INCREMENT 2

--
Robert


[dpdk-dev] Question: Can't make pcap and refcnt to match

2013-11-26 Thread Robert Sanford
Hi Bruce,

We also found buffer overflow problems with the pcap driver. 1) Frame may
be longer than mbuf. 2) Caplen may be less than original packet.

I've been meaning to submit a change, but I'm not familiar with the process.

Here is a diff of the relevant code in rte_eth_pcap.c:



  if (unlikely(mbuf == NULL))
  break;
- rte_memcpy(mbuf->pkt.data, packet, header.len);
- mbuf->pkt.data_len = (uint16_t)header.len;
- mbuf->pkt.pkt_len = mbuf->pkt.data_len;
+
+ /*
+ * Fix buffer overflow problems.
+ * 1. Frame may be longer than mbuf.
+ * 2. Capture length (caplen) may be less than original packet length.
+ */
+ uint16_t len = (uint16_t)header.caplen;
+ uint16_t tailroom = rte_pktmbuf_tailroom(mbuf);
+ if (len > tailroom)
+ len = tailroom;
+
+ /
+ RTE_LOG(INFO, PMD, "eth_pcap_rx: i=%u caplen=%u framelen=%u tail=%u
len=%u\n",
+ i, header.caplen, header.len, tailroom, len);
+ /
+
+ rte_memcpy(mbuf->pkt.data, packet, len);
+ mbuf->pkt.data_len = len;
+ mbuf->pkt.pkt_len = len;
+
  bufs[i] = mbuf;
  num_rx++;
  }



Regards,
Robert




On Tue, Nov 26, 2013 at 8:46 AM, Richardson, Bruce <
bruce.richardson at intel.com> wrote:

> Hi Mats,
>
> yes, you are right, there is an issue in the pcap driver that it is not
> allocating mbufs correctly. We are working on a fix.
>
> Regards,
> /Bruce
>
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Mats Liljegren
> > Sent: Tuesday, November 26, 2013 1:07 PM
> > To: dev at dpdk.org
> > Subject: [dpdk-dev] Question: Can't make pcap and refcnt to match
> >
> > I have had stability problems when using pcap in my little application.
> My
> > application is a simple benchmark applications that is trying to see how
> > much data I can send and receive.
> >
> > It has one lcore per NIC, where each lcore handles transmit and receive.
> On
> > the hardware, I make a loopback between two NICs, so the NICs are in
> > practice paired. I currently use 4 NICs and therefore 4 lcores. Port 0
> sends to
> > port 1 and vice versa. Port 2 send to port 3 and vice versa. One pair is
> using
> > DPDK hardware driver against a dual
> > i350 NIC. The other pair is using pcap against two of the four on-board
> NICs.
> >
> > When enabling everything saying "DEBUG" in its name in the .config file,
> I
> > get the following error:
> >
> > PMD: rte_eth_dev_config_restore: port 1: MAC address array not
> > supported
> > PMD: rte_eth_promiscuous_disable: Function not supported
> > PMD: rte_eth_allmulticast_disable: Function not supported
> > Speed: 1 Mbps, full duplex
> > Port 1 up and running.
> > PMD: e1000_put_hw_semaphore_generic():
> > e1000_put_hw_semaphore_generic PANIC in rte_mbuf_sanity_check():
> > bad ref cnt
> > PANIC in rte_mbuf_sanity_check():
> > bad ref cnt
> > PMD: e1000_release_phy_82575(): e1000_release_phy_82575
> > PMD: e1000_release_swfw_sync_82575():
> > e1000_release_swfw_sync_82575
> > PMD: e1000_get_hw_semaphore_generic():
> > e1000_get_hw_semaphore_generic
> > PMD: eth_igb_rx_queue_setup(): sw_ring=0x7fff776eefc0
> > hw_ring=0x7fff76830480 dma_addr=0x464630480
> >
> > PMD: e1000_put_hw_semaphore_generic():
> > e1000_put_hw_semaphore_generic
> > PMD: To improve 1G driver performance, consider setting the TX WTHRESH
> > value to 4, 8, or 16.
> > PMD: eth_igb_tx_queue_setup(): sw_ring=0x7fff776ece40
> > hw_ring=0x7fff76840500 dma_addr=0x464640500
> >
> > PMD: eth_igb_start(): >>
> > PMD: e1000_read_phy_reg_82580(): e1000_read_phy_reg_82580
> > PMD: e1000_acquire_phy_82575(): e1000_acquire_phy_82575
> > PMD: e1000_acquire_swfw_sync_82575():
> > e1000_acquire_swfw_sync_82575
> > PMD: e1000_get_hw_semaphore_generic():
> > e1000_get_hw_semaphore_generic
> > PMD: e1000_get_cfg_done_82575(): e1000_get_cfg_done_82575
> > PMD: e1000_put_hw_semaphore_generic():
> > e1000_put_hw_semaphore_generic
> > PMD: e1000_read_phy_reg_mdic(): e1000_read_phy_reg_mdic
> > 9: [/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x772a89cd]]
> > 8: [/lib/x86_64-linux-gnu/libpthread.so.0(+0x7f6e) [0x7757df6e]]
> > 7: [/home/mlil/dpdk-demo/build/enea-demo(eal_thread_loop+0x1b9)
> > [0x492669]]
> > 6: [/home/mlil/dpdk-demo/build/enea-demo() [0x4150bc]]
> > 5: [/home/mlil/dpdk-demo/build/enea-demo() [0x414d0b]]
> > 4: [/home/mlil/dpdk-demo/build/enea-demo() [0x4116ef]]
> > 3: [/home/mlil/dpdk-demo/build/enea-
> > demo(rte_mbuf_sanity_check+0xa7) [0x484707]]
> > 2: [/home/mlil/dpdk-demo/build/enea-demo(__rte_panic+0xc1)
> > [0x40f788]]
> > 1: [/home/mlil/dpdk-demo/build/enea-demo(rte_dump_stack+0x18)
> > [0x493f68]]
> > PMD: e1000_release_phy_82575(): e1000_release_phy_82575
> > PMD: e1000_release_swfw_sync_82575():
> > e1000_release_swfw_sync_82575
> > PMD: e1000_get_hw_semaphore_generic():
> > e1000_get_hw_semaphore_generic
> >
> > I checked the source code for pcap, and in the file rte_eth_pcap.c,
> function
> > eth_pcap_rx(), I make the following observation:
> >
> > It pre-allocates a number of mbufs (64 to be exact). It then fills these
> 

[dpdk-dev] How to fight forwarding performance regression on large mempool sizes.

2013-09-20 Thread Robert Sanford
One more point, if you're not doing this already: Allocate 2^N-1 mbufs, not
2^N. According to the code and comments: "The optimum size (in terms of
memory usage) for a mempool is when n is a power of two minus one: n = (2^q
- 1)."

The reason: rte_mempool_create(... n ...) invokes rte_ring_create(...
rte_align32power(n+1) ...), i.e., it wants to allocate n+1 slots in the
ring, but it has to round that up to the next power of 2. So, when you
allocate 2^N bufs, its ring comes with twice as many slots (2^(N+1)).

--
Robert



On Fri, Sep 20, 2013 at 2:48 AM, Dmitry Vyal  wrote:

> On 09/19/2013 11:39 PM, Robert Sanford wrote:
>
>> Hi Dmitry,
>>
>> The biggest drop-off seems to be from size 128K to 256K. Are you using
>> 1GB huge pages already (rather than 2MB)?
>>
>> I would think that it would not use over 1GB until you ask for 512K mbufs
>> or more.
>>
>>
> Hi Robert,
>
> Yes, I've been using 1GB pages for a while. My L3 cache is 20MB and mbufs
> are 2240 bytes of size. So something strange indeed happens then we move
> from ~200MB to ~400MB. Any ideas?
>
> Regards,
> Dmitry
>
>


[dpdk-dev] How to fight forwarding performance regression on large mempool sizes.

2013-09-20 Thread Robert Sanford
It might be interesting to see that start/end address of the 256K-item (256
* 1024 * 2240 = 560 MB) mbuf memory pool. Maybe it's the first size that
straddles two 1GB pages.

Perhaps you should try a tool that reports cache misses, TLB misses, and
related statistics. I don't know much about this area, yet, but this looks
like a good starting place:

http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization

--
Regards,
Robert



On Fri, Sep 20, 2013 at 2:48 AM, Dmitry Vyal  wrote:

> On 09/19/2013 11:39 PM, Robert Sanford wrote:
>
>> Hi Dmitry,
>>
>> The biggest drop-off seems to be from size 128K to 256K. Are you using
>> 1GB huge pages already (rather than 2MB)?
>>
>> I would think that it would not use over 1GB until you ask for 512K mbufs
>> or more.
>>
>>
> Hi Robert,
>
> Yes, I've been using 1GB pages for a while. My L3 cache is 20MB and mbufs
> are 2240 bytes of size. So something strange indeed happens then we move
> from ~200MB to ~400MB. Any ideas?
>
> Regards,
> Dmitry
>
>