Re: [PATCH] scm: fold __scm_send() into scm_send()

2006-03-21 Thread Stephen Smalley
On Tue, 2006-03-21 at 08:32 -0500, Stephen Smalley wrote:
  I don't expect security_sk_sid() to be terribly expensive.  It's not
  an AVC check, it's just propagating a label.  But I've not done any
  benchmarking on that.
 
 No permission check there, but it looks like it does read lock
 sk_callback_lock.  Not sure if that is truly justified here.

Ah, that is because it is also called from the xfrm code, introduced by
Trent's patches.  But that locking shouldn't be necessary from scm_send,
right?  So she likely wants a separate hook for it to avoid that
overhead, or even just a direct SELinux interface?
  
-- 
Stephen Smalley
National Security Agency

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Results WAS(Re: [PATCH] TC: bug fixes to the sample clause

2006-03-21 Thread jamal
On Tue, 2006-21-03 at 09:35 +1000, Russell Stuart wrote:

 Jeezz, that pisses me off.  What is it with the bloody
 internet?  This isn't the first time this has happened.
 The page you are accessing is in the US for gods sake.
 It seems like the internet has walled off islands on 
 occasions.  I have mirrored it:

Thanks - I accessed it. 

[..]

 By the way, with the analysis I didn't go out of my
 way to find a dataset where 2.4 ran faster - it was
 just the second one I tried.  There are pathological
 fake datasets that perform much worse than the one
 in the analysis, and presumably real ones too.

sorry Russell - that still doesnt cut it. When you design 
something like a route lookup algorithm, for example, you dont 
pick one over another based on a set of IP addresses you have. 
A worst case scenario is acceptable, always. An observation of this is
going to run on the edge/core of a network therefore i will optimize for
that case is also acceptable. Yours doesnt fit this. I havent run your
test data but i am willing to bet (unenthusiastic to try for sure since
we've spent too much time on this), the better results you are getting
are due to biasing so that the better algorithm gets things in some
buckets more than others i.e it has nothing to do with the
hash selection. The environment changes and such results will 
change as well. Nothing is ever gonna save you from 25-75% of your
buckets never ever being used in the case of 2.4; and at the expense of
sounding like a broken record: i dont see anything the 2.4 algorithm
brings of value other than in the case of 256 buckets with masks which
ensure all 256 buckets get used - so as a performance bigot i equally
dont value adding those extra computations; trust me if i was
semi-convinced i would have supported the change.

The impression i have is you are an energetic, resourceful person - lets
move on (drop this) to that other thing you said you wanted to talk
about. I could look at the way you have arranged your tables
and offer opinion.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Comment] sizeof(struct tcp_sock) is above 1024 on x86 since linux-2.6.15

2006-03-21 Thread Eric Dumazet

Hi all

I would like to point out that struct tcp_sock was enlarged in 2.6.15, and the 
'TCP' kmem_cache now needs order-1 allocations instead of order-0



In 2.6.14 :

# grep ^TCP /proc/slabinfo
TCP   64 7696041 : tunables   54   270 : 
slabdata 19 19  0


In 2.6.16 / 2.6.15 :

grep ^TCP /proc/slabinfo
TCP   16 28   115272 : tunables   24   128 : 
slabdata  4  4  0



This is a new point of failure for x86 machines that use lot of tcp sockets, I 
learnt it the bad way and had to revert to 2.6.14 some servers that cannot run 
stock 2.6.15/2.6.16 for long because of this problem.


Of course, we might argue the problem come from linux memory management...
Oh well...

Thank you
Eric Dumazet


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


MPLS extension for pktgen

2006-03-21 Thread Steven Whitehouse
Hi,

I've been looking into MPLS recently and so one of the first things that
would be useful is a testbed to generate test traffic, and hence the
attached patch to pktgen.

If you have a moment to look over it, then please let me know if you
would give it your blessing. The patch is against davem's current
net-2.6.17 tree,


Steve.


--

diff --git a/Documentation/networking/pktgen.txt 
b/Documentation/networking/pktgen.txt
--- a/Documentation/networking/pktgen.txt
+++ b/Documentation/networking/pktgen.txt
@@ -109,6 +109,22 @@ Examples:
  cycle through the port range.
  pgset udp_dst_max 9   set UDP destination port max.
 
+ pgset mpls 0001000a,0002000a,000a set MPLS labels (in this example
+ outer label=16,middle label=32,
+inner label=0 (IPv4 NULL)) Note that
+there must be no spaces between the
+arguments. Leading zeros are required.
+Do not set the bottom of stack bit,
+thats done automatically. If you do
+set the bottom of stack bit, that
+indicates that you want to randomly
+generate that address and the flag
+MPLS_RND will be turned on. You
+can have any mix of random and fixed
+labels in the label stack.
+
+ pgset mpls 0  turn off mpls (or any invalid argument works 
too!)
+
  pgset stop  aborts injection. Also, ^C aborts generator.
 
 
@@ -167,6 +183,8 @@ pkt_size 
 min_pkt_size
 max_pkt_size
 
+mpls
+
 udp_src_min
 udp_src_max
 
@@ -211,4 +229,4 @@ Grant Grundler for testing on IA-64 and 
 Stephen Hemminger, Andi Kleen, Dave Miller and many others.
 
 
-Good luck with the linux net-development.
\ No newline at end of file
+Good luck with the linux net-development.
diff --git a/net/core/pktgen.c b/net/core/pktgen.c
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -106,6 +106,9 @@
  *
  * interruptible_sleep_on_timeout() replaced Nishanth Aravamudan [EMAIL 
PROTECTED] 
  * 050103
+ *
+ * MPLS support by Steven Whitehouse [EMAIL PROTECTED]
+ *
  */
 #include linux/sys.h
 #include linux/types.h
@@ -154,7 +157,7 @@
 #include asm/div64.h /* do_div */
 #include asm/timex.h
 
-#define VERSION  pktgen v2.66: Packet Generator for packet performance 
testing.\n
+#define VERSION  pktgen v2.67: Packet Generator for packet performance 
testing.\n
 
 /* #define PG_DEBUG(a) a */
 #define PG_DEBUG(a)
@@ -162,6 +165,8 @@
 /* The buckets are exponential in 'width' */
 #define LAT_BUCKETS_MAX 32
 #define IP_NAME_SZ 32
+#define MAX_MPLS_LABELS 16 /* This is the max label stack depth */
+#define MPLS_STACK_BOTTOM __constant_htonl(0x0100)
 
 /* Device flag bits */
 #define F_IPSRC_RND   (10)   /* IP-Src Random  */
@@ -172,6 +177,7 @@
 #define F_MACDST_RND  (15)   /* MAC-Dst Random */
 #define F_TXSIZE_RND  (16)   /* Transmit size is random */
 #define F_IPV6(17)   /* Interface in IPV6 Mode */
+#define F_MPLS_RND(18)   /* Random MPLS labels */
 
 /* Thread control flag bits */
 #define T_TERMINATE   (10)
@@ -278,6 +284,10 @@ struct pktgen_dev {
__u16 udp_dst_min;  /* inclusive, dest UDP port */
__u16 udp_dst_max;  /* exclusive, dest UDP port */
 
+   /* MPLS */
+   unsigned nr_labels; /* Depth of stack, 0 = no MPLS */
+   __be32 labels[MAX_MPLS_LABELS];
+
__u32 src_mac_count;/* How many MACs to iterate through */
__u32 dst_mac_count;/* How many MACs to iterate through */
 
@@ -623,9 +633,19 @@ static int pktgen_if_show(struct seq_fil
   pkt_dev-udp_dst_min, pkt_dev-udp_dst_max);
 
seq_printf(seq,
-   src_mac_count: %d  dst_mac_count: %d \n Flags: ,
+   src_mac_count: %d  dst_mac_count: %d\n,
   pkt_dev-src_mac_count, pkt_dev-dst_mac_count);
 
+   if (pkt_dev-nr_labels) {
+   unsigned i;
+   seq_printf(seq,  mpls: );
+   for(i = 0; i  pkt_dev-nr_labels; i++)
+   seq_printf(seq, %08x%s, ntohl(pkt_dev-labels[i]),
+  i == pkt_dev-nr_labels-1 ? \n : , );
+   }
+
+   seq_printf(seq,  Flags: );
+
if (pkt_dev-flags  F_IPV6)
seq_printf(seq, IPV6  );
 
@@ -644,6 +664,9 @@ static int pktgen_if_show(struct seq_fil
if (pkt_dev-flags  F_UDPDST_RND)
seq_printf(seq, UDPDST_RND  );
 
+   if (pkt_dev-flags  F_MPLS_RND)
+   seq_printf(seq,  MPLS_RND  );
+
if (pkt_dev-flags  

Question about TCP behavior

2006-03-21 Thread Patrick Klos
Hello all,

I am trying to figure out what is causing a change in behavior of the TCP
stack on Linux?  I have a very simple test setup:

1)  Windows machine running a test app to request data from the server
2)  Linux (2.6.10 - yeah, I know... upgrade...) machine running test server
3)  Gigabit ethernet between the two machines via a Cisco switch

The Windows machine sends a request to the Linux machine requesting the 
Linux machine send a block of data containing 502,132 bytes of data.  The
server on Linux makes a single send() call with the entire buffer (this to
reduce the user-to-kernel mode overhead of multiple calls.

If the Linux machine has just recently been booted, the transfer takes around
8 or 9 milliseconds.  If the Linux machine has been up for a while (but still
primarily idle), the transfer starts to take anywhere from 32 to 70 milli-
seconds.  Both the Windows machine and the Linux machine are for all practical
purposes idle and dedicated to this test process.  It seems the Linux TCP
stack is getting into a state where it decides to slow down the pace of the
transfer to the Windows machine?!?

When the transfer is fast, the time between frame sends is usually about
8 to 40 microseconds (with some variation).

When the transfer is slow, the time between frame sends starts off at a high
130 microseconds, then tapers down to 1/2 and/or 1/4 of that in a pattern
that looks too consistant to be random.  Here's the basic pattern:

(time between packet sends in microseconds)
130, 130, 130, 130, 130, 130, 130, 130, 68, 32, 32, 32, 32, 32

[   it's this pattern I'm hoping someone recognizes!  :o)  ]

After a group of packets are sent, the pattern starts again with a large
number then tapers down again and again until the entire transfer is done.

Questions going though my head:

1)  Is some metric on the interface being used to determine the initial
TCP transfer rate?
2)  Is this some form of slow start? (doesn't sound it to me, but who
knows?).  If so, can I verify that?  Then turn it off (or not do
whatever is triggering it)??
3)  What mechanism of TCP might account for such a pattern of behavior?

I have a dump of the delta times between packets for the fast and slow case
with some packet information (frame size, TCP flags, start of TCP data).
Not to take advantage of this mailing list, I've put the verbose information
on the following web page:

http://www.klos.com/~patrick/TCPQuestion.html

Thanks for looking!

Patrick
= For LAN/WAN Protocol Analysis, check out PacketView Pro! =
Patrick Klos   Email: [EMAIL PROTECTED]
Network/Embedded Software Engineer Web:   http://www.klos.com/
Klos Technologies, Inc.Phone: 603-471-2547
 http://www.loving-long-island.com/ 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Patch] mv643xx_eth: Cache align skb-data if CONFIG_NOT_COHERENT_CACHE

2006-03-21 Thread Dale Farnsworth
From: Dale Farnsworth [EMAIL PROTECTED]

When I/O is non-cache-coherent, we need to ensure that the I/O buffers
we use don't share cache lines with other data.

Signed-off-by: Dale Farnsworth [EMAIL PROTECTED]

---

This patch fixes red zone error messages that appear when CONFIG_SLAB_DEBUG=y.

 drivers/net/mv643xx_eth.h |   18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

Index: linux-2.6-mv643xx_enet/drivers/net/mv643xx_eth.h
===
--- linux-2.6-mv643xx_enet.orig/drivers/net/mv643xx_eth.h
+++ linux-2.6-mv643xx_enet/drivers/net/mv643xx_eth.h
@@ -42,13 +42,23 @@
 #define MAX_DESCS_PER_SKB  1
 #endif
 
+/*
+ * The MV643XX HW requires 8-byte alignment.  However, when I/O
+ * is non-cache-coherent, we need to ensure that the I/O buffers
+ * we use don't share cache lines with other data.
+ */
+#if defined(CONFIG_DMA_NONCOHERENT) || defined(CONFIG_NOT_COHERENT_CACHE)
+#define ETH_DMA_ALIGN  L1_CACHE_BYTES
+#else
+#define ETH_DMA_ALIGN  8
+#endif
+
 #define ETH_VLAN_HLEN  4
 #define ETH_FCS_LEN4
-#define ETH_DMA_ALIGN  8   /* hw requires 8-byte alignment */
-#define ETH_HW_IP_ALIGN2   /* hw aligns IP header */
+#define ETH_HW_IP_ALIGN2   /* hw aligns IP header 
*/
 #define ETH_WRAPPER_LEN(ETH_HW_IP_ALIGN + ETH_HLEN + \
-   ETH_VLAN_HLEN + ETH_FCS_LEN)
-#define ETH_RX_SKB_SIZE((dev-mtu + ETH_WRAPPER_LEN + 7)  
~0x7)
+   ETH_VLAN_HLEN + ETH_FCS_LEN)
+#define ETH_RX_SKB_SIZE(dev-mtu + ETH_WRAPPER_LEN + 
ETH_DMA_ALIGN)
 
 #define ETH_RX_QUEUES_ENABLED  (1  0)/* use only Q0 for receive */
 #define ETH_TX_QUEUES_ENABLED  (1  0)/* use only Q0 for transmit */
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


MPLS extension for pktgen

2006-03-21 Thread Robert Olsson

Steven Whitehouse writes:

  I've been looking into MPLS recently and so one of the first things that
  would be useful is a testbed to generate test traffic, and hence the
  attached patch to pktgen.
  
  If you have a moment to look over it, then please let me know if you
  would give it your blessing. The patch is against davem's current
  net-2.6.17 tree,

 Nice. Well never thought about mpls but it seems possible too. With mpls 
 enabled it seems send something my tcpdump does not understand so I trust 
 you there. I and it does not seem to brake standard ipv4 sending. So it 
 should be OK. 

 But I'll guess the mpls result code is not what you expected...

 echo mpls 0001000a,0002000a,000a/proc/net/pktgen/eth1
 cat /proc/net/pktgen/eth1 | grep Res
 Result: 000a

  sprintf(pg_result, OK: mpls=);
  for(n = 0; n  pkt_dev-nr_labels; n++)
   sprintf(pg_result, %08x%s, ntohl(pkt_dev-labels[n]),
n == pkt_dev-nr_labels-1 ?  : ,);

 Cheers.
--ro



 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 8/9] skge: handle pci errors better

2006-03-21 Thread Stephen Hemminger
When a PCI error occurs, try and report more info.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]


--- skge-2.6.orig/drivers/net/skge.c
+++ skge-2.6/drivers/net/skge.c
@@ -2764,17 +2764,6 @@ static void skge_mac_parity(struct skge_
? GMF_CLI_TX_FC : GMF_CLI_TX_PE);
 }
 
-static void skge_pci_clear(struct skge_hw *hw)
-{
-   u16 status;
-
-   pci_read_config_word(hw-pdev, PCI_STATUS, status);
-   skge_write8(hw, B2_TST_CTRL1, TST_CFG_WRITE_ON);
-   pci_write_config_word(hw-pdev, PCI_STATUS,
- status | PCI_STATUS_ERROR_BITS);
-   skge_write8(hw, B2_TST_CTRL1, TST_CFG_WRITE_OFF);
-}
-
 static void skge_mac_intr(struct skge_hw *hw, int port)
 {
if (hw-chip_id == CHIP_ID_GENESIS)
@@ -2816,23 +2805,39 @@ static void skge_error_irq(struct skge_h
if (hwstatus  IS_M2_PAR_ERR)
skge_mac_parity(hw, 1);
 
-   if (hwstatus  IS_R1_PAR_ERR)
+   if (hwstatus  IS_R1_PAR_ERR) {
+   printk(KERN_ERR PFX %s: receive queue parity error\n,
+  hw-dev[0]-name);
skge_write32(hw, B0_R1_CSR, CSR_IRQ_CL_P);
+   }
 
-   if (hwstatus  IS_R2_PAR_ERR)
+   if (hwstatus  IS_R2_PAR_ERR) {
+   printk(KERN_ERR PFX %s: receive queue parity error\n,
+  hw-dev[1]-name);
skge_write32(hw, B0_R2_CSR, CSR_IRQ_CL_P);
+   }
 
if (hwstatus  (IS_IRQ_MST_ERR|IS_IRQ_STAT)) {
-   printk(KERN_ERR PFX hardware error detected (status 0x%x)\n,
-  hwstatus);
+   u16 pci_status, pci_cmd;
+
+   pci_read_config_word(hw-pdev, PCI_COMMAND, pci_cmd);
+   pci_read_config_word(hw-pdev, PCI_STATUS, pci_status);
 
-   skge_pci_clear(hw);
+   printk(KERN_ERR PFX %s: PCI error cmd=%#x status=%#x\n,
+  pci_name(hw-pdev), pci_cmd, pci_status);
+
+   /* Write the error bits back to clear them. */
+   pci_status = PCI_STATUS_ERROR_BITS;
+   skge_write8(hw, B2_TST_CTRL1, TST_CFG_WRITE_ON);
+   pci_write_config_word(hw-pdev, PCI_COMMAND,
+ pci_cmd | PCI_COMMAND_SERR | 
PCI_COMMAND_PARITY);
+   pci_write_config_word(hw-pdev, PCI_STATUS, pci_status);
+   skge_write8(hw, B2_TST_CTRL1, TST_CFG_WRITE_OFF);
 
/* if error still set then just ignore it */
hwstatus = skge_read32(hw, B0_HWE_ISRC);
if (hwstatus  IS_IRQ_STAT) {
-   pr_debug(IRQ status %x: still set ignoring hardware 
errors\n,
-  hwstatus);
+   printk(KERN_INFO PFX unable to clear error (so 
ignoring them)\n);
hw-intr_mask = ~IS_HW_ERR;
}
}
@@ -2998,7 +3003,7 @@ static const char *skge_board_name(const
 static int skge_reset(struct skge_hw *hw)
 {
u32 reg;
-   u16 ctst;
+   u16 ctst, pci_status;
u8 t8, mac_cfg, pmd_type, phy_type;
int i;
 
@@ -3009,8 +3014,13 @@ static int skge_reset(struct skge_hw *hw
skge_write8(hw, B0_CTST, CS_RST_CLR);
 
/* clear PCI errors, if any */
-   skge_pci_clear(hw);
+   skge_write8(hw, B2_TST_CTRL1, TST_CFG_WRITE_ON);
+   skge_write8(hw, B2_TST_CTRL2, 0);
 
+   pci_read_config_word(hw-pdev, PCI_STATUS, pci_status);
+   pci_write_config_word(hw-pdev, PCI_STATUS,
+ pci_status | PCI_STATUS_ERROR_BITS);
+   skge_write8(hw, B2_TST_CTRL1, TST_CFG_WRITE_OFF);
skge_write8(hw, B0_CTST, CS_MRST_CLR);
 
/* restore CLK_RUN bits (for Yukon-Lite) */
@@ -3377,7 +3387,6 @@ static void __devexit skge_remove(struct
 
skge_write32(hw, B0_IMSK, 0);
skge_write16(hw, B0_LED, LED_STAT_OFF);
-   skge_pci_clear(hw);
skge_write8(hw, B0_CTST, CS_RST_SET);
 
tasklet_kill(hw-ext_tasklet);

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/9] skge: dma configuration cleanup

2006-03-21 Thread Stephen Hemminger
Cleanup of the part of the code that sets up DMA configuration.
Should cause no real change in operation, just clearer.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]


--- skge-2.6.orig/drivers/net/skge.c
+++ skge-2.6/drivers/net/skge.c
@@ -3251,22 +3251,18 @@ static int __devinit skge_probe(struct p
 
pci_set_master(pdev);
 
-   if (sizeof(dma_addr_t)  sizeof(u32) 
-   !(err = pci_set_dma_mask(pdev, DMA_64BIT_MASK))) {
+   if (!pci_set_dma_mask(pdev, DMA_64BIT_MASK)) {
using_dac = 1;
err = pci_set_consistent_dma_mask(pdev, DMA_64BIT_MASK);
-   if (err  0) {
-   printk(KERN_ERR PFX %s unable to obtain 64 bit DMA 
-  for consistent allocations\n, pci_name(pdev));
-   goto err_out_free_regions;
-   }
-   } else {
-   err = pci_set_dma_mask(pdev, DMA_32BIT_MASK);
-   if (err) {
-   printk(KERN_ERR PFX %s no usable DMA configuration\n,
-  pci_name(pdev));
-   goto err_out_free_regions;
-   }
+   } else if (!(err = pci_set_dma_mask(pdev, DMA_32BIT_MASK))) {
+   using_dac = 0;
+   err = pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK);
+   }
+
+   if (err) {
+   printk(KERN_ERR PFX %s no usable DMA configuration\n,
+  pci_name(pdev));
+   goto err_out_free_regions;
}
 
 #ifdef __BIG_ENDIAN

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/9] skge: use NAPI for tx cleanup.

2006-03-21 Thread Stephen Hemminger
Cleanup transmit buffers using NAPI.  This allows the transmit routine
to leave interrupts enabled, and that improves performance.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

--- skge-2.6.orig/drivers/net/skge.c
+++ skge-2.6/drivers/net/skge.c
@@ -2307,16 +2307,13 @@ static int skge_xmit_frame(struct sk_buf
int i;
u32 control, len;
u64 map;
-   unsigned long flags;
 
skb = skb_padto(skb, ETH_ZLEN);
if (!skb)
return NETDEV_TX_OK;
 
-   local_irq_save(flags);
if (!spin_trylock(skge-tx_lock)) {
/* Collision - tell upper layer to requeue */
-   local_irq_restore(flags);
return NETDEV_TX_LOCKED;
}
 
@@ -2327,7 +2324,7 @@ static int skge_xmit_frame(struct sk_buf
printk(KERN_WARNING PFX %s: ring full when queue 
awake!\n,
   dev-name);
}
-   spin_unlock_irqrestore(skge-tx_lock, flags);
+   spin_unlock(skge-tx_lock);
return NETDEV_TX_BUSY;
}
 
@@ -2403,7 +2400,7 @@ static int skge_xmit_frame(struct sk_buf
}
 
dev-trans_start = jiffies;
-   spin_unlock_irqrestore(skge-tx_lock, flags);
+   spin_unlock(skge-tx_lock);
 
return NETDEV_TX_OK;
 }
@@ -2416,7 +2413,7 @@ static inline void skge_tx_free(struct s
   pci_unmap_addr(e, mapaddr),
   pci_unmap_len(e, maplen),
   PCI_DMA_TODEVICE);
-   dev_kfree_skb_any(e-skb);
+   dev_kfree_skb(e-skb);
e-skb = NULL;
} else {
pci_unmap_page(hw-pdev,
@@ -2430,15 +2427,14 @@ static void skge_tx_clean(struct skge_po
 {
struct skge_ring *ring = skge-tx_ring;
struct skge_element *e;
-   unsigned long flags;
 
-   spin_lock_irqsave(skge-tx_lock, flags);
+   spin_lock_bh(skge-tx_lock);
for (e = ring-to_clean; e != ring-to_use; e = e-next) {
++skge-tx_avail;
skge_tx_free(skge-hw, e);
}
ring-to_clean = e;
-   spin_unlock_irqrestore(skge-tx_lock, flags);
+   spin_unlock_bh(skge-tx_lock);
 }
 
 static void skge_tx_timeout(struct net_device *dev)
@@ -2663,6 +2659,37 @@ resubmit:
return NULL;
 }
 
+static void skge_tx_done(struct skge_port *skge)
+{
+   struct skge_ring *ring = skge-tx_ring;
+   struct skge_element *e;
+
+   spin_lock(skge-tx_lock);
+   for (e = ring-to_clean; prefetch(e-next), e != ring-to_use; e = 
e-next) {
+   struct skge_tx_desc *td = e-desc;
+   u32 control;
+
+   rmb();
+   control = td-control;
+   if (control  BMU_OWN)
+   break;
+
+   if (unlikely(netif_msg_tx_done(skge)))
+   printk(KERN_DEBUG PFX %s: tx done slot %td status 
0x%x\n,
+  skge-netdev-name, e - ring-start, td-status);
+
+   skge_tx_free(skge-hw, e);
+   e-skb = NULL;
+   ++skge-tx_avail;
+   }
+   ring-to_clean = e;
+   skge_write8(skge-hw, Q_ADDR(txqaddr[skge-port], Q_CSR), CSR_IRQ_CL_F);
+
+   if (skge-tx_avail  MAX_SKB_FRAGS + 1)
+   netif_wake_queue(skge-netdev);
+
+   spin_unlock(skge-tx_lock);
+}
 
 static int skge_poll(struct net_device *dev, int *budget)
 {
@@ -2670,8 +2697,10 @@ static int skge_poll(struct net_device *
struct skge_hw *hw = skge-hw;
struct skge_ring *ring = skge-rx_ring;
struct skge_element *e;
-   unsigned int to_do = min(dev-quota, *budget);
-   unsigned int work_done = 0;
+   int to_do = min(dev-quota, *budget);
+   int work_done = 0;
+
+   skge_tx_done(skge);
 
for (e = ring-to_clean; prefetch(e-next), work_done  to_do; e = 
e-next) {
struct skge_rx_desc *rd = e-desc;
@@ -2714,40 +2743,6 @@ static int skge_poll(struct net_device *
return 0;
 }
 
-static inline void skge_tx_intr(struct net_device *dev)
-{
-   struct skge_port *skge = netdev_priv(dev);
-   struct skge_hw *hw = skge-hw;
-   struct skge_ring *ring = skge-tx_ring;
-   struct skge_element *e;
-
-   spin_lock(skge-tx_lock);
-   for (e = ring-to_clean; prefetch(e-next), e != ring-to_use; e = 
e-next) {
-   struct skge_tx_desc *td = e-desc;
-   u32 control;
-
-   rmb();
-   control = td-control;
-   if (control  BMU_OWN)
-   break;
-
-   if (unlikely(netif_msg_tx_done(skge)))
-   printk(KERN_DEBUG PFX %s: tx done slot %td status 
0x%x\n,
-  dev-name, e - ring-start, td-status);
-
-   skge_tx_free(hw, e);
-   e-skb = NULL;
-   ++skge-tx_avail;
-   }
-   ring-to_clean = e;
-   skge_write8(hw, 

[PATCH 9/9] skge: version 1.4

2006-03-21 Thread Stephen Hemminger
Update version number

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

--- skge-2.6.orig/drivers/net/skge.c
+++ skge-2.6/drivers/net/skge.c
@@ -44,7 +44,7 @@
 #include skge.h
 
 #define DRV_NAME   skge
-#define DRV_VERSION1.3
+#define DRV_VERSION1.4
 #define PFXDRV_NAME  
 
 #define DEFAULT_TX_RING_SIZE   128

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/9] skge: use auto masking of irqs

2006-03-21 Thread Stephen Hemminger
Improve performance of skge driver by not touching irq mask
register as much. Since the interrupt source auto-masks, the driver
can just leave it disabled until the end of the soft irq.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

--- skge-2.6.orig/drivers/net/skge.c
+++ skge-2.6/drivers/net/skge.c
@@ -104,7 +104,6 @@ static const int txqaddr[] = { Q_XA1, Q_
 static const int rxqaddr[] = { Q_R1, Q_R2 };
 static const u32 rxirqmask[] = { IS_R1_F, IS_R2_F };
 static const u32 txirqmask[] = { IS_XA1_F, IS_XA2_F };
-static const u32 portirqmask[] = { IS_PORT_1, IS_PORT_2 };
 
 static int skge_get_regs_len(struct net_device *dev)
 {
@@ -2184,12 +2183,6 @@ static int skge_up(struct net_device *de
 
skge-tx_avail = skge-tx_ring.count - 1;
 
-   /* Enable IRQ from port */
-   spin_lock_irq(hw-hw_lock);
-   hw-intr_mask |= portirqmask[port];
-   skge_write32(hw, B0_IMSK, hw-intr_mask);
-   spin_unlock_irq(hw-hw_lock);
-
/* Initialize MAC */
spin_lock_bh(hw-phy_lock);
if (hw-chip_id == CHIP_ID_GENESIS)
@@ -2246,11 +2239,6 @@ static int skge_down(struct net_device *
else
yukon_stop(skge);
 
-   spin_lock_irq(hw-hw_lock);
-   hw-intr_mask = ~portirqmask[skge-port];
-   skge_write32(hw, B0_IMSK, hw-intr_mask);
-   spin_unlock_irq(hw-hw_lock);
-
/* Stop transmitter */
skge_write8(hw, Q_ADDR(txqaddr[port], Q_CSR), CSR_STOP);
skge_write32(hw, RB_ADDR(txqaddr[port], RB_CTRL),
@@ -2734,11 +2722,9 @@ static int skge_poll(struct net_device *
if (work_done =  to_do)
return 1; /* not done */
 
-   spin_lock_irq(hw-hw_lock);
-   __netif_rx_complete(dev);
-   hw-intr_mask |= portirqmask[skge-port];
+   netif_rx_complete(dev);
+   hw-intr_mask |= skge-port == 0 ? (IS_R1_F|IS_XA1_F) : 
(IS_R2_F|IS_XA2_F);
skge_write32(hw, B0_IMSK, hw-intr_mask);
-   spin_unlock_irq(hw-hw_lock);
 
return 0;
 }
@@ -2850,12 +2836,11 @@ static void skge_extirq(unsigned long da
int port;
 
spin_lock(hw-phy_lock);
-   for (port = 0; port  2; port++) {
+   for (port = 0; port  hw-ports; port++) {
struct net_device *dev = hw-dev[port];
+   struct skge_port *skge = netdev_priv(dev);
 
-   if (dev  netif_running(dev)) {
-   struct skge_port *skge = netdev_priv(dev);
-
+   if (netif_running(dev)) {
if (hw-chip_id != CHIP_ID_GENESIS)
yukon_phy_intr(skge);
else
@@ -2864,21 +2849,25 @@ static void skge_extirq(unsigned long da
}
spin_unlock(hw-phy_lock);
 
-   spin_lock_irq(hw-hw_lock);
hw-intr_mask |= IS_EXT_REG;
skge_write32(hw, B0_IMSK, hw-intr_mask);
-   spin_unlock_irq(hw-hw_lock);
 }
 
 static irqreturn_t skge_intr(int irq, void *dev_id, struct pt_regs *regs)
 {
struct skge_hw *hw = dev_id;
-   u32 status = skge_read32(hw, B0_SP_ISRC);
+   u32 status;
 
-   if (status == 0 || status == ~0) /* hotplug or shared irq */
+   /* Reading this register masks IRQ */
+   status = skge_read32(hw, B0_SP_ISRC);
+   if (status == 0)
return IRQ_NONE;
 
-   spin_lock(hw-hw_lock);
+   if (status  IS_EXT_REG) {
+   hw-intr_mask = ~IS_EXT_REG;
+   tasklet_schedule(hw-ext_tasklet);
+   }
+
if (status  (IS_R1_F|IS_XA1_F)) {
skge_write8(hw, Q_ADDR(Q_R1, Q_CSR), CSR_IRQ_CL_F);
hw-intr_mask = ~(IS_R1_F|IS_XA1_F);
@@ -2891,6 +2880,9 @@ static irqreturn_t skge_intr(int irq, vo
netif_rx_schedule(hw-dev[1]);
}
 
+   if (likely((status  hw-intr_mask) == 0))
+   return IRQ_HANDLED;
+
if (status  IS_PA_TO_RX1) {
struct skge_port *skge = netdev_priv(hw-dev[0]);
++skge-net_stats.rx_over_errors;
@@ -2918,13 +2910,7 @@ static irqreturn_t skge_intr(int irq, vo
if (status  IS_HW_ERR)
skge_error_irq(hw);
 
-   if (status  IS_EXT_REG) {
-   hw-intr_mask = ~IS_EXT_REG;
-   tasklet_schedule(hw-ext_tasklet);
-   }
-
skge_write32(hw, B0_IMSK, hw-intr_mask);
-   spin_unlock(hw-hw_lock);
 
return IRQ_HANDLED;
 }
@@ -3070,7 +3056,10 @@ static int skge_reset(struct skge_hw *hw
else
hw-ram_size = t8 * 4096;
 
-   hw-intr_mask = IS_HW_ERR | IS_EXT_REG;
+   hw-intr_mask = IS_HW_ERR | IS_EXT_REG | IS_PORT_1;
+   if (hw-ports  1)
+   hw-intr_mask |= IS_PORT_2;
+
if (hw-chip_id == CHIP_ID_GENESIS)
genesis_init(hw);
else {
@@ -3293,7 +3282,6 @@ static int __devinit skge_probe(struct p
 
hw-pdev = pdev;
spin_lock_init(hw-phy_lock);
-   spin_lock_init(hw-hw_lock);
tasklet_init(hw-ext_tasklet, skge_extirq, (unsigned long) hw);
 
  

[PATCH 7/9] skge: formmating and whitespace cleanup

2006-03-21 Thread Stephen Hemminger
Reformat some code to make it easier to read. And whitespace
fixes.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]


--- skge-2.6.orig/drivers/net/skge.c
+++ skge-2.6/drivers/net/skge.c
@@ -2177,15 +2177,17 @@ static int skge_up(struct net_device *de
 
memset(skge-mem, 0, skge-mem_size);
 
-   if ((err = skge_ring_alloc(skge-rx_ring, skge-mem, skge-dma)))
+   err = skge_ring_alloc(skge-rx_ring, skge-mem, skge-dma);
+   if (err)
goto free_pci_mem;
 
err = skge_rx_fill(skge);
if (err)
goto free_rx_ring;
 
-   if ((err = skge_ring_alloc(skge-tx_ring, skge-mem + rx_size,
-  skge-dma + rx_size)))
+   err = skge_ring_alloc(skge-tx_ring, skge-mem + rx_size,
+ skge-dma + rx_size);
+   if (err)
goto free_rx_ring;
 
skge-tx_avail = skge-tx_ring.count - 1;
@@ -2308,9 +2310,9 @@ static int skge_xmit_frame(struct sk_buf
return NETDEV_TX_OK;
 
if (!spin_trylock(skge-tx_lock)) {
-   /* Collision - tell upper layer to requeue */
-   return NETDEV_TX_LOCKED;
-   }
+   /* Collision - tell upper layer to requeue */
+   return NETDEV_TX_LOCKED;
+   }
 
if (unlikely(skge-tx_avail  skb_shinfo(skb)-nr_frags +1)) {
if (!netif_queue_stopped(dev)) {
@@ -2709,8 +2711,8 @@ static int skge_poll(struct net_device *
if (control  BMU_OWN)
break;
 
-   skb = skge_rx_get(skge, e, control, rd-status,
- le16_to_cpu(rd-csum2));
+   skb = skge_rx_get(skge, e, control, rd-status,
+ le16_to_cpu(rd-csum2));
if (likely(skb)) {
dev-last_rx = jiffies;
netif_receive_skb(skb);
@@ -3240,13 +3242,15 @@ static int __devinit skge_probe(struct p
struct skge_hw *hw;
int err, using_dac = 0;
 
-   if ((err = pci_enable_device(pdev))) {
+   err = pci_enable_device(pdev);
+   if (err) {
printk(KERN_ERR PFX %s cannot enable PCI device\n,
   pci_name(pdev));
goto err_out;
}
 
-   if ((err = pci_request_regions(pdev, DRV_NAME))) {
+   err = pci_request_regions(pdev, DRV_NAME);
+   if (err) {
printk(KERN_ERR PFX %s cannot obtain PCI resources\n,
   pci_name(pdev));
goto err_out_disable_pdev;
@@ -3298,7 +3302,8 @@ static int __devinit skge_probe(struct p
goto err_out_free_hw;
}
 
-   if ((err = request_irq(pdev-irq, skge_intr, SA_SHIRQ, DRV_NAME, hw))) {
+   err = request_irq(pdev-irq, skge_intr, SA_SHIRQ, DRV_NAME, hw);
+   if (err) {
printk(KERN_ERR PFX %s: cannot assign irq %d\n,
   pci_name(pdev), pdev-irq);
goto err_out_iounmap;
@@ -3316,7 +3321,8 @@ static int __devinit skge_probe(struct p
if ((dev = skge_devinit(hw, 0, using_dac)) == NULL)
goto err_out_led_off;
 
-   if ((err = register_netdev(dev))) {
+   err = register_netdev(dev);
+   if (err) {
printk(KERN_ERR PFX %s: cannot register net device\n,
   pci_name(pdev));
goto err_out_free_netdev;

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/9] skge: check the allocation of ring buffer

2006-03-21 Thread Stephen Hemminger
The SysKonnect Genesis and Yukon chip sets have restrictions on the possible
control block area.  The memory needs to not cross 4 Gig boundary, and it needs
to be 8 byte aligned.  This patch checks and fails to bring the device up
if region is unacceptable.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]


--- skge-2.6.orig/drivers/net/skge.c
+++ skge-2.6/drivers/net/skge.c
@@ -727,7 +727,7 @@ static struct ethtool_ops skge_ethtool_o
  * Allocate ring elements and chain them together
  * One-to-one association of board descriptors with ring elements
  */
-static int skge_ring_alloc(struct skge_ring *ring, void *vaddr, u64 base)
+static int skge_ring_alloc(struct skge_ring *ring, void *vaddr, u32 base)
 {
struct skge_tx_desc *d;
struct skge_element *e;
@@ -2168,6 +2168,14 @@ static int skge_up(struct net_device *de
if (!skge-mem)
return -ENOMEM;
 
+   BUG_ON(skge-dma  7);
+
+   if ((u64)skge-dma  32 != ((u64) skge-dma + skge-mem_size)  32) {
+   printk(KERN_ERR PFX pci_alloc_consistent region crosses 4G 
boundary\n);
+   err = -EINVAL;
+   goto free_pci_mem;
+   }
+
memset(skge-mem, 0, skge-mem_size);
 
if ((err = skge_ring_alloc(skge-rx_ring, skge-mem, skge-dma)))

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/9] skge: use mmiowb

2006-03-21 Thread Stephen Hemminger
Add mmio barriers at the appropriate places, don't have a platform
that needs them, but this is where the documentation of the patch
says to add them.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]


--- skge-2.6.orig/drivers/net/skge.c
+++ skge-2.6/drivers/net/skge.c
@@ -2394,9 +2394,11 @@ static int skge_xmit_frame(struct sk_buf
netif_stop_queue(dev);
}
 
-   dev-trans_start = jiffies;
+   mmiowb();
spin_unlock(skge-tx_lock);
 
+   dev-trans_start = jiffies;
+
return NETDEV_TX_OK;
 }
 
@@ -2730,6 +2732,8 @@ static int skge_poll(struct net_device *
return 1; /* not done */
 
netif_rx_complete(dev);
+   mmiowb();
+
hw-intr_mask |= skge-port == 0 ? (IS_R1_F|IS_XA1_F) : 
(IS_R2_F|IS_XA2_F);
skge_write32(hw, B0_IMSK, hw-intr_mask);
 

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/9] skge: use kcalloc

2006-03-21 Thread Stephen Hemminger
Use kcalloc when allocating ring data structure.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

--- skge-2.6.orig/drivers/net/skge.c
+++ skge-2.6/drivers/net/skge.c
@@ -733,13 +733,12 @@ static int skge_ring_alloc(struct skge_r
struct skge_element *e;
int i;
 
-   ring-start = kmalloc(sizeof(*e)*ring-count, GFP_KERNEL);
+   ring-start = kcalloc(sizeof(*e), ring-count, GFP_KERNEL);
if (!ring-start)
return -ENOMEM;
 
for (i = 0, e = ring-start, d = vaddr; i  ring-count; i++, e++, d++) 
{
e-desc = d;
-   e-skb = NULL;
if (i == ring-count - 1) {
e-next = ring-start;
d-next_offset = base;

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND] RT2x00 update: trivial fixes

2006-03-21 Thread John W. Linville
On Tue, Feb 28, 2006 at 08:46:54PM +0100, Ivo van Doorn wrote:
 ieee80211_rx has been renamed __ieee80211_rx.
 Use DRV_NAME as much as possible instead of a seperate name string.
 Add new USB device ID.

Merged to dscape branch of wireless-2.6...thanks!

John
-- 
John W. Linville
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH wireless-2.6 0/2] d80211: Devicescape 802.11 update

2006-03-21 Thread John W. Linville
On Fri, Mar 03, 2006 at 06:54:23PM -0800, Jouni Malinen wrote:
 Here's couple of patches to the Devicescape 802.11 implementation.
 Please consider applying to the dscape branch of wireless-2.6 tree.

Merged to dscape branch of wireless-2.6...thanks!

John
-- 
John W. Linville
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] wireless.git: update acxsm to 0.4.7

2006-03-21 Thread John W. Linville
On Wed, Mar 01, 2006 at 03:58:14PM +0200, Denis Vlasenko wrote:
 On Tuesday 28 February 2006 03:34, John W. Linville wrote:
  On Mon, Feb 27, 2006 at 11:44:38AM +0100, Carlos Martín wrote:
   On Monday 27 February 2006 11:20, Denis Vlasenko wrote:
 Comments are welcome and I'll split the patch if needed.
  
  Denis are you applying this patch to your tree?  If so, I'll rely on
  you to push it to me when you are ready.
  
  If not, then I will need Carlos to generate the diffs so that they
  can be applied to the top of the tree with -p1.
  
  http://linux.yyz.us/patch-format.html
 
 Changelog:

Merged to softmac branch of wireless-2.6...thanks!

John
-- 
John W. Linville
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch] mv643xx_eth: Cache align skb-data if CONFIG_NOT_COHERENT_CACHE

2006-03-21 Thread Jeff Garzik

Dale Farnsworth wrote:

From: Dale Farnsworth [EMAIL PROTECTED]

When I/O is non-cache-coherent, we need to ensure that the I/O buffers
we use don't share cache lines with other data.

Signed-off-by: Dale Farnsworth [EMAIL PROTECTED]


applied


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/9] skge: use NAPI for tx cleanup.

2006-03-21 Thread Jeff Garzik

Stephen Hemminger wrote:

Cleanup transmit buffers using NAPI.  This allows the transmit routine
to leave interrupts enabled, and that improves performance.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]


applied 1-9


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/9] sky2: remove support for untested Yukon EC/rev 0

2006-03-21 Thread Jeff Garzik

Stephen Hemminger wrote:

The Yukon EC/rev0 (A1) chipset requires a bunch of workarounds. I copied these
from sk98lin.  But since they never got tested and add more cruft to the code;
any attempt at using driver as is on this version will probably fail.

It looks like this was a early engineering sample chip revision, if it ever 
shows
up on a real system. Produce an error message.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]


applied 1-9


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] smc91x: allow for dynamic bus access configs

2006-03-21 Thread Jeff Garzik

Nicolas Pitre wrote:

All accessor's different methods are now selected with C code and unused
ones statically optimized away at compile time instead of being selected
with #if's and #ifdef's.  This has many advantages such as allowing the
compiler to validate the syntax of the whole code, making it cleaner and
easier to understand, and ultimately allowing people to define
configuration symbols in terms of variables if they really want to
dynamically support multiple bus configurations at the same time (with
the unavoidable performance cost).

Signed-off-by: Nicolas Pitre [EMAIL PROTECTED]


applied


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[PATCH] sis900 adm7001 PHY support

2006-03-21 Thread Artur Skawina

this patch is required to get a SIS964 based motherboard ethernet working (FSC 
D1875)
(picking the #1 transceiver, instead of the last one, in case no known ones 
were found
might be a better default, and would have worked in this case too)

Signed-off-by: Artur Skawina [EMAIL PROTECTED]

--- v2.6.16/drivers/net/sis900.c2006-03-21 21:14:37.0 +0100
+++ v2.6.16-dtnode/drivers/net/sis900.c 2006-03-21 02:53:54.0 +0100
@@ -128,6 +128,7 @@ static struct mii_chip_info {
{ SiS 900 Internal MII PHY,   0x001d, 0x8000, LAN },
{ SiS 7014 Physical Layer Solution,   0x0016, 0xf830, LAN },
{ Altimata AC101LF PHY,   0x0022, 0x5520, LAN },
+   { ADM 7001 LAN PHY,   0x002e, 0xcc60, LAN },
{ AMD 79C901 10BASE-T PHY,0x, 0x6B70, LAN },
{ AMD 79C901 HomePNA PHY, 0x, 0x6B90, HOME},
{ ICS LAN PHY,0x0015, 0xF440, LAN },


Re: [PATCH 2.6.16-rc6 0/3] MAINTAINERS, e100 and e1000 text file updates

2006-03-21 Thread Jeff Garzik

Jesse Brandeburg wrote:

okay, here goes... these patches are against Linus's current tree.  They only
update text files, no code updates.  The large change to e1000.txt includes
whitespace changes, and some content.  They could be included with 2.6.16
as they are for the drivers that are already merged.

Signed-off-by: Jesse Brandeburg [EMAIL PROTECTED]

---

The following changes since commit a488edc914aa1d766a4e2c982b5ae03d5657ec1b:
are found in the git repository at:

  git://198.78.49.142/~jbrandeb/linux-2.6 e1000-fixes


pulled


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please pull bcm43xx softmac-upstream and dscape-upstream branches

2006-03-21 Thread John W. Linville
On Sun, Mar 05, 2006 at 09:47:55PM +0100, Michael Buesch wrote:

 Please pull branches softmac-upstream and dscape-upstream
 from my repository at:
 git://bu3sch.de/wireless-2.6.git

Merged to softmac and dscape branches of wireless-2.6...thanks!

John
-- 
John W. Linville
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6 v2] IB: address translation to map IP toIB addresses (GIDs)

2006-03-21 Thread Roland Dreier
  +struct workqueue_struct *rdma_wq;
  +EXPORT_SYMBOL(rdma_wq);

Sean, I don't think I saw an answer when I asked you this before.  Why
is ib_addr exporting a workqueue?  Is there some sort of ordering
constraint that is forcing other modules to go through the same
workqueue for things?

This seems like a very fragile internal thing to be exposing, and I'm
wondering if there's a better way to handle it.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [openib-general] Re: [PATCH 4/6 v2] IB: address translation to map IP toIB addresses (GIDs)

2006-03-21 Thread Sean Hefty

Roland Dreier wrote:

  +struct workqueue_struct *rdma_wq;
  +EXPORT_SYMBOL(rdma_wq);

Sean, I don't think I saw an answer when I asked you this before.  Why
is ib_addr exporting a workqueue?  Is there some sort of ordering
constraint that is forcing other modules to go through the same
workqueue for things?

This seems like a very fragile internal thing to be exposing, and I'm
wondering if there's a better way to handle it.


I responded in a different thread, but here's what I wrote:

This is simply an attempt to reduce/combine work queues used by the Infiniband 
code.  This keeps the threading a little simpler in the rdma_cm, since all 
callbacks are invoked using the same work queue.  (I'm also using this with the 
local SA/multicast code, but that's not ready for merging.)


There's no specific ordering constraint that's required.  We're just ending up 
with several Infiniband modules creating their own work queues (ib_mad, ib_cm, 
ib_addr, rdma_cm, plus a couple more in modules under development), and this is 
an attempt to reduce that.  If having separate work queues would work better, 
there shouldn't be anything that prevents this.


- Sean
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ES-API

2006-03-21 Thread Mark Butler

On Mon Mar 14, 2006, Christopher Hellwig wrote:
On Mon, Mar 13, 2006 at 02:25:08PM -0800, Zach Brown wrote:
 Hi guys,

 I'm hearing noise about the 'Extended Sockets' API in Oracle.  It's an
 extension to the socket API put together by an industry group that calls
 itself the Interconnect Software Consortium and is working in
 partnership with the open group.  The API adds support for things like
 memory registration, async operations completed through event queues,
 standard sendfile() and async poll(), etc.

It's a new bullshit standard from the crackmonkeyes at the opengroups
interconnect working group that already tried to push idiocies like RNICPI
onto us.  I already told them that they're on crack but they don't care.
It's never going to appear in Linux.

ES-API has relatively little to do with memory registration or the RDMA 
world view per se.  It is primarily a generic API for performing 
asynchronous socket I/O with completion notifications.   Considering 
there are no other cross platform standards for asynchronous socket 
operations, ES-API is rather unlikely to go away. 

Of course ES-API is a user level API, not a kernel level API. Linux 
does not have to implement it at all for there to be working, generic 
(no hardware required) Linux ES-API implemenations.  All that is needed 
is a working syscall interface for asynchronous socket operations, such 
as an extension of io_submit / io_getevents to do asynchronous 
connect(), shutdown(), sendmsg(),  recvmsg(), setsockopt(), and 
getsocktopt() operations.  A library could easily translate ES-API calls 
in the same manner as glibc translates POSIX API calls.


- Mark B.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Results WAS(Re: [PATCH] TC: bug fixes to the sample clause

2006-03-21 Thread Stephen Hemminger
Back to the original question... 

What should the iproute2 utilities contain? 

Does it have to have the utsname hack to work?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Results WAS(Re: [PATCH] TC: bug fixes to the sample clause

2006-03-21 Thread Russell Stuart
On Tue, 2006-03-21 at 09:57 -0500, jamal wrote:
 I accessed them - unfortunately, though i am trying to, I dont
 see anything outstanding that would justify any changes to the
 hash. Lets just drop this. We can talk about other things if you want.

If you still are not convinced, then I don't see that
I can convince you.  Fair enough.  Yes - I would like
to discuss other things.  I will take me some days to
prepare them so you will have a little peace and quiet
(from me anyway) for a short while.

I would like to take the opportunity to thank you for 
giving me such a fair hearing.  You have been polite
throughout, despite my persistence.  And you have 
always responded quickly.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Results WAS(Re: [PATCH] TC: bug fixes to the sample clause

2006-03-21 Thread Russell Stuart
On Tue, 2006-03-21 at 14:39 -0800, Stephen Hemminger wrote:
 Back to the original question... 
 
 What should the iproute2 utilities contain? 
 
 Does it have to have the utsname hack to work?

Hi Stephen,

I think the resolution was:

  - No to the utsname hack.  Ergo the tc sample clause
won't work on 2.4.

Maybe tc using ustname to check the kernel version and
print out a warning / error if sample is used on 2.4
is acceptable?  I regard failing silently and producing
incorrect results as a terrible thing to do to.  I could
produce another patch if this is OK.

  - Put the 2.6 hash algorithm in tc.  That is what my 
previous patch did.  Jamal didn't like the patch 
description though.  Perhaps he would prefer something
along the lines of Changed u32 hashing algorithm used
by the 'sample clause' to the 2.6 kernel algorithm.  
Currently its uses the 2.4 algorithm, which computes the 
wrong result under some circumstances on 2.6 kernels.  
This means tc sample will no longer work on 2.4.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC/PATCH 6/13] d80211: remove obsolete stuff

2006-03-21 Thread Simon Barber
Yes, fully agreed - and the hardware's pre-beacon interrupt would cause
the beacon function to create a beacon frame and put it into the queue
(dev_queue_xmit on the master device). The beacon frame would the be
passed to the hardware through the normal run_queue that follows.

Simon 

-Original Message-
From: Jouni Malinen 
Sent: Wednesday, March 15, 2006 4:48 PM
To: Simon Barber
Cc: Jiri Benc; netdev@vger.kernel.org
Subject: Re: [RFC/PATCH 6/13] d80211: remove obsolete stuff

On Wed, Mar 15, 2006 at 04:41:56PM -0800, Simon Barber wrote:
 The more natural way for beacons to flow from the 80211.o to the low 
 level driver would be for beacons to be passed down just like any 
 other
 802.11 frame is passed down - rather than having a special case for 
 beacons and buffered MC data, where they are pulled. I would suggest 
 making the qdisc aware of beacons, and then there is no special 
 interface for passing beacons down - they are passed down just like 
 other frames, with a special queue ID reserved for beacons and 
 buffered multicast.
 
 This would simplify the 80211.o/low level interface.

Sure, but it would also require good synchronization for sending the
beacons just before they are needed for transmission.. If the wlan
hardware implementation provides support for interrupts that request
beacons at proper times, being able to use them for this is quite
convenient.

-- 
Jouni MalinenPGP id EFC895FA
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [iproute2] IPoIB link layer address bug

2006-03-21 Thread Stephen Hemminger
On Thu, 16 Mar 2006 17:24:41 -0500 (EST)
James Lentini [EMAIL PROTECTED] wrote:

 
 The ip(8) command has a bug when dealing with IPoIB link layer 
 addresses. Specifically it does not correctly handle the addition of 
 new entries in the neighbor/arp table. For example, this command will 
 fail:
 
 ip neigh add 192.168.0.138 lladdr 
 00:00:04:04:fe:80:00:00:00:00:00:00:00:01:73:00:00:00:8a:91 nud permanent dev 
 ib0
 
 An IPoIB link layer address is 20-bytes (see 
 http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-09.txt,
  
 section 9.1.1).
 
 The command line parsing code expects link layer addresses to be a 
 maximum of 16-bytes. Addresses over 16-bytes are truncated.
 
 This patch (against the iproute2 cvs repository) fixes the problem:
 

Okay, but there are number of other places in iproute2 that call ll_addr_a2n()
with ifr.ifr_hwaddr.sa_data. And that is 14 bytes.  If you want to fix those
it will be harder since it would increase the sizeof(struct sockaddr) and 
potentially
break compatibility.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: shared abstractions (was Writing a rate based transport protocol)

2006-03-21 Thread Mark Butler

One Tue, 14 Mar 2006 11:37:38 -0300, Arnaldo Carvalho de Melo wrote:

   On 3/13/06, Saurabh Jain [EMAIL PROTECTED] wrote:
 Hi All,

 I am trying to write a new rate based transport protocol in linux
 kernel (either as a module or directly within the kernel). Basically
 it would be similar to UDP but with features like dynamic rate
 control, connection and state management, error control like TCP. Is
 there any established framework which i can use? I know there is one
 for window based protocols like TCP where one can dynamically
   register
 different congestion control mechanisms. I would appreciate if
 somebody can give me some direction in this regard.

   Look at how DCCP and TCP share code, using abstractions such as:
   struct inet_connection_sock
   struct inet_request_sock
   struct inet_timewait_sock
   struct inet_hashinfo

   I suggest too that you read my OLS 2004 paper:

One of the limitations of those abstractions is that they are not 
generic enough for SCTP to use them.  It is probably asking too much to 
generalize everything, but it would be nice if everything weren't bound 
so tightly to the idea of a one-to-one, single address, single path 
socket.  XFRM, the IP layer, the sock layer, the inet sock layer, and 
the inet connection sock layer all have that assumption hard coded into 
them in various ways.


For example, a one-to-many style SCTP socket is equivalent to a group of 
inet_connection_socks presenting a UDP style interface, one 
inet_connection_sock per SCTP association.  But since 
inet_connection_sock is a _sock_, it cannot be used as the base 
implementation for an SCTP association.


Similarly, struct sk_buff carries a destructor pointer, that is 
typically used to release memory to sk_wmem_alloc, but the destructor is 
called with a struct sock * argument, from skb-sk.  A more general 
implementation would replace skb-sk with a pointer to an intermediate 
abstraction or a void *, or add a destructor context argument. Currently 
in order to work correctly SCTP has to consider memory reclaimed as soon 
as it hits the IP layer, because flow control is done at the association 
level, not the socket level.


XFRM has the same problem - it only allows security policies to be 
overridden at the socket level, where an SCTP socket may handle 
thousands of associations, with independent security policies.  It would 
also be nice to share congestion control implementations - SCTP does 
congestion control on a path (transport) basis, not a per socket 
basis, and the congestion control interface would have to be similarly 
general purpose.


There are dozens of fields in struct sock, struct inet_sock, and struct 
inet6_sock that are superfluous overhead in the  SCTP case.  It would be 
better if struct sock etc were one level higher in abstraction, rather 
than carrying so much baggage from the TCP view of the world.  Perhaps 
two thirds of the current fields belong at lower levels of abstraction.  

In view of the goal of reducing the kernel footprint, such a 
re-factoring might be worth considering.


- Mark B.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [openib-general] Re: [iproute2] IPoIB link layer address bug

2006-03-21 Thread Jason Gunthorpe
On Tue, Mar 21, 2006 at 03:56:17PM -0800, Stephen Hemminger wrote:

 Okay, but there are number of other places in iproute2 that call ll_addr_a2n()
 with ifr.ifr_hwaddr.sa_data. And that is 14 bytes.  If you want to fix those
 it will be harder since it would increase the sizeof(struct sockaddr) and 
 potentially
 break compatibility.

Maybe the best thing is to upgrade ip (and or netlink?) to use netlink
messages instead of ioctls for the remaining problematic operations.
Since netlink already supports an arbitary length hwaddr there should
be no compatability problem.

Just browsing I see usages of SIOCSIFHWBROADCAST, SIOCSIFHWADDR,
SIOCADDMULTI, SIOCDELMULTI and SIOCGIFHWADDR that use a struct ifreq..

I know SIOCGIFHWADDR can be done over netlink, but I'm not too
familiar with the others..

Jason
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6 v2] IB: userspace support for RDMA connection manager

2006-03-21 Thread Roland Dreier
I added this patch to the rdma_cm branch in my git tree.  When I was
doing that, I noticed that it builds rdma_ucm.ko unconditionally.  It
seems that we want this to depend on CONFIG_INFINIBAND_USER_ACCESS,
since that controls ib_uverbs.ko and ib_ucm.ko.

To do this I rejiggered the Kconfig and Makefile changes I made
before.  I made CONFIG_INFINIBAND_ADDR_TRANS into a bool (instead of a
tristate), so that it's 'y' if INFINIBAND and INET are on, and made
the top of the Makefile look like:

infiniband-$(CONFIG_INFINIBAND_ADDR_TRANS)  := ib_addr.o rdma_cm.o
user_access-$(CONFIG_INFINIBAND_ADDR_TRANS) := rdma_ucm.o

obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o \
ib_cm.o $(infiniband-y)
obj-$(CONFIG_INFINIBAND_USER_MAD) +=ib_umad.o
obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o $(user_access-y)

I'm pretty sure this does exactly what we want.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Broadcom Sibyte SB1xxx NAPI ethernet support

2006-03-21 Thread Tom Rix

Yes.
They are soon to follow.
Tom

On Sun, 19 Mar 2006 17:08:23 -0600, Lennert Buytenhek  
[EMAIL PROTECTED] wrote:



On Sun, Mar 19, 2006 at 05:12:32PM -0600, Tom Rix wrote:

This patch also has a fix to drivers/net/sb1250-mac.c, the dma  
descriptor
table ptr is allocated, aligned and the aligned ptr is freed.  If the  
ptr
was not already aligned (usually is) then the free would not work of  
what

was returned by the kmalloc. A variable was added to store the unaligned
pointer so that it could be properly freed.


Can you submit that as a separate patch?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Broadcom Sibyte SB1xxx save unaligned dma descriptor pointer fix

2006-03-21 Thread Tom Rix

This patch has a fix to drivers/net/sb1250-mac.c, the dma descriptor
table ptr is allocated, aligned and the aligned ptr is freed.  If the ptr
was not already aligned (usually is) then the free would not work of what
was returned by the kmalloc. A variable was added to store the unaligned
pointer so that it could be properly freed.

Tom

On Sun, 19 Mar 2006 17:08:23 -0600, Lennert Buytenhek  
[EMAIL PROTECTED] wrote:



On Sun, Mar 19, 2006 at 05:12:32PM -0600, Tom Rix wrote:

This patch also has a fix to drivers/net/sb1250-mac.c, the dma  
descriptor
table ptr is allocated, aligned and the aligned ptr is freed.  If the  
ptr
was not already aligned (usually is) then the free would not work of  
what

was returned by the kmalloc. A variable was added to store the unaligned
pointer so that it could be properly freed.


Can you submit that as a separate patch?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

mips-sb1250-mac-savedmaptr-1.patch
Description: Binary data


Re: [PATCH 2.6.16-rc6 1/1] ipw2200: Add Kconfig entries for QOS and Monitor mode

2006-03-21 Thread Zhu Yi
On Sat, 2006-03-18 at 18:47 +0100, Andreas Happe wrote:
 Adds Kconfig entries for enabling Monitor mode and Quality of service
 to the ipw2200 driver. It also renames the IPW_QOS define to
 IPW2200_QOS.
 
 As Monitor mode generates lots of firmware errors it depends upon
 BROKEN. QOS is under development, so it depends upon EXPERIMENTAL.

Ack the rename and QoS description changes.

The IPW2200_MONITOR and monitor mode firmware error are already fixed in
wireless-2.6 GIT
http://kernel.org/git/?p=linux/kernel/git/linville/wireless-2.6.git;a=summary

Wireless related development happens there. I'd suggest you create
patches against that tree.

Thanks,
-yi

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Writing a rate based transport protocol

2006-03-21 Thread Mark Butler

On Mon, 13 Mar 2006 18:20:26 -0600, Saurabh Jain wrote:


   Hi All, I am trying to write a new rate based transport protocol in
   linux kernel (either as a module or directly within the kernel).
   Basically it would be similar to UDP but with features like dynamic
   rate control, connection and state management, error control like
   TCP. Is there any established framework which i can use? I know
   there is one for window based protocols like TCP where one can
   dynamically register different congestion control mechanisms. I
   would appreciate if somebody can give me some direction in this regard.


I do not know what you have in mind,  but a general facility to transmit 
a series of packets at spaced intervals would be very useful to 
compensate for ack compression, etc.  Preferably a facility simple 
enough to be trivially offloaded to hardware.  TSO/LSO hardware could 
certainly use something similar for spacing segments, so breaking sends 
over a size (c.f. sysctl_tcp_tso_win_divisor) manually would not be 
necessary. 

In software one might implement this as an alternative queueing 
discipline at layer two. The minimum spacing interval could be obtained 
from a route attribute similar to  RTAX_ADVMSS.  Alternatively, a 
transport protocol might calculate the nominal transmission spacing as 
the RTT divided by the congestion window size in packets and run or 
share a similar transmission scheduler at layer 4.


- Mark B.




-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 1/2] net: Node aware multipath device round robin

2006-03-21 Thread Ravikiran G Thirumalai
Following patch adds in node aware, device round robin ip multipathing.  
It is based on multipath_drr.c, the multipath device round robin algorithm, and
is derived from it.  This implementation maintians per node state table, and 
round robins between interfaces on the same node.  The implementation needs to 
be aware of the NIC proximity to a node.  Hence we have added a nodeid field to 
struct netdevice.  NIC device drivers can initialize this with the node id 
the NIC belongs to.  This patch uses IP_MP_ALG_DRR slot like the regular 
multipath_drr too.  So either SMP multipath_drr or node aware 
multipath_node_drr should be used for device round robin, based on system having
proximity information for the NICs.

Performance results:
1. Single NIC test -- 1 client targets 1 nic on the server with 300 concurrent 
requests.
2. 4 NIC test -- 1 client targets 4 nics, all on different nodes on the server 
with 300 concurrent requests.

We see about 135% improvement on AB requests per second with this patch and 
the device_locality_check patch on single NIC test, on the Rackable c5100 
machine (server).  We see about 64% improvement  when all 4 NICS are targeted.

Credits:  This work was originally done by Justin Forbes 

Comments?

Signed-off by: Pravin B. Shelar [EMAIL PROTECTED]
Signed-off by: Shobhit Dayal [EMAIL PROTECTED]
Signed-off by: Ravikiran Thirumalai [EMAIL PROTECTED]
Signed-off by: Shai Fultheim [EMAIL PROTECTED]

Index: linux-2.6.16/drivers/net/e1000/e1000_main.c
===
--- linux-2.6.16.orig/drivers/net/e1000/e1000_main.c2006-03-19 
21:53:29.0 -0800
+++ linux-2.6.16/drivers/net/e1000/e1000_main.c 2006-03-20 14:52:23.0 
-0800
@@ -692,6 +692,7 @@ e1000_probe(struct pci_dev *pdev,
 
SET_MODULE_OWNER(netdev);
SET_NETDEV_DEV(netdev, pdev-dev);
+   SET_NETDEV_NODE(netdev, pcibus_to_node(pdev-bus));
 
pci_set_drvdata(pdev, netdev);
adapter = netdev_priv(netdev);
Index: linux-2.6.16/drivers/net/tg3.c
===
--- linux-2.6.16.orig/drivers/net/tg3.c 2006-03-19 21:53:29.0 -0800
+++ linux-2.6.16/drivers/net/tg3.c  2006-03-20 14:52:23.0 -0800
@@ -10705,6 +10705,7 @@ static int __devinit tg3_init_one(struct
 
SET_MODULE_OWNER(dev);
SET_NETDEV_DEV(dev, pdev-dev);
+   SET_NETDEV_NODE(dev, pcibus_to_node(pdev-bus));
 
dev-features |= NETIF_F_LLTX;
 #if TG3_VLAN_TAG_USED
Index: linux-2.6.16/include/linux/netdevice.h
===
--- linux-2.6.16.orig/include/linux/netdevice.h 2006-03-19 21:53:29.0 
-0800
+++ linux-2.6.16/include/linux/netdevice.h  2006-03-20 14:52:23.0 
-0800
@@ -315,7 +315,9 @@ struct net_device
/* Interface index. Unique device identifier*/
int ifindex;
int iflink;
-
+#ifdef CONFIG_NUMA
+   int node;   /* NUMA node this IF is close to */
+#endif
 
struct net_device_stats* (*get_stats)(struct net_device *dev);
struct iw_statistics*   (*get_wireless_stats)(struct net_device *dev);
@@ -520,6 +522,14 @@ static inline void *netdev_priv(struct n
  */
 #define SET_NETDEV_DEV(net, pdev)  ((net)-class_dev.dev = (pdev))
 
+#ifdef CONFIG_NUMA
+#define SET_NETDEV_NODE(dev, nodeid)   ((dev)-node = (nodeid))
+#define netdev_node(dev)   ((dev)-node)
+#else
+#define SET_NETDEV_NODE(dev, nodeid)   do {} while (0)
+#define netdev_node(dev)   (-1)
+#endif
+
 struct packet_type {
__be16  type;   /* This is really htons(ether_type). */
struct net_device   *dev;   /* NULL is wildcarded here   */
Index: linux-2.6.16/net/core/dev.c
===
--- linux-2.6.16.orig/net/core/dev.c2006-03-19 21:53:29.0 -0800
+++ linux-2.6.16/net/core/dev.c 2006-03-20 14:52:23.0 -0800
@@ -3003,7 +3003,8 @@ struct net_device *alloc_netdev(int size
 
if (sizeof_priv)
dev-priv = netdev_priv(dev);
-
+   
+   SET_NETDEV_NODE(dev, -1);
setup(dev);
strcpy(dev-name, name);
return dev;
Index: linux-2.6.16/net/ipv4/Kconfig
===
--- linux-2.6.16.orig/net/ipv4/Kconfig  2006-03-19 21:53:29.0 -0800
+++ linux-2.6.16/net/ipv4/Kconfig   2006-03-20 14:52:23.0 -0800
@@ -164,6 +164,15 @@ config IP_ROUTE_MULTIPATH_DRR
  available interfaces. This policy makes sense if the connections 
  should be primarily distributed on interfaces and not on routes. 
 
+config IP_ROUTE_MULTIPATH_NODE
+   tristate MULTIPATH: interface RR algorithm with node affinity
+   depends on IP_ROUTE_MULTIPATH_CACHED  NUMA  !IP_ROUTE_MULTIPATH_DRR
+   help
+ This 

[patch 2/2] net: Node aware multipath device round robin -- device locality check

2006-03-21 Thread Ravikiran G Thirumalai
This patch checks device locality on every ip packet xmit.
In multipath configuration tcp connection to route association is done at 
session startup time. The tcp session process is migrated to different nodes 
after this association.  This would mean a remote NIC is chosen for xmit,
although a local NIC could be available.   Following patch checks if a 
local NIC is available for the desitnation, and recalculates routes if so.
his leads to remote NIC  transfer  in some tcp work load such as AB.

Downside: adds a bitmap to struct rtable.  But only if 
CONFIG_IP_ROUTE_MULTIPATH_NODE is enabled.

Comments, suggestions welcome. 

Signed-off by: Pravin B. Shelar [EMAIL PROTECTED]
Signed-off by: Ravikiran Thirumalai [EMAIL PROTECTED]
Signed-off by: Shai Fultheim [EMAIL PROTECTED]

Index: linux-2.6.16/include/net/route.h
===
--- linux-2.6.16.orig/include/net/route.h   2006-03-19 21:53:29.0 
-0800
+++ linux-2.6.16/include/net/route.h2006-03-20 14:52:24.0 -0800
@@ -75,6 +75,13 @@ struct rtable
/* Miscellaneous cached information */
__u32   rt_spec_dst; /* RFC1122 specific destination */
struct inet_peer*peer; /* long-living peer info */
+#ifdef CONFIG_IP_ROUTE_MULTIPATH_NODE
+   /*  bitmap bit is set if current node has a local multi-path device for
+*  this route.
+*/
+   DECLARE_BITMAP  (mp_if_bitmap, MAX_NUMNODES);
+#endif
+
 };
 
 struct ip_rt_acct
@@ -201,4 +208,21 @@ static inline struct inet_peer *rt_get_p
 
 extern ctl_table ipv4_route_table[];
 
+#ifdef CONFIG_IP_ROUTE_MULTIPATH_NODE
+
+#include linux/ip_mp_alg.h
+
+static inline int dst_dev_node_check(struct rtable *rt)
+{
+   int cnode = numa_node_id();
+   if (unlikely(netdev_node(rt-u.dst.dev) != cnode)) {
+   if (test_bit(cnode, rt-mp_if_bitmap))
+   return 1;
+   }
+   return 0;
+}
+#else
+#define dst_dev_node_check(rt) 0
+#endif
+
 #endif /* _ROUTE_H */
Index: linux-2.6.16/net/ipv4/ip_output.c
===
--- linux-2.6.16.orig/net/ipv4/ip_output.c  2006-03-19 21:53:29.0 
-0800
+++ linux-2.6.16/net/ipv4/ip_output.c   2006-03-20 14:52:24.0 -0800
@@ -309,7 +309,7 @@ int ip_queue_xmit(struct sk_buff *skb, i
 
/* Make sure we can route this packet. */
rt = (struct rtable *)__sk_dst_check(sk, 0);
-   if (rt == NULL) {
+   if ((rt == NULL ) || dst_dev_node_check(rt)) {
u32 daddr;
 
/* Use correct destination address if we have options. */
Index: linux-2.6.16/net/ipv4/route.c
===
--- linux-2.6.16.orig/net/ipv4/route.c  2006-03-19 21:53:29.0 -0800
+++ linux-2.6.16/net/ipv4/route.c   2006-03-20 14:52:24.0 -0800
@@ -2313,6 +2313,22 @@ static inline int ip_mkroute_output(stru
if (res-fi  res-fi-fib_nhs  1) {
unsigned char hopcount = res-fi-fib_nhs;
 
+#ifdef CONFIG_IP_ROUTE_MULTIPATH_NODE
+   DECLARE_BITMAP (mp_if_bitmap, MAX_NUMNODES);
+   bitmap_zero(mp_if_bitmap, MAX_NUMNODES);
+   /* Calculating device bitmap for this multipath route */
+   if (res-fi-fib_mp_alg == IP_MP_ALG_DRR) {
+   for (hop = 0; hop  hopcount; hop++) {
+   struct net_device *dev2nexthop;
+
+   res-nh_sel = hop;
+   dev2nexthop = FIB_RES_DEV(*res);
+   dev_hold(dev2nexthop);
+   set_bit(netdev_node(dev2nexthop), mp_if_bitmap);
+   dev_put(dev2nexthop);
+   }
+   }
+#endif
for (hop = 0; hop  hopcount; hop++) {
struct net_device *dev2nexthop;
 
@@ -2343,6 +2359,10 @@ static inline int ip_mkroute_output(stru
 FIB_RES_NETMASK(*res),
 res-prefixlen,
 FIB_RES_NH(*res));
+
+#ifdef CONFIG_IP_ROUTE_MULTIPATH_NODE
+   bitmap_copy(rth-mp_if_bitmap, mp_if_bitmap, 
MAX_NUMNODES);
+#endif
cleanup:
/* release work reference to output device */
dev_put(dev2nexthop);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Writing a rate based transport protocol

2006-03-21 Thread Stephen Hemminger
On Tue, 21 Mar 2006 20:26:55 -0700
Mark Butler [EMAIL PROTECTED] wrote:

 On Mon, 13 Mar 2006 18:20:26 -0600, Saurabh Jain wrote:
 
 
 Hi All, I am trying to write a new rate based transport protocol in
 linux kernel (either as a module or directly within the kernel).
 Basically it would be similar to UDP but with features like dynamic
 rate control, connection and state management, error control like
 TCP. Is there any established framework which i can use? I know
 there is one for window based protocols like TCP where one can
 dynamically register different congestion control mechanisms. I
 would appreciate if somebody can give me some direction in this regard.
 
 
 I do not know what you have in mind,  but a general facility to transmit 
 a series of packets at spaced intervals would be very useful to 
 compensate for ack compression, etc.  Preferably a facility simple 
 enough to be trivially offloaded to hardware.  TSO/LSO hardware could 
 certainly use something similar for spacing segments, so breaking sends 
 over a size (c.f. sysctl_tcp_tso_win_divisor) manually would not be 
 necessary. 
 
 In software one might implement this as an alternative queueing 
 discipline at layer two. The minimum spacing interval could be obtained 
 from a route attribute similar to  RTAX_ADVMSS.  Alternatively, a 
 transport protocol might calculate the nominal transmission spacing as 
 the RTT divided by the congestion window size in packets and run or 
 share a similar transmission scheduler at layer 4.
 

The bigger problem is that too be effective rate control needs accurate
real time. Linux is doing better at real time, but still providing useful
high speed inter packet spacing is beyond the current capabilities. To get
around this I think most high speed 10G cards provide some form of rate control
in firmware.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting

2006-03-21 Thread Balbir Singh
On Mon, Mar 13, 2006 at 09:48:26PM -0500, jamal wrote:
 On Mon, 2006-13-03 at 18:33 -0800, Matt Helsley wrote:
  On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote:
 
 
 I had a long description in an earlier email feedback; but the summary
 of it is the GET command is generic like TASKSTATS_CMD_GET; the message
 itself carries TLVs of what needs to be gotten which are 
 either PID and/or TGID etc. Anyways, theres a long spill of what i am
 saying in that earlier email. Perhaps the current patch is a transition
 towards that?
 

Hi, Jamal,

Please find the updated version of delayacct-genetlink.patch. We hope
this iteration is closer to your expectation. I have copied the enums
you suggested in your previous review comments and used them.

Comments addressed (in this patch)

- Changed the code to use TLV's for data exchange between kernel and
  user space

Thanks,
Balbir


Documentation for the patch

Create a generic netlink interface (NETLINK_GENERIC family),
called taskstats, for getting delay and cpu statistics of
tasks and thread groups during their lifetime and when they exit.

More changes expected. Following comments will go into a
Documentation file:

When a task is alive, userspace can get its stats by sending a
command containing its pid. Sending a tgid returns the sum of stats
of the tasks belonging to that tgid (where such a sum makes sense).
Together, the command interface allows stats for a large number of
tasks to be collected more efficiently than would be possible
through /proc or any per-pid interface.

The netlink interface also sends the stats for each task to userspace
when the task is exiting. This permits fine-grain accounting for
short-lived tasks, which is important if userspace is doing its own
aggregation of statistics based on some grouping of tasks
(e.g. CSA jobs, ELSA banks or CKRM classes).

If the exiting task belongs to a thread group (with more members than itself)
, the latters delay stats are also sent out on the task's exit. This allows
userspace to get accurate data at a per-tgid level while the tid's of a tgid
are exiting one by one.

The interface has been deliberately kept distinct from the delay
accounting code since it is potentially usable by other kernel components
that need to export per-pid/tgid data. The format of data returned to
userspace is versioned and the command interface easily extensible to
facilitate reuse.

If reuse is not deemed useful enough, the naming, placement of functions
and config options will be modified to make this an interface for delay
accounting alone.

Signed-off-by: Shailabh Nagar [EMAIL PROTECTED]
Signed-off-by: Balbir Singh [EMAIL PROTECTED]

---

 include/linux/delayacct.h |   11 ++
 include/linux/taskstats.h |  112 
 init/Kconfig  |   16 ++
 kernel/Makefile   |1 
 kernel/delayacct.c|   44 
 kernel/taskstats.c|  251 ++
 6 files changed, 432 insertions(+), 3 deletions(-)

diff -puN include/linux/delayacct.h~delayacct-genetlink 
include/linux/delayacct.h
--- linux-2.6.16/include/linux/delayacct.h~delayacct-genetlink  2006-03-22 
11:56:03.0 +0530
+++ linux-2.6.16-balbir/include/linux/delayacct.h   2006-03-22 
11:56:03.0 +0530
@@ -15,6 +15,7 @@
 #define _LINUX_TASKDELAYS_H
 
 #include linux/sched.h
+#include linux/taskstats.h
 
 #ifdef CONFIG_TASK_DELAY_ACCT
 extern int delayacct_on;   /* Delay accounting turned on/off */
@@ -25,6 +26,7 @@ extern void __delayacct_tsk_exit(struct 
 extern void __delayacct_blkio_start(void);
 extern void __delayacct_blkio_end(void);
 extern unsigned long long __delayacct_blkio_ticks(struct task_struct *);
+extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);
 
 static inline void delayacct_tsk_init(struct task_struct *tsk)
 {
@@ -72,4 +74,13 @@ static inline unsigned long long delayac
return 0;
 }
 #endif /* CONFIG_TASK_DELAY_ACCT */
+#ifdef CONFIG_TASKSTATS
+static inline int delayacct_add_tsk(struct taskstats *d,
+   struct task_struct *tsk)
+{
+   if (!tsk-delays)
+   return -EINVAL;
+   return __delayacct_add_tsk(d, tsk);
+}
+#endif
 #endif /* _LINUX_TASKDELAYS_H */
diff -puN /dev/null include/linux/taskstats.h
--- /dev/null   2004-06-24 23:34:38.0 +0530
+++ linux-2.6.16-balbir/include/linux/taskstats.h   2006-03-22 
13:12:01.0 +0530
@@ -0,0 +1,112 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *   (C) Balbir Singh,   IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY