date:20061016

Re: Bluetooth update for 2.6

2006-10-16 Thread David Miller

From: Marcel Holtmann [EMAIL PROTECTED]
Date: Sun, 15 Oct 2006 18:10:24 +0200

 Please pull from

 git://git.kernel.org/pub/scm/linux/kernel/git/holtmann/bluetooth-2.6.git

Pulled, thanks a lot Marcel.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: poll problem with PF_PACKET when using PACKET_RX_RING

2006-10-16 Thread Patrick McHardy

Joan Raventos wrote:
Is this a bug in PF_PACKET? Should the socket queue be
emptied by packet_set_ring (called via setsockopt when
PACKET_RX_RING is used) so the above cannot happen?
Should the user-space app drain the socket queue with
recvfrom prior to (4) -quite unlikely in practice-?
 

I guess the best way is not to bind the socket before having
completed setup. We could still flush the queue to make life
easier for userspace, not sure about that ..
 
 
 Even w/o bind, packet_create is doing a dev_add_pack, which I think will make 
 pkts arrive to that socket (ie. in netif_receive_skb one can see the loops 
 over the rcu for both ptype_all and type-specific which seem match whenever 
 !ptype-dev || ptype-dev==skb-dev).
 
 Also the packet_mmap.txt doc does not mention bind, which probably is more a 
 mechanism to closely specify a dev than to signal socket readiness.

packet_create only calls dev_add_pack if a protocol is given.
You can use a protocol number of 0 and then bind the socket
after setting it up properly.

According to your description, you first used setsockopt(...,
PACKET_RX_RING), then mmap. In that case the receive queue
should already get flushed by packet_set_ring (about line 1710).
How did you verify that the receive queue still contains packets?

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Suppress / delay SYN-ACK

2006-10-16 Thread Martin Schiller

On Friday, October 13, 2006 10:14 PM, Eric Dumazet wrote:

 Martin, I played with libnetfilter_queue
 (http://www.netfilter.org/projects/libnetfilter_queue/index.html)
 
 With this single iptables rules, I was able to do what you want :
 transmit the SYN message to a user application, that may DROP this
 packet or let it pass normal TCP stack.  
 
 iptables -A INPUT -p tcp --dport 333 --syn -j QUEUE
 
 Then hack nfqnl_test.c to meet your needs (see nfq_set_verdict(),
 nfq_get_payload())
 
 Be prepared to receive the 'same SYN' several time if your X.25 call
 attempt is too long. 
 
 (You have to be root unfortunatly)
 
 Eric

Thanks, this sounds very interesting. I will have a closer look at that.

Martin


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch to remove masq/NAT in description of IP6_NF_IPTABLES in ipv6/netfilter/Kconfig

2006-10-16 Thread Peter Bieringer

Hi,

afaik, NAT (and therefore masquerading also) is left out by design in
IPv6, looks like a copypaste issue.

Patch attached to fix this.

Peter
-- 
Dr. Peter Bieringer http://www.bieringer.de/pb/
GPG/PGP Key 0x958F422D   mailto:[EMAIL PROTECTED]
Deep Space 6 Co-Founder and Core Member  http://www.deepspace6.net/
--- linux-2.6.18.1/net/ipv6/netfilter/Kconfig.orig	2006-10-16 08:56:43.0 +0200
+++ linux-2.6.18.1/net/ipv6/netfilter/Kconfig	2006-10-16 08:56:55.0 +0200
@@ -40,7 +40,7 @@
 	  To compile it as a module, choose M here.  If unsure, say N.
 
 config IP6_NF_IPTABLES
-	tristate IP6 tables support (required for filtering/masq/NAT)
+	tristate IP6 tables support (required for filtering)
 	depends on NETFILTER_XTABLES
 	help
 	  ip6tables is a general, extensible packet identification framework.

Re: Suppress / delay SYN-ACK

2006-10-16 Thread Lennert Buytenhek

On Thu, Oct 12, 2006 at 10:08:53AM +0200, Martin Schiller wrote:

 I'm searching for a solution to suppress / delay the SYN-ACK packet of a
 listening server (-application) until he has decided (e.g. analysed the
 requesting ip-address or checked if the corresponding other end of a
 connection is available) if he wants to accept the connect request of the
 client. If not, it should be possible to reject the connect request.

I wrote something like this a couple of years ago:

http://marc.theaimsgroup.com/?l=linux-netdevm=103666165629419w=2
http://marc.theaimsgroup.com/?l=linux-netdevm=106089519611631w=2

There wasn't a whole lot of external interest, and my need for it
disappeared, so I never really finished it, and there's a couple of
unfixed bugs,


cheers,
Lennert
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Patch to remove masq/NAT in description of IP6_NF_IPTABLES in ipv6/netfilter/Kconfig

2006-10-16 Thread Patrick McHardy

Peter Bieringer wrote:
 afaik, NAT (and therefore masquerading also) is left out by design in
 IPv6, looks like a copypaste issue.
 
 Patch attached to fix this.

Applied, thanks. But please sign off future patches.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [take19 1/4] kevent: Core files.

2006-10-16 Thread Evgeniy Polyakov

On Sun, Oct 15, 2006 at 04:22:45PM -0700, Ulrich Drepper ([EMAIL PROTECTED]) 
wrote:
 Evgeniy Polyakov wrote:
 Existing design does not allow overflow.
 
 And I've pointed out a number of times that this is not practical at 
 best.  There are event sources which can create events which cannot be 
 coalesced into one single event as it would be required with your design.
 
 Signals are one example, specifically realtime signals.  If we do not 
 want the design to be limited from the start this approach has to be 
 thought over.

The whole idea of mmap buffer seems to be broken, since those who asked
for creation do not like existing design and do not show theirs...

According to signals and possibility to overflow in existing ring buffer
implementation.
You seems to not checked the code - each event can be marked as ready 
only one time, which means only one copy and so on.
It was done _specially_. And it is not limitation, but new approach.
Queue of the same signals or any other events has fundamental flawness
(as any other ring buffer implementation, which has queue size)  -
it's size of the queue and extremely bad case of the overflow.
So, the same event may not be ready several times. Any design which
allows to create infinite number of events generated for the same case
is broken, since consumer can be in situation, when it can not handle
that flow. That is why poll() returns only POLLIN when data is ready in
network stack, but is not trying to generate some kind of a signal for 
each byte/packet/MTU/MSS received.
RT signals have design problems, and I will not repeate the same error
with similar limits in kevent.

 So zap mmap() support completely, since it is not usable at all. We wont 
 discuss on it.
 
 Initial implementation did not have it.
 But I was requested to do it, and it is ready now.
 No one likes it, but no one provides an alternative implementation.
 We are stuck.
 
 We need the mapped ring buffer.  The current design (before it was 
 removed) was broken but this does not mean it shouldn't be implemented. 
  We just need more time to figure out how to implement it correctly.

In the latest patchset it was removed. I'm waiting for your code.

Mmap implementation can be added separately, since it does not affect
kevent core.

 -- 
 ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
 CA ❖

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 9/14] [TIPC] Name publication events now delivered in chronological order

2006-10-16 Thread Per Liden

I'm fairly sure this is a problem on your side. I received patch 10/14 
from the netdev list and the two list archives I checked also had it.

/Per

On Fri, 13 Oct 2006, Bill Fink wrote:

 FYI,
 
 At least here, I received two copies of patch 9/14 and no copy
 of patch 10/14.
 
   -Bill
 
 
 
 On Fri, 13 Oct 2006 13:37:50 +0200, Per Liden wrote:
 
  From: Allan Stephens [EMAIL PROTECTED]
  
  This patch tivially re-orders the entries in TIPC's list of local
  publications so that applications will receive publication events
  in the order they were published.
  
  Signed-off-by: Allan Stephens [EMAIL PROTECTED]
  Signed-off-by: Per Liden [EMAIL PROTECTED]
  ---
   net/tipc/name_distr.c |2 +-
   1 files changed, 1 insertions(+), 1 deletions(-)
  
  diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c
  index f0b063b..03bd659 100644
  --- a/net/tipc/name_distr.c
  +++ b/net/tipc/name_distr.c
  @@ -122,7 +122,7 @@ void tipc_named_publish(struct publicati
  struct sk_buff *buf;
  struct distr_item *item;
   
  -   list_add(publ-local_list, publ_root);
  +   list_add_tail(publ-local_list, publ_root);
  publ_cnt++;
   
  buf = named_prepare_buf(PUBLICATION, ITEM_SIZE, 0);
  -- 
  1.4.1
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] NET : Suspicious locking in reqsk_queue_hash_req()

2006-10-16 Thread Eric Dumazet

Hi David

While browsing include/net/request_sock.h I found this suspicious locking 
protecting the SYN table hash table. I think this patch is necessary.

Thank you

Signed-off-by: Eric Dumazet [EMAIL PROTECTED]
--- linux-2.6.18/include/net/request_sock.h.orig2006-10-16 
10:53:11.0 +0200
+++ linux-2.6.18-ed/include/net/request_sock.h  2006-10-16 10:53:24.0 
+0200
@@ -251,9 +251,9 @@
req-expires = jiffies + timeout;
req-retrans = 0;
req-sk = NULL;
-   req-dl_next = lopt-syn_table[hash];
 
write_lock(queue-syn_wait_lock);
+   req-dl_next = lopt-syn_table[hash];
lopt-syn_table[hash] = req;
write_unlock(queue-syn_wait_lock);
 }

[PATCH] NET : Suspicious locking in reqsk_queue_hash_req()

2006-10-16 Thread Eric Dumazet

(Sorry, patch inlined this time)

Hi David

While browsing include/net/request_sock.h I found this suspicious locking
protecting the SYN table hash table. I think this patch is necessary.

Thank you

Signed-off-by: Eric Dumazet [EMAIL PROTECTED]
--- linux-2.6.18/include/net/request_sock.h.orig2006-10-16 
10:53:11.0 +0200
+++ linux-2.6.18-ed/include/net/request_sock.h  2006-10-16 10:53:24.0 
+0200
@@ -251,9 +251,9 @@
req-expires = jiffies + timeout;
req-retrans = 0;
req-sk = NULL;
-   req-dl_next = lopt-syn_table[hash];
 
write_lock(queue-syn_wait_lock);
+   req-dl_next = lopt-syn_table[hash];
lopt-syn_table[hash] = req;
write_unlock(queue-syn_wait_lock);
 }

Re: [take19 0/4] kevent: Generic event handling mechanism.

2006-10-16 Thread Ulrich Drepper


Evgeniy Polyakov wrote:

One can set number of events before the syscall and do not remove them
after syscall. It can be updated if there is need for that.


Nobody doubts that it is possible.  But it is

a) potentially much expensive

and

b) an alien concept

to have the signal mask to set during the wait call implicitly. 
Conceptually it doesn't even make sense.  This is no event to wait for. 
 It a parameter for the specific wait call, just like the timeout.  And 
I fortunately haven't seen you proposing to pass the timeout value 
implicitly.



Not good enough?  It does exactly what it is supposed to do.  What can 
there be not good enough?


Not to move signals into special case of events. If poll() can not work
with them it does not mean, that they need to be specified as additional
syscall parameter, instead change poll() to work with them, which can be
easily done with kevents.


You still seem to be completely missing the point.  The signal mask is 
no event to wait for.  It has nothing to do with this that ppoll() takes 
the signal mask as a parameter.  The signal mask is a parameter for the 
wait call just like the timeout, not more and not less.




Do not mix warm and soft - waiting for some period is not equal to
syscall timeout. Waiting is possible with timer kevent user (although
only relative timeout, can be changed to support both, not a big
problem).


That's what I'm saying all the time.  Of course it can be supported. 
But for this the timeout parameter must be a timespec pointer.  Whatever 
you could possibly mean by do not mix warm and soft I cannot possibly 
imagine.  Fact is that both relative and absolute timeouts are useful. 
And that for absolute timeouts the change of the clock has to be taken 
into account.




I'm quite sure that absolute timeouts are very usefull, but not as in
the case of waiting for syscall completeness. In any way, kevent can be
extended to support absolute timeouts in it's timer notifications.


That's not the same.  If you argue that then the syscall should have no 
timeout parameter at all.  Fact is that setting up a timer is not for 
free.  Since the timeout is used all the time having a timeout parameter 
is the right answer.  And if you do this then do it right just like 
every other syscall other than poll: use a timespec object.  This gives 
flexibility without measurable cost.


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [take19 1/4] kevent: Core files.

2006-10-16 Thread Ulrich Drepper


Evgeniy Polyakov wrote:

The whole idea of mmap buffer seems to be broken, since those who asked
for creation do not like existing design and do not show theirs...


What kind of argumentation is that?

   Because my attempt to implement it doesn't work and nobody right
away has a better suggestion this means the idea is broken.

Nonsense.

It just means that time should be spend on thinking about this.  You cut 
all this short by rushing out your attempt without any discussions. 
Unfortunately nobody else really looked at the approach so it lingered 
around for some weeks.  Well, now it is clear that it is not the right 
approach and we can start thinking about it again.



You seems to not checked the code - each event can be marked as ready 
only one time, which means only one copy and so on.

It was done _specially_. And it is not limitation, but new approach.


I know that it is done deliberately and I tell you that this is wrong 
and unacceptable.  Realtime signals are one event which need to have 
more than one event queued.  This is no description of what you have 
implemented, it's a description of the reality of realtime signals.


RT signals are queued.  They carry a data value (the sigval_t object) 
which can be unique for each signal delivery.  Coalescing the signal 
events therefore leads to information loss.


Therefore, at the very least for signal we need to have the ability to 
queue more than one event for each event source.  Not having this 
functionality means that signals and likely other types of events cannot 
be implemented using kevent queues.




Queue of the same signals or any other events has fundamental flawness
(as any other ring buffer implementation, which has queue size)  -
it's size of the queue and extremely bad case of the overflow.


Of course there are additional problems.  Overflows need to be handled. 
 But this is nothing which is unsolvable.




So, the same event may not be ready several times. Any design which
allows to create infinite number of events generated for the same case
is broken, since consumer can be in situation, when it can not handle
that flow.


That's complete nonsense.  Again, for RT signals it is very reasonable 
and not broken to have multiple outstanding signals.




That is why poll() returns only POLLIN when data is ready in
network stack, but is not trying to generate some kind of a signal for 
each byte/packet/MTU/MSS received.


It makes no sense to drag poll() into this discussion.  poll() is a very 
limited interface.  The new event handling is supposed to be the 
opposite, namely, usable for all kinds of events.  Arguing that because 
poll() does it like this just means you don't see what big step is 
needed to get to the goal of a unified event handling.  The shackles of 
poll() must be left behind.




RT signals have design problems, and I will not repeate the same error
with similar limits in kevent.


I don't know what to say.  You claim to be the source of all wisdom is 
OS design.  Maybe you should design your own OS, from ground up.  I 
wonder how many people would like that since all your arguments are 
squarely geared towards optimizing the implementation.  But: the 
implementation is irrelevant without users.  The functionality users (= 
programmers) want and need is what must drive the implementation.  And 
RT signals are definitely heavily used and liked by programmers.  You 
have to accept that you try to modify an OS which has that functionality 
regardless of how much you hate it and want to fight it.




Mmap implementation can be added separately, since it does not affect
kevent core.


That I doubt very much and it is why I would not want the kevent stuff 
go into any released kernel until that detail is resolved.


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Hardware bug or kernel bug?

2006-10-16 Thread Jarek Poplawski

On Fri, Oct 13, 2006 at 05:24:39PM +0100, David Johnson wrote:
 On Friday 13 October 2006 14:06, Jarek Poplawski wrote:
 
  Probably - but only with networking. So I'd try with this debugging
  like in my first reply plus maybe 2.6.19-rc1 (e1000 - btw. I hope
  this other tested card was different model - and locking improved)
  and resend conclusions to [EMAIL PROTECTED]
 
 
 OK I built a 2.6.19-rc1 kernel with a minimal config as you describe and I 
 cannot reproduce the reboots with this kernel. My .config:
 http://www.david-web.co.uk/download/config

I've seen more minimal minimal configs but if it works
it is 50% of success. 

 The other NIC I tried was a D-Link DL10050-based card which I think uses the 
 dl2k module.
 
 I tried to reproduce the problem under Windows (2k), which didn't reboot but 
 did still suffer from it I believe. Randomly during an scp transfer (using 
 the PuTTY scp client) Windows will lock-up for about 30 seconds, making an 
 entry in the event log indicating that there was a time-out talking to the 
 IDE controller, then continuing. Could the same thing be happening in Linux? 
 If Linux can't talk to the IDE controller when trying to write to disk, how 
 does it handle that?

Was this lock-up effect visible during above 2.6.19-rc1 tests?
If not I'd try to continue linux debbuging:
- is 2.6.19-rc1 working with normal config (use make oldconfig
to upgrade .config),
- is 2.6.17 working with minimal config (use make oldconfig),
- changing one or two options at a time try to find which one makes
the effect returns (acpi, smp...). 

Regards,
Jarek P.

PS: Sorry for late reply - I was offline.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] [PCI] Check that MWI bit really did get set

2006-10-16 Thread Alan Cox

Ar Sul, 2006-10-15 am 18:10 -0700, ysgrifennodd Andrew Morton:
 Question is, should pci_set_mwi() ever return -EFOO?  I guess it should, in
 the case where setting the line size didn't work out.

It does no harm, no driver will ever check anything but 0 v !0 because
the handling is no different in either case.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] [PCI] Check that MWI bit really did get set

2006-10-16 Thread Alan Cox

Ar Sul, 2006-10-15 am 16:44 -0700, ysgrifennodd Andrew Morton:
 Let me restore the words from my earlier email which you removed so that
 you could say that:
 
   For you the driver author to make assumptions about what's happening
   inside pci_set_mwi() is a layering violation.  Maybe the bridge got
   hot-unplugged.  Maybe the attempt to set MWI caused some synchronous PCI
   error.  For example, take a look at the various implementations of
   pci_ops.read() around the place - various of them can fail for various
   reasons.  

Let me repeat what I said before. As a driver author I do not care. It
doesn't matter if it failed because it is not supported or because a
pink elephant went for a dance on the PCI bus.

   Now it could be that an appropriate solution is to make pci_set_mwi()
   return only 0 or 1, and to generate a warning from within pci_set_mwi()
   if some unexpected error happens.  In which case it is legitimate for
   callers to not check for errors.

That would be my belief, and ditto for a lot of these other functions -
even the correctly __must_check ones like pci_set_master should do the
error reporting in the set_master() function etc not in every driver.
That gives us a single consistent printk and avoids missing them out or
bloat.

Alan

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [take19 0/4] kevent: Generic event handling mechanism.

2006-10-16 Thread Evgeniy Polyakov

On Mon, Oct 16, 2006 at 02:59:48AM -0700, Ulrich Drepper ([EMAIL PROTECTED]) 
wrote:
 Evgeniy Polyakov wrote:
 One can set number of events before the syscall and do not remove them
 after syscall. It can be updated if there is need for that.
 
 Nobody doubts that it is possible.  But it is
 
 a) potentially much expensive
 
 and
 
 b) an alien concept
 
 to have the signal mask to set during the wait call implicitly. 
 Conceptually it doesn't even make sense.  This is no event to wait for. 
  It a parameter for the specific wait call, just like the timeout.  And 
 I fortunately haven't seen you proposing to pass the timeout value 
 implicitly.

Because timeout has it's meaning for syscall processing, but signals are
completely separated objects. Why do you want to allow to queue signals
_and_ add 'temporal' signal mask for syscall? Just use one way - queue
them all.
 
 Not good enough?  It does exactly what it is supposed to do.  What can 
 there be not good enough?
 
 Not to move signals into special case of events. If poll() can not work
 with them it does not mean, that they need to be specified as additional
 syscall parameter, instead change poll() to work with them, which can be
 easily done with kevents.
 
 You still seem to be completely missing the point.  The signal mask is 
 no event to wait for.  It has nothing to do with this that ppoll() takes 
 the signal mask as a parameter.  The signal mask is a parameter for the 
 wait call just like the timeout, not more and not less.

That's where we have different opinioins (among others places :) - I do
not agree that signals are parameters for syscall, I insist that is is
usual events. ppoll() shows us that there is no difference between
signal reported as usual user - syscall returns and we can check if
something was changed (signal was delivered or even was fired), it does
not differ from the case when syscall returns and we check what event it
reports first - ready signal or some other event.
 
 Do not mix warm and soft - waiting for some period is not equal to
 syscall timeout. Waiting is possible with timer kevent user (although
 only relative timeout, can be changed to support both, not a big
 problem).
 
 That's what I'm saying all the time.  Of course it can be supported. 
 But for this the timeout parameter must be a timespec pointer.  Whatever 
 you could possibly mean by do not mix warm and soft I cannot possibly 
 imagine.  Fact is that both relative and absolute timeouts are useful. 
 And that for absolute timeouts the change of the clock has to be taken 
 into account.

They are usefull for special waiting, but not for waiting when syscall
is called. The former is supported by timer notifications, the latter -
by syscall parameter. We can add support for absolute timer
notifications as addon to relative ones. But using there timeval
structure is not accessible, since it has different sizes on different
arches, so there will be problems with 32/64 arches like x86_64.
Instead it is possible to use u32/u32 structure for sec/nsec, like what
is used for relative timeouts.
 
 I'm quite sure that absolute timeouts are very usefull, but not as in
 the case of waiting for syscall completeness. In any way, kevent can be
 extended to support absolute timeouts in it's timer notifications.
 
 That's not the same.  If you argue that then the syscall should have no 
 timeout parameter at all.  Fact is that setting up a timer is not for 
 free.  Since the timeout is used all the time having a timeout parameter 
 is the right answer.  And if you do this then do it right just like 
 every other syscall other than poll: use a timespec object.  This gives 
 flexibility without measurable cost.

It does not introduce any flexibility, since syscall does not have a
parameter to specify absolute or relative timeout has been provided.
That's one.
I do argue that syscall must have timout parameter, since it is related
to syscall behaviour but not to events syscall is working with - which is
completely different things: syscall must be interrupted after some time
to allow to fail operation or perform other tasks, but timer event can
be fired in any time in the future, syscall should not care about
underlaying events. That's two.
You say every other syscall other than poll - but even aio_suspend()
and friends use relative timeouts (although glibc converts them into 
absolute to be used with pthread_cond_timedwait), so why do you propose 
to use wariable sized structure (even if it is transferred almost for 
free in syscall) instead of usual timeout specified in 
seconds/nanoseconds/anything? That's three.

 -- 
 ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
 CA ❖

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [take19 1/4] kevent: Core files.

2006-10-16 Thread Evgeniy Polyakov

On Mon, Oct 16, 2006 at 03:16:15AM -0700, Ulrich Drepper ([EMAIL PROTECTED]) 
wrote:
 Evgeniy Polyakov wrote:
 The whole idea of mmap buffer seems to be broken, since those who asked
 for creation do not like existing design and do not show theirs...
 
 What kind of argumentation is that?
 
Because my attempt to implement it doesn't work and nobody right
 away has a better suggestion this means the idea is broken.
 
 Nonsense.

Ok, let's reformulate:
My attempt works, but nobody around likes it, I remove it and wait until
some other implement it.

 It just means that time should be spend on thinking about this.  You cut 
 all this short by rushing out your attempt without any discussions. 
 Unfortunately nobody else really looked at the approach so it lingered 
 around for some weeks.  Well, now it is clear that it is not the right 
 approach and we can start thinking about it again.

I talked about it in the last 13 releases of the kevent, and _noone_
said at least some comments. And now I get - 'it is broken, it does not
work, there are problems, we do not want it' and the like. I tried
hardly to show that it does work and problems shown can not happen, but
noone still hears me. Since I think it is not that interface which is
100% required for correct functionality, I removed it. When there are
better suggestions and implementation we can return to them of course.

 You seems to not checked the code - each event can be marked as ready 
 only one time, which means only one copy and so on.
 It was done _specially_. And it is not limitation, but new approach.
 
 I know that it is done deliberately and I tell you that this is wrong 
 and unacceptable.  Realtime signals are one event which need to have 
 more than one event queued.  This is no description of what you have 
 implemented, it's a description of the reality of realtime signals.
 
 RT signals are queued.  They carry a data value (the sigval_t object) 
 which can be unique for each signal delivery.  Coalescing the signal 
 events therefore leads to information loss.
 
 Therefore, at the very least for signal we need to have the ability to 
 queue more than one event for each event source.  Not having this 
 functionality means that signals and likely other types of events cannot 
 be implemented using kevent queues.

Well, my point about rt-signals is that they do not deserve to be
resurrected, but it is only my point :)
In case it is still used, each signal setup should create event - many
signals means many events, each signal can be sent with different
parameters - each event should correspond to one unique case.

 Queue of the same signals or any other events has fundamental flawness
 (as any other ring buffer implementation, which has queue size)  -
 it's size of the queue and extremely bad case of the overflow.
 
 Of course there are additional problems.  Overflows need to be handled. 
  But this is nothing which is unsolvable.

I strongly disagree that having design which allows overflows is
acceptible - do we really want rt-signals queue overflow problems in new
place? Instead some complex allocation scheme can be created.

 So, the same event may not be ready several times. Any design which
 allows to create infinite number of events generated for the same case
 is broken, since consumer can be in situation, when it can not handle
 that flow.
 
 That's complete nonsense.  Again, for RT signals it is very reasonable 
 and not broken to have multiple outstanding signals.

The same signal with different payload is acceptible, but when number of
them increases ulimit and they are started to be forgotten - that's what
I call broken design.

 That is why poll() returns only POLLIN when data is ready in
 network stack, but is not trying to generate some kind of a signal for 
 each byte/packet/MTU/MSS received.
 
 It makes no sense to drag poll() into this discussion.  poll() is a very 
 limited interface.  The new event handling is supposed to be the 
 opposite, namely, usable for all kinds of events.  Arguing that because 
 poll() does it like this just means you don't see what big step is 
 needed to get to the goal of a unified event handling.  The shackles of 
 poll() must be left behind.

Kevent is that subsystem, and for now it works quite good.

 RT signals have design problems, and I will not repeate the same error
 with similar limits in kevent.
 
 I don't know what to say.  You claim to be the source of all wisdom is 
 OS design.  Maybe you should design your own OS, from ground up.  I 
 wonder how many people would like that since all your arguments are 
 squarely geared towards optimizing the implementation.  But: the 
 implementation is irrelevant without users.  The functionality users (= 
 programmers) want and need is what must drive the implementation.  And 
 RT signals are definitely heavily used and liked by programmers.  You 
 have to accept that you try to modify an OS which has that functionality 
 regardless of how

Re: [Bugme-new] [Bug 7366] New: BUG: unable to handle kernel paging request at virtual address d0cb03e0

2006-10-16 Thread Patrick McHardy

Please use reply to _all_. Quoting manually ..

Patrick McHardy wrote:
 Does it also happen without external patches like ipp2p? Did you
 load/unload any netfilter modules before?

 This happens after loading all specific, ip_conntrackmodules, flushing
 al iptables rules, reseting counters, flushing all tables, unloading all
 ip_conntrack modules and the runing command iptables -A INPUT -i eth1
 -j ACCEPT . Tested also with kernel 2.6.18.1 and it works ok. I do not
 thik this has to do anything with  ipp2p
 module, since is not even used, and in the commands I used,  is not
 specified a command for this module.


Any chance you're also unloading iptables modules? If so this patch
(already in Dave's queue) should fix it ..

[NETFILTER]: fix cut-and-paste error in exit functions

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit c7b1507f3c040c02efa1b955f7180a33a232c4d9
tree fd21258deca0e5d8859271bb2c745302ce5a1e2a
parent 26da6cf44bc574d528d715a17e48f54da061c151
author Patrick McHardy [EMAIL PROTECTED] Wed, 11 Oct 2006 08:35:50 +0200
committer Patrick McHardy [EMAIL PROTECTED] Wed, 11 Oct 2006 08:35:50 +0200

 net/netfilter/xt_NFQUEUE.c  |2 +-
 net/netfilter/xt_connmark.c |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/xt_NFQUEUE.c b/net/netfilter/xt_NFQUEUE.c
index db9b896..39e1175 100644
--- a/net/netfilter/xt_NFQUEUE.c
+++ b/net/netfilter/xt_NFQUEUE.c
@@ -68,7 +68,7 @@ static int __init xt_nfqueue_init(void)
 
 static void __exit xt_nfqueue_fini(void)
 {
-   xt_register_targets(xt_nfqueue_target, ARRAY_SIZE(xt_nfqueue_target));
+   xt_unregister_targets(xt_nfqueue_target, ARRAY_SIZE(xt_nfqueue_target));
 }
 
 module_init(xt_nfqueue_init);
diff --git a/net/netfilter/xt_connmark.c b/net/netfilter/xt_connmark.c
index 92a5726..a8f0305 100644
--- a/net/netfilter/xt_connmark.c
+++ b/net/netfilter/xt_connmark.c
@@ -147,7 +147,7 @@ static int __init xt_connmark_init(void)
 
 static void __exit xt_connmark_fini(void)
 {
-   xt_register_matches(xt_connmark_match, ARRAY_SIZE(xt_connmark_match));
+   xt_unregister_matches(xt_connmark_match, ARRAY_SIZE(xt_connmark_match));
 }
 
 module_init(xt_connmark_init);

Re: [patch 3/6] 2.6.18: sb1250-mac: Phylib IRQ handling fixes

2006-10-16 Thread Maciej W. Rozycki

Andrew,

 I don't get it.  If some code does
 
   rtnl_lock();
   flush_scheduled_work();
 
 and there's some work scheduled which does rtnl_lock() then it'll deadlock.
 
 But it'll deadlock whether or not the caller of flush_scheduled_work() is
 keventd.
 
 Calling flush_scheduled_work() under locks is generally a bad idea.

 Indeed -- this is why I avoid it.

 I'd have thought that was still deadlockable.  Could you describe the
 deadlock more completely please?

 The simplest sequence of calls that prevents races here is as follows:

unregister_netdev();
  rtnl_lock();
  unregister_netdevice();
dev_close();
  sbmac_close();
phy_stop();
phy_disconnect();
  phy_stop_interrupts();
phy_disable_interrupts();
flush_scheduled_work();
free_irq();
  phy_detach();
mdiobus_unregister();
  rtnl_unlock();

We want to call flush_scheduled_work() from phy_stop_interrupts(), because 
there may still be calls to phy_change() waiting for the keventd to 
process and mdiobus_unregister() frees structures needed by phy_change().  
This may deadlock because of the call to rtnl_lock() though.

 So the modified sequence I have implemented is as follows:

unregister_netdev();
  rtnl_lock();
  unregister_netdevice();
dev_close();
  sbmac_close();
phy_stop();
schedule_work(); [sbmac_phy_disconnect()]
  rtnl_unlock();

and then:

sbmac_phy_disconnect();
  phy_disconnect();
phy_stop_interrupts();
  phy_disable_interrupts();
  free_irq();
phy_detach();
  mdiobus_unregister();

This guarantees calls to phy_change() for this PHY unit will not be run 
after mdiobus_unregister(), because any such calls have been placed in the 
queue before sbmac_phy_disconnect() (phy_stop() prevents the interrupt 
handler from issuing new calls to phy_change()).

 We still need flush_scheduled_work() to be called from 
phy_stop_interrupts() though, in case some other MAC driver calls 
phy_disconnect() (or phy_stop_interrupts(), depending on the abstraction 
layer of phylib used) directly rather than using keventd.  This is 
possible if phy_disconnect() is called from the driver's module_exit() 
call, which may make sense for devices that are known not to have their 
MII interface available as an external connector.  Hence the:

if (!current_is_keventd())
  flush_scheduled_work();

sequence in phy_stop_interrupts().  Of course, we can force all drivers 
using phylib (in the interrupt mode) to call phy_disconnect() through 
keventd instead.

 Does it sound clearer?

  Maciej
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Hardware bug or kernel bug?

2006-10-16 Thread David Johnson

On Monday 16 October 2006 11:25, Jarek Poplawski wrote:

 Was this lock-up effect visible during above 2.6.19-rc1 tests?

No, I've not seen anything in Linux other than the reboots, which are instant 
without any preceding lock-up.

 If not I'd try to continue linux debbuging:
 - is 2.6.19-rc1 working with normal config (use make oldconfig
 to upgrade .config),

With 2.6.19-rc1 and a normal config, I get the reboots as usual.

 - is 2.6.17 working with minimal config (use make oldconfig),

Yes.

 - changing one or two options at a time try to find which one makes
 the effect returns (acpi, smp...).

I've found the culprit - CPU Frequency Scaling.
With it enabled I get the reboots, with it disabled I don't. That's the same 
with every kernel version I've tried (2.6.19-rc1+rc2, 2.6.17.13  Centos' 
2.6.9) The system was using the p4-clockmod driver and the ondemand governor.

I'm still not sure exactly what the problem is - the reboots only happen in 
the circumstances I've mentioned and are not triggered by changes in clock 
speed alone - but disabling cpufreq seems to make it go away...

Thanks for your help,
David.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/5] d80211: remove bitfields from ieee80211_tx_control

2006-10-16 Thread Michael Buesch

On Friday 13 October 2006 21:20, David Kimdon wrote:
 All one-bit bitfields have been subsumed into the new 'flags'
 structure member and the new IEEE80211_TXCTL_* definitions.  The
 multiple bit members were converted to u8, s8 or u16 as appropriate.

And, eh, did this increase or decrease the struct size?
Does this generate better or worse code?

-- 
Greetings Michael.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] NET : Suspicious locking in reqsk_queue_hash_req()

2006-10-16 Thread Arnaldo Carvalho de Melo


On 10/16/06, Eric Dumazet [EMAIL PROTECTED] wrote:

(Sorry, patch inlined this time)

Hi David

While browsing include/net/request_sock.h I found this suspicious locking
protecting the SYN table hash table. I think this patch is necessary.

Thank you


Interesting, just checked and it was there before I moved this out of tcp land:

http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=0e87506fcc734647c7b2497eee4eb81e785c857a

@@ -898,18 +898,10 @@ static struct request_sock *tcp_v4_searc
static void tcp_v4_synq_add(struct sock *sk, struct request_sock *req)
{
 struct tcp_sock *tp = tcp_sk(sk);
-struct tcp_listen_opt *lopt = tp-listen_opt;
+   struct tcp_listen_opt *lopt = tp-accept_queue.listen_opt;
u32 h = tcp_v4_synq_hash(inet_rsk(req)-rmt_addr,
inet_rsk(req)-rmt_port, lopt-hash_rnd);
-req-expires = jiffies + TCP_TIMEOUT_INIT;
-req-retrans = 0;
-req-sk = NULL;
-req-dl_next = lopt-syn_table[h];
-
-write_lock(tp-syn_wait_lock);
-lopt-syn_table[h] = req;
-write_unlock(tp-syn_wait_lock);
-
+reqsk_queue_hash_req(tp-accept_queue, h, req, TCP_TIMEOUT_INIT);
 tcp_synq_added(sk);
}



Signed-off-by: Eric Dumazet [EMAIL PROTECTED]


--- linux-2.6.18/include/net/request_sock.h.orig2006-10-16 
10:53:11.0 +0200
+++ linux-2.6.18-ed/include/net/request_sock.h  2006-10-16 10:53:24.0 
+0200
@@ -251,9 +251,9 @@
req-expires = jiffies + timeout;
req-retrans = 0;
req-sk = NULL;
-   req-dl_next = lopt-syn_table[hash];

write_lock(queue-syn_wait_lock);
+   req-dl_next = lopt-syn_table[hash];
lopt-syn_table[hash] = req;
write_unlock(queue-syn_wait_lock);
 }

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] NET : Suspicious locking in reqsk_queue_hash_req()

2006-10-16 Thread Eric Dumazet

On Monday 16 October 2006 18:16, Arnaldo Carvalho de Melo wrote:
 On 10/16/06, Eric Dumazet [EMAIL PROTECTED] wrote:
  (Sorry, patch inlined this time)
 
  Hi David
 
  While browsing include/net/request_sock.h I found this suspicious locking
  protecting the SYN table hash table. I think this patch is necessary.
 
  Thank you

 Interesting, just checked and it was there before I moved this out of tcp
 land:

Well, the bug was there before you put your hands on the code (I checked 
linux-2.4.33  linux-2.4.1 , bug present on both versions)

:)

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

PATCH zero-copy send completion callback

2006-10-16 Thread Eric Barton


This patch has been used with the lustre cluster file system (www.lustre.org)
to give notification when page buffers used to send bulk data via TCP/IP may be
overwritten.  It implements...

  a) A general-purpose callback to inform higher-level protocols when a
 zero-copy send of a set of pages has completed.

  b) tcp_sendpage_zccd(), a variation on tcp_sendpage() that includes a
 completion callback parameter.

How to use it (you are a higher-level protocol driver)...

  a) Initialise a zero-copy descriptor with your callback procedure.

  b) Pass this descriptor in all zero-copy sends for an arbitrary set of pages.
 Skbuffs that reference your pages also take a reference on your zero-copy
 callback descriptor.  They release this reference when they release their
 page references.

  c) Release your own reference when you've posted all your pages and you're
 ready for the callback.

  d) The callback occurs when the last reference is dropped.


This patch applies on branch 'master' of
git://kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6


diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 85577a4..4afaef1 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -129,6 +129,36 @@ struct skb_frag_struct {
__u16 size;
 };
 
+/* Zero Copy Callback Descriptor
+ * This struct supports receiving notification when zero-copy network I/O has
+ * completed.  The ZCCD can be embedded in a struct containing the state of a
+ * zero-copy network send.  Every skbuff that references that send's pages also
+ * keeps a reference on the ZCCD.  When they have all been disposed of, the
+ * reference count on the ZCCD drops to zero and the callback is made, telling
+ * the original caller that the pages may now be overwritten. */
+struct zccd 
+{
+   atomic_t zccd_refcount;
+   void   (*zccd_callback)(struct zccd *); 
+};
+
+static inline void zccd_init (struct zccd *d, void (*callback)(struct zccd *))
+{
+   atomic_set (d-zccd_refcount, 1);
+   d-zccd_callback = callback;
+}
+
+static inline void zccd_incref (struct zccd *d)/* take a reference */
+{
+   atomic_inc (d-zccd_refcount);
+}
+
+static inline void zccd_decref (struct zccd *d)/* release a reference 
*/
+{
+   if (atomic_dec_and_test (d-zccd_refcount))
+   (d-zccd_callback)(d);
+}
+
 /* This data is invariant across clones and lives at
  * the end of the header data, ie. at skb-end.
  */
@@ -141,6 +171,11 @@ struct skb_shared_info {
unsigned short  gso_type;
unsigned intip6_frag_id;
struct sk_buff  *frag_list;
+   struct zccd *zccd1;
+   struct zccd *zccd2;
+   /* NB zero-copy data is normally whole pages.  We have 2 zccds in an
+* skbuff so we don't unneccessarily split the packet where pages fall
+* into the same packet. */
skb_frag_t  frags[MAX_SKB_FRAGS];
 };
 
@@ -1311,6 +1346,23 @@ #ifdef CONFIG_HIGHMEM
 #endif
 }
 
+/* This skbuf has dropped its pages: drop refs on any zero-copy callback
+ * descriptors it has. */
+static inline void skb_complete_zccd (struct sk_buff *skb)
+{
+   struct skb_shared_info *info = skb_shinfo(skb);
+   
+   if (info-zccd1 != NULL) {
+   zccd_decref(info-zccd1);
+   info-zccd1 = NULL;
+   }
+
+   if (info-zccd2 != NULL) {
+   zccd_decref(info-zccd2);
+   info-zccd2 = NULL;
+   }
+}
+
 #define skb_queue_walk(queue, skb) \
for (skb = (queue)-next;   
\
 prefetch(skb-next), (skb != (struct sk_buff *)(queue));   
\
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..e02b55f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -278,6 +278,8 @@ extern int  tcp_v4_tw_remember_stam
 extern int tcp_sendmsg(struct kiocb *iocb, struct sock *sk,
struct msghdr *msg, size_t size);
 extern ssize_t tcp_sendpage(struct socket *sock, struct page 
*page, int offset, size_t size, int flags);
+extern ssize_t tcp_sendpage_zccd(struct socket *sock, struct 
page *page, int offset, size_t size,
+ int flags, struct zccd *zccd);
 
 extern int tcp_ioctl(struct sock *sk, 
  int cmd, 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3c23760..a1d2ed0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -177,6 +177,8 @@ struct sk_buff *__alloc_skb(unsigned int
shinfo-gso_type = 0;
shinfo-ip6_frag_id = 0;
shinfo-frag_list = NULL;
+   shinfo-zccd1 = NULL;
+   shinfo-zccd2 = NULL;
 
if (fclone) {
struct sk_buff *child = skb + 1;
@@

Re: [PATCH] NET : Suspicious locking in reqsk_queue_hash_req()

2006-10-16 Thread Eric Dumazet

On Monday 16 October 2006 18:56, Eric Dumazet wrote:
 On Monday 16 October 2006 18:16, Arnaldo Carvalho de Melo wrote:
  On 10/16/06, Eric Dumazet [EMAIL PROTECTED] wrote:
   (Sorry, patch inlined this time)
  
   Hi David
  
   While browsing include/net/request_sock.h I found this suspicious
   locking protecting the SYN table hash table. I think this patch is
   necessary.
  
   Thank you
 
  Interesting, just checked and it was there before I moved this out of tcp
  land:

 Well, the bug was there before you put your hands on the code (I checked
 linux-2.4.33  linux-2.4.1 , bug present on both versions)

Well, 'bug' is not appropriate in fact. Overkill maybe ? 

The comment from include/net/request_sock.h explain the thing...

 * %syn_wait_lock is necessary only to avoid proc interface having to grab the 
main
 * lock sock while browsing the listening hash (otherwise it's deadlock 
prone).
 *
 * This lock is acquired in read mode only from listening_get_next() seq_file
 * op and it's acquired in write mode _only_ from code that is actively
 * changing rskq_accept_head. All readers that are holding the master sock 
lock
 * don't need to grab this lock in read mode too as rskq_accept_head. writes
 * are always protected from the main sock lock.

I bet a more appropriate code (and less prone to reading errors for kernel 
gurus/newbies) would be :

What do you think ?

Signed-off-by: Eric Dumazet [EMAIL PROTECTED]
--- linux-2.6.19-rc2/include/net/request_sock.h 2006-10-13 18:25:04.0 
+0200
+++ linux-2.6.19-rc2-ed/include/net/request_sock.h  2006-10-16 
19:34:19.0 +0200
@@ -254,9 +254,13 @@
req-sk = NULL;
req-dl_next = lopt-syn_table[hash];
 
-   write_lock(queue-syn_wait_lock);
+   /*
+* We want previous writes being commited before doing this change,
+* so that readers of the chain are not confused.
+*/
+   smp_mb();
+
lopt-syn_table[hash] = req;
-   write_unlock(queue-syn_wait_lock);
 }
 
 #endif /* _REQUEST_SOCK_H */

[RFC] wrr (weighted round-robin) bonding

2006-10-16 Thread Dawid Ciezarkiewicz

This patch is little thinner then the previous one.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

PATCH zero-copy send completion callback

2006-10-16 Thread Eric Barton


This patch has been used with the lustre cluster file system (www.lustre.org)
to give notification when page buffers used to send bulk data via TCP/IP may be
overwritten.  It implements...

  a) A general-purpose callback to inform higher-level protocols when a
 zero-copy send of a set of pages has completed.

  b) tcp_sendpage_zccd(), a variation on tcp_sendpage() that includes a
 completion callback parameter.

How to use it (you are a higher-level protocol driver)...

  a) Initialise a zero-copy descriptor with your callback procedure.

  b) Pass this descriptor in all zero-copy sends for an arbitrary set of pages.
 Skbuffs that reference your pages also take a reference on your zero-copy
 callback descriptor.  They release this reference when they release their
 page references.

  c) Release your own reference when you've posted all your pages and you're
 ready for the callback.

  d) The callback occurs when the last reference is dropped.


This patch applies on branch 'master' of
git://kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6


diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 85577a4..4afaef1 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -129,6 +129,36 @@ struct skb_frag_struct {
__u16 size;
 };
 
+/* Zero Copy Callback Descriptor
+ * This struct supports receiving notification when zero-copy network I/O has
+ * completed.  The ZCCD can be embedded in a struct containing the state of a
+ * zero-copy network send.  Every skbuff that references that send's pages also
+ * keeps a reference on the ZCCD.  When they have all been disposed of, the
+ * reference count on the ZCCD drops to zero and the callback is made, telling
+ * the original caller that the pages may now be overwritten. */
+struct zccd 
+{
+   atomic_t zccd_refcount;
+   void   (*zccd_callback)(struct zccd *); 
+};
+
+static inline void zccd_init (struct zccd *d, void (*callback)(struct zccd *))
+{
+   atomic_set (d-zccd_refcount, 1);
+   d-zccd_callback = callback;
+}
+
+static inline void zccd_incref (struct zccd *d)/* take a reference */
+{
+   atomic_inc (d-zccd_refcount);
+}
+
+static inline void zccd_decref (struct zccd *d)/* release a reference 
*/
+{
+   if (atomic_dec_and_test (d-zccd_refcount))
+   (d-zccd_callback)(d);
+}
+
 /* This data is invariant across clones and lives at
  * the end of the header data, ie. at skb-end.
  */
@@ -141,6 +171,11 @@ struct skb_shared_info {
unsigned short  gso_type;
unsigned intip6_frag_id;
struct sk_buff  *frag_list;
+   struct zccd *zccd1;
+   struct zccd *zccd2;
+   /* NB zero-copy data is normally whole pages.  We have 2 zccds in an
+* skbuff so we don't unneccessarily split the packet where pages fall
+* into the same packet. */
skb_frag_t  frags[MAX_SKB_FRAGS];
 };
 
@@ -1311,6 +1346,23 @@ #ifdef CONFIG_HIGHMEM
 #endif
 }
 
+/* This skbuf has dropped its pages: drop refs on any zero-copy callback
+ * descriptors it has. */
+static inline void skb_complete_zccd (struct sk_buff *skb)
+{
+   struct skb_shared_info *info = skb_shinfo(skb);
+   
+   if (info-zccd1 != NULL) {
+   zccd_decref(info-zccd1);
+   info-zccd1 = NULL;
+   }
+
+   if (info-zccd2 != NULL) {
+   zccd_decref(info-zccd2);
+   info-zccd2 = NULL;
+   }
+}
+
 #define skb_queue_walk(queue, skb) \
for (skb = (queue)-next;   
\
 prefetch(skb-next), (skb != (struct sk_buff *)(queue));   
\
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..e02b55f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -278,6 +278,8 @@ extern int  tcp_v4_tw_remember_stam
 extern int tcp_sendmsg(struct kiocb *iocb, struct sock *sk,
struct msghdr *msg, size_t size);
 extern ssize_t tcp_sendpage(struct socket *sock, struct page 
*page, int offset, size_t size, int flags);
+extern ssize_t tcp_sendpage_zccd(struct socket *sock, struct 
page *page, int offset, size_t size,
+ int flags, struct zccd *zccd);
 
 extern int tcp_ioctl(struct sock *sk, 
  int cmd, 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3c23760..a1d2ed0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -177,6 +177,8 @@ struct sk_buff *__alloc_skb(unsigned int
shinfo-gso_type = 0;
shinfo-ip6_frag_id = 0;
shinfo-frag_list = NULL;
+   shinfo-zccd1 = NULL;
+   shinfo-zccd2 = NULL;
 
if (fclone) {
struct sk_buff *child = skb + 1;
@@

Re: [RFC] wrr (weighted round-robin) bonding

2006-10-16 Thread Dawid Ciezarkiewicz

On Monday, 16 October 2006 20:21, Dawid Ciezarkiewicz wrote:
 This patch is little thinner then the previous one.

I'm sorry for that. I've just ... nevermind. Here goes the patch.

Should I post patch for ifenslave here, too?



diff -Nur linux-2.6.17.orig/Documentation/networking/bonding.txt 
linux-2.6.17/Documentation/networking/bonding.txt
--- linux-2.6.17.orig/Documentation/networking/bonding.txt  2006-06-18 
03:49:35.0 +0200
+++ linux-2.6.17/Documentation/networking/bonding.txt   2006-07-28 
15:47:55.0 +0200
@@ -398,6 +398,19 @@
swapped with the new curr_active_slave that was
chosen.
 
+   weighted-rr or 7
+
+   Weighted round-robin bonding. In this mode bonding
+   interface will use weights assigned to it's slaves.
+
+   Each slave can have weight assigned via ioctl (ifenslave).
+   These values will be used at the start of each cycle.
+   Each slave will have token counter restored to it's weight.
+   Then using round-robin mechanism those tokens are used
+   to pay for emitted frames. When all token counters are
+   zeroed - new cycle begins.
+   
+
 primary
 
A string (eth0, eth2, etc) specifying which slave is the
diff -Nur linux-2.6.17.orig/drivers/net/bonding/bond_main.c 
linux-2.6.17/drivers/net/bonding/bond_main.c
--- linux-2.6.17.orig/drivers/net/bonding/bond_main.c   2006-06-18 
03:49:35.0 +0200
+++ linux-2.6.17/drivers/net/bonding/bond_main.c2006-07-28 
15:31:44.0 +0200
@@ -115,7 +115,7 @@
 MODULE_PARM_DESC(mode, Mode of operation : 0 for balance-rr, 
   1 for active-backup, 2 for balance-xor, 
   3 for broadcast, 4 for 802.3ad, 5 for balance-tlb, 
-  6 for balance-alb);
+  6 for balance-alb, 7 for weighted-rr);
 module_param(primary, charp, 0);
 MODULE_PARM_DESC(primary, Primary network device to use);
 module_param(lacp_rate, charp, 0);
@@ -162,6 +162,7 @@
 {  802.3ad,  BOND_MODE_8023AD},
 {  balance-tlb,  BOND_MODE_TLB},
 {  balance-alb,  BOND_MODE_ALB},
+{  weighted-rr,  BOND_MODE_WEIGHTED_RR},
 {  NULL,   -1},
 };
 
@@ -194,6 +195,8 @@
return transmit load balancing;
case BOND_MODE_ALB:
return adaptive load balancing;
+   case BOND_MODE_WEIGHTED_RR:
+   return weighted round robin (weighted-rr);
default:
return unknown;
}
@@ -1198,6 +1201,24 @@
return 0;
 }
 
+int bond_set_weight(struct net_device *bond_dev, struct net_device *slave_dev,
+   u16 weight)
+{
+   struct slave* slave;
+   slave = bond_get_slave_by_dev(bond_dev-priv, slave_dev);
+   if (!slave) {
+   return -EINVAL;
+   }
+
+   slave-weight = weight;
+
+   if (weight) {
+   slave-link = BOND_LINK_UP;
+   slave-state = BOND_STATE_ACTIVE;
+   }
+   return 0;
+}
+
 #define BOND_INTERSECT_FEATURES \
(NETIF_F_SG|NETIF_F_IP_CSUM|NETIF_F_NO_CSUM|NETIF_F_HW_CSUM|\
NETIF_F_TSO|NETIF_F_UFO)
@@ -1336,6 +1352,9 @@
 */
new_slave-original_flags = slave_dev-flags;
 
+   /* slave default weight = 1 */
+   new_slave-weight = 1;
+
/*
 * Save slave's original (permanent) mac address for modes
 * that need it, and for restoring it upon release, and then
@@ -3601,7 +3620,10 @@
}
 
down_write((bonding_rwsem));
-   slave_dev = dev_get_by_name(ifr-ifr_slave);
+   if (cmd != SIOCBONDSETWEIGHT)
+   slave_dev = dev_get_by_name(ifr-ifr_slave);
+   else
+   slave_dev = dev_get_by_name(ifr-ifr_weight_slave);
 
dprintk(slave_dev=%p: \n, slave_dev);
 
@@ -3626,6 +3648,9 @@
case SIOCBONDCHANGEACTIVE:
res = bond_ioctl_change_active(bond_dev, slave_dev);
break;
+   case SIOCBONDSETWEIGHT:
+   res = bond_set_weight(bond_dev, slave_dev, 
ifr-ifr_weight_weight);
+   break;
default:
res = -EOPNOTSUPP;
}
@@ -3881,6 +3906,67 @@
return 0;
 }
 
+static int bond_xmit_weighted_rr(struct sk_buff *skb, struct net_device 
*bond_dev)
+{
+   struct bonding *bond = bond_dev-priv;
+   struct slave *slave, *start_at;
+   int i;
+   int res = 1;
+   int were_weight_tokens_recharged = 0;
+
+   read_lock(bond-lock);
+
+   if (!BOND_IS_OK(bond)) {
+   goto out;
+   }
+
+   read_lock(bond-curr_slave_lock);
+   slave = start_at = bond-curr_active_slave;
+   read_unlock(bond-curr_slave_lock);
+
+   if (!slave) {
+   goto out;
+   }
+
+try_send:
+   bond_for_each_slave_from(bond, slave, i,

[PATCH] d80211: remove unused Super AG definitions, purge comment

2006-10-16 Thread David Kimdon

Remove unused Super AG structure members, enums.

In struct ieee80211_tx_status the queue_length and queue_number could
be useful outside the context of Super AG, so remove the comment and
leave the members.

Signed-off-by: David Kimdon [EMAIL PROTECTED]

Index: wireless-dev/include/net/d80211.h
===
--- wireless-dev.orig/include/net/d80211.h
+++ wireless-dev/include/net/d80211.h
@@ -159,12 +159,6 @@ struct ieee80211_tx_control {
unsigned int requeue:1;
unsigned int first_fragment:1;  /* This is a first fragment of the
 * frame */
-/* following three flags are only used with Atheros Super A/G */
-   unsigned int compress:1;
-   unsigned int turbo_prime_notify:1; /* notify HostAPd after frame
-   * transmission */
-   unsigned int fast_frame:1;
-
 unsigned int power_level:8; /* per-packet transmit power level, in dBm
 */
unsigned int antenna_sel:4; /* 0 = default/diversity,
@@ -219,7 +213,6 @@ struct ieee80211_tx_status {
int excessive_retries;
int retry_count;
 
-   /* following two fields are only used with Atheros Super A/G */
int queue_length;  /* information about TX queue */
int queue_number;
 };
@@ -265,13 +258,6 @@ struct ieee80211_conf {
 int antenna_def;
 int antenna_mode;
 
-   int atheros_super_ag_compression;
-   int atheros_super_ag_fast_frame;
-   int atheros_super_ag_burst;
-   int atheros_super_ag_wme_ele;
-   int atheros_super_ag_turbo_g;
-   int atheros_super_ag_turbo_prime;
-
/* Following five fields are used for IEEE 802.11H */
unsigned int radar_detect;
unsigned int spect_mgmt;
Index: wireless-dev/net/d80211/hostapd_ioctl.h
===
--- wireless-dev.orig/net/d80211/hostapd_ioctl.h
+++ wireless-dev/net/d80211/hostapd_ioctl.h
@@ -182,10 +182,6 @@ struct prism2_hostapd_param {
u16 aid;
u16 capability;
u8 supp_rates[32];
-   /* atheros_super_ag and enc_flags are only used with
-* IEEE80211_ATHEROS_SUPER_AG
-*/
-   u8 atheros_super_ag;
u8 wds_flags;
 #define IEEE80211_STA_DYNAMIC_ENC BIT(0)
u8 enc_flags;
Index: wireless-dev/include/net/d80211_shared.h
===
--- wireless-dev.orig/include/net/d80211_shared.h
+++ wireless-dev/include/net/d80211_shared.h
@@ -19,8 +19,6 @@ enum {
MODE_ATHEROS_TURBO = 2 /* Atheros Turbo mode (2x.11a at 5 GHz) */,
MODE_IEEE80211G = 3 /* IEEE 802.11g (and 802.11b compatibility) */,
MODE_ATHEROS_TURBOG = 4 /* Atheros Turbo mode (2x.11g at 2.4 GHz) */,
-   MODE_ATHEROS_PRIME = 5 /* Atheros Dynamic Turbo mode */,
-   MODE_ATHEROS_PRIMEG = 6 /* Atheros Dynamic Turbo mode G */,
NUM_IEEE80211_MODES = 7
 };
 

--
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] wrr (weighted round-robin) bonding

2006-10-16 Thread Jay Vosburgh


Dawid Ciezarkiewicz [EMAIL PROTECTED] wrote:
[...]
+  weighted-rr or 7
+
+  Weighted round-robin bonding. In this mode bonding
+  interface will use weights assigned to it's slaves.
+
+  Each slave can have weight assigned via ioctl (ifenslave).
+  These values will be used at the start of each cycle.
+  Each slave will have token counter restored to it's weight.
+  Then using round-robin mechanism those tokens are used
+  to pay for emitted frames. When all token counters are
+  zeroed - new cycle begins.

Before getting into the technical bits of the patch, what's the
reason for wanting to do this, and why is this rather complex manual
weight assignment better than an automatic system based on, e.g., link
speed of the slaves?

-J

---
-Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] wrr (weighted round-robin) bonding

2006-10-16 Thread Dawid Ciezarkiewicz

On Monday, 16 October 2006 20:50, you wrote:
 
 Dawid Ciezarkiewicz [EMAIL PROTECTED] wrote:
 [...]
 +weighted-rr or 7
 +
 +Weighted round-robin bonding. In this mode bonding
 +interface will use weights assigned to it's slaves.
 +
 +Each slave can have weight assigned via ioctl (ifenslave).
 +These values will be used at the start of each cycle.
 +Each slave will have token counter restored to it's weight.
 +Then using round-robin mechanism those tokens are used
 +to pay for emitted frames. When all token counters are
 +zeroed - new cycle begins.
 
   Before getting into the technical bits of the patch, what's the
 reason for wanting to do this, and why is this rather complex manual
 weight assignment better than an automatic system based on, e.g., link
 speed of the slaves?

In short:
It was designed as a solution for wireless links bonding - where link quality 
can change rather quickly in time. By using wrr bonding, userspace tools can 
measure current bandwidth and change bonding slave weights in realtime.

It was written for Lintrack, and you can read about it's usage here:
http://lintrack.org/index.php/about/advantage
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 3/6] 2.6.18: sb1250-mac: Phylib IRQ handling fixes

2006-10-16 Thread Andrew Morton

On Mon, 16 Oct 2006 15:50:55 +0100 (BST)
Maciej W. Rozycki [EMAIL PROTECTED] wrote:

 Andrew,
 
  I don't get it.  If some code does
  
  rtnl_lock();
  flush_scheduled_work();
  
  and there's some work scheduled which does rtnl_lock() then it'll deadlock.
  
  But it'll deadlock whether or not the caller of flush_scheduled_work() is
  keventd.
  
  Calling flush_scheduled_work() under locks is generally a bad idea.
 
  Indeed -- this is why I avoid it.
 
  I'd have thought that was still deadlockable.  Could you describe the
  deadlock more completely please?
 
  The simplest sequence of calls that prevents races here is as follows:
 
 unregister_netdev();
   rtnl_lock();
   unregister_netdevice();
 dev_close();
   sbmac_close();
 phy_stop();
 phy_disconnect();
   phy_stop_interrupts();
 phy_disable_interrupts();
 flush_scheduled_work();
 free_irq();
   phy_detach();
 mdiobus_unregister();
   rtnl_unlock();
 
 We want to call flush_scheduled_work() from phy_stop_interrupts(), because 
 there may still be calls to phy_change() waiting for the keventd to 
 process and mdiobus_unregister() frees structures needed by phy_change().  
 This may deadlock because of the call to rtnl_lock() though.
 
  So the modified sequence I have implemented is as follows:
 
 unregister_netdev();
   rtnl_lock();
   unregister_netdevice();
 dev_close();
   sbmac_close();
 phy_stop();
 schedule_work(); [sbmac_phy_disconnect()]
   rtnl_unlock();
 
 and then:
 
 sbmac_phy_disconnect();
   phy_disconnect();
 phy_stop_interrupts();
   phy_disable_interrupts();
   free_irq();
 phy_detach();
   mdiobus_unregister();
 
 This guarantees calls to phy_change() for this PHY unit will not be run 
 after mdiobus_unregister(), because any such calls have been placed in the 
 queue before sbmac_phy_disconnect() (phy_stop() prevents the interrupt 
 handler from issuing new calls to phy_change()).
 
  We still need flush_scheduled_work() to be called from 
 phy_stop_interrupts() though, in case some other MAC driver calls 
 phy_disconnect() (or phy_stop_interrupts(), depending on the abstraction 
 layer of phylib used) directly rather than using keventd.  This is 
 possible if phy_disconnect() is called from the driver's module_exit() 
 call, which may make sense for devices that are known not to have their 
 MII interface available as an external connector.  Hence the:
 
 if (!current_is_keventd())
   flush_scheduled_work();
 
 sequence in phy_stop_interrupts().  Of course, we can force all drivers 
 using phylib (in the interrupt mode) to call phy_disconnect() through 
 keventd instead.
 
  Does it sound clearer?
 

Vaguely.  Why doesn't it deadlock if !current_is_keventd()?  I mean,
whether or not the caller is keventd, the flush_scheduled_work() caller
will still be dependent upon rtnl_lock() being acquirable.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.18-mm2 boot failure on x86-64

2006-10-16 Thread Vivek Goyal

On Mon, Oct 09, 2006 at 10:53:58AM +0100, Mel Gorman wrote:
 On Fri, 6 Oct 2006, Vivek Goyal wrote:
 
 On Fri, Oct 06, 2006 at 01:03:50PM -0500, Steve Fox wrote:
 On Fri, 2006-10-06 at 18:11 +0100, Mel Gorman wrote:
 On (06/10/06 11:36), Vivek Goyal didst pronounce:
 Where is bss placed in physical memory? I guess bss_start and bss_stop
 from System.map will tell us. That will confirm that above memset step 
 is
 stomping over bss. Then we have to just find that somewhere probably
 we allocated wrong physical memory area for bootmem allocator map.
 
 
 BSS is at 0x643000 - 0x777BC4
 init_bootmem wipes from 0x777000 - 0x8F7000
 
 So the BSS bytes from 0x777000 -0x777BC4 (which looks very suspiciously
 pile a page alignment of addr  PAGE_MASK) gets set to 0xFF. One possible
 fix is below. It adds a check in bad_addr() to see if the BSS section is
 about to be used for bootmap. It Seems To Work For Me (tm) and 
 illustrates
 the source of the problem even if it's not the 100% correct fix.
 
 I was able to boot the machine with Mel's patch applied on top of
 -git22.
 
 
 Please have a look at the attached patch. Does it make some sense.
 
 
 It makes some sense. As you state, it wastes memory but that is better 
 than breaking.
 
 Steve, can you please give this patch a try if it fixes the problem?
 
 
 I boottested the patch on the same machine as Steve was using and it 
 completed successfully.


Hi Andrew,

Can you please have a look at the attached patch and include it in -mm.
This fixes the issue for steve. It also figures in the list of Adrian Bunk
of known regressions.

Subject: oops in xfrm_register_mode
References : http://lkml.org/lkml/2006/10/4/170
Submitter  : Steve Fox [EMAIL PROTECTED]
Handled-By : Vivek Goyal [EMAIL PROTECTED]
Status : patch available



o Currently some code pieces assume that address returned by find_e820_area()
  are page aligned. But looks like find_e820_area() had no such intention
  and hence one might end up stomping over some of the data. One such
  case is bootmem allocator initialization code stomped over bss.

o This patch modified find_e820_area() to return page aligned address. This
  might be little wasteful of memory but at the same time probably it is
  easier to handle page aligned memory. 

Signed-off-by: Vivek Goyal [EMAIL PROTECTED]
---

 arch/x86_64/kernel/e820.c |   14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff -puN 
arch/x86_64/kernel/e820.c~x86_64-return-page-aligned-phy-addr-from-find-e820-area
 arch/x86_64/kernel/e820.c
--- 
linux-2.6.19-rc1-1M/arch/x86_64/kernel/e820.c~x86_64-return-page-aligned-phy-addr-from-find-e820-area
   2006-10-06 15:28:13.0 -0400
+++ linux-2.6.19-rc1-1M-root/arch/x86_64/kernel/e820.c  2006-10-06 
15:44:45.0 -0400
@@ -54,13 +54,13 @@ static inline int bad_addr(unsigned long
 
/* various gunk below that needed for SMP startup */
if (addr  0x8000) { 
-   *addrp = 0x8000;
+   *addrp = PAGE_ALIGN(0x8000);
return 1; 
}
 
/* direct mapping tables of the kernel */
if (last = table_startPAGE_SHIFT  addr  table_endPAGE_SHIFT) { 
-   *addrp = table_end  PAGE_SHIFT; 
+   *addrp = PAGE_ALIGN(table_end  PAGE_SHIFT);
return 1;
} 
 
@@ -68,18 +68,18 @@ static inline int bad_addr(unsigned long
 #ifdef CONFIG_BLK_DEV_INITRD
if (LOADER_TYPE  INITRD_START  last = INITRD_START  
addr  INITRD_START+INITRD_SIZE) { 
-   *addrp = INITRD_START + INITRD_SIZE; 
+   *addrp = PAGE_ALIGN(INITRD_START + INITRD_SIZE);
return 1;
} 
 #endif
/* kernel code */
-   if (last = __pa_symbol(_text)  last  __pa_symbol(_end)) {
-   *addrp = __pa_symbol(_end);
+   if (last = __pa_symbol(_text)  addr  __pa_symbol(_end)) {
+   *addrp = PAGE_ALIGN(__pa_symbol(_end));
return 1;
}
 
if (last = ebda_addr  addr  ebda_addr + ebda_size) {
-   *addrp = ebda_addr + ebda_size;
+   *addrp = PAGE_ALIGN(ebda_addr + ebda_size);
return 1;
}
 
@@ -152,7 +152,7 @@ unsigned long __init find_e820_area(unsi
continue; 
while (bad_addr(addr, size)  addr+size = ei-addr+ei-size)
;
-   last = addr + size;
+   last = PAGE_ALIGN(addr) + size;
if (last  ei-addr + ei-size)
continue;
if (last  end) 
_
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] bcm43xx-softmac: add PCI-E code

2006-10-16 Thread Michael Buesch

On Monday 16 October 2006 06:18, Larry Finger wrote:
 From: Stefano Brivio [EMAIL PROTECTED]
 
 The current bcm43xx driver does not contain code to handle PCI-E interfaces
 such as the BCM4311 and BCM4312. This patch, originally written by Stefano
 Brivio adds the necessary code to enable these interfaces. 
 
 Signed-off-by: Stefano Brivio [EMAIL PROTECTED]
 Signed-off-by: Larry Finger [EMAIL PROTECTED]

This patch should be OK. Please merge for 2.6.19.

-- 
Greetings Michael.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [patch 1/5] d80211: remove bitfields from ieee80211_tx_control

2006-10-16 Thread Simon Barber

Removing the bitfields makes the code much harder to read and maintain.
Here we are working around a problem with the compiler by making the
code ugly - rather than fixing the compiler. The compilers are getting
better and better (GCC 4 has much better handling of this type of
optimization) but the code will remain ugly for ever.

Simon

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of Michael Buesch
Sent: Monday, October 16, 2006 9:07 AM
To: David Kimdon
Cc: netdev@vger.kernel.org; John W. Linville; Jiri Benc
Subject: Re: [patch 1/5] d80211: remove bitfields from
ieee80211_tx_control

On Friday 13 October 2006 21:20, David Kimdon wrote:
 All one-bit bitfields have been subsumed into the new 'flags'
 structure member and the new IEEE80211_TXCTL_* definitions.  The 
 multiple bit members were converted to u8, s8 or u16 as appropriate.

And, eh, did this increase or decrease the struct size?
Does this generate better or worse code?

--
Greetings Michael.
-
To unsubscribe from this list: send the line unsubscribe netdev in the
body of a message to [EMAIL PROTECTED] More majordomo info at
http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/5] d80211: remove bitfields from ieee80211_tx_control

2006-10-16 Thread Michael Buesch

On Monday 16 October 2006 21:34, Simon Barber wrote:
 Removing the bitfields makes the code much harder to read and maintain.
 Here we are working around a problem with the compiler by making the
 code ugly - rather than fixing the compiler. The compilers are getting
 better and better (GCC 4 has much better handling of this type of
 optimization) but the code will remain ugly for ever.

Yeah, that's my opinion on this, too.

But I still like the  unsigned int foo:16; = u16 foo;  type of conversions.

-- 
Greetings Michael.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] NET : Suspicious locking in reqsk_queue_hash_req()

2006-10-16 Thread David Miller

From: Eric Dumazet [EMAIL PROTECTED]
Date: Mon, 16 Oct 2006 11:00:22 +0200

 While browsing include/net/request_sock.h I found this suspicious locking 
 protecting the SYN table hash table. I think this patch is necessary.

 Signed-off-by: Eric Dumazet [EMAIL PROTECTED]

People get tripped up by this one all the time.

We hold a higher level lock which protects other
inserts from happening, namely the listening socket
lock, it works here like the RTNL semaphore does.

We only need to protect the actual change of the hash
head, as lookups can occur asynchronously and we want
linkage seen by lookups to be consistent.

Alexey likes to do this locking trick a lot.

Feel free to add a comment. :-)

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 9/14] [TIPC] Name publication events now delivered in chronological order

2006-10-16 Thread David Miller

From: Per Liden [EMAIL PROTECTED]
Date: Mon, 16 Oct 2006 10:50:40 +0200 (CEST)

 I'm fairly sure this is a problem on your side. I received patch 10/14 
 from the netdev list and the two list archives I checked also had it.

I also got 2 copies which means it hit netdev for me too.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Fixed a number of bugs in the PHY Layer

2006-10-16 Thread Andy Fleming


* genphy_update_link is now exported
* Added a fix from [EMAIL PROTECTED] which changes forcing so it
  only updates the link.  Otherwise, it never tries the lower
  values, since it is always overwriting the speed/duplex values
  with the current ones, rather than the intended ones.
* Fixed a bug where bringing up a PHY with no link caused it to
  timeout, and enter forcing mode.  Once in forcing mode,
  plugging in the link didn't autonegotiate.  Now the AN state
  detects the lack of link, and enters the NO_LINK state.  AN
  only times out if the link is up and AN fails
* Cleaned up the PHY_AN case, reducing one level of indentation
  for the timeout code.
---
 drivers/net/phy/phy.c|   81 --
 drivers/net/phy/phy_device.c |1 +
 2 files changed, 40 insertions(+), 42 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 3af9fcf..c81536d 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -693,60 +693,57 @@ static void phy_timer(unsigned long data
 
break;
case PHY_AN:
+   err = phy_read_status(phydev);
+
+   if (err  0)
+   break;
+
+   /* If the link is down, give up on
+* negotiation for now */
+   if (!phydev-link) {
+   phydev-state = PHY_NOLINK;
+   netif_carrier_off(phydev-attached_dev);
+   phydev-adjust_link(phydev-attached_dev);
+   break;
+   }
+
/* Check if negotiation is done.  Break
 * if there's an error */
err = phy_aneg_done(phydev);
if (err  0)
break;
 
-   /* If auto-negotiation is done, we change to
-* either RUNNING, or NOLINK */
+   /* If AN is done, we're running */
if (err  0) {
-   err = phy_read_status(phydev);
+   phydev-state = PHY_RUNNING;
+   netif_carrier_on(phydev-attached_dev);
+   phydev-adjust_link(phydev-attached_dev);
+
+   } else if (0 == phydev-link_timeout--) {
+   int idx;
 
-   if (err)
+   needs_aneg = 1;
+   /* If we have the magic_aneg bit,
+* we try again */
+   if (phydev-drv-flags  PHY_HAS_MAGICANEG)
break;
 
-   if (phydev-link) {
-   phydev-state = PHY_RUNNING;
-   netif_carrier_on(phydev-attached_dev);
-   } else {
-   phydev-state = PHY_NOLINK;
-   netif_carrier_off(phydev-attached_dev);
-   }
+   /* The timer expired, and we still
+* don't have a setting, so we try
+* forcing it until we find one that
+* works, starting from the fastest speed,
+* and working our way down */
+   idx = phy_find_valid(0, phydev-supported);
 
-   phydev-adjust_link(phydev-attached_dev);
+   phydev-speed = settings[idx].speed;
+   phydev-duplex = settings[idx].duplex;
 
-   } else if (0 == phydev-link_timeout--) {
-   /* The counter expired, so either we
-* switch to forced mode, or the
-* magic_aneg bit exists, and we try aneg
-* again */
-   if (!(phydev-drv-flags  PHY_HAS_MAGICANEG)) {
-   int idx;
-
-   /* We'll start from the
-* fastest speed, and work
-* our way down */
-   idx = phy_find_valid(0,
-   phydev-supported);
-
-   phydev-speed = settings[idx].speed;
-   phydev-duplex = settings[idx].duplex;
-   
-   phydev-autoneg = AUTONEG_DISABLE;
-   phydev-state =

Re: [RFC] wrr (weighted round-robin) bonding

2006-10-16 Thread Andy Gospodarek

On Mon, Oct 16, 2006 at 09:07:57PM +0200, Dawid Ciezarkiewicz wrote:
  
  Before getting into the technical bits of the patch, what's the
  reason for wanting to do this, and why is this rather complex manual
  weight assignment better than an automatic system based on, e.g., link
  speed of the slaves?
 
 In short:
 It was designed as a solution for wireless links bonding - where link quality 
 can change rather quickly in time. By using wrr bonding, userspace tools can 
 measure current bandwidth and change bonding slave weights in realtime.

Since this is so similar to mode 0, it would seem there would be a way
to extend it rather than creating yet another mode that is so similar.
What would be the reason not to enhance that mode?

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: poll problem with PF_PACKET when using PACKET_RX_RING

2006-10-16 Thread Joan Raventos

Is this a bug in PF_PACKET? Should the socket queue be
emptied by packet_set_ring (called via setsockopt when
PACKET_RX_RING is used) so the above cannot happen?
Should the user-space app drain the socket queue with
recvfrom prior to (4) -quite unlikely in practice-?
 

I guess the best way is not to bind the socket before having
completed setup. We could still flush the queue to make life
easier for userspace, not sure about that ..
 
 
 Even w/o bind, packet_create is doing a dev_add_pack, which I think will 
 make pkts arrive to that socket (ie. in netif_receive_skb one can see the 
 loops over the rcu for both ptype_all and type-specific which seem match 
 whenever !ptype-dev || ptype-dev==skb-dev).
 
 Also the packet_mmap.txt doc does not mention bind, which probably is more a 
 mechanism to closely specify a dev than to signal socket readiness.

 packet_create only calls dev_add_pack if a protocol is given.
 You can use a protocol number of 0 and then bind the socket
 after setting it up properly.

Currently I'm using ETH_P_ALL on the socket call. If I understand your proposal 
correctly you suggest to pass 0 on the socket call, so dev_add_pack is not 
called, and afterwards use a sockaddr_ll with bind to set the sll_protocol to 
whatever value (ETH_P_ALL in my case). Correct?

 According to your description, you first used setsockopt(...,
 PACKET_RX_RING), then mmap. In that case the receive queue
 should already get flushed by packet_set_ring (about line 1710).

Ok, I see... I guess if mmap has not been issued by the time setsockopt is 
called then po-mapped == 0 and the code you point out is triggered, 
specifically skb_queue_purge.

 How did you verify that the receive queue still contains packets?

You are totally right! non-block recv to the descriptor returns EAGAIN, so the 
queues are empty. After further instrumentation of the ring code, I'm 
suspecting of an issue with the ring read index at the user-space app...

Nevertheless the whole ring communication between kernel and user-space seems 
to be based on marking pkts via a flag in each pkt slot in the ring 
(tp_status). I guess it works fine because the assignments are atomic (like the 
one on af_packet.c:671). Correct?
BTW what's the purpose of mb() and why is it exactly needed in that position in 
the code?

Thx again!

Salu2,
J.



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/13] [RFC] Fix problems with IPv6 routing subtrees and source address selection

2006-10-16 Thread Ville Nuorvala

Hi,

here are a bunch of more or less related patches having to do with fixing the 
IPv6 routing
subtrees and source address selection. Most of the code is a cleaned up version 
of what
I've written earlier for MIPL 2, where it has worked pretty well for a couple 
of years now.

The SCTP code, however, turned out to be messier and more difficult to fix than 
I had
originally thought. As I'm not that familiar with SCTP and don't really have an
opportunity to test the code I'm especially grateful for any comments regarding 
those
parts of the code.

I've tried to split up the changes into logical parts to help digest them. 
Please comment!

Regards,
Ville
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/13] [IPV6] Remove struct pol_chain.

2006-10-16 Thread Ville Nuorvala

Struct pol_chain has existed since at least the 2.2 kernel, but isn't used
anymore. As the IPv6 policy routing is implemented in a totally different
way in the current kernel, just get rid of it.

Signed-off-by: Ville Nuorvala [EMAIL PROTECTED]
---
 include/net/ip6_route.h |7 ---
 1 files changed, 0 insertions(+), 7 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 6ca6b71..c14b70e 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -36,13 +36,6 @@ #define RT6_LOOKUP_F_IFACE   0x1
 #define RT6_LOOKUP_F_REACHABLE 0x2
 #define RT6_LOOKUP_F_HAS_SADDR 0x4

-struct pol_chain {
-   int type;
-   int priority;
-   struct fib6_node*rules;
-   struct pol_chain*next;
-};
-
 extern struct rt6_info ip6_null_entry;

 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
-- 
1.4.2.3

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/13] [SCTP] Fix minor typo

2006-10-16 Thread Ville Nuorvala


Signed-off-by: Ville Nuorvala [EMAIL PROTECTED]
---
 net/sctp/socket.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 79c3e07..185d480 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -821,7 +821,7 @@ out:
  * addrs is a pointer to an array of one or more socket addresses. Each
  * address is contained in its appropriate structure (i.e. struct
  * sockaddr_in or struct sockaddr_in6) the family of the address type
- * must be used to distengish the address length (note that this
+ * must be used to distinguish the address length (note that this
  * representation is termed a packed array of addresses). The caller
  * specifies the number of addresses in the array with addrcnt.
  *
-- 
1.4.2.3
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/13] [IPV6] Make sure error handling is done when calling ip6_route_output().

2006-10-16 Thread Ville Nuorvala


As ip6_route_output() never returns NULL, error checking must be done by
looking at dst-error in stead of comparing dst against NULL.

Signed-off-by: Ville Nuorvala [EMAIL PROTECTED]
---
 net/ipv6/xfrm6_policy.c |   12 +++-
 net/sctp/ipv6.c |   10 +-
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index 6a252e2..db2d55c 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -25,12 +25,14 @@ #endif
 static struct dst_ops xfrm6_dst_ops;
 static struct xfrm_policy_afinfo xfrm6_policy_afinfo;

-static int xfrm6_dst_lookup(struct xfrm_dst **dst, struct flowi *fl)
+static int xfrm6_dst_lookup(struct xfrm_dst **xdst, struct flowi *fl)
 {
-   int err = 0;
-   *dst = (struct xfrm_dst*)ip6_route_output(NULL, fl);
-   if (!*dst)
-   err = -ENETUNREACH;
+   struct dst_entry *dst = ip6_route_output(NULL, fl);
+   int err = dst-error;
+   if (!err)
+   *xdst = (struct xfrm_dst *) dst;
+   else
+   dst_release(dst);
return err;
 }

diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 249e503..78071c6 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -215,17 +215,17 @@ static struct dst_entry *sctp_v6_get_dst
}

dst = ip6_route_output(NULL, fl);
-   if (dst) {
+   if (!dst-error) {
struct rt6_info *rt;
rt = (struct rt6_info *)dst;
SCTP_DEBUG_PRINTK(
rt6_dst: NIP6_FMT  rt6_src: NIP6_FMT \n,
NIP6(rt-rt6i_dst.addr), NIP6(rt-rt6i_src.addr));
-   } else {
-   SCTP_DEBUG_PRINTK(NO ROUTE\n);
+   return dst;
}
-
-   return dst;
+   SCTP_DEBUG_PRINTK(NO ROUTE\n);
+   dst_release(dst);
+   return NULL;
 }

 /* Returns the number of consecutive initial bits that match in the 2 ipv6
-- 
1.4.2.3
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/13] [IPV6] Clean up BACKTRACK().

2006-10-16 Thread Ville Nuorvala


The fn check is unnecessary as fn can never be NULL in BACKTRACK().

Signed-off-by: Ville Nuorvala [EMAIL PROTECTED]
---
 net/ipv6/route.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index a1b0f07..263c057 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -484,7 +484,7 @@ #define BACKTRACK(saddr) \
 do { \
if (rt == ip6_null_entry) { \
struct fib6_node *pn; \
-   while (fn) { \
+   while (1) { \
if (fn-fn_flags  RTN_TL_ROOT) \
goto out; \
pn = fn-parent; \
-- 
1.4.2.3
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/13] [IPV6] Make IPV6_SUBTREES depend on IPV6_MULTIPLE_TABLES.

2006-10-16 Thread Ville Nuorvala


As IPV6_SUBTREES can't work without IPV6_MULTIPLE_TABLES have IPV6_SUBTREES
depend on it.

Signed-off-by: Ville Nuorvala [EMAIL PROTECTED]
---
 net/ipv6/Kconfig |   16 
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig
index a2d211d..5fd2ffd 100644
--- a/net/ipv6/Kconfig
+++ b/net/ipv6/Kconfig
@@ -152,9 +152,16 @@ config IPV6_TUNNEL

  If unsure, say N.

+config IPV6_MULTIPLE_TABLES
+   bool IPv6: Multiple Routing Tables
+   depends on IPV6  EXPERIMENTAL
+   select FIB_RULES
+   ---help---
+ Support multiple routing tables.
+
 config IPV6_SUBTREES
bool IPv6: source address based routing
-   depends on IPV6  EXPERIMENTAL
+   depends on IPV6_MULTIPLE_TABLES
---help---
  Enable routing by source address or prefix.

@@ -166,13 +173,6 @@ config IPV6_SUBTREES

  If unsure, say N.

-config IPV6_MULTIPLE_TABLES
-   bool IPv6: Multiple Routing Tables
-   depends on IPV6  EXPERIMENTAL
-   select FIB_RULES
-   ---help---
- Support multiple routing tables.
-
 config IPV6_ROUTE_FWMARK
bool IPv6: use netfilter MARK value as routing key
depends on IPV6_MULTIPLE_TABLES  NETFILTER
-- 
1.4.2.3

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 6/13] [IPV6] Always copy rt-u.dst.error when copying a rt6_info.

2006-10-16 Thread Ville Nuorvala

Signed-off-by: Ville Nuorvala [EMAIL PROTECTED]
---
 net/ipv6/route.c |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 263c057..aa96be8 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -618,8 +618,6 @@ static struct rt6_info *rt6_alloc_clone(
ipv6_addr_copy(rt-rt6i_dst.addr, daddr);
rt-rt6i_dst.plen = 128;
rt-rt6i_flags |= RTF_CACHE;
-   if (rt-rt6i_flags  RTF_REJECT)
-   rt-u.dst.error = ort-u.dst.error;
rt-u.dst.flags |= DST_HOST;
rt-rt6i_nexthop = neigh_clone(ort-rt6i_nexthop);
}
@@ -1540,6 +1538,7 @@ static struct rt6_info * ip6_rt_copy(str
rt-u.dst.output = ort-u.dst.output;

memcpy(rt-u.dst.metrics, ort-u.dst.metrics, 
RTAX_MAX*sizeof(u32));
+   rt-u.dst.error = ort-u.dst.error;
rt-u.dst.dev = ort-u.dst.dev;
if (rt-u.dst.dev)
dev_hold(rt-u.dst.dev);
-- 
1.4.2.3
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 7/13] [RFC] [IPV6] Move source address selection into route lookup.

2006-10-16 Thread Ville Nuorvala


This patch moves the normal source address selection from
ip6_dst_lookup() into ip6_pol_route_output(), but shouldn't
change the routing or source address selection behavior in
any way.

Signed-off-by: Ville Nuorvala [EMAIL PROTECTED]
---
 net/ipv6/ip6_output.c |6 --
 net/ipv6/route.c  |   37 ++---
 2 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 6671691..0019007 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -855,12 +855,6 @@ static int ip6_dst_lookup_tail(struct so
if ((err = (*dst)-error))
goto out_err_release;

-   if (ipv6_addr_any(fl-fl6_src)) {
-   err = ipv6_get_saddr(*dst, fl-fl6_dst, fl-fl6_src);
-   if (err)
-   goto out_err_release;
-   }
-
return 0;

 out_err_release:
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index aa96be8..b7b8148 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -536,7 +536,7 @@ struct rt6_info *rt6_lookup(struct in6_a
int flags = strict ? RT6_LOOKUP_F_IFACE : 0;

if (saddr) {
-   memcpy(fl.fl6_src, saddr, sizeof(*saddr));
+   ipv6_addr_copy(fl.fl6_src, saddr);
flags |= RT6_LOOKUP_F_HAS_SADDR;
}

@@ -629,13 +629,11 @@ static struct rt6_info *ip6_pol_route_in
 {
struct fib6_node *fn;
struct rt6_info *rt, *nrt;
-   int strict = 0;
+   int strict = flags  RT6_LOOKUP_F_IFACE;
int attempts = 3;
int err;
int reachable = RT6_LOOKUP_F_REACHABLE;

-   strict |= flags  RT6_LOOKUP_F_IFACE;
-
 relookup:
read_lock_bh(table-tb6_lock);

@@ -726,22 +724,22 @@ static struct rt6_info *ip6_pol_route_ou
 {
struct fib6_node *fn;
struct rt6_info *rt, *nrt;
-   int strict = 0;
-   int attempts = 3;
-   int err;
+   int has_saddr = flags  RT6_LOOKUP_F_HAS_SADDR;
+   int strict = flags  RT6_LOOKUP_F_IFACE;
int reachable = RT6_LOOKUP_F_REACHABLE;
+   int attempts = 3;
+   struct in6_addr saddr;

-   strict |= flags  RT6_LOOKUP_F_IFACE;
-
+   ipv6_addr_copy(saddr, fl-fl6_src);
 relookup:
read_lock_bh(table-tb6_lock);

 restart_2:
-   fn = fib6_lookup(table-tb6_root, fl-fl6_dst, fl-fl6_src);
+   fn = fib6_lookup(table-tb6_root, fl-fl6_dst, saddr);

 restart:
rt = rt6_select(fn-leaf, fl-oif, strict | reachable);
-   BACKTRACK(fl-fl6_src);
+   BACKTRACK(saddr);
if (rt == ip6_null_entry ||
rt-rt6i_flags  RTF_CACHE)
goto out;
@@ -749,6 +747,13 @@ restart:
dst_hold(rt-u.dst);
read_unlock_bh(table-tb6_lock);

+   if (!has_saddr) {
+   /* policy rule doesn't restrict source address */
+   if (ipv6_get_saddr(rt-u.dst, fl-fl6_dst, saddr))
+   goto no_saddr;
+   has_saddr = RT6_LOOKUP_F_HAS_SADDR;
+   ipv6_addr_copy(fl-fl6_src, saddr);
+   }
if (!rt-rt6i_nexthop  !(rt-rt6i_flags  RTF_NONEXTHOP))
nrt = rt6_alloc_cow(rt, fl-fl6_dst, fl-fl6_src);
else {
@@ -764,8 +769,7 @@ #endif

dst_hold(rt-u.dst);
if (nrt) {
-   err = ip6_ins_rt(nrt);
-   if (!err)
+   if (!ip6_ins_rt(nrt))
goto out2;
}

@@ -778,7 +782,6 @@ #endif
 */
dst_release(rt-u.dst);
goto relookup;
-
 out:
if (reachable) {
reachable = 0;
@@ -790,6 +793,10 @@ out2:
rt-u.dst.lastuse = jiffies;
rt-u.dst.__use++;
return rt;
+no_saddr:
+   rt = ip6_null_entry;
+   dst_hold(rt-u.dst);
+   goto out2;
 }

 struct dst_entry * ip6_route_output(struct sock *sk, struct flowi *fl)
@@ -2044,7 +2051,7 @@ #endif
NLA_PUT_U32(skb, RTA_IIF, iif);
else if (dst) {
struct in6_addr saddr_buf;
-   if (ipv6_get_saddr(rt-u.dst, dst, saddr_buf) == 0)
+   if (!ipv6_get_saddr(rt-u.dst, dst, saddr_buf))
NLA_PUT(skb, RTA_PREFSRC, 16, saddr_buf);
}

-- 
1.4.2.3

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 8/13] [RFC] [IPV6] Get rid of ipv6_get_saddr() in xfrm6_get_saddr().

2006-10-16 Thread Ville Nuorvala


As the source address is already selected in ip6_pol_route_output()
there is no need to do the source address lookup a second time.

Signed-off-by: Ville Nuorvala [EMAIL PROTECTED]
---
 net/ipv6/xfrm6_policy.c |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index db2d55c..954c9ac 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -48,8 +48,7 @@ static int xfrm6_get_saddr(xfrm_address_
};

if (!xfrm6_dst_lookup((struct xfrm_dst **)rt, fl_tunnel)) {
-   ipv6_get_saddr(rt-u.dst, (struct in6_addr *)daddr-a6,
-  (struct in6_addr *)saddr-a6);
+   ipv6_addr_copy((struct in6_addr *)saddr, fl_tunnel.fl6_src);
dst_release(rt-u.dst);
return 0;
}
-- 
1.4.2.3

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 9/13] [SCTP] Merge IPv4 and IPv6 versions of get_saddr() with their corresponding get_dst().

2006-10-16 Thread Ville Nuorvala


As the IPv6 route lookup now also returns the selected source address
there is no need for a separate source address lookup. In fact, the
source address selection needs to be moved to get_dst() because the
selected IPv6 source address isn't always stored in the route.
Sometimes this makes it impossible to guess the correct address later on.

Signed-off-by: Ville Nuorvala [EMAIL PROTECTED]
---
 include/net/sctp/structs.h |7 -
 net/sctp/ipv6.c|  235 +++-
 net/sctp/protocol.c|   56 --
 net/sctp/transport.c   |8 +
 4 files changed, 148 insertions(+), 158 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index c6d93bb..e0973a3 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -529,15 +529,8 @@ struct sctp_af {
struct dst_entry *(*get_dst)(struct sctp_association *asoc,
 union sctp_addr *daddr,
 union sctp_addr *saddr);
-   void(*get_saddr)(struct sctp_association *asoc,
-struct dst_entry *dst,
-union sctp_addr *daddr,
-union sctp_addr *saddr);
void(*copy_addrlist) (struct list_head *,
  struct net_device *);
-   void(*dst_saddr)(union sctp_addr *saddr,
-struct dst_entry *dst,
-unsigned short port);
int (*cmp_addr) (const union sctp_addr *addr1,
 const union sctp_addr *addr2);
void(*addr_copy)(union sctp_addr *dst,
diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 78071c6..68ead54 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -188,46 +188,6 @@ static int sctp_v6_xmit(struct sk_buff *
return ip6_xmit(sk, skb, fl, np-opt, ipfragok);
 }

-/* Returns the dst cache entry for the given source and destination ip
- * addresses.
- */
-static struct dst_entry *sctp_v6_get_dst(struct sctp_association *asoc,
-union sctp_addr *daddr,
-union sctp_addr *saddr)
-{
-   struct dst_entry *dst;
-   struct flowi fl;
-
-   memset(fl, 0, sizeof(fl));
-   ipv6_addr_copy(fl.fl6_dst, daddr-v6.sin6_addr);
-   if (ipv6_addr_type(daddr-v6.sin6_addr)  IPV6_ADDR_LINKLOCAL)
-   fl.oif = daddr-v6.sin6_scope_id;
-   
-
-   SCTP_DEBUG_PRINTK(%s: DST= NIP6_FMT  ,
- __FUNCTION__, NIP6(fl.fl6_dst));
-
-   if (saddr) {
-   ipv6_addr_copy(fl.fl6_src, saddr-v6.sin6_addr);
-   SCTP_DEBUG_PRINTK(
-   SRC= NIP6_FMT  - ,
-   NIP6(fl.fl6_src));
-   }
-
-   dst = ip6_route_output(NULL, fl);
-   if (!dst-error) {
-   struct rt6_info *rt;
-   rt = (struct rt6_info *)dst;
-   SCTP_DEBUG_PRINTK(
-   rt6_dst: NIP6_FMT  rt6_src: NIP6_FMT \n,
-   NIP6(rt-rt6i_dst.addr), NIP6(rt-rt6i_src.addr));
-   return dst;
-   }
-   SCTP_DEBUG_PRINTK(NO ROUTE\n);
-   dst_release(dst);
-   return NULL;
-}
-
 /* Returns the number of consecutive initial bits that match in the 2 ipv6
  * addresses.
  */
@@ -250,69 +210,6 @@ static inline int sctp_v6_addr_match_len
return (i*32);
 }

-/* Fills in the source address(saddr) based on the destination address(daddr)
- * and asoc's bind address list.
- */
-static void sctp_v6_get_saddr(struct sctp_association *asoc,
- struct dst_entry *dst,
- union sctp_addr *daddr,
- union sctp_addr *saddr)
-{
-   struct sctp_bind_addr *bp;
-   rwlock_t *addr_lock;
-   struct sctp_sockaddr_entry *laddr;
-   struct list_head *pos;
-   sctp_scope_t scope;
-   union sctp_addr *baddr = NULL;
-   __u8 matchlen = 0;
-   __u8 bmatchlen;
-
-   SCTP_DEBUG_PRINTK(%s: asoc:%p dst:%p 
- daddr: NIP6_FMT  ,
- __FUNCTION__, asoc, dst, NIP6(daddr-v6.sin6_addr));
-
-   if (!asoc) {
-   ipv6_get_saddr(dst, daddr-v6.sin6_addr,saddr-v6.sin6_addr);
-   SCTP_DEBUG_PRINTK(saddr from ipv6_get_saddr:  NIP6_FMT \n,
- NIP6(saddr-v6.sin6_addr));
-   return;
-   }
-
-   scope = sctp_scope(daddr);
-
-   bp = asoc-base.bind_addr;
-   addr_lock = asoc-base.addr_lock;
-
-   /* Go through the bind address list and find the best source address
-* that matches the scope of the destination address.
-*/
-   sctp_read_lock(addr_lock);
-

[PATCH 10/13] [RFC] [IPV6] Don't export ipv6_get_saddr().

2006-10-16 Thread Ville Nuorvala


To make sure the source address selection is done correctly, don't let
users outside the ipv6 module call ipv6_get_saddr() directly. In stead
have them go through ip6_route_output().

Signed-off-by: Ville Nuorvala [EMAIL PROTECTED]
---
 net/ipv6/ipv6_syms.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/ipv6_syms.c b/net/ipv6/ipv6_syms.c
index 0e8e067..94a9806 100644
--- a/net/ipv6/ipv6_syms.c
+++ b/net/ipv6/ipv6_syms.c
@@ -25,7 +25,6 @@ EXPORT_SYMBOL(inet6_release);
 EXPORT_SYMBOL(inet6_bind);
 EXPORT_SYMBOL(inet6_getname);
 EXPORT_SYMBOL(inet6_ioctl);
-EXPORT_SYMBOL(ipv6_get_saddr);
 EXPORT_SYMBOL(ipv6_chk_addr);
 EXPORT_SYMBOL(in6_dev_finish_destroy);
 #ifdef CONFIG_XFRM
-- 
1.4.2.3

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 11/13] [RFC] [IPV6] Merge ipv6_dev_get_saddr() and ipv6_get_saddr().

2006-10-16 Thread Ville Nuorvala

The split into ipv6_get_saddr() and ipv6_dev_get_saddr() isn't necessary
anymore, so they can be merged into just the function ipv6_get_saddr().

Signed-off-by: Ville Nuorvala [EMAIL PROTECTED]
---
 include/net/addrconf.h |5 +
 net/ipv6/addrconf.c|   21 ++---
 net/ipv6/ndisc.c   |2 +-
 net/ipv6/route.c   |5 +++--
 4 files changed, 11 insertions(+), 22 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 44f1b67..d075693 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -67,10 +67,7 @@ #endif
 extern struct inet6_ifaddr *   ipv6_get_ifaddr(struct in6_addr *addr,
struct net_device *dev,
int strict);
-extern int ipv6_get_saddr(struct dst_entry *dst,
-  struct in6_addr *daddr,
-  struct in6_addr *saddr);
-extern int ipv6_dev_get_saddr(struct net_device *dev,
+extern int ipv6_get_saddr(int pref_if,
   struct in6_addr *daddr,
   struct in6_addr *saddr);
 extern int ipv6_get_lladdr(struct net_device *dev, struct 
in6_addr *);
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index c186763..09a22c8 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -904,8 +904,7 @@ static int inline ipv6_saddr_label(const
return 1;
 }

-int ipv6_dev_get_saddr(struct net_device *daddr_dev,
-  struct in6_addr *daddr, struct in6_addr *saddr)
+int ipv6_get_saddr(int pref_if, struct in6_addr *daddr, struct in6_addr *saddr)
 {
struct ipv6_saddr_score hiscore;
struct inet6_ifaddr *ifa_result = NULL;
@@ -937,7 +936,7 @@ int ipv6_dev_get_saddr(struct net_device
 */
if ((daddr_type  IPV6_ADDR_MULTICAST ||
 daddr_scope = IPV6_ADDR_SCOPE_LINKLOCAL) 
-   daddr_dev  dev != daddr_dev)
+   pref_if  dev-ifindex != pref_if)
continue;

idev = __in6_dev_get(dev);
@@ -1062,13 +1061,13 @@ #endif

/* Rule 5: Prefer outgoing interface */
if (hiscore.rule  5) {
-   if (daddr_dev == NULL ||
-   daddr_dev == ifa_result-idev-dev)
+   if (!pref_if ||
+   pref_if == ifa_result-idev-dev-ifindex)
hiscore.attrs |= IPV6_SADDR_SCORE_OIF;
hiscore.rule++;
}
-   if (daddr_dev == NULL ||
-   daddr_dev == ifa-idev-dev) {
+   if (!pref_if ||
+   pref_if == ifa-idev-dev-ifindex) {
score.attrs |= IPV6_SADDR_SCORE_OIF;
if (!(hiscore.attrs  IPV6_SADDR_SCORE_OIF)) {
score.rule = 5;
@@ -1158,14 +1157,6 @@ record_it:
return 0;
 }

-
-int ipv6_get_saddr(struct dst_entry *dst,
-  struct in6_addr *daddr, struct in6_addr *saddr)
-{
-   return ipv6_dev_get_saddr(dst ? ((struct rt6_info 
*)dst)-rt6i_idev-dev : NULL, daddr, saddr);
-}
-
-
 int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *addr)
 {
struct inet6_dev *idev;
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 0304b5f..3ac4e12 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -449,7 +449,7 @@ static void ndisc_send_na(struct net_dev
src_addr = solicited_addr;
in6_ifa_put(ifp);
} else {
-   if (ipv6_dev_get_saddr(dev, daddr, tmpaddr))
+   if (ipv6_get_saddr(dev-ifindex, daddr, tmpaddr))
return;
src_addr = tmpaddr;
}
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index b7b8148..7cd7747 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -748,8 +748,9 @@ restart:
read_unlock_bh(table-tb6_lock);

if (!has_saddr) {
+   int oif = rt-rt6i_dev-ifindex;
/* policy rule doesn't restrict source address */
-   if (ipv6_get_saddr(rt-u.dst, fl-fl6_dst, saddr))
+   if (ipv6_get_saddr(oif, fl-fl6_dst, saddr))
goto no_saddr;
has_saddr = RT6_LOOKUP_F_HAS_SADDR;
ipv6_addr_copy(fl-fl6_src, saddr);
@@ -2051,7 +2052,7 @@ #endif
NLA_PUT_U32(skb, RTA_IIF, iif);
else if (dst) {
struct in6_addr saddr_buf;
-   if (!ipv6_get_saddr(rt-u.dst, dst, saddr_buf))
+   if (!ipv6_get_saddr(rt-rt6i_dev-ifindex, dst, saddr_buf))

[PATCH 12/13] [RFC] [IPV6] Make sure route cache entries have a valid source address.

2006-10-16 Thread Ville Nuorvala


Leaving out the source address from routing cache entries when
using routing subtrees causes all kinds of problems. Make sure
this doesn't happen.

Signed-off-by: Ville Nuorvala [EMAIL PROTECTED]
---
 net/ipv6/route.c |   31 +--
 1 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 7cd7747..7c3438e 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -594,29 +594,28 @@ static struct rt6_info *rt6_alloc_cow(st

ipv6_addr_copy(rt-rt6i_dst.addr, daddr);
rt-rt6i_dst.plen = 128;
-   rt-rt6i_flags |= RTF_CACHE;
-   rt-u.dst.flags |= DST_HOST;
-
 #ifdef CONFIG_IPV6_SUBTREES
-   if (rt-rt6i_src.plen  saddr) {
-   ipv6_addr_copy(rt-rt6i_src.addr, saddr);
-   rt-rt6i_src.plen = 128;
-   }
+   ipv6_addr_copy(rt-rt6i_src.addr, saddr);
+   rt-rt6i_src.plen = 128;
 #endif
-
+   rt-rt6i_flags |= RTF_CACHE;
+   rt-u.dst.flags |= DST_HOST;
rt-rt6i_nexthop = ndisc_get_neigh(rt-rt6i_dev, 
rt-rt6i_gateway);
-
}

return rt;
 }

-static struct rt6_info *rt6_alloc_clone(struct rt6_info *ort, struct in6_addr 
*daddr)
+static struct rt6_info *rt6_alloc_clone(struct rt6_info *ort, struct in6_addr 
*daddr, struct
in6_addr *saddr)
 {
struct rt6_info *rt = ip6_rt_copy(ort);
if (rt) {
ipv6_addr_copy(rt-rt6i_dst.addr, daddr);
rt-rt6i_dst.plen = 128;
+#ifdef CONFIG_IPV6_SUBTREES
+   ipv6_addr_copy(rt-rt6i_src.addr, saddr);
+   rt-rt6i_src.plen = 128;
+#endif
rt-rt6i_flags |= RTF_CACHE;
rt-u.dst.flags |= DST_HOST;
rt-rt6i_nexthop = neigh_clone(ort-rt6i_nexthop);
@@ -654,7 +653,7 @@ restart:
nrt = rt6_alloc_cow(rt, fl-fl6_dst, fl-fl6_src);
else {
 #if CLONE_OFFLINK_ROUTE
-   nrt = rt6_alloc_clone(rt, fl-fl6_dst);
+   nrt = rt6_alloc_clone(rt, fl-fl6_dst, fl-fl6_src);
 #else
goto out2;
 #endif
@@ -756,10 +755,10 @@ restart:
ipv6_addr_copy(fl-fl6_src, saddr);
}
if (!rt-rt6i_nexthop  !(rt-rt6i_flags  RTF_NONEXTHOP))
-   nrt = rt6_alloc_cow(rt, fl-fl6_dst, fl-fl6_src);
+   nrt = rt6_alloc_cow(rt, fl-fl6_dst, saddr);
else {
 #if CLONE_OFFLINK_ROUTE
-   nrt = rt6_alloc_clone(rt, fl-fl6_dst);
+   nrt = rt6_alloc_clone(rt, fl-fl6_dst, saddr);
 #else
goto out2;
 #endif
@@ -1429,6 +1428,10 @@ void rt6_redirect(struct in6_addr *dest,

ipv6_addr_copy(nrt-rt6i_dst.addr, dest);
nrt-rt6i_dst.plen = 128;
+#ifdef CONFIG_IPV6_SUBTREES
+   ipv6_addr_copy(nrt-rt6i_src.addr, src);
+   nrt-rt6i_src.plen = 128;
+#endif
nrt-u.dst.flags |= DST_HOST;

ipv6_addr_copy(nrt-rt6i_gateway, (struct 
in6_addr*)neigh-primary_key);
@@ -1511,7 +1514,7 @@ void rt6_pmtu_discovery(struct in6_addr
if (!rt-rt6i_nexthop  !(rt-rt6i_flags  RTF_NONEXTHOP))
nrt = rt6_alloc_cow(rt, daddr, saddr);
else
-   nrt = rt6_alloc_clone(rt, daddr);
+   nrt = rt6_alloc_clone(rt, daddr, saddr);

if (nrt) {
nrt-u.dst.metrics[RTAX_MTU-1] = pmtu;
-- 
1.4.2.3

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 13/13] [RFC] [IPV6] Fix source prefix routing problems when source address undefined.

2006-10-16 Thread Ville Nuorvala


With IPv6 routing subtrees we need to take into account that the
source address is typically not specified at the time of the route
lookup.

There are two separate cases where this can happen. In the typical
case the source address hasn't been selected before the route lookup.
Skipping a source prefix policy rule because of this will lead to
inconsistent routing behavior between for example bound and unbound
sockets.

We avoid this by passing the policy rule source prefix to the lookup
and source address selection functions. For source prefix rules the
source address is selected before the route lookup, otherwise we do it
the other way around. The source address selection algorithm remains
virtually unchanged; the source prefix is just used to verify the
selected address is compatible with the rule. If the source address
doesn't match, the route lookup with the current rule is aborted and
is started again with the next rule in the policy chain.

The more uncommon case is where the unspecified address is actually
used as a valid source address. When the kernel uses the unspecified
address it doesn't touch the routing table. We need to make sure a
userland application using a raw socket can do the same. If the user
includes the IPv6 header we therefore have to bypass the source
address selection even then the source address is unspecified. In
addition, we don't insert any routing cache entry created by such a
lookup.

Signed-off-by: Ville Nuorvala [EMAIL PROTECTED]
---
 include/net/addrconf.h |4 +++-
 include/net/ip6_fib.h  |   16 +++-
 net/ipv6/addrconf.c|   13 +++--
 net/ipv6/fib6_rules.c  |   16 ++--
 net/ipv6/ip6_fib.c |2 +-
 net/ipv6/ndisc.c   |2 +-
 net/ipv6/route.c   |   41 +
 7 files changed, 66 insertions(+), 28 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index d075693..7066362 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -67,8 +67,10 @@ #endif
 extern struct inet6_ifaddr *   ipv6_get_ifaddr(struct in6_addr *addr,
struct net_device *dev,
int strict);
-extern int ipv6_get_saddr(int pref_if,
+struct rt6key;
+extern int ipv6_get_saddr(int pref_if,
   struct in6_addr *daddr,
+  struct rt6key *sconstr,
   struct in6_addr *saddr);
 extern int ipv6_get_lladdr(struct net_device *dev, struct 
in6_addr *);
 extern int ipv6_rcv_saddr_equal(const struct sock *sk,
diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index e4438de..8887b5c 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -21,6 +21,7 @@ #include linux/spinlock.h
 #include net/dst.h
 #include net/flow.h
 #include net/netlink.h
+#include net/fib_rules.h

 struct rt6_info;

@@ -77,6 +78,18 @@ struct rt6key
int plen;
 };

+struct fib6_rule
+{
+   struct fib_rule common;
+   struct rt6key   src;
+   struct rt6key   dst;
+#ifdef CONFIG_IPV6_ROUTE_FWMARK
+   u32 fwmark;
+   u32 fwmask;
+#endif
+   u8  tclass;
+};
+
 struct fib6_table;

 struct rt6_info
@@ -174,7 +187,8 @@ #define RT6_TABLE_LOCAL RT6_TABLE_MAIN
 #endif

 typedef struct rt6_info *(*pol_lookup_t)(struct fib6_table *,
-struct flowi *, int);
+struct flowi *, int,
+struct fib6_rule *);

 /*
  * exported functions
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 09a22c8..486af76 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -904,7 +904,8 @@ static int inline ipv6_saddr_label(const
return 1;
 }

-int ipv6_get_saddr(int pref_if, struct in6_addr *daddr, struct in6_addr *saddr)
+int ipv6_get_saddr(int pref_if, struct in6_addr *daddr,
+  struct rt6key *sconstr, struct in6_addr *saddr)
 {
struct ipv6_saddr_score hiscore;
struct inet6_ifaddr *ifa_result = NULL;
@@ -1151,7 +1152,15 @@ record_it:

if (!ifa_result)
return -EADDRNOTAVAIL;
-   
+#ifdef CONFIG_IPV6_SUBTREES
+   /* Don't let source address based routing interfere with the
+  address selection, just make sure the selected address
+  matches the routing policy constraints */
+
+   if (sconstr  sconstr-plen  0 
+   !ipv6_prefix_equal(saddr, sconstr-addr, sconstr-plen))
+   return -EADDRNOTAVAIL;
+#endif
ipv6_addr_copy(saddr, ifa_result-addr);
in6_ifa_put(ifa_result);
return 0;
diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c
index

[PATCH 9/13] [RFC] [SCTP] Merge IPv4 and IPv6 versions of get_saddr() with their corresponding get_dst().

2006-10-16 Thread Ville Nuorvala

Oops, this almost more than any other patch was RFC. Sorry about that!

Regards,
Ville
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Bound TSO defer time (resend)

2006-10-16 Thread John Heffner

The original message didn't show up on the list.  I'm assuming it's
because the filters didn't like the attached postscript.  I posted PDFs of
the figures on the web:

http://www.psc.edu/~jheffner/tmp/a.pdf
http://www.psc.edu/~jheffner/tmp/b.pdf
http://www.psc.edu/~jheffner/tmp/c.pdf

  -John


-- Forwarded message --
Date: Mon, 16 Oct 2006 15:55:53 -0400 (EDT)
From: John Heffner [EMAIL PROTECTED]
To: David Miller [EMAIL PROTECTED]
Cc: netdev netdev@vger.kernel.org
Subject: [PATCH] Bound TSO defer time

This patch limits the amount of time you will defer sending a TSO segment
to less than two clock ticks, or the time between two acks, whichever is
longer.

On slow links, deferring causes significant bursts.  See attached plots,
which show RTT through a 1 Mbps link with a 100 ms RTT and ~100 ms queue
for (a) non-TSO, (b) currnet TSO, and (c) patched TSO.  This burstiness
causes significant jitter, tends to overflow queues early (bad for short
queues), and makes delay-based congestion control more difficult.

Deferring by a couple clock ticks I believe will have a relatively small
impact on performance.


Signed-off-by: John Heffner [EMAIL PROTECTED]


diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 0e058a2..27ae4b2 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -341,7 +341,9 @@ #endif
int linger2;

unsigned long last_synq_overflow;
-
+
+   __u32   tso_deferred;
+
 /* Receiver side RTT estimation */
struct {
__u32   rtt;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 9a253fa..3ea8973 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1087,11 +1087,15 @@ static int tcp_tso_should_defer(struct s
u32 send_win, cong_win, limit, in_flight;

if (TCP_SKB_CB(skb)-flags  TCPCB_FLAG_FIN)
-   return 0;
+   goto send_now;

if (icsk-icsk_ca_state != TCP_CA_Open)
-   return 0;
+   goto send_now;

+   /* Defer for less than two clock ticks. */
+   if (!tp-tso_deferred  ((jiffies1)1) - (tp-tso_deferred1)  1)
+   goto send_now;
+
in_flight = tcp_packets_in_flight(tp);

BUG_ON(tcp_skb_pcount(skb) = 1 ||
@@ -1106,8 +1110,8 @@ static int tcp_tso_should_defer(struct s

/* If a full-sized TSO skb can be sent, do it. */
if (limit = 65536)
-   return 0;
-
+   goto send_now;
+
if (sysctl_tcp_tso_win_divisor) {
u32 chunk = min(tp-snd_wnd, tp-snd_cwnd * tp-mss_cache);

@@ -1116,7 +1120,7 @@ static int tcp_tso_should_defer(struct s
 */
chunk /= sysctl_tcp_tso_win_divisor;
if (limit = chunk)
-   return 0;
+   goto send_now;
} else {
/* Different approach, try not to defer past a single
 * ACK.  Receiver should ACK every other full sized
@@ -1124,11 +1128,17 @@ static int tcp_tso_should_defer(struct s
 * then send now.
 */
if (limit  tcp_max_burst(tp) * tp-mss_cache)
-   return 0;
+   goto send_now;
}
-
+
/* Ok, it looks like it is advisable to defer.  */
+   tp-tso_deferred = 1 | (jiffies1);
+
return 1;
+
+send_now:
+   tp-tso_deferred = 0;
+   return 0;
 }

 /* Create a new MTU probe if we are ready.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: PATCH zero-copy send completion callback

2006-10-16 Thread Eric Barton

David,

 Also, the correct mailing list to get to the networking developers
 is [EMAIL PROTECTED]  linux-net is for users.

Noted.

 Finally, I very much doubt you have much chance getting this
 change in, the infrastructure is implemented in a very ad-hoc
 fashion and it takes into consideration none of the potential
 other users of such a thing.  

Are you referring to the absence of a callback argument other than the
callback descriptor itself?  It seemed natural to me to contain the
descriptor in whatever state the higher-level protocol associates with the
message it's sending, and to derive this from the descriptor address in the
callback.

If this isn't what you mean, could you explain?  I'm not at all religious
about it.

 And these days we're trying to figure
 out how to eliminate skbuff and skb_shared_info struct members
 whereas you're adding 16-bytes of space on 64-bit platforms.

Do you think the general concept of a zero-copy completion callback is
useful?

If so, do you have any ideas about how to do it more economically?  It's 2
pointers rather than 1 to avoid forcing an unnecessary packet boundary
between successive zero-copy sends.  But I guess that might not be hugely
significant since you're generally sending many pages when zero-copy is
needed for performance.  Also, (please correct me if I'm wrong) I didn't
think this would push the allocation over to the next entry in
'malloc_sizes'.

Cheers,
Eric


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: socket/IP on Linux

2006-10-16 Thread Jingping Lin


Arnaldo:
Sorry, I have to bother you again with another Linux
socket question.

Suppose that I have a Linux IP socket connected for a
TCP connection and the socket is set as a non-blocking
one with fcntl().

Even the socket is set as non-blocking, is it really
possible to perform Non-Blocking Close on this socket?
i.e., can I make the int=close(fd) a non-blocking
call? 

The answer seems No to me based on my study. I am not
totally sure though.

If the answer is Yes, how?

Please help, thanks a lot,
Jingping
  
--- Arnaldo Carvalho de Melo [EMAIL PROTECTED]
wrote:

 On 10/5/06, Jingping Lin [EMAIL PROTECTED] wrote:
  Hello, Linux Kernel:
  For a project I will work on for mobile, I am
 looking
  into the IP stacks on Linux.
 
  I have a few questions to bother you:
 
 No bothering, so far, please see the below answers
 and try to check
 them all before bothering again 8)
 
  1. is socket.c the file handling the socket
  interface?
 
 One of them
 
  2. which function is for opening a socket?
  It looks like sock_map_fd() is the one for
  opening/creating a socket? Is that correct?
  The Linux IP Stacks Commentary book suggested
 the
  function is int socket() which I didn't find in
  socket.c though.
 
 Perhaps it is suggesting that you create the socket
 in userspace using
 the libc socket(2) function (see 'man socket') and
 then passing it
 thru some ioctl if you want to use kernel_sendmsg
 (make tags ; vi -t
 kernel_sendmsg) from kernelspace?
 
  3. Do you have documentations discussing in
 details
  the implemented socket interfaces?
 
 Humm, I guess you can grep the sources for in kernel
 socket usage?
 
  Thanks a lot in advance for your help,
 
 Best Regards,
 
 - Arnaldo
 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.18-mm2 boot failure on x86-64

2006-10-16 Thread Andrew Morton

On Mon, 16 Oct 2006 14:16:13 -0400
Vivek Goyal [EMAIL PROTECTED] wrote:

 
 Can you please have a look at the attached patch

Looks like a fine patch to me, although it could benefit from a comment
explaining why all those PAGE_ALIGN()s are in there.

 and include it in -mm.

Does it fix a patch in -mm or is it needed in mainline?


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Bound TSO defer time (resend)

2006-10-16 Thread Stephen Hemminger

On Mon, 16 Oct 2006 20:53:20 -0400 (EDT)
John Heffner [EMAIL PROTECTED] wrote:

 The original message didn't show up on the list.  I'm assuming it's
 because the filters didn't like the attached postscript.  I posted PDFs of
 the figures on the web:
 
 http://www.psc.edu/~jheffner/tmp/a.pdf
 http://www.psc.edu/~jheffner/tmp/b.pdf
 http://www.psc.edu/~jheffner/tmp/c.pdf
 
   -John
 
 
 -- Forwarded message --
 Date: Mon, 16 Oct 2006 15:55:53 -0400 (EDT)
 From: John Heffner [EMAIL PROTECTED]
 To: David Miller [EMAIL PROTECTED]
 Cc: netdev netdev@vger.kernel.org
 Subject: [PATCH] Bound TSO defer time
 
 This patch limits the amount of time you will defer sending a TSO segment
 to less than two clock ticks, or the time between two acks, whichever is
 longer.
 
 On slow links, deferring causes significant bursts.  See attached plots,
 which show RTT through a 1 Mbps link with a 100 ms RTT and ~100 ms queue
 for (a) non-TSO, (b) currnet TSO, and (c) patched TSO.  This burstiness
 causes significant jitter, tends to overflow queues early (bad for short
 queues), and makes delay-based congestion control more difficult.
 
 Deferring by a couple clock ticks I believe will have a relatively small
 impact on performance.
 
 
 Signed-off-by: John Heffner [EMAIL PROTECTED]

Okay, but doing any timing on clock ticks makes the behavior dependent
on the value of HZ which doesn't seem desirable. Should this be based
on RTT or a real-time values?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/14] [TIPC] Add missing unlock in port timeout code.

2006-10-16 Thread David Miller

From: Per Liden [EMAIL PROTECTED]
Date: Fri, 13 Oct 2006 13:37:42 +0200

 From: Allan Stephens [EMAIL PROTECTED]

 Signed-off-by: Allan Stephens [EMAIL PROTECTED]
 Signed-off-by: Per Liden [EMAIL PROTECTED]

Applied, thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/14] [TIPC] Debug print buffer enhancements and fixes

2006-10-16 Thread David Miller

From: Per Liden [EMAIL PROTECTED]
Date: Fri, 13 Oct 2006 13:37:43 +0200

 From: Allan Stephens [EMAIL PROTECTED]

 This change modifies TIPC's print buffer code as follows:
 1) Now supports small print buffers (min. size reduced from 512 bytes to 64)
 2) Now uses TIPC_NULL print buffer structure to indicate null device
instead of NULL pointer (this simplified error handling)
 3) Fixed misuse of console buffer structure by tipc_dump()
 4) Added and corrected comments in various places

 Signed-off-by: Allan Stephens [EMAIL PROTECTED]
 Signed-off-by: Per Liden [EMAIL PROTECTED]

Applied, please run trailing-whitespace checks on your patches,
f.e. using git apply --check --whitespace=error-all diff.
Because often I have to fix up problems like the following in
your submissions:

[EMAIL PROTECTED]:~/src/GIT/net-2.6$ pcheck diff
+ git apply --check --whitespace=error-all diff
Adds trailing whitespace.
diff:25: * TIPC_LOG: TIPC log buffer 
Adds trailing whitespace.
diff:105: * 
Adds trailing whitespace.
diff:148: * 
Adds trailing whitespace.
diff:334:   printk(\n Start of %s log dump \n\n, 
Adds trailing whitespace.
diff:366:   tipc_printbuf_init(TIPC_LOG, kmalloc(log_size, 
GFP_ATOMIC), 
Adds trailing whitespace.
diff:393: * @next: used to link print buffers when printing to more than one at 
a time 
Adds trailing whitespace.
diff:395: 
fatal: 7 lines add trailing whitespaces.

Thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/14] [TIPC] Stream socket can now send 66000 bytes at a time

2006-10-16 Thread David Miller

From: Per Liden [EMAIL PROTECTED]
Date: Fri, 13 Oct 2006 13:37:44 +0200

 From: Allan Stephens [EMAIL PROTECTED]

 The stream socket send code was not initializing some required fields
 of the temporary msghdr structure it was utilizing; this is now fixed.
 A check has also been added to detect if a user illegally specifies
 a destination address when sending on an established stream connection.

 Signed-off-by: Allan Stephens [EMAIL PROTECTED]
 Signed-off-by: Per Liden [EMAIL PROTECTED]

Applied, thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/14] [TIPC] Added duplicate node address detection capability

2006-10-16 Thread David Miller

From: Per Liden [EMAIL PROTECTED]
Date: Fri, 13 Oct 2006 13:37:45 +0200

 From: Allan Stephens [EMAIL PROTECTED]

 TIPC now rejects and logs link setup requests from node Z.C.N if the
 receiving node already has a functional link to that node on the associated
 interface, or if the requestor is using the same Z.C.N as the receiver.

 Signed-off-by: Allan Stephens [EMAIL PROTECTED]
 Signed-off-by: Per Liden [EMAIL PROTECTED]

Applied, but more whitespace crap I had to fix up:

[EMAIL PROTECTED]:~/src/GIT/net-2.6$ pcheck diff
+ git apply --check --whitespace=error-all diff
Adds trailing whitespace.
diff:19:tipc_printf(pb, %s(%s), m_ptr-name, 
Adds trailing whitespace.
diff:46:static void disc_dupl_alert(struct bearer *b_ptr, u32 node_addr, 
Adds trailing whitespace.
diff:84:spin_unlock_bh(n_ptr-lock);   

fatal: 3 lines add trailing whitespaces.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/14] [TIPC] Optimize wakeup logic when socket has no waiting processes

2006-10-16 Thread David Miller

From: Per Liden [EMAIL PROTECTED]
Date: Fri, 13 Oct 2006 13:37:46 +0200

 From: Allan Stephens [EMAIL PROTECTED]

 This patch adds a simple test so TIPC doesn't try waking up processes
 waiting on a socket if there are none waiting.

 Signed-off-by: Allan Stephens [EMAIL PROTECTED]
 Signed-off-by: Per Liden [EMAIL PROTECTED]

Applied.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/14] [TIPC] Remove code bloat introduced by print buffer rework

2006-10-16 Thread David Miller

From: Per Liden [EMAIL PROTECTED]
Date: Fri, 13 Oct 2006 13:37:47 +0200

 From: Allan Stephens [EMAIL PROTECTED]

 This patch allows the compiler to optimize out any code that tries to
 send debugging output to the null print buffer (TIPC_NULL), a capability
 that was unintentionally broken during the recent print buffer rework.

 Signed-off-by: Allan Stephens [EMAIL PROTECTED]
 Signed-off-by: Per Liden [EMAIL PROTECTED]

Applied, thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 7/14] [TIPC] Add support for Ethernet VLANs

2006-10-16 Thread David Miller

From: Per Liden [EMAIL PROTECTED]
Date: Fri, 13 Oct 2006 13:37:48 +0200

 From: Allan Stephens [EMAIL PROTECTED]
 
 This patch enhances TIPC's Ethernet support to include VLAN interfaces.
 
 Signed-off-by: Allan Stephens [EMAIL PROTECTED]
 Signed-off-by: Per Liden [EMAIL PROTECTED]

Applied, more whitespace I had to fixup:

+ git apply --check --whitespace=error-all diff
Adds trailing whitespace.
diff:24: * (in case the message is sent off-node), 
fatal: 1 line adds trailing whitespaces.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 8/14] [TIPC] Fix socket receive queue NULL pointer dereference on SMP systems

2006-10-16 Thread David Miller

From: Per Liden [EMAIL PROTECTED]
Date: Fri, 13 Oct 2006 13:37:49 +0200

 From: P Litov [EMAIL PROTECTED]

 This patch corrects an SMP system-specific race condition which allowed
 TIPC to prematurely dereference the first sk_buff in a socket receive
 queue that was changing from empty to non-empty state.

 Signed-off-by: Allan Stephens [EMAIL PROTECTED]
 Signed-off-by: Per Liden [EMAIL PROTECTED]

If you are going to access the socket packet without some other kind
of locking that prevents changes to the queue, you must take the skb
queue lock.  You can't dance around it by checking the linked list
pointer instead the queue length.  Otherwise we'd be doing this all
over the UDP code and other datagram socket layers.  And we don't
because it simply isn't valid.

So I'm not applying this.

Also, this patch is missing a proper signed off line from the
patch author, P Litov.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 9/14] [TIPC] Name publication events now delivered in chronological order

2006-10-16 Thread David Miller

From: Per Liden [EMAIL PROTECTED]
Date: Fri, 13 Oct 2006 13:37:50 +0200

 From: Allan Stephens [EMAIL PROTECTED]

 This patch tivially re-orders the entries in TIPC's list of local
 publications so that applications will receive publication events
 in the order they were published.

 Signed-off-by: Allan Stephens [EMAIL PROTECTED]
 Signed-off-by: Per Liden [EMAIL PROTECTED]

Applied, thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 10/14] [TIPC] Fixed slow link reactivation when link tolerance is large

2006-10-16 Thread David Miller

From: Per Liden [EMAIL PROTECTED]
Date: Fri, 13 Oct 2006 13:37:51 +0200

 From: Allan Stephens [EMAIL PROTECTED]

 This patch corrects an issue wherein a previouly failed node could
 not reestablish a links to a non-failing node in the TIPC network
 until the latter node detected the link failure itself (which might
 be configured to take up to 30 seconds).  The non-failing node now
 responds to link setup requests from a previously failed node in at
 most 1 second, allowing it to detect the link failure more quickly.

 Signed-off-by: Allan Stephens [EMAIL PROTECTED]
 Signed-off-by: Per Liden [EMAIL PROTECTED]

Applied, thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 11/14] [TIPC] Can now list multicast link on an isolated network node

2006-10-16 Thread David Miller

From: Per Liden [EMAIL PROTECTED]
Date: Fri, 13 Oct 2006 13:37:52 +0200

 From: Allan Stephens [EMAIL PROTECTED]

 This patch fixes a minor bug that prevents tipc-config -l from
 displaying the multicast link if a TIPC node has never successfully
 established at least one unicast link.

 Signed-off-by: Allan Stephens [EMAIL PROTECTED]
 Signed-off-by: Per Liden [EMAIL PROTECTED]

Applied, thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 12/14] [TIPC] Added subscription cancellation capability

2006-10-16 Thread David Miller

From: Per Liden [EMAIL PROTECTED]
Date: Fri, 13 Oct 2006 13:37:53 +0200

 From: Lijun Chen [EMAIL PROTECTED]

 This patch allows a TIPC application to cancel an existing
 topology service subscription by re-requesting the subscription
 with the TIPC_SUB_CANCEL filter bit set.  (All other bits of
 the cancel request must match the original subscription request.)

 Signed-off-by: Allan Stephens [EMAIL PROTECTED]
 Signed-off-by: Per Liden [EMAIL PROTECTED]

Applied, but had some trailing whitespace additions to cleanup
and would you please ask all patch authors to provide proper
signed-off-by lines in the future?  Thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 13/14] [TIPC] Unrecognized configuration command now returns error message

2006-10-16 Thread David Miller

From: Per Liden [EMAIL PROTECTED]
Date: Fri, 13 Oct 2006 13:37:54 +0200

 From: Allan Stephens [EMAIL PROTECTED]

 This patch causes TIPC to return an error message when it receives
 an unrecognized configuration command.  (Previously, the sender
 received no feedback.)

 Signed-off-by: Allan Stephens [EMAIL PROTECTED]
 Signed-off-by: Per Liden [EMAIL PROTECTED]

Applied.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 14/14] [TIPC] Updated TIPC version number to 1.6.2

2006-10-16 Thread David Miller

From: Per Liden [EMAIL PROTECTED]
Date: Fri, 13 Oct 2006 13:37:55 +0200

 From: Allan Stephens [EMAIL PROTECTED]

 Signed-off-by: Allan Stephens [EMAIL PROTECTED]
 Signed-off-by: Per Liden [EMAIL PROTECTED]

Applied, thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/14] TIPC updates

2006-10-16 Thread David Miller

From: Per Liden [EMAIL PROTECTED]
Date: Fri, 13 Oct 2006 13:37:23 +0200 (CEST)

 This patch set includes a number TIPC fixes/cleanups. Please see each 
 individual patch for further description.

 Please pull from:

  git://tipc.cslab.ericsson.net/pub/git/tipc.git

  (rebased on linux/kernel/git/davem/net-2.6.git)

I applied everything except patch 8/14, you really need to
add proper SKB queue locking to handle that race.  I think
the performance cost of taking that lock is much overstated,
you should never have contention on that lock at all.

Secondly, I never pull from your trees because I still have
to make many fixups to your patches:

1) Please add a proper colon to your changeset header lines,
   it should be [TIPC]: , not [TIPC] .

2) Please check for trailing whitespace added by your patches.
   I've given you the command you can use in another email to
   check this for yourselve before submission.

3) Please get full proper signed-off-by lines from patch submitters,
   especially when the patch is more than a trivial 1 or 2 liner.

Thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/13] [SCTP] Fix minor typo

2006-10-16 Thread David Miller

From: Ville Nuorvala [EMAIL PROTECTED]
Date: Tue, 17 Oct 2006 02:56:55 +0300

 Signed-off-by: Ville Nuorvala [EMAIL PROTECTED]

Also applied, thanks.

Please format your changelog headers properly, make
it [TOPIC]:  instead of [TOPIC] .  Thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/13] [IPV6] Make sure error handling is done when calling ip6_route_output().

2006-10-16 Thread David Miller

From: Ville Nuorvala [EMAIL PROTECTED]
Date: Tue, 17 Oct 2006 03:04:08 +0300

 As ip6_route_output() never returns NULL, error checking must be done by
 looking at dst-error in stead of comparing dst against NULL.

 Signed-off-by: Ville Nuorvala [EMAIL PROTECTED]

Good catch, patch applied.

Thanks a lot.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/13] [IPV6] Remove struct pol_chain.

2006-10-16 Thread David Miller

From: Ville Nuorvala [EMAIL PROTECTED]
Date: Tue, 17 Oct 2006 02:54:27 +0300

 Struct pol_chain has existed since at least the 2.2 kernel, but isn't used
 anymore. As the IPv6 policy routing is implemented in a totally different
 way in the current kernel, just get rid of it.

 Signed-off-by: Ville Nuorvala [EMAIL PROTECTED]

That's obvious enough, good catch.

Applied, thanks a lot.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/13] [IPV6] Clean up BACKTRACK().

2006-10-16 Thread David Miller

From: Ville Nuorvala [EMAIL PROTECTED]
Date: Tue, 17 Oct 2006 03:06:27 +0300

 The fn check is unnecessary as fn can never be NULL in BACKTRACK().

 Signed-off-by: Ville Nuorvala [EMAIL PROTECTED]

Applied, especially valid since we're walking parents up to
the, we break out at hitting root, and root's parent is
itself :-)

Thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/13] [IPV6] Make IPV6_SUBTREES depend on IPV6_MULTIPLE_TABLES.

2006-10-16 Thread David Miller

From: Ville Nuorvala [EMAIL PROTECTED]
Date: Tue, 17 Oct 2006 03:08:35 +0300

 As IPV6_SUBTREES can't work without IPV6_MULTIPLE_TABLES have IPV6_SUBTREES
 depend on it.

 Signed-off-by: Ville Nuorvala [EMAIL PROTECTED]

Good catch, patch applied, thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/13] [IPV6] Always copy rt-u.dst.error when copying a rt6_info.

2006-10-16 Thread David Miller

From: Ville Nuorvala [EMAIL PROTECTED]
Date: Tue, 17 Oct 2006 03:10:49 +0300

 Signed-off-by: Ville Nuorvala [EMAIL PROTECTED]

Looks good, applied.

Ville, can you fixup Thunderbird to not corrupt your patches?
The specific corruption is that if the patch has a completely
empty line with just a space at the beginning, thunderbird is
killing that space which makes the patch bad (at least in GIT's
eyes, which is all that matters :-)

Thanks!
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 7/13] [RFC] [IPV6] Move source address selection into route lookup.

2006-10-16 Thread David Miller

From: Ville Nuorvala [EMAIL PROTECTED]
Date: Tue, 17 Oct 2006 03:13:17 +0300

 This patch moves the normal source address selection from
 ip6_dst_lookup() into ip6_pol_route_output(), but shouldn't
 change the routing or source address selection behavior in
 any way.

 Signed-off-by: Ville Nuorvala [EMAIL PROTECTED]

Although this conversion is very clean and the next patch
is very logic, I'm going to hold on all patches from 7 onward
so there is some time for some discussion of the RFC'ness
of them :-)

Thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Bound TSO defer time (resend)

2006-10-16 Thread David Miller

From: John Heffner [EMAIL PROTECTED]
Date: Tue, 17 Oct 2006 00:18:33 -0400

 Stephen Hemminger wrote:
  On Mon, 16 Oct 2006 20:53:20 -0400 (EDT)
  John Heffner [EMAIL PROTECTED] wrote:

  This patch limits the amount of time you will defer sending a TSO segment
  to less than two clock ticks, or the time between two acks, whichever is
  longer.

  Okay, but doing any timing on clock ticks makes the behavior dependent
  on the value of HZ which doesn't seem desirable. Should this be based
  on RTT or a real-time values?

 It would be nice to use a high res clock so you don't depend on HZ, but 
 this is still expensive on most SMP arch's as I understand it.

Right so we do need to use a jiffies based solution.

Since HZ is variable, I have a feeling that the thing to do here
is pick some timeout in msec.  Then replace the 2 clock ticks
with some msec_to_jiffies() calls, bottoming out at 1 jiffie.

How does that sound?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [take19 1/4] kevent: Core files.

2006-10-16 Thread Johann Borck

Ulrich Drepper wrote:
 Evgeniy Polyakov wrote:
 Existing design does not allow overflow.

 And I've pointed out a number of times that this is not practical at
 best.  There are event sources which can create events which cannot be
 coalesced into one single event as it would be required with your design.

 Signals are one example, specifically realtime signals.  If we do not
 want the design to be limited from the start this approach has to be
 thought over.


 So zap mmap() support completely, since it is not usable at all. We
 wont discuss on it.

 Initial implementation did not have it.
 But I was requested to do it, and it is ready now.
 No one likes it, but no one provides an alternative implementation.
 We are stuck.

 We need the mapped ring buffer.  The current design (before it was
 removed) was broken but this does not mean it shouldn't be
 implemented.  We just need more time to figure out how to implement it
 correctly.

Considering the if at all and if then how of ring buffer implemetation
I'd like to throw in some ideas I had when reading the discussion and
respective code. If I understood Ulrich Drepper right, his notion of a
generic event handling interface is, that it has to be flexible enough
to transport additional info from origin to userspace, and to support
queuing of events from the same origin, so that additional
per-event-occurrence data doesn't get lost, which would happen when
coalescing multiple events into one until delivery. From what I read he
says ring buffer is broken because of  insufficient space for additional
data (mukevent) and the limited number of events that can be put into
ring buffer. Another argument is missing notification of userspace about
dropped events in case ring buffer limit is reached. (is that right?)
I see no reason why kevent couldn't be modified to fit (all) these
needs. While modifying the server-example and writing a client using
kevent I came across the coalescing problem, there were more incoming
connections than accept events, and I had to work around that. In this
case the pure number of coalesced events would suffice, while it
wouldn't for the example of RT-signals that Ulrich Drepper gave. So if
coalescing can be done at all or if it is impossible depends on the type
of event. The same goes for additional data delivered with the events.
There might be no panacea for all possible scenarios with one fixed
design. Either performance suffers for 'lightweight' events  which don't
need additional data and/or coalescing is not problematic and/or ring
buffer, or kevent is not usable for other types of events. Why not treat
different things differently, and let the (kernel-)user decide.
I don't know if I got all this right, but if, then ring buffer is needed
especially for cases where coalescing is not possible and additional
data has to be delivered for each triggered notification (so the pure
number of events is not enough; other reasons? performance? ). To me it
doesn't make sense to have kevent fill memory and use processor-time if
buffer is not used at all, which is the case when using kevent_getevents.
So here are my Ideas:
Make usage of ring buffer optional, if not required for specific
event-type it might be chosen by userspace-code.
Make limit of events in ring buffer optional and controllable from
userspace.
Regarding mukevent I'm thinking of a event-type specific struct, that is
filled by the originating code, and placed into a per-event-type ring
buffer (which  requires modification of kevent_wait). To my limited
understanding it seems that alternative or modified versions of
kevent_storage_ready, (__)kevent_requeue and kevent_user_ring_add_event
could return a void pointer to the position in buffer, and all kevent
has to know about is the size of the struct.
If coalescing doesn't hurt for a specific event-type it might just be
modified to notify userspace about the number of coalesced events. Make
it depend on type of event.

I know this doesn't address all objections that have been made, and
Evgeniy, big sorry for this being just talk again, and maybe not even
applicable for some reasons I do not overlook, but maybe it's worth
consideration. I'll gladly try to put that into code, and see where it
leads. I think kevent is great, and if things can be done to increase
it's genericity without sacrifying performance, why not.
Sorry for the length of post and repetitions,

Johann
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

85 matches

Mail list logo