Re: [PATCH] net: Fragment large datagrams even when IP_HDRINCL is set.

2016-07-08 Thread Alexey Kuznetsov
Hello! I can tell why it has not been done initially. Main problem was in IP options, which can be present in raw packet. They have to be properly fragmented, some options are to be deleted on fragments. Not that it is too complicated, it is just boring and ugly and inconsistent with IP_HDRINCL

Re: [PATCH net-2.6 0/3]: Three TCP fixes

2007-12-05 Thread Alexey Kuznetsov
Hello! My theory is that it could relate to tcp_cwnd_restart and tcp_cwnd_application_limited using it and the others are just then accidently changed as well. Perhaps I'll have to dig once again to changelog history to see if there's some clue (unless Alexey shed some light to this)...

Re: [PATCH 3/3] [UDP6]: Counter increment on BH mode

2007-12-03 Thread Alexey Kuznetsov
On Mon, Dec 03, 2007 at 10:39:35PM +1100, Herbert Xu wrote: So we need to fix this, and whatever the fix is will probably render the BH/USER distinction obsolete. Hmm, I would think opposite. USER (or generic) is expensive variant, BH is lite. No? Alexey -- To unsubscribe from this list: send

Re: [PATCH] net/ipv4/arp.c: Fix arp reply when sender ip 0 (was: Strange behavior in arp probe reply, bug or feature?)

2007-11-19 Thread Alexey Kuznetsov
Hello! Is there a reason that the target hardware address isn't the target hardware address? It is bound only to the fact that linux uses protocol address of the machine, which responds. It would be highly confusing (more than confusing :-)), if we used our protocol address and hardware

Re: [PATCH] net/ipv4/arp.c: Fix arp reply when sender ip 0 (was: Strange behavior in arp probe reply, bug or feature?)

2007-11-15 Thread Alexey Kuznetsov
Hello! Send a correct arp reply instead of one with sender ip and sender hardware adress in target fields. I do not see anything more legal in setting target address to 0. Actually, semantics of target address in ARP reply is ambiguous. If it is a reply to some real request, it is set to

Re: [2.6 patch] remove Documentation/networking/routing.txt

2007-11-05 Thread Alexey Kuznetsov
Hello! This file is so outdated that I can't see any value in keeping it. Absolutely agree. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH RESEND] ip_gre: sendto/recvfrom NBMA address

2007-10-24 Thread Alexey Kuznetsov
Hello! I was able to set a nbma gre tunnel, add routes to it and it worked perfectly ok. Link-level next hop worked: ip route add route via link-level-address dev tunnel-dev onlink This can work if you use gre0. By plain luck it has all-zero dev_addr. It will break on nbma devices set

Re: [PATCH RESEND] ip_gre: sendto/recvfrom NBMA address

2007-10-23 Thread Alexey Kuznetsov
Hello! When GRE tunnel is in NBMA mode, this patch allows an application to use a PF_PACKET socket to: - send a packet to specific NBMA address with sendto() - use recvfrom() to receive packet and check which NBMA address it came from This is required to implement properly NHRP over GRE

Re: [PATCH RESEND] ip_gre: sendto/recvfrom NBMA address

2007-10-23 Thread Alexey Kuznetsov
Hello! Me wrote: Ack. This is good idea. Frankly, I was sure ip_gre worked in this way all these years. I do not remember any reasons why it was crippled. The only dubious case is when next hop is set using routing tables. But code in ipgre_tunnel_xmit() is ready to accept this

Re: [PATCH 5/10] [NET]: Avoid unnecessary cloning for ingress filtering

2007-10-15 Thread Alexey Kuznetsov
Hello! If it is causing trouble, then one idea would be to move the resetting to a wrapper function which calls clone first and then resets the other fields. All actions currently cloning would need to be mod-ed to use that call. I see not so many places inside net/sched/act* where skb_clone

Re: SFQ qdisc crashes with limit of 2 packets

2007-09-21 Thread Alexey Kuznetsov
the whole range of hash values. Switched to Jenkins' hash. Signed-off-by: Alexey Kuznetsov [EMAIL PROTECTED] diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c index 3a23e30..b542c87 100644 --- a/net/sched/sch_sfq.c +++ b/net/sched/sch_sfq.c @@ -19,6 +19,7 @@ #include linux/init.h #include

Re: SFQ qdisc crashes with limit of 2 packets

2007-09-19 Thread Alexey Kuznetsov
Hello! OK the off-by-one prevents an out-of-bounds array access, Yes, this is not off-by-one (off-by-two, to be more exact :-)). Maximal queue length is really limited by SFQ_DEPTH-2, because: 1. SFQ keeps list of queue lengths in array of length SFQ_DEPTH. This means length of queue must

Re: Problem with implementation of TCP_DEFER_ACCEPT?

2007-08-24 Thread Alexey Kuznetsov
the SYN_ACK time-outs finally expire the connection will be dropped. A brought this up a long, long time ago, and I seem to remember Alexey Kuznetsov explained me at the time that this was intentional. Obviously, I said something like it is exactly what TCP_DEFER_ACCEPT does

Re: [RFC RTNETLINK 00/09]: Netlink link creation API

2007-06-06 Thread Alexey Kuznetsov
Hello! I just suggested to Pavel to create only a single device per newlink operation and binding them later, I see some logical inconsistency here. Look, the second end is supposed to be in another namespace. It will have identity, which cannot

Re: [RFC RTNETLINK 00/09]: Netlink link creation API

2007-06-06 Thread Alexey Kuznetsov
Hello! Good point, I didn't think of that. Is there a version of this patch that already uses different namespaces so I can look at it? Pavel does not like the idea. It looks not exactly pretty, like you said. :-) The alternative is to create pair in main namespace and then move one end to

Re: [PATCH] [IPV4] nl_fib_lookup: Initialise res.r before fib_res_put(res)

2007-04-26 Thread Alexey Kuznetsov
Hello! When CONFIG_IP_MULTIPLE_TABLES is enabled, the code in nl_fib_lookup() needs to initialize the res.r field before fib_res_put(res) - unlike fib_lookup(), a direct call to -tb_lookup does not set this field. Indeed, I am sorry. Alexey - To unsubscribe from this list: send the line

[PATCH] infinite recursion in netlink

2007-04-25 Thread Alexey Kuznetsov
table is missing 2. Do not crash when queue is empty (does not happen, but yet) 3. Put result of lookup Signed-off-by: Alexey Kuznetsov [EMAIL PROTECTED] diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index fc920f6..cac06c4 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4

Re: [ofa-general] Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Alexey Kuznetsov
Hello! Well I don't think the loopback device is currently but as soon as we get network namespace support we will have multiple loopback devices and they will get unregistered when we remove the network namespace. There is no logical difference. At the moment when namespace is gone there is

Re: [ofa-general] Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Alexey Kuznetsov
Hello! Does this look sane (untested)? It does not, unfortunately. Instead of regular crash in infiniband you will get numerous random NULL pointer dereferences both due to dst-neighbour and due to dst-dev. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body

Re: [ofa-general] Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Alexey Kuznetsov
Hello! I think the thing to do is to just leave the loopback references in place, try to unregister the per-namespace loopback device, and that will safely wait for all the references to go away. Yes, it is exactly how it works in openvz. All the sockets are killed, queues are cleared, nobody

Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Alexey Kuznetsov
Hello! If a device driver sets neigh_destructor in neigh_params, this could get called after the device has been unregistered and the driver module removed. It is the same problem: if dst-neighbour holds neighbour, it should not hold device. parms-dev is not supposed to be used after

Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Alexey Kuznetsov
Hello! infiniband sets parm-neigh_destructor, and I search for a way to prevent this destructor from being called after the module has been unloaded. Ideas? It must be called in any case to update/release internal ipoib structures. The idea is to move call of parm-neigh_destructor from

Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Alexey Kuznetsov
Hello! This might work. Could you post a patch to better show what you mean to do? Here it is. -neigh_destructor() is killed (not used), replaced with -neigh_cleanup(), which is called when neighbor entry goes to dead state. At this point everything is still valid: neigh-dev, neigh-parms etc.

Re: dst_ifdown breaks infiniband?

2007-03-18 Thread Alexey Kuznetsov
Hello! This is not new code, and should have triggered long time ago, so I am not sure how come we are triggering this only now, but somehow this did not lead to crashes in 2.6.20 I see. I guess this was plain luck. Why is neighbour-dev changed here? It holds reference to device and

Re: dst_ifdown breaks infiniband?

2007-03-18 Thread Alexey Kuznetsov
Hello! Hmm. Something I don't understand: does the code in question not run on *each* device unregister? It does. Why do I only see this under stress? You should have some referenced destination entries to trigger bad path. This should happen not only under stress. F.e. just try to ssh to

Re: dst_ifdown breaks infiniband?

2007-03-18 Thread Alexey Kuznetsov
Hello! It should be cleared and we should be sure it will not be destroyed before quiescent state. I'm confused. didn't you say dst_ifdown is called after quiescent state? Quiescent state should happen after dst-neighbour is invalidated. And this implies that all the users of

Re: [PATCH] Copy mac_len in skb_clone() as well

2007-03-15 Thread Alexey Kuznetsov
Hello! What bug triggered that helped you discover this? Or is it merely from a code audit? I asked the same question. :-) openvz added some another fields to skbuff and when it was found that they are lost while clone, he tried to figure out how all this works and looked for another

Re: [PATCH] TCP: Replace __kfree_skb() with kfree_skb()

2007-01-26 Thread Alexey Kuznetsov
Hello! do you know of any place where __kfree_skb is used to free an skb whose ref count is greater than 1? No. Actually, since kfree_skb is not inline, __kfree_skb could be made static and remaining places still using it switched to kfree_skb. - To unsubscribe from this list: send the

Re: [BUG] problem with BPF in PF_PACKET sockets, introduced in linux-2.6.19

2007-01-25 Thread Alexey Kuznetsov
Hello! So this whole idea to make run_filter() return signed integers and fail on negative is entirely flawed, it simply cannot work and retain the expected semantics which have been there forever. Actually, it can. Return value was used only as sign of error, so that the mistake was to

Re: RFC: consistent disable_xfrm behaviour

2006-12-04 Thread Alexey Kuznetsov
Hello! Alexey, do you remember what the original intent of this was? disable_policy was supposed to skip policy checks on input. It makes sense only on input device. disable_xfrm was supposed to skip transformations on output. It makes sense only on output device. If it does not work, it was

Re: RFC: consistent disable_xfrm behaviour

2006-12-04 Thread Alexey Kuznetsov
Hello! Here's the patch again properly signed off. I think it is correct. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: 2.6.19-rc1: Volanomark slowdown

2006-11-08 Thread Alexey Kuznetsov
Hell]! reduced Volanomark benchmark throughput by 10%. The irony of it is that java vm used to be one of victims of over-delayed acks. I will look, there is a little chance that it is possible to detect the situation and to stretch ACKs. There is one little question though. If you see a

Re: [PATCH][RFC] Re: high latency with TCP connections

2006-09-22 Thread Alexey Kuznetsov
Hello! transactions to data segments is fubar. That issue is also why I wonder about the setting of tcp_abc. Yes, switching ABC on/off has visible impact on amount of segments. When ABC is off, amount of segments is almost the same as number of transactions. When it is on, ~1.5% are merged.

Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

2006-09-22 Thread Alexey Kuznetsov
Hello! I can't even find a reference to SIOCGSTAMP in the dhcp-2.0pl5 or dhcp3-3.0.3 sources shipped in Ubuntu. But I will note that tpacket_rcv() expects to always get valid timestamps in the SKB, it does a: It is equally unlikely it uses mmapped packet socket (tpacket_rcv). I even

Re: [PATCH][RFC] Re: high latency with TCP connections

2006-09-18 Thread Alexey Kuznetsov
Hello! It looks perfectly fine to me, would you like me to apply it Alexey? Yes, I think it is safe. Theoretically, there is one place where it can be not so good. Good nagling tcp connection, which makes lots of small write()s, will send MSS sized frames due to delayed ACKs. But if we ACK

Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

2006-09-18 Thread Alexey Kuznetsov
Hello! For netdev: I'm more and more thinking we should just avoid the problem completely and switch to true end2end timestamps. This means don't time stamp when a packet is received, but only when it is delivered to a socket. This will work. From viewpoint of existing uses of timestamp by

Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

2006-09-18 Thread Alexey Kuznetsov
Hello! Hmm, not sure how that could happen. Also is it a real problem even if it could? As I said, the problem is _occasionally_ theoretical. This would happen f.e. if packet socket handler was installed after IP handler. Then tcpdump would get packet after it is processed

Re: [PATCH][RFC] Re: high latency with TCP connections

2006-09-18 Thread Alexey Kuznetsov
Hello! Of course, number of ACK increases. It is the goal. :-) unpleasant increase in service demands on something like a burst enabled (./configure --enable-burst) netperf TCP_RR test: netperf -t TCP_RR -H foo -- -b N # N 1 foo=localhost b patched orig 2

Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

2006-09-18 Thread Alexey Kuznetsov
Hello! But that never happens right? Right. Well, not right. It happens. Simply because you get packet with newer timestamp after previous handler saw this packet and did some actions. I just do not see any bad consequences. And do you have some other prefered way to solve this? Even if

Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

2006-09-18 Thread Alexey Kuznetsov
Hello! Please think about it this way: suppose you haave a heavily loaded router and some network problem is to be diagnosed. You run tcpdump and suddenly router becomes overloaded (by switching to timestamp-it-all mode I am sorry. I cannot think that way. :-) Instead of attempts to scare,

Re: [PATCH][RFC] Re: high latency with TCP connections

2006-09-18 Thread Alexey Kuznetsov
Hello! There isn't any sort of clever short-circuiting in loopback is there? No, from all that I know. I do like the convenience of testing things over loopback, but always fret about not including drivers and actual

Re: 2.6.18-rc6 memory mapped pcap truncates outgoing TCP packets, but not icmp

2006-09-14 Thread Alexey Kuznetsov
Hello! [PACKET]: Don't truncate non-linear skbs with mmaped IO Non-linear skbs are truncated to their linear part with mmaped IO. Fix by using skb_copy_bits instead of memcpy. Ack. I remember this trick. The idea was that I needed only TCP header in any case and it was perfect cutoff. This

Re: [PATCH] make ipv4 multicast packets only get delivered to sockets that are joined to group

2006-09-14 Thread Alexey Kuznetsov
Hello! No, it returns 1 (allow) if there are no filters to explicitly filter it. I wrote that code. :-) I see. It did not behave this way old times. From your mails I understood that current behaviour matches another implementations (BSD whatever), is it true? Alexey - To unsubscribe

Re: [PATCH] make ipv4 multicast packets only get delivered to sockets that are joined to group

2006-09-13 Thread Alexey Kuznetsov
Hello! IPv6 behaves the same way. Actually, Linux IPv6 filters received multicasts, inet6_mc_check() does this. IPv4 does not. I remember that attempts to do this were made in the past and failed, because some applications, related to multicast routing, did expect to receive all the multicasts

Re: [PATCH][RFC] Re: high latency with TCP connections

2006-09-05 Thread Alexey Kuznetsov
Hello! Is this really necessary? No, of course. We lived for ages without this, would live for another age. I thought that the problems with ABC were in trying to apply byte-based heuristics from the RFC(s) to a packet-oritented cwnd in the stack? It was just the

Re: high latency with TCP connections

2006-09-04 Thread Alexey Kuznetsov
Hello! At least for slow start it is safe, but experiments with atcp for netchannels showed that it is better not to send excessive number of acks when slow start is over, If this thing is done from tcp_cleanup_rbuf(), it should not affect performance too much. Note, that with ABC and

Re: 2.6.18-rc5 with GRE, iptables and Speedtouch ADSL, PPP over ATM

2006-09-04 Thread Alexey Kuznetsov
Hello! This path obviously breaks assumption 1) and therefore can lead to ABBA dead-locks. Yes... I've looked at the history and there seems to be no reason for the lock to be held at all in dev_watchdog_up. The lock appeared in day one and even there it was unnecessary. Seems, it

[PATCH][RFC] Re: high latency with TCP connections

2006-09-04 Thread Alexey Kuznetsov
ACK is forced after tcp_recvmsg() drains receive buffer. In other words, it is a soft each-2d-segment ACK, which is enough to preserve ACK clock even when ABC is enabled. Signed-off-by: Alexey Kuznetsov [EMAIL PROTECTED] diff --git a/include/net/inet_connection_sock.h b/include/net

Re: ProxyARP and IPSec

2006-09-04 Thread Alexey Kuznetsov
Hello! sarcasm What I great idea. Now I just have to get every host I want to interoperate with to support a nonstandard configuration. The scary part is that if I motivate it with Linux is too stupid to handle standard tunnel-mode IPsec I might actually get away with it. sarcasm

Re: high latency with TCP connections

2006-08-31 Thread Alexey Kuznetsov
Hello! 2) a way to take delayed ACKs into account for cwnd growth This part is OK now, right? 1) protection against ACK division But Linux never had this problem... Congestion window was increased only when a whole skb is ACKed, flag FLAG_DATA_ACKED. (TSO could break this, but should not).

Re: NAPI: netif_rx_reschedule() ??

2006-08-31 Thread Alexey Kuznetsov
Hello! However I'm confused about a couple of things, and there are only two uses of netif_rx_reschedule() in the kernel, so I'm a little stuck. First, do not believe to even single bit of code or docs about netif_rx_reschedule(). It was used once in the first version of NAPI for 3com driver

Re: [PATCH] fix sk-sk_filter field access

2006-08-30 Thread Alexey Kuznetsov
Hello! Function sk_filter() is called from tcp_v{4,6}_rcv() functions with argue needlock = 0, while socket is not locked at that moment. In order to avoid this and similar issues in the future, use rcu for sk-sk_filter field read protection. Patch is for net-2.6.19 What bug

Re: [PATCH] fix sk-sk_filter field access

2006-08-30 Thread Alexey Kuznetsov
Hello! Really? It is used with needlock=0 by DCCP ipv6, for example. This case seems correct too. What about sk_receive_skb()? dn_queue_skb()? In fact, there seems to be numerous uses still with needlock=0, all legitimate. Well, not quite legitime. sk_receive_skb() has the same bug as

Re: [PATCH 4/6] net neighbour: convert to RCU

2006-08-29 Thread Alexey Kuznetsov
Hello! @@ -346,8 +354,8 @@ struct neighbour *neigh_lookup(struct ne NEIGH_CACHE_STAT_INC(tbl, lookups); - read_lock_bh(tbl-lock); - hlist_for_each_entry(n, tmp, tbl-hash_buckets[hash_val], hlist) { + rcu_read_lock(); + hlist_for_each_entry_rcu(n, tmp,

Re: [RFC IPv6] Disabling IPv6 autoconf

2006-08-29 Thread Alexey Kuznetsov
Hello! Yes, it is logical because without multicast IPV6 cannot work correctly. This is not quite true. IFF_BROADCAST is enough, it will work just like IPv4. Real troubles start only when interface is not IFF_BROADCAST and not IFF_POINTOPOINT. IFF_MULTICAST flag seems potentially

Re: [PATCH 4/6] net neighbour: convert to RCU

2006-08-29 Thread Alexey Kuznetsov
Hello! atomic_inc_and_test is true iff result is zero, so that won't work. I meant atomic_inc_not_zero(), as Martin noticed. But the following should work: hlist_for_each_entry_rcu(n, tmp, tbl-hash_buckets[hash_val], hlist) { if (dev == n-dev !memcmp(n-primary_key,

Re: [PATCH 4/6] net neighbour: convert to RCU

2006-08-29 Thread Alexey Kuznetsov
Hello! Also, probably, it makes sense to add neigh_lookup_light(), which does not take refcnt, but required to call neigh_release_light() (which is just rcu_read_unlock_bh()). Which code paths would that make sense on? fib_detect_death (ok) infiniband (ok)

Re: [PATCH 4/6] net neighbour: convert to RCU

2006-08-29 Thread Alexey Kuznetsov
Hello! This should not be any more racy than the existing code. Existing code is not racy. Critical place is interpretation of refcnt==1. Current code assumes, that when refcnt=1 and entry is in hash table, nobody can take this entry (table is locked). So, it can be unlinked from the table.

Re: [PATCH 4/6] net neighbour: convert to RCU

2006-08-29 Thread Alexey Kuznetsov
Hello! Yes, I forgot to say I take back my suggestion about atomic_inc_test_zero(). It would not work. Seems, it is possible to add some barriers around setting n-dead and testing it in neigh_lookup_rcu(), but it would be scary and ugly. To be honest, I just do not know how to do this. :-) - To

Re: [PATCH 4/6] net neighbour: convert to RCU

2006-08-29 Thread Alexey Kuznetsov
Hello! Race 1: w/o RCU Cpu 0: is in neigh_lookup gets read_lock() finds entry ++refcount to 2

Re: ProxyARP and IPSec

2006-08-24 Thread Alexey Kuznetsov
Hello! I'm thinking that David definitely has a point about having a usability problem, though. All other kind of tunnels have endpoint devices associated with them, and that would make all these kinds of problems go away, Yes, when you deal with sane practical setups, this approach is

Re: ProxyARP and IPSec

2006-08-23 Thread Alexey Kuznetsov
Hello! What he's trying to accomplish doesn't sound all that weird, Absolutely sane. does anyone have any other ideas? The question is where is this host really? If it is far far away and connected only via IPsec tunnel with destionation of tunnel different of host address ip ro add

Re: Get rid of /proc/sys/net/unix/max_dgram_qlen

2006-08-22 Thread Alexey Kuznetsov
Hello! Either this, or it should be implemented correctly, which means poll needs to be fixed to also check for max_dgram_qlen, Feel free to do this correctly. :-) Deleting wrong code rarely helps. It is the only protection of commiting infinite amount of memory to a socket. Alexey - To

Re: Get rid of /proc/sys/net/unix/max_dgram_qlen

2006-08-22 Thread Alexey Kuznetsov
Hello! It is the only protection of commiting infinite amount of memory to a socket. Doesn't the if (atomic_read(sk-sk_wmem_alloc) sk-sk_sndbuf) check in sock_alloc_send_pskb() limit things already? Unfortunately, it does not. You can open a socket, send something to a selected

Re: [take12 0/3] kevent: Generic event handling mechanism.

2006-08-22 Thread Alexey Kuznetsov
Hello! No way - timespec uses long. I must have missed that discussion. Please enlighten me in what regard using an opaque type with lower resolution is preferable to a type defined in POSIX for this sort of purpose. Let me explain, as a person who did this mistake and deeply regrets about

Re: Get rid of /proc/sys/net/unix/max_dgram_qlen

2006-08-22 Thread Alexey Kuznetsov
Hello! Isn't a socket freed until all skb are handled? In which case the limit on the number of open files limits the total memory usage? (Same as with streaming sockets?) Alas. Number of closed sockets is not limited. Actually, it is limited by sk_max_ack_backlog*max_files, which is a lot.

[PATCH] locking bug in fib_semantics.c

2006-08-17 Thread Alexey Kuznetsov
. Signed-off-by: Alexey Kuznetsov [EMAIL PROTECTED] --- diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 4ea6c68..5dfdad5 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -159,7 +159,7 @@ void free_fib_info(struct fib_info *fi) void fib_release_info

Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()

2006-08-16 Thread Alexey Kuznetsov
Hello! The netlink header pid is really akin to sadb_msg_pid from RFC 2367. IMHO it should always be zero if the kernel is the originator of the message. No. Analogue of sadb_msg_pid is nladdr.nl_pid. Netlink header pid is not originator of the message, but author of the change. The notion

Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()

2006-08-16 Thread Alexey Kuznetsov
Hello! In one conversation with Alexey he told me there was some inspiration from pfkey in the semantics of it i.e processid. Inspiration, but not a copy. :-) Unlike pfkeyv2 it uses addressing usual for networking i.e. struct sockaddr_nl. Alexey - To unsubscribe from this list: send the line

Re: [RFC] network namespaces

2006-08-16 Thread Alexey Kuznetsov
Hello! (application) containers. Performance aside, are there any reasons why this approach would be problematic for c/r? This approach is just perfect for c/r. Probably, this is the only approach when migration can be done in a clean and self-consistent way. Alexey - To unsubscribe from

Re: [PATCH?] tcp and delayed acks

2006-08-16 Thread Alexey Kuznetsov
Hello! send out any delayed ACKs when it is clear that the receiving process is waiting for more data? It has just be done in tcp_cleanup_rbuf() a few lines before your chunk. There is some somplex condition to be satisfied there and it is impossible to relax it any further. I do not know

Re: skb_shared_info()

2006-08-15 Thread Alexey Kuznetsov
Hello! I still like existing way - it is much simpler (I hope :) to convince e1000 developers to fix driver's memory usage e1000 is not a problem at all. It just has to use pages. If it is going to use high order allocations, it will suck, be it order 3 or 2. area (does MAX_TCP_HEADER

Re: skb_shared_info()

2006-08-14 Thread Alexey Kuznetsov
Hello! e1000 will setup head/data/tail pointers to point to the area in the first sg page. Maybe. But I still hope this is not necessary, the driver should be able to do at least primitive header splitting, in that case the header could be inlined to skb. Alternatively, header can be copied

Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()

2006-08-14 Thread Alexey Kuznetsov
Hello! Some of these removals of current-pid will affect users such as quagga, zebra, vrrpd etc. If they survived cleanup in IPv4, they definitely will not feel cleanup in IPv6. Thomas does great work, Jamal, do not worry. :-) IMO, I believe there is a strong case that can be made for

Re: [RFC 5/7] neighbour: convert lookup to sequence lock

2006-08-14 Thread Alexey Kuznetsov
Hello! That wouldn't work if hard_header() ever expands the head. Fortunately hard_header() returns the length added even in case of an error so we can undo the absolute value returned. Yes. Or probably it is safer to undo to skb-nh. Even if hard_header expands skb, skb-nh still remains

Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()

2006-08-13 Thread Alexey Kuznetsov
Hello! So we do something like this: Yes, exactly. Actually, there was a function with similar functionality: rtnetlink_send(). net/sched/* used it, older net/ipv4/ still did this directly. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to

Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()

2006-08-12 Thread Alexey Kuznetsov
Hello! Makes sense, especially for auto generated handles. I've been listening to the notifications on a separate socket for this purpose. That's... complicated. But cool. :-) It does make sense, the way it has been implemented if at all is creepy. Even worse, IPv6 is using current-pid,

Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()

2006-08-12 Thread Alexey Kuznetsov
Hello! Actually I think the only safe solution is to allocate a separate socket for multicast messages. In other words, if you want reliable unicast reception on a socket, don't bind it to a multicast group. Yes, it was the point of my advocacy of NLM_F_ECHO. :-) Alexey - To unsubscribe

Re: the mystery that is sock_fasync

2006-08-11 Thread Alexey Kuznetsov
Hello! Did I miss some way that multiple file objects can point to the same socket inode? Absolutely prohibited. Always was. Apparently, sock_fasync() was cloned from tty_fasync(), that's the only reason why it is so creepy. Alexey - To unsubscribe from this list: send the line unsubscribe

Re: skb_shared_info()

2006-08-11 Thread Alexey Kuznetsov
Hello! management schemes and to just wrap SKB's around arbitrary pieces of data. + and something clever like a special page_offset encoding means use data, not page. But for what purpose do you plan to use it? The e1000 issue is just one example of this, another What is this issue?

Re: sender throttling for unreliable protocols not garuanteed? (different units in sock-wmem_alloc and net_devive-tx_queue_len)

2006-08-11 Thread Alexey Kuznetsov
Hello! I'd be interested in any opinions on the above mentioned effect. Everything is right, it is exactly how it works. Well, use another qdisc, which counts in bytes rather than in frames (f.e. bfifo) Set sndbuf small enough. And if sndbuf*#senders is still too large, you have to use fair

Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()

2006-08-11 Thread Alexey Kuznetsov
Hello! I get your point and I see the value. Unfortunately, probably due to lack of documentation, this feature isn't used by any applications I know of. Well, tc was supposed to use it, but this did not happen and it remained deficient. We even put in the hacks to make identification of

Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()

2006-08-10 Thread Alexey Kuznetsov
Hello! This patch handles NLM_F_ECHO in netlink_rcv_skb() to handle it in a central point. Most subsystems currently interpret NLM_F_ECHO as to just unicast events to the originator of the change while the real meaning of the flag is to echo the request. Do not you think it is useless to

Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()

2006-08-10 Thread Alexey Kuznetsov
Hello! What's wrong with listening to the notification for that purpose? Nothing! NLM_F_ECHO _is_ listening for notifications without subscription to multicast groups and need to figure out what messages are yours. But beyond this NLM_F_ECHO is totally subset of this. Which still makes much

Re: [PATCH] llc: SOCK_DGRAM interface fixes

2006-08-08 Thread Alexey Kuznetsov
Hello! This fix goes against the old historical comments about UNIX98 semantics but without this fix SOCK_DGRAM is broken and useless. So either ANK's interpretation was incorect or UNIX98 standard was wrong. Just found this reference to me. :-) The comment migrated from tcp.c. It is only

Re: [PATCH] limit rt cache size

2006-08-07 Thread Alexey Kuznetsov
Hello! During OpenVZ stress testing we found that UDP traffic with random src can generate too much excessive rt hash growing leading finally to OOM and kernel panics. It was found that for 4GB i686 system (having 1048576 total pages and 225280 normal zone pages) kernel allocates the

Re: [PATCH] NET: fix kernel panic from no dev-hard_header_len space

2006-08-01 Thread Alexey Kuznetsov
Hello! Alexey, any suggestions on how to handle this kind of thing? Device, which adds something at head must check for space. Anyone, who adds something at head, must check. Otherwise, it will remain buggy forever. What's wrong with my patch? As I already said there is nothing wrong with

Re: [PATCH] NET: fix kernel panic from no dev-hard_header_len space

2006-08-01 Thread Alexey Kuznetsov
Hello! Do the semantics (I'm not talking about bugs) allow skb passed to dev-hard_header() (if defined) No. dev-hard_header() should get enough of space, which is dev-hard_header_len. Actually, it is historical hole in design, inherited from ancient times. Calling conventions of

Re: [PATCH] NET: fix kernel panic from no dev-hard_header_len space

2006-07-31 Thread Alexey Kuznetsov
Hello! It does seem weird that IP output won't pay attention to Not so weird, actually. The logic was: Only initial skb allocation tries to reserve all the space to avoid copies in the future. All the rest of places just check, that there is enough space for their immediate needs. If

Re: Netchannles: first stage has been completed. Further ideas.

2006-07-27 Thread Alexey Kuznetsov
Hello! On Thu, Jul 27, 2006 at 03:46:12PM +1000, Rusty Russell wrote: Of course, it means rewriting all the userspace tools, documentation, and creating a complete new infrastructure for connection tracking and NAT, but if that's what's required, then so be it. That's what I love to hear. Not

Re: [PATCH] NET: fix kernel panic from no dev-hard_header_len space

2006-07-27 Thread Alexey Kuznetsov
Hello! ip_output() ignores dev-hard_header_len ip_output() worries about the space, which it needs. If some place needs more, it is its problem to check. To the moment where it is used, hard_header_len can even change. It can be applied, but it does not change the fact, that those placed

Re: Netchannles: first stage has been completed. Further ideas.

2006-07-27 Thread Alexey Kuznetsov
Hello! kernel thread takes 100% cpu (with preemption Preemption, you tell... :-) I begged you to spend 1 minute of your time to press ^Z. Did you? Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at

Re: [PATCH] ip multicast route bug fix

2006-07-26 Thread Alexey Kuznetsov
HellO! I like this. However, since the cloned skb is either discarded in case of error, or queued in which case the caller discards its reference right away, wouldn't it be simpler to just do this? Well, if we wanted just to cheat those checking tools, it is nice. But if we want clarity, it

Re: [PATCH] ip multicast route bug fix

2006-07-25 Thread Alexey Kuznetsov
Hello! Code was reusing an skb which could lead to use after free or double free. No, this does not help. The bug is not here. I was so ashamed of this that could not touch the thing. :-) It startled me a lot, how is it possible that the thing was in production for several years and such bad

Re: [PATCH] ip multicast route bug fix

2006-07-25 Thread Alexey Kuznetsov
Hello! checking tools because the skb lifetime depends on the return value. Wouldn't it be better to have a consistent interface (skb always freed), and clone the skb if needed for deferred processing? But skb is not always freed in any case. Normally it is submitted to netlink_unicast(). It

Re: [PATCH] ip multicast route bug fix

2006-07-25 Thread Alexey Kuznetsov
Hello! Wouldn't it be better to have a consistent interface (skb always freed), and clone the skb if needed for deferred processing? I am sorry, I misunderstood you. I absolutely agree. It is much better, the variant which I suggested is a good sample of bad programming. :-) Alexey - To

Re: [PATCH] ip multicast route bug fix

2006-07-25 Thread Alexey Kuznetsov
Hello! Wouldn't it be better to have a consistent interface (skb always freed), and clone the skb if needed for deferred processing? I think you mean this. Note, it is real skb_clone(), not alloc_skb(). Equeued skb contains the whole half-prepared netlink message plus room for the rest. It

Re: Netchannles: first stage has been completed. Further ideas.

2006-07-24 Thread Alexey Kuznetsov
Hello! Also, there is some code for refcnt's in it that looks wrong. Yes, it is disgusting. rcu does not allow to increase socket refcnt in lookup routine. Ben's version looks cleaner here, it does not touch refcnt in rcu lookups. But it is dubious too: do_time_wait: + sock_hold(sk);

Re: Netchannles: first stage has been completed. Further ideas.

2006-07-20 Thread Alexey Kuznetsov
Hello! Small question first: userspace, but also there are big problems, like one syscall per ack, I do not see redundant syscalls. Is not it expected to send ACKs only after receiving data as you said? What is the problem? Now boring things: There is no BH protocol processing at all, so

Re: Netchannles: first stage has been completed. Further ideas.

2006-07-20 Thread Alexey Kuznetsov
Hello! Moving protocol (no matter if it is TCP or not) closer to user allows naturally control the dataflow - when user can read that data(and _this_ is the main goal), user acks, when it can not - it does not generate ack. In theory To all that I rememeber, in theory absence of feedback

  1   2   >