Re: [PATCH] connector: add parent pid and tgid to coredump and exit events
Stefan, hi Sorry for delay. 26.04.2018, 15:04, "Stefan Strogin" <stefan.stro...@gmail.com>: > Hi David, Evgeniy, > > Sorry to bother you, but could you please comment about the UAPI change and > the patch? With 4-bytes pid_t everything looks fine, and I do not know arch where pid is larger currently, so it looks safe. David, please pull it into your tree, or should it go via different path? Acked-by: Evgeniy Polyakov <z...@ioremap.net> >> I don't see how it breaks UAPI. The point is that structures >> coredump_proc_event and exit_proc_event are members of *union* >> event_data, thus position of the existing data in the structure is >> unchanged. Furthermore, this change won't increase size of struct >> proc_event, because comm_proc_event (also a member of event_data) is >> of bigger size than the changed structures. >> >> If I'm wrong, could you please explain what exactly will the change >> break in UAPI? >> >> On 30/03/18 19:59, David Miller wrote: >>> From: Stefan Strogin <sstro...@cisco.com> >>> Date: Thu, 29 Mar 2018 17:12:47 +0300 >>> >>>> diff --git a/include/uapi/linux/cn_proc.h b/include/uapi/linux/cn_proc.h >>>> index 68ff25414700..db210625cee8 100644 >>>> --- a/include/uapi/linux/cn_proc.h >>>> +++ b/include/uapi/linux/cn_proc.h >>>> @@ -116,12 +116,16 @@ struct proc_event { >>>> struct coredump_proc_event { >>>> __kernel_pid_t process_pid; >>>> __kernel_pid_t process_tgid; >>>> + __kernel_pid_t parent_pid; >>>> + __kernel_pid_t parent_tgid; >>>> } coredump; >>>> >>>> struct exit_proc_event { >>>> __kernel_pid_t process_pid; >>>> __kernel_pid_t process_tgid; >>>> __u32 exit_code, exit_signal; >>>> + __kernel_pid_t parent_pid; >>>> + __kernel_pid_t parent_tgid; >>>> } exit; >>>> >>>> } event_data; >>> >>> I don't think you can add these members without breaking UAPI.
Re: [RFC] connector: add group_exit_code and signal_flags fields to exit_proc_event
Hi everyone Sorry for that late reply 01.03.2018, 21:58, "Stefan Strogin": > So I was thinking to add these two fields to union event_data: > task->signal->group_exit_code > task->signal->flags > This won't increase size of struct proc_event (because of comm_proc_event) > and shouldn't break backward compatibility for the user-space. But it will > add some useful information about what caused the process death. > What do you think, is it an acceptable approach? As I saw in other discussion, doesn't it break userspace API, or you are sure that no sizes has been increased? You are using the same structure as used for plain signals and add group status there, how will userspace react, if it was compiled with older headers? What if it uses zero-field alignment, i.e. allocating exactly the size of structure with byte precision?
Re: [PATCH] connector: Delete an error message for a failed memory allocation in cn_queue_alloc_callback_entry()
Hi everyone 27.08.2017, 22:25, "SF Markus Elfring" <elfr...@users.sourceforge.net>: > From: Markus Elfring <elfr...@users.sourceforge.net> > Date: Sun, 27 Aug 2017 21:18:37 +0200 > > Omit an extra message for a memory allocation failure in this function. > > This issue was detected by using the Coccinelle software. > > Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net> Looks good to me, thanks Markus. There is virtually zero useful information in this print if we are in the situation, when kernel can not allocate a few bytes to run connector queue. Acked-by: Evgeniy Polyakov <z...@ioremap.net> kernel-janitors@ please queue this patch up > --- > drivers/connector/cn_queue.c | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > > diff --git a/drivers/connector/cn_queue.c b/drivers/connector/cn_queue.c > index 1f8bf054d11c..e4f31d679f02 100644 > --- a/drivers/connector/cn_queue.c > +++ b/drivers/connector/cn_queue.c > @@ -40,10 +40,8 @@ cn_queue_alloc_callback_entry(struct cn_queue_dev *dev, > const char *name, > struct cn_callback_entry *cbq; > > cbq = kzalloc(sizeof(*cbq), GFP_KERNEL); > - if (!cbq) { > - pr_err("Failed to create new callback queue.\n"); > + if (!cbq) > return NULL; > - } > > atomic_set(>refcnt, 1); > > -- > 2.14.1
Re: [PATCH] [RFC] proc connector: add namespace events
Hi everyone 08.09.2016, 18:39, "Alban Crequy": > The act of a process creating or joining a namespace via clone(), > unshare() or setns() is a useful signal for monitoring applications. > + if (old_ns->mnt_ns != new_ns->mnt_ns) > + proc_ns_connector(tsk, CLONE_NEWNS, PROC_NM_REASON_CLONE, old_mntns_inum, > new_mntns_inum); > + > + if (old_ns->uts_ns != new_ns->uts_ns) > + proc_ns_connector(tsk, CLONE_NEWUTS, PROC_NM_REASON_CLONE, > old_ns->uts_ns->ns.inum, new_ns->uts_ns->ns.inum); > + > + if (old_ns->ipc_ns != new_ns->ipc_ns) > + proc_ns_connector(tsk, CLONE_NEWIPC, PROC_NM_REASON_CLONE, > old_ns->ipc_ns->ns.inum, new_ns->ipc_ns->ns.inum); > + > + if (old_ns->net_ns != new_ns->net_ns) > + proc_ns_connector(tsk, CLONE_NEWNET, PROC_NM_REASON_CLONE, > old_ns->net_ns->ns.inum, new_ns->net_ns->ns.inum); > + > + if (old_ns->cgroup_ns != new_ns->cgroup_ns) > + proc_ns_connector(tsk, CLONE_NEWCGROUP, PROC_NM_REASON_CLONE, > old_ns->cgroup_ns->ns.inum, new_ns->cgroup_ns->ns.inum); > + > + if (old_ns->pid_ns_for_children != new_ns->pid_ns_for_children) > + proc_ns_connector(tsk, CLONE_NEWPID, PROC_NM_REASON_CLONE, > old_ns->pid_ns_for_children->ns.inum, new_ns->pid_ns_for_children->ns.inum); > + } > + Patch looks good to me from technical/connector point of view, but these even multiplication is a bit weird imho. I'm not against it, but did you consider sending just 2 serialized ns structures via single message, and client would check all ns bits himself?
Re: [PATCH] connector: fix out-of-order cn_proc netlink message delivery
Hi Aaron 24.06.2016, 16:07, "Aaron Campbell": > The proc connector messages include a sequence number, allowing userspace > programs to detect lost messages. However, performing this detection is > currently more difficult than necessary, since netlink messages can be > delivered to the application out-of-order. To fix this, leave pre-emption > disabled during cn_netlink_send(), and use GFP_NOWAIT. > > The following was written as a test case. Building the kernel w/ make -j32 > proved a reliable way to generate out-of-order cn_proc messages. This is not actually about out-of-order sending which is impossible iirc, but the way fork pushes messages into socket queue in parallel. What you've done is syncing one more layer higher. I'm not against this patch if you think it does fix some issues, but wording is not correct imo.
Re: [PATCH] cn_proc: Flag termination of the last thread in the process
Hi Sergei 29.05.2015, 22:50, Sergei Zhirikov sf...@yahoo.com: There is no easy and reliable way for userspace to get notified of a process termination. The process connector sends out exit events upon termination of each thread, but it is not trivial for userspace to tell whether the just-terminated thread was the last one in the process. With this change a flag will be set in struct cn_proc for the exit event of the last thread in the process. I have no objection against this patch, but it should really go through cn_proc maintainer. Feel free to add my Acked-by. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] connector: add cgroup release event report to proc connector
Hi 28.05.2015, 11:54, Dimitri John Ledkov dimitri.j.led...@intel.com: What you are saying is that we have inefficient notification mechanism that hammers everyone's boot time significantly, and no current path to resolve it. What can I do get us efficient cgroup release notifications soon? This patch-set is a no-op if one doesn't subscribe from the userspace and has no other side effects that I can trivially see and is very similar in-spirit to other notifications that proc-connector generates. E.g. /proc/pid/comm is exposed as a file, yet there is proc connector notification as well about comm name changes. Maybe Evgeniy can chip in, if such a notification would be beneficial to proc-connector. I understand your need in a new notifications related to cgroups, although I would rather put it into separate module than proc connector - I'm pretty sure there will be quite alot of extensions in this module in the future. But if you do want to extend proc connector module, I'm ok with it, but it should go via its maintainer. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add IPv6 support to TCP SYN cookies
On Tue, Feb 05, 2008 at 05:52:31PM -0800, Glenn Griffin ([EMAIL PROTECTED]) wrote: +static u32 cookie_hash(struct in6_addr *saddr, struct in6_addr *daddr, +__be16 sport, __be16 dport, u32 count, int c) +{ + __u32 tmp[16 + 5 + SHA_WORKSPACE_WORDS]; This huge buffer should not be allocated on stack. -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add IPv6 support to TCP SYN cookies
On Wed, Feb 06, 2008 at 10:30:24AM -0800, Glenn Griffin ([EMAIL PROTECTED]) wrote: +static u32 cookie_hash(struct in6_addr *saddr, struct in6_addr *daddr, +__be16 sport, __be16 dport, u32 count, int c) +{ + __u32 tmp[16 + 5 + SHA_WORKSPACE_WORDS]; This huge buffer should not be allocated on stack. I can replace it will a kmalloc, but for my benefit what's the practical size we try and limit the stack to? It seemed at first glance to me that 404 bytes plus the arguments, etc. was not such a large buffer for a non-recursive function. Plus the alternative with a kmalloc requires Well, maybe for connection establishment path it is not, but it is absolutely the case in the sending and sometimes receiving pathes for 4k stacks. The main problem is that bugs which happen because of stack overflow are so much obscure, that it is virtually impossible to detect where overflow happend. 'Debug stack overflow' somehow does not help to detect it. Usually there is about 1-1.5 kb of free stack for each process, so this change will cut one third of the free stack, getting into account that something can store ipv6 addresses on stack too, this can end up badly. propogating the possible error status back up to tcp_ipv6.c in the event we are unable to allocate enough memory, so it can simply drop the connection. Not an impossible task by any means but it does significantly complicate things and I would like to know it's worth the effort. Also would it be worth it to provide a supplemental patch for the ipv4 implementation as it allocates the same buffer? One can reorganize syncookie support to work with request hash tables too, so that we could allocate per hash-bucket space and use it as a scratchpad for cookies. --Glenn -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add IPv6 support to TCP SYN cookies
On Tue, Feb 05, 2008 at 09:02:11PM +0100, Andi Kleen ([EMAIL PROTECTED]) wrote: On Tue, Feb 05, 2008 at 10:29:28AM -0800, Glenn Griffin wrote: Syncookies are discouraged these days. They disable too many valuable TCP features (window scaling, SACK) and even without them the kernel is usually strong enough to defend against syn floods and systems have much more memory than they used to be. So I don't think it makes much sense to add more code to it, sorry. How does syncookies prevent windows from growing? Most (if not all) distributions have them enabled and window growing works just fine. Actually I do not see any reason why connection establishment handshake should prevent any run-time operations at all, even if it was setup during handshake. -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add IPv6 support to TCP SYN cookies
On Tue, Feb 05, 2008 at 09:53:45PM +0100, Andi Kleen ([EMAIL PROTECTED]) wrote: How does syncookies prevent windows from growing? Syncookies do not allow window scaling so you can't have any windows 64k Then you meant not windows change, but the fact, that option is ignored as long as sack enable one? Most (if not all) distributions have them enabled and window growing works just fine. Actually I do not see any reason why connection establishment handshake should prevent any run-time operations at all, even if it was setup during handshake. TCP only uses options negotiated during the hand shake and syncookies is incapable to do this. What about fixing the implementation, so that it could get into account different options too? -Andi -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add IPv6 support to TCP SYN cookies
Hi Alan. On Tue, Feb 05, 2008 at 09:20:17PM +, Alan Cox ([EMAIL PROTECTED]) wrote: Most (if not all) distributions have them enabled and window growing works just fine. Actually I do not see any reason why connection establishment handshake should prevent any run-time operations at all, even if it was setup during handshake. Syncookies are only triggered if the system is under a load where it would begin to lose connections otherwise. So they merely turn a DoS into a working if slightly slower setup (and 64K windows don't matter for most normal users, especially on mobile devices). SACK is actually a good idea for mobile devices, so preventing syncookies from not getting into account some options (btw, does it work with timestamps and PAWS?) is not a solution. -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[2/2] POHMELFS: hack to disable writeback.
This patch disables writeback in POHMELFS and creates all objects on behalf of its own without sync with remote side. This mode is _very_ fast. If POHEMLFS would be bound to single remote filesystem, it could use its inode allocation policy and be very happy with write-back cache. By design POHMELFS is a transport layer in distributed filesystem, which will work with some or other remote filesystem (likely completely new one), so instead of stupid algorithm shown here, it will contain correct object creation. Likely the way to go is to use name hash with parent inode number as unique ID, which can be matched to filesystem path, so that remote side could create objects without _any_ knowledge of inode numbers on the local fs. Crappy-stuff-created-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/fs/pohmelfs/dir.c b/fs/pohmelfs/dir.c index 23f9ecd..5aec593 100644 --- a/fs/pohmelfs/dir.c +++ b/fs/pohmelfs/dir.c @@ -80,6 +80,8 @@ static struct pohmelfs_name *pohmelfs_insert_offset(struct pohmelfs_inode *pi, rb_link_node(new-offset_node, parent, n); rb_insert_color(new-offset_node, pi-offset_root); + pi-total_len += new-len; + return NULL; } @@ -647,6 +649,7 @@ static int pohmelfs_create_entry(struct inode *dir, struct dentry *dentry, u64 s cmd-start = start; netfs_set_cmd_flags(cmd, dentry-d_name.hash, mode); +#if 0 netfs_convert_cmd(cmd); err = netfs_data_send(st, cmd, sizeof(struct netfs_cmd)); @@ -666,6 +669,30 @@ static int pohmelfs_create_entry(struct inode *dir, struct dentry *dentry, u64 s err = netfs_recv_inode_info(psb, POHMELFS_I(dir), npi, data); if (err 0) goto err_out_unlock; +#else + { + static u64 pohmelfs_ino = 123; + + st-info.mode = netfs_get_inode_mode(cmd); + st-info.ino = pohmelfs_ino++; + st-info.nlink = 2; + st-info.uid = 2319; + st-info.gid = 100; + + cmd-ino = st-info.ino; + cmd-start = POHMELFS_I(dir)-total_len; + } + + npi = pohmelfs_new_inode(psb, POHMELFS_I(dir), data, cmd, st-info); + if (IS_ERR(npi)) { + err = PTR_ERR(npi); + if (err != -EEXIST) + goto err_out_unlock; + npi = NULL; + } else + err = 0; + npi-state = 1; +#endif mutex_unlock(st-lock); d_add(dentry, npi-vfs_inode); diff --git a/fs/pohmelfs/inode.c b/fs/pohmelfs/inode.c index b0ee0b3..6a81bdc 100644 --- a/fs/pohmelfs/inode.c +++ b/fs/pohmelfs/inode.c @@ -125,6 +125,16 @@ static int netfs_process_page(struct file *file, struct page *page, __u64 cmd_op int err; void *addr; + { + if (cmd_op == NETFS_READ_PAGE) { + if (file) + file-f_pos += cmd-size; + } + SetPageUptodate(page); + unlock_page(page); + return 0; + } + mutex_lock(st-lock); cmd-ino = inode-i_ino; @@ -305,6 +315,7 @@ static struct inode *pohmelfs_alloc_inode(struct super_block *sb) inode-state = 0; inode-parent = 0; + inode-total_len = 0; return inode-vfs_inode; } diff --git a/fs/pohmelfs/netfs.h b/fs/pohmelfs/netfs.h index 23aa953..b719fbe 100644 --- a/fs/pohmelfs/netfs.h +++ b/fs/pohmelfs/netfs.h @@ -163,6 +163,8 @@ struct pohmelfs_inode u64 ino; u64 parent; + u64 total_len; + struct pohmelfs_namename; struct inodevfs_inode; -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[1/2] POHMELFS - network filesystem with local coherent cache.
Hi. POHMELFS stands for Parallel Optimized Host Message Exchange Layered File System. It allows to mount remote servers to local directory via network. This filesystem supports local caching and writeback flushing. POHMELFS is a brick in a future distributed filesystem. This set includes two patches: * network filesystem with write-through cache (slow, but works with remote userspace server) * hack to show how local cache works and how faster it is compared to async NFS (see below). hack disables writeback flush and performs local allocation of the objects only. Now, some vaporware aka food for thoughts and your brains. A small benchmark of the local cached mode (above hack): $ time tar -xf /home/zbr/threading.tar POHMELFSNFS v3 (async) real0m0.043s0m1.679s Which is damn 40 times! Excited? Now get huge bucket with ice. Generic problem with writeback cache is a fact, that all local objects has to have IDs in sync with remote side. For example, if remote side is ext3, local one should not overwrite inode with number 0. Contrary write-through cache allows to request remote side about what ID should given data have and be in sync. This one is slow. Of course this will not be _that_ huge difference in a real world, when tested archives are larger (this one if a git archive of my userspace threading library), which is very small. Since it is so small there is no writeback cache flushing, and thus remote side never receives data. Actually one can consider this as tmpfs or something like that. Code supports sync, but since inode generation process is very different, files and dirs can not be blindly synced to the ext3. So, this release of POHMELFS consists of two patches: first one is a network filesystem implementation with write-through cache, when object is first created on the remote side and then populated to the local cache. This one is slow. Second patch is a hack to disable writeback caching and implement local caching only, which is very fast. Next task is to think about how to generically solve the problem with syncing local changes with remote server, when remote server maintains inodes with completely different numbers. This, among others, will allow offline work with automatic syncing after reconnect. This is not intended for inclusion, CRFS by Zach Brown is a bit ahead of POHMELFS, but it is not generic enough (because of above problem), works only with BTRFS, and was closed by Oracle so far :) So, anyone who managed to read up to this and happend to be at LCA 08 just has to move this Friday to his presentation. POHMELFS TODO list includes: * mechanism of keeping it coherent with other users * unified method of syncing with various remote filesystems Thank you. P.S. POHMELFS is about one month old, so do not be so severe with it :) Crappy-stuff-created-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/fs/Kconfig b/fs/Kconfig index f9eed6d..c40f2c5 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -1519,6 +1519,8 @@ endmenu menu Network File Systems depends on NET +source fs/pohmelfs/Kconfig + config NFS_FS tristate NFS file system support depends on INET diff --git a/fs/Makefile b/fs/Makefile index 720c29d..8fff82a 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -118,3 +118,4 @@ obj-$(CONFIG_HPPFS) += hppfs/ obj-$(CONFIG_DEBUG_FS) += debugfs/ obj-$(CONFIG_OCFS2_FS) += ocfs2/ obj-$(CONFIG_GFS2_FS) += gfs2/ +obj-$(CONFIG_POHMELFS) += pohmelfs/ diff --git a/fs/pohmelfs/Kconfig b/fs/pohmelfs/Kconfig new file mode 100644 index 000..ac19aac --- /dev/null +++ b/fs/pohmelfs/Kconfig @@ -0,0 +1,6 @@ +config POHMELFS + tristate POHMELFS filesystem support + help + POHMELFS stands for Parallel Optimized Host Message Exchange Layered File System. + This is a network filesystem which supports coherent caching of data and metadata + on clients. diff --git a/fs/pohmelfs/Makefile b/fs/pohmelfs/Makefile new file mode 100644 index 000..8a87f46 --- /dev/null +++ b/fs/pohmelfs/Makefile @@ -0,0 +1,3 @@ +obj-$(CONFIG_POHMELFS) += pohmelfs.o + +pohmelfs-y := inode.o config.o dir.o net.o diff --git a/fs/pohmelfs/config.c b/fs/pohmelfs/config.c new file mode 100644 index 000..10eabe1 --- /dev/null +++ b/fs/pohmelfs/config.c @@ -0,0 +1,120 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details
Re: [1/2] POHMELFS - network filesystem with local coherent cache.
Hi. On Fri, Feb 01, 2008 at 02:04:39AM +0100, Jan Engelhardt ([EMAIL PROTECTED]) wrote: POHMELFS stands for Parallel Optimized Host Message Exchange Layered File System. It allows to mount remote servers to local directory via network. This filesystem supports local caching and writeback flushing. POHMELFS is a brick in a future distributed filesystem. A brick is usually something that is in the way - Or you also say the user has bricked his machine when it's quite unusable :) Hope you did not mean /that/. No, this brick as a building block :) This set includes two patches: * network filesystem with write-through cache (slow, but works with remote userspace server) * hack to show how local cache works and how faster it is compared to async NFS (see below). hack disables writeback flush and performs local allocation of the objects only. Now, some vaporware aka food for thoughts and your brains. A small benchmark of the local cached mode (above hack): $ time tar -xf /home/zbr/threading.tar POHMELFSNFS v3 (async) real0m0.043s 0m1.679s Which is damn 40 times! Needs a bigger data set to compare. But what is much more important: does it use a single port for networing, or some firewall-unfriendly-by-default multiple dynamic-port-allocation like NFS? It uses single port, configurable at mount time. POHMELFS client can connect to different addresses (including ipv6) and via different protocols (like sctp). Metadata server will provide that information dynamically, so pohmelfs client will be able to connect to different nodes and perform operations in parallell. Next task is to think about how to generically solve the problem with syncing local changes with remote server, when remote server maintains inodes with completely different numbers. This, among others, will allow offline work with automatic syncing after reconnect. What will happen when both nodes change an inode in disconnected state? Which inode wins out? Who will be online first. Second node will be told that there is a merge collision and it has to be resolved by hands. This is not intended for inclusion, CRFS by Zach Brown is a bit ahead of POHMELFS, but it is not generic enough (because of above problem), works only with BTRFS, and was closed by Oracle so far :) btrfs is all we need :p Well, at least it has some very interesting ideas. Although there are things which are not that good imho, time will show, maybe there will be another state-of-the-art filesystem at the moment... This was for information. Where's the parallelism that is advertised by the POH in pohmelfs? First, clients work with local caches and sync them either in writeback or via cache coherency algorithm. This work is effectively parallel. Second, pohmelfs as in distributed filesystem is developed as a transport layer to eliminate mount operation for each different node, so that after client asks for data it would be just sent to different server. This allows to make parallel transactions. Essentially it looks like mounting different remote server to virtual directory working with it, except that connection setup should be done not at mount time, but at run time. -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc8 ppp regression
On Wed, Jan 23, 2008 at 10:35:09AM +0100, maximilian attems ([EMAIL PROTECTED]) wrote: Jan 22 23:23:13 dual kernel: unregister_netdevice: waiting for ppp0 to become free. Usage count = 1 Jan 22 23:23:44 dual last message repeated 3 times Jan 22 23:23:54 dual kernel: unregister_netdevice: waiting for ppp0 to become free. Usage count = 1 2.6.24-rc7 works fine, not yet bisected, will do later in the evening. Fix (revert) is in Dave's tree already. -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[0/4] DST: Distributed storage: Succumbed to live ant.
Distributed storage: Succumbed to live ant. I'm pleased to announce the 14'th release of the distributed storage subsystem (DST). DST allows to form a storage on top of local and remote nodes and combine them into linear or mirroring setup, which in turn can be exported to remote nodes. This is a maintenance release only. Short changelog: * do not allocate big enough address structure on stack during local export node initialization Thanks to Serge Leschinsky and Konstantin Kalin for testing. Overall list of features of the DST can be found on project's homepage: http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst DST is also exported as a git tree available for clone and pull from http://tservice.net.ru/~s0mbre/archive/dst/dst.git Thank you. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[4/4] DST: Algorithms used in distributed storage.
Algorithms used in distributed storage. Mirror and linear mapping code. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/drivers/block/dst/alg_linear.c b/drivers/block/dst/alg_linear.c new file mode 100644 index 000..2f9ed65 --- /dev/null +++ b/drivers/block/dst/alg_linear.c @@ -0,0 +1,105 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/init.h +#include linux/dst.h + +static struct dst_alg *alg_linear; + +/* + * This callback is invoked when node is removed from storage. + */ +static void dst_linear_del_node(struct dst_node *n) +{ +} + +/* + * This callback is invoked when node is added to storage. + */ +static int dst_linear_add_node(struct dst_node *n) +{ + struct dst_storage *st = n-st; + + dprintk(%s: disk_size: %llu, node_size: %llu.\n, + __func__, st-disk_size, n-size); + + mutex_lock(st-tree_lock); + n-start = st-disk_size; + st-disk_size += n-size; + dst_set_disk_size(st); + mutex_unlock(st-tree_lock); + + return 0; +} + +static int dst_linear_remap(struct dst_request *req) +{ + int err; + + if (req-node-bdev) { + generic_make_request(req-bio); + return 0; + } + + err = kst_check_permissions(req-state, req-bio); + if (err) + return err; + + return req-state-ops-push(req); +} + +/* + * Failover callback - it is invoked each time error happens during + * request processing. + */ +static int dst_linear_error(struct kst_state *st, int err) +{ + if (err) + set_bit(DST_NODE_FROZEN, st-node-flags); + else + clear_bit(DST_NODE_FROZEN, st-node-flags); + return 0; +} + +static struct dst_alg_ops alg_linear_ops = { + .remap = dst_linear_remap, + .add_node = dst_linear_add_node, + .del_node = dst_linear_del_node, + .error = dst_linear_error, + .owner = THIS_MODULE, +}; + +static int __devinit alg_linear_init(void) +{ + alg_linear = dst_alloc_alg(alg_linear, alg_linear_ops); + if (!alg_linear) + return -ENOMEM; + + return 0; +} + +static void __devexit alg_linear_exit(void) +{ + dst_remove_alg(alg_linear); +} + +module_init(alg_linear_init); +module_exit(alg_linear_exit); + +MODULE_LICENSE(GPL); +MODULE_AUTHOR(Evgeniy Polyakov [EMAIL PROTECTED]); +MODULE_DESCRIPTION(Linear distributed algorithm.); diff --git a/drivers/block/dst/alg_mirror.c b/drivers/block/dst/alg_mirror.c new file mode 100644 index 000..529b8cb --- /dev/null +++ b/drivers/block/dst/alg_mirror.c @@ -0,0 +1,1614 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/init.h +#include linux/poll.h +#include linux/dst.h +#include linux/vmstat.h + +struct dst_write_entry +{ + int error; + u32 size; + u64 start; +}; +#define DST_LOG_ENTRIES_PER_PAGE (PAGE_SIZE/sizeof(struct dst_write_entry)) + +#define DST_MIRROR_COOKIE 0xc47fd0d33274d7c6ULL + +struct dst_mirror_node_data +{ + u64 age; + u32 num, write_idx, resync_idx, unused; + u64 magic; +}; + +struct dst_mirror_log +{ + unsigned intnr_pages; + struct dst_write_entry **entries; +}; + +struct dst_mirror_priv +{ + u64 resync_start, resync_size; + atomic_tresync_num; + struct completion resync_complete; + struct delayed_work resync_work; + unsigned intresync_timeout; + + u64 last_start; + + spinlock_t resync_wait_lock; + struct
[3/4] DST: Network state machine.
Network state machine. Includes network async processing state machine and related tasks. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/drivers/block/dst/kst.c b/drivers/block/dst/kst.c new file mode 100644 index 000..4ff14ce --- /dev/null +++ b/drivers/block/dst/kst.c @@ -0,0 +1,1523 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/kernel.h +#include linux/module.h +#include linux/list.h +#include linux/slab.h +#include linux/socket.h +#include linux/kthread.h +#include linux/net.h +#include linux/in.h +#include linux/poll.h +#include linux/bio.h +#include linux/dst.h + +#include net/sock.h + +struct kst_poll_helper +{ + poll_table pt; + struct kst_state*st; +}; + +static LIST_HEAD(kst_worker_list); +static DEFINE_MUTEX(kst_worker_mutex); + +/* + * This function creates bound socket for local export node. + */ +static int kst_sock_create(struct kst_state *st, struct saddr *addr, + int type, int proto, int backlog) +{ + int err; + + err = sock_create(addr-sa_family, type, proto, st-socket); + if (err) + goto err_out_exit; + + err = st-socket-ops-bind(st-socket, (struct sockaddr *)addr, + addr-sa_data_len); + + err = st-socket-ops-listen(st-socket, backlog); + if (err) + goto err_out_release; + + st-socket-sk-sk_allocation = GFP_NOIO; + + return 0; + +err_out_release: + sock_release(st-socket); +err_out_exit: + return err; +} + +static void kst_sock_release(struct kst_state *st) +{ + if (st-socket) { + sock_release(st-socket); + st-socket = NULL; + } +} + +void kst_wake(struct kst_state *st) +{ + if (st) { + struct kst_worker *w = st-node-w; + unsigned long flags; + + spin_lock_irqsave(w-ready_lock, flags); + if (list_empty(st-ready_entry)) + list_add_tail(st-ready_entry, w-ready_list); + spin_unlock_irqrestore(w-ready_lock, flags); + + wake_up(w-wait); + } +} +EXPORT_SYMBOL_GPL(kst_wake); + +/* + * Polling machinery. + */ +static int kst_state_wake_callback(wait_queue_t *wait, unsigned mode, + int sync, void *key) +{ + struct kst_state *st = container_of(wait, struct kst_state, wait); + kst_wake(st); + return 1; +} + +static void kst_queue_func(struct file *file, wait_queue_head_t *whead, +poll_table *pt) +{ + struct kst_state *st = container_of(pt, struct kst_poll_helper, pt)-st; + + st-whead = whead; + init_waitqueue_func_entry(st-wait, kst_state_wake_callback); + add_wait_queue(whead, st-wait); +} + +static void kst_poll_exit(struct kst_state *st) +{ + if (st-whead) { + remove_wait_queue(st-whead, st-wait); + st-whead = NULL; + } +} + +/* + * This function removes request from state tree and ordering list. + */ +void kst_del_req(struct dst_request *req) +{ + list_del_init(req-request_list_entry); +} +EXPORT_SYMBOL_GPL(kst_del_req); + +static struct dst_request *kst_req_first(struct kst_state *st) +{ + struct dst_request *req = NULL; + + if (!list_empty(st-request_list)) + req = list_entry(st-request_list.next, struct dst_request, + request_list_entry); + return req; +} + +/* + * This function dequeues first request from the queue and tree. + */ +static struct dst_request *kst_dequeue_req(struct kst_state *st) +{ + struct dst_request *req; + + mutex_lock(st-request_lock); + req = kst_req_first(st); + if (req) + kst_del_req(req); + mutex_unlock(st-request_lock); + return req; +} + +/* + * This function enqueues request into tree, indexed by start of the request, + * and also puts request into ordered queue. + */ +int kst_enqueue_req(struct kst_state *st, struct dst_request *req) +{ + if (unlikely(req-flags DST_REQ_CHECK_QUEUE)) { + struct dst_request *r; + + list_for_each_entry(r, st-request_list, request_list_entry) { + if (bio_rw(r-bio) != bio_rw(req-bio)) + continue; + + if (r-start = req-start + req-size) + continue
[1/4] DST: Distributed storage documentation.
Distributed storage documentation. Algorithms used in the system, userspace interfaces (sysfs dirs and files), design and implementation details are described here. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/Documentation/dst/algorithms.txt b/Documentation/dst/algorithms.txt new file mode 100644 index 000..1437a6a --- /dev/null +++ b/Documentation/dst/algorithms.txt @@ -0,0 +1,115 @@ +Each storage by itself is just a set of contiguous logical blocks, with +allowed number of operations. Nodes, each of which has own start and size, +are placed into storage by appropriate algorithm, which remaps +logical sector number into real node's sector. One can create +own algorithms, since DST has pluggable interface for that. +Currently mirrored and linear algorithms are supported. + +Let's briefly describe how they work. + +Linear algorithm. +Simple approach of concatenating storages into single device with +increased size is used in this algorithm. Essentially new device +has size equal to sum of sizes of underlying nodes and nodes are +placed one after another. + + /- Node 1 ---\ /-- Node 3 \ +start end start end + |==||==| + |start end | + | \--- Node 2 -/ | + | | +start end + \-- DST storage --/ + + /\ + || + || + + IO operations + + Figure 1. + 3 nodes combined into single storage using linear algorithm. + +Mirror algorithm. +In this algorithms nodes are placed under each other, so when +operation comes to the first one, it can be mirrored to all +underlying nodes. In case of reading, actual data is obtained from +the nearest node - algoritm keeps track of previous operation +and knows where it was stopped, so that subsequent seek to the +start of the new request will take the shortest time. +Writing is always mirrored to all underlying nodes. + + IO operations + || + || + \/ + +| DST storage ---| +| prev position | +|---| Node 1 | +| prev pos | +| Node 2 -|--| +|prev pos| +|---| Node 3 | + + Figure 2. + 3 nodes combined into single storage using mirror algorithm. + +Each algorithm must implement number of callbacks, +which must be registered during initialization time. + +struct dst_alg_ops +{ + int (*add_node)(struct dst_node *n); + void(*del_node)(struct dst_node *n); + int (*remap)(struct dst_request *req); + int (*error)(struct kst_state *state, int err); + struct module *owner; +}; + [EMAIL PROTECTED] +This callback is invoked when new node is being added into the storage, +but before node is actually added into the storage, so that it could +be accessed from it. When it is called, all appropriate initialization +of the underlying device is already completed (system has been connected +to remote node or got a reference to the local block device). At this +stage algorithm can add node into private map. +It must return zero on success or negative value otherwise. + [EMAIL PROTECTED] +This callback is invoked when node is being deleted from the storage, +i.e. when its reference counter hits zero. It is called before +any cleaning is performed. +It must return zero on success or negative value otherwise. + [EMAIL PROTECTED] +This callback is invoked each time new bio hits the storage. +Request structure contains BIO itself, pointer to the node, which originally +stores the whole region under given IO request, and various parameters +used by storage core to process this block request. +It must return zero on success or negative value otherwise. It is upto +this method to call all cleaning if remapping failed, for example it must +call kst_bio_endio() for given callback in case of error, which in turn +will call bio_endio(). Note, that dst_request structure provided in this +callback is allocated on stack, so if there is a need to use it outside +of the given function, it must be cloned (it will happen automatically +in state's push callback, but that copy will not be shared by any other +user). + [EMAIL PROTECTED] +This callback is invoked for each error, which happend when processed
[2/4] DST: Core distributed storage files.
Core distributed storage files. Include userspace interfaces, initialization, block layer bindings and other core functionality. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index b4c8319..ca6592d 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -451,6 +451,8 @@ config ATA_OVER_ETH This driver provides Support for ATA over Ethernet block devices like the Coraid EtherDrive (R) Storage Blade. +source drivers/block/dst/Kconfig + source drivers/s390/block/Kconfig endmenu diff --git a/drivers/block/Makefile b/drivers/block/Makefile index dd88e33..fcf042d 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -29,3 +29,4 @@ obj-$(CONFIG_VIODASD) += viodasd.o obj-$(CONFIG_BLK_DEV_SX8) += sx8.o obj-$(CONFIG_BLK_DEV_UB) += ub.o +obj-$(CONFIG_DST) += dst/ diff --git a/drivers/block/dst/Kconfig b/drivers/block/dst/Kconfig new file mode 100644 index 000..67a7dad --- /dev/null +++ b/drivers/block/dst/Kconfig @@ -0,0 +1,28 @@ +config DST + tristate Distributed storage + depends on NET + select CONNECTOR + select LIBCRC32C + ---help--- + This driver allows to create a distributed storage. + +config DST_DEBUG + bool DST debug + depends on DST + ---help--- + This option will turn HEAVY debugging of the DST. + Turn it on ONLY if you have to debug some really obscure problem. + +config DST_ALG_LINEAR + tristate Linear distribution algorithm + depends on DST + ---help--- + This module allows to create linear mapping of the nodes + in the distributed storage. + +config DST_ALG_MIRROR + tristate Mirror distribution algorithm + depends on DST + ---help--- + This module allows to create a mirror of the nodes in the + distributed storage. diff --git a/drivers/block/dst/Makefile b/drivers/block/dst/Makefile new file mode 100644 index 000..1400e94 --- /dev/null +++ b/drivers/block/dst/Makefile @@ -0,0 +1,6 @@ +obj-$(CONFIG_DST) += dst.o + +dst-y := dcore.o kst.o + +obj-$(CONFIG_DST_ALG_LINEAR) += alg_linear.o +obj-$(CONFIG_DST_ALG_MIRROR) += alg_mirror.o diff --git a/drivers/block/dst/dcore.c b/drivers/block/dst/dcore.c new file mode 100644 index 000..22841a7 --- /dev/null +++ b/drivers/block/dst/dcore.c @@ -0,0 +1,1657 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/init.h +#include linux/blkdev.h +#include linux/bio.h +#include linux/slab.h +#include linux/connector.h +#include linux/socket.h +#include linux/dst.h +#include linux/device.h +#include linux/in.h +#include linux/in6.h +#include linux/buffer_head.h + +#include net/sock.h + +static LIST_HEAD(dst_storage_list); +static LIST_HEAD(dst_alg_list); +static DEFINE_MUTEX(dst_storage_lock); +static DEFINE_MUTEX(dst_alg_lock); +static int dst_major; +static struct kst_worker *kst_main_worker; +static struct cb_id cn_dst_id = { CN_DST_IDX, CN_DST_VAL }; + +struct kmem_cache *dst_request_cache; + +static char dst_name[] = Succumbed to live ant.; + +/* + * DST sysfs tree. For device called 'storage' which is formed + * on top of two nodes this looks like this: + * + * /sys/bus/dst/devices/storage/ + * /sys/bus/dst/devices/storage/alg : alg_linear + * /sys/bus/dst/devices/storage/n-800/type : R: 192.168.4.80:1025 + * /sys/bus/dst/devices/storage/n-800/size : 800 + * /sys/bus/dst/devices/storage/n-800/start : 800 + * /sys/bus/dst/devices/storage/n-800/clean + * /sys/bus/dst/devices/storage/n-800/dirty + * /sys/bus/dst/devices/storage/n-0/type : R: 192.168.4.81:1025 + * /sys/bus/dst/devices/storage/n-0/size : 800 + * /sys/bus/dst/devices/storage/n-0/start : 0 + * /sys/bus/dst/devices/storage/n-0/clean + * /sys/bus/dst/devices/storage/n-0/dirty + * /sys/bus/dst/devices/storage/remove_all_nodes + * /sys/bus/dst/devices/storage/nodes : sectors (start [size]): 0 [800] | 800 [800] + * /sys/bus/dst/devices/storage/name : storage + */ + +static int dst_dev_match(struct device *dev, struct device_driver *drv) +{ + return 1; +} + +static void dst_dev_release(struct device *dev) +{ +} + +static struct bus_type dst_dev_bus_type = { + .name = dst, + .match = dst_dev_match, +}; + +static struct device dst_dev = { + .bus
Re: [Bugme-new] [Bug 9778] New: unregister_netdevice: waiting for [device] to become free
On Sun, Jan 20, 2008 at 02:30:27AM -0800, David Miller ([EMAIL PROTECTED]) wrote: From: Andrew Morton [EMAIL PROTECTED] Date: Sat, 19 Jan 2008 16:58:02 -0800 ouch. Yep, several people are hitting this it seems. If Pavel doesn't provide a fix or direction soon I'll just revert. It looks like patch is still valid. Here is a problem description as I undestood. When new device (let's talk about ethernet, since that is what I tested) is being turned on, it gets neigh_parms entry allocated for it via inetdev_init(), which is called for NETDEV_REGISTER inetdev event. This entry is stored in arp_tbl table and is in_dev-arp_parms. When later new arp entry is created, device is provided into arp_constructor(), which clones (increase reference counter) device's in_dev-arp_parms and puts it into provided neighbour entry. When later we remove device, its in_dev-arp_parms's reference counter is high enough (it is equal to number of arp entries found on given device plu one), so neigh_parms_destroy() is not called. Later all neighbour entries are flushed by garbage collector and reference counter for that parm hits zero and device can be removed. I will think about how to fix the problem nicely or if this patch still can be simplified/dropped, but so far it looks valid. Maybe this analysis will help someone to fix problem first. Here is debug dmesg: [ 21.835595] inetdev_init: allocating parms. [ 21.839829] neigh_parms_alloc: parms: 81003d8e8df0, dev: eth0, refcnt: 1, dev_refcnt: 2. ... [ 30.251576] r8169: eth0: link up [ 31.067079] NET: Registered protocol family 10 [ 31.072055] neigh_parms_alloc: parms: 81003efc72a8, dev: lo, refcnt: 1, dev_refcnt: 9. [ 31.080891] neigh_alloc: parms: 8812afe8, dev: NULL, refcnt: 2. [ 31.087816] neigh_parms_alloc: parms: 81003efc7210, dev: eth0, refcnt: 1, dev_refcnt: 9. [ 31.097335] neigh_alloc: parms: 804deb88, dev: NULL, refcnt: 2. [ 31.104172] arp_constructor: parms: 81003f8c3be8, dev: lo, refcnt: 2. [ 31.500348] neigh_alloc: parms: 8812afe8, dev: NULL, refcnt: 2. [ 32.499628] neigh_alloc: parms: 8812afe8, dev: NULL, refcnt: 2. [ 102.827796] neigh_destroy: parms: 81003efc7210, dev: eth0, refcnt: 3, dev_refcnt: 13. [ 106.828843] neigh_destroy: parms: 81003f8c3be8, dev: lo, refcnt: 2, dev_refcnt: 78. [ 109.810987] neigh_alloc: parms: 804deb88, dev: NULL, refcnt: 2. First arp entry for eth0 device, bump the counter: [ 109.817827] arp_constructor: parms: 81003d8e8df0, dev: eth0, refcnt: 2. [ 109.831811] neigh_alloc: parms: 804deb88, dev: NULL, refcnt: 2. [ 109.838661] arp_constructor: parms: 81003f8c3be8, dev: lo, refcnt: 2. [ 110.837894] neigh_destroy: parms: 81003efc7210, dev: eth0, refcnt: 2, dev_refcnt: 15. Can not release that neigh parm: [ 113.638228] neigh_parms_release: parms: 81003d8e8df0, dev: eth0, refcnt: 2, dev_refcnt: 5. Can release some other (for ipv6): [ 113.649380] neigh_parms_release: parms: 81003efc7210, dev: eth0, refcnt: 1, dev_refcnt: 5. [ 113.671806] neigh_parms_destroy: parms: 81003efc7210, dev: eth0, dev_refcnt: 3. [ 123.916250] unregister_netdevice: waiting for eth0 to become free. Usage count = 1 GC hits us: [ 124.839572] neigh_destroy: parms: 81003d8e8df0, dev: eth0, refcnt: 1, dev_refcnt: 11. [ 124.847813] neigh_parms_destroy: parms: 81003d8e8df0, dev: eth0, dev_refcnt: 1. [ 124.952026] ACPI: PCI interrupt for device :02:0d.0 disabled -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 9778] New: unregister_netdevice: waiting for [device] to become free
On Mon, Jan 21, 2008 at 03:14:45PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: It looks like patch is still valid. Here is a problem description as I undestood. When new device (let's talk about ethernet, since that is what I tested) is being turned on, it gets neigh_parms entry allocated for it via inetdev_init(), which is called for NETDEV_REGISTER inetdev event. This entry is stored in arp_tbl table and is in_dev-arp_parms. When later new arp entry is created, device is provided into arp_constructor(), which clones (increase reference counter) device's in_dev-arp_parms and puts it into provided neighbour entry. When later we remove device, its in_dev-arp_parms's reference counter is high enough (it is equal to number of arp entries found on given device plu one), so neigh_parms_destroy() is not called. Later all neighbour entries are flushed by garbage collector and reference counter for that parm hits zero and device can be removed. I will think about how to fix the problem nicely or if this patch still can be simplified/dropped, but so far it looks valid. Maybe this analysis will help someone to fix problem first. Yes, patch is valid, and there is a (very noticeble) race between neighbour processing and parm release - parm still can be accessed after device was fully freed (as with old behaviour when dev_pu() was called from neigh_parms_release()), although no one access it, so the simplest solution is to move dev_put() under the table lock and allow to access parms-dev only under table lock and always check if it is non-null. So I propose a following patch as a simplest solution for the current time. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/include/net/neighbour.h b/include/net/neighbour.h index a4f2618..410b7e7 100644 --- a/include/net/neighbour.h +++ b/include/net/neighbour.h @@ -34,6 +34,11 @@ struct neighbour; struct neigh_parms { + /* +* This device is only allowed to be accessed under table lock (bh turned off) +* and while device is alive. After parm was released, it will be set to NULL +* and has to be always checked before accessed. +*/ struct net_device *dev; struct neigh_parms *next; int (*neigh_setup)(struct neighbour *); diff --git a/net/core/neighbour.c b/net/core/neighbour.c index cc8a2f1..5076acd 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -1315,7 +1315,12 @@ void neigh_parms_release(struct neigh_table *tbl, struct neigh_parms *parms) if (*p == parms) { *p = parms-next; parms-dead = 1; + if (parms-dev) { + dev_put(parms-dev); + parms-dev = NULL; + } write_unlock_bh(tbl-lock); + call_rcu(parms-rcu_head, neigh_rcu_free_parms); return; } @@ -1326,8 +1331,6 @@ void neigh_parms_release(struct neigh_table *tbl, struct neigh_parms *parms) void neigh_parms_destroy(struct neigh_parms *parms) { - if (parms-dev) - dev_put(parms-dev); kfree(parms); } -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SACK scoreboard
Hi. On Wed, Jan 09, 2008 at 08:03:18AM +0100, Andi Kleen ([EMAIL PROTECTED]) wrote: It adds severe spikes in CPU utilization that are even moderate line rates begins to affect RTTs. Or do you think it's OK to process 500,000 SKBs while locked in a software interrupt. You can always push it into a work queue. Even put it to other cores if you want. In fact this is already done partly for the -completion_queue. Wouldn't be a big change to queue it another level down. Also even freeing a lot of objects doesn't have to be that expensive. I suspect the most cost is in taking the slab locks, but that could be batched. Without that the kmem_free fast path isn't particularly expensive, as long as the headers are still in cache. Postponing freeing of the skb has major drawbacks. Some time ago I made a patch to postpone skb freeing behind rcu and got 2.5 times slower connection speed on some machines with decreased CPU usage though. So, queueing solution has to be proven with real data and although it looks good in one situation, it can be really bad in another. For interested reader: results of the RCUfication of the kfree_skbmem() http://tservice.net.ru/~s0mbre/blog/devel/networking/2006/12/05 -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Evgeniy Polyakov
On Thu, Dec 20, 2007 at 08:34:59PM -0800, David Miller ([EMAIL PROTECTED]) wrote: If someone has a way other than email to contact Evgeniy, could you please let him know that his email is bouncing in strange ways. Yep, I saw him couple of times and will try to contact. I'll have to unsubscribe him if this goes on much longer, which I don't want to do. Thanks. Here is some example bounce text: 451 4.0.0 readqf: cannot open ./dflBL48UH3032179: No such file or directory 552 5.3.4 Message is too large; 1500 bytes max 554 5.0.0 Service unavailable This looks really strange for me - I will forward it to system admin of the university server where I have a mail. Likely is is because of some troubles with the mail queue or FS... Do not unsubscribe me :) -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0/4] DST: Distributed storage.
Hi David. On Tue, Dec 18, 2007 at 12:00:04PM +1100, David Chinner ([EMAIL PROTECTED]) wrote: On Mon, Dec 17, 2007 at 06:03:38PM +0300, Evgeniy Polyakov wrote: DST passed all FS tests in LTP with XFS (modulo MAX_LOCK_DEPTH too low bug: [ 8398.605691] BUG: MAX_LOCK_DEPTH too low! [ 8398.609641] turning off the locking correctness validator. Evgeniy, can you please start reporting these XFS problems you are coming across to the XFS list ([EMAIL PROTECTED])? They may be real issues that we need to address and we should not be hearing about them for the first time in the release notes for a block device project It is not XFS as is, but lock validator warning. I just found it working with XFS - it can be anything else: VFS, block layer, DST itself (there is number of locks too), so I did not fill the bug against filesystem, but pointed that some problem, probalby harmless, exists in tested environment. -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[0/4] DST: Distributed storage.
Distributed storage. I'm pleased to announce the 12'th release of the distributed storage subsystem (DST). DST allows to form a storage on top of local and remote nodes and combine them into linear or mirroring setup, which in turn can be exported to remote nodes. Short changelog: * new improved mirroring algorithm. This algorithm uses sliding window approach for full resync and write log for partial resync. * fixed number of typos and debug cleanups * update inode size when linear algorithm changes the size of the storage in run time * extended number of sysfs files and documentation for them * fixed leak in local export node setup * name is 'Dancing with the smoked neutrino' now Overall list of features of the DST can be found on project's homepage: http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst DST is also exported as a git tree available for clone and pull from http://tservice.net.ru/~s0mbre/archive/dst/dst.git Interested reader can test DST with 2.6.23 tree too (it should compile fine, but was not tested). DST passed all FS tests in LTP with XFS (modulo MAX_LOCK_DEPTH too low bug: [ 8398.605691] BUG: MAX_LOCK_DEPTH too low! [ 8398.609641] turning off the locking correctness validator. this is not DST problem though), but it was not performed with offline/online nodes. Thank you. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[2/4] DST: Core distributed storage files.
Core distributed storage files. Include userspace interfaces, initialization, block layer bindings and other core functionality. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index b4c8319..ca6592d 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -451,6 +451,8 @@ config ATA_OVER_ETH This driver provides Support for ATA over Ethernet block devices like the Coraid EtherDrive (R) Storage Blade. +source drivers/block/dst/Kconfig + source drivers/s390/block/Kconfig endmenu diff --git a/drivers/block/Makefile b/drivers/block/Makefile index dd88e33..fcf042d 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -29,3 +29,4 @@ obj-$(CONFIG_VIODASD) += viodasd.o obj-$(CONFIG_BLK_DEV_SX8) += sx8.o obj-$(CONFIG_BLK_DEV_UB) += ub.o +obj-$(CONFIG_DST) += dst/ diff --git a/drivers/block/dst/Kconfig b/drivers/block/dst/Kconfig new file mode 100644 index 000..67a7dad --- /dev/null +++ b/drivers/block/dst/Kconfig @@ -0,0 +1,28 @@ +config DST + tristate Distributed storage + depends on NET + select CONNECTOR + select LIBCRC32C + ---help--- + This driver allows to create a distributed storage. + +config DST_DEBUG + bool DST debug + depends on DST + ---help--- + This option will turn HEAVY debugging of the DST. + Turn it on ONLY if you have to debug some really obscure problem. + +config DST_ALG_LINEAR + tristate Linear distribution algorithm + depends on DST + ---help--- + This module allows to create linear mapping of the nodes + in the distributed storage. + +config DST_ALG_MIRROR + tristate Mirror distribution algorithm + depends on DST + ---help--- + This module allows to create a mirror of the nodes in the + distributed storage. diff --git a/drivers/block/dst/Makefile b/drivers/block/dst/Makefile new file mode 100644 index 000..1400e94 --- /dev/null +++ b/drivers/block/dst/Makefile @@ -0,0 +1,6 @@ +obj-$(CONFIG_DST) += dst.o + +dst-y := dcore.o kst.o + +obj-$(CONFIG_DST_ALG_LINEAR) += alg_linear.o +obj-$(CONFIG_DST_ALG_MIRROR) += alg_mirror.o diff --git a/drivers/block/dst/dcore.c b/drivers/block/dst/dcore.c new file mode 100644 index 000..423e7b2 --- /dev/null +++ b/drivers/block/dst/dcore.c @@ -0,0 +1,1622 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/init.h +#include linux/blkdev.h +#include linux/bio.h +#include linux/slab.h +#include linux/connector.h +#include linux/socket.h +#include linux/dst.h +#include linux/device.h +#include linux/in.h +#include linux/in6.h +#include linux/buffer_head.h + +#include net/sock.h + +static LIST_HEAD(dst_storage_list); +static LIST_HEAD(dst_alg_list); +static DEFINE_MUTEX(dst_storage_lock); +static DEFINE_MUTEX(dst_alg_lock); +static int dst_major; +static struct kst_worker *kst_main_worker; +static struct cb_id cn_dst_id = { CN_DST_IDX, CN_DST_VAL }; + +struct kmem_cache *dst_request_cache; + +static char dst_name[] = Dancing with the smoked neutrino; + +/* + * DST sysfs tree. For device called 'storage' which is formed + * on top of two nodes this looks like this: + * + * /sys/devices/storage/ + * /sys/devices/storage/alg : alg_linear + * /sys/devices/storage/n-800/type : R: 192.168.4.80:1025 + * /sys/devices/storage/n-800/size : 800 + * /sys/devices/storage/n-800/start : 800 + * /sys/devices/storage/n-800/clean + * /sys/devices/storage/n-800/dirty + * /sys/devices/storage/n-0/type : R: 192.168.4.81:1025 + * /sys/devices/storage/n-0/size : 800 + * /sys/devices/storage/n-0/start : 0 + * /sys/devices/storage/n-0/clean + * /sys/devices/storage/n-0/dirty + * /sys/devices/storage/remove_all_nodes + * /sys/devices/storage/nodes : sectors (start [size]): 0 [800] | 800 [800] + * /sys/devices/storage/name : storage + */ + +static int dst_dev_match(struct device *dev, struct device_driver *drv) +{ + return 1; +} + +static void dst_dev_release(struct device *dev) +{ +} + +static struct bus_type dst_dev_bus_type = { + .name = dst, + .match = dst_dev_match, +}; + +static struct device dst_dev = { + .bus= dst_dev_bus_type, + .release= dst_dev_release +}; + +static void dst_node_release(struct device *dev
[3/4] DST: Network state machine.
Network state machine. Includes network async processing state machine and related tasks. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/drivers/block/dst/kst.c b/drivers/block/dst/kst.c new file mode 100644 index 000..6d92014 --- /dev/null +++ b/drivers/block/dst/kst.c @@ -0,0 +1,1515 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/kernel.h +#include linux/module.h +#include linux/list.h +#include linux/slab.h +#include linux/socket.h +#include linux/kthread.h +#include linux/net.h +#include linux/in.h +#include linux/poll.h +#include linux/bio.h +#include linux/dst.h + +#include net/sock.h + +struct kst_poll_helper +{ + poll_table pt; + struct kst_state*st; +}; + +static LIST_HEAD(kst_worker_list); +static DEFINE_MUTEX(kst_worker_mutex); + +/* + * This function creates bound socket for local export node. + */ +static int kst_sock_create(struct kst_state *st, struct saddr *addr, + int type, int proto, int backlog) +{ + int err; + + err = sock_create(addr-sa_family, type, proto, st-socket); + if (err) + goto err_out_exit; + + err = st-socket-ops-bind(st-socket, (struct sockaddr *)addr, + addr-sa_data_len); + + err = st-socket-ops-listen(st-socket, backlog); + if (err) + goto err_out_release; + + st-socket-sk-sk_allocation = GFP_NOIO; + + return 0; + +err_out_release: + sock_release(st-socket); +err_out_exit: + return err; +} + +static void kst_sock_release(struct kst_state *st) +{ + if (st-socket) { + sock_release(st-socket); + st-socket = NULL; + } +} + +void kst_wake(struct kst_state *st) +{ + if (st) { + struct kst_worker *w = st-node-w; + unsigned long flags; + + spin_lock_irqsave(w-ready_lock, flags); + if (list_empty(st-ready_entry)) + list_add_tail(st-ready_entry, w-ready_list); + spin_unlock_irqrestore(w-ready_lock, flags); + + wake_up(w-wait); + } +} +EXPORT_SYMBOL_GPL(kst_wake); + +/* + * Polling machinery. + */ +static int kst_state_wake_callback(wait_queue_t *wait, unsigned mode, + int sync, void *key) +{ + struct kst_state *st = container_of(wait, struct kst_state, wait); + kst_wake(st); + return 1; +} + +static void kst_queue_func(struct file *file, wait_queue_head_t *whead, +poll_table *pt) +{ + struct kst_state *st = container_of(pt, struct kst_poll_helper, pt)-st; + + st-whead = whead; + init_waitqueue_func_entry(st-wait, kst_state_wake_callback); + add_wait_queue(whead, st-wait); +} + +static void kst_poll_exit(struct kst_state *st) +{ + if (st-whead) { + remove_wait_queue(st-whead, st-wait); + st-whead = NULL; + } +} + +/* + * This function removes request from state tree and ordering list. + */ +void kst_del_req(struct dst_request *req) +{ + list_del_init(req-request_list_entry); +} +EXPORT_SYMBOL_GPL(kst_del_req); + +static struct dst_request *kst_req_first(struct kst_state *st) +{ + struct dst_request *req = NULL; + + if (!list_empty(st-request_list)) + req = list_entry(st-request_list.next, struct dst_request, + request_list_entry); + return req; +} + +/* + * This function dequeues first request from the queue and tree. + */ +static struct dst_request *kst_dequeue_req(struct kst_state *st) +{ + struct dst_request *req; + + mutex_lock(st-request_lock); + req = kst_req_first(st); + if (req) + kst_del_req(req); + mutex_unlock(st-request_lock); + return req; +} + +/* + * This function enqueues request into tree, indexed by start of the request, + * and also puts request into ordered queue. + */ +int kst_enqueue_req(struct kst_state *st, struct dst_request *req) +{ + if (unlikely(req-flags DST_REQ_CHECK_QUEUE)) { + struct dst_request *r; + + list_for_each_entry(r, st-request_list, request_list_entry) { + if (bio_rw(r-bio) != bio_rw(req-bio)) + continue; + + if (r-start = req-start + req-size) + continue
[1/4] DST: Distributed storage documentation.
Distributed storage documentation. Algorithms used in the system, userspace interfaces (sysfs dirs and files), design and implementation details are described here. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/Documentation/dst/algorithms.txt b/Documentation/dst/algorithms.txt new file mode 100644 index 000..1437a6a --- /dev/null +++ b/Documentation/dst/algorithms.txt @@ -0,0 +1,115 @@ +Each storage by itself is just a set of contiguous logical blocks, with +allowed number of operations. Nodes, each of which has own start and size, +are placed into storage by appropriate algorithm, which remaps +logical sector number into real node's sector. One can create +own algorithms, since DST has pluggable interface for that. +Currently mirrored and linear algorithms are supported. + +Let's briefly describe how they work. + +Linear algorithm. +Simple approach of concatenating storages into single device with +increased size is used in this algorithm. Essentially new device +has size equal to sum of sizes of underlying nodes and nodes are +placed one after another. + + /- Node 1 ---\ /-- Node 3 \ +start end start end + |==||==| + |start end | + | \--- Node 2 -/ | + | | +start end + \-- DST storage --/ + + /\ + || + || + + IO operations + + Figure 1. + 3 nodes combined into single storage using linear algorithm. + +Mirror algorithm. +In this algorithms nodes are placed under each other, so when +operation comes to the first one, it can be mirrored to all +underlying nodes. In case of reading, actual data is obtained from +the nearest node - algoritm keeps track of previous operation +and knows where it was stopped, so that subsequent seek to the +start of the new request will take the shortest time. +Writing is always mirrored to all underlying nodes. + + IO operations + || + || + \/ + +| DST storage ---| +| prev position | +|---| Node 1 | +| prev pos | +| Node 2 -|--| +|prev pos| +|---| Node 3 | + + Figure 2. + 3 nodes combined into single storage using mirror algorithm. + +Each algorithm must implement number of callbacks, +which must be registered during initialization time. + +struct dst_alg_ops +{ + int (*add_node)(struct dst_node *n); + void(*del_node)(struct dst_node *n); + int (*remap)(struct dst_request *req); + int (*error)(struct kst_state *state, int err); + struct module *owner; +}; + [EMAIL PROTECTED] +This callback is invoked when new node is being added into the storage, +but before node is actually added into the storage, so that it could +be accessed from it. When it is called, all appropriate initialization +of the underlying device is already completed (system has been connected +to remote node or got a reference to the local block device). At this +stage algorithm can add node into private map. +It must return zero on success or negative value otherwise. + [EMAIL PROTECTED] +This callback is invoked when node is being deleted from the storage, +i.e. when its reference counter hits zero. It is called before +any cleaning is performed. +It must return zero on success or negative value otherwise. + [EMAIL PROTECTED] +This callback is invoked each time new bio hits the storage. +Request structure contains BIO itself, pointer to the node, which originally +stores the whole region under given IO request, and various parameters +used by storage core to process this block request. +It must return zero on success or negative value otherwise. It is upto +this method to call all cleaning if remapping failed, for example it must +call kst_bio_endio() for given callback in case of error, which in turn +will call bio_endio(). Note, that dst_request structure provided in this +callback is allocated on stack, so if there is a need to use it outside +of the given function, it must be cloned (it will happen automatically +in state's push callback, but that copy will not be shared by any other +user). + [EMAIL PROTECTED] +This callback is invoked for each error, which happend when processed
[4/4] DST: Algorithms used in distributed storage.
Algorithms used in distributed storage. Mirror and linear mapping code. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/drivers/block/dst/alg_linear.c b/drivers/block/dst/alg_linear.c new file mode 100644 index 000..836764d --- /dev/null +++ b/drivers/block/dst/alg_linear.c @@ -0,0 +1,114 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/init.h +#include linux/dst.h + +static struct dst_alg *alg_linear; + +/* + * This callback is invoked when node is removed from storage. + */ +static void dst_linear_del_node(struct dst_node *n) +{ +} + +/* + * This callback is invoked when node is added to storage. + */ +static int dst_linear_add_node(struct dst_node *n) +{ + struct dst_storage *st = n-st; + struct block_device *bdev; + + dprintk(%s: disk_size: %llu, node_size: %llu.\n, + __func__, st-disk_size, n-size); + + mutex_lock(st-tree_lock); + n-start = st-disk_size; + st-disk_size += n-size; + set_capacity(st-disk, st-disk_size); + + bdev = bdget_disk(st-disk, 0); + if (bdev) { + mutex_lock(bdev-bd_inode-i_mutex); + i_size_write(bdev-bd_inode, to_bytes(st-disk_size)); + mutex_unlock(bdev-bd_inode-i_mutex); + bdput(bdev); + } + mutex_unlock(st-tree_lock); + + return 0; +} + +static int dst_linear_remap(struct dst_request *req) +{ + int err; + + if (req-node-bdev) { + generic_make_request(req-bio); + return 0; + } + + err = kst_check_permissions(req-state, req-bio); + if (err) + return err; + + return req-state-ops-push(req); +} + +/* + * Failover callback - it is invoked each time error happens during + * request processing. + */ +static int dst_linear_error(struct kst_state *st, int err) +{ + if (err) + set_bit(DST_NODE_FROZEN, st-node-flags); + else + clear_bit(DST_NODE_FROZEN, st-node-flags); + return 0; +} + +static struct dst_alg_ops alg_linear_ops = { + .remap = dst_linear_remap, + .add_node = dst_linear_add_node, + .del_node = dst_linear_del_node, + .error = dst_linear_error, + .owner = THIS_MODULE, +}; + +static int __devinit alg_linear_init(void) +{ + alg_linear = dst_alloc_alg(alg_linear, alg_linear_ops); + if (!alg_linear) + return -ENOMEM; + + return 0; +} + +static void __devexit alg_linear_exit(void) +{ + dst_remove_alg(alg_linear); +} + +module_init(alg_linear_init); +module_exit(alg_linear_exit); + +MODULE_LICENSE(GPL); +MODULE_AUTHOR(Evgeniy Polyakov [EMAIL PROTECTED]); +MODULE_DESCRIPTION(Linear distributed algorithm.); diff --git a/drivers/block/dst/alg_mirror.c b/drivers/block/dst/alg_mirror.c new file mode 100644 index 000..c10d582 --- /dev/null +++ b/drivers/block/dst/alg_mirror.c @@ -0,0 +1,1536 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/init.h +#include linux/poll.h +#include linux/dst.h +#include linux/vmstat.h + +struct dst_write_entry +{ + int error; + u32 size; + u64 start; +}; +#define DST_LOG_ENTRIES_PER_PAGE (PAGE_SIZE/sizeof(struct dst_write_entry)) + +struct dst_mirror_node_data +{ + u64 age; + u64 num, write_idx, resync_idx; +}; + +struct dst_mirror_log +{ + unsigned intnr_pages; + struct dst_write_entry **entries; +}; + +struct dst_mirror_priv +{ + u64 resync_start, resync_size; + atomic_tresync_num; + struct completion resync_complete
Re: Badness at net/core/dev.c:2199
On Sun, Dec 16, 2007 at 07:55:55PM +0200, Meelis Roos ([EMAIL PROTECTED]) wrote: Just got this trace from current 2.6.24-rc5+git running on 32-bit ppc (PReP subarch, tulip NIC's) during apt-get update (logged in via ssh so also ssh traffic): [ cut here ] Badness at net/core/dev.c:2199 Please test attached patch. If I understood ltulip correctly, it is posible, that number of entries can be higher than requested budget. When work_done is equal to budget-1, the last skb has to be processed, after 154'th line work_done will become equal to budget and thus loop has to break, check on the same 154 line will become false, but work_done will be increased nevertheless, which will make work_done being equal to budget+1 at exit, which will fire warning you saw. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/drivers/net/tulip/interrupt.c b/drivers/net/tulip/interrupt.c index 3653314..9e0e97a 100644 --- a/drivers/net/tulip/interrupt.c +++ b/drivers/net/tulip/interrupt.c @@ -151,8 +151,9 @@ int tulip_poll(struct napi_struct *napi, int budget) if (tulip_debug 5) printk(KERN_DEBUG %s: In tulip_rx(), entry %d %8.8x.\n, dev-name, entry, status); - if (work_done++ = budget) + if (work_done = budget) goto not_done; + work_done++; if ((status 0x38008300) != 0x0300) { if ((status 0x38000300) != 0x0300) { -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Badness at net/core/dev.c:2199
On Sun, Dec 16, 2007 at 10:33:40AM -0800, Stephen Hemminger ([EMAIL PROTECTED]) wrote: index 3653314..9e0e97a 100644 --- a/drivers/net/tulip/interrupt.c +++ b/drivers/net/tulip/interrupt.c @@ -151,8 +151,9 @@ int tulip_poll(struct napi_struct *napi, int budget) if (tulip_debug 5) printk(KERN_DEBUG %s: In tulip_rx(), entry %d %8.8x.\n, dev-name, entry, status); - if (work_done++ = budget) + if (work_done = budget) goto not_done; + work_done++; if ((status 0x38008300) != 0x0300) { if ((status 0x38000300) != 0x0300) { I already sendout a correct patch last week. It should pre-increment. That will work too. Thanks Stephen. -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [3/4] DST: Network state machine.
On Thu, Dec 13, 2007 at 11:43:43PM +0300, Dmitry Monakhov ([EMAIL PROTECTED]) wrote: On 14:47 Mon 10 Dec , Evgeniy Polyakov wrote: Network state machine. Includes network async processing state machine and related tasks. Hi, I've tried to play a little bit with DST and discover huge memory leak. Every read request from remote node result in bio + bio's pages leak. Data flow: -kst_export_ready## prepare and submit bio -generic_make_request(bio) ## submit it -kst_export_read_end_io ## block layer call bio_end_io callback -kst_thread_process_state## process ready requests -kst_data_callback -kst_data_process_bio ## submit pages to network layer -kst_complete_req -kst_bio_endio -kst_export_read_end_io ## WoW we calling the same bio_end_io ## callback twice -dst_free_request(req); ## request will be destroyed but it's bio ## and all bio's pages wasn't released. We may release bio's pages after it was sent to network, it is safe because sendpage() already called get_page(). I've attached simple patch which this this. Yes, your patch looks good. Thanks a lot, Dmitry. -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What was the reason for 2.6.22 SMP kernels to change how sendmsg is called?
Hi Kevin. On Thu, Dec 13, 2007 at 04:00:02PM -0600, Kevin Wilson ([EMAIL PROTECTED]) wrote: I see your point but it just so happens it is a GPL'd driver, as is all of our Linux code we produce for our hardware. Granted it is out of tree, and after you saw it you would want it to stay that way. However, I would have sent you the whole thing if that is a pre-req to cordial exchanges on this list. Nonetheless, a somewhat recent change in your tree, that I could not pinpoint on my own, caused the driver to stop functioning properly. So after much searching in git/google/sources with no luck, I decided to ask for a little assistance, maybe just a hint as to where the culprit may be in the tree so I could investigate for myself. For SNGs I tried the method that now works but I am still at a loss as to (can't find) what changes in the tree caused it to fail. Without having your code it is virtually impossible to say, why you have a bug. And do not express your frustration telling 'zero people responded to my bug report'. This was not a bug report at all, but empty message about 'my code stopped working after some network changes, which broke the stuff. Now in 2.6.22 and later kernels you must use the higher level SOCKET to make a call to PROTO_OPS then to sendmsg(). e.g., socket-ops-sendmsg(). It was done because of bug found in inet_sendmsg(), which tried to autobind socket it should not try. -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [4/4] DST: Algorithms used in distributed storage.
On Wed, Dec 12, 2007 at 12:12:47PM +0300, Dmitry Monakhov ([EMAIL PROTECTED]) wrote: On 14:47 Mon 10 Dec , Evgeniy Polyakov wrote: Algorithms used in distributed storage. Mirror and linear mapping code. Hi, i've finally take a look on your DST solution. It seems what your current implementation will not work on nonstandard devices for example software raid0. other comments are follows: +static int dst_mirror_process_node_data(struct dst_node *n, + struct dst_mirror_node_data *ndata, int op) + + kunmap(cmp-page); MINOR_BUG: You has forgot to unmap page on error path, so IMHO it is better to move kunmap to err_out_free_cmp label. Yep, I will fix this. + priv = kzalloc(sizeof(struct dst_mirror_priv), GFP_KERNEL); + if (!priv) + return -ENOMEM; + + priv-chunk_num = st-disk_size; + + priv-chunk = vmalloc(DIV_ROUND_UP(priv-chunk_num, BITS_PER_LONG) * sizeof(long)); Ohhh. My. I want to add my 500G hdd. Do you really wanna say what i have to store 128Mb in memory object for this. Right now yes. There was a code which used single bit for bigger data units, but I dropped it because of resync troubles (i.e. when one single sector has been updated, it requires to resync the whole block). I can not say which case is better though. + dprintk(%s: start: %llu, size: %llu/%u, bio: %p, req: %p, + node: %p.\n, + __func__, req-start, req-size, nr_pages, bio, + req, req-node); + + err = n-st-queue-make_request_fn(n-st-queue, bio); Why direct make_request_fn instead of generic_make_request? generic_make_request() will queue the bio in this case, so I call request_fn directly. + for (i = 0; i DIV_ROUND_UP(priv-chunk_num, BITS_PER_LONG); ++i) { + int bit, num, start; + unsigned long word = priv-chunk[i]; + + if (!word) + continue; + + num = 0; + start = -1; + while (word num BITS_PER_LONG) { + bit = __ffs(word); + if (start == -1) + start = bit; + num++; MINOR_BUG: Seems you have misstyped here. AFAIU @num represent position of last non zero bit (start + num == last_non_zero_bit_pos) if (start == -1) { start = bit; num = 1; } else num += bit; Yes, you are right of course. Since I shift word to more than a single bit, @num has to be update accordingly. + word = (bit+1); Dmitry, thanks a lot for comments, I will fix issues you pointed in the next release, although will stay bitmap case opened for a while. -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[0/4] DST: Distributed storage.
Distributed storage. I'm pleased to announce the 11'th release of the distributed storage subsystem (DST). This is a maintenance release and includes bug fixes and simple feature extensions only. DST allows to form a storage on top of local and remote nodes and combine them into linear or mirroring setup, which in turn can be exported to remote nodes. Short changelog: * wakeup state when mirror detected error to seedup reconnect * if connecting in csum mode to no-csum server, do not enable csums * do not clean queue until all users are removed * allow to increase size of the storage in linear add callback (with this change it is possible to add nodes into linear array in real time without stopping storage. Filesystem has to be prepared for the case when underlying device has changed its size. Real-time addon of mirror nodes is also supported) * allow to delete gendisk only after device was started * dst debug config option * Name: Gamardjoba, genacvale! ('Hi friend' in georgian) Great thanks to Matthew Hodgson [EMAIL PROTECTED] for debugging! Overall list of features of the DST can be found on project's homepage: http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst Thank you. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[1/4] DST: Distributed storage documentation.
Distributed storage documentation. Algorithms used in the system, userspace interfaces (sysfs dirs and files), design and implementation details are described here. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/Documentation/dst/algorithms.txt b/Documentation/dst/algorithms.txt new file mode 100644 index 000..1437a6a --- /dev/null +++ b/Documentation/dst/algorithms.txt @@ -0,0 +1,115 @@ +Each storage by itself is just a set of contiguous logical blocks, with +allowed number of operations. Nodes, each of which has own start and size, +are placed into storage by appropriate algorithm, which remaps +logical sector number into real node's sector. One can create +own algorithms, since DST has pluggable interface for that. +Currently mirrored and linear algorithms are supported. + +Let's briefly describe how they work. + +Linear algorithm. +Simple approach of concatenating storages into single device with +increased size is used in this algorithm. Essentially new device +has size equal to sum of sizes of underlying nodes and nodes are +placed one after another. + + /- Node 1 ---\ /-- Node 3 \ +start end start end + |==||==| + |start end | + | \--- Node 2 -/ | + | | +start end + \-- DST storage --/ + + /\ + || + || + + IO operations + + Figure 1. + 3 nodes combined into single storage using linear algorithm. + +Mirror algorithm. +In this algorithms nodes are placed under each other, so when +operation comes to the first one, it can be mirrored to all +underlying nodes. In case of reading, actual data is obtained from +the nearest node - algoritm keeps track of previous operation +and knows where it was stopped, so that subsequent seek to the +start of the new request will take the shortest time. +Writing is always mirrored to all underlying nodes. + + IO operations + || + || + \/ + +| DST storage ---| +| prev position | +|---| Node 1 | +| prev pos | +| Node 2 -|--| +|prev pos| +|---| Node 3 | + + Figure 2. + 3 nodes combined into single storage using mirror algorithm. + +Each algorithm must implement number of callbacks, +which must be registered during initialization time. + +struct dst_alg_ops +{ + int (*add_node)(struct dst_node *n); + void(*del_node)(struct dst_node *n); + int (*remap)(struct dst_request *req); + int (*error)(struct kst_state *state, int err); + struct module *owner; +}; + [EMAIL PROTECTED] +This callback is invoked when new node is being added into the storage, +but before node is actually added into the storage, so that it could +be accessed from it. When it is called, all appropriate initialization +of the underlying device is already completed (system has been connected +to remote node or got a reference to the local block device). At this +stage algorithm can add node into private map. +It must return zero on success or negative value otherwise. + [EMAIL PROTECTED] +This callback is invoked when node is being deleted from the storage, +i.e. when its reference counter hits zero. It is called before +any cleaning is performed. +It must return zero on success or negative value otherwise. + [EMAIL PROTECTED] +This callback is invoked each time new bio hits the storage. +Request structure contains BIO itself, pointer to the node, which originally +stores the whole region under given IO request, and various parameters +used by storage core to process this block request. +It must return zero on success or negative value otherwise. It is upto +this method to call all cleaning if remapping failed, for example it must +call kst_bio_endio() for given callback in case of error, which in turn +will call bio_endio(). Note, that dst_request structure provided in this +callback is allocated on stack, so if there is a need to use it outside +of the given function, it must be cloned (it will happen automatically +in state's push callback, but that copy will not be shared by any other +user). + [EMAIL PROTECTED] +This callback is invoked for each error, which happend when processed
[3/4] DST: Network state machine.
Network state machine. Includes network async processing state machine and related tasks. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/drivers/block/dst/kst.c b/drivers/block/dst/kst.c new file mode 100644 index 000..8fa3387 --- /dev/null +++ b/drivers/block/dst/kst.c @@ -0,0 +1,1513 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/kernel.h +#include linux/module.h +#include linux/list.h +#include linux/slab.h +#include linux/socket.h +#include linux/kthread.h +#include linux/net.h +#include linux/in.h +#include linux/poll.h +#include linux/bio.h +#include linux/dst.h + +#include net/sock.h + +struct kst_poll_helper +{ + poll_table pt; + struct kst_state*st; +}; + +static LIST_HEAD(kst_worker_list); +static DEFINE_MUTEX(kst_worker_mutex); + +/* + * This function creates bound socket for local export node. + */ +static int kst_sock_create(struct kst_state *st, struct saddr *addr, + int type, int proto, int backlog) +{ + int err; + + err = sock_create(addr-sa_family, type, proto, st-socket); + if (err) + goto err_out_exit; + + err = st-socket-ops-bind(st-socket, (struct sockaddr *)addr, + addr-sa_data_len); + + err = st-socket-ops-listen(st-socket, backlog); + if (err) + goto err_out_release; + + st-socket-sk-sk_allocation = GFP_NOIO; + + return 0; + +err_out_release: + sock_release(st-socket); +err_out_exit: + return err; +} + +static void kst_sock_release(struct kst_state *st) +{ + if (st-socket) { + sock_release(st-socket); + st-socket = NULL; + } +} + +void kst_wake(struct kst_state *st) +{ + if (st) { + struct kst_worker *w = st-node-w; + unsigned long flags; + + spin_lock_irqsave(w-ready_lock, flags); + if (list_empty(st-ready_entry)) + list_add_tail(st-ready_entry, w-ready_list); + spin_unlock_irqrestore(w-ready_lock, flags); + + wake_up(w-wait); + } +} +EXPORT_SYMBOL_GPL(kst_wake); + +/* + * Polling machinery. + */ +static int kst_state_wake_callback(wait_queue_t *wait, unsigned mode, + int sync, void *key) +{ + struct kst_state *st = container_of(wait, struct kst_state, wait); + kst_wake(st); + return 1; +} + +static void kst_queue_func(struct file *file, wait_queue_head_t *whead, +poll_table *pt) +{ + struct kst_state *st = container_of(pt, struct kst_poll_helper, pt)-st; + + st-whead = whead; + init_waitqueue_func_entry(st-wait, kst_state_wake_callback); + add_wait_queue(whead, st-wait); +} + +static void kst_poll_exit(struct kst_state *st) +{ + if (st-whead) { + remove_wait_queue(st-whead, st-wait); + st-whead = NULL; + } +} + +/* + * This function removes request from state tree and ordering list. + */ +void kst_del_req(struct dst_request *req) +{ + list_del_init(req-request_list_entry); +} +EXPORT_SYMBOL_GPL(kst_del_req); + +static struct dst_request *kst_req_first(struct kst_state *st) +{ + struct dst_request *req = NULL; + + if (!list_empty(st-request_list)) + req = list_entry(st-request_list.next, struct dst_request, + request_list_entry); + return req; +} + +/* + * This function dequeues first request from the queue and tree. + */ +static struct dst_request *kst_dequeue_req(struct kst_state *st) +{ + struct dst_request *req; + + mutex_lock(st-request_lock); + req = kst_req_first(st); + if (req) + kst_del_req(req); + mutex_unlock(st-request_lock); + return req; +} + +/* + * This function enqueues request into tree, indexed by start of the request, + * and also puts request into ordered queue. + */ +int kst_enqueue_req(struct kst_state *st, struct dst_request *req) +{ + if (unlikely(req-flags DST_REQ_CHECK_QUEUE)) { + struct dst_request *r; + + list_for_each_entry(r, st-request_list, request_list_entry) { + if (bio_rw(r-bio) != bio_rw(req-bio)) + continue; + + if (r-start = req-start + req-size) + continue
[4/4] DST: Algorithms used in distributed storage.
Algorithms used in distributed storage. Mirror and linear mapping code. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/drivers/block/dst/alg_linear.c b/drivers/block/dst/alg_linear.c new file mode 100644 index 000..9dc0976 --- /dev/null +++ b/drivers/block/dst/alg_linear.c @@ -0,0 +1,105 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/init.h +#include linux/dst.h + +static struct dst_alg *alg_linear; + +/* + * This callback is invoked when node is removed from storage. + */ +static void dst_linear_del_node(struct dst_node *n) +{ +} + +/* + * This callback is invoked when node is added to storage. + */ +static int dst_linear_add_node(struct dst_node *n) +{ + struct dst_storage *st = n-st; + + dprintk(%s: disk_size: %llu, node_size: %llu.\n, + __func__, st-disk_size, n-size); + + mutex_lock(st-tree_lock); + n-start = st-disk_size; + st-disk_size += n-size; + set_capacity(st-disk, st-disk_size); + mutex_unlock(st-tree_lock); + + return 0; +} + +static int dst_linear_remap(struct dst_request *req) +{ + int err; + + if (req-node-bdev) { + generic_make_request(req-bio); + return 0; + } + + err = kst_check_permissions(req-state, req-bio); + if (err) + return err; + + return req-state-ops-push(req); +} + +/* + * Failover callback - it is invoked each time error happens during + * request processing. + */ +static int dst_linear_error(struct kst_state *st, int err) +{ + if (err) + set_bit(DST_NODE_FROZEN, st-node-flags); + else + clear_bit(DST_NODE_FROZEN, st-node-flags); + return 0; +} + +static struct dst_alg_ops alg_linear_ops = { + .remap = dst_linear_remap, + .add_node = dst_linear_add_node, + .del_node = dst_linear_del_node, + .error = dst_linear_error, + .owner = THIS_MODULE, +}; + +static int __devinit alg_linear_init(void) +{ + alg_linear = dst_alloc_alg(alg_linear, alg_linear_ops); + if (!alg_linear) + return -ENOMEM; + + return 0; +} + +static void __devexit alg_linear_exit(void) +{ + dst_remove_alg(alg_linear); +} + +module_init(alg_linear_init); +module_exit(alg_linear_exit); + +MODULE_LICENSE(GPL); +MODULE_AUTHOR(Evgeniy Polyakov [EMAIL PROTECTED]); +MODULE_DESCRIPTION(Linear distributed algorithm.); diff --git a/drivers/block/dst/alg_mirror.c b/drivers/block/dst/alg_mirror.c new file mode 100644 index 000..3c457ff --- /dev/null +++ b/drivers/block/dst/alg_mirror.c @@ -0,0 +1,1128 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/init.h +#include linux/poll.h +#include linux/dst.h + +struct dst_mirror_node_data +{ + u64 age; +}; + +struct dst_mirror_priv +{ + unsigned intchunk_num; + + u64 last_start; + + spinlock_t backlog_lock; + struct list_headbacklog_list; + + struct dst_mirror_node_data old_data, new_data; + + unsigned long *chunk; +}; + +static struct dst_alg *alg_mirror; +static struct bio_set *dst_mirror_bio_set; + +static int dst_mirror_resync(struct dst_node *n, int ndp); + +static void dst_mirror_mark_sync(struct dst_node *n) +{ + if (test_bit(DST_NODE_NOTSYNC, n-flags)) { + struct dst_mirror_priv *priv = n-priv; + + clear_bit(DST_NODE_NOTSYNC, n-flags); + dprintk(%s: node: %p, %llu:%llu synchronization + has been completed.\n, + __func__, n, n-start, n-size); + priv-old_data.age = 0
[2/4] DST: Core distributed storage files.
Core distributed storage files. Include userspace interfaces, initialization, block layer bindings and other core functionality. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index b4c8319..ca6592d 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -451,6 +451,8 @@ config ATA_OVER_ETH This driver provides Support for ATA over Ethernet block devices like the Coraid EtherDrive (R) Storage Blade. +source drivers/block/dst/Kconfig + source drivers/s390/block/Kconfig endmenu diff --git a/drivers/block/Makefile b/drivers/block/Makefile index dd88e33..fcf042d 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -29,3 +29,4 @@ obj-$(CONFIG_VIODASD) += viodasd.o obj-$(CONFIG_BLK_DEV_SX8) += sx8.o obj-$(CONFIG_BLK_DEV_UB) += ub.o +obj-$(CONFIG_DST) += dst/ diff --git a/drivers/block/dst/Kconfig b/drivers/block/dst/Kconfig new file mode 100644 index 000..e91f8ed --- /dev/null +++ b/drivers/block/dst/Kconfig @@ -0,0 +1,28 @@ +config DST + tristate Distributed storage + depends on NET + select CONNECTOR + select LIBCRC32C + ---help--- + This driver allows to create a distributed storage. + +config DST_DEBUG + bool DST debug + depends on DST + ---help--- + This option will turn HEAVY debugging of the DST. + Turn it on ONLY if you have to debug some really obscure problem. + +config DST_ALG_LINEAR + tristate Linear distribution algorithm + depends on DST + ---help--- + This module allows to create linear mapping of the nodes + in the distributed storage. + +config DST_ALG_MIRROR + tristate Mirror distribution algorithm + depends on DST + ---help--- + This module allows to create a mirror of the noes in the + distributed storage. diff --git a/drivers/block/dst/Makefile b/drivers/block/dst/Makefile new file mode 100644 index 000..1400e94 --- /dev/null +++ b/drivers/block/dst/Makefile @@ -0,0 +1,6 @@ +obj-$(CONFIG_DST) += dst.o + +dst-y := dcore.o kst.o + +obj-$(CONFIG_DST_ALG_LINEAR) += alg_linear.o +obj-$(CONFIG_DST_ALG_MIRROR) += alg_mirror.o diff --git a/drivers/block/dst/dcore.c b/drivers/block/dst/dcore.c new file mode 100644 index 000..17a5e61 --- /dev/null +++ b/drivers/block/dst/dcore.c @@ -0,0 +1,1631 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/init.h +#include linux/blkdev.h +#include linux/bio.h +#include linux/slab.h +#include linux/connector.h +#include linux/socket.h +#include linux/dst.h +#include linux/device.h +#include linux/in.h +#include linux/in6.h +#include linux/buffer_head.h + +#include net/sock.h + +static LIST_HEAD(dst_storage_list); +static LIST_HEAD(dst_alg_list); +static DEFINE_MUTEX(dst_storage_lock); +static DEFINE_MUTEX(dst_alg_lock); +static int dst_major; +static struct kst_worker *kst_main_worker; +static struct cb_id cn_dst_id = { CN_DST_IDX, CN_DST_VAL }; + +struct kmem_cache *dst_request_cache; + +static char dst_name[] = Gamardjoba, genacvale!; + +/* + * DST sysfs tree. For device called 'storage' which is formed + * on top of two nodes this looks like this: + * + * /sys/devices/storage/ + * /sys/devices/storage/alg : alg_linear + * /sys/devices/storage/n-800/type : R: 192.168.4.80:1025 + * /sys/devices/storage/n-800/size : 800 + * /sys/devices/storage/n-800/start : 800 + * /sys/devices/storage/n-800/clean + * /sys/devices/storage/n-800/dirty + * /sys/devices/storage/n-0/type : R: 192.168.4.81:1025 + * /sys/devices/storage/n-0/size : 800 + * /sys/devices/storage/n-0/start : 0 + * /sys/devices/storage/n-0/clean + * /sys/devices/storage/n-0/dirty + * /sys/devices/storage/remove_all_nodes + * /sys/devices/storage/nodes : sectors (start [size]): 0 [800] | 800 [800] + * /sys/devices/storage/name : storage + */ + +static int dst_dev_match(struct device *dev, struct device_driver *drv) +{ + return 1; +} + +static void dst_dev_release(struct device *dev) +{ +} + +static struct bus_type dst_dev_bus_type = { + .name = dst, + .match = dst_dev_match, +}; + +static struct device dst_dev = { + .bus= dst_dev_bus_type, + .release= dst_dev_release +}; + +static void dst_node_release(struct device *dev) +{ +} + +static
Re: [1/4] DST: Distributed storage documentation.
On Mon, Dec 10, 2007 at 01:51:43PM +0100, Kay Sievers ([EMAIL PROTECTED]) wrote: On Dec 10, 2007 12:47 PM, Evgeniy Polyakov [EMAIL PROTECTED] wrote: diff --git a/Documentation/dst/sysfs.txt b/Documentation/dst/sysfs.txt new file mode 100644 index 000..79d79dc --- /dev/null +++ b/Documentation/dst/sysfs.txt @@ -0,0 +1,30 @@ +This file describes sysfs files created for each storage. + +1. Per-storage files. +Each storage has its own dir /sysfs/devices/$storage_name, It's always /sys/devices/. I meant that for each new device, it will be placed into /sys/devices/its_name, but it can also be accessed via /sys/bus/dst/devices/ +which contains following files: + +alg - contains name of the algorithm used to created given storage +name - name of the storage +nodes - map of the storage (list of nodes and their sizes and starts) +remove_all_nodes - writable file which allows to remove all nodes from given + storage +n-$start-$cookie - per node directory, where + $start - start of the given node in sectors, + $cookie - unique node's id used by DST + +2. Per-node files. +Node's files are located in /sysfs/devices/$storage_name/n-$start-$cookie +directory, described above. To which class or bus do the devices you create belong? Care to show a tree or ls -la of the device? It is 'dst' bus. uganda:~/codes# ls -la /sys/devices/staorge/ total 0 drwxr-xr-x 4 root root0 2007-12-10 11:46 . drwxr-xr-x 9 root root0 2007-12-10 11:46 .. -r--r--r-- 1 root root 4096 2007-12-10 11:46 alg lrwxrwxrwx 1 root root0 2007-12-10 11:46 bus - ../../bus/dst drwxr-xr-x 3 root root0 2007-12-10 11:46 n-0-81003e24117 -r--r--r-- 1 root root 4096 2007-12-10 11:46 name -r--r--r-- 1 root root 4096 2007-12-10 11:46 nodes drwxr-xr-x 2 root root0 2007-12-10 11:46 power -rw-r--r-- 1 root root 4096 2007-12-10 11:46 remove_all_nodes lrwxrwxrwx 1 root root0 2007-12-10 11:46 subsystem - ../../bus/dst -rw-r--r-- 1 root root 4096 2007-12-10 11:46 uevent uganda:~/codes# ls -l /sys/bus/dst/ total 0 drwxr-xr-x 2 root root0 2007-12-10 09:52 devices drwxr-xr-x 2 root root0 2007-12-10 09:52 drivers -rw-r--r-- 1 root root 4096 2007-12-10 11:46 drivers_autoprobe --w--- 1 root root 4096 2007-12-10 11:46 drivers_probe Kay -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/4] DST: Distributed storage documentation.
On Mon, Dec 10, 2007 at 03:31:48PM +0100, Kay Sievers ([EMAIL PROTECTED]) wrote: I meant that for each new device, it will be placed into /sys/devices/its_name, but it can also be accessed via /sys/bus/dst/devices/ Still, it looks like a path. :) Please don't reference any device directly with a /sys/devices/ path. You have to use the subsystem links to the devices in /sys/bus/dst/devices/. Devices are free to move around in /sys/devices, even during runtime. Yours don't do, but anyway, please remove all mentioning of direct access to /sys/devices/. Ok, I will update documentation to reference /sys/bus/dst/devices instead of /sys/devices Btw, where is the top-level /sys/devices/storage/ coming from? I don't see that in the code. We don't accept any new virtual parents here. Your devices will automatically appear in /sys/devices/virtual/dst/, and not below your own parent. But that path does not matter anyway, because you should only access them from the /sys/bus/dst/devices/ directory. And in general please don't claim generic names like storage in any namespace for a very specific subsystem like this. It is not a parent - it is an example for device called 'storage', if it will be called 'testing', then path will be /sys/devices/testing or more correct /sys/bus/dst/devices/testing :) It is 'dst' bus. uganda:~/codes# ls -la /sys/devices/staorge/ total 0 drwxr-xr-x 4 root root0 2007-12-10 11:46 . drwxr-xr-x 9 root root0 2007-12-10 11:46 .. -r--r--r-- 1 root root 4096 2007-12-10 11:46 alg lrwxrwxrwx 1 root root0 2007-12-10 11:46 bus - ../../bus/dst drwxr-xr-x 3 root root0 2007-12-10 11:46 n-0-81003e24117 -r--r--r-- 1 root root 4096 2007-12-10 11:46 name -r--r--r-- 1 root root 4096 2007-12-10 11:46 nodes drwxr-xr-x 2 root root0 2007-12-10 11:46 power -rw-r--r-- 1 root root 4096 2007-12-10 11:46 remove_all_nodes lrwxrwxrwx 1 root root0 2007-12-10 11:46 subsystem - ../../bus/dst -rw-r--r-- 1 root root 4096 2007-12-10 11:46 uevent Ok, how does: ls -l /sys/devices/storage/n-0-81003e24117 look? uganda:~/codes# ls -l /sys/devices/storage/n-0-81003ebc220/ total 0 drwxr-xr-x 2 root root0 2007-12-10 13:23 power -r--r--r-- 1 root root 4096 2007-12-10 13:30 size -r--r--r-- 1 root root 4096 2007-12-10 13:30 start -r--r--r-- 1 root root 4096 2007-12-10 13:30 type -rw-r--r-- 1 root root 4096 2007-12-10 13:30 uevent uganda:~/codes# ls -l /sys/bus/dst/ total 0 drwxr-xr-x 2 root root0 2007-12-10 09:52 devices drwxr-xr-x 2 root root0 2007-12-10 09:52 drivers -rw-r--r-- 1 root root 4096 2007-12-10 11:46 drivers_autoprobe --w--- 1 root root 4096 2007-12-10 11:46 drivers_probe How does: ls -l /sys/bus/dst/devices look? uganda:~/codes# ls -la /sys/bus/dst/devices/ total 0 drwxr-xr-x 2 root root 0 2007-12-10 13:30 . drwxr-xr-x 4 root root 0 2007-12-10 13:22 .. lrwxrwxrwx 1 root root 0 2007-12-10 13:30 storage - ../../../devices/storage Here 'storage' is just a name for device called 'storage', it can be anything else. Further questions: Why do you do your own refcounting instead of using kref? That's because I always used atomic operations as a reference counters and did not tried krefs :) They are the same actually (module tricky arches where smp_mb_* are required), so I can replace them in the next release. Why don't you use groups for the attributes? For 3-4 attributes it is faster to register them in a loop than typing another structure :) Why don't you use default attributes for the device, where you get all error handling done by the core. What is 'default attributes' and for what devices? All my sysfs files are so much trivial, so they do not need anything special and I do not see what is error handling you mentioned. -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/4] DST: Distributed storage documentation.
On Mon, Dec 10, 2007 at 05:50:55PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: Further questions: Why do you do your own refcounting instead of using kref? That's because I always used atomic operations as a reference counters and did not tried krefs :) They are the same actually (module tricky arches where smp_mb_* are required), so I can replace them in the next release. Actually not - I have to set reference counter to something other than 1 or +/- 1, and thus will have to call kref_get() in a loop, which is a very ugly step. Is there kref_set() or somethinglike that? At least not in 2.6.22 what I'm using for now. Sigh, I've converted most of the DST already... -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/4] DST: Distributed storage documentation.
On Mon, Dec 10, 2007 at 08:02:28PM +0100, Kay Sievers ([EMAIL PROTECTED]) wrote: uganda:~/codes# ls -l /sys/devices/storage/n-0-81003ebc220/ total 0 drwxr-xr-x 2 root root0 2007-12-10 13:23 power -r--r--r-- 1 root root 4096 2007-12-10 13:30 size -r--r--r-- 1 root root 4096 2007-12-10 13:30 start -r--r--r-- 1 root root 4096 2007-12-10 13:30 type -rw-r--r-- 1 root root 4096 2007-12-10 13:30 uevent This is a struct device instance without a subsystem (bus/class), right? It will not send an uevent to userspace. Is that intended? Why don't you add them all to the dst bus? I created dst bus for storage devices only, nodes are very different objects, and actually they do not need any events from above, but I need to put some attributes somewhere, so it is 'empty' device. Actually not - I have to set reference counter to something other than 1 or +/- 1, and thus will have to call kref_get() in a loop, which is a very ugly step. Is there kref_set() or somethinglike that? At least not in 2.6.22 what I'm using for now. Yeah, a loop would look pretty ugly. How about just adding kref_set(), if you need it. Well, then it distributed storage will not be able to build as standalone module, and kref_set() itself will not be accepted as a single patch, since there are no in-kernel users :) It is easily doable though. Why don't you use groups for the attributes? For 3-4 attributes it is faster to register them in a loop than typing another structure :) Yeah, but if you would need to recover from an error when the creation of a file fails, a group would do the proper rollback. I do not care about such errors - if there is such an error for a file, which exports information about type of the node (i.e. string L or R) or some other very meaningful info, then system has enough to care about instead of this, so dst does not do anything special - it ignores such errors :) On exit path it will be checked and removed correctly. If there will be additional sysfs files, I think group is a good way to implement them. Why don't you use default attributes for the device, where you get all error handling done by the core. What is 'default attributes' and for what devices? All my sysfs files are so much trivial, so they do not need anything special and I do not see what is error handling you mentioned. If all devices of a subsystem (bus/class) are of the same type, you can set a default array of attributes in the struct bus/class to be created at every device. If you have multiple types of devices in the same subsytem (bus/class) you can to assign a the device_type, which has the default attribute group. That way the core will create the files before the event is sent out to userspace, and the files can be access from the event itself. Not sure if that is needed for dst. Ok, I see. DST right now has 3 types of files - storage files, it is common for every storage device; node files, which are the same for every node; and per-algorithm private devices - they can be different (actually only mirroring algorithm exports something to userspace). I think it is possible to use default attributes for storage devices, but node device does not have a bus/class, so they will be untouched. Thanks, Kay -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/4] DST: Distributed storage documentation.
On Mon, Dec 10, 2007 at 08:44:55PM +0100, Kay Sievers ([EMAIL PROTECTED]) wrote: Actually not - I have to set reference counter to something other than 1 or +/- 1, and thus will have to call kref_get() in a loop, which is a very ugly step. Is there kref_set() or somethinglike that? At least not in 2.6.22 what I'm using for now. Yeah, a loop would look pretty ugly. How about just adding kref_set(), if you need it. Well, then it distributed storage will not be able to build as standalone module, and kref_set() itself will not be accepted as a single patch, since there are no in-kernel users :) It is easily doable though. Most rules have exceptions. :) Send a patch, so we can see how it looks like. It looks really non-trivial :) Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/include/linux/kref.h b/include/linux/kref.h index 6fee353..5d18563 100644 --- a/include/linux/kref.h +++ b/include/linux/kref.h @@ -24,6 +24,7 @@ struct kref { atomic_t refcount; }; +void kref_set(struct kref *kref, int num); void kref_init(struct kref *kref); void kref_get(struct kref *kref); int kref_put(struct kref *kref, void (*release) (struct kref *kref)); diff --git a/lib/kref.c b/lib/kref.c index a6dc3ec..40aa9f9 100644 --- a/lib/kref.c +++ b/lib/kref.c @@ -15,13 +15,23 @@ #include linux/module.h /** + * kref_set - initialize object and set refcount to requested number. + * @kref: object in question. + * @num: initial reference counter + */ +void kref_set(struct kref *kref, int num) +{ + atomic_set(kref-refcount, num); + smp_mb(); +} + +/** * kref_init - initialize object. * @kref: object in question. */ void kref_init(struct kref *kref) { - atomic_set(kref-refcount,1); - smp_mb(); + kref_set(kref, 1); } /** -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/4] DST: Distributed storage documentation.
On Mon, Dec 10, 2007 at 08:56:49PM +0100, Kay Sievers ([EMAIL PROTECTED]) wrote: On Mon, 2007-12-10 at 22:51 +0300, Evgeniy Polyakov wrote: On Mon, Dec 10, 2007 at 08:44:55PM +0100, Kay Sievers ([EMAIL PROTECTED]) wrote: Actually not - I have to set reference counter to something other than 1 or +/- 1, and thus will have to call kref_get() in a loop, which is a very ugly step. Is there kref_set() or somethinglike that? At least not in 2.6.22 what I'm using for now. Yeah, a loop would look pretty ugly. How about just adding kref_set(), if you need it. Well, then it distributed storage will not be able to build as standalone module, and kref_set() itself will not be accepted as a single patch, since there are no in-kernel users :) It is easily doable though. Most rules have exceptions. :) Send a patch, so we can see how it looks like. It looks really non-trivial :) Yeah, it does. :) We miss an EXPORT_SYMBOL(), right? Yep :) diff --git a/include/linux/kref.h b/include/linux/kref.h index 6fee353..5d18563 100644 --- a/include/linux/kref.h +++ b/include/linux/kref.h @@ -24,6 +24,7 @@ struct kref { atomic_t refcount; }; +void kref_set(struct kref *kref, int num); void kref_init(struct kref *kref); void kref_get(struct kref *kref); int kref_put(struct kref *kref, void (*release) (struct kref *kref)); diff --git a/lib/kref.c b/lib/kref.c index a6dc3ec..9ecd6e8 100644 --- a/lib/kref.c +++ b/lib/kref.c @@ -15,13 +15,23 @@ #include linux/module.h /** + * kref_set - initialize object and set refcount to requested number. + * @kref: object in question. + * @num: initial reference counter + */ +void kref_set(struct kref *kref, int num) +{ + atomic_set(kref-refcount, num); + smp_mb(); +} + +/** * kref_init - initialize object. * @kref: object in question. */ void kref_init(struct kref *kref) { - atomic_set(kref-refcount,1); - smp_mb(); + kref_set(kref, 1); } /** @@ -61,6 +71,7 @@ int kref_put(struct kref *kref, void (*release)(struct kref *kref)) return 0; } +EXPORT_SYMBOL(kref_set); EXPORT_SYMBOL(kref_init); EXPORT_SYMBOL(kref_get); EXPORT_SYMBOL(kref_put); -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP event tracking via netlink...
On Wed, Dec 05, 2007 at 09:03:43PM -0800, David Miller ([EMAIL PROTECTED]) wrote: I think this work is very different. When I say state I mean something more significant than CLOSE, ESTABLISHED, etc. which is what Samir's patches are tracking. I'm talking about all of the sequence numbers, SACK information, congestion control knobs, etc. whose values are nearly impossible to track on a packet to packet basis in order to diagnose problems. I pointed that work as a possible basis for collecting more info if you needs including sequence numbers, window sizes and so on. It just requires a useful structure layout placed, so that one would not require to recreate the same bits again, so that it could be called from any place inside the stack. -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP event tracking via netlink...
Hi. On Wed, Dec 05, 2007 at 09:11:01AM -0500, John Heffner ([EMAIL PROTECTED]) wrote: Maybe if we want to get really fancy we can have some more-expensive debug mode where detailed specific events get generated via some macros we can scatter all over the place. This won't be useful for general user problem analysis, but it will be excellent for developers. Let me know if you think this is useful enough and I'll work on an implementation we can start playing with. FWIW, sounds similar to what these guys are doing with SIFTR for FreeBSD: http://caia.swin.edu.au/urp/newtcp/tools.html http://caia.swin.edu.au/reports/070824A/CAIA-TR-070824A.pdf And even more similar to this patch from Samir Bellabes of Mandriva: http://lwn.net/Articles/202255/ -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[4/4] DST: Algorithms used in distributed storage.
Algorithms used in distributed storage. Mirror and linear mapping code. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/drivers/block/dst/alg_linear.c b/drivers/block/dst/alg_linear.c new file mode 100644 index 000..cb77b57 --- /dev/null +++ b/drivers/block/dst/alg_linear.c @@ -0,0 +1,104 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/init.h +#include linux/dst.h + +static struct dst_alg *alg_linear; + +/* + * This callback is invoked when node is removed from storage. + */ +static void dst_linear_del_node(struct dst_node *n) +{ +} + +/* + * This callback is invoked when node is added to storage. + */ +static int dst_linear_add_node(struct dst_node *n) +{ + struct dst_storage *st = n-st; + + dprintk(%s: disk_size: %llu, node_size: %llu.\n, + __func__, st-disk_size, n-size); + + mutex_lock(st-tree_lock); + n-start = st-disk_size; + st-disk_size += n-size; + mutex_unlock(st-tree_lock); + + return 0; +} + +static int dst_linear_remap(struct dst_request *req) +{ + int err; + + if (req-node-bdev) { + generic_make_request(req-bio); + return 0; + } + + err = kst_check_permissions(req-state, req-bio); + if (err) + return err; + + return req-state-ops-push(req); +} + +/* + * Failover callback - it is invoked each time error happens during + * request processing. + */ +static int dst_linear_error(struct kst_state *st, int err) +{ + if (err) + set_bit(DST_NODE_FROZEN, st-node-flags); + else + clear_bit(DST_NODE_FROZEN, st-node-flags); + return 0; +} + +static struct dst_alg_ops alg_linear_ops = { + .remap = dst_linear_remap, + .add_node = dst_linear_add_node, + .del_node = dst_linear_del_node, + .error = dst_linear_error, + .owner = THIS_MODULE, +}; + +static int __devinit alg_linear_init(void) +{ + alg_linear = dst_alloc_alg(alg_linear, alg_linear_ops); + if (!alg_linear) + return -ENOMEM; + + return 0; +} + +static void __devexit alg_linear_exit(void) +{ + dst_remove_alg(alg_linear); +} + +module_init(alg_linear_init); +module_exit(alg_linear_exit); + +MODULE_LICENSE(GPL); +MODULE_AUTHOR(Evgeniy Polyakov [EMAIL PROTECTED]); +MODULE_DESCRIPTION(Linear distributed algorithm.); diff --git a/drivers/block/dst/alg_mirror.c b/drivers/block/dst/alg_mirror.c new file mode 100644 index 000..11a6169 --- /dev/null +++ b/drivers/block/dst/alg_mirror.c @@ -0,0 +1,1122 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/init.h +#include linux/poll.h +#include linux/dst.h + +struct dst_mirror_node_data +{ + u64 age; +}; + +struct dst_mirror_priv +{ + unsigned intchunk_num; + + u64 last_start; + + spinlock_t backlog_lock; + struct list_headbacklog_list; + + struct dst_mirror_node_data old_data, new_data; + + unsigned long *chunk; +}; + +static struct dst_alg *alg_mirror; +static struct bio_set *dst_mirror_bio_set; + +static int dst_mirror_resync(struct dst_node *n, int ndp); + +static void dst_mirror_mark_sync(struct dst_node *n) +{ + if (test_bit(DST_NODE_NOTSYNC, n-flags)) { + struct dst_mirror_priv *priv = n-priv; + + clear_bit(DST_NODE_NOTSYNC, n-flags); + dprintk(%s: node: %p, %llu:%llu synchronization + has been completed.\n, + __func__, n, n-start, n-size); + priv-old_data.age = 0; + } +} + +static void dst_mirror_mark_notsync(struct
[0/4] DST: Distributed storage.
Distributed storage. I'm pleased to announce the 10'th release of the distributed storage subsystem (DST). This is a maintenance release and includes bug fixes and simple feature extensions only. DST allows to form a storage on top of local and remote nodes and combine them into linear or mirroring setup, which in turn can be exported to remote nodes. Short changelog: * fixed bug with XFS metadata update (it can provide slab pages to the DST, so it is not allowed to transfer them using -sendpage()) * fixed async error completion path * extended netlink communication channel to report errors back to userspace * DST name is now The 10'th dynasty of smuggled slothes * number of fixes for userspace DST target Great thanks to Matthew Hodgson [EMAIL PROTECTED] for debugging and fixes for userspace DST target and preliminary netlink extension patches. Overall list of features of the DST can be found on project's homepage: http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst Thank you. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[1/4] DST: Distributed storage documentation.
Distributed storage documentation. Algorithms used in the system, userspace interfaces (sysfs dirs and files), design and implementation details are described here. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/Documentation/dst/algorithms.txt b/Documentation/dst/algorithms.txt new file mode 100644 index 000..1437a6a --- /dev/null +++ b/Documentation/dst/algorithms.txt @@ -0,0 +1,115 @@ +Each storage by itself is just a set of contiguous logical blocks, with +allowed number of operations. Nodes, each of which has own start and size, +are placed into storage by appropriate algorithm, which remaps +logical sector number into real node's sector. One can create +own algorithms, since DST has pluggable interface for that. +Currently mirrored and linear algorithms are supported. + +Let's briefly describe how they work. + +Linear algorithm. +Simple approach of concatenating storages into single device with +increased size is used in this algorithm. Essentially new device +has size equal to sum of sizes of underlying nodes and nodes are +placed one after another. + + /- Node 1 ---\ /-- Node 3 \ +start end start end + |==||==| + |start end | + | \--- Node 2 -/ | + | | +start end + \-- DST storage --/ + + /\ + || + || + + IO operations + + Figure 1. + 3 nodes combined into single storage using linear algorithm. + +Mirror algorithm. +In this algorithms nodes are placed under each other, so when +operation comes to the first one, it can be mirrored to all +underlying nodes. In case of reading, actual data is obtained from +the nearest node - algoritm keeps track of previous operation +and knows where it was stopped, so that subsequent seek to the +start of the new request will take the shortest time. +Writing is always mirrored to all underlying nodes. + + IO operations + || + || + \/ + +| DST storage ---| +| prev position | +|---| Node 1 | +| prev pos | +| Node 2 -|--| +|prev pos| +|---| Node 3 | + + Figure 2. + 3 nodes combined into single storage using mirror algorithm. + +Each algorithm must implement number of callbacks, +which must be registered during initialization time. + +struct dst_alg_ops +{ + int (*add_node)(struct dst_node *n); + void(*del_node)(struct dst_node *n); + int (*remap)(struct dst_request *req); + int (*error)(struct kst_state *state, int err); + struct module *owner; +}; + [EMAIL PROTECTED] +This callback is invoked when new node is being added into the storage, +but before node is actually added into the storage, so that it could +be accessed from it. When it is called, all appropriate initialization +of the underlying device is already completed (system has been connected +to remote node or got a reference to the local block device). At this +stage algorithm can add node into private map. +It must return zero on success or negative value otherwise. + [EMAIL PROTECTED] +This callback is invoked when node is being deleted from the storage, +i.e. when its reference counter hits zero. It is called before +any cleaning is performed. +It must return zero on success or negative value otherwise. + [EMAIL PROTECTED] +This callback is invoked each time new bio hits the storage. +Request structure contains BIO itself, pointer to the node, which originally +stores the whole region under given IO request, and various parameters +used by storage core to process this block request. +It must return zero on success or negative value otherwise. It is upto +this method to call all cleaning if remapping failed, for example it must +call kst_bio_endio() for given callback in case of error, which in turn +will call bio_endio(). Note, that dst_request structure provided in this +callback is allocated on stack, so if there is a need to use it outside +of the given function, it must be cloned (it will happen automatically +in state's push callback, but that copy will not be shared by any other +user). + [EMAIL PROTECTED] +This callback is invoked for each error, which happend when processed
[3/4] DST: Network state machine.
Network state machine. Includes network async processing state machine and related tasks. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/drivers/block/dst/kst.c b/drivers/block/dst/kst.c new file mode 100644 index 000..8fa3387 --- /dev/null +++ b/drivers/block/dst/kst.c @@ -0,0 +1,1513 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/kernel.h +#include linux/module.h +#include linux/list.h +#include linux/slab.h +#include linux/socket.h +#include linux/kthread.h +#include linux/net.h +#include linux/in.h +#include linux/poll.h +#include linux/bio.h +#include linux/dst.h + +#include net/sock.h + +struct kst_poll_helper +{ + poll_table pt; + struct kst_state*st; +}; + +static LIST_HEAD(kst_worker_list); +static DEFINE_MUTEX(kst_worker_mutex); + +/* + * This function creates bound socket for local export node. + */ +static int kst_sock_create(struct kst_state *st, struct saddr *addr, + int type, int proto, int backlog) +{ + int err; + + err = sock_create(addr-sa_family, type, proto, st-socket); + if (err) + goto err_out_exit; + + err = st-socket-ops-bind(st-socket, (struct sockaddr *)addr, + addr-sa_data_len); + + err = st-socket-ops-listen(st-socket, backlog); + if (err) + goto err_out_release; + + st-socket-sk-sk_allocation = GFP_NOIO; + + return 0; + +err_out_release: + sock_release(st-socket); +err_out_exit: + return err; +} + +static void kst_sock_release(struct kst_state *st) +{ + if (st-socket) { + sock_release(st-socket); + st-socket = NULL; + } +} + +void kst_wake(struct kst_state *st) +{ + if (st) { + struct kst_worker *w = st-node-w; + unsigned long flags; + + spin_lock_irqsave(w-ready_lock, flags); + if (list_empty(st-ready_entry)) + list_add_tail(st-ready_entry, w-ready_list); + spin_unlock_irqrestore(w-ready_lock, flags); + + wake_up(w-wait); + } +} +EXPORT_SYMBOL_GPL(kst_wake); + +/* + * Polling machinery. + */ +static int kst_state_wake_callback(wait_queue_t *wait, unsigned mode, + int sync, void *key) +{ + struct kst_state *st = container_of(wait, struct kst_state, wait); + kst_wake(st); + return 1; +} + +static void kst_queue_func(struct file *file, wait_queue_head_t *whead, +poll_table *pt) +{ + struct kst_state *st = container_of(pt, struct kst_poll_helper, pt)-st; + + st-whead = whead; + init_waitqueue_func_entry(st-wait, kst_state_wake_callback); + add_wait_queue(whead, st-wait); +} + +static void kst_poll_exit(struct kst_state *st) +{ + if (st-whead) { + remove_wait_queue(st-whead, st-wait); + st-whead = NULL; + } +} + +/* + * This function removes request from state tree and ordering list. + */ +void kst_del_req(struct dst_request *req) +{ + list_del_init(req-request_list_entry); +} +EXPORT_SYMBOL_GPL(kst_del_req); + +static struct dst_request *kst_req_first(struct kst_state *st) +{ + struct dst_request *req = NULL; + + if (!list_empty(st-request_list)) + req = list_entry(st-request_list.next, struct dst_request, + request_list_entry); + return req; +} + +/* + * This function dequeues first request from the queue and tree. + */ +static struct dst_request *kst_dequeue_req(struct kst_state *st) +{ + struct dst_request *req; + + mutex_lock(st-request_lock); + req = kst_req_first(st); + if (req) + kst_del_req(req); + mutex_unlock(st-request_lock); + return req; +} + +/* + * This function enqueues request into tree, indexed by start of the request, + * and also puts request into ordered queue. + */ +int kst_enqueue_req(struct kst_state *st, struct dst_request *req) +{ + if (unlikely(req-flags DST_REQ_CHECK_QUEUE)) { + struct dst_request *r; + + list_for_each_entry(r, st-request_list, request_list_entry) { + if (bio_rw(r-bio) != bio_rw(req-bio)) + continue; + + if (r-start = req-start + req-size) + continue
[2/4] DST: Core distributed storage files.
Core distributed storage files. Include userspace interfaces, initialization, block layer bindings and other core functionality. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index b4c8319..ca6592d 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -451,6 +451,8 @@ config ATA_OVER_ETH This driver provides Support for ATA over Ethernet block devices like the Coraid EtherDrive (R) Storage Blade. +source drivers/block/dst/Kconfig + source drivers/s390/block/Kconfig endmenu diff --git a/drivers/block/Makefile b/drivers/block/Makefile index dd88e33..fcf042d 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -29,3 +29,4 @@ obj-$(CONFIG_VIODASD) += viodasd.o obj-$(CONFIG_BLK_DEV_SX8) += sx8.o obj-$(CONFIG_BLK_DEV_UB) += ub.o +obj-$(CONFIG_DST) += dst/ diff --git a/drivers/block/dst/Kconfig b/drivers/block/dst/Kconfig new file mode 100644 index 000..e91f8ed --- /dev/null +++ b/drivers/block/dst/Kconfig @@ -0,0 +1,28 @@ +config DST + tristate Distributed storage + depends on NET + select CONNECTOR + select LIBCRC32C + ---help--- + This driver allows to create a distributed storage. + +config DST_DEBUG + bool DST debug + depends on DST + ---help--- + This option will turn HEAVY debugging of the DST. + Turn it on ONLY if you have to debug some really obscure problem. + +config DST_ALG_LINEAR + tristate Linear distribution algorithm + depends on DST + ---help--- + This module allows to create linear mapping of the nodes + in the distributed storage. + +config DST_ALG_MIRROR + tristate Mirror distribution algorithm + depends on DST + ---help--- + This module allows to create a mirror of the noes in the + distributed storage. diff --git a/drivers/block/dst/Makefile b/drivers/block/dst/Makefile new file mode 100644 index 000..1400e94 --- /dev/null +++ b/drivers/block/dst/Makefile @@ -0,0 +1,6 @@ +obj-$(CONFIG_DST) += dst.o + +dst-y := dcore.o kst.o + +obj-$(CONFIG_DST_ALG_LINEAR) += alg_linear.o +obj-$(CONFIG_DST_ALG_MIRROR) += alg_mirror.o diff --git a/drivers/block/dst/dcore.c b/drivers/block/dst/dcore.c new file mode 100644 index 000..4fdad29 --- /dev/null +++ b/drivers/block/dst/dcore.c @@ -0,0 +1,1629 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/init.h +#include linux/blkdev.h +#include linux/bio.h +#include linux/slab.h +#include linux/connector.h +#include linux/socket.h +#include linux/dst.h +#include linux/device.h +#include linux/in.h +#include linux/in6.h +#include linux/buffer_head.h + +#include net/sock.h + +static LIST_HEAD(dst_storage_list); +static LIST_HEAD(dst_alg_list); +static DEFINE_MUTEX(dst_storage_lock); +static DEFINE_MUTEX(dst_alg_lock); +static int dst_major; +static struct kst_worker *kst_main_worker; +static struct cb_id cn_dst_id = { CN_DST_IDX, CN_DST_VAL }; + +struct kmem_cache *dst_request_cache; + +static char dst_name[] = The 10'th dynasty of smuggled slothes; + +/* + * DST sysfs tree. For device called 'storage' which is formed + * on top of two nodes this looks like this: + * + * /sys/devices/storage/ + * /sys/devices/storage/alg : alg_linear + * /sys/devices/storage/n-800/type : R: 192.168.4.80:1025 + * /sys/devices/storage/n-800/size : 800 + * /sys/devices/storage/n-800/start : 800 + * /sys/devices/storage/n-800/clean + * /sys/devices/storage/n-800/dirty + * /sys/devices/storage/n-0/type : R: 192.168.4.81:1025 + * /sys/devices/storage/n-0/size : 800 + * /sys/devices/storage/n-0/start : 0 + * /sys/devices/storage/n-0/clean + * /sys/devices/storage/n-0/dirty + * /sys/devices/storage/remove_all_nodes + * /sys/devices/storage/nodes : sectors (start [size]): 0 [800] | 800 [800] + * /sys/devices/storage/name : storage + */ + +static int dst_dev_match(struct device *dev, struct device_driver *drv) +{ + return 1; +} + +static void dst_dev_release(struct device *dev) +{ +} + +static struct bus_type dst_dev_bus_type = { + .name = dst, + .match = dst_dev_match, +}; + +static struct device dst_dev = { + .bus= dst_dev_bus_type, + .release= dst_dev_release +}; + +static void dst_node_release(struct device *dev
Netchannels. The 22'th century release.
Hi. This is the 22'th release of the netchannels, a peer-to-peer protocol agnostic communication channel between hardware and users. It uses unified cache to store channels, allows to allocate buffers for data from userspace mapped area or from other preallocated set of pages (like VFS cache). All protocol processing happens in process context. Users of the system can be for example userspace - it allows to receive and send traffic from the wire without any kernel interference, to implement own protocols and offload its processing to the hardware. This idea was originally proposed and implemented by Van Jacobson. This patchset (with userspace netowrk stack) is a logical continuation of the idea with move to the full peer-to-peer processing. One of its users is userspace network stack [2]. Short changelog: * update cached route in the netchannel when it expires. Thanks to Salvatore Del Popolo [EMAIL PROTECTED] for testing. 1. Netchannels homepage. http://tservice.net.ru/~s0mbre/old/?section=projectsitem=netchannel 2. Userspace network stack. http://tservice.net.ru/~s0mbre/old/?section=projectsitem=unetstack Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S index 2697e92..3231b22 100644 --- a/arch/i386/kernel/syscall_table.S +++ b/arch/i386/kernel/syscall_table.S @@ -319,3 +319,4 @@ ENTRY(sys_call_table) .long sys_move_pages .long sys_getcpu .long sys_epoll_pwait + .long sys_netchannel_control diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S index b4aa875..d35d4d8 100644 --- a/arch/x86_64/ia32/ia32entry.S +++ b/arch/x86_64/ia32/ia32entry.S @@ -718,4 +718,5 @@ ia32_sys_call_table: .quad compat_sys_vmsplice .quad compat_sys_move_pages .quad sys_getcpu + .quad sys_netchannel_control ia32_syscall_end: diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h index beeeaf6..33242f8 100644 --- a/include/asm-i386/unistd.h +++ b/include/asm-i386/unistd.h @@ -325,10 +325,11 @@ #define __NR_move_pages317 #define __NR_getcpu318 #define __NR_epoll_pwait 319 +#define __NR_netchannel_control320 #ifdef __KERNEL__ -#define NR_syscalls 320 +#define NR_syscalls 321 #include linux/err.h /* diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h index 777288e..16f1aac 100644 --- a/include/asm-x86_64/unistd.h +++ b/include/asm-x86_64/unistd.h @@ -619,8 +619,10 @@ __SYSCALL(__NR_sync_file_range, sys_sync_file_range) __SYSCALL(__NR_vmsplice, sys_vmsplice) #define __NR_move_pages279 __SYSCALL(__NR_move_pages, sys_move_pages) +#define __NR_netchannel_control280 +__SYSCALL(__NR_netchannel_control, sys_netchannel_control) -#define __NR_syscall_max __NR_move_pages +#define __NR_syscall_max __NR_netchannel_control #ifdef __KERNEL__ #include linux/err.h diff --git a/include/linux/connector.h b/include/linux/connector.h index 4c02119..bdf6432 100644 --- a/include/linux/connector.h +++ b/include/linux/connector.h @@ -36,9 +36,11 @@ #define CN_VAL_CIFS 0x1 #define CN_W1_IDX 0x3 /* w1 communication */ #define CN_W1_VAL 0x1 +#define CN_NETCHANNELS_IDX 0x04/* Netchannels connection control */ +#define CN_NETCHANNELS_VAL 0x01 -#define CN_NETLINK_USERS 4 +#define CN_NETLINK_USERS 5 /* * Maximum connector's message size. diff --git a/include/linux/netchannel.h b/include/linux/netchannel.h new file mode 100644 index 000..c56afc5 --- /dev/null +++ b/include/linux/netchannel.h @@ -0,0 +1,175 @@ +/* + * netchannel.h + * + * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef __NETCHANNEL_H +#define __NETCHANNEL_H + +#include linux/types.h + +enum netchannel_commands { + NETCHANNEL_CREATE = 0, +}; + +enum netchannel_type { + NETCHANNEL_EMPTY = 0, + NETCHANNEL_COPY_USER, + NETCHANNEL_NAT, + NETCHANNEL_MAX +}; + +/* + * Destination and source addresses/ports are from receiving point ov view, + * i.e. when packet is being
Re: [0/4] DST: Distributed storage.
Hi Mike. On Tue, Dec 04, 2007 at 10:25:29AM -0500, Mike Snitzer ([EMAIL PROTECTED]) wrote: Thanks for your continued work on DST. I'd like to know if you've thought further about how synchronous mirroring would be best implemented with DST. You shared you views some time ago via comments on your blog: http://tservice.net.ru/~s0mbre/blog/devel/dst/2007_11_05.html At that time you were saying you'd add a sync bit to the request structure that is sent to remote nodes. I'd imagine this would also require ordering of the block io, no? Is order guaranteed when the requests are submitted over the DST protocol? Otherwise how can you ensure a valid remote mirror (in the case of network disconnects, etc)? Guaranteeing consistent data on all members of a mirror is important. The main question is: what mechanisms _should_ be used in DST to provide this consistency? And do you have a timeframe for when DST might support such mechanisms for consistent data? For the purpose of this discussion please assume that the disk cache is either write-through or battery-backed. In this case sync bit would only imply waiting until all pending requests reached remote nodes. This is not implemented yet. Order of the requests for given node is guaranteed by DST core, it is possible to perform multiple requests in parallel for/from different nodes. In the more generic case it should wait until data has reached media, i.e. perform flushing. I did not implement that since actually no multiple-device system in Linux supports barriers (please note, that in this discussion sync bit actually means a barrier in the block layer). Protocol changes are pretty trivial and are absolutely transparent for the DST core - only remote targets (both userspace and kernelspace) should be changed to invoke -issue_flush_fn() callback when needed for underlying device and do not process new requests until flush completed. Thus barrier bit can be attached to data packets and can also be single requests without data. DST will continue to collect data, but will not send it to remote nodes (actually it can send it, but data will not be processed and will stay in the remote's receiving queue). This is a main concern about barrier - should or not main node continue to process requests if previous ones have not reached media yet, thus I have not yet implemented barriers. regards, Mike -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0/4] DST: Distributed storage.
On Tue, Dec 04, 2007 at 04:56:26PM +, Christoph Hellwig ([EMAIL PROTECTED]) wrote: * fixed bug with XFS metadata update (it can provide slab pages to the DST, so it is not allowed to transfer them using -sendpage()) xfs hasn't been doing that anymore for quite a while. Block drivers don't need hacks for it anymore, epsecially as it's not reliably detectable. I use 2.6.22 and it is there, maybe it was changed later. Right now it can be detected quite trivially, but can result in a little more bio startup overhead, I just did not know that it was allowed and thus did not have a check in the DST. -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/4] dst: Distributed storage documentation.
Hi Matt. On Sun, Dec 02, 2007 at 10:50:59PM -0600, Matt Mackall ([EMAIL PROTECTED]) wrote: Distributed storage documentation. Algorithms used in the system, userspace interfaces (sysfs dirs and files), design and implementation details are described here. Can you give us a summary of how this differs from using device mapper with NBD? From the higher point ov view it does not, but it operates quite differently: it has async processing of the requests, thus not blocking, it has different protocol with smaller overhead, supports strong checksums, has in-kernel export server, which supports simple security attributes (i.e. allow to connect, to read or write). It uses smaller amount of memory (zero additional allocations in the common path for linear mapping, not including network allocations, it uses smaller amount of additional allocations for mirroring case). DST supports failure recovery in case of dropped connection (core will reconnect to the remote node when it is ready), thus it is possible to turn off and on remote nodes without special administration steps. DST has simple autoconfiguration at the startup time (support checksums and storage size autonegotiation). It is possible to turn one of the mirror nodes off and use it as a offline backup, since dst mirror node stores data at the end of the storage, so it can be mounted locally. -- Evgeniy Polyakov -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 9440] New: Problem in joinning a socket to ipv6 multicast address in specific scenario
On Fri, Nov 30, 2007 at 11:02:19PM +1100, Herbert Xu ([EMAIL PROTECTED]) wrote: OK, this looks like a good change. However, we should also change NETDEV_UP as well to recreate idev if it isn't there and the MTU is big enough. Ok, added netdev_up too. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 567664e..e8c3475 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -2293,6 +2293,9 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event, break; } + if (!idev dev-mtu = IPV6_MIN_MTU) + idev = ipv6_add_dev(dev); + if (idev) idev-if_flags |= IF_READY; } else { @@ -2357,12 +2360,18 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event, break; case NETDEV_CHANGEMTU: - if ( idev dev-mtu = IPV6_MIN_MTU) { + if (idev dev-mtu = IPV6_MIN_MTU) { rt6_mtu_change(dev, dev-mtu); idev-cnf.mtu6 = dev-mtu; break; } + if (!idev dev-mtu = IPV6_MIN_MTU) { + idev = ipv6_add_dev(dev); + if (idev) + break; + } + /* MTU falled under IPV6_MIN_MTU. Stop IPv6 on this interface. */ case NETDEV_DOWN: -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[3/4] dst: Network state machine.
Network state machine. Includes network async processing state machine and related tasks. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/drivers/block/dst/kst.c b/drivers/block/dst/kst.c new file mode 100644 index 000..ba5e5ef --- /dev/null +++ b/drivers/block/dst/kst.c @@ -0,0 +1,1475 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/kernel.h +#include linux/module.h +#include linux/list.h +#include linux/slab.h +#include linux/socket.h +#include linux/kthread.h +#include linux/net.h +#include linux/in.h +#include linux/poll.h +#include linux/bio.h +#include linux/dst.h + +#include net/sock.h + +struct kst_poll_helper +{ + poll_table pt; + struct kst_state*st; +}; + +static LIST_HEAD(kst_worker_list); +static DEFINE_MUTEX(kst_worker_mutex); + +/* + * This function creates bound socket for local export node. + */ +static int kst_sock_create(struct kst_state *st, struct saddr *addr, + int type, int proto, int backlog) +{ + int err; + + err = sock_create(addr-sa_family, type, proto, st-socket); + if (err) + goto err_out_exit; + + err = st-socket-ops-bind(st-socket, (struct sockaddr *)addr, + addr-sa_data_len); + + err = st-socket-ops-listen(st-socket, backlog); + if (err) + goto err_out_release; + + st-socket-sk-sk_allocation = GFP_NOIO; + + return 0; + +err_out_release: + sock_release(st-socket); +err_out_exit: + return err; +} + +static void kst_sock_release(struct kst_state *st) +{ + if (st-socket) { + sock_release(st-socket); + st-socket = NULL; + } +} + +void kst_wake(struct kst_state *st) +{ + if (st) { + struct kst_worker *w = st-node-w; + unsigned long flags; + + spin_lock_irqsave(w-ready_lock, flags); + if (list_empty(st-ready_entry)) + list_add_tail(st-ready_entry, w-ready_list); + spin_unlock_irqrestore(w-ready_lock, flags); + + wake_up(w-wait); + } +} +EXPORT_SYMBOL_GPL(kst_wake); + +/* + * Polling machinery. + */ +static int kst_state_wake_callback(wait_queue_t *wait, unsigned mode, + int sync, void *key) +{ + struct kst_state *st = container_of(wait, struct kst_state, wait); + kst_wake(st); + return 1; +} + +static void kst_queue_func(struct file *file, wait_queue_head_t *whead, +poll_table *pt) +{ + struct kst_state *st = container_of(pt, struct kst_poll_helper, pt)-st; + + st-whead = whead; + init_waitqueue_func_entry(st-wait, kst_state_wake_callback); + add_wait_queue(whead, st-wait); +} + +static void kst_poll_exit(struct kst_state *st) +{ + if (st-whead) { + remove_wait_queue(st-whead, st-wait); + st-whead = NULL; + } +} + +/* + * This function removes request from state tree and ordering list. + */ +void kst_del_req(struct dst_request *req) +{ + list_del_init(req-request_list_entry); +} +EXPORT_SYMBOL_GPL(kst_del_req); + +static struct dst_request *kst_req_first(struct kst_state *st) +{ + struct dst_request *req = NULL; + + if (!list_empty(st-request_list)) + req = list_entry(st-request_list.next, struct dst_request, + request_list_entry); + return req; +} + +/* + * This function dequeues first request from the queue and tree. + */ +static struct dst_request *kst_dequeue_req(struct kst_state *st) +{ + struct dst_request *req; + + mutex_lock(st-request_lock); + req = kst_req_first(st); + if (req) + kst_del_req(req); + mutex_unlock(st-request_lock); + return req; +} + +/* + * This function enqueues request into tree, indexed by start of the request, + * and also puts request into ordered queue. + */ +int kst_enqueue_req(struct kst_state *st, struct dst_request *req) +{ + if (unlikely(req-flags DST_REQ_CHECK_QUEUE)) { + struct dst_request *r; + + list_for_each_entry(r, st-request_list, request_list_entry) { + if (bio_rw(r-bio) != bio_rw(req-bio)) + continue; + + if (r-start = req-start + req-size) + continue
[4/4] dst: Algorithms used in distributed storage.
Algorithms used in distributed storage. Mirror and linear mapping code. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/drivers/block/dst/alg_linear.c b/drivers/block/dst/alg_linear.c new file mode 100644 index 000..cb77b57 --- /dev/null +++ b/drivers/block/dst/alg_linear.c @@ -0,0 +1,104 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/init.h +#include linux/dst.h + +static struct dst_alg *alg_linear; + +/* + * This callback is invoked when node is removed from storage. + */ +static void dst_linear_del_node(struct dst_node *n) +{ +} + +/* + * This callback is invoked when node is added to storage. + */ +static int dst_linear_add_node(struct dst_node *n) +{ + struct dst_storage *st = n-st; + + dprintk(%s: disk_size: %llu, node_size: %llu.\n, + __func__, st-disk_size, n-size); + + mutex_lock(st-tree_lock); + n-start = st-disk_size; + st-disk_size += n-size; + mutex_unlock(st-tree_lock); + + return 0; +} + +static int dst_linear_remap(struct dst_request *req) +{ + int err; + + if (req-node-bdev) { + generic_make_request(req-bio); + return 0; + } + + err = kst_check_permissions(req-state, req-bio); + if (err) + return err; + + return req-state-ops-push(req); +} + +/* + * Failover callback - it is invoked each time error happens during + * request processing. + */ +static int dst_linear_error(struct kst_state *st, int err) +{ + if (err) + set_bit(DST_NODE_FROZEN, st-node-flags); + else + clear_bit(DST_NODE_FROZEN, st-node-flags); + return 0; +} + +static struct dst_alg_ops alg_linear_ops = { + .remap = dst_linear_remap, + .add_node = dst_linear_add_node, + .del_node = dst_linear_del_node, + .error = dst_linear_error, + .owner = THIS_MODULE, +}; + +static int __devinit alg_linear_init(void) +{ + alg_linear = dst_alloc_alg(alg_linear, alg_linear_ops); + if (!alg_linear) + return -ENOMEM; + + return 0; +} + +static void __devexit alg_linear_exit(void) +{ + dst_remove_alg(alg_linear); +} + +module_init(alg_linear_init); +module_exit(alg_linear_exit); + +MODULE_LICENSE(GPL); +MODULE_AUTHOR(Evgeniy Polyakov [EMAIL PROTECTED]); +MODULE_DESCRIPTION(Linear distributed algorithm.); diff --git a/drivers/block/dst/alg_mirror.c b/drivers/block/dst/alg_mirror.c new file mode 100644 index 000..55cf59c --- /dev/null +++ b/drivers/block/dst/alg_mirror.c @@ -0,0 +1,1122 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/init.h +#include linux/poll.h +#include linux/dst.h + +struct dst_mirror_node_data +{ + u64 age; +}; + +struct dst_mirror_priv +{ + unsigned intchunk_num; + + u64 last_start; + + spinlock_t backlog_lock; + struct list_headbacklog_list; + + struct dst_mirror_node_data old_data, new_data; + + unsigned long *chunk; +}; + +static struct dst_alg *alg_mirror; +static struct bio_set *dst_mirror_bio_set; + +static int dst_mirror_resync(struct dst_node *n, int ndp); + +static void dst_mirror_mark_sync(struct dst_node *n) +{ + if (test_bit(DST_NODE_NOTSYNC, n-flags)) { + struct dst_mirror_priv *priv = n-priv; + + clear_bit(DST_NODE_NOTSYNC, n-flags); + dprintk(%s: node: %p, %llu:%llu synchronization + has been completed.\n, + __func__, n, n-start, n-size); + priv-old_data.age = 0; + } +} + +static void dst_mirror_mark_notsync(struct
[1/4] dst: Distributed storage documentation.
Distributed storage documentation. Algorithms used in the system, userspace interfaces (sysfs dirs and files), design and implementation details are described here. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/Documentation/dst/algorithms.txt b/Documentation/dst/algorithms.txt new file mode 100644 index 000..1437a6a --- /dev/null +++ b/Documentation/dst/algorithms.txt @@ -0,0 +1,115 @@ +Each storage by itself is just a set of contiguous logical blocks, with +allowed number of operations. Nodes, each of which has own start and size, +are placed into storage by appropriate algorithm, which remaps +logical sector number into real node's sector. One can create +own algorithms, since DST has pluggable interface for that. +Currently mirrored and linear algorithms are supported. + +Let's briefly describe how they work. + +Linear algorithm. +Simple approach of concatenating storages into single device with +increased size is used in this algorithm. Essentially new device +has size equal to sum of sizes of underlying nodes and nodes are +placed one after another. + + /- Node 1 ---\ /-- Node 3 \ +start end start end + |==||==| + |start end | + | \--- Node 2 -/ | + | | +start end + \-- DST storage --/ + + /\ + || + || + + IO operations + + Figure 1. + 3 nodes combined into single storage using linear algorithm. + +Mirror algorithm. +In this algorithms nodes are placed under each other, so when +operation comes to the first one, it can be mirrored to all +underlying nodes. In case of reading, actual data is obtained from +the nearest node - algoritm keeps track of previous operation +and knows where it was stopped, so that subsequent seek to the +start of the new request will take the shortest time. +Writing is always mirrored to all underlying nodes. + + IO operations + || + || + \/ + +| DST storage ---| +| prev position | +|---| Node 1 | +| prev pos | +| Node 2 -|--| +|prev pos| +|---| Node 3 | + + Figure 2. + 3 nodes combined into single storage using mirror algorithm. + +Each algorithm must implement number of callbacks, +which must be registered during initialization time. + +struct dst_alg_ops +{ + int (*add_node)(struct dst_node *n); + void(*del_node)(struct dst_node *n); + int (*remap)(struct dst_request *req); + int (*error)(struct kst_state *state, int err); + struct module *owner; +}; + [EMAIL PROTECTED] +This callback is invoked when new node is being added into the storage, +but before node is actually added into the storage, so that it could +be accessed from it. When it is called, all appropriate initialization +of the underlying device is already completed (system has been connected +to remote node or got a reference to the local block device). At this +stage algorithm can add node into private map. +It must return zero on success or negative value otherwise. + [EMAIL PROTECTED] +This callback is invoked when node is being deleted from the storage, +i.e. when its reference counter hits zero. It is called before +any cleaning is performed. +It must return zero on success or negative value otherwise. + [EMAIL PROTECTED] +This callback is invoked each time new bio hits the storage. +Request structure contains BIO itself, pointer to the node, which originally +stores the whole region under given IO request, and various parameters +used by storage core to process this block request. +It must return zero on success or negative value otherwise. It is upto +this method to call all cleaning if remapping failed, for example it must +call kst_bio_endio() for given callback in case of error, which in turn +will call bio_endio(). Note, that dst_request structure provided in this +callback is allocated on stack, so if there is a need to use it outside +of the given function, it must be cloned (it will happen automatically +in state's push callback, but that copy will not be shared by any other +user). + [EMAIL PROTECTED] +This callback is invoked for each error, which happend when processed
[0/4] dst: Distributed storage.
Distributed storage. I'm pleased to announce the 9'th release of the distributed storage subsystem (DST). This is maintenance release and include bug fixing only. DST allows to form a storage on top of local and remote nodes and combine them into linear or mirroring setup, which in turn can be exported to remote nodes. Short changelog: * use node's size in sectors instead of bytes * fixed old/new ages for the first node. Error spotted by Matthew Hodgson [EMAIL PROTECTED] * fixed debug printk declaration * it is now called 'astonishingly screwed tapeworm' Overall list of features of the DST can be found on project's homepage: http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst Thank you. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[2/4] dst: Core distributed storage files.
Core distributed storage files. Include userspace interfaces, initialization, block layer bindings and other core functionality. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index b4c8319..ca6592d 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -451,6 +451,8 @@ config ATA_OVER_ETH This driver provides Support for ATA over Ethernet block devices like the Coraid EtherDrive (R) Storage Blade. +source drivers/block/dst/Kconfig + source drivers/s390/block/Kconfig endmenu diff --git a/drivers/block/Makefile b/drivers/block/Makefile index dd88e33..fcf042d 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -29,3 +29,4 @@ obj-$(CONFIG_VIODASD) += viodasd.o obj-$(CONFIG_BLK_DEV_SX8) += sx8.o obj-$(CONFIG_BLK_DEV_UB) += ub.o +obj-$(CONFIG_DST) += dst/ diff --git a/drivers/block/dst/Kconfig b/drivers/block/dst/Kconfig new file mode 100644 index 000..d35e0cc --- /dev/null +++ b/drivers/block/dst/Kconfig @@ -0,0 +1,21 @@ +config DST + tristate Distributed storage + depends on NET + select CONNECTOR + select LIBCRC32C + ---help--- + This driver allows to create a distributed storage. + +config DST_ALG_LINEAR + tristate Linear distribution algorithm + depends on DST + ---help--- + This module allows to create linear mapping of the nodes + in the distributed storage. + +config DST_ALG_MIRROR + tristate Mirror distribution algorithm + depends on DST + ---help--- + This module allows to create a mirror of the noes in the + distributed storage. diff --git a/drivers/block/dst/Makefile b/drivers/block/dst/Makefile new file mode 100644 index 000..1400e94 --- /dev/null +++ b/drivers/block/dst/Makefile @@ -0,0 +1,6 @@ +obj-$(CONFIG_DST) += dst.o + +dst-y := dcore.o kst.o + +obj-$(CONFIG_DST_ALG_LINEAR) += alg_linear.o +obj-$(CONFIG_DST_ALG_MIRROR) += alg_mirror.o diff --git a/drivers/block/dst/dcore.c b/drivers/block/dst/dcore.c new file mode 100644 index 000..06d0810 --- /dev/null +++ b/drivers/block/dst/dcore.c @@ -0,0 +1,1608 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/init.h +#include linux/blkdev.h +#include linux/bio.h +#include linux/slab.h +#include linux/connector.h +#include linux/socket.h +#include linux/dst.h +#include linux/device.h +#include linux/in.h +#include linux/in6.h +#include linux/buffer_head.h + +#include net/sock.h + +static LIST_HEAD(dst_storage_list); +static LIST_HEAD(dst_alg_list); +static DEFINE_MUTEX(dst_storage_lock); +static DEFINE_MUTEX(dst_alg_lock); +static int dst_major; +static struct kst_worker *kst_main_worker; +static struct cb_id cn_dst_id = { CN_DST_IDX, CN_DST_VAL }; + +struct kmem_cache *dst_request_cache; + +static char dst_name[] = Astonishingly screwed tapeworm; + +/* + * DST sysfs tree. For device called 'storage' which is formed + * on top of two nodes this looks like this: + * + * /sys/devices/storage/ + * /sys/devices/storage/alg : alg_linear + * /sys/devices/storage/n-800/type : R: 192.168.4.80:1025 + * /sys/devices/storage/n-800/size : 800 + * /sys/devices/storage/n-800/start : 800 + * /sys/devices/storage/n-800/clean + * /sys/devices/storage/n-800/dirty + * /sys/devices/storage/n-0/type : R: 192.168.4.81:1025 + * /sys/devices/storage/n-0/size : 800 + * /sys/devices/storage/n-0/start : 0 + * /sys/devices/storage/n-0/clean + * /sys/devices/storage/n-0/dirty + * /sys/devices/storage/remove_all_nodes + * /sys/devices/storage/nodes : sectors (start [size]): 0 [800] | 800 [800] + * /sys/devices/storage/name : storage + */ + +static int dst_dev_match(struct device *dev, struct device_driver *drv) +{ + return 1; +} + +static void dst_dev_release(struct device *dev) +{ +} + +static struct bus_type dst_dev_bus_type = { + .name = dst, + .match = dst_dev_match, +}; + +static struct device dst_dev = { + .bus= dst_dev_bus_type, + .release= dst_dev_release +}; + +static void dst_node_release(struct device *dev) +{ +} + +static struct device dst_node_dev = { + .release= dst_node_release +}; + +static void dst_free_alg(struct dst_alg *alg) +{ + kfree(alg); +} + +/* + * Algorithm is never freed directly, + * since its
Netchannels. The 21'th release.
Hi. This is the 21'th release of the netchannels, a peer-to-peer protocol agnostic communication channel between hardware and users. It uses unified cache to store channels, allows to allocate buffers for data from userspace mapped area or from other preallocated set of pages (like VFS cache). All protocol processing happens in process context. Users of the system can be for example userspace - it allows to receive and send traffic from the wire without any kernel interference, to implement own protocols and offload its processing to the hardware. This idea was originally proposed and implemented by Van Jacobson. This patchset (with userspace netowrk stack) is a logical continuation of the idea with move to the full peer-to-peer processing. One of its users is userspace network stack [2]. Short changelog: * fixed queue length usage * fixed dst release path. Both problems reported by Salvatore Del Popolo [EMAIL PROTECTED] * removed nat user 1. Netchannels homepage. http://tservice.net.ru/~s0mbre/old/?section=projectsitem=netchannel 2. Userspace network stack. http://tservice.net.ru/~s0mbre/old/?section=projectsitem=unetstack Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S index 2697e92..3231b22 100644 --- a/arch/i386/kernel/syscall_table.S +++ b/arch/i386/kernel/syscall_table.S @@ -319,3 +319,4 @@ ENTRY(sys_call_table) .long sys_move_pages .long sys_getcpu .long sys_epoll_pwait + .long sys_netchannel_control diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S index b4aa875..d35d4d8 100644 --- a/arch/x86_64/ia32/ia32entry.S +++ b/arch/x86_64/ia32/ia32entry.S @@ -718,4 +718,5 @@ ia32_sys_call_table: .quad compat_sys_vmsplice .quad compat_sys_move_pages .quad sys_getcpu + .quad sys_netchannel_control ia32_syscall_end: diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h index beeeaf6..33242f8 100644 --- a/include/asm-i386/unistd.h +++ b/include/asm-i386/unistd.h @@ -325,10 +325,11 @@ #define __NR_move_pages317 #define __NR_getcpu318 #define __NR_epoll_pwait 319 +#define __NR_netchannel_control320 #ifdef __KERNEL__ -#define NR_syscalls 320 +#define NR_syscalls 321 #include linux/err.h /* diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h index 777288e..16f1aac 100644 --- a/include/asm-x86_64/unistd.h +++ b/include/asm-x86_64/unistd.h @@ -619,8 +619,10 @@ __SYSCALL(__NR_sync_file_range, sys_sync_file_range) __SYSCALL(__NR_vmsplice, sys_vmsplice) #define __NR_move_pages279 __SYSCALL(__NR_move_pages, sys_move_pages) +#define __NR_netchannel_control280 +__SYSCALL(__NR_netchannel_control, sys_netchannel_control) -#define __NR_syscall_max __NR_move_pages +#define __NR_syscall_max __NR_netchannel_control #ifdef __KERNEL__ #include linux/err.h diff --git a/include/linux/connector.h b/include/linux/connector.h index 4c02119..bdf6432 100644 --- a/include/linux/connector.h +++ b/include/linux/connector.h @@ -36,9 +36,11 @@ #define CN_VAL_CIFS 0x1 #define CN_W1_IDX 0x3 /* w1 communication */ #define CN_W1_VAL 0x1 +#define CN_NETCHANNELS_IDX 0x04/* Netchannels connection control */ +#define CN_NETCHANNELS_VAL 0x01 -#define CN_NETLINK_USERS 4 +#define CN_NETLINK_USERS 5 /* * Maximum connector's message size. diff --git a/include/linux/netchannel.h b/include/linux/netchannel.h new file mode 100644 index 000..c56afc5 --- /dev/null +++ b/include/linux/netchannel.h @@ -0,0 +1,175 @@ +/* + * netchannel.h + * + * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef __NETCHANNEL_H +#define __NETCHANNEL_H + +#include linux/types.h + +enum netchannel_commands { + NETCHANNEL_CREATE = 0, +}; + +enum netchannel_type { + NETCHANNEL_EMPTY = 0, + NETCHANNEL_COPY_USER, + NETCHANNEL_NAT, + NETCHANNEL_MAX +}; + +/* + * Destination and source addresses/ports are from receiving point ov view, + * i.e
Re: [Bugme-new] [Bug 9440] New: Problem in joinning a socket to ipv6 multicast address in specific scenario
Hi. Avaid provided test application, so bug got fixed. IPv6 addrconf removes ipv6 inner device from netdev each time cmu changes and new value is less than IPV6_MIN_MTU (1280 bytes). When mtu is changed and new value is greater than IPV6_MIN_MTU, it does not add ipv6 addresses and inner device bac. This patch fixes that. Tested with Avaid's application, which works ok now. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 567664e..4f7e46c 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -2357,12 +2358,18 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event, break; case NETDEV_CHANGEMTU: - if ( idev dev-mtu = IPV6_MIN_MTU) { + if (idev dev-mtu = IPV6_MIN_MTU) { rt6_mtu_change(dev, dev-mtu); idev-cnf.mtu6 = dev-mtu; break; } + if (!idev dev-mtu = IPV6_MIN_MTU) { + idev = ipv6_add_dev(dev); + if (idev) + break; + } + /* MTU falled under IPV6_MIN_MTU. Stop IPv6 on this interface. */ case NETDEV_DOWN: -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23 WARNING: at kernel/softirq.c:139 local_bh_enable()
On Fri, Nov 23, 2007 at 01:11:20PM -0600, Matt Mackall ([EMAIL PROTECTED]) wrote: On Fri, Nov 23, 2007 at 09:59:06PM +0300, Evgeniy Polyakov wrote: On Fri, Nov 23, 2007 at 09:51:01PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: On Fri, Nov 23, 2007 at 09:48:51PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: Stop, we are trying to free skb without destructor and catch connection tracking, so it is not a solution. To fix the problem we need to check if it is not netfilter related, kind of this (not tested), Simon please give it a try: And to be really cool we need to bypass skbs with xfrm attached, since its freeing also assumes BH context. What about compile options? What about my original suggestion that we mark skbs owned by netpoll and free only those. Much safer, no? Untested: This should work if there are netpoll's skbs, but if we are under memory pressure we want to free not only netpoll skbs, but at least one, and what if there are no netpoll skbs in the queue? -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23 WARNING: at kernel/softirq.c:139 local_bh_enable()
On Fri, Nov 23, 2007 at 10:54:10PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: On Fri, Nov 23, 2007 at 01:41:39PM -0600, Matt Mackall ([EMAIL PROTECTED]) wrote: Here's another thought: move all this logic into the networking core, unify it with current softirq zapper, then allow it to be called from various other places (like atomic allocators). Then it'll all be in central maintained place with more users. This can be done quite easily - put a check into __kfree_skb() if netpoll is compiled-in and we are in hardirq context, then put skb into softirq freeing queue. Then zap_completion_queue() can free anything without ever knowing about nature of the packet, since this will be checked in __kfree_skb() anyway. And let's add some mess... But should fix the case when netpoll code is being executed in interrupt context and is about to free skb, which should not be freed. Frankly saying this looks like crap. Crap-added-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/net/core/netpoll.c b/net/core/netpoll.c index 758dafe..88f8ea9 100644 --- a/net/core/netpoll.c +++ b/net/core/netpoll.c @@ -196,10 +196,7 @@ static void zap_completion_queue(void) while (clist != NULL) { struct sk_buff *skb = clist; clist = clist-next; - if (skb-destructor) - dev_kfree_skb_any(skb); /* put this one back */ - else - __kfree_skb(skb); + __kfree_skb(skb); } } diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 27cfe5f..8642097 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -318,6 +318,26 @@ void kfree_skbmem(struct sk_buff *skb) void __kfree_skb(struct sk_buff *skb) { +#if defined(CONFIG_NETPOLL) || defined(CONFIG_NETPOLL_TRAP) + if (in_irq() || irqs_disabled()) { + if (skb-destructor) { + dev_kfree_skb_irq(skb); + return; + } +#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE) + if (skb-nfct || skb-nfct_reasm) { + dev_kfree_skb_irq(skb); + return; + } +#endif +#ifdef CONFIG_XFRM + if (skb-sp) { + dev_kfree_skb_irq(skb); + return; + } +#endif + } +#endif dst_release(skb-dst); #ifdef CONFIG_XFRM secpath_put(skb-sp); -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23 WARNING: at kernel/softirq.c:139 local_bh_enable()
On Fri, Nov 23, 2007 at 09:48:51PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: Stop, we are trying to free skb without destructor and catch connection tracking, so it is not a solution. To fix the problem we need to check if it is not netfilter related, kind of this (not tested), Simon please give it a try: And to be really cool we need to bypass skbs with xfrm attached, since its freeing also assumes BH context. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/net/core/netpoll.c b/net/core/netpoll.c index 758dafe..5f86e60 100644 --- a/net/core/netpoll.c +++ b/net/core/netpoll.c @@ -196,7 +196,8 @@ static void zap_completion_queue(void) while (clist != NULL) { struct sk_buff *skb = clist; clist = clist-next; - if (skb-destructor) + if (skb-destructor || skb-nfct || + skb-nfct_reasm || skb-sp) dev_kfree_skb_any(skb); /* put this one back */ else __kfree_skb(skb); -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23 WARNING: at kernel/softirq.c:139 local_bh_enable()
On Fri, Nov 23, 2007 at 08:57:57PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: My memory here is hazy, but I think this exists to rescue netconsole in low-memory situations. This bit originated with Ingo, so maybe he can recall. Netpoll can process an arbitrary number of skbs inside a single interrupt. Think sysrq-t at one packet per line or kgdboe where the entire trace session can happen inside one very long interrupt. Perhaps we can refine this to mark netpoll's skbs (perhaps with -destructor?) and delete only skbs we own. As these are never passed through any of the other route/xfrm/filter code, they should be safe to delete even in irq context, yes? Removing zap_completion_queue() from find_skb() will fix the warning, but I'm not sure this is a correct fix. I've added Matt to the Cc list. Care to try the sysrq-t or OOM message tests? We basically can not free skbs there - if it is interrupt context and we are freeing some skb with destructor we will catch the warning anyway. No matter if we are under memory pressure or whatever - it is not allowed - a lot of skbs are supposed to be freed in softirq context, that is why dev_kfree_skb_any() exists. I think we can drop skbs _without_ destructor from the queue though in that conditions given that we actually need only one. Stop, we are trying to free skb without destructor and catch connection tracking, so it is not a solution. To fix the problem we need to check if it is not netfilter related, kind of this (not tested), Simon please give it a try: diff --git a/net/core/netpoll.c b/net/core/netpoll.c index 758dafe..855bb3f 100644 --- a/net/core/netpoll.c +++ b/net/core/netpoll.c @@ -196,7 +196,7 @@ static void zap_completion_queue(void) while (clist != NULL) { struct sk_buff *skb = clist; clist = clist-next; - if (skb-destructor) + if (skb-destructor || skb-nfct || skb-nfct_reasm) dev_kfree_skb_any(skb); /* put this one back */ else __kfree_skb(skb); -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23 WARNING: at kernel/softirq.c:139 local_bh_enable()
On Fri, Nov 23, 2007 at 12:21:57AM -0800, Andrew Morton ([EMAIL PROTECTED]) wrote: [2059664.615816] __iptables__: init4 IN=ppp0 OUT=ppp0 WARNING: at kernel/softirq.c:139 local_bh_enable() [2059664.620535] [80120364] local_bh_enable+0x3c/0x97 [2059664.620657] [8011c205] __call_console_drivers+0x61/0x6d [2059664.620669] [8011c3fc] release_console_sem+0x164/0x1bf [2059664.620679] [8011c81f] vprintk+0x27a/0x2ff If that trace is to be beieved we're doing nefilter stuff on packets which were sent across netconsole. This probably isn't anything the netfilter guys have thought about. And probably we don't want them to. Is there some simple way in which we can exempt netconsole from netfilter processing? This is not about netfilter, but about freeing skb in interrupt context, which is not allowed, and in interrupt skbs are queued to be freed in softirq, but netcnsole wants to flush softirq freeing queue. That is a question: why? Removing zap_completion_queue() from find_skb() will fix the warning, but I'm not sure this is a correct fix. I've added Matt to the Cc list. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 9440] New: Problem in joinning a socket to ipv6 multicast address in specific scenario
On Thu, Nov 22, 2007 at 05:23:42PM -0800, Andrew Morton ([EMAIL PROTECTED]) wrote: 3. Now i am running a program i wrote in c that opens a dgram socket (sock_fd[i] = socket(test_data-protocol, SOCK_DGRAM, 0);) and join it to multicast ipv6 address. if i am running this program after steps 1+2 i get the following error: Resource temporarily unavailable when trying to join the socket to the multicast ipv6 address by the system call : Could you provide a test application? Given it is small and can be ran without external dependencies, it will be fixed way much faster. Thanks. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23 WARNING: at kernel/softirq.c:139 local_bh_enable()
On Fri, Nov 23, 2007 at 01:41:39PM -0600, Matt Mackall ([EMAIL PROTECTED]) wrote: Here's another thought: move all this logic into the networking core, unify it with current softirq zapper, then allow it to be called from various other places (like atomic allocators). Then it'll all be in central maintained place with more users. This can be done quite easily - put a check into __kfree_skb() if netpoll is compiled-in and we are in hardirq context, then put skb into softirq freeing queue. Then zap_completion_queue() can free anything without ever knowing about nature of the packet, since this will be checked in __kfree_skb() anyway. Kind of this: diff --git a/net/core/netpoll.c b/net/core/netpoll.c index 758dafe..88f8ea9 100644 --- a/net/core/netpoll.c +++ b/net/core/netpoll.c @@ -196,10 +196,7 @@ static void zap_completion_queue(void) while (clist != NULL) { struct sk_buff *skb = clist; clist = clist-next; - if (skb-destructor) - dev_kfree_skb_any(skb); /* put this one back */ - else - __kfree_skb(skb); + __kfree_skb(skb); } } diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 27cfe5f..f720685 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -318,6 +318,12 @@ void kfree_skbmem(struct sk_buff *skb) void __kfree_skb(struct sk_buff *skb) { +#if defined(CONFIG_NETPOLL) || defined(CONFIG_NETPOLL_TRAP) + if (in_irq() || irqs_disabled()) { + dev_kfree_skb_irq(skb); + return; + } +#endif dst_release(skb-dst); #ifdef CONFIG_XFRM secpath_put(skb-sp); -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23 WARNING: at kernel/softirq.c:139 local_bh_enable()
On Fri, Nov 23, 2007 at 09:51:01PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: On Fri, Nov 23, 2007 at 09:48:51PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: Stop, we are trying to free skb without destructor and catch connection tracking, so it is not a solution. To fix the problem we need to check if it is not netfilter related, kind of this (not tested), Simon please give it a try: And to be really cool we need to bypass skbs with xfrm attached, since its freeing also assumes BH context. What about compile options? Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/net/core/netpoll.c b/net/core/netpoll.c index 758dafe..adb3c54 100644 --- a/net/core/netpoll.c +++ b/net/core/netpoll.c @@ -196,10 +196,25 @@ static void zap_completion_queue(void) while (clist != NULL) { struct sk_buff *skb = clist; clist = clist-next; - if (skb-destructor) + if (skb-destructor) { dev_kfree_skb_any(skb); /* put this one back */ - else - __kfree_skb(skb); + continue; + } + +#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE) + if (skb-nfct || skb-nfct_reasm) { + dev_kfree_skb_any(skb); /* put this one back */ + continue; + } +#endif + +#ifdef CONFIG_XFRM + if (skb-sp) { + dev_kfree_skb_any(skb); /* put this one back */ + continue; + } +#endif + __kfree_skb(skb); } } -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23 WARNING: at kernel/softirq.c:139 local_bh_enable()
On Fri, Nov 23, 2007 at 11:07:56AM -0600, Matt Mackall ([EMAIL PROTECTED]) wrote: On Fri, Nov 23, 2007 at 01:55:19PM +0300, Evgeniy Polyakov wrote: On Fri, Nov 23, 2007 at 12:21:57AM -0800, Andrew Morton ([EMAIL PROTECTED]) wrote: [2059664.615816] __iptables__: init4 IN=ppp0 OUT=ppp0 WARNING: at kernel/softirq.c:139 local_bh_enable() [2059664.620535] [80120364] local_bh_enable+0x3c/0x97 [2059664.620657] [8011c205] __call_console_drivers+0x61/0x6d [2059664.620669] [8011c3fc] release_console_sem+0x164/0x1bf [2059664.620679] [8011c81f] vprintk+0x27a/0x2ff If that trace is to be beieved we're doing nefilter stuff on packets which were sent across netconsole. This probably isn't anything the netfilter guys have thought about. And probably we don't want them to. Is there some simple way in which we can exempt netconsole from netfilter processing? This is not about netfilter, but about freeing skb in interrupt context, which is not allowed, and in interrupt skbs are queued to be freed in softirq, but netcnsole wants to flush softirq freeing queue. That is a question: why? My memory here is hazy, but I think this exists to rescue netconsole in low-memory situations. This bit originated with Ingo, so maybe he can recall. Netpoll can process an arbitrary number of skbs inside a single interrupt. Think sysrq-t at one packet per line or kgdboe where the entire trace session can happen inside one very long interrupt. Perhaps we can refine this to mark netpoll's skbs (perhaps with -destructor?) and delete only skbs we own. As these are never passed through any of the other route/xfrm/filter code, they should be safe to delete even in irq context, yes? Removing zap_completion_queue() from find_skb() will fix the warning, but I'm not sure this is a correct fix. I've added Matt to the Cc list. Care to try the sysrq-t or OOM message tests? We basically can not free skbs there - if it is interrupt context and we are freeing some skb with destructor we will catch the warning anyway. No matter if we are under memory pressure or whatever - it is not allowed - a lot of skbs are supposed to be freed in softirq context, that is why dev_kfree_skb_any() exists. I think we can drop skbs _without_ destructor from the queue though in that conditions given that we actually need only one. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23 WARNING: at kernel/softirq.c:139 local_bh_enable()
On Fri, Nov 23, 2007 at 12:59:43PM -0600, Matt Mackall ([EMAIL PROTECTED]) wrote: So I'd be surprised if that was a problem. But I can imagine having problems for skbs without destructors which run into one of these in __kfree_skb: dst_release secpath_put nf_conntrack_put nf_conntrack_put_reasm nf_bridge_put ..some or all of which assume a softirq context. bridging is ok, others require softirq context. I've sent a patch (the last one should be ok) to guard against xfrm and connection tracking. No matter if we are under memory pressure or whatever - it is not allowed - a lot of skbs are supposed to be freed in softirq context, that is why dev_kfree_skb_any() exists. Some skbs we definitely -can- free in irq context. The only ones we care about are the ones generated by netpoll. If there's a reason you think netpoll's own skbs can't be freed, please describe it. Only some and to distinguish them we can not use destructor - if it is set (even empty function) it will fire an alarm. I think we can drop skbs _without_ destructor from the queue though in that conditions given that we actually need only one. Huh? Don't mind - friday... I posted a patch (third one should be ok) to fix this issue. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netfilter: kernel panic with REDIRECT target. (2.6.23 and 2.6.23.8)
Ok, let's try it hard way. Please check attached patch and tell if it helped (it will produce some debug though). With both patches applied - one Patrick showed and this one. Now works, with this in dmesg conntrack: ea94159c, new: ead4d7c4, old: ead4d7d0, ct: . David (Miller :), please apply attached patch, which also needed to fix netfilter connection tracking bug. When connection tracking entry (nf_conn) is about to copy itself it can have some of its extension users (like nat) as being already freed and thus not required to be copied. Frankly saying, it can be not the correct fix, but from code observation and test, perfomed by David [EMAIL PROTECTED] it is. Actually looking at this function I suspect it was copied from nf_nat_setup_info() and thus bug was introduced. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c index 70e7997..86b465b 100644 --- a/net/ipv4/netfilter/nf_nat_core.c +++ b/net/ipv4/netfilter/nf_nat_core.c @@ -607,13 +607,10 @@ static void nf_nat_move_storage(struct nf_conn *conntrack, void *old) struct nf_conn_nat *new_nat = nf_ct_ext_find(conntrack, NF_CT_EXT_NAT); struct nf_conn_nat *old_nat = (struct nf_conn_nat *)old; struct nf_conn *ct = old_nat-ct; - unsigned int srchash; - if (!(ct-status IPS_NAT_DONE_MASK)) + if (!ct || !(ct-status IPS_NAT_DONE_MASK)) return; - srchash = hash_by_src(ct-tuplehash[IP_CT_DIR_ORIGINAL].tuple); - write_lock_bh(nf_nat_lock); hlist_replace_rcu(old_nat-bysource, new_nat-bysource); new_nat-ct = ct; -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] LRO ack aggregation
Hi. On Tue, Nov 20, 2007 at 08:27:05AM -0500, Andrew Gallatin ([EMAIL PROTECTED]) wrote: Hmm.. rather than a global tunable, what if it was a network driver managed tunable which toggled a flag in the lro_mgr features? Would that be better? What about ethtool control to set LRO_simple and LRO_ACK_aggregation? -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netfilter: kernel panic with REDIRECT target. (2.6.23 and 2.6.23.8)
On Tue, Nov 20, 2007 at 01:24:17PM +0100, Patrick McHardy ([EMAIL PROTECTED]) wrote: Patrick McHardy wrote: Evgeniy Polyakov wrote: Ok, let's try it hard way. Please check attached patch and tell if it helped (it will produce some debug though). With both patches applied - one Patrick showed and this one. Now works, with this in dmesg conntrack: ea94159c, new: ead4d7c4, old: ead4d7d0, ct: . David (Miller :), please apply attached patch, which also needed to fix netfilter connection tracking bug. When connection tracking entry (nf_conn) is about to copy itself it can have some of its extension users (like nat) as being already freed and thus not required to be copied. Frankly saying, it can be not the correct fix, but from code observation and test, perfomed by David [EMAIL PROTECTED] it is. I also don't believe this can be correct, let me look into this first. I now understand whats happening: - new connection is allocated without helper - connection is REDIRECTed to localhost - nf_nat_setup_info adds NAT extension, but doesn't initialize it yet - nf_conntrack_alter_reply performs a helper lookup based on the new tuple, finds the SIP helper and allocates a helper extension, causing reallocation because of too little space - nf_nat_move_storage is called with the uninitialized nat extension So your fix is entirely correct, thanks a lot :) It is always better to check my third eye revelations :) Thanks for checking it. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] LRO ack aggregation
On Tue, Nov 20, 2007 at 09:50:56PM +0800, Herbert Xu ([EMAIL PROTECTED]) wrote: On Tue, Nov 20, 2007 at 04:35:09PM +0300, Evgeniy Polyakov wrote: On Tue, Nov 20, 2007 at 08:27:05AM -0500, Andrew Gallatin ([EMAIL PROTECTED]) wrote: Hmm.. rather than a global tunable, what if it was a network driver managed tunable which toggled a flag in the lro_mgr features? Would that be better? What about ethtool control to set LRO_simple and LRO_ACK_aggregation? I have two concerns about this: 1) That same option can still be turned on by distros. FC and Debian turn on hardware checksumm offloading in e1000 and I have a card where this results in more than 10% performance _decrease_. I do not know why, but Im able to run script which disables it via ethtool. 2) This doesn't make sense because the code is actually in the core networking stack. It depends. Software lro can be controlled by simple procfs switch, but hardware one? I recall it was number of times pointed that hardware LRO is possible and likely being implemented in some asics. I'm particular unhappy about 2) because I don't want be in a situation down the track where every driver is going to add this option so that they're not left behind in the arms race. For software lro I agree, but this looks exactly like gso/tso case and additional tweak for software gso. Having it per-system is fine, and I believe no one should ever care that some distro will do bad/good things with it. Actually we do have so much tricky options in procfs already which can kill performance... -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] LRO ack aggregation
On Tue, Nov 20, 2007 at 10:08:31PM +0800, Herbert Xu ([EMAIL PROTECTED]) wrote: Of course we still have the problem with the option in general that Dave raised. That is this may cause the proliferation of TCP receiver behaviour that may be undesirable. Yes, it results in bursts of traffic because of delayed acks accumulated in sender's lro engine, but from the first point, if receiver is slow, then it will slowly send acks and they will be slowly accumulated, thus changing not only seq/ack numbers, but also timings, which is equal to increasing length of the pipe between users. TCP is able to balance on this edge. I'm sure it depends on workload, but heavy bulk transfers, where only lro with and without ack agregation can win, are quite usual on long pipes with high performance numbers. Until it is tested, I doubt it is possible to say it is 100% good or bad, so my proposal is to write the code, which is tunable from userspace, turn it off and allow people to test the change. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[take8 4/4] dst: Algorithms used in distributed storage.
Algorithms used in distributed storage. Mirror and linear mapping code. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/drivers/block/dst/alg_linear.c b/drivers/block/dst/alg_linear.c new file mode 100644 index 000..cb77b57 --- /dev/null +++ b/drivers/block/dst/alg_linear.c @@ -0,0 +1,104 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/init.h +#include linux/dst.h + +static struct dst_alg *alg_linear; + +/* + * This callback is invoked when node is removed from storage. + */ +static void dst_linear_del_node(struct dst_node *n) +{ +} + +/* + * This callback is invoked when node is added to storage. + */ +static int dst_linear_add_node(struct dst_node *n) +{ + struct dst_storage *st = n-st; + + dprintk(%s: disk_size: %llu, node_size: %llu.\n, + __func__, st-disk_size, n-size); + + mutex_lock(st-tree_lock); + n-start = st-disk_size; + st-disk_size += n-size; + mutex_unlock(st-tree_lock); + + return 0; +} + +static int dst_linear_remap(struct dst_request *req) +{ + int err; + + if (req-node-bdev) { + generic_make_request(req-bio); + return 0; + } + + err = kst_check_permissions(req-state, req-bio); + if (err) + return err; + + return req-state-ops-push(req); +} + +/* + * Failover callback - it is invoked each time error happens during + * request processing. + */ +static int dst_linear_error(struct kst_state *st, int err) +{ + if (err) + set_bit(DST_NODE_FROZEN, st-node-flags); + else + clear_bit(DST_NODE_FROZEN, st-node-flags); + return 0; +} + +static struct dst_alg_ops alg_linear_ops = { + .remap = dst_linear_remap, + .add_node = dst_linear_add_node, + .del_node = dst_linear_del_node, + .error = dst_linear_error, + .owner = THIS_MODULE, +}; + +static int __devinit alg_linear_init(void) +{ + alg_linear = dst_alloc_alg(alg_linear, alg_linear_ops); + if (!alg_linear) + return -ENOMEM; + + return 0; +} + +static void __devexit alg_linear_exit(void) +{ + dst_remove_alg(alg_linear); +} + +module_init(alg_linear_init); +module_exit(alg_linear_exit); + +MODULE_LICENSE(GPL); +MODULE_AUTHOR(Evgeniy Polyakov [EMAIL PROTECTED]); +MODULE_DESCRIPTION(Linear distributed algorithm.); diff --git a/drivers/block/dst/alg_mirror.c b/drivers/block/dst/alg_mirror.c new file mode 100644 index 000..1b55f4d --- /dev/null +++ b/drivers/block/dst/alg_mirror.c @@ -0,0 +1,1113 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/init.h +#include linux/poll.h +#include linux/dst.h + +struct dst_mirror_node_data +{ + u64 age; +}; + +struct dst_mirror_priv +{ + unsigned intchunk_num; + + u64 last_start; + + spinlock_t backlog_lock; + struct list_headbacklog_list; + + struct dst_mirror_node_data old_data, new_data; + + unsigned long *chunk; +}; + +static struct dst_alg *alg_mirror; +static struct bio_set *dst_mirror_bio_set; + +static int dst_mirror_resync(struct dst_node *n, int ndp); + +static void dst_mirror_mark_sync(struct dst_node *n) +{ + if (test_bit(DST_NODE_NOTSYNC, n-flags)) { + struct dst_mirror_priv *priv = n-priv; + + clear_bit(DST_NODE_NOTSYNC, n-flags); + dprintk(%s: node: %p, %llu:%llu synchronization + has been completed.\n, + __func__, n, n-start, n-size); + priv-old_data.age = 0; + } +} + +static void dst_mirror_mark_notsync(struct
[take8 0/4] dst: Distributed storage.
Distributed storage. I'm pleased to announce the 8'th release of the distributed storage subsystem (DST). This is a maintenance release and includes bug fixes only. DST allows to form a storage on top of local and remote nodes and combine them into linear or mirroring setup, which in turn can be exported to remote nodes. Short changelog: * cleanup sysfs files on error path. Patch by Chris Madden [EMAIL PROTECTED] Overall list of features of the DST can be found on project's homepage: http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst Thank you. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[take8 2/4] dst: Core distributed storage files.
Core distributed storage files. Include userspace interfaces, initialization, block layer bindings and other core functionality. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index b4c8319..ca6592d 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -451,6 +451,8 @@ config ATA_OVER_ETH This driver provides Support for ATA over Ethernet block devices like the Coraid EtherDrive (R) Storage Blade. +source drivers/block/dst/Kconfig + source drivers/s390/block/Kconfig endmenu diff --git a/drivers/block/Makefile b/drivers/block/Makefile index dd88e33..fcf042d 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -29,3 +29,4 @@ obj-$(CONFIG_VIODASD) += viodasd.o obj-$(CONFIG_BLK_DEV_SX8) += sx8.o obj-$(CONFIG_BLK_DEV_UB) += ub.o +obj-$(CONFIG_DST) += dst/ diff --git a/drivers/block/dst/Kconfig b/drivers/block/dst/Kconfig new file mode 100644 index 000..d35e0cc --- /dev/null +++ b/drivers/block/dst/Kconfig @@ -0,0 +1,21 @@ +config DST + tristate Distributed storage + depends on NET + select CONNECTOR + select LIBCRC32C + ---help--- + This driver allows to create a distributed storage. + +config DST_ALG_LINEAR + tristate Linear distribution algorithm + depends on DST + ---help--- + This module allows to create linear mapping of the nodes + in the distributed storage. + +config DST_ALG_MIRROR + tristate Mirror distribution algorithm + depends on DST + ---help--- + This module allows to create a mirror of the noes in the + distributed storage. diff --git a/drivers/block/dst/Makefile b/drivers/block/dst/Makefile new file mode 100644 index 000..1400e94 --- /dev/null +++ b/drivers/block/dst/Makefile @@ -0,0 +1,6 @@ +obj-$(CONFIG_DST) += dst.o + +dst-y := dcore.o kst.o + +obj-$(CONFIG_DST_ALG_LINEAR) += alg_linear.o +obj-$(CONFIG_DST_ALG_MIRROR) += alg_mirror.o diff --git a/drivers/block/dst/dcore.c b/drivers/block/dst/dcore.c new file mode 100644 index 000..77b2c4f --- /dev/null +++ b/drivers/block/dst/dcore.c @@ -0,0 +1,1608 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/init.h +#include linux/blkdev.h +#include linux/bio.h +#include linux/slab.h +#include linux/connector.h +#include linux/socket.h +#include linux/dst.h +#include linux/device.h +#include linux/in.h +#include linux/in6.h +#include linux/buffer_head.h + +#include net/sock.h + +static LIST_HEAD(dst_storage_list); +static LIST_HEAD(dst_alg_list); +static DEFINE_MUTEX(dst_storage_lock); +static DEFINE_MUTEX(dst_alg_lock); +static int dst_major; +static struct kst_worker *kst_main_worker; +static struct cb_id cn_dst_id = { CN_DST_IDX, CN_DST_VAL }; + +struct kmem_cache *dst_request_cache; + +static char dst_name[] = Squizzed black-out of the dancing back-aching hippo; + +/* + * DST sysfs tree. For device called 'storage' which is formed + * on top of two nodes this looks like this: + * + * /sys/devices/storage/ + * /sys/devices/storage/alg : alg_linear + * /sys/devices/storage/n-800/type : R: 192.168.4.80:1025 + * /sys/devices/storage/n-800/size : 800 + * /sys/devices/storage/n-800/start : 800 + * /sys/devices/storage/n-800/clean + * /sys/devices/storage/n-800/dirty + * /sys/devices/storage/n-0/type : R: 192.168.4.81:1025 + * /sys/devices/storage/n-0/size : 800 + * /sys/devices/storage/n-0/start : 0 + * /sys/devices/storage/n-0/clean + * /sys/devices/storage/n-0/dirty + * /sys/devices/storage/remove_all_nodes + * /sys/devices/storage/nodes : sectors (start [size]): 0 [800] | 800 [800] + * /sys/devices/storage/name : storage + */ + +static int dst_dev_match(struct device *dev, struct device_driver *drv) +{ + return 1; +} + +static void dst_dev_release(struct device *dev) +{ +} + +static struct bus_type dst_dev_bus_type = { + .name = dst, + .match = dst_dev_match, +}; + +static struct device dst_dev = { + .bus= dst_dev_bus_type, + .release= dst_dev_release +}; + +static void dst_node_release(struct device *dev) +{ +} + +static struct device dst_node_dev = { + .release= dst_node_release +}; + +static void dst_free_alg(struct dst_alg *alg) +{ + kfree(alg); +} + +/* + * Algorithm is never freed
[take8 1/4] dst: Distributed storage documentation.
Distributed storage documentation. Algorithms used in the system, userspace interfaces (sysfs dirs and files), design and implementation details are described here. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/Documentation/dst/algorithms.txt b/Documentation/dst/algorithms.txt new file mode 100644 index 000..1437a6a --- /dev/null +++ b/Documentation/dst/algorithms.txt @@ -0,0 +1,115 @@ +Each storage by itself is just a set of contiguous logical blocks, with +allowed number of operations. Nodes, each of which has own start and size, +are placed into storage by appropriate algorithm, which remaps +logical sector number into real node's sector. One can create +own algorithms, since DST has pluggable interface for that. +Currently mirrored and linear algorithms are supported. + +Let's briefly describe how they work. + +Linear algorithm. +Simple approach of concatenating storages into single device with +increased size is used in this algorithm. Essentially new device +has size equal to sum of sizes of underlying nodes and nodes are +placed one after another. + + /- Node 1 ---\ /-- Node 3 \ +start end start end + |==||==| + |start end | + | \--- Node 2 -/ | + | | +start end + \-- DST storage --/ + + /\ + || + || + + IO operations + + Figure 1. + 3 nodes combined into single storage using linear algorithm. + +Mirror algorithm. +In this algorithms nodes are placed under each other, so when +operation comes to the first one, it can be mirrored to all +underlying nodes. In case of reading, actual data is obtained from +the nearest node - algoritm keeps track of previous operation +and knows where it was stopped, so that subsequent seek to the +start of the new request will take the shortest time. +Writing is always mirrored to all underlying nodes. + + IO operations + || + || + \/ + +| DST storage ---| +| prev position | +|---| Node 1 | +| prev pos | +| Node 2 -|--| +|prev pos| +|---| Node 3 | + + Figure 2. + 3 nodes combined into single storage using mirror algorithm. + +Each algorithm must implement number of callbacks, +which must be registered during initialization time. + +struct dst_alg_ops +{ + int (*add_node)(struct dst_node *n); + void(*del_node)(struct dst_node *n); + int (*remap)(struct dst_request *req); + int (*error)(struct kst_state *state, int err); + struct module *owner; +}; + [EMAIL PROTECTED] +This callback is invoked when new node is being added into the storage, +but before node is actually added into the storage, so that it could +be accessed from it. When it is called, all appropriate initialization +of the underlying device is already completed (system has been connected +to remote node or got a reference to the local block device). At this +stage algorithm can add node into private map. +It must return zero on success or negative value otherwise. + [EMAIL PROTECTED] +This callback is invoked when node is being deleted from the storage, +i.e. when its reference counter hits zero. It is called before +any cleaning is performed. +It must return zero on success or negative value otherwise. + [EMAIL PROTECTED] +This callback is invoked each time new bio hits the storage. +Request structure contains BIO itself, pointer to the node, which originally +stores the whole region under given IO request, and various parameters +used by storage core to process this block request. +It must return zero on success or negative value otherwise. It is upto +this method to call all cleaning if remapping failed, for example it must +call kst_bio_endio() for given callback in case of error, which in turn +will call bio_endio(). Note, that dst_request structure provided in this +callback is allocated on stack, so if there is a need to use it outside +of the given function, it must be cloned (it will happen automatically +in state's push callback, but that copy will not be shared by any other +user). + [EMAIL PROTECTED] +This callback is invoked for each error, which happend when processed
[take8 3/4] dst: Network state machine.
Network state machine. Includes network async processing state machine and related tasks. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/drivers/block/dst/kst.c b/drivers/block/dst/kst.c new file mode 100644 index 000..ba5e5ef --- /dev/null +++ b/drivers/block/dst/kst.c @@ -0,0 +1,1475 @@ +/* + * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/kernel.h +#include linux/module.h +#include linux/list.h +#include linux/slab.h +#include linux/socket.h +#include linux/kthread.h +#include linux/net.h +#include linux/in.h +#include linux/poll.h +#include linux/bio.h +#include linux/dst.h + +#include net/sock.h + +struct kst_poll_helper +{ + poll_table pt; + struct kst_state*st; +}; + +static LIST_HEAD(kst_worker_list); +static DEFINE_MUTEX(kst_worker_mutex); + +/* + * This function creates bound socket for local export node. + */ +static int kst_sock_create(struct kst_state *st, struct saddr *addr, + int type, int proto, int backlog) +{ + int err; + + err = sock_create(addr-sa_family, type, proto, st-socket); + if (err) + goto err_out_exit; + + err = st-socket-ops-bind(st-socket, (struct sockaddr *)addr, + addr-sa_data_len); + + err = st-socket-ops-listen(st-socket, backlog); + if (err) + goto err_out_release; + + st-socket-sk-sk_allocation = GFP_NOIO; + + return 0; + +err_out_release: + sock_release(st-socket); +err_out_exit: + return err; +} + +static void kst_sock_release(struct kst_state *st) +{ + if (st-socket) { + sock_release(st-socket); + st-socket = NULL; + } +} + +void kst_wake(struct kst_state *st) +{ + if (st) { + struct kst_worker *w = st-node-w; + unsigned long flags; + + spin_lock_irqsave(w-ready_lock, flags); + if (list_empty(st-ready_entry)) + list_add_tail(st-ready_entry, w-ready_list); + spin_unlock_irqrestore(w-ready_lock, flags); + + wake_up(w-wait); + } +} +EXPORT_SYMBOL_GPL(kst_wake); + +/* + * Polling machinery. + */ +static int kst_state_wake_callback(wait_queue_t *wait, unsigned mode, + int sync, void *key) +{ + struct kst_state *st = container_of(wait, struct kst_state, wait); + kst_wake(st); + return 1; +} + +static void kst_queue_func(struct file *file, wait_queue_head_t *whead, +poll_table *pt) +{ + struct kst_state *st = container_of(pt, struct kst_poll_helper, pt)-st; + + st-whead = whead; + init_waitqueue_func_entry(st-wait, kst_state_wake_callback); + add_wait_queue(whead, st-wait); +} + +static void kst_poll_exit(struct kst_state *st) +{ + if (st-whead) { + remove_wait_queue(st-whead, st-wait); + st-whead = NULL; + } +} + +/* + * This function removes request from state tree and ordering list. + */ +void kst_del_req(struct dst_request *req) +{ + list_del_init(req-request_list_entry); +} +EXPORT_SYMBOL_GPL(kst_del_req); + +static struct dst_request *kst_req_first(struct kst_state *st) +{ + struct dst_request *req = NULL; + + if (!list_empty(st-request_list)) + req = list_entry(st-request_list.next, struct dst_request, + request_list_entry); + return req; +} + +/* + * This function dequeues first request from the queue and tree. + */ +static struct dst_request *kst_dequeue_req(struct kst_state *st) +{ + struct dst_request *req; + + mutex_lock(st-request_lock); + req = kst_req_first(st); + if (req) + kst_del_req(req); + mutex_unlock(st-request_lock); + return req; +} + +/* + * This function enqueues request into tree, indexed by start of the request, + * and also puts request into ordered queue. + */ +int kst_enqueue_req(struct kst_state *st, struct dst_request *req) +{ + if (unlikely(req-flags DST_REQ_CHECK_QUEUE)) { + struct dst_request *r; + + list_for_each_entry(r, st-request_list, request_list_entry) { + if (bio_rw(r-bio) != bio_rw(req-bio)) + continue; + + if (r-start = req-start + req-size) + continue
Re: Netfilter: kernel panic with REDIRECT target. (2.6.23 and 2.6.23.8)
On Mon, Nov 19, 2007 at 06:51:38PM +, David ([EMAIL PROTECTED]) wrote: Patrick McHardy wrote: iptables -t nat -A PREROUTING -j REDIRECT -i eth2 -p udp --dport 5061 --to-ports 5060 Also post the kernel panic log. Please try if this patch fixes the problem. No luck with the patch I'm afraid, panic log attached (of patched kernel). Ok, let's try it hard way. Please check attached patch and tell if it helped (it will produce some debug though). What is a load on this machine? Is it simple enough to reproduce? I will take closer look tomorrow if this will not help. Thanks. diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c index 70e7997..7dc3496 100644 --- a/net/ipv4/netfilter/nf_nat_core.c +++ b/net/ipv4/netfilter/nf_nat_core.c @@ -607,13 +607,13 @@ static void nf_nat_move_storage(struct nf_conn *conntrack, void *old) struct nf_conn_nat *new_nat = nf_ct_ext_find(conntrack, NF_CT_EXT_NAT); struct nf_conn_nat *old_nat = (struct nf_conn_nat *)old; struct nf_conn *ct = old_nat-ct; - unsigned int srchash; + + printk(conntrack: %p, new: %p, old: %p, ct: %p.\n, + conntrack, new_nat, old_nat, ct); - if (!(ct-status IPS_NAT_DONE_MASK)) + if (!ct || !(ct-status IPS_NAT_DONE_MASK)) return; - srchash = hash_by_src(ct-tuplehash[IP_CT_DIR_ORIGINAL].tuple); - write_lock_bh(nf_nat_lock); hlist_replace_rcu(old_nat-bysource, new_nat-bysource); new_nat-ct = ct; -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netfilter: kernel panic with REDIRECT target. (2.6.23 and 2.6.23.8)
On Mon, Nov 19, 2007 at 10:24:23PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: On Mon, Nov 19, 2007 at 06:51:38PM +, David ([EMAIL PROTECTED]) wrote: Patrick McHardy wrote: iptables -t nat -A PREROUTING -j REDIRECT -i eth2 -p udp --dport 5061 --to-ports 5060 Also post the kernel panic log. Please try if this patch fixes the problem. No luck with the patch I'm afraid, panic log attached (of patched kernel). Ok, let's try it hard way. Please check attached patch and tell if it helped (it will produce some debug though). With both patches applied - one Patrick showed and this one. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re : Bug in using inet_lookup ()
On Fri, Nov 16, 2007 at 09:47:08AM +, Nj A ([EMAIL PROTECTED]) wrote: Hello, Please show at least one bug trace when inet_lookup(tcp_hashinfo, 0, 0, 0, 0, 0) fails :) Trying this the system hangs :-( (setting panic* doesn't change more). Your code below can not work - you _never_ call inet_lookup(). In your bug inet_lookup() is called, so this wither code is wrong, or bug is hand written. And you use inet_iif() which requires dst entry (routing cache), which you do not setup either. You can do following to resolve where problem occurs: $ gdb vmlix p inet_lookup l *(returned_above_address + 0x300) it will show you the line where bug occurs. You have to compile your kernel with debugging symbols. To prove that inet_lookup() works correctly patch tcv_v4_rcv() to print lookup result for static source/destination addresses/ports copied from you message and zero ifindex (the last field). I'm pretty sure your code, which was not shown yet, has a bug in the inet_lookup() calling routing. However, using (tcp_hashinfo, ip_src, p_src, ip_dst, p_dst, 0) gives the following oops: Wrong, you do _NOT_ use this in your code. BUG: unable to handle kernel NULL pointer dereference at virtual address printing eip: c02f19e1 *pde = Oops: [#1] CPU:0 EIP:0060:[c02f19e1]Not tainted VLI EFLAGS: 00010282 (2.6.18 #1) EIP is at inet_lookup+0x300x500 eax: 9e3779b9 ebx: 0004 ecx: 9e377a57 edx: f4046f84 esi: f46a6010 edi: ebp: 009e esp: f4046f38 ds: 007b es: 007b ss: 0068 Process knl-thread (pid: 3068, ti=f4046000 task=f46f0610 task.ti=f4046000) Stack: 22921900 f6953840 f46a6010 f46a6000 f4046f84 0004 f46a6010 f46a6000 f6953840 f8d3314a 0004 b7f3a000 0404 0005 0bfe 0bfe 0404 f4046fa8 f6953840 f4aa7880 f4aa7800 f4046fa8 Code: 00 00 00 8d bc 27 00 00 00 00 55 89 cd 57 0f b7 c9 56 81 e9 47 86 c8 61 53 83 ec 14 89 54 24 10 8b b8 54 02 00 00 b8 b9 79 37 9e 8b 5f 10 29 d8 89 da 03 44 24 28 c1 ea 0d 29 c8 29 d9 31 d0 89 EIP: [c02f19e1] inet_lookup +0x300x500 SS:ESP 0068:f4046f38 Yes, to show the code you are using. Ok so basically I am receiving via Netlink a state telling me the ip_src, psrc, ip_dst, pdst. sk = inet_lookup (tcp_hashinfo, payload-src, payload-p_src, payload-dst, payload-p_dst, inet_iif (s_skb)); WRONG! You did not setup s_skb-dst, so inet_iif() will fail. Use 0 there, as you were told already several times. This will not catch device binding though. if (!sk) goto no_tcp_socket; if (sk-sk_state == TCP_TIME_WAIT) goto time_wait_socket; ... bh_lock_sock (sk); pdev: spin_lock (tmp_lock); new_dev = list_entry (tmp, struct net_device, todo_list); spin_unlock (tmp_lock); if (!new_dev) goto err; s_skb-dev = new_dev; ... switch (sk-sk_state) { case TCP_SYN_RECV: .. case TCP_LISTEN: .. case TCP_SYN_SENT: .. } bh_unlock_sock (sk); ... /* send reply via Netlink */ This code _NEVER_ calls inet_lookup(), since the first ckeck for s_skb-dev will fail and you will select device via your list and then never return to inet_lookup(). Anyway, until your code is presented fully so that people could show you exactly wrong line it is pretty impossible to try to convince you that inet_llokup() does work and you have a bug in setup. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re : Re : Bug in using inet_lookup ()
On Wed, Nov 14, 2007 at 04:47:22PM +, Nj A ([EMAIL PROTECTED]) wrote: By setting the ID of the ingress device to the inet_lookup() to 0, the machine reboots automatically. Setting proc/sys/kernel/panic* to non zero values dosn't help more.. Sorry, I did not understand? You mean after you provide zero to inet_lookup() instead of device id it strted to reboot? -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Null pointer dereference in nf_nat_move_storage(), kernel 2.6.23.1
Hi Chuck. On Wed, Nov 14, 2007 at 06:25:15PM -0500, Chuck Ebbert ([EMAIL PROTECTED]) wrote: https://bugzilla.redhat.com/show_bug.cgi?id=259501#c14 [f8b61643] __nf_ct_ext_add+0x12f/0x1c4 [nf_conntrack] nf_nat_move_storage(): /usr/src/debug/kernel-2.6.23/linux-2.6.23.i686/net/ipv4/netfilter/nf_nat_core.c:612 87: f7 47 64 80 01 00 00testl $0x180,0x64(%edi) 8e: 74 39 je c9 nf_nat_move_storage+0x65 line 612: if (!(ct-status IPS_NAT_DONE_MASK)) return; Please test attached patch. This routing is called each time hash should be replaced, nf_conn has extension list which contains pointers to connection tracking users (like nat, which is right now the only such user), so when replace takes place it should copy own extensions. Loop above checks for own extension, but tries to move higer-layer one, which can lead to above oops. Not tested, derived from code observation only. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/net/netfilter/nf_conntrack_extend.c b/net/netfilter/nf_conntrack_extend.c index a1a65a1..cf6ba66 100644 --- a/net/netfilter/nf_conntrack_extend.c +++ b/net/netfilter/nf_conntrack_extend.c @@ -109,7 +109,7 @@ void *__nf_ct_ext_add(struct nf_conn *ct, enum nf_ct_ext_id id, gfp_t gfp) rcu_read_lock(); t = rcu_dereference(nf_ct_ext_types[i]); if (t t-move) - t-move(ct, ct-ext + ct-ext-offset[id]); + t-move(ct, ct-ext + ct-ext-offset[i]); rcu_read_unlock(); } kfree(ct-ext); -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re : Re : Re : Bug in using inet_lookup ()
On Thu, Nov 15, 2007 at 05:29:52PM +0100, Nj A ([EMAIL PROTECTED]) wrote: Hello all, No bugs are due to the inet_lookup call now using the following: if ((s_skb = alloc_skb (MAX_TCP_HEADER + 15, GFP_ATOMIC)) == NULL) { printk (%s: Unable to allocate memory \n, __FUNCTION__); err = -ENOMEM; } dev = s_skb-dev; if (!dev) printk (%s: no device attached to s_skb\n, __FUNCTION__); goto process_dev; sk = inet_lookup (tcp_hashinfo, src, p_src, dst, p_dst, inet_iif (s_skb)); bh_lock_sock (sk); process_dev: spin_lock (tmp_lock); new_dev = list_entry (tmp, struct net_device, todo_list); spin_unlock (tmp_lock); if (!new_dev) printk (%s: no device attached to new_dev \n, __FUNCTION__); s_skb-dev = new_dev; ... bh_unlock_sock (sk); ... However, I am not having the right results. I checked with an established socket and expected to see that the socket is established (which is the case) but got the wrong state when testing on (sk-sk_state) and the socket seems in the TIME_WAIT / CLOSE state. May be I am corrupting the search by manually attaching a device to the skb? Any idea please? Well, your code will oops just like before - you provide empty skb to the inet_iif(), which is wrong. Actually you will not even reach that point, since your code will exit after skb-dev check. Try simple inet_lookup(tcp_hashinfo, src, p_src, dst, p_dst, 0). It does work. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re : Re : Re : Re : Bug in using inet_lookup ()
On Thu, Nov 15, 2007 at 04:57:17PM +, Nj A ([EMAIL PROTECTED]) wrote: Well, your code will oops just like before - you provide empty skb to the inet_iif(), which is wrong. Actually you will not even reach that point, since your code will exit after skb-dev check. Try simple inet_lookup(tcp_hashinfo, src, p_src, dst, p_dst, 0). But trying inet_lookup(tcp_hashinfo, src, p_src, dst, p_dst, 0), the machine either hangs or panics. Hmmm, it does not. Please show at least one bug trace when inet_lookup(tcp_hashinfo, 0, 0, 0, 0, 0) fails :) Is there any clean manner to come across this issue? Yes, to show the code you are using. Sorry, all mind readers are on vacations. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug in using inet_lookup ()
On Wed, Nov 14, 2007 at 09:26:18AM +, Nj A ([EMAIL PROTECTED]) wrote: /* The kernel TCP hashtable */ struct inet_hashinfo __cacheline_aligned tcp_hashinfo = { .lhash_lock = __RW_LOCK_UNLOCKED (tcp_hashinfo.lhash_lock), .lhash_users = ATOMIC_INIT (0), .lhash_wait = __WAIT_QUEUE_HEAD_INITIALIZER (tcp_hashinfo.lhash_wait), }; ... struct sock *sk; struct sk_buff *skb; skb = alloc_skb (MAX_TCP_HEADER + 15, GFP_KERNEL); if (skb == NULL) printk (%s: Unable to allocate memory \n, __FUNCTION__); sk = inet_lookup (tcp_hashinfo, ip_src, src_port, ip_dst, dst_port, inet_iif (skb)); if (!sk) ... This portion of code seems to cause the kernel to panic due to dereferencing a NULL pointer. Can anyone please tell me what is the error above? Best Regards, Where exactly? Likely in inet_iif(), since it dereferences dst (routing info), which is not presented after simple alloc_skb(). You have to setup skb correctly, check how ip_rcv() does it. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re : Bug in using inet_lookup ()
On Wed, Nov 14, 2007 at 01:12:11PM +, Nj A ([EMAIL PROTECTED]) wrote: I suspected it could be that. However, can't see in ip_rcv the right portion that can help. Any further tip please? It is ip_rcv_finish() called from ip_rcv(): if (skb-dst == NULL) { int err = ip_route_input(skb, iph-daddr, iph-saddr, iph-tos, skb-dev); if (unlikely(err)) { if (err == -EHOSTUNREACH) IP_INC_STATS_BH(IPSTATS_MIB_INADDRERRORS); else if (err == -ENETUNREACH) IP_INC_STATS_BH(IPSTATS_MIB_INNOROUTES); goto drop; } } So you will have to specify device, you got your skb via. Actually it is not exactly needed in some cases, you will need interface index (dev-ifindex). You can find socket by using that number instead of dereferencing dst. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] New Kernel Bugs
On Tue, Nov 13, 2007 at 03:15:53AM -0800, Andrew Morton ([EMAIL PROTECTED]) wrote: NETWORKING=== RTNLGRP_ND_USEROPT does not report ifindex (IPv6) http://bugzilla.kernel.org/show_bug.cgi?id=9349 Kernel: 2.6.24+ No response from developers Fixed (extended) in the DaveM's tree (or will be soon - patch was submitted by Pierre Ynard). Sorry, others are either driver related (and thus require hardware to be tested on and maintainers to be kicked in) or too obscure (like 2.6.11 bug and weird network problem which is undetectible on other systems). Yes, we suck, but we try to recover :) -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: possible bug in tcp_probe
Hi. On Tue, Nov 13, 2007 at 11:26:15AM +, Gavin McCullagh ([EMAIL PROTECTED]) wrote: 74.259589763 192.168.2.1 36988 192.168.3.5 5001 0x679c23dc 0x679bc3b4 18 13 9114624 78 76 1 0 64 74.260590660 192.168.2.1 44261 192.168.3.5 5006 0x573bb3ed 0x573b700d 13 9 5254144 155 127 1 0 64 74.261607478 192.168.2.1 44261 192.168.3.5 5006 0x588.066586741 192.168.2.1 33739 192.168.3.5 5009 0xe26d1767 0xe26cf577 2 3 13090816 443 15818 1 0 64 88.066690797 192.168.2.1 33739 192.168.3.5 5009 0xe26d1767 0xe26cfb1f 3 3 13092864 2365 15818 1 0 64 88.067625714 192.168.2.1 59385 192.168.3.5 5012 0x411c1090 0x411bd258 12 9 14578688 2807 15812 1 0 64 As you can see the third line has been truncated as well as the next roughly 14 seconds of data after which data continues writing as usual. I don't think my small changes are causing this but perhaps I'm wrong. Does anyone know what might be causing the above? Log buffer has limited size, you can not write from different threads to it and expect all data being printed synchronously, there is nothing exceptional here. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Stack Trace. Bad?
Hi Jon. On Tue, Nov 06, 2007 at 02:23:03PM -0600, Jon Nelson ([EMAIL PROTECTED]) wrote: [linux-raid was also emailed this same information] It looks like it was not :) I was testing some network throughput today and ran into this. I should note that I've this motherboard has 2x MCP55 Ethernet and one of them works fine and the other one gives lots and lots of frame errors under load. The following is only an harmless informational message. Unless you get a _continuous_flood_ of these messages it means everything is working fine. Allocations from irqs cannot be perfectly reliable and the kernel is designed to handle that. md0_raid5: page allocation failure. order:2, mode:0x20 Call Trace: IRQ [802684c2] __alloc_pages+0x324/0x33d [80283147] kmem_getpages+0x66/0x116 [8028367a] fallback_alloc+0x104/0x174 [80283330] kmem_cache_alloc_node+0x9c/0xa8 [80396984] __alloc_skb+0x65/0x138 [8821d82a] :forcedeth:nv_alloc_rx_optimized+0x4d/0x18f What MTU for this card is? Forcedeth supports jumbo frames, but does it in very unoptimized way, particulary by relying on the possibility to allocate 2-order pages, which is wrong. So, set MTU to 1500 and things will be back into good shape. I think adding fragments support is not a short-term solution because of closed specs. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel panic removing devices from a teql queuing discipline
On Mon, Nov 05, 2007 at 11:08:00PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: On Tue, Oct 30, 2007 at 01:33:41AM -0700, David Miller ([EMAIL PROTECTED]) wrote: The panic is in __teql_resolve (which has been inlined into teql_master_xmit) in net/sched/sch_teql.c at this line: if (n n-tbl == mn-tbl Specifically the dereference of n-tbl is faulting as n is not valid. n is never valid (null), mn is garbage. My fault, of course you are right, n is invalid because it is dereferenced from qdisc, which was changed. That was too late in Moscow for conclusions... And the address looks like part of an ASCCI string... figt I studied sch_teql.c a bit and I suspect that the slave list management in teql_destroy() and teql_qdisc_init() might be suspect. tecl_reset() is called from deactivate and qdisc is set to noop already, but subsequent teql_xmit does not know about it and dereference private data as teql qdisc and thus oopses. I will fix it tomorrow if you will not catch it first :) It looks like I am. Tested, works, fixed. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/net/sched/sch_teql.c b/net/sched/sch_teql.c index f05ad9a..e0a44b9 100644 --- a/net/sched/sch_teql.c +++ b/net/sched/sch_teql.c @@ -263,6 +276,9 @@ __teql_resolve(struct sk_buff *skb, struct sk_buff *skb_res, struct net_device * static __inline__ int teql_resolve(struct sk_buff *skb, struct sk_buff *skb_res, struct net_device *dev) { + if (dev-qdisc == noop_qdisc) + return -ENODEV; + if (dev-hard_header == NULL || skb-dst == NULL || skb-dst-neighbour == NULL) -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[0/4] Distributed storage. Squizzed black-out of the dancing back-aching hippo.
Hi. I'm pleased to announce 7'th and the final release of the distributed storage subsystem (DST). It allows to form a storage on top of local and remote nodes and combine them in linear or mirroring setup, which in turn can be exported to remote nodes. Short changelog: * added strong checksum support (Castagnoli crc) * extended autoconfiguration (added ability to request if remote side supports strong checksum and turn it on if needed) * documentation addon - sysfs files * added clean/dirty sysfs files which allows to mark node as clean (sinc) or dirty (not sync) * fair number of bug fixes (including really tricky bastards, which are unlikely to be found in real setups, but which were still bugs) * and the main one - added release name (it clearly shows my condition) Overall list of features of the DST can be found on project's homepage: http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst Thank you. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html