Re: [PATCH] connector: add parent pid and tgid to coredump and exit events

2018-04-30 Thread Evgeniy Polyakov
Stefan, hi

Sorry for delay.

26.04.2018, 15:04, "Stefan Strogin" <stefan.stro...@gmail.com>:
> Hi David, Evgeniy,
>
> Sorry to bother you, but could you please comment about the UAPI change and 
> the patch?

With 4-bytes pid_t everything looks fine, and I do not know arch where pid is 
larger currently, so it looks safe.

David, please pull it into your tree, or should it go via different path?

Acked-by: Evgeniy Polyakov <z...@ioremap.net>


>>  I don't see how it breaks UAPI. The point is that structures
>>  coredump_proc_event and exit_proc_event are members of *union*
>>  event_data, thus position of the existing data in the structure is
>>  unchanged. Furthermore, this change won't increase size of struct
>>  proc_event, because comm_proc_event (also a member of event_data) is
>>  of bigger size than the changed structures.
>>
>>  If I'm wrong, could you please explain what exactly will the change
>>  break in UAPI?
>>
>>  On 30/03/18 19:59, David Miller wrote:
>>>  From: Stefan Strogin <sstro...@cisco.com>
>>>  Date: Thu, 29 Mar 2018 17:12:47 +0300
>>>
>>>>  diff --git a/include/uapi/linux/cn_proc.h b/include/uapi/linux/cn_proc.h
>>>>  index 68ff25414700..db210625cee8 100644
>>>>  --- a/include/uapi/linux/cn_proc.h
>>>>  +++ b/include/uapi/linux/cn_proc.h
>>>>  @@ -116,12 +116,16 @@ struct proc_event {
>>>>   struct coredump_proc_event {
>>>>   __kernel_pid_t process_pid;
>>>>   __kernel_pid_t process_tgid;
>>>>  + __kernel_pid_t parent_pid;
>>>>  + __kernel_pid_t parent_tgid;
>>>>   } coredump;
>>>>
>>>>   struct exit_proc_event {
>>>>   __kernel_pid_t process_pid;
>>>>   __kernel_pid_t process_tgid;
>>>>   __u32 exit_code, exit_signal;
>>>>  + __kernel_pid_t parent_pid;
>>>>  + __kernel_pid_t parent_tgid;
>>>>   } exit;
>>>>
>>>>   } event_data;
>>>
>>>  I don't think you can add these members without breaking UAPI.



Re: [RFC] connector: add group_exit_code and signal_flags fields to exit_proc_event

2018-04-08 Thread Evgeniy Polyakov
Hi everyone

Sorry for that late reply

01.03.2018, 21:58, "Stefan Strogin" :
> So I was thinking to add these two fields to union event_data:
> task->signal->group_exit_code
> task->signal->flags
> This won't increase size of struct proc_event (because of comm_proc_event)
> and shouldn't break backward compatibility for the user-space. But it will
> add some useful information about what caused the process death.
> What do you think, is it an acceptable approach?

As I saw in other discussion, doesn't it break userspace API, or you are sure 
that no sizes has been increased?
You are using the same structure as used for plain signals and add group status 
there, how will userspace react,
if it was compiled with older headers? What if it uses zero-field alignment, 
i.e. allocating exactly the size of structure with byte precision?


Re: [PATCH] connector: Delete an error message for a failed memory allocation in cn_queue_alloc_callback_entry()

2017-09-05 Thread Evgeniy Polyakov
Hi everyone

27.08.2017, 22:25, "SF Markus Elfring" <elfr...@users.sourceforge.net>:
> From: Markus Elfring <elfr...@users.sourceforge.net>
> Date: Sun, 27 Aug 2017 21:18:37 +0200
>
> Omit an extra message for a memory allocation failure in this function.
>
> This issue was detected by using the Coccinelle software.
>
> Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net>

Looks good to me, thanks Markus.
There is virtually zero useful information in this print if we are in the 
situation, when kernel can not allocate
a few bytes to run connector queue.

Acked-by: Evgeniy Polyakov <z...@ioremap.net>

kernel-janitors@ please queue this patch up

> ---
>  drivers/connector/cn_queue.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/drivers/connector/cn_queue.c b/drivers/connector/cn_queue.c
> index 1f8bf054d11c..e4f31d679f02 100644
> --- a/drivers/connector/cn_queue.c
> +++ b/drivers/connector/cn_queue.c
> @@ -40,10 +40,8 @@ cn_queue_alloc_callback_entry(struct cn_queue_dev *dev, 
> const char *name,
>  struct cn_callback_entry *cbq;
>
>  cbq = kzalloc(sizeof(*cbq), GFP_KERNEL);
> - if (!cbq) {
> - pr_err("Failed to create new callback queue.\n");
> + if (!cbq)
>  return NULL;
> - }
>
>  atomic_set(>refcnt, 1);
>
> --
> 2.14.1


Re: [PATCH] [RFC] proc connector: add namespace events

2016-09-12 Thread Evgeniy Polyakov
Hi everyone

08.09.2016, 18:39, "Alban Crequy" :
> The act of a process creating or joining a namespace via clone(),
> unshare() or setns() is a useful signal for monitoring applications.

> + if (old_ns->mnt_ns != new_ns->mnt_ns)
> + proc_ns_connector(tsk, CLONE_NEWNS, PROC_NM_REASON_CLONE, old_mntns_inum, 
> new_mntns_inum);
> +
> + if (old_ns->uts_ns != new_ns->uts_ns)
> + proc_ns_connector(tsk, CLONE_NEWUTS, PROC_NM_REASON_CLONE, 
> old_ns->uts_ns->ns.inum, new_ns->uts_ns->ns.inum);
> +
> + if (old_ns->ipc_ns != new_ns->ipc_ns)
> + proc_ns_connector(tsk, CLONE_NEWIPC, PROC_NM_REASON_CLONE, 
> old_ns->ipc_ns->ns.inum, new_ns->ipc_ns->ns.inum);
> +
> + if (old_ns->net_ns != new_ns->net_ns)
> + proc_ns_connector(tsk, CLONE_NEWNET, PROC_NM_REASON_CLONE, 
> old_ns->net_ns->ns.inum, new_ns->net_ns->ns.inum);
> +
> + if (old_ns->cgroup_ns != new_ns->cgroup_ns)
> + proc_ns_connector(tsk, CLONE_NEWCGROUP, PROC_NM_REASON_CLONE, 
> old_ns->cgroup_ns->ns.inum, new_ns->cgroup_ns->ns.inum);
> +
> + if (old_ns->pid_ns_for_children != new_ns->pid_ns_for_children)
> + proc_ns_connector(tsk, CLONE_NEWPID, PROC_NM_REASON_CLONE, 
> old_ns->pid_ns_for_children->ns.inum, new_ns->pid_ns_for_children->ns.inum);
> + }
> +

Patch looks good to me from technical/connector point of view, but these even 
multiplication is a bit weird imho.

I'm not against it, but did you consider sending just 2 serialized ns 
structures via single message, and client
would check all ns bits himself?


Re: [PATCH] connector: fix out-of-order cn_proc netlink message delivery

2016-06-28 Thread Evgeniy Polyakov
Hi Aaron

24.06.2016, 16:07, "Aaron Campbell" :
> The proc connector messages include a sequence number, allowing userspace
> programs to detect lost messages. However, performing this detection is
> currently more difficult than necessary, since netlink messages can be
> delivered to the application out-of-order. To fix this, leave pre-emption
> disabled during cn_netlink_send(), and use GFP_NOWAIT.
>
> The following was written as a test case. Building the kernel w/ make -j32
> proved a reliable way to generate out-of-order cn_proc messages.

This is not actually about out-of-order sending which is impossible iirc,
but the way fork pushes messages into socket queue in parallel. What you've done
is syncing one more layer higher.

I'm not against this patch if you think it does fix some issues, but wording is 
not correct imo.


Re: [PATCH] cn_proc: Flag termination of the last thread in the process

2015-06-11 Thread Evgeniy Polyakov
Hi Sergei

29.05.2015, 22:50, Sergei Zhirikov sf...@yahoo.com:
 There is no easy and reliable way for userspace to get notified
 of a process termination. The process connector sends out exit
 events upon termination of each thread, but it is not trivial for
 userspace to tell whether the just-terminated thread was the last
 one in the process.

 With this change a flag will be set in struct cn_proc for the exit
 event of the last thread in the process.

I have no objection against this patch, but it should really go through cn_proc 
maintainer.
Feel free to add my Acked-by.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] connector: add cgroup release event report to proc connector

2015-06-11 Thread Evgeniy Polyakov
Hi

28.05.2015, 11:54, Dimitri John Ledkov dimitri.j.led...@intel.com:

 What you are saying is that we have inefficient notification mechanism
 that hammers everyone's boot time significantly, and no current path
 to resolve it. What can I do get us efficient cgroup release
 notifications soon?
 This patch-set is a no-op if one doesn't subscribe from the userspace
 and has no other side effects that I can trivially see and is very
 similar in-spirit to other notifications that proc-connector
 generates. E.g. /proc/pid/comm is exposed as a file, yet there is proc
 connector notification as well about comm name changes. Maybe Evgeniy
 can chip in, if such a notification would be beneficial to
 proc-connector.

I understand your need in a new notifications related to cgroups,
although I would rather put it into separate module than proc connector - 
I'm pretty sure there will be quite alot of extensions in this module in the 
future.

But if you do want to extend proc connector module, I'm ok with it, but it 
should
go via its maintainer.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add IPv6 support to TCP SYN cookies

2008-02-06 Thread Evgeniy Polyakov
On Tue, Feb 05, 2008 at 05:52:31PM -0800, Glenn Griffin ([EMAIL PROTECTED]) 
wrote:
 +static u32 cookie_hash(struct in6_addr *saddr, struct in6_addr *daddr,
 +__be16 sport, __be16 dport, u32 count, int c)
 +{
 + __u32 tmp[16 + 5 + SHA_WORKSPACE_WORDS];

This huge buffer should not be allocated on stack.


-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add IPv6 support to TCP SYN cookies

2008-02-06 Thread Evgeniy Polyakov
On Wed, Feb 06, 2008 at 10:30:24AM -0800, Glenn Griffin ([EMAIL PROTECTED]) 
wrote:
   +static u32 cookie_hash(struct in6_addr *saddr, struct in6_addr *daddr,
   +__be16 sport, __be16 dport, u32 count, int c)
   +{
   + __u32 tmp[16 + 5 + SHA_WORKSPACE_WORDS];
  
  This huge buffer should not be allocated on stack.
 
 I can replace it will a kmalloc, but for my benefit what's the practical
 size we try and limit the stack to?  It seemed at first glance to me
 that 404 bytes plus the arguments, etc. was not such a large buffer for
 a non-recursive function.  Plus the alternative with a kmalloc requires

Well, maybe for connection establishment path it is not, but it is
absolutely the case in the sending and sometimes receiving pathes for 4k
stacks. The main problem is that bugs which happen because of stack
overflow are so much obscure, that it is virtually impossible to detect
where overflow happend. 'Debug stack overflow' somehow does not help to
detect it.

Usually there is about 1-1.5 kb of free stack for each process, so this
change will cut one third of the free stack, getting into account that
something can store ipv6 addresses on stack too, this can end up badly.

 propogating the possible error status back up to tcp_ipv6.c in the event
 we are unable to allocate enough memory, so it can simply drop the
 connection.  Not an impossible task by any means but it does
 significantly complicate things and I would like to know it's worth the
 effort.  Also would it be worth it to provide a supplemental patch for
 the ipv4 implementation as it allocates the same buffer?

One can reorganize syncookie support to work with request hash tables
too, so that we could allocate per hash-bucket space and use it as a
scratchpad for cookies.

 --Glenn

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add IPv6 support to TCP SYN cookies

2008-02-05 Thread Evgeniy Polyakov
On Tue, Feb 05, 2008 at 09:02:11PM +0100, Andi Kleen ([EMAIL PROTECTED]) wrote:
 On Tue, Feb 05, 2008 at 10:29:28AM -0800, Glenn Griffin wrote:
   Syncookies are discouraged these days. They disable too many
   valuable TCP features (window scaling, SACK) and even without them
   the kernel is usually strong enough to defend against syn floods
   and systems have much more memory than they used to be.
  
   So I don't think it makes much sense to add more code to it, sorry.

How does syncookies prevent windows from growing?
Most (if not all) distributions have them enabled and window growing
works just fine. Actually I do not see any reason why connection
establishment handshake should prevent any run-time operations at all,
even if it was setup during handshake.

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add IPv6 support to TCP SYN cookies

2008-02-05 Thread Evgeniy Polyakov
On Tue, Feb 05, 2008 at 09:53:45PM +0100, Andi Kleen ([EMAIL PROTECTED]) wrote:
  How does syncookies prevent windows from growing?
 
 Syncookies do not allow window scaling so you can't have any windows 64k

Then you meant not windows change, but the fact, that option is ignored
as long as sack enable one?

  Most (if not all) distributions have them enabled and window growing
  works just fine. Actually I do not see any reason why connection
  establishment handshake should prevent any run-time operations at all,
  even if it was setup during handshake.
 
 TCP only uses options negotiated during the hand shake and syncookies
 is incapable to do this.

What about fixing the implementation, so that it could get into account
different options too?

 -Andi

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add IPv6 support to TCP SYN cookies

2008-02-05 Thread Evgeniy Polyakov
Hi Alan.

On Tue, Feb 05, 2008 at 09:20:17PM +, Alan Cox ([EMAIL PROTECTED]) wrote:
  Most (if not all) distributions have them enabled and window growing
  works just fine. Actually I do not see any reason why connection
  establishment handshake should prevent any run-time operations at all,
  even if it was setup during handshake.
 
 Syncookies are only triggered if the system is under a load where it
 would begin to lose connections otherwise. So they merely turn a DoS into
 a working if slightly slower setup (and  64K windows don't matter for
 most normal users, especially on mobile devices).

SACK is actually a good idea for mobile devices, so preventing
syncookies from not getting into account some options (btw, does it work
with timestamps and PAWS?) is not a solution.

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[2/2] POHMELFS: hack to disable writeback.

2008-01-31 Thread Evgeniy Polyakov
This patch disables writeback in POHMELFS and creates all objects
on behalf of its own without sync with remote side.
This mode is _very_ fast.

If POHEMLFS would be bound to single remote filesystem, it could
use its inode allocation policy and be very happy with write-back cache.
By design POHMELFS is a transport layer in distributed filesystem,
which will work with some or other remote filesystem (likely completely
new one), so instead of stupid algorithm shown here, it will contain
correct object creation.

Likely the way to go is to use name hash with parent inode number as
unique ID, which can be matched to filesystem path, so that remote side
could create objects without _any_ knowledge of inode numbers on the
local fs.

Crappy-stuff-created-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/fs/pohmelfs/dir.c b/fs/pohmelfs/dir.c
index 23f9ecd..5aec593 100644
--- a/fs/pohmelfs/dir.c
+++ b/fs/pohmelfs/dir.c
@@ -80,6 +80,8 @@ static struct pohmelfs_name *pohmelfs_insert_offset(struct 
pohmelfs_inode *pi,
rb_link_node(new-offset_node, parent, n);
rb_insert_color(new-offset_node, pi-offset_root);
 
+   pi-total_len += new-len;
+
return NULL;
 }
 
@@ -647,6 +649,7 @@ static int pohmelfs_create_entry(struct inode *dir, struct 
dentry *dentry, u64 s
cmd-start = start;
netfs_set_cmd_flags(cmd, dentry-d_name.hash, mode);
 
+#if 0
netfs_convert_cmd(cmd);
 
err = netfs_data_send(st, cmd, sizeof(struct netfs_cmd));
@@ -666,6 +669,30 @@ static int pohmelfs_create_entry(struct inode *dir, struct 
dentry *dentry, u64 s
err = netfs_recv_inode_info(psb, POHMELFS_I(dir), npi, data);
if (err  0)
goto err_out_unlock;
+#else
+   {
+   static u64 pohmelfs_ino = 123;
+
+   st-info.mode = netfs_get_inode_mode(cmd);
+   st-info.ino = pohmelfs_ino++;
+   st-info.nlink = 2;
+   st-info.uid = 2319;
+   st-info.gid = 100;
+
+   cmd-ino = st-info.ino;
+   cmd-start = POHMELFS_I(dir)-total_len;
+   }
+
+   npi = pohmelfs_new_inode(psb, POHMELFS_I(dir), data, cmd, st-info);
+   if (IS_ERR(npi)) {
+   err = PTR_ERR(npi);
+   if (err != -EEXIST)
+   goto err_out_unlock;
+   npi = NULL;
+   } else
+   err = 0;
+   npi-state = 1;
+#endif
mutex_unlock(st-lock);
 
d_add(dentry, npi-vfs_inode);
diff --git a/fs/pohmelfs/inode.c b/fs/pohmelfs/inode.c
index b0ee0b3..6a81bdc 100644
--- a/fs/pohmelfs/inode.c
+++ b/fs/pohmelfs/inode.c
@@ -125,6 +125,16 @@ static int netfs_process_page(struct file *file, struct 
page *page, __u64 cmd_op
int err;
void *addr;
 
+   {
+   if (cmd_op == NETFS_READ_PAGE) {
+   if (file)
+   file-f_pos += cmd-size;
+   }
+   SetPageUptodate(page);
+   unlock_page(page);
+   return 0;
+   }
+
mutex_lock(st-lock);
 
cmd-ino = inode-i_ino;
@@ -305,6 +315,7 @@ static struct inode *pohmelfs_alloc_inode(struct 
super_block *sb)
 
inode-state = 0;
inode-parent = 0;
+   inode-total_len = 0;
 
return inode-vfs_inode;
 }
diff --git a/fs/pohmelfs/netfs.h b/fs/pohmelfs/netfs.h
index 23aa953..b719fbe 100644
--- a/fs/pohmelfs/netfs.h
+++ b/fs/pohmelfs/netfs.h
@@ -163,6 +163,8 @@ struct pohmelfs_inode
u64 ino;
u64 parent;
 
+   u64 total_len;
+
struct pohmelfs_namename;
 
struct inodevfs_inode;
 

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[1/2] POHMELFS - network filesystem with local coherent cache.

2008-01-31 Thread Evgeniy Polyakov
Hi.

POHMELFS stands for Parallel Optimized Host Message Exchange
Layered File System. It allows to mount remote servers to local
directory via network. This filesystem supports local caching
and writeback flushing.
POHMELFS is a brick in a future distributed filesystem.

This set includes two patches:
 * network filesystem with write-through cache (slow, but works with
remote userspace server)
 * hack to show how local cache works and how faster it is compared
to async NFS (see below). hack disables writeback flush and
performs local allocation of the objects only.

Now, some vaporware aka food for thoughts and your brains.

A small benchmark of the local cached mode (above hack):

$ time tar -xf /home/zbr/threading.tar

POHMELFSNFS v3 (async)
real0m0.043s0m1.679s

Which is damn 40 times!

Excited? Now get huge bucket with ice.

Generic problem with writeback cache is a fact, that all local objects
has to have IDs in sync with remote side. For example, if remote side
is ext3, local one should not overwrite inode with number 0.
Contrary write-through cache allows to request remote side about
what ID should given data have and be in sync. This one is slow.

Of course this will not be _that_ huge difference in a real world, when
tested archives are larger (this one if a git archive of my userspace
threading library), which is very small. Since it is so small there is
no writeback cache flushing, and thus remote side never receives data.

Actually one can consider this as tmpfs or something like that. Code supports
sync, but since inode generation process is very different, files and dirs
can not be blindly synced to the ext3. So, this release of POHMELFS consists of
two patches: first one is a network filesystem implementation with write-through
cache, when object is first created on the remote side and then populated to the
local cache. This one is slow.

Second patch is a hack to disable writeback caching and implement local caching
only, which is very fast.

Next task is to think about how to generically solve the problem with
syncing local changes with remote server, when remote server maintains inodes 
with
completely different numbers.
This, among others, will allow offline work with automatic syncing after 
reconnect.

This is not intended for inclusion, CRFS by Zach Brown is a bit ahead of 
POHMELFS,
but it is not generic enough (because of above problem), works only with BTRFS,
and was closed by Oracle so far :)
So, anyone who managed to read up to this and happend to be at LCA 08 just has 
to
move this Friday to his presentation.

POHMELFS TODO list includes:

 * mechanism of keeping it coherent with other users
 * unified method of syncing with various remote filesystems

Thank you.

P.S. POHMELFS is about one month old, so do not be so severe with it :)

Crappy-stuff-created-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/fs/Kconfig b/fs/Kconfig
index f9eed6d..c40f2c5 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1519,6 +1519,8 @@ endmenu
 menu Network File Systems
depends on NET
 
+source fs/pohmelfs/Kconfig
+
 config NFS_FS
tristate NFS file system support
depends on INET
diff --git a/fs/Makefile b/fs/Makefile
index 720c29d..8fff82a 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -118,3 +118,4 @@ obj-$(CONFIG_HPPFS) += hppfs/
 obj-$(CONFIG_DEBUG_FS) += debugfs/
 obj-$(CONFIG_OCFS2_FS) += ocfs2/
 obj-$(CONFIG_GFS2_FS)   += gfs2/
+obj-$(CONFIG_POHMELFS)  += pohmelfs/
diff --git a/fs/pohmelfs/Kconfig b/fs/pohmelfs/Kconfig
new file mode 100644
index 000..ac19aac
--- /dev/null
+++ b/fs/pohmelfs/Kconfig
@@ -0,0 +1,6 @@
+config POHMELFS
+   tristate POHMELFS filesystem support
+   help
+ POHMELFS stands for Parallel Optimized Host Message Exchange Layered 
File System.
+ This is a network filesystem which supports coherent caching of data 
and metadata
+ on clients.
diff --git a/fs/pohmelfs/Makefile b/fs/pohmelfs/Makefile
new file mode 100644
index 000..8a87f46
--- /dev/null
+++ b/fs/pohmelfs/Makefile
@@ -0,0 +1,3 @@
+obj-$(CONFIG_POHMELFS) += pohmelfs.o
+
+pohmelfs-y := inode.o config.o dir.o net.o
diff --git a/fs/pohmelfs/config.c b/fs/pohmelfs/config.c
new file mode 100644
index 000..10eabe1
--- /dev/null
+++ b/fs/pohmelfs/config.c
@@ -0,0 +1,120 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details

Re: [1/2] POHMELFS - network filesystem with local coherent cache.

2008-01-31 Thread Evgeniy Polyakov
Hi.

On Fri, Feb 01, 2008 at 02:04:39AM +0100, Jan Engelhardt ([EMAIL PROTECTED]) 
wrote:
 POHMELFS stands for Parallel Optimized Host Message Exchange
 Layered File System. It allows to mount remote servers to local
 directory via network. This filesystem supports local caching
 and writeback flushing.
 POHMELFS is a brick in a future distributed filesystem.
 
 A brick is usually something that is in the way -
 Or you also say the user has bricked his machine
 when it's quite unusable :)
 Hope you did not mean /that/.

No, this brick as a building block :)

 This set includes two patches:
  * network filesystem with write-through cache (slow, but works with
  remote userspace server)
  * hack to show how local cache works and how faster it is compared
  to async NFS (see below). hack disables writeback flush and
  performs local allocation of the objects only.
 
 Now, some vaporware aka food for thoughts and your brains.
 
 A small benchmark of the local cached mode (above hack):
 
 $ time tar -xf /home/zbr/threading.tar
 
  POHMELFSNFS v3 (async)
 real0m0.043s 0m1.679s
 
 Which is damn 40 times!
 
 Needs a bigger data set to compare. But what is much more
 important: does it use a single port for networing, or some
 firewall-unfriendly-by-default multiple dynamic-port-allocation
 like NFS?

It uses single port, configurable at mount time.
POHMELFS client can connect to different addresses (including ipv6) and
via different protocols (like sctp). Metadata server will provide that
information dynamically, so pohmelfs client will be able to connect to
different nodes and perform operations in parallell.

 Next task is to think about how to generically solve the problem with
 syncing local changes with remote server, when remote server maintains 
 inodes with
 completely different numbers.
 This, among others, will allow offline work with automatic syncing after 
 reconnect.
 
 What will happen when both nodes change an inode in disconnected state?
 Which inode wins out?

Who will be online first. Second node will be told that there is a
merge collision and it has to be resolved by hands.

 This is not intended for inclusion, CRFS by Zach Brown is a bit ahead of 
 POHMELFS,
 but it is not generic enough (because of above problem), works only with 
 BTRFS,
 and was closed by Oracle so far :)
 
 btrfs is all we need :p

Well, at least it has some very interesting ideas.
Although there are things which are not that good imho, time will
show, maybe there will be another state-of-the-art filesystem at the
moment...

This was for information.
 
 Where's the parallelism that is advertised by the POH in pohmelfs?

First, clients work with local caches and sync them either in writeback
or via cache coherency algorithm. This work is effectively parallel.
Second, pohmelfs as in distributed filesystem is developed as a transport
layer to eliminate mount operation for each different node, so that
after client asks for data it would be just sent to different server.
This allows to make parallel transactions. Essentially it looks like
mounting different remote server to virtual directory working with it,
except that connection setup should be done not at mount time, but at
run time.

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc8 ppp regression

2008-01-23 Thread Evgeniy Polyakov
On Wed, Jan 23, 2008 at 10:35:09AM +0100, maximilian attems ([EMAIL PROTECTED]) 
wrote:
 Jan 22 23:23:13 dual kernel: unregister_netdevice: waiting for ppp0 to become 
 free. Usage count = 1
 Jan 22 23:23:44 dual last message repeated 3 times
 Jan 22 23:23:54 dual kernel: unregister_netdevice: waiting for ppp0 to become 
 free. Usage count = 1
 
 2.6.24-rc7 works fine, not yet bisected, will do later in the evening.

Fix (revert) is in Dave's tree already.

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[0/4] DST: Distributed storage: Succumbed to live ant.

2008-01-22 Thread Evgeniy Polyakov

Distributed storage: Succumbed to live ant.

I'm pleased to announce the 14'th release of the distributed
storage subsystem (DST).

DST allows to form a storage on top of local and remote nodes
and combine them into linear or mirroring setup, which in
turn can be exported to remote nodes.

This is a maintenance release only.

Short changelog:
 * do not allocate big enough address structure on stack during
local export node initialization
Thanks to Serge Leschinsky and Konstantin Kalin for testing.

Overall list of features of the DST can be found on project's homepage:

http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst

DST is also exported as a git tree available for clone and pull from
http://tservice.net.ru/~s0mbre/archive/dst/dst.git

Thank you.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[4/4] DST: Algorithms used in distributed storage.

2008-01-22 Thread Evgeniy Polyakov

Algorithms used in distributed storage.
Mirror and linear mapping code.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/drivers/block/dst/alg_linear.c b/drivers/block/dst/alg_linear.c
new file mode 100644
index 000..2f9ed65
--- /dev/null
+++ b/drivers/block/dst/alg_linear.c
@@ -0,0 +1,105 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/init.h
+#include linux/dst.h
+
+static struct dst_alg *alg_linear;
+
+/*
+ * This callback is invoked when node is removed from storage.
+ */
+static void dst_linear_del_node(struct dst_node *n)
+{
+}
+
+/*
+ * This callback is invoked when node is added to storage.
+ */
+static int dst_linear_add_node(struct dst_node *n)
+{
+   struct dst_storage *st = n-st;
+
+   dprintk(%s: disk_size: %llu, node_size: %llu.\n,
+   __func__, st-disk_size, n-size);
+
+   mutex_lock(st-tree_lock);
+   n-start = st-disk_size;
+   st-disk_size += n-size;
+   dst_set_disk_size(st);
+   mutex_unlock(st-tree_lock);
+
+   return 0;
+}
+
+static int dst_linear_remap(struct dst_request *req)
+{
+   int err;
+
+   if (req-node-bdev) {
+   generic_make_request(req-bio);
+   return 0;
+   }
+
+   err = kst_check_permissions(req-state, req-bio);
+   if (err)
+   return err;
+
+   return req-state-ops-push(req);
+}
+
+/*
+ * Failover callback - it is invoked each time error happens during
+ * request processing.
+ */
+static int dst_linear_error(struct kst_state *st, int err)
+{
+   if (err)
+   set_bit(DST_NODE_FROZEN, st-node-flags);
+   else
+   clear_bit(DST_NODE_FROZEN, st-node-flags);
+   return 0;
+}
+
+static struct dst_alg_ops alg_linear_ops = {
+   .remap  = dst_linear_remap,
+   .add_node   = dst_linear_add_node,
+   .del_node   = dst_linear_del_node,
+   .error  = dst_linear_error,
+   .owner  = THIS_MODULE,
+};
+
+static int __devinit alg_linear_init(void)
+{
+   alg_linear = dst_alloc_alg(alg_linear, alg_linear_ops);
+   if (!alg_linear)
+   return -ENOMEM;
+
+   return 0;
+}
+
+static void __devexit alg_linear_exit(void)
+{
+   dst_remove_alg(alg_linear);
+}
+
+module_init(alg_linear_init);
+module_exit(alg_linear_exit);
+
+MODULE_LICENSE(GPL);
+MODULE_AUTHOR(Evgeniy Polyakov [EMAIL PROTECTED]);
+MODULE_DESCRIPTION(Linear distributed algorithm.);
diff --git a/drivers/block/dst/alg_mirror.c b/drivers/block/dst/alg_mirror.c
new file mode 100644
index 000..529b8cb
--- /dev/null
+++ b/drivers/block/dst/alg_mirror.c
@@ -0,0 +1,1614 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/init.h
+#include linux/poll.h
+#include linux/dst.h
+#include linux/vmstat.h
+
+struct dst_write_entry
+{
+   int error;
+   u32 size;
+   u64 start;
+};
+#define DST_LOG_ENTRIES_PER_PAGE   (PAGE_SIZE/sizeof(struct 
dst_write_entry))
+
+#define DST_MIRROR_COOKIE  0xc47fd0d33274d7c6ULL
+
+struct dst_mirror_node_data
+{
+   u64 age;
+   u32 num, write_idx, resync_idx, unused;
+   u64 magic;
+};
+
+struct dst_mirror_log
+{
+   unsigned intnr_pages;
+   struct dst_write_entry  **entries;
+};
+
+struct dst_mirror_priv
+{
+   u64 resync_start, resync_size;
+   atomic_tresync_num;
+   struct completion   resync_complete;
+   struct delayed_work resync_work;
+   unsigned intresync_timeout;
+
+   u64 last_start;
+   
+   spinlock_t  resync_wait_lock;
+   struct

[3/4] DST: Network state machine.

2008-01-22 Thread Evgeniy Polyakov

Network state machine.

Includes network async processing state machine and related tasks.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/drivers/block/dst/kst.c b/drivers/block/dst/kst.c
new file mode 100644
index 000..4ff14ce
--- /dev/null
+++ b/drivers/block/dst/kst.c
@@ -0,0 +1,1523 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/kernel.h
+#include linux/module.h
+#include linux/list.h
+#include linux/slab.h
+#include linux/socket.h
+#include linux/kthread.h
+#include linux/net.h
+#include linux/in.h
+#include linux/poll.h
+#include linux/bio.h
+#include linux/dst.h
+
+#include net/sock.h
+
+struct kst_poll_helper
+{
+   poll_table  pt;
+   struct kst_state*st;
+};
+
+static LIST_HEAD(kst_worker_list);
+static DEFINE_MUTEX(kst_worker_mutex);
+
+/*
+ * This function creates bound socket for local export node.
+ */
+static int kst_sock_create(struct kst_state *st, struct saddr *addr,
+   int type, int proto, int backlog)
+{
+   int err;
+
+   err = sock_create(addr-sa_family, type, proto, st-socket);
+   if (err)
+   goto err_out_exit;
+
+   err = st-socket-ops-bind(st-socket, (struct sockaddr *)addr,
+   addr-sa_data_len);
+
+   err = st-socket-ops-listen(st-socket, backlog);
+   if (err)
+   goto err_out_release;
+
+   st-socket-sk-sk_allocation = GFP_NOIO;
+
+   return 0;
+
+err_out_release:
+   sock_release(st-socket);
+err_out_exit:
+   return err;
+}
+
+static void kst_sock_release(struct kst_state *st)
+{
+   if (st-socket) {
+   sock_release(st-socket);
+   st-socket = NULL;
+   }
+}
+
+void kst_wake(struct kst_state *st)
+{
+   if (st) {
+   struct kst_worker *w = st-node-w;
+   unsigned long flags;
+
+   spin_lock_irqsave(w-ready_lock, flags);
+   if (list_empty(st-ready_entry))
+   list_add_tail(st-ready_entry, w-ready_list);
+   spin_unlock_irqrestore(w-ready_lock, flags);
+
+   wake_up(w-wait);
+   }
+}
+EXPORT_SYMBOL_GPL(kst_wake);
+
+/*
+ * Polling machinery.
+ */
+static int kst_state_wake_callback(wait_queue_t *wait, unsigned mode,
+   int sync, void *key)
+{
+   struct kst_state *st = container_of(wait, struct kst_state, wait);
+   kst_wake(st);
+   return 1;
+}
+
+static void kst_queue_func(struct file *file, wait_queue_head_t *whead,
+poll_table *pt)
+{
+   struct kst_state *st = container_of(pt, struct kst_poll_helper, pt)-st;
+
+   st-whead = whead;
+   init_waitqueue_func_entry(st-wait, kst_state_wake_callback);
+   add_wait_queue(whead, st-wait);
+}
+
+static void kst_poll_exit(struct kst_state *st)
+{
+   if (st-whead) {
+   remove_wait_queue(st-whead, st-wait);
+   st-whead = NULL;
+   }
+}
+
+/*
+ * This function removes request from state tree and ordering list.
+ */
+void kst_del_req(struct dst_request *req)
+{
+   list_del_init(req-request_list_entry);
+}
+EXPORT_SYMBOL_GPL(kst_del_req);
+
+static struct dst_request *kst_req_first(struct kst_state *st)
+{
+   struct dst_request *req = NULL;
+
+   if (!list_empty(st-request_list))
+   req = list_entry(st-request_list.next, struct dst_request,
+   request_list_entry);
+   return req;
+}
+
+/*
+ * This function dequeues first request from the queue and tree.
+ */
+static struct dst_request *kst_dequeue_req(struct kst_state *st)
+{
+   struct dst_request *req;
+
+   mutex_lock(st-request_lock);
+   req = kst_req_first(st);
+   if (req)
+   kst_del_req(req);
+   mutex_unlock(st-request_lock);
+   return req;
+}
+
+/*
+ * This function enqueues request into tree, indexed by start of the request,
+ * and also puts request into ordered queue.
+ */
+int kst_enqueue_req(struct kst_state *st, struct dst_request *req)
+{
+   if (unlikely(req-flags  DST_REQ_CHECK_QUEUE)) {
+   struct dst_request *r;
+
+   list_for_each_entry(r, st-request_list, request_list_entry) {
+   if (bio_rw(r-bio) != bio_rw(req-bio))
+   continue;
+
+   if (r-start = req-start + req-size)
+   continue

[1/4] DST: Distributed storage documentation.

2008-01-22 Thread Evgeniy Polyakov

Distributed storage documentation.

Algorithms used in the system, userspace interfaces
(sysfs dirs and files), design and implementation details
are described here.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/Documentation/dst/algorithms.txt b/Documentation/dst/algorithms.txt
new file mode 100644
index 000..1437a6a
--- /dev/null
+++ b/Documentation/dst/algorithms.txt
@@ -0,0 +1,115 @@
+Each storage by itself is just a set of contiguous logical blocks, with
+allowed number of operations. Nodes, each of which has own start and size,
+are placed into storage by appropriate algorithm, which remaps
+logical sector number into real node's sector. One can create
+own algorithms, since DST has pluggable interface for that.
+Currently mirrored and linear algorithms are supported.
+
+Let's briefly describe how they work.
+
+Linear algorithm.
+Simple approach of concatenating storages into single device with
+increased size is used in this algorithm. Essentially new device
+has size equal to sum of sizes of underlying nodes and nodes are
+placed one after another.
+
+  /- Node 1 ---\ /-- Node 3 \
+start  end start   end
+ |==||==|
+ |start end |
+ |  \--- Node 2 -/  |
+ |  |
+start  end
+ \-- DST storage --/
+
+   /\
+   ||
+   ||
+
+  IO operations
+
+   Figure 1. 
+ 3 nodes combined into single storage using linear algorithm.
+
+Mirror algorithm.
+In this algorithms nodes are placed under each other, so when
+operation comes to the first one, it can be mirrored to all
+underlying nodes. In case of reading, actual data is obtained from
+the nearest node - algoritm keeps track of previous operation
+and knows where it was stopped, so that subsequent seek to the 
+start of the new request will take the shortest time.
+Writing is always mirrored to all underlying nodes.
+
+  IO operations
+   ||
+   ||
+   \/
+
+| DST storage ---|
+|  prev position |
+|---| Node 1 |
+|  prev pos  |
+| Node 2 -|--|
+|prev pos|
+|---| Node 3 |
+
+   Figure 2.
+   3 nodes combined into single storage using mirror algorithm.
+
+Each algorithm must implement number of callbacks,
+which must be registered during initialization time.
+
+struct dst_alg_ops
+{
+   int (*add_node)(struct dst_node *n);
+   void(*del_node)(struct dst_node *n);
+   int (*remap)(struct dst_request *req);
+   int (*error)(struct kst_state *state, int err);
+   struct module   *owner;
+};
+
[EMAIL PROTECTED]
+This callback is invoked when new node is being added into the storage,
+but before node is actually added into the storage, so that it could
+be accessed from it. When it is called, all appropriate initialization
+of the underlying device is already completed (system has been connected
+to remote node or got a reference to the local block device). At this
+stage algorithm can add node into private map. 
+It must return zero on success or negative value otherwise.
+
[EMAIL PROTECTED]
+This callback is invoked when node is being deleted from the storage,
+i.e. when its reference counter hits zero. It is called before
+any cleaning is performed.
+It must return zero on success or negative value otherwise.
+
[EMAIL PROTECTED]
+This callback is invoked each time new bio hits the storage.
+Request structure contains BIO itself, pointer to the node, which originally
+stores the whole region under given IO request, and various parameters
+used by storage core to process this block request.
+It must return zero on success or negative value otherwise. It is upto
+this method to call all cleaning if remapping failed, for example it must
+call kst_bio_endio() for given callback in case of error, which in turn
+will call bio_endio(). Note, that dst_request structure provided in this
+callback is allocated on stack, so if there is a need to use it outside
+of the given function, it must be cloned (it will happen automatically
+in state's push callback, but that copy will not be shared by any other
+user).
+
[EMAIL PROTECTED]
+This callback is invoked for each error, which happend when processed

[2/4] DST: Core distributed storage files.

2008-01-22 Thread Evgeniy Polyakov

Core distributed storage files.
Include userspace interfaces, initialization,
block layer bindings and other core functionality.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index b4c8319..ca6592d 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -451,6 +451,8 @@ config ATA_OVER_ETH
This driver provides Support for ATA over Ethernet block
devices like the Coraid EtherDrive (R) Storage Blade.
 
+source drivers/block/dst/Kconfig
+
 source drivers/s390/block/Kconfig
 
 endmenu
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index dd88e33..fcf042d 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_VIODASD) += viodasd.o
 obj-$(CONFIG_BLK_DEV_SX8)  += sx8.o
 obj-$(CONFIG_BLK_DEV_UB)   += ub.o
 
+obj-$(CONFIG_DST)  += dst/
diff --git a/drivers/block/dst/Kconfig b/drivers/block/dst/Kconfig
new file mode 100644
index 000..67a7dad
--- /dev/null
+++ b/drivers/block/dst/Kconfig
@@ -0,0 +1,28 @@
+config DST
+   tristate Distributed storage
+   depends on NET
+   select CONNECTOR
+   select LIBCRC32C
+   ---help---
+   This driver allows to create a distributed storage.
+
+config DST_DEBUG
+   bool DST debug
+   depends on DST
+   ---help---
+   This option will turn HEAVY debugging of the DST.
+   Turn it on ONLY if you have to debug some really obscure problem.
+
+config DST_ALG_LINEAR
+   tristate Linear distribution algorithm
+   depends on DST
+   ---help---
+   This module allows to create linear mapping of the nodes
+   in the distributed storage.
+
+config DST_ALG_MIRROR
+   tristate Mirror distribution algorithm
+   depends on DST
+   ---help---
+   This module allows to create a mirror of the nodes in the
+   distributed storage.
diff --git a/drivers/block/dst/Makefile b/drivers/block/dst/Makefile
new file mode 100644
index 000..1400e94
--- /dev/null
+++ b/drivers/block/dst/Makefile
@@ -0,0 +1,6 @@
+obj-$(CONFIG_DST) += dst.o
+
+dst-y := dcore.o kst.o
+
+obj-$(CONFIG_DST_ALG_LINEAR) += alg_linear.o
+obj-$(CONFIG_DST_ALG_MIRROR) += alg_mirror.o
diff --git a/drivers/block/dst/dcore.c b/drivers/block/dst/dcore.c
new file mode 100644
index 000..22841a7
--- /dev/null
+++ b/drivers/block/dst/dcore.c
@@ -0,0 +1,1657 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/init.h
+#include linux/blkdev.h
+#include linux/bio.h
+#include linux/slab.h
+#include linux/connector.h
+#include linux/socket.h
+#include linux/dst.h
+#include linux/device.h
+#include linux/in.h
+#include linux/in6.h
+#include linux/buffer_head.h
+
+#include net/sock.h
+
+static LIST_HEAD(dst_storage_list);
+static LIST_HEAD(dst_alg_list);
+static DEFINE_MUTEX(dst_storage_lock);
+static DEFINE_MUTEX(dst_alg_lock);
+static int dst_major;
+static struct kst_worker *kst_main_worker;
+static struct cb_id cn_dst_id = { CN_DST_IDX, CN_DST_VAL };
+
+struct kmem_cache *dst_request_cache;
+
+static char dst_name[] = Succumbed to live ant.;
+
+/*
+ * DST sysfs tree. For device called 'storage' which is formed
+ * on top of two nodes this looks like this:
+ *
+ * /sys/bus/dst/devices/storage/
+ * /sys/bus/dst/devices/storage/alg : alg_linear
+ * /sys/bus/dst/devices/storage/n-800/type : R: 192.168.4.80:1025
+ * /sys/bus/dst/devices/storage/n-800/size : 800
+ * /sys/bus/dst/devices/storage/n-800/start : 800
+ * /sys/bus/dst/devices/storage/n-800/clean
+ * /sys/bus/dst/devices/storage/n-800/dirty
+ * /sys/bus/dst/devices/storage/n-0/type : R: 192.168.4.81:1025
+ * /sys/bus/dst/devices/storage/n-0/size : 800
+ * /sys/bus/dst/devices/storage/n-0/start : 0
+ * /sys/bus/dst/devices/storage/n-0/clean
+ * /sys/bus/dst/devices/storage/n-0/dirty
+ * /sys/bus/dst/devices/storage/remove_all_nodes
+ * /sys/bus/dst/devices/storage/nodes : sectors (start [size]): 0 [800] | 800 
[800]
+ * /sys/bus/dst/devices/storage/name : storage
+ */
+
+static int dst_dev_match(struct device *dev, struct device_driver *drv)
+{
+   return 1;
+}
+
+static void dst_dev_release(struct device *dev)
+{
+}
+
+static struct bus_type dst_dev_bus_type = {
+   .name   = dst,
+   .match  = dst_dev_match,
+};
+
+static struct device dst_dev = {
+   .bus

Re: [Bugme-new] [Bug 9778] New: unregister_netdevice: waiting for [device] to become free

2008-01-21 Thread Evgeniy Polyakov
On Sun, Jan 20, 2008 at 02:30:27AM -0800, David Miller ([EMAIL PROTECTED]) 
wrote:
 From: Andrew Morton [EMAIL PROTECTED]
 Date: Sat, 19 Jan 2008 16:58:02 -0800
 
  ouch.
 
 Yep, several people are hitting this it seems.
 
 If Pavel doesn't provide a fix or direction soon I'll just revert.

It looks like patch is still valid.
Here is a problem description as I undestood.

When new device (let's talk about ethernet, since that is what I tested)
is being turned on, it gets neigh_parms entry allocated for it via
inetdev_init(), which is called for NETDEV_REGISTER inetdev event.
This entry is stored in arp_tbl table and is in_dev-arp_parms.

When later new arp entry is created, device is provided into
arp_constructor(), which clones (increase reference counter) device's
in_dev-arp_parms and puts it into provided neighbour entry.

When later we remove device, its in_dev-arp_parms's reference counter
is high enough (it is equal to number of arp entries found on given
device plu one), so neigh_parms_destroy() is not called. Later all
neighbour entries are flushed by garbage collector and reference counter
for that parm hits zero and device can be removed.

I will think about how to fix the problem nicely or if this patch still
can be simplified/dropped, but so far it looks valid. Maybe this
analysis will help someone to fix problem first.

Here is debug dmesg:
[   21.835595] inetdev_init: allocating parms.
[   21.839829] neigh_parms_alloc: parms: 81003d8e8df0, dev: eth0, refcnt: 
1, dev_refcnt: 2.
...
[   30.251576] r8169: eth0: link up
[   31.067079] NET: Registered protocol family 10
[   31.072055] neigh_parms_alloc: parms: 81003efc72a8, dev: lo, refcnt: 1, 
dev_refcnt: 9.
[   31.080891] neigh_alloc: parms: 8812afe8, dev: NULL, refcnt: 2.
[   31.087816] neigh_parms_alloc: parms: 81003efc7210, dev: eth0, refcnt: 
1, dev_refcnt: 9.
[   31.097335] neigh_alloc: parms: 804deb88, dev: NULL, refcnt: 2.
[   31.104172] arp_constructor: parms: 81003f8c3be8, dev: lo, refcnt: 2.
[   31.500348] neigh_alloc: parms: 8812afe8, dev: NULL, refcnt: 2.
[   32.499628] neigh_alloc: parms: 8812afe8, dev: NULL, refcnt: 2.
[  102.827796] neigh_destroy: parms: 81003efc7210, dev: eth0, refcnt: 3, 
dev_refcnt: 13.
[  106.828843] neigh_destroy: parms: 81003f8c3be8, dev: lo, refcnt: 2, 
dev_refcnt: 78.
[  109.810987] neigh_alloc: parms: 804deb88, dev: NULL, refcnt: 2.

First arp entry for eth0 device, bump the counter:
[  109.817827] arp_constructor: parms: 81003d8e8df0, dev: eth0, refcnt: 2.

[  109.831811] neigh_alloc: parms: 804deb88, dev: NULL, refcnt: 2.
[  109.838661] arp_constructor: parms: 81003f8c3be8, dev: lo, refcnt: 2.
[  110.837894] neigh_destroy: parms: 81003efc7210, dev: eth0, refcnt: 2, 
dev_refcnt: 15.

Can not release that neigh parm:
[  113.638228] neigh_parms_release: parms: 81003d8e8df0, dev: eth0, refcnt: 
2, dev_refcnt: 5.

Can release some other (for ipv6):
[  113.649380] neigh_parms_release: parms: 81003efc7210, dev: eth0, refcnt: 
1, dev_refcnt: 5.
[  113.671806] neigh_parms_destroy: parms: 81003efc7210, dev: eth0, 
dev_refcnt: 3.

[  123.916250] unregister_netdevice: waiting for eth0 to become free. Usage 
count = 1

GC hits us:
[  124.839572] neigh_destroy: parms: 81003d8e8df0, dev: eth0, refcnt: 1, 
dev_refcnt: 11.
[  124.847813] neigh_parms_destroy: parms: 81003d8e8df0, dev: eth0, 
dev_refcnt: 1.
[  124.952026] ACPI: PCI interrupt for device :02:0d.0 disabled

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bugme-new] [Bug 9778] New: unregister_netdevice: waiting for [device] to become free

2008-01-21 Thread Evgeniy Polyakov
On Mon, Jan 21, 2008 at 03:14:45PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) 
wrote:
 It looks like patch is still valid.
 Here is a problem description as I undestood.
 
 When new device (let's talk about ethernet, since that is what I tested)
 is being turned on, it gets neigh_parms entry allocated for it via
 inetdev_init(), which is called for NETDEV_REGISTER inetdev event.
 This entry is stored in arp_tbl table and is in_dev-arp_parms.
 
 When later new arp entry is created, device is provided into
 arp_constructor(), which clones (increase reference counter) device's
 in_dev-arp_parms and puts it into provided neighbour entry.
 
 When later we remove device, its in_dev-arp_parms's reference counter
 is high enough (it is equal to number of arp entries found on given
 device plu one), so neigh_parms_destroy() is not called. Later all
 neighbour entries are flushed by garbage collector and reference counter
 for that parm hits zero and device can be removed.
 
 I will think about how to fix the problem nicely or if this patch still
 can be simplified/dropped, but so far it looks valid. Maybe this
 analysis will help someone to fix problem first.

Yes, patch is valid, and there is a (very noticeble) race between
neighbour processing and parm release - parm still can be accessed after
device was fully freed (as with old behaviour when dev_pu() was called
from neigh_parms_release()), although no one access it, so the simplest
solution is to move dev_put() under the table lock and allow to access
parms-dev only under table lock and always check if it is non-null.
So I propose a following patch as a simplest solution for the current
time.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index a4f2618..410b7e7 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -34,6 +34,11 @@ struct neighbour;
 
 struct neigh_parms
 {
+   /*
+* This device is only allowed to be accessed under table lock (bh 
turned off)
+* and while device is alive. After parm was released, it will be set 
to NULL
+* and has to be always checked before accessed.
+*/
struct net_device *dev;
struct neigh_parms *next;
int (*neigh_setup)(struct neighbour *);
 
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index cc8a2f1..5076acd 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1315,7 +1315,12 @@ void neigh_parms_release(struct neigh_table *tbl, struct 
neigh_parms *parms)
if (*p == parms) {
*p = parms-next;
parms-dead = 1;
+   if (parms-dev) {
+   dev_put(parms-dev);
+   parms-dev = NULL;
+   }
write_unlock_bh(tbl-lock);
+
call_rcu(parms-rcu_head, neigh_rcu_free_parms);
return;
}
@@ -1326,8 +1331,6 @@ void neigh_parms_release(struct neigh_table *tbl, struct 
neigh_parms *parms)
 
 void neigh_parms_destroy(struct neigh_parms *parms)
 {
-   if (parms-dev)
-   dev_put(parms-dev);
kfree(parms);
 }
 

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SACK scoreboard

2008-01-09 Thread Evgeniy Polyakov
Hi.

On Wed, Jan 09, 2008 at 08:03:18AM +0100, Andi Kleen ([EMAIL PROTECTED]) wrote:
  It adds severe spikes in CPU utilization that are even moderate
  line rates begins to affect RTTs.
  
  Or do you think it's OK to process 500,000 SKBs while locked
  in a software interrupt.
 
 You can always push it into a work queue.  Even put it to
 other cores if you want. 
 
 In fact this is already done partly for the -completion_queue.
 Wouldn't be a big change to queue it another level down.
 
 Also even freeing a lot of objects doesn't have to be
 that expensive. I suspect the most cost is in taking
 the slab locks, but that could be batched. Without
 that the kmem_free fast path isn't particularly
 expensive, as long as the headers are still in cache.

Postponing freeing of the skb has major drawbacks. Some time ago I
made a patch to postpone skb freeing behind rcu and got 2.5 times slower
connection speed on some machines with decreased CPU usage though.
So, queueing solution has to be proven with real data and although it
looks good in one situation, it can be really bad in another.

For interested reader: results of the RCUfication of the kfree_skbmem()
http://tservice.net.ru/~s0mbre/blog/devel/networking/2006/12/05

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Evgeniy Polyakov

2007-12-21 Thread Evgeniy Polyakov
On Thu, Dec 20, 2007 at 08:34:59PM -0800, David Miller ([EMAIL PROTECTED]) 
wrote:
 
 If someone has a way other than email to contact Evgeniy, could
 you please let him know that his email is bouncing in strange
 ways.

Yep, I saw him couple of times and will try to contact.

 I'll have to unsubscribe him if this goes on much longer, which
 I don't want to do.
 
 Thanks.
 
 Here is some example bounce text:
 
 451 4.0.0 readqf: cannot open ./dflBL48UH3032179: No such file or directory
 552 5.3.4 Message is too large; 1500 bytes max
 554 5.0.0 Service unavailable

This looks really strange for me - I will forward it to system admin of
the university server where I have a mail.
Likely is is because of some troubles with the mail queue or FS...

Do not unsubscribe me :)

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0/4] DST: Distributed storage.

2007-12-18 Thread Evgeniy Polyakov
Hi David.

On Tue, Dec 18, 2007 at 12:00:04PM +1100, David Chinner ([EMAIL PROTECTED]) 
wrote:
 On Mon, Dec 17, 2007 at 06:03:38PM +0300, Evgeniy Polyakov wrote:
  DST passed all FS tests in LTP with XFS (modulo MAX_LOCK_DEPTH too low bug:
  [ 8398.605691] BUG: MAX_LOCK_DEPTH too low!
  [ 8398.609641] turning off the locking correctness validator.
 
 Evgeniy, can you please start reporting these XFS problems you are
 coming across to the XFS list ([EMAIL PROTECTED])? They may be
 real issues that we need to address and we should not be hearing
 about them for the first time in the release notes for a block
 device project

It is not XFS as is, but lock validator warning. I just found it
working with XFS - it can be anything else: VFS, block layer, DST itself
(there is number of locks too), so I did not fill the bug against
filesystem, but pointed that some problem, probalby harmless,
exists in tested environment.

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[0/4] DST: Distributed storage.

2007-12-17 Thread Evgeniy Polyakov

Distributed storage.

I'm pleased to announce the 12'th release of the distributed
storage subsystem (DST).

DST allows to form a storage on top of local and remote nodes
and combine them into linear or mirroring setup, which in
turn can be exported to remote nodes.

Short changelog:
 * new improved mirroring algorithm.
This algorithm uses sliding window approach for full resync
and write log for partial resync.
 * fixed number of typos and debug cleanups
 * update inode size when linear algorithm changes the size of the
storage in run time
 * extended number of sysfs files and documentation for them
 * fixed leak in local export node setup
 * name is 'Dancing with the smoked neutrino' now

Overall list of features of the DST can be found on project's homepage:

http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst

DST is also exported as a git tree available for clone and pull from
http://tservice.net.ru/~s0mbre/archive/dst/dst.git

Interested reader can test DST with 2.6.23 tree too
(it should compile fine, but was not tested).

DST passed all FS tests in LTP with XFS (modulo MAX_LOCK_DEPTH too low bug:
[ 8398.605691] BUG: MAX_LOCK_DEPTH too low!
[ 8398.609641] turning off the locking correctness validator.

this is not DST problem though), but it was not performed with
offline/online nodes.

Thank you.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[2/4] DST: Core distributed storage files.

2007-12-17 Thread Evgeniy Polyakov

Core distributed storage files.
Include userspace interfaces, initialization,
block layer bindings and other core functionality.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index b4c8319..ca6592d 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -451,6 +451,8 @@ config ATA_OVER_ETH
This driver provides Support for ATA over Ethernet block
devices like the Coraid EtherDrive (R) Storage Blade.
 
+source drivers/block/dst/Kconfig
+
 source drivers/s390/block/Kconfig
 
 endmenu
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index dd88e33..fcf042d 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_VIODASD) += viodasd.o
 obj-$(CONFIG_BLK_DEV_SX8)  += sx8.o
 obj-$(CONFIG_BLK_DEV_UB)   += ub.o
 
+obj-$(CONFIG_DST)  += dst/
diff --git a/drivers/block/dst/Kconfig b/drivers/block/dst/Kconfig
new file mode 100644
index 000..67a7dad
--- /dev/null
+++ b/drivers/block/dst/Kconfig
@@ -0,0 +1,28 @@
+config DST
+   tristate Distributed storage
+   depends on NET
+   select CONNECTOR
+   select LIBCRC32C
+   ---help---
+   This driver allows to create a distributed storage.
+
+config DST_DEBUG
+   bool DST debug
+   depends on DST
+   ---help---
+   This option will turn HEAVY debugging of the DST.
+   Turn it on ONLY if you have to debug some really obscure problem.
+
+config DST_ALG_LINEAR
+   tristate Linear distribution algorithm
+   depends on DST
+   ---help---
+   This module allows to create linear mapping of the nodes
+   in the distributed storage.
+
+config DST_ALG_MIRROR
+   tristate Mirror distribution algorithm
+   depends on DST
+   ---help---
+   This module allows to create a mirror of the nodes in the
+   distributed storage.
diff --git a/drivers/block/dst/Makefile b/drivers/block/dst/Makefile
new file mode 100644
index 000..1400e94
--- /dev/null
+++ b/drivers/block/dst/Makefile
@@ -0,0 +1,6 @@
+obj-$(CONFIG_DST) += dst.o
+
+dst-y := dcore.o kst.o
+
+obj-$(CONFIG_DST_ALG_LINEAR) += alg_linear.o
+obj-$(CONFIG_DST_ALG_MIRROR) += alg_mirror.o
diff --git a/drivers/block/dst/dcore.c b/drivers/block/dst/dcore.c
new file mode 100644
index 000..423e7b2
--- /dev/null
+++ b/drivers/block/dst/dcore.c
@@ -0,0 +1,1622 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/init.h
+#include linux/blkdev.h
+#include linux/bio.h
+#include linux/slab.h
+#include linux/connector.h
+#include linux/socket.h
+#include linux/dst.h
+#include linux/device.h
+#include linux/in.h
+#include linux/in6.h
+#include linux/buffer_head.h
+
+#include net/sock.h
+
+static LIST_HEAD(dst_storage_list);
+static LIST_HEAD(dst_alg_list);
+static DEFINE_MUTEX(dst_storage_lock);
+static DEFINE_MUTEX(dst_alg_lock);
+static int dst_major;
+static struct kst_worker *kst_main_worker;
+static struct cb_id cn_dst_id = { CN_DST_IDX, CN_DST_VAL };
+
+struct kmem_cache *dst_request_cache;
+
+static char dst_name[] = Dancing with the smoked neutrino;
+
+/*
+ * DST sysfs tree. For device called 'storage' which is formed
+ * on top of two nodes this looks like this:
+ *
+ * /sys/devices/storage/
+ * /sys/devices/storage/alg : alg_linear
+ * /sys/devices/storage/n-800/type : R: 192.168.4.80:1025
+ * /sys/devices/storage/n-800/size : 800
+ * /sys/devices/storage/n-800/start : 800
+ * /sys/devices/storage/n-800/clean
+ * /sys/devices/storage/n-800/dirty
+ * /sys/devices/storage/n-0/type : R: 192.168.4.81:1025
+ * /sys/devices/storage/n-0/size : 800
+ * /sys/devices/storage/n-0/start : 0
+ * /sys/devices/storage/n-0/clean
+ * /sys/devices/storage/n-0/dirty
+ * /sys/devices/storage/remove_all_nodes
+ * /sys/devices/storage/nodes : sectors (start [size]): 0 [800] | 800 [800]
+ * /sys/devices/storage/name : storage
+ */
+
+static int dst_dev_match(struct device *dev, struct device_driver *drv)
+{
+   return 1;
+}
+
+static void dst_dev_release(struct device *dev)
+{
+}
+
+static struct bus_type dst_dev_bus_type = {
+   .name   = dst,
+   .match  = dst_dev_match,
+};
+
+static struct device dst_dev = {
+   .bus= dst_dev_bus_type,
+   .release= dst_dev_release
+};
+
+static void dst_node_release(struct device *dev

[3/4] DST: Network state machine.

2007-12-17 Thread Evgeniy Polyakov

Network state machine.

Includes network async processing state machine and related tasks.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/drivers/block/dst/kst.c b/drivers/block/dst/kst.c
new file mode 100644
index 000..6d92014
--- /dev/null
+++ b/drivers/block/dst/kst.c
@@ -0,0 +1,1515 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/kernel.h
+#include linux/module.h
+#include linux/list.h
+#include linux/slab.h
+#include linux/socket.h
+#include linux/kthread.h
+#include linux/net.h
+#include linux/in.h
+#include linux/poll.h
+#include linux/bio.h
+#include linux/dst.h
+
+#include net/sock.h
+
+struct kst_poll_helper
+{
+   poll_table  pt;
+   struct kst_state*st;
+};
+
+static LIST_HEAD(kst_worker_list);
+static DEFINE_MUTEX(kst_worker_mutex);
+
+/*
+ * This function creates bound socket for local export node.
+ */
+static int kst_sock_create(struct kst_state *st, struct saddr *addr,
+   int type, int proto, int backlog)
+{
+   int err;
+
+   err = sock_create(addr-sa_family, type, proto, st-socket);
+   if (err)
+   goto err_out_exit;
+
+   err = st-socket-ops-bind(st-socket, (struct sockaddr *)addr,
+   addr-sa_data_len);
+
+   err = st-socket-ops-listen(st-socket, backlog);
+   if (err)
+   goto err_out_release;
+
+   st-socket-sk-sk_allocation = GFP_NOIO;
+
+   return 0;
+
+err_out_release:
+   sock_release(st-socket);
+err_out_exit:
+   return err;
+}
+
+static void kst_sock_release(struct kst_state *st)
+{
+   if (st-socket) {
+   sock_release(st-socket);
+   st-socket = NULL;
+   }
+}
+
+void kst_wake(struct kst_state *st)
+{
+   if (st) {
+   struct kst_worker *w = st-node-w;
+   unsigned long flags;
+
+   spin_lock_irqsave(w-ready_lock, flags);
+   if (list_empty(st-ready_entry))
+   list_add_tail(st-ready_entry, w-ready_list);
+   spin_unlock_irqrestore(w-ready_lock, flags);
+
+   wake_up(w-wait);
+   }
+}
+EXPORT_SYMBOL_GPL(kst_wake);
+
+/*
+ * Polling machinery.
+ */
+static int kst_state_wake_callback(wait_queue_t *wait, unsigned mode,
+   int sync, void *key)
+{
+   struct kst_state *st = container_of(wait, struct kst_state, wait);
+   kst_wake(st);
+   return 1;
+}
+
+static void kst_queue_func(struct file *file, wait_queue_head_t *whead,
+poll_table *pt)
+{
+   struct kst_state *st = container_of(pt, struct kst_poll_helper, pt)-st;
+
+   st-whead = whead;
+   init_waitqueue_func_entry(st-wait, kst_state_wake_callback);
+   add_wait_queue(whead, st-wait);
+}
+
+static void kst_poll_exit(struct kst_state *st)
+{
+   if (st-whead) {
+   remove_wait_queue(st-whead, st-wait);
+   st-whead = NULL;
+   }
+}
+
+/*
+ * This function removes request from state tree and ordering list.
+ */
+void kst_del_req(struct dst_request *req)
+{
+   list_del_init(req-request_list_entry);
+}
+EXPORT_SYMBOL_GPL(kst_del_req);
+
+static struct dst_request *kst_req_first(struct kst_state *st)
+{
+   struct dst_request *req = NULL;
+
+   if (!list_empty(st-request_list))
+   req = list_entry(st-request_list.next, struct dst_request,
+   request_list_entry);
+   return req;
+}
+
+/*
+ * This function dequeues first request from the queue and tree.
+ */
+static struct dst_request *kst_dequeue_req(struct kst_state *st)
+{
+   struct dst_request *req;
+
+   mutex_lock(st-request_lock);
+   req = kst_req_first(st);
+   if (req)
+   kst_del_req(req);
+   mutex_unlock(st-request_lock);
+   return req;
+}
+
+/*
+ * This function enqueues request into tree, indexed by start of the request,
+ * and also puts request into ordered queue.
+ */
+int kst_enqueue_req(struct kst_state *st, struct dst_request *req)
+{
+   if (unlikely(req-flags  DST_REQ_CHECK_QUEUE)) {
+   struct dst_request *r;
+
+   list_for_each_entry(r, st-request_list, request_list_entry) {
+   if (bio_rw(r-bio) != bio_rw(req-bio))
+   continue;
+
+   if (r-start = req-start + req-size)
+   continue

[1/4] DST: Distributed storage documentation.

2007-12-17 Thread Evgeniy Polyakov

Distributed storage documentation.

Algorithms used in the system, userspace interfaces
(sysfs dirs and files), design and implementation details
are described here.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/Documentation/dst/algorithms.txt b/Documentation/dst/algorithms.txt
new file mode 100644
index 000..1437a6a
--- /dev/null
+++ b/Documentation/dst/algorithms.txt
@@ -0,0 +1,115 @@
+Each storage by itself is just a set of contiguous logical blocks, with
+allowed number of operations. Nodes, each of which has own start and size,
+are placed into storage by appropriate algorithm, which remaps
+logical sector number into real node's sector. One can create
+own algorithms, since DST has pluggable interface for that.
+Currently mirrored and linear algorithms are supported.
+
+Let's briefly describe how they work.
+
+Linear algorithm.
+Simple approach of concatenating storages into single device with
+increased size is used in this algorithm. Essentially new device
+has size equal to sum of sizes of underlying nodes and nodes are
+placed one after another.
+
+  /- Node 1 ---\ /-- Node 3 \
+start  end start   end
+ |==||==|
+ |start end |
+ |  \--- Node 2 -/  |
+ |  |
+start  end
+ \-- DST storage --/
+
+   /\
+   ||
+   ||
+
+  IO operations
+
+   Figure 1. 
+ 3 nodes combined into single storage using linear algorithm.
+
+Mirror algorithm.
+In this algorithms nodes are placed under each other, so when
+operation comes to the first one, it can be mirrored to all
+underlying nodes. In case of reading, actual data is obtained from
+the nearest node - algoritm keeps track of previous operation
+and knows where it was stopped, so that subsequent seek to the 
+start of the new request will take the shortest time.
+Writing is always mirrored to all underlying nodes.
+
+  IO operations
+   ||
+   ||
+   \/
+
+| DST storage ---|
+|  prev position |
+|---| Node 1 |
+|  prev pos  |
+| Node 2 -|--|
+|prev pos|
+|---| Node 3 |
+
+   Figure 2.
+   3 nodes combined into single storage using mirror algorithm.
+
+Each algorithm must implement number of callbacks,
+which must be registered during initialization time.
+
+struct dst_alg_ops
+{
+   int (*add_node)(struct dst_node *n);
+   void(*del_node)(struct dst_node *n);
+   int (*remap)(struct dst_request *req);
+   int (*error)(struct kst_state *state, int err);
+   struct module   *owner;
+};
+
[EMAIL PROTECTED]
+This callback is invoked when new node is being added into the storage,
+but before node is actually added into the storage, so that it could
+be accessed from it. When it is called, all appropriate initialization
+of the underlying device is already completed (system has been connected
+to remote node or got a reference to the local block device). At this
+stage algorithm can add node into private map. 
+It must return zero on success or negative value otherwise.
+
[EMAIL PROTECTED]
+This callback is invoked when node is being deleted from the storage,
+i.e. when its reference counter hits zero. It is called before
+any cleaning is performed.
+It must return zero on success or negative value otherwise.
+
[EMAIL PROTECTED]
+This callback is invoked each time new bio hits the storage.
+Request structure contains BIO itself, pointer to the node, which originally
+stores the whole region under given IO request, and various parameters
+used by storage core to process this block request.
+It must return zero on success or negative value otherwise. It is upto
+this method to call all cleaning if remapping failed, for example it must
+call kst_bio_endio() for given callback in case of error, which in turn
+will call bio_endio(). Note, that dst_request structure provided in this
+callback is allocated on stack, so if there is a need to use it outside
+of the given function, it must be cloned (it will happen automatically
+in state's push callback, but that copy will not be shared by any other
+user).
+
[EMAIL PROTECTED]
+This callback is invoked for each error, which happend when processed

[4/4] DST: Algorithms used in distributed storage.

2007-12-17 Thread Evgeniy Polyakov

Algorithms used in distributed storage.
Mirror and linear mapping code.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/drivers/block/dst/alg_linear.c b/drivers/block/dst/alg_linear.c
new file mode 100644
index 000..836764d
--- /dev/null
+++ b/drivers/block/dst/alg_linear.c
@@ -0,0 +1,114 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/init.h
+#include linux/dst.h
+
+static struct dst_alg *alg_linear;
+
+/*
+ * This callback is invoked when node is removed from storage.
+ */
+static void dst_linear_del_node(struct dst_node *n)
+{
+}
+
+/*
+ * This callback is invoked when node is added to storage.
+ */
+static int dst_linear_add_node(struct dst_node *n)
+{
+   struct dst_storage *st = n-st;
+   struct block_device *bdev;
+
+   dprintk(%s: disk_size: %llu, node_size: %llu.\n,
+   __func__, st-disk_size, n-size);
+
+   mutex_lock(st-tree_lock);
+   n-start = st-disk_size;
+   st-disk_size += n-size;
+   set_capacity(st-disk, st-disk_size);
+   
+   bdev = bdget_disk(st-disk, 0);
+   if (bdev) {
+   mutex_lock(bdev-bd_inode-i_mutex);
+   i_size_write(bdev-bd_inode, to_bytes(st-disk_size));
+   mutex_unlock(bdev-bd_inode-i_mutex);
+   bdput(bdev);
+   }
+   mutex_unlock(st-tree_lock);
+
+   return 0;
+}
+
+static int dst_linear_remap(struct dst_request *req)
+{
+   int err;
+
+   if (req-node-bdev) {
+   generic_make_request(req-bio);
+   return 0;
+   }
+
+   err = kst_check_permissions(req-state, req-bio);
+   if (err)
+   return err;
+
+   return req-state-ops-push(req);
+}
+
+/*
+ * Failover callback - it is invoked each time error happens during
+ * request processing.
+ */
+static int dst_linear_error(struct kst_state *st, int err)
+{
+   if (err)
+   set_bit(DST_NODE_FROZEN, st-node-flags);
+   else
+   clear_bit(DST_NODE_FROZEN, st-node-flags);
+   return 0;
+}
+
+static struct dst_alg_ops alg_linear_ops = {
+   .remap  = dst_linear_remap,
+   .add_node   = dst_linear_add_node,
+   .del_node   = dst_linear_del_node,
+   .error  = dst_linear_error,
+   .owner  = THIS_MODULE,
+};
+
+static int __devinit alg_linear_init(void)
+{
+   alg_linear = dst_alloc_alg(alg_linear, alg_linear_ops);
+   if (!alg_linear)
+   return -ENOMEM;
+
+   return 0;
+}
+
+static void __devexit alg_linear_exit(void)
+{
+   dst_remove_alg(alg_linear);
+}
+
+module_init(alg_linear_init);
+module_exit(alg_linear_exit);
+
+MODULE_LICENSE(GPL);
+MODULE_AUTHOR(Evgeniy Polyakov [EMAIL PROTECTED]);
+MODULE_DESCRIPTION(Linear distributed algorithm.);
diff --git a/drivers/block/dst/alg_mirror.c b/drivers/block/dst/alg_mirror.c
new file mode 100644
index 000..c10d582
--- /dev/null
+++ b/drivers/block/dst/alg_mirror.c
@@ -0,0 +1,1536 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/init.h
+#include linux/poll.h
+#include linux/dst.h
+#include linux/vmstat.h
+
+struct dst_write_entry
+{
+   int error;
+   u32 size;
+   u64 start;
+};
+#define DST_LOG_ENTRIES_PER_PAGE   (PAGE_SIZE/sizeof(struct 
dst_write_entry))
+
+struct dst_mirror_node_data
+{
+   u64 age;
+   u64 num, write_idx, resync_idx;
+};
+
+struct dst_mirror_log
+{
+   unsigned intnr_pages;
+   struct dst_write_entry  **entries;
+};
+
+struct dst_mirror_priv
+{
+   u64 resync_start, resync_size;
+   atomic_tresync_num;
+   struct completion   resync_complete

Re: Badness at net/core/dev.c:2199

2007-12-16 Thread Evgeniy Polyakov
On Sun, Dec 16, 2007 at 07:55:55PM +0200, Meelis Roos ([EMAIL PROTECTED]) wrote:
 Just got this trace from current 2.6.24-rc5+git running on 32-bit ppc 
 (PReP subarch, tulip NIC's) during apt-get update (logged in via ssh so 
 also ssh traffic):
 
 [ cut here ]
 Badness at net/core/dev.c:2199

Please test attached patch.
If I understood ltulip correctly, it is posible, that number of entries
can be higher than requested budget. When work_done is equal to
budget-1, the last skb has to be processed, after 154'th line
work_done will become equal to budget and thus loop has to break,
check on the same 154 line will become false, but work_done will be
increased nevertheless, which will make work_done being equal to
budget+1 at exit, which will fire warning you saw.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/drivers/net/tulip/interrupt.c b/drivers/net/tulip/interrupt.c
index 3653314..9e0e97a 100644
--- a/drivers/net/tulip/interrupt.c
+++ b/drivers/net/tulip/interrupt.c
@@ -151,8 +151,9 @@ int tulip_poll(struct napi_struct *napi, int budget)
if (tulip_debug  5)
printk(KERN_DEBUG %s: In tulip_rx(), entry %d 
%8.8x.\n,
   dev-name, entry, status);
-  if (work_done++ = budget)
+  if (work_done = budget)
goto not_done;
+  work_done++;
 
if ((status  0x38008300) != 0x0300) {
if ((status  0x38000300) != 0x0300) {


-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Badness at net/core/dev.c:2199

2007-12-16 Thread Evgeniy Polyakov
On Sun, Dec 16, 2007 at 10:33:40AM -0800, Stephen Hemminger ([EMAIL PROTECTED]) 
wrote:
  index 3653314..9e0e97a 100644
  --- a/drivers/net/tulip/interrupt.c
  +++ b/drivers/net/tulip/interrupt.c
  @@ -151,8 +151,9 @@ int tulip_poll(struct napi_struct *napi, int budget)
  if (tulip_debug  5)
  printk(KERN_DEBUG %s: In tulip_rx(), entry 
  %d %8.8x.\n,
 dev-name, entry, status);
  -  if (work_done++ = budget)
  +  if (work_done = budget)
  goto not_done;
  +  work_done++;
   
  if ((status  0x38008300) != 0x0300) {
  if ((status  0x38000300) != 0x0300) {
  
  
 
 I already sendout a correct patch last week. It should pre-increment.

That will work too.
Thanks Stephen.

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [3/4] DST: Network state machine.

2007-12-13 Thread Evgeniy Polyakov
On Thu, Dec 13, 2007 at 11:43:43PM +0300, Dmitry Monakhov ([EMAIL PROTECTED]) 
wrote:
 On 14:47 Mon 10 Dec , Evgeniy Polyakov wrote:
  
  Network state machine.
  
  Includes network async processing state machine and related tasks.
 Hi, I've tried to play a little bit with DST and discover huge memory
 leak. Every read request from remote node result in bio + bio's pages leak.
 
 Data flow:
 -kst_export_ready## prepare and submit bio 
   -generic_make_request(bio) ## submit it
 
 -kst_export_read_end_io  ## block layer call bio_end_io callback
 
 -kst_thread_process_state## process ready requests
   -kst_data_callback
  -kst_data_process_bio   ## submit pages to network layer
   -kst_complete_req
  -kst_bio_endio
-kst_export_read_end_io ## WoW we calling the same bio_end_io 
   ## callback twice 
  -dst_free_request(req);   ## request will be destroyed but it's bio
   ## and all bio's pages wasn't released.
 We may release bio's pages after it was sent to network, it is safe because
 sendpage() already called get_page(). I've attached simple patch which 
 this this.

Yes, your patch looks good.
Thanks a lot, Dmitry.

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What was the reason for 2.6.22 SMP kernels to change how sendmsg is called?

2007-12-13 Thread Evgeniy Polyakov
Hi Kevin.

On Thu, Dec 13, 2007 at 04:00:02PM -0600, Kevin Wilson ([EMAIL PROTECTED]) 
wrote:
 I see your point but it just so happens it is a GPL'd driver, as is all of 
 our Linux code we produce for our hardware. Granted it is out of tree, and 
 after you saw it you would want it to stay that way. However, I would have 
 sent you the whole thing if that is a pre-req to cordial exchanges on this 
 list.
 
 Nonetheless, a somewhat recent change in your tree, that I could not pinpoint 
 on my own, caused the driver to stop functioning properly. So after much 
 searching in git/google/sources with no luck, I decided to ask for a little 
 assistance, maybe just a hint as to where the culprit may be in the tree so I 
 could investigate for myself. For SNGs I tried the method that now works but 
 I am still at a loss as to (can't find) what changes in the tree caused it to 
 fail.

Without having your code it is virtually impossible to say, why you have
a bug. And do not express your frustration telling 'zero people
responded to my bug report'. This was not a bug report at all, but empty
message about 'my code stopped working after some network changes, which
broke the stuff.

Now in 2.6.22 and later kernels you must use the higher level SOCKET to
make a call to PROTO_OPS then to sendmsg(). e.g., socket-ops-sendmsg().

It was done because of bug found in inet_sendmsg(), which tried to
autobind socket it should not try.

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4/4] DST: Algorithms used in distributed storage.

2007-12-12 Thread Evgeniy Polyakov
On Wed, Dec 12, 2007 at 12:12:47PM +0300, Dmitry Monakhov ([EMAIL PROTECTED]) 
wrote:
 On 14:47 Mon 10 Dec , Evgeniy Polyakov wrote:
  
  Algorithms used in distributed storage.
  Mirror and linear mapping code.
 Hi, i've finally take a look on your DST solution.
 It seems what your current implementation will not work on nonstandard
 devices for example software raid0.
 other comments are follows:

  +static int dst_mirror_process_node_data(struct dst_node *n,
  +   struct dst_mirror_node_data *ndata, int op)

  +
  +   kunmap(cmp-page);
  MINOR_BUG:
You has forgot to unmap page on error path, so IMHO it is better to move
kunmap to err_out_free_cmp label.

Yep, I will fix this.

  +   priv = kzalloc(sizeof(struct dst_mirror_priv), GFP_KERNEL);
  +   if (!priv)
  +   return -ENOMEM;
  +
  +   priv-chunk_num = st-disk_size;
  +
  +   priv-chunk = vmalloc(DIV_ROUND_UP(priv-chunk_num, BITS_PER_LONG) * 
  sizeof(long));
  Ohhh. My. I want to add my 500G hdd. Do you really wanna
say what i have to store 128Mb in memory object for this.

Right now yes. There was a code which used single bit for bigger
data units, but I dropped it because of resync troubles (i.e. when
one single sector has been updated, it requires to resync the whole
block). I can not say which case is better though.

  +   dprintk(%s: start: %llu, size: %llu/%u, bio: %p, req: %p, 
  +   node: %p.\n,
  +   __func__, req-start, req-size, nr_pages, bio,
  +   req, req-node);
  +
  +   err = n-st-queue-make_request_fn(n-st-queue, bio);
  Why direct make_request_fn instead of generic_make_request?

generic_make_request() will queue the bio in this case,
so I call request_fn directly.

  +   for (i = 0; i  DIV_ROUND_UP(priv-chunk_num, BITS_PER_LONG); ++i) {
  +   int bit, num, start;
  +   unsigned long word = priv-chunk[i];
  +
  +   if (!word)
  +   continue;
  +
  +   num = 0;
  +   start = -1;
  +   while (word  num  BITS_PER_LONG) {
  +   bit = __ffs(word);
  +   if (start == -1)
  +   start = bit;
  +   num++;
  MINOR_BUG: Seems you have misstyped here. AFAIU @num represent position
of last non zero bit (start + num == last_non_zero_bit_pos)
   if (start == -1) {
   start = bit;
   num = 1;
   } else
 num += bit;

Yes, you are right of course.
Since I shift word to more than a single bit, @num has to be update
accordingly.

  +   word = (bit+1);

Dmitry, thanks a lot for comments, I will fix issues you pointed in the
next release, although will stay bitmap case opened for a while.

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[0/4] DST: Distributed storage.

2007-12-10 Thread Evgeniy Polyakov

Distributed storage.

I'm pleased to announce the 11'th release of the distributed
storage subsystem (DST). This is a maintenance release and includes
bug fixes and simple feature extensions only.

DST allows to form a storage on top of local and remote nodes
and combine them into linear or mirroring setup, which in
turn can be exported to remote nodes.

Short changelog:
 * wakeup state when mirror detected error to seedup reconnect
 * if connecting in csum mode to no-csum server, do not enable csums
 * do not clean queue until all users are removed
 * allow to increase size of the storage in linear add callback 
(with this change it is possible to add nodes into linear array
in real time without stopping storage. Filesystem has to be prepared
for the case when underlying device has changed its size.
Real-time addon of mirror nodes is also supported)
 * allow to delete gendisk only after device was started
 * dst debug config option
 * Name: Gamardjoba, genacvale! ('Hi friend' in georgian)

Great thanks to Matthew Hodgson [EMAIL PROTECTED] for debugging!

Overall list of features of the DST can be found on project's homepage:

http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst

Thank you.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[1/4] DST: Distributed storage documentation.

2007-12-10 Thread Evgeniy Polyakov

Distributed storage documentation.

Algorithms used in the system, userspace interfaces
(sysfs dirs and files), design and implementation details
are described here.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/Documentation/dst/algorithms.txt b/Documentation/dst/algorithms.txt
new file mode 100644
index 000..1437a6a
--- /dev/null
+++ b/Documentation/dst/algorithms.txt
@@ -0,0 +1,115 @@
+Each storage by itself is just a set of contiguous logical blocks, with
+allowed number of operations. Nodes, each of which has own start and size,
+are placed into storage by appropriate algorithm, which remaps
+logical sector number into real node's sector. One can create
+own algorithms, since DST has pluggable interface for that.
+Currently mirrored and linear algorithms are supported.
+
+Let's briefly describe how they work.
+
+Linear algorithm.
+Simple approach of concatenating storages into single device with
+increased size is used in this algorithm. Essentially new device
+has size equal to sum of sizes of underlying nodes and nodes are
+placed one after another.
+
+  /- Node 1 ---\ /-- Node 3 \
+start  end start   end
+ |==||==|
+ |start end |
+ |  \--- Node 2 -/  |
+ |  |
+start  end
+ \-- DST storage --/
+
+   /\
+   ||
+   ||
+
+  IO operations
+
+   Figure 1. 
+ 3 nodes combined into single storage using linear algorithm.
+
+Mirror algorithm.
+In this algorithms nodes are placed under each other, so when
+operation comes to the first one, it can be mirrored to all
+underlying nodes. In case of reading, actual data is obtained from
+the nearest node - algoritm keeps track of previous operation
+and knows where it was stopped, so that subsequent seek to the 
+start of the new request will take the shortest time.
+Writing is always mirrored to all underlying nodes.
+
+  IO operations
+   ||
+   ||
+   \/
+
+| DST storage ---|
+|  prev position |
+|---| Node 1 |
+|  prev pos  |
+| Node 2 -|--|
+|prev pos|
+|---| Node 3 |
+
+   Figure 2.
+   3 nodes combined into single storage using mirror algorithm.
+
+Each algorithm must implement number of callbacks,
+which must be registered during initialization time.
+
+struct dst_alg_ops
+{
+   int (*add_node)(struct dst_node *n);
+   void(*del_node)(struct dst_node *n);
+   int (*remap)(struct dst_request *req);
+   int (*error)(struct kst_state *state, int err);
+   struct module   *owner;
+};
+
[EMAIL PROTECTED]
+This callback is invoked when new node is being added into the storage,
+but before node is actually added into the storage, so that it could
+be accessed from it. When it is called, all appropriate initialization
+of the underlying device is already completed (system has been connected
+to remote node or got a reference to the local block device). At this
+stage algorithm can add node into private map. 
+It must return zero on success or negative value otherwise.
+
[EMAIL PROTECTED]
+This callback is invoked when node is being deleted from the storage,
+i.e. when its reference counter hits zero. It is called before
+any cleaning is performed.
+It must return zero on success or negative value otherwise.
+
[EMAIL PROTECTED]
+This callback is invoked each time new bio hits the storage.
+Request structure contains BIO itself, pointer to the node, which originally
+stores the whole region under given IO request, and various parameters
+used by storage core to process this block request.
+It must return zero on success or negative value otherwise. It is upto
+this method to call all cleaning if remapping failed, for example it must
+call kst_bio_endio() for given callback in case of error, which in turn
+will call bio_endio(). Note, that dst_request structure provided in this
+callback is allocated on stack, so if there is a need to use it outside
+of the given function, it must be cloned (it will happen automatically
+in state's push callback, but that copy will not be shared by any other
+user).
+
[EMAIL PROTECTED]
+This callback is invoked for each error, which happend when processed

[3/4] DST: Network state machine.

2007-12-10 Thread Evgeniy Polyakov

Network state machine.

Includes network async processing state machine and related tasks.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/drivers/block/dst/kst.c b/drivers/block/dst/kst.c
new file mode 100644
index 000..8fa3387
--- /dev/null
+++ b/drivers/block/dst/kst.c
@@ -0,0 +1,1513 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/kernel.h
+#include linux/module.h
+#include linux/list.h
+#include linux/slab.h
+#include linux/socket.h
+#include linux/kthread.h
+#include linux/net.h
+#include linux/in.h
+#include linux/poll.h
+#include linux/bio.h
+#include linux/dst.h
+
+#include net/sock.h
+
+struct kst_poll_helper
+{
+   poll_table  pt;
+   struct kst_state*st;
+};
+
+static LIST_HEAD(kst_worker_list);
+static DEFINE_MUTEX(kst_worker_mutex);
+
+/*
+ * This function creates bound socket for local export node.
+ */
+static int kst_sock_create(struct kst_state *st, struct saddr *addr,
+   int type, int proto, int backlog)
+{
+   int err;
+
+   err = sock_create(addr-sa_family, type, proto, st-socket);
+   if (err)
+   goto err_out_exit;
+
+   err = st-socket-ops-bind(st-socket, (struct sockaddr *)addr,
+   addr-sa_data_len);
+
+   err = st-socket-ops-listen(st-socket, backlog);
+   if (err)
+   goto err_out_release;
+
+   st-socket-sk-sk_allocation = GFP_NOIO;
+
+   return 0;
+
+err_out_release:
+   sock_release(st-socket);
+err_out_exit:
+   return err;
+}
+
+static void kst_sock_release(struct kst_state *st)
+{
+   if (st-socket) {
+   sock_release(st-socket);
+   st-socket = NULL;
+   }
+}
+
+void kst_wake(struct kst_state *st)
+{
+   if (st) {
+   struct kst_worker *w = st-node-w;
+   unsigned long flags;
+
+   spin_lock_irqsave(w-ready_lock, flags);
+   if (list_empty(st-ready_entry))
+   list_add_tail(st-ready_entry, w-ready_list);
+   spin_unlock_irqrestore(w-ready_lock, flags);
+
+   wake_up(w-wait);
+   }
+}
+EXPORT_SYMBOL_GPL(kst_wake);
+
+/*
+ * Polling machinery.
+ */
+static int kst_state_wake_callback(wait_queue_t *wait, unsigned mode,
+   int sync, void *key)
+{
+   struct kst_state *st = container_of(wait, struct kst_state, wait);
+   kst_wake(st);
+   return 1;
+}
+
+static void kst_queue_func(struct file *file, wait_queue_head_t *whead,
+poll_table *pt)
+{
+   struct kst_state *st = container_of(pt, struct kst_poll_helper, pt)-st;
+
+   st-whead = whead;
+   init_waitqueue_func_entry(st-wait, kst_state_wake_callback);
+   add_wait_queue(whead, st-wait);
+}
+
+static void kst_poll_exit(struct kst_state *st)
+{
+   if (st-whead) {
+   remove_wait_queue(st-whead, st-wait);
+   st-whead = NULL;
+   }
+}
+
+/*
+ * This function removes request from state tree and ordering list.
+ */
+void kst_del_req(struct dst_request *req)
+{
+   list_del_init(req-request_list_entry);
+}
+EXPORT_SYMBOL_GPL(kst_del_req);
+
+static struct dst_request *kst_req_first(struct kst_state *st)
+{
+   struct dst_request *req = NULL;
+
+   if (!list_empty(st-request_list))
+   req = list_entry(st-request_list.next, struct dst_request,
+   request_list_entry);
+   return req;
+}
+
+/*
+ * This function dequeues first request from the queue and tree.
+ */
+static struct dst_request *kst_dequeue_req(struct kst_state *st)
+{
+   struct dst_request *req;
+
+   mutex_lock(st-request_lock);
+   req = kst_req_first(st);
+   if (req)
+   kst_del_req(req);
+   mutex_unlock(st-request_lock);
+   return req;
+}
+
+/*
+ * This function enqueues request into tree, indexed by start of the request,
+ * and also puts request into ordered queue.
+ */
+int kst_enqueue_req(struct kst_state *st, struct dst_request *req)
+{
+   if (unlikely(req-flags  DST_REQ_CHECK_QUEUE)) {
+   struct dst_request *r;
+
+   list_for_each_entry(r, st-request_list, request_list_entry) {
+   if (bio_rw(r-bio) != bio_rw(req-bio))
+   continue;
+
+   if (r-start = req-start + req-size)
+   continue

[4/4] DST: Algorithms used in distributed storage.

2007-12-10 Thread Evgeniy Polyakov

Algorithms used in distributed storage.
Mirror and linear mapping code.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/drivers/block/dst/alg_linear.c b/drivers/block/dst/alg_linear.c
new file mode 100644
index 000..9dc0976
--- /dev/null
+++ b/drivers/block/dst/alg_linear.c
@@ -0,0 +1,105 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/init.h
+#include linux/dst.h
+
+static struct dst_alg *alg_linear;
+
+/*
+ * This callback is invoked when node is removed from storage.
+ */
+static void dst_linear_del_node(struct dst_node *n)
+{
+}
+
+/*
+ * This callback is invoked when node is added to storage.
+ */
+static int dst_linear_add_node(struct dst_node *n)
+{
+   struct dst_storage *st = n-st;
+
+   dprintk(%s: disk_size: %llu, node_size: %llu.\n,
+   __func__, st-disk_size, n-size);
+
+   mutex_lock(st-tree_lock);
+   n-start = st-disk_size;
+   st-disk_size += n-size;
+   set_capacity(st-disk, st-disk_size);
+   mutex_unlock(st-tree_lock);
+
+   return 0;
+}
+
+static int dst_linear_remap(struct dst_request *req)
+{
+   int err;
+
+   if (req-node-bdev) {
+   generic_make_request(req-bio);
+   return 0;
+   }
+
+   err = kst_check_permissions(req-state, req-bio);
+   if (err)
+   return err;
+
+   return req-state-ops-push(req);
+}
+
+/*
+ * Failover callback - it is invoked each time error happens during
+ * request processing.
+ */
+static int dst_linear_error(struct kst_state *st, int err)
+{
+   if (err)
+   set_bit(DST_NODE_FROZEN, st-node-flags);
+   else
+   clear_bit(DST_NODE_FROZEN, st-node-flags);
+   return 0;
+}
+
+static struct dst_alg_ops alg_linear_ops = {
+   .remap  = dst_linear_remap,
+   .add_node   = dst_linear_add_node,
+   .del_node   = dst_linear_del_node,
+   .error  = dst_linear_error,
+   .owner  = THIS_MODULE,
+};
+
+static int __devinit alg_linear_init(void)
+{
+   alg_linear = dst_alloc_alg(alg_linear, alg_linear_ops);
+   if (!alg_linear)
+   return -ENOMEM;
+
+   return 0;
+}
+
+static void __devexit alg_linear_exit(void)
+{
+   dst_remove_alg(alg_linear);
+}
+
+module_init(alg_linear_init);
+module_exit(alg_linear_exit);
+
+MODULE_LICENSE(GPL);
+MODULE_AUTHOR(Evgeniy Polyakov [EMAIL PROTECTED]);
+MODULE_DESCRIPTION(Linear distributed algorithm.);
diff --git a/drivers/block/dst/alg_mirror.c b/drivers/block/dst/alg_mirror.c
new file mode 100644
index 000..3c457ff
--- /dev/null
+++ b/drivers/block/dst/alg_mirror.c
@@ -0,0 +1,1128 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/init.h
+#include linux/poll.h
+#include linux/dst.h
+
+struct dst_mirror_node_data
+{
+   u64 age;
+};
+
+struct dst_mirror_priv
+{
+   unsigned intchunk_num;
+
+   u64 last_start;
+
+   spinlock_t  backlog_lock;
+   struct list_headbacklog_list;
+
+   struct dst_mirror_node_data old_data, new_data;
+
+   unsigned long   *chunk;
+};
+
+static struct dst_alg *alg_mirror;
+static struct bio_set *dst_mirror_bio_set;
+
+static int dst_mirror_resync(struct dst_node *n, int ndp);
+
+static void dst_mirror_mark_sync(struct dst_node *n)
+{
+   if (test_bit(DST_NODE_NOTSYNC, n-flags)) {
+   struct dst_mirror_priv *priv = n-priv;
+
+   clear_bit(DST_NODE_NOTSYNC, n-flags);
+   dprintk(%s: node: %p, %llu:%llu synchronization 
+   has been completed.\n,
+   __func__, n, n-start, n-size);
+   priv-old_data.age = 0

[2/4] DST: Core distributed storage files.

2007-12-10 Thread Evgeniy Polyakov

Core distributed storage files.
Include userspace interfaces, initialization,
block layer bindings and other core functionality.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index b4c8319..ca6592d 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -451,6 +451,8 @@ config ATA_OVER_ETH
This driver provides Support for ATA over Ethernet block
devices like the Coraid EtherDrive (R) Storage Blade.
 
+source drivers/block/dst/Kconfig
+
 source drivers/s390/block/Kconfig
 
 endmenu
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index dd88e33..fcf042d 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_VIODASD) += viodasd.o
 obj-$(CONFIG_BLK_DEV_SX8)  += sx8.o
 obj-$(CONFIG_BLK_DEV_UB)   += ub.o
 
+obj-$(CONFIG_DST)  += dst/
diff --git a/drivers/block/dst/Kconfig b/drivers/block/dst/Kconfig
new file mode 100644
index 000..e91f8ed
--- /dev/null
+++ b/drivers/block/dst/Kconfig
@@ -0,0 +1,28 @@
+config DST
+   tristate Distributed storage
+   depends on NET
+   select CONNECTOR
+   select LIBCRC32C
+   ---help---
+   This driver allows to create a distributed storage.
+
+config DST_DEBUG
+   bool DST debug
+   depends on DST
+   ---help---
+   This option will turn HEAVY debugging of the DST.
+   Turn it on ONLY if you have to debug some really obscure problem.
+
+config DST_ALG_LINEAR
+   tristate Linear distribution algorithm
+   depends on DST
+   ---help---
+   This module allows to create linear mapping of the nodes
+   in the distributed storage.
+
+config DST_ALG_MIRROR
+   tristate Mirror distribution algorithm
+   depends on DST
+   ---help---
+   This module allows to create a mirror of the noes in the
+   distributed storage.
diff --git a/drivers/block/dst/Makefile b/drivers/block/dst/Makefile
new file mode 100644
index 000..1400e94
--- /dev/null
+++ b/drivers/block/dst/Makefile
@@ -0,0 +1,6 @@
+obj-$(CONFIG_DST) += dst.o
+
+dst-y := dcore.o kst.o
+
+obj-$(CONFIG_DST_ALG_LINEAR) += alg_linear.o
+obj-$(CONFIG_DST_ALG_MIRROR) += alg_mirror.o
diff --git a/drivers/block/dst/dcore.c b/drivers/block/dst/dcore.c
new file mode 100644
index 000..17a5e61
--- /dev/null
+++ b/drivers/block/dst/dcore.c
@@ -0,0 +1,1631 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/init.h
+#include linux/blkdev.h
+#include linux/bio.h
+#include linux/slab.h
+#include linux/connector.h
+#include linux/socket.h
+#include linux/dst.h
+#include linux/device.h
+#include linux/in.h
+#include linux/in6.h
+#include linux/buffer_head.h
+
+#include net/sock.h
+
+static LIST_HEAD(dst_storage_list);
+static LIST_HEAD(dst_alg_list);
+static DEFINE_MUTEX(dst_storage_lock);
+static DEFINE_MUTEX(dst_alg_lock);
+static int dst_major;
+static struct kst_worker *kst_main_worker;
+static struct cb_id cn_dst_id = { CN_DST_IDX, CN_DST_VAL };
+
+struct kmem_cache *dst_request_cache;
+
+static char dst_name[] = Gamardjoba, genacvale!;
+
+/*
+ * DST sysfs tree. For device called 'storage' which is formed
+ * on top of two nodes this looks like this:
+ *
+ * /sys/devices/storage/
+ * /sys/devices/storage/alg : alg_linear
+ * /sys/devices/storage/n-800/type : R: 192.168.4.80:1025
+ * /sys/devices/storage/n-800/size : 800
+ * /sys/devices/storage/n-800/start : 800
+ * /sys/devices/storage/n-800/clean
+ * /sys/devices/storage/n-800/dirty
+ * /sys/devices/storage/n-0/type : R: 192.168.4.81:1025
+ * /sys/devices/storage/n-0/size : 800
+ * /sys/devices/storage/n-0/start : 0
+ * /sys/devices/storage/n-0/clean
+ * /sys/devices/storage/n-0/dirty
+ * /sys/devices/storage/remove_all_nodes
+ * /sys/devices/storage/nodes : sectors (start [size]): 0 [800] | 800 [800]
+ * /sys/devices/storage/name : storage
+ */
+
+static int dst_dev_match(struct device *dev, struct device_driver *drv)
+{
+   return 1;
+}
+
+static void dst_dev_release(struct device *dev)
+{
+}
+
+static struct bus_type dst_dev_bus_type = {
+   .name   = dst,
+   .match  = dst_dev_match,
+};
+
+static struct device dst_dev = {
+   .bus= dst_dev_bus_type,
+   .release= dst_dev_release
+};
+
+static void dst_node_release(struct device *dev)
+{
+}
+
+static

Re: [1/4] DST: Distributed storage documentation.

2007-12-10 Thread Evgeniy Polyakov
On Mon, Dec 10, 2007 at 01:51:43PM +0100, Kay Sievers ([EMAIL PROTECTED]) wrote:
 On Dec 10, 2007 12:47 PM, Evgeniy Polyakov [EMAIL PROTECTED] wrote:
  diff --git a/Documentation/dst/sysfs.txt b/Documentation/dst/sysfs.txt
  new file mode 100644
  index 000..79d79dc
  --- /dev/null
  +++ b/Documentation/dst/sysfs.txt
  @@ -0,0 +1,30 @@
  +This file describes sysfs files created for each storage.
  +
  +1. Per-storage files.
  +Each storage has its own dir /sysfs/devices/$storage_name,
 
 It's always /sys/devices/.

I meant that for each new device, it will be placed into
/sys/devices/its_name, but it can also be accessed via
/sys/bus/dst/devices/

  +which contains following files:
  +
  +alg - contains name of the algorithm used to created given storage
  +name - name of the storage
  +nodes - map of the storage (list of nodes and their sizes and starts)
  +remove_all_nodes - writable file which allows to remove all nodes from 
  given
  +   storage
  +n-$start-$cookie - per node directory, where
  +   $start - start of the given node in sectors,
  +   $cookie - unique node's id used by DST
  +
  +2. Per-node files.
  +Node's files are located in /sysfs/devices/$storage_name/n-$start-$cookie
  +directory, described above.
 
 To which class or bus do the devices you create belong? Care to show a
 tree or ls -la of the device?

It is 'dst' bus.

uganda:~/codes# ls -la /sys/devices/staorge/
total 0
drwxr-xr-x 4 root root0 2007-12-10 11:46 .
drwxr-xr-x 9 root root0 2007-12-10 11:46 ..
-r--r--r-- 1 root root 4096 2007-12-10 11:46 alg
lrwxrwxrwx 1 root root0 2007-12-10 11:46 bus - ../../bus/dst
drwxr-xr-x 3 root root0 2007-12-10 11:46 n-0-81003e24117
-r--r--r-- 1 root root 4096 2007-12-10 11:46 name
-r--r--r-- 1 root root 4096 2007-12-10 11:46 nodes
drwxr-xr-x 2 root root0 2007-12-10 11:46 power
-rw-r--r-- 1 root root 4096 2007-12-10 11:46 remove_all_nodes
lrwxrwxrwx 1 root root0 2007-12-10 11:46 subsystem - ../../bus/dst
-rw-r--r-- 1 root root 4096 2007-12-10 11:46 uevent
uganda:~/codes# ls -l /sys/bus/dst/
total 0
drwxr-xr-x 2 root root0 2007-12-10 09:52 devices
drwxr-xr-x 2 root root0 2007-12-10 09:52 drivers
-rw-r--r-- 1 root root 4096 2007-12-10 11:46 drivers_autoprobe
--w--- 1 root root 4096 2007-12-10 11:46 drivers_probe


 Kay

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [1/4] DST: Distributed storage documentation.

2007-12-10 Thread Evgeniy Polyakov
On Mon, Dec 10, 2007 at 03:31:48PM +0100, Kay Sievers ([EMAIL PROTECTED]) wrote:
  I meant that for each new device, it will be placed into
  /sys/devices/its_name, but it can also be accessed via
  /sys/bus/dst/devices/
 
 Still, it looks like a path. :)
 
 Please don't reference any device directly with a /sys/devices/ path.
 You have to use the subsystem links to the devices
 in /sys/bus/dst/devices/. Devices are free to move around
 in /sys/devices, even during runtime. Yours don't do, but anyway, please
 remove all mentioning of direct access to /sys/devices/.

Ok, I will update documentation to reference /sys/bus/dst/devices
instead of /sys/devices

 Btw, where is the top-level /sys/devices/storage/ coming from? I don't
 see that in the code. We don't accept any new virtual parents here.

 Your devices will automatically appear in /sys/devices/virtual/dst/, and
 not below your own parent. But that path does not matter anyway, because
 you should only access them from the /sys/bus/dst/devices/ directory.
 
 And in general please don't claim generic names like storage in any
 namespace for a very specific subsystem like this.

It is not a parent - it is an example for device called 'storage', if it
will be called 'testing', then path will be /sys/devices/testing or more
correct /sys/bus/dst/devices/testing :)

  It is 'dst' bus.
  
  uganda:~/codes# ls -la /sys/devices/staorge/
  total 0
  drwxr-xr-x 4 root root0 2007-12-10 11:46 .
  drwxr-xr-x 9 root root0 2007-12-10 11:46 ..
  -r--r--r-- 1 root root 4096 2007-12-10 11:46 alg
  lrwxrwxrwx 1 root root0 2007-12-10 11:46 bus - ../../bus/dst
  drwxr-xr-x 3 root root0 2007-12-10 11:46 n-0-81003e24117
  -r--r--r-- 1 root root 4096 2007-12-10 11:46 name
  -r--r--r-- 1 root root 4096 2007-12-10 11:46 nodes
  drwxr-xr-x 2 root root0 2007-12-10 11:46 power
  -rw-r--r-- 1 root root 4096 2007-12-10 11:46 remove_all_nodes
  lrwxrwxrwx 1 root root0 2007-12-10 11:46 subsystem - ../../bus/dst
  -rw-r--r-- 1 root root 4096 2007-12-10 11:46 uevent
 
 Ok, how does:
   ls -l /sys/devices/storage/n-0-81003e24117
 look?

uganda:~/codes# ls -l /sys/devices/storage/n-0-81003ebc220/
total 0
drwxr-xr-x 2 root root0 2007-12-10 13:23 power
-r--r--r-- 1 root root 4096 2007-12-10 13:30 size
-r--r--r-- 1 root root 4096 2007-12-10 13:30 start
-r--r--r-- 1 root root 4096 2007-12-10 13:30 type
-rw-r--r-- 1 root root 4096 2007-12-10 13:30 uevent


  uganda:~/codes# ls -l /sys/bus/dst/
  total 0
  drwxr-xr-x 2 root root0 2007-12-10 09:52 devices
  drwxr-xr-x 2 root root0 2007-12-10 09:52 drivers
  -rw-r--r-- 1 root root 4096 2007-12-10 11:46 drivers_autoprobe
  --w--- 1 root root 4096 2007-12-10 11:46 drivers_probe
 
 How does:
   ls -l /sys/bus/dst/devices
 look?

uganda:~/codes# ls -la /sys/bus/dst/devices/
total 0
drwxr-xr-x 2 root root 0 2007-12-10 13:30 .
drwxr-xr-x 4 root root 0 2007-12-10 13:22 ..
lrwxrwxrwx 1 root root 0 2007-12-10 13:30 storage - ../../../devices/storage


Here 'storage' is just a name for device called 'storage', it can be
anything else.
 
 Further questions:
 Why do you do your own refcounting instead of using kref?

That's because I always used atomic operations as a reference counters
and did not tried krefs :)
They are the same actually (module tricky arches where smp_mb_* are
required), so I can replace them in the next release.

 Why don't you use groups for the attributes?

For 3-4 attributes it is faster to register them in a loop than typing
another structure :)

 Why don't you use default attributes for the device, where you get all
 error handling done by the core.

What is 'default attributes' and for what devices?
All my sysfs files are so much trivial, so they do not need anything
special and I do not see what is error handling you mentioned.

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [1/4] DST: Distributed storage documentation.

2007-12-10 Thread Evgeniy Polyakov
On Mon, Dec 10, 2007 at 05:50:55PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) 
wrote:
  Further questions:
  Why do you do your own refcounting instead of using kref?
 
 That's because I always used atomic operations as a reference counters
 and did not tried krefs :)
 They are the same actually (module tricky arches where smp_mb_* are
 required), so I can replace them in the next release.

Actually not - I have to set reference counter to something other than 1
or +/- 1, and thus will have to call kref_get() in a loop, which is a
very ugly step. Is there kref_set() or somethinglike that? At least not
in 2.6.22 what I'm using for now.

Sigh, I've converted most of the DST already...

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [1/4] DST: Distributed storage documentation.

2007-12-10 Thread Evgeniy Polyakov
On Mon, Dec 10, 2007 at 08:02:28PM +0100, Kay Sievers ([EMAIL PROTECTED]) wrote:
  uganda:~/codes# ls -l /sys/devices/storage/n-0-81003ebc220/
  total 0
  drwxr-xr-x 2 root root0 2007-12-10 13:23 power
  -r--r--r-- 1 root root 4096 2007-12-10 13:30 size
  -r--r--r-- 1 root root 4096 2007-12-10 13:30 start
  -r--r--r-- 1 root root 4096 2007-12-10 13:30 type
  -rw-r--r-- 1 root root 4096 2007-12-10 13:30 uevent
 
 This is a struct device instance without a subsystem (bus/class),
 right? It will not send an uevent to userspace. Is that intended? Why
 don't you add them all to the dst bus? 

I created dst bus for storage devices only, nodes are very different
objects, and actually they do not need any events from above, but I need
to put some attributes somewhere, so it is 'empty' device.

  Actually not - I have to set reference counter to something other than 1
  or +/- 1, and thus will have to call kref_get() in a loop, which is a
  very ugly step. Is there kref_set() or somethinglike that? At least not
  in 2.6.22 what I'm using for now.
 
 Yeah, a loop would look pretty ugly. How about just adding kref_set(),
 if you need it.

Well, then it distributed storage will not be able to build as
standalone module, and kref_set() itself will not be accepted as a single 
patch, since there are no in-kernel users :)
It is easily doable though.

   Why don't you use groups for the attributes?
  
  For 3-4 attributes it is faster to register them in a loop than typing
  another structure :)
 
 Yeah, but if you would need to recover from an error when the creation
 of a file fails, a group would do the proper rollback.

I do not care about such errors - if there is such an error for a file,
which exports information about type of the node (i.e. string L or R)
or some other very meaningful info, then system has enough to care about
instead of this, so dst does not do anything special - it ignores such
errors :)

On exit path it will be checked and removed correctly.
If there will be additional sysfs files, I think group is a good way to
implement them.

   Why don't you use default attributes for the device, where you get all
   error handling done by the core.
  
  What is 'default attributes' and for what devices?
  All my sysfs files are so much trivial, so they do not need anything
  special and I do not see what is error handling you mentioned.
 
 If all devices of a subsystem (bus/class) are of the same type, you can
 set a default array of attributes in the struct bus/class to be
 created at every device. If you have multiple types of devices in the
 same subsytem (bus/class) you can to assign a the device_type, which
 has the default attribute group.
 That way the core will create the files before the event is sent out to
 userspace, and the files can be access from the event itself. Not sure
 if that is needed for dst.

Ok, I see.

DST right now has 3 types of files - storage files, it is common for
every storage device; node files, which are the same for every node; and
per-algorithm private devices - they can be different (actually only
mirroring algorithm exports something to userspace).

I think it is possible to use default attributes for storage devices,
but node device does not have a bus/class, so they will be untouched.

 Thanks,
 Kay

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [1/4] DST: Distributed storage documentation.

2007-12-10 Thread Evgeniy Polyakov
On Mon, Dec 10, 2007 at 08:44:55PM +0100, Kay Sievers ([EMAIL PROTECTED]) wrote:
Actually not - I have to set reference counter to something other than 1
or +/- 1, and thus will have to call kref_get() in a loop, which is a
very ugly step. Is there kref_set() or somethinglike that? At least not
in 2.6.22 what I'm using for now.
   
   Yeah, a loop would look pretty ugly. How about just adding kref_set(),
   if you need it.
  
  Well, then it distributed storage will not be able to build as
  standalone module, and kref_set() itself will not be accepted as a single 
  patch, since there are no in-kernel users :)
  It is easily doable though.
 
 Most rules have exceptions. :) Send a patch, so we can see how it looks
 like.

It looks really non-trivial :)

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/include/linux/kref.h b/include/linux/kref.h
index 6fee353..5d18563 100644
--- a/include/linux/kref.h
+++ b/include/linux/kref.h
@@ -24,6 +24,7 @@ struct kref {
atomic_t refcount;
 };
 
+void kref_set(struct kref *kref, int num);
 void kref_init(struct kref *kref);
 void kref_get(struct kref *kref);
 int kref_put(struct kref *kref, void (*release) (struct kref *kref));
diff --git a/lib/kref.c b/lib/kref.c
index a6dc3ec..40aa9f9 100644
--- a/lib/kref.c
+++ b/lib/kref.c
@@ -15,13 +15,23 @@
 #include linux/module.h
 
 /**
+ * kref_set - initialize object and set refcount to requested number.
+ * @kref: object in question.
+ * @num: initial reference counter
+ */
+void kref_set(struct kref *kref, int num)
+{
+   atomic_set(kref-refcount, num);
+   smp_mb();
+}
+
+/**
  * kref_init - initialize object.
  * @kref: object in question.
  */
 void kref_init(struct kref *kref)
 {
-   atomic_set(kref-refcount,1);
-   smp_mb();
+   kref_set(kref, 1);
 }
 
 /**

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [1/4] DST: Distributed storage documentation.

2007-12-10 Thread Evgeniy Polyakov
On Mon, Dec 10, 2007 at 08:56:49PM +0100, Kay Sievers ([EMAIL PROTECTED]) wrote:
 On Mon, 2007-12-10 at 22:51 +0300, Evgeniy Polyakov wrote:
  On Mon, Dec 10, 2007 at 08:44:55PM +0100, Kay Sievers ([EMAIL PROTECTED]) 
  wrote:
  Actually not - I have to set reference counter to something other 
  than 1
  or +/- 1, and thus will have to call kref_get() in a loop, which is 
  a
  very ugly step. Is there kref_set() or somethinglike that? At least 
  not
  in 2.6.22 what I'm using for now.
 
 Yeah, a loop would look pretty ugly. How about just adding kref_set(),
 if you need it.

Well, then it distributed storage will not be able to build as
standalone module, and kref_set() itself will not be accepted as a 
single 
patch, since there are no in-kernel users :)
It is easily doable though.
   
   Most rules have exceptions. :) Send a patch, so we can see how it looks
   like.
  
  It looks really non-trivial :)
 
 Yeah, it does. :)
 We miss an EXPORT_SYMBOL(), right?

Yep :)

diff --git a/include/linux/kref.h b/include/linux/kref.h
index 6fee353..5d18563 100644
--- a/include/linux/kref.h
+++ b/include/linux/kref.h
@@ -24,6 +24,7 @@ struct kref {
atomic_t refcount;
 };
 
+void kref_set(struct kref *kref, int num);
 void kref_init(struct kref *kref);
 void kref_get(struct kref *kref);
 int kref_put(struct kref *kref, void (*release) (struct kref *kref));
diff --git a/lib/kref.c b/lib/kref.c
index a6dc3ec..9ecd6e8 100644
--- a/lib/kref.c
+++ b/lib/kref.c
@@ -15,13 +15,23 @@
 #include linux/module.h
 
 /**
+ * kref_set - initialize object and set refcount to requested number.
+ * @kref: object in question.
+ * @num: initial reference counter
+ */
+void kref_set(struct kref *kref, int num)
+{
+   atomic_set(kref-refcount, num);
+   smp_mb();
+}
+
+/**
  * kref_init - initialize object.
  * @kref: object in question.
  */
 void kref_init(struct kref *kref)
 {
-   atomic_set(kref-refcount,1);
-   smp_mb();
+   kref_set(kref, 1);
 }
 
 /**
@@ -61,6 +71,7 @@ int kref_put(struct kref *kref, void (*release)(struct kref 
*kref))
return 0;
 }
 
+EXPORT_SYMBOL(kref_set);
 EXPORT_SYMBOL(kref_init);
 EXPORT_SYMBOL(kref_get);
 EXPORT_SYMBOL(kref_put);

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP event tracking via netlink...

2007-12-06 Thread Evgeniy Polyakov
On Wed, Dec 05, 2007 at 09:03:43PM -0800, David Miller ([EMAIL PROTECTED]) 
wrote:
 I think this work is very different.
 
 When I say state I mean something more significant than
 CLOSE, ESTABLISHED, etc. which is what Samir's patches are
 tracking.
 
 I'm talking about all of the sequence numbers, SACK information,
 congestion control knobs, etc. whose values are nearly impossible to
 track on a packet to packet basis in order to diagnose problems.

I pointed that work as a possible basis for collecting more info if you
needs including sequence numbers, window sizes and so on.
It just requires a useful structure layout placed, so that one would not
require to recreate the same bits again, so that it could be called from
any place inside the stack.

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP event tracking via netlink...

2007-12-05 Thread Evgeniy Polyakov
Hi.

On Wed, Dec 05, 2007 at 09:11:01AM -0500, John Heffner ([EMAIL PROTECTED]) 
wrote:
 Maybe if we want to get really fancy we can have some more-expensive
 debug mode where detailed specific events get generated via some
 macros we can scatter all over the place.  This won't be useful
 for general user problem analysis, but it will be excellent for
 developers.
 
 Let me know if you think this is useful enough and I'll work on
 an implementation we can start playing with.
 
 
 FWIW, sounds similar to what these guys are doing with SIFTR for FreeBSD:
 http://caia.swin.edu.au/urp/newtcp/tools.html
 http://caia.swin.edu.au/reports/070824A/CAIA-TR-070824A.pdf

And even more similar to this patch from Samir Bellabes of Mandriva:
http://lwn.net/Articles/202255/

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[4/4] DST: Algorithms used in distributed storage.

2007-12-04 Thread Evgeniy Polyakov

Algorithms used in distributed storage.
Mirror and linear mapping code.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/drivers/block/dst/alg_linear.c b/drivers/block/dst/alg_linear.c
new file mode 100644
index 000..cb77b57
--- /dev/null
+++ b/drivers/block/dst/alg_linear.c
@@ -0,0 +1,104 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/init.h
+#include linux/dst.h
+
+static struct dst_alg *alg_linear;
+
+/*
+ * This callback is invoked when node is removed from storage.
+ */
+static void dst_linear_del_node(struct dst_node *n)
+{
+}
+
+/*
+ * This callback is invoked when node is added to storage.
+ */
+static int dst_linear_add_node(struct dst_node *n)
+{
+   struct dst_storage *st = n-st;
+
+   dprintk(%s: disk_size: %llu, node_size: %llu.\n,
+   __func__, st-disk_size, n-size);
+
+   mutex_lock(st-tree_lock);
+   n-start = st-disk_size;
+   st-disk_size += n-size;
+   mutex_unlock(st-tree_lock);
+
+   return 0;
+}
+
+static int dst_linear_remap(struct dst_request *req)
+{
+   int err;
+
+   if (req-node-bdev) {
+   generic_make_request(req-bio);
+   return 0;
+   }
+
+   err = kst_check_permissions(req-state, req-bio);
+   if (err)
+   return err;
+
+   return req-state-ops-push(req);
+}
+
+/*
+ * Failover callback - it is invoked each time error happens during
+ * request processing.
+ */
+static int dst_linear_error(struct kst_state *st, int err)
+{
+   if (err)
+   set_bit(DST_NODE_FROZEN, st-node-flags);
+   else
+   clear_bit(DST_NODE_FROZEN, st-node-flags);
+   return 0;
+}
+
+static struct dst_alg_ops alg_linear_ops = {
+   .remap  = dst_linear_remap,
+   .add_node   = dst_linear_add_node,
+   .del_node   = dst_linear_del_node,
+   .error  = dst_linear_error,
+   .owner  = THIS_MODULE,
+};
+
+static int __devinit alg_linear_init(void)
+{
+   alg_linear = dst_alloc_alg(alg_linear, alg_linear_ops);
+   if (!alg_linear)
+   return -ENOMEM;
+
+   return 0;
+}
+
+static void __devexit alg_linear_exit(void)
+{
+   dst_remove_alg(alg_linear);
+}
+
+module_init(alg_linear_init);
+module_exit(alg_linear_exit);
+
+MODULE_LICENSE(GPL);
+MODULE_AUTHOR(Evgeniy Polyakov [EMAIL PROTECTED]);
+MODULE_DESCRIPTION(Linear distributed algorithm.);
diff --git a/drivers/block/dst/alg_mirror.c b/drivers/block/dst/alg_mirror.c
new file mode 100644
index 000..11a6169
--- /dev/null
+++ b/drivers/block/dst/alg_mirror.c
@@ -0,0 +1,1122 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/init.h
+#include linux/poll.h
+#include linux/dst.h
+
+struct dst_mirror_node_data
+{
+   u64 age;
+};
+
+struct dst_mirror_priv
+{
+   unsigned intchunk_num;
+
+   u64 last_start;
+
+   spinlock_t  backlog_lock;
+   struct list_headbacklog_list;
+
+   struct dst_mirror_node_data old_data, new_data;
+
+   unsigned long   *chunk;
+};
+
+static struct dst_alg *alg_mirror;
+static struct bio_set *dst_mirror_bio_set;
+
+static int dst_mirror_resync(struct dst_node *n, int ndp);
+
+static void dst_mirror_mark_sync(struct dst_node *n)
+{
+   if (test_bit(DST_NODE_NOTSYNC, n-flags)) {
+   struct dst_mirror_priv *priv = n-priv;
+
+   clear_bit(DST_NODE_NOTSYNC, n-flags);
+   dprintk(%s: node: %p, %llu:%llu synchronization 
+   has been completed.\n,
+   __func__, n, n-start, n-size);
+   priv-old_data.age = 0;
+   }
+}
+
+static void dst_mirror_mark_notsync(struct

[0/4] DST: Distributed storage.

2007-12-04 Thread Evgeniy Polyakov

Distributed storage.

I'm pleased to announce the 10'th release of the distributed
storage subsystem (DST). This is a maintenance release and includes
bug fixes and simple feature extensions only.

DST allows to form a storage on top of local and remote nodes
and combine them into linear or mirroring setup, which in
turn can be exported to remote nodes.

Short changelog:
 * fixed bug with XFS metadata update (it can provide slab pages to the
DST, so it is not allowed to transfer them using -sendpage())
 * fixed async error completion path
 * extended netlink communication channel to report errors back to userspace
 * DST name is now The 10'th dynasty of smuggled slothes
 * number of fixes for userspace DST target

Great thanks to Matthew Hodgson [EMAIL PROTECTED] for debugging and
fixes for userspace DST target and preliminary netlink extension patches.

Overall list of features of the DST can be found on project's homepage:

http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst

Thank you.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[1/4] DST: Distributed storage documentation.

2007-12-04 Thread Evgeniy Polyakov

Distributed storage documentation.

Algorithms used in the system, userspace interfaces
(sysfs dirs and files), design and implementation details
are described here.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/Documentation/dst/algorithms.txt b/Documentation/dst/algorithms.txt
new file mode 100644
index 000..1437a6a
--- /dev/null
+++ b/Documentation/dst/algorithms.txt
@@ -0,0 +1,115 @@
+Each storage by itself is just a set of contiguous logical blocks, with
+allowed number of operations. Nodes, each of which has own start and size,
+are placed into storage by appropriate algorithm, which remaps
+logical sector number into real node's sector. One can create
+own algorithms, since DST has pluggable interface for that.
+Currently mirrored and linear algorithms are supported.
+
+Let's briefly describe how they work.
+
+Linear algorithm.
+Simple approach of concatenating storages into single device with
+increased size is used in this algorithm. Essentially new device
+has size equal to sum of sizes of underlying nodes and nodes are
+placed one after another.
+
+  /- Node 1 ---\ /-- Node 3 \
+start  end start   end
+ |==||==|
+ |start end |
+ |  \--- Node 2 -/  |
+ |  |
+start  end
+ \-- DST storage --/
+
+   /\
+   ||
+   ||
+
+  IO operations
+
+   Figure 1. 
+ 3 nodes combined into single storage using linear algorithm.
+
+Mirror algorithm.
+In this algorithms nodes are placed under each other, so when
+operation comes to the first one, it can be mirrored to all
+underlying nodes. In case of reading, actual data is obtained from
+the nearest node - algoritm keeps track of previous operation
+and knows where it was stopped, so that subsequent seek to the 
+start of the new request will take the shortest time.
+Writing is always mirrored to all underlying nodes.
+
+  IO operations
+   ||
+   ||
+   \/
+
+| DST storage ---|
+|  prev position |
+|---| Node 1 |
+|  prev pos  |
+| Node 2 -|--|
+|prev pos|
+|---| Node 3 |
+
+   Figure 2.
+   3 nodes combined into single storage using mirror algorithm.
+
+Each algorithm must implement number of callbacks,
+which must be registered during initialization time.
+
+struct dst_alg_ops
+{
+   int (*add_node)(struct dst_node *n);
+   void(*del_node)(struct dst_node *n);
+   int (*remap)(struct dst_request *req);
+   int (*error)(struct kst_state *state, int err);
+   struct module   *owner;
+};
+
[EMAIL PROTECTED]
+This callback is invoked when new node is being added into the storage,
+but before node is actually added into the storage, so that it could
+be accessed from it. When it is called, all appropriate initialization
+of the underlying device is already completed (system has been connected
+to remote node or got a reference to the local block device). At this
+stage algorithm can add node into private map. 
+It must return zero on success or negative value otherwise.
+
[EMAIL PROTECTED]
+This callback is invoked when node is being deleted from the storage,
+i.e. when its reference counter hits zero. It is called before
+any cleaning is performed.
+It must return zero on success or negative value otherwise.
+
[EMAIL PROTECTED]
+This callback is invoked each time new bio hits the storage.
+Request structure contains BIO itself, pointer to the node, which originally
+stores the whole region under given IO request, and various parameters
+used by storage core to process this block request.
+It must return zero on success or negative value otherwise. It is upto
+this method to call all cleaning if remapping failed, for example it must
+call kst_bio_endio() for given callback in case of error, which in turn
+will call bio_endio(). Note, that dst_request structure provided in this
+callback is allocated on stack, so if there is a need to use it outside
+of the given function, it must be cloned (it will happen automatically
+in state's push callback, but that copy will not be shared by any other
+user).
+
[EMAIL PROTECTED]
+This callback is invoked for each error, which happend when processed

[3/4] DST: Network state machine.

2007-12-04 Thread Evgeniy Polyakov

Network state machine.

Includes network async processing state machine and related tasks.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/drivers/block/dst/kst.c b/drivers/block/dst/kst.c
new file mode 100644
index 000..8fa3387
--- /dev/null
+++ b/drivers/block/dst/kst.c
@@ -0,0 +1,1513 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/kernel.h
+#include linux/module.h
+#include linux/list.h
+#include linux/slab.h
+#include linux/socket.h
+#include linux/kthread.h
+#include linux/net.h
+#include linux/in.h
+#include linux/poll.h
+#include linux/bio.h
+#include linux/dst.h
+
+#include net/sock.h
+
+struct kst_poll_helper
+{
+   poll_table  pt;
+   struct kst_state*st;
+};
+
+static LIST_HEAD(kst_worker_list);
+static DEFINE_MUTEX(kst_worker_mutex);
+
+/*
+ * This function creates bound socket for local export node.
+ */
+static int kst_sock_create(struct kst_state *st, struct saddr *addr,
+   int type, int proto, int backlog)
+{
+   int err;
+
+   err = sock_create(addr-sa_family, type, proto, st-socket);
+   if (err)
+   goto err_out_exit;
+
+   err = st-socket-ops-bind(st-socket, (struct sockaddr *)addr,
+   addr-sa_data_len);
+
+   err = st-socket-ops-listen(st-socket, backlog);
+   if (err)
+   goto err_out_release;
+
+   st-socket-sk-sk_allocation = GFP_NOIO;
+
+   return 0;
+
+err_out_release:
+   sock_release(st-socket);
+err_out_exit:
+   return err;
+}
+
+static void kst_sock_release(struct kst_state *st)
+{
+   if (st-socket) {
+   sock_release(st-socket);
+   st-socket = NULL;
+   }
+}
+
+void kst_wake(struct kst_state *st)
+{
+   if (st) {
+   struct kst_worker *w = st-node-w;
+   unsigned long flags;
+
+   spin_lock_irqsave(w-ready_lock, flags);
+   if (list_empty(st-ready_entry))
+   list_add_tail(st-ready_entry, w-ready_list);
+   spin_unlock_irqrestore(w-ready_lock, flags);
+
+   wake_up(w-wait);
+   }
+}
+EXPORT_SYMBOL_GPL(kst_wake);
+
+/*
+ * Polling machinery.
+ */
+static int kst_state_wake_callback(wait_queue_t *wait, unsigned mode,
+   int sync, void *key)
+{
+   struct kst_state *st = container_of(wait, struct kst_state, wait);
+   kst_wake(st);
+   return 1;
+}
+
+static void kst_queue_func(struct file *file, wait_queue_head_t *whead,
+poll_table *pt)
+{
+   struct kst_state *st = container_of(pt, struct kst_poll_helper, pt)-st;
+
+   st-whead = whead;
+   init_waitqueue_func_entry(st-wait, kst_state_wake_callback);
+   add_wait_queue(whead, st-wait);
+}
+
+static void kst_poll_exit(struct kst_state *st)
+{
+   if (st-whead) {
+   remove_wait_queue(st-whead, st-wait);
+   st-whead = NULL;
+   }
+}
+
+/*
+ * This function removes request from state tree and ordering list.
+ */
+void kst_del_req(struct dst_request *req)
+{
+   list_del_init(req-request_list_entry);
+}
+EXPORT_SYMBOL_GPL(kst_del_req);
+
+static struct dst_request *kst_req_first(struct kst_state *st)
+{
+   struct dst_request *req = NULL;
+
+   if (!list_empty(st-request_list))
+   req = list_entry(st-request_list.next, struct dst_request,
+   request_list_entry);
+   return req;
+}
+
+/*
+ * This function dequeues first request from the queue and tree.
+ */
+static struct dst_request *kst_dequeue_req(struct kst_state *st)
+{
+   struct dst_request *req;
+
+   mutex_lock(st-request_lock);
+   req = kst_req_first(st);
+   if (req)
+   kst_del_req(req);
+   mutex_unlock(st-request_lock);
+   return req;
+}
+
+/*
+ * This function enqueues request into tree, indexed by start of the request,
+ * and also puts request into ordered queue.
+ */
+int kst_enqueue_req(struct kst_state *st, struct dst_request *req)
+{
+   if (unlikely(req-flags  DST_REQ_CHECK_QUEUE)) {
+   struct dst_request *r;
+
+   list_for_each_entry(r, st-request_list, request_list_entry) {
+   if (bio_rw(r-bio) != bio_rw(req-bio))
+   continue;
+
+   if (r-start = req-start + req-size)
+   continue

[2/4] DST: Core distributed storage files.

2007-12-04 Thread Evgeniy Polyakov

Core distributed storage files.
Include userspace interfaces, initialization,
block layer bindings and other core functionality.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index b4c8319..ca6592d 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -451,6 +451,8 @@ config ATA_OVER_ETH
This driver provides Support for ATA over Ethernet block
devices like the Coraid EtherDrive (R) Storage Blade.
 
+source drivers/block/dst/Kconfig
+
 source drivers/s390/block/Kconfig
 
 endmenu
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index dd88e33..fcf042d 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_VIODASD) += viodasd.o
 obj-$(CONFIG_BLK_DEV_SX8)  += sx8.o
 obj-$(CONFIG_BLK_DEV_UB)   += ub.o
 
+obj-$(CONFIG_DST)  += dst/
diff --git a/drivers/block/dst/Kconfig b/drivers/block/dst/Kconfig
new file mode 100644
index 000..e91f8ed
--- /dev/null
+++ b/drivers/block/dst/Kconfig
@@ -0,0 +1,28 @@
+config DST
+   tristate Distributed storage
+   depends on NET
+   select CONNECTOR
+   select LIBCRC32C
+   ---help---
+   This driver allows to create a distributed storage.
+
+config DST_DEBUG
+   bool DST debug
+   depends on DST
+   ---help---
+   This option will turn HEAVY debugging of the DST.
+   Turn it on ONLY if you have to debug some really obscure problem.
+
+config DST_ALG_LINEAR
+   tristate Linear distribution algorithm
+   depends on DST
+   ---help---
+   This module allows to create linear mapping of the nodes
+   in the distributed storage.
+
+config DST_ALG_MIRROR
+   tristate Mirror distribution algorithm
+   depends on DST
+   ---help---
+   This module allows to create a mirror of the noes in the
+   distributed storage.
diff --git a/drivers/block/dst/Makefile b/drivers/block/dst/Makefile
new file mode 100644
index 000..1400e94
--- /dev/null
+++ b/drivers/block/dst/Makefile
@@ -0,0 +1,6 @@
+obj-$(CONFIG_DST) += dst.o
+
+dst-y := dcore.o kst.o
+
+obj-$(CONFIG_DST_ALG_LINEAR) += alg_linear.o
+obj-$(CONFIG_DST_ALG_MIRROR) += alg_mirror.o
diff --git a/drivers/block/dst/dcore.c b/drivers/block/dst/dcore.c
new file mode 100644
index 000..4fdad29
--- /dev/null
+++ b/drivers/block/dst/dcore.c
@@ -0,0 +1,1629 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/init.h
+#include linux/blkdev.h
+#include linux/bio.h
+#include linux/slab.h
+#include linux/connector.h
+#include linux/socket.h
+#include linux/dst.h
+#include linux/device.h
+#include linux/in.h
+#include linux/in6.h
+#include linux/buffer_head.h
+
+#include net/sock.h
+
+static LIST_HEAD(dst_storage_list);
+static LIST_HEAD(dst_alg_list);
+static DEFINE_MUTEX(dst_storage_lock);
+static DEFINE_MUTEX(dst_alg_lock);
+static int dst_major;
+static struct kst_worker *kst_main_worker;
+static struct cb_id cn_dst_id = { CN_DST_IDX, CN_DST_VAL };
+
+struct kmem_cache *dst_request_cache;
+
+static char dst_name[] = The 10'th dynasty of smuggled slothes;
+
+/*
+ * DST sysfs tree. For device called 'storage' which is formed
+ * on top of two nodes this looks like this:
+ *
+ * /sys/devices/storage/
+ * /sys/devices/storage/alg : alg_linear
+ * /sys/devices/storage/n-800/type : R: 192.168.4.80:1025
+ * /sys/devices/storage/n-800/size : 800
+ * /sys/devices/storage/n-800/start : 800
+ * /sys/devices/storage/n-800/clean
+ * /sys/devices/storage/n-800/dirty
+ * /sys/devices/storage/n-0/type : R: 192.168.4.81:1025
+ * /sys/devices/storage/n-0/size : 800
+ * /sys/devices/storage/n-0/start : 0
+ * /sys/devices/storage/n-0/clean
+ * /sys/devices/storage/n-0/dirty
+ * /sys/devices/storage/remove_all_nodes
+ * /sys/devices/storage/nodes : sectors (start [size]): 0 [800] | 800 [800]
+ * /sys/devices/storage/name : storage
+ */
+
+static int dst_dev_match(struct device *dev, struct device_driver *drv)
+{
+   return 1;
+}
+
+static void dst_dev_release(struct device *dev)
+{
+}
+
+static struct bus_type dst_dev_bus_type = {
+   .name   = dst,
+   .match  = dst_dev_match,
+};
+
+static struct device dst_dev = {
+   .bus= dst_dev_bus_type,
+   .release= dst_dev_release
+};
+
+static void dst_node_release(struct device *dev

Netchannels. The 22'th century release.

2007-12-04 Thread Evgeniy Polyakov
Hi.

This is the 22'th release of the netchannels, a peer-to-peer protocol
agnostic communication channel between hardware and users. It uses
unified cache to store channels, allows to allocate buffers for data
from userspace mapped area or from other preallocated set of pages
(like VFS cache). All protocol processing happens in process context.

Users of the system can be for example userspace - it allows to receive
and send traffic from the wire without any kernel interference, to
implement own protocols and offload its processing to the hardware.

This idea was originally proposed and implemented by Van Jacobson.
This patchset (with userspace netowrk stack) is a logical continuation
of the idea with move to the full peer-to-peer processing.

One of its users is userspace network stack [2].

Short changelog:
 * update cached route in the netchannel when it expires.

Thanks to Salvatore Del Popolo [EMAIL PROTECTED] for testing.

1. Netchannels homepage.
http://tservice.net.ru/~s0mbre/old/?section=projectsitem=netchannel

2. Userspace network stack.
http://tservice.net.ru/~s0mbre/old/?section=projectsitem=unetstack

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index 2697e92..3231b22 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+   .long sys_netchannel_control
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index b4aa875..d35d4d8 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -718,4 +718,5 @@ ia32_sys_call_table:
.quad compat_sys_vmsplice
.quad compat_sys_move_pages
.quad sys_getcpu
+   .quad sys_netchannel_control
 ia32_syscall_end:  
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index beeeaf6..33242f8 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -325,10 +325,11 @@
 #define __NR_move_pages317
 #define __NR_getcpu318
 #define __NR_epoll_pwait   319
+#define __NR_netchannel_control320
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 320
+#define NR_syscalls 321
 #include linux/err.h
 
 /*
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 777288e..16f1aac 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,8 +619,10 @@ __SYSCALL(__NR_sync_file_range, sys_sync_file_range)
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_netchannel_control280
+__SYSCALL(__NR_netchannel_control, sys_netchannel_control)
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_netchannel_control
 
 #ifdef __KERNEL__
 #include linux/err.h
diff --git a/include/linux/connector.h b/include/linux/connector.h
index 4c02119..bdf6432 100644
--- a/include/linux/connector.h
+++ b/include/linux/connector.h
@@ -36,9 +36,11 @@
 #define CN_VAL_CIFS 0x1
 #define CN_W1_IDX  0x3 /* w1 communication */
 #define CN_W1_VAL  0x1
+#define CN_NETCHANNELS_IDX 0x04/* Netchannels connection 
control */
+#define CN_NETCHANNELS_VAL 0x01
 
 
-#define CN_NETLINK_USERS   4
+#define CN_NETLINK_USERS   5
 
 /*
  * Maximum connector's message size.
diff --git a/include/linux/netchannel.h b/include/linux/netchannel.h
new file mode 100644
index 000..c56afc5
--- /dev/null
+++ b/include/linux/netchannel.h
@@ -0,0 +1,175 @@
+/*
+ * netchannel.h
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __NETCHANNEL_H
+#define __NETCHANNEL_H
+
+#include linux/types.h
+
+enum netchannel_commands {
+   NETCHANNEL_CREATE = 0,
+};
+
+enum netchannel_type {
+   NETCHANNEL_EMPTY = 0,
+   NETCHANNEL_COPY_USER,
+   NETCHANNEL_NAT,
+   NETCHANNEL_MAX
+};
+
+/*
+ * Destination and source addresses/ports are from receiving point ov view, 
+ * i.e. when packet is being

Re: [0/4] DST: Distributed storage.

2007-12-04 Thread Evgeniy Polyakov
Hi Mike.

On Tue, Dec 04, 2007 at 10:25:29AM -0500, Mike Snitzer ([EMAIL PROTECTED]) 
wrote:
 Thanks for your continued work on DST.  I'd like to know if you've
 thought further about how synchronous mirroring would be best
 implemented with DST.
 
 You shared you views some time ago via comments on your blog:
 http://tservice.net.ru/~s0mbre/blog/devel/dst/2007_11_05.html
 
 At that time you were saying you'd add a sync bit to the request
 structure that is sent to remote nodes.  I'd imagine this would also
 require ordering of the block io, no?  Is order guaranteed when the
 requests are submitted over the DST protocol?  Otherwise how can you
 ensure a valid remote mirror (in the case of network disconnects,
 etc)?
 
 Guaranteeing consistent data on all members of a mirror is important.
 The main question is: what mechanisms _should_ be used in DST to
 provide this consistency?  And do you have a timeframe for when DST
 might support such mechanisms for consistent data?
 
 For the purpose of this discussion please assume that the disk cache
 is either write-through or battery-backed.

In this case sync bit would only imply waiting until all pending
requests reached remote nodes. This is not implemented yet.
Order of the requests for given node is guaranteed by DST core,
it is possible to perform multiple requests in parallel for/from
different nodes.

In the more generic case it should wait until data has reached media,
i.e. perform flushing.
I did not implement that since actually no multiple-device system in
Linux supports barriers (please note, that in this discussion sync bit
actually means a barrier in the block layer).

Protocol changes are pretty trivial and are absolutely transparent for
the DST core - only remote targets (both userspace and kernelspace)
should be changed to invoke -issue_flush_fn() callback when needed for
underlying device and do not process new requests until flush completed.
Thus barrier bit can be attached to data packets and can also be single
requests without data.

DST will continue to collect data, but will not send it to remote nodes
(actually it can send it, but data will not be processed and will stay
in the remote's receiving queue). This is a main concern about barrier -
should or not main node continue to process requests if previous ones
have not reached media yet, thus I have not yet implemented barriers.

 regards,
 Mike

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0/4] DST: Distributed storage.

2007-12-04 Thread Evgeniy Polyakov
On Tue, Dec 04, 2007 at 04:56:26PM +, Christoph Hellwig ([EMAIL PROTECTED]) 
wrote:
* fixed bug with XFS metadata update (it can provide slab pages to the
   DST, so it is not allowed to transfer them using -sendpage())
 
 xfs hasn't been doing that anymore for quite a while.  Block drivers
 don't need hacks for it anymore, epsecially as it's not reliably
 detectable.

I use 2.6.22 and it is there, maybe it was changed later.
Right now it can be detected quite trivially, but can result in a little
more bio startup overhead, I just did not know that it was allowed and
thus did not have a check in the DST.

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [1/4] dst: Distributed storage documentation.

2007-12-03 Thread Evgeniy Polyakov
Hi Matt.

On Sun, Dec 02, 2007 at 10:50:59PM -0600, Matt Mackall ([EMAIL PROTECTED]) 
wrote:
  Distributed storage documentation.
  
  Algorithms used in the system, userspace interfaces
  (sysfs dirs and files), design and implementation details
  are described here.
 
 Can you give us a summary of how this differs from using device mapper
 with NBD?

From the higher point ov view it does not, but it operates quite differently:
it has async processing of the requests, thus not blocking, it has
different protocol with smaller overhead, supports strong checksums, has
in-kernel export server, which supports simple security attributes (i.e.
allow to connect, to read or write). It uses smaller amount of memory
(zero additional allocations in the common path for linear mapping,
not including network allocations, it uses smaller amount of additional
allocations for mirroring case).
DST supports failure recovery in case of dropped connection (core will
reconnect to the remote node when it is ready), thus it is possible to
turn off and on remote nodes without special administration steps. DST
has simple autoconfiguration at the startup time (support checksums and
storage size autonegotiation). It is possible to turn one of the mirror
nodes off and use it as a offline backup, since dst mirror node stores
data at the end of the storage, so it can be mounted locally.

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bugme-new] [Bug 9440] New: Problem in joinning a socket to ipv6 multicast address in specific scenario

2007-11-30 Thread Evgeniy Polyakov
On Fri, Nov 30, 2007 at 11:02:19PM +1100, Herbert Xu ([EMAIL PROTECTED]) wrote:
 OK, this looks like a good change.  However, we should also
 change NETDEV_UP as well to recreate idev if it isn't there
 and the MTU is big enough.

Ok, added netdev_up too.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 567664e..e8c3475 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -2293,6 +2293,9 @@ static int addrconf_notify(struct notifier_block *this, 
unsigned long event,
break;
}
 
+   if (!idev  dev-mtu = IPV6_MIN_MTU)
+   idev = ipv6_add_dev(dev);
+
if (idev)
idev-if_flags |= IF_READY;
} else {
@@ -2357,12 +2360,18 @@ static int addrconf_notify(struct notifier_block *this, 
unsigned long event,
break;
 
case NETDEV_CHANGEMTU:
-   if ( idev  dev-mtu = IPV6_MIN_MTU) {
+   if (idev  dev-mtu = IPV6_MIN_MTU) {
rt6_mtu_change(dev, dev-mtu);
idev-cnf.mtu6 = dev-mtu;
break;
}
 
+   if (!idev  dev-mtu = IPV6_MIN_MTU) {
+   idev = ipv6_add_dev(dev);
+   if (idev)
+   break;
+   }
+
/* MTU falled under IPV6_MIN_MTU. Stop IPv6 on this interface. 
*/
 
case NETDEV_DOWN:

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[3/4] dst: Network state machine.

2007-11-29 Thread Evgeniy Polyakov

Network state machine.

Includes network async processing state machine and related tasks.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/drivers/block/dst/kst.c b/drivers/block/dst/kst.c
new file mode 100644
index 000..ba5e5ef
--- /dev/null
+++ b/drivers/block/dst/kst.c
@@ -0,0 +1,1475 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/kernel.h
+#include linux/module.h
+#include linux/list.h
+#include linux/slab.h
+#include linux/socket.h
+#include linux/kthread.h
+#include linux/net.h
+#include linux/in.h
+#include linux/poll.h
+#include linux/bio.h
+#include linux/dst.h
+
+#include net/sock.h
+
+struct kst_poll_helper
+{
+   poll_table  pt;
+   struct kst_state*st;
+};
+
+static LIST_HEAD(kst_worker_list);
+static DEFINE_MUTEX(kst_worker_mutex);
+
+/*
+ * This function creates bound socket for local export node.
+ */
+static int kst_sock_create(struct kst_state *st, struct saddr *addr,
+   int type, int proto, int backlog)
+{
+   int err;
+
+   err = sock_create(addr-sa_family, type, proto, st-socket);
+   if (err)
+   goto err_out_exit;
+
+   err = st-socket-ops-bind(st-socket, (struct sockaddr *)addr,
+   addr-sa_data_len);
+
+   err = st-socket-ops-listen(st-socket, backlog);
+   if (err)
+   goto err_out_release;
+
+   st-socket-sk-sk_allocation = GFP_NOIO;
+
+   return 0;
+
+err_out_release:
+   sock_release(st-socket);
+err_out_exit:
+   return err;
+}
+
+static void kst_sock_release(struct kst_state *st)
+{
+   if (st-socket) {
+   sock_release(st-socket);
+   st-socket = NULL;
+   }
+}
+
+void kst_wake(struct kst_state *st)
+{
+   if (st) {
+   struct kst_worker *w = st-node-w;
+   unsigned long flags;
+
+   spin_lock_irqsave(w-ready_lock, flags);
+   if (list_empty(st-ready_entry))
+   list_add_tail(st-ready_entry, w-ready_list);
+   spin_unlock_irqrestore(w-ready_lock, flags);
+
+   wake_up(w-wait);
+   }
+}
+EXPORT_SYMBOL_GPL(kst_wake);
+
+/*
+ * Polling machinery.
+ */
+static int kst_state_wake_callback(wait_queue_t *wait, unsigned mode,
+   int sync, void *key)
+{
+   struct kst_state *st = container_of(wait, struct kst_state, wait);
+   kst_wake(st);
+   return 1;
+}
+
+static void kst_queue_func(struct file *file, wait_queue_head_t *whead,
+poll_table *pt)
+{
+   struct kst_state *st = container_of(pt, struct kst_poll_helper, pt)-st;
+
+   st-whead = whead;
+   init_waitqueue_func_entry(st-wait, kst_state_wake_callback);
+   add_wait_queue(whead, st-wait);
+}
+
+static void kst_poll_exit(struct kst_state *st)
+{
+   if (st-whead) {
+   remove_wait_queue(st-whead, st-wait);
+   st-whead = NULL;
+   }
+}
+
+/*
+ * This function removes request from state tree and ordering list.
+ */
+void kst_del_req(struct dst_request *req)
+{
+   list_del_init(req-request_list_entry);
+}
+EXPORT_SYMBOL_GPL(kst_del_req);
+
+static struct dst_request *kst_req_first(struct kst_state *st)
+{
+   struct dst_request *req = NULL;
+
+   if (!list_empty(st-request_list))
+   req = list_entry(st-request_list.next, struct dst_request,
+   request_list_entry);
+   return req;
+}
+
+/*
+ * This function dequeues first request from the queue and tree.
+ */
+static struct dst_request *kst_dequeue_req(struct kst_state *st)
+{
+   struct dst_request *req;
+
+   mutex_lock(st-request_lock);
+   req = kst_req_first(st);
+   if (req)
+   kst_del_req(req);
+   mutex_unlock(st-request_lock);
+   return req;
+}
+
+/*
+ * This function enqueues request into tree, indexed by start of the request,
+ * and also puts request into ordered queue.
+ */
+int kst_enqueue_req(struct kst_state *st, struct dst_request *req)
+{
+   if (unlikely(req-flags  DST_REQ_CHECK_QUEUE)) {
+   struct dst_request *r;
+
+   list_for_each_entry(r, st-request_list, request_list_entry) {
+   if (bio_rw(r-bio) != bio_rw(req-bio))
+   continue;
+
+   if (r-start = req-start + req-size)
+   continue

[4/4] dst: Algorithms used in distributed storage.

2007-11-29 Thread Evgeniy Polyakov

Algorithms used in distributed storage.
Mirror and linear mapping code.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/drivers/block/dst/alg_linear.c b/drivers/block/dst/alg_linear.c
new file mode 100644
index 000..cb77b57
--- /dev/null
+++ b/drivers/block/dst/alg_linear.c
@@ -0,0 +1,104 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/init.h
+#include linux/dst.h
+
+static struct dst_alg *alg_linear;
+
+/*
+ * This callback is invoked when node is removed from storage.
+ */
+static void dst_linear_del_node(struct dst_node *n)
+{
+}
+
+/*
+ * This callback is invoked when node is added to storage.
+ */
+static int dst_linear_add_node(struct dst_node *n)
+{
+   struct dst_storage *st = n-st;
+
+   dprintk(%s: disk_size: %llu, node_size: %llu.\n,
+   __func__, st-disk_size, n-size);
+
+   mutex_lock(st-tree_lock);
+   n-start = st-disk_size;
+   st-disk_size += n-size;
+   mutex_unlock(st-tree_lock);
+
+   return 0;
+}
+
+static int dst_linear_remap(struct dst_request *req)
+{
+   int err;
+
+   if (req-node-bdev) {
+   generic_make_request(req-bio);
+   return 0;
+   }
+
+   err = kst_check_permissions(req-state, req-bio);
+   if (err)
+   return err;
+
+   return req-state-ops-push(req);
+}
+
+/*
+ * Failover callback - it is invoked each time error happens during
+ * request processing.
+ */
+static int dst_linear_error(struct kst_state *st, int err)
+{
+   if (err)
+   set_bit(DST_NODE_FROZEN, st-node-flags);
+   else
+   clear_bit(DST_NODE_FROZEN, st-node-flags);
+   return 0;
+}
+
+static struct dst_alg_ops alg_linear_ops = {
+   .remap  = dst_linear_remap,
+   .add_node   = dst_linear_add_node,
+   .del_node   = dst_linear_del_node,
+   .error  = dst_linear_error,
+   .owner  = THIS_MODULE,
+};
+
+static int __devinit alg_linear_init(void)
+{
+   alg_linear = dst_alloc_alg(alg_linear, alg_linear_ops);
+   if (!alg_linear)
+   return -ENOMEM;
+
+   return 0;
+}
+
+static void __devexit alg_linear_exit(void)
+{
+   dst_remove_alg(alg_linear);
+}
+
+module_init(alg_linear_init);
+module_exit(alg_linear_exit);
+
+MODULE_LICENSE(GPL);
+MODULE_AUTHOR(Evgeniy Polyakov [EMAIL PROTECTED]);
+MODULE_DESCRIPTION(Linear distributed algorithm.);
diff --git a/drivers/block/dst/alg_mirror.c b/drivers/block/dst/alg_mirror.c
new file mode 100644
index 000..55cf59c
--- /dev/null
+++ b/drivers/block/dst/alg_mirror.c
@@ -0,0 +1,1122 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/init.h
+#include linux/poll.h
+#include linux/dst.h
+
+struct dst_mirror_node_data
+{
+   u64 age;
+};
+
+struct dst_mirror_priv
+{
+   unsigned intchunk_num;
+
+   u64 last_start;
+
+   spinlock_t  backlog_lock;
+   struct list_headbacklog_list;
+
+   struct dst_mirror_node_data old_data, new_data;
+
+   unsigned long   *chunk;
+};
+
+static struct dst_alg *alg_mirror;
+static struct bio_set *dst_mirror_bio_set;
+
+static int dst_mirror_resync(struct dst_node *n, int ndp);
+
+static void dst_mirror_mark_sync(struct dst_node *n)
+{
+   if (test_bit(DST_NODE_NOTSYNC, n-flags)) {
+   struct dst_mirror_priv *priv = n-priv;
+
+   clear_bit(DST_NODE_NOTSYNC, n-flags);
+   dprintk(%s: node: %p, %llu:%llu synchronization 
+   has been completed.\n,
+   __func__, n, n-start, n-size);
+   priv-old_data.age = 0;
+   }
+}
+
+static void dst_mirror_mark_notsync(struct

[1/4] dst: Distributed storage documentation.

2007-11-29 Thread Evgeniy Polyakov

Distributed storage documentation.

Algorithms used in the system, userspace interfaces
(sysfs dirs and files), design and implementation details
are described here.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/Documentation/dst/algorithms.txt b/Documentation/dst/algorithms.txt
new file mode 100644
index 000..1437a6a
--- /dev/null
+++ b/Documentation/dst/algorithms.txt
@@ -0,0 +1,115 @@
+Each storage by itself is just a set of contiguous logical blocks, with
+allowed number of operations. Nodes, each of which has own start and size,
+are placed into storage by appropriate algorithm, which remaps
+logical sector number into real node's sector. One can create
+own algorithms, since DST has pluggable interface for that.
+Currently mirrored and linear algorithms are supported.
+
+Let's briefly describe how they work.
+
+Linear algorithm.
+Simple approach of concatenating storages into single device with
+increased size is used in this algorithm. Essentially new device
+has size equal to sum of sizes of underlying nodes and nodes are
+placed one after another.
+
+  /- Node 1 ---\ /-- Node 3 \
+start  end start   end
+ |==||==|
+ |start end |
+ |  \--- Node 2 -/  |
+ |  |
+start  end
+ \-- DST storage --/
+
+   /\
+   ||
+   ||
+
+  IO operations
+
+   Figure 1. 
+ 3 nodes combined into single storage using linear algorithm.
+
+Mirror algorithm.
+In this algorithms nodes are placed under each other, so when
+operation comes to the first one, it can be mirrored to all
+underlying nodes. In case of reading, actual data is obtained from
+the nearest node - algoritm keeps track of previous operation
+and knows where it was stopped, so that subsequent seek to the 
+start of the new request will take the shortest time.
+Writing is always mirrored to all underlying nodes.
+
+  IO operations
+   ||
+   ||
+   \/
+
+| DST storage ---|
+|  prev position |
+|---| Node 1 |
+|  prev pos  |
+| Node 2 -|--|
+|prev pos|
+|---| Node 3 |
+
+   Figure 2.
+   3 nodes combined into single storage using mirror algorithm.
+
+Each algorithm must implement number of callbacks,
+which must be registered during initialization time.
+
+struct dst_alg_ops
+{
+   int (*add_node)(struct dst_node *n);
+   void(*del_node)(struct dst_node *n);
+   int (*remap)(struct dst_request *req);
+   int (*error)(struct kst_state *state, int err);
+   struct module   *owner;
+};
+
[EMAIL PROTECTED]
+This callback is invoked when new node is being added into the storage,
+but before node is actually added into the storage, so that it could
+be accessed from it. When it is called, all appropriate initialization
+of the underlying device is already completed (system has been connected
+to remote node or got a reference to the local block device). At this
+stage algorithm can add node into private map. 
+It must return zero on success or negative value otherwise.
+
[EMAIL PROTECTED]
+This callback is invoked when node is being deleted from the storage,
+i.e. when its reference counter hits zero. It is called before
+any cleaning is performed.
+It must return zero on success or negative value otherwise.
+
[EMAIL PROTECTED]
+This callback is invoked each time new bio hits the storage.
+Request structure contains BIO itself, pointer to the node, which originally
+stores the whole region under given IO request, and various parameters
+used by storage core to process this block request.
+It must return zero on success or negative value otherwise. It is upto
+this method to call all cleaning if remapping failed, for example it must
+call kst_bio_endio() for given callback in case of error, which in turn
+will call bio_endio(). Note, that dst_request structure provided in this
+callback is allocated on stack, so if there is a need to use it outside
+of the given function, it must be cloned (it will happen automatically
+in state's push callback, but that copy will not be shared by any other
+user).
+
[EMAIL PROTECTED]
+This callback is invoked for each error, which happend when processed

[0/4] dst: Distributed storage.

2007-11-29 Thread Evgeniy Polyakov

Distributed storage.

I'm pleased to announce the 9'th release of the distributed
storage subsystem (DST). This is maintenance release and include
bug fixing only.

DST allows to form a storage on top of local and remote nodes
and combine them into linear or mirroring setup, which in
turn can be exported to remote nodes.

Short changelog:
 * use node's size in sectors instead of bytes
 * fixed old/new ages for the first node. 
Error spotted by Matthew Hodgson [EMAIL PROTECTED]
 * fixed debug printk declaration
 * it is now called 'astonishingly screwed tapeworm'

Overall list of features of the DST can be found on project's homepage:

http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst

Thank you.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[2/4] dst: Core distributed storage files.

2007-11-29 Thread Evgeniy Polyakov

Core distributed storage files.
Include userspace interfaces, initialization,
block layer bindings and other core functionality.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index b4c8319..ca6592d 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -451,6 +451,8 @@ config ATA_OVER_ETH
This driver provides Support for ATA over Ethernet block
devices like the Coraid EtherDrive (R) Storage Blade.
 
+source drivers/block/dst/Kconfig
+
 source drivers/s390/block/Kconfig
 
 endmenu
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index dd88e33..fcf042d 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_VIODASD) += viodasd.o
 obj-$(CONFIG_BLK_DEV_SX8)  += sx8.o
 obj-$(CONFIG_BLK_DEV_UB)   += ub.o
 
+obj-$(CONFIG_DST)  += dst/
diff --git a/drivers/block/dst/Kconfig b/drivers/block/dst/Kconfig
new file mode 100644
index 000..d35e0cc
--- /dev/null
+++ b/drivers/block/dst/Kconfig
@@ -0,0 +1,21 @@
+config DST
+   tristate Distributed storage
+   depends on NET
+   select CONNECTOR
+   select LIBCRC32C
+   ---help---
+   This driver allows to create a distributed storage.
+
+config DST_ALG_LINEAR
+   tristate Linear distribution algorithm
+   depends on DST
+   ---help---
+   This module allows to create linear mapping of the nodes
+   in the distributed storage.
+
+config DST_ALG_MIRROR
+   tristate Mirror distribution algorithm
+   depends on DST
+   ---help---
+   This module allows to create a mirror of the noes in the
+   distributed storage.
diff --git a/drivers/block/dst/Makefile b/drivers/block/dst/Makefile
new file mode 100644
index 000..1400e94
--- /dev/null
+++ b/drivers/block/dst/Makefile
@@ -0,0 +1,6 @@
+obj-$(CONFIG_DST) += dst.o
+
+dst-y := dcore.o kst.o
+
+obj-$(CONFIG_DST_ALG_LINEAR) += alg_linear.o
+obj-$(CONFIG_DST_ALG_MIRROR) += alg_mirror.o
diff --git a/drivers/block/dst/dcore.c b/drivers/block/dst/dcore.c
new file mode 100644
index 000..06d0810
--- /dev/null
+++ b/drivers/block/dst/dcore.c
@@ -0,0 +1,1608 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/init.h
+#include linux/blkdev.h
+#include linux/bio.h
+#include linux/slab.h
+#include linux/connector.h
+#include linux/socket.h
+#include linux/dst.h
+#include linux/device.h
+#include linux/in.h
+#include linux/in6.h
+#include linux/buffer_head.h
+
+#include net/sock.h
+
+static LIST_HEAD(dst_storage_list);
+static LIST_HEAD(dst_alg_list);
+static DEFINE_MUTEX(dst_storage_lock);
+static DEFINE_MUTEX(dst_alg_lock);
+static int dst_major;
+static struct kst_worker *kst_main_worker;
+static struct cb_id cn_dst_id = { CN_DST_IDX, CN_DST_VAL };
+
+struct kmem_cache *dst_request_cache;
+
+static char dst_name[] = Astonishingly screwed tapeworm;
+
+/*
+ * DST sysfs tree. For device called 'storage' which is formed
+ * on top of two nodes this looks like this:
+ *
+ * /sys/devices/storage/
+ * /sys/devices/storage/alg : alg_linear
+ * /sys/devices/storage/n-800/type : R: 192.168.4.80:1025
+ * /sys/devices/storage/n-800/size : 800
+ * /sys/devices/storage/n-800/start : 800
+ * /sys/devices/storage/n-800/clean
+ * /sys/devices/storage/n-800/dirty
+ * /sys/devices/storage/n-0/type : R: 192.168.4.81:1025
+ * /sys/devices/storage/n-0/size : 800
+ * /sys/devices/storage/n-0/start : 0
+ * /sys/devices/storage/n-0/clean
+ * /sys/devices/storage/n-0/dirty
+ * /sys/devices/storage/remove_all_nodes
+ * /sys/devices/storage/nodes : sectors (start [size]): 0 [800] | 800 [800]
+ * /sys/devices/storage/name : storage
+ */
+
+static int dst_dev_match(struct device *dev, struct device_driver *drv)
+{
+   return 1;
+}
+
+static void dst_dev_release(struct device *dev)
+{
+}
+
+static struct bus_type dst_dev_bus_type = {
+   .name   = dst,
+   .match  = dst_dev_match,
+};
+
+static struct device dst_dev = {
+   .bus= dst_dev_bus_type,
+   .release= dst_dev_release
+};
+
+static void dst_node_release(struct device *dev)
+{
+}
+
+static struct device dst_node_dev = {
+   .release= dst_node_release
+};
+
+static void dst_free_alg(struct dst_alg *alg)
+{
+   kfree(alg);
+}
+
+/*
+ * Algorithm is never freed directly,
+ * since its

Netchannels. The 21'th release.

2007-11-29 Thread Evgeniy Polyakov
Hi.

This is the 21'th release of the netchannels, a peer-to-peer protocol
agnostic communication channel between hardware and users. It uses
unified cache to store channels, allows to allocate buffers for data
from userspace mapped area or from other preallocated set of pages
(like VFS cache). All protocol processing happens in process context.

Users of the system can be for example userspace - it allows to receive
and send traffic from the wire without any kernel interference, to
implement own protocols and offload its processing to the hardware.

This idea was originally proposed and implemented by Van Jacobson.
This patchset (with userspace netowrk stack) is a logical continuation
of the idea with move to the full peer-to-peer processing.

One of its users is userspace network stack [2].

Short changelog:
 * fixed queue length usage
 * fixed dst release path.
Both problems reported by Salvatore Del Popolo [EMAIL PROTECTED]
 * removed nat user

1. Netchannels homepage.
http://tservice.net.ru/~s0mbre/old/?section=projectsitem=netchannel

2. Userspace network stack.
http://tservice.net.ru/~s0mbre/old/?section=projectsitem=unetstack

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index 2697e92..3231b22 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+   .long sys_netchannel_control
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index b4aa875..d35d4d8 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -718,4 +718,5 @@ ia32_sys_call_table:
.quad compat_sys_vmsplice
.quad compat_sys_move_pages
.quad sys_getcpu
+   .quad sys_netchannel_control
 ia32_syscall_end:  
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index beeeaf6..33242f8 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -325,10 +325,11 @@
 #define __NR_move_pages317
 #define __NR_getcpu318
 #define __NR_epoll_pwait   319
+#define __NR_netchannel_control320
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 320
+#define NR_syscalls 321
 #include linux/err.h
 
 /*
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 777288e..16f1aac 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,8 +619,10 @@ __SYSCALL(__NR_sync_file_range, sys_sync_file_range)
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_netchannel_control280
+__SYSCALL(__NR_netchannel_control, sys_netchannel_control)
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_netchannel_control
 
 #ifdef __KERNEL__
 #include linux/err.h
diff --git a/include/linux/connector.h b/include/linux/connector.h
index 4c02119..bdf6432 100644
--- a/include/linux/connector.h
+++ b/include/linux/connector.h
@@ -36,9 +36,11 @@
 #define CN_VAL_CIFS 0x1
 #define CN_W1_IDX  0x3 /* w1 communication */
 #define CN_W1_VAL  0x1
+#define CN_NETCHANNELS_IDX 0x04/* Netchannels connection 
control */
+#define CN_NETCHANNELS_VAL 0x01
 
 
-#define CN_NETLINK_USERS   4
+#define CN_NETLINK_USERS   5
 
 /*
  * Maximum connector's message size.
diff --git a/include/linux/netchannel.h b/include/linux/netchannel.h
new file mode 100644
index 000..c56afc5
--- /dev/null
+++ b/include/linux/netchannel.h
@@ -0,0 +1,175 @@
+/*
+ * netchannel.h
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __NETCHANNEL_H
+#define __NETCHANNEL_H
+
+#include linux/types.h
+
+enum netchannel_commands {
+   NETCHANNEL_CREATE = 0,
+};
+
+enum netchannel_type {
+   NETCHANNEL_EMPTY = 0,
+   NETCHANNEL_COPY_USER,
+   NETCHANNEL_NAT,
+   NETCHANNEL_MAX
+};
+
+/*
+ * Destination and source addresses/ports are from receiving point ov view, 
+ * i.e

Re: [Bugme-new] [Bug 9440] New: Problem in joinning a socket to ipv6 multicast address in specific scenario

2007-11-28 Thread Evgeniy Polyakov
Hi.

Avaid provided test application, so bug got fixed.

IPv6 addrconf removes ipv6 inner device from netdev each time cmu
changes and new value is less than IPV6_MIN_MTU (1280 bytes).
When mtu is changed and new value is greater than IPV6_MIN_MTU,
it does not add ipv6 addresses and inner device bac.

This patch fixes that.

Tested with Avaid's application, which works ok now.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 567664e..4f7e46c 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -2357,12 +2358,18 @@ static int addrconf_notify(struct notifier_block *this, 
unsigned long event,
break;
 
case NETDEV_CHANGEMTU:
-   if ( idev  dev-mtu = IPV6_MIN_MTU) {
+   if (idev  dev-mtu = IPV6_MIN_MTU) {
rt6_mtu_change(dev, dev-mtu);
idev-cnf.mtu6 = dev-mtu;
break;
}
 
+   if (!idev  dev-mtu = IPV6_MIN_MTU) {
+   idev = ipv6_add_dev(dev);
+   if (idev)
+   break;
+   }
+
/* MTU falled under IPV6_MIN_MTU. Stop IPv6 on this interface. 
*/
 
case NETDEV_DOWN:

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23 WARNING: at kernel/softirq.c:139 local_bh_enable()

2007-11-23 Thread Evgeniy Polyakov
On Fri, Nov 23, 2007 at 01:11:20PM -0600, Matt Mackall ([EMAIL PROTECTED]) 
wrote:
 On Fri, Nov 23, 2007 at 09:59:06PM +0300, Evgeniy Polyakov wrote:
  On Fri, Nov 23, 2007 at 09:51:01PM +0300, Evgeniy Polyakov ([EMAIL 
  PROTECTED]) wrote:
   On Fri, Nov 23, 2007 at 09:48:51PM +0300, Evgeniy Polyakov ([EMAIL 
   PROTECTED]) wrote:
Stop, we are trying to free skb without destructor and catch connection
tracking, so it is not a solution. To fix the problem we need to check
if it is not netfilter related, kind of this (not tested), Simon please
give it a try:
   
   And to be really cool we need to bypass skbs with xfrm attached, since
   its freeing also assumes BH context.
  
  What about compile options?
 
 What about my original suggestion that we mark skbs owned by netpoll
 and free only those. Much safer, no? Untested:

This should work if there are netpoll's skbs, but if we are under memory
pressure we want to free not only netpoll skbs, but at least one, and 
what if there are no netpoll skbs in the queue?

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23 WARNING: at kernel/softirq.c:139 local_bh_enable()

2007-11-23 Thread Evgeniy Polyakov
On Fri, Nov 23, 2007 at 10:54:10PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) 
wrote:
 On Fri, Nov 23, 2007 at 01:41:39PM -0600, Matt Mackall ([EMAIL PROTECTED]) 
 wrote:
  Here's another thought: move all this logic into the networking core,
  unify it with current softirq zapper, then allow it to be called from
  various other places (like atomic allocators). Then it'll all be in
  central maintained place with more users.
 
 This can be done quite easily - put a check into __kfree_skb() if
 netpoll is compiled-in and we are in hardirq context, then put skb
 into softirq freeing queue. Then zap_completion_queue() can free
 anything without ever knowing about nature of the packet, since this
 will be checked in __kfree_skb() anyway.

And let's add some mess...
But should fix the case when netpoll code is being executed in interrupt
context and is about to free skb, which should not be freed.

Frankly saying this looks like crap.

Crap-added-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 758dafe..88f8ea9 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -196,10 +196,7 @@ static void zap_completion_queue(void)
while (clist != NULL) {
struct sk_buff *skb = clist;
clist = clist-next;
-   if (skb-destructor)
-   dev_kfree_skb_any(skb); /* put this one back */
-   else
-   __kfree_skb(skb);
+   __kfree_skb(skb);
}
}
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 27cfe5f..8642097 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -318,6 +318,26 @@ void kfree_skbmem(struct sk_buff *skb)
 
 void __kfree_skb(struct sk_buff *skb)
 {
+#if defined(CONFIG_NETPOLL) || defined(CONFIG_NETPOLL_TRAP)
+   if (in_irq() || irqs_disabled()) {
+   if (skb-destructor) {
+   dev_kfree_skb_irq(skb);
+   return;
+   }
+#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
+   if (skb-nfct || skb-nfct_reasm) {
+   dev_kfree_skb_irq(skb);
+   return;
+   }
+#endif
+#ifdef CONFIG_XFRM
+   if (skb-sp) {
+   dev_kfree_skb_irq(skb);
+   return;
+   }
+#endif
+   }
+#endif
dst_release(skb-dst);
 #ifdef CONFIG_XFRM
secpath_put(skb-sp);

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23 WARNING: at kernel/softirq.c:139 local_bh_enable()

2007-11-23 Thread Evgeniy Polyakov
On Fri, Nov 23, 2007 at 09:48:51PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) 
wrote:
 Stop, we are trying to free skb without destructor and catch connection
 tracking, so it is not a solution. To fix the problem we need to check
 if it is not netfilter related, kind of this (not tested), Simon please
 give it a try:

And to be really cool we need to bypass skbs with xfrm attached, since
its freeing also assumes BH context.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 758dafe..5f86e60 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -196,7 +196,8 @@ static void zap_completion_queue(void)
while (clist != NULL) {
struct sk_buff *skb = clist;
clist = clist-next;
-   if (skb-destructor)
+   if (skb-destructor || skb-nfct ||
+   skb-nfct_reasm || skb-sp)
dev_kfree_skb_any(skb); /* put this one back */
else
__kfree_skb(skb);


-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23 WARNING: at kernel/softirq.c:139 local_bh_enable()

2007-11-23 Thread Evgeniy Polyakov
On Fri, Nov 23, 2007 at 08:57:57PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) 
wrote:
  My memory here is hazy, but I think this exists to rescue netconsole
  in low-memory situations. This bit originated with Ingo, so maybe he
  can recall.
  
  Netpoll can process an arbitrary number of skbs inside a single
  interrupt. Think sysrq-t at one packet per line or kgdboe where the
  entire trace session can happen inside one very long interrupt.
  
  Perhaps we can refine this to mark netpoll's skbs (perhaps with
  -destructor?) and delete only skbs we own. As these are never passed
  through any of the other route/xfrm/filter code, they should be safe
  to delete even in irq context, yes?
  
   Removing zap_completion_queue() from find_skb() will fix the warning,
   but I'm not sure this is a correct fix. I've added Matt to the Cc list.
  
  Care to try the sysrq-t or OOM message tests?
 
 We basically can not free skbs there - if it is interrupt context and
 we are freeing some skb with destructor we will catch the warning anyway.
 
 No matter if we are under memory pressure or whatever - it is not
 allowed - a lot of skbs are supposed to be freed in softirq context,
 that is why dev_kfree_skb_any() exists.
 
 I think we can drop skbs _without_ destructor from the queue though in
 that conditions given that we actually need only one.

Stop, we are trying to free skb without destructor and catch connection
tracking, so it is not a solution. To fix the problem we need to check
if it is not netfilter related, kind of this (not tested), Simon please
give it a try:

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 758dafe..855bb3f 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -196,7 +196,7 @@ static void zap_completion_queue(void)
while (clist != NULL) {
struct sk_buff *skb = clist;
clist = clist-next;
-   if (skb-destructor)
+   if (skb-destructor || skb-nfct || skb-nfct_reasm)
dev_kfree_skb_any(skb); /* put this one back */
else
__kfree_skb(skb);


-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23 WARNING: at kernel/softirq.c:139 local_bh_enable()

2007-11-23 Thread Evgeniy Polyakov
On Fri, Nov 23, 2007 at 12:21:57AM -0800, Andrew Morton ([EMAIL PROTECTED]) 
wrote:
  [2059664.615816] __iptables__: init4 IN=ppp0 OUT=ppp0 WARNING: at 
  kernel/softirq.c:139 local_bh_enable()
  [2059664.620535]  [80120364] local_bh_enable+0x3c/0x97

  [2059664.620657]  [8011c205] __call_console_drivers+0x61/0x6d
  [2059664.620669]  [8011c3fc] release_console_sem+0x164/0x1bf
  [2059664.620679]  [8011c81f] vprintk+0x27a/0x2ff
 
 If that trace is to be beieved we're doing nefilter stuff on packets which
 were sent across netconsole.
 
 This probably isn't anything the netfilter guys have thought about.  And
 probably we don't want them to.  Is there some simple way in which we can
 exempt netconsole from netfilter processing?

This is not about netfilter, but about freeing skb in interrupt context, 
which is not allowed, and in interrupt skbs are queued to be freed in softirq,
but netcnsole wants to flush softirq freeing queue. That is a question: why?

Removing zap_completion_queue() from find_skb() will fix the warning,
but I'm not sure this is a correct fix. I've added Matt to the Cc list.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bugme-new] [Bug 9440] New: Problem in joinning a socket to ipv6 multicast address in specific scenario

2007-11-23 Thread Evgeniy Polyakov
On Thu, Nov 22, 2007 at 05:23:42PM -0800, Andrew Morton ([EMAIL PROTECTED]) 
wrote:
  3. Now i am running a program i wrote in c that opens a dgram socket
  (sock_fd[i] = socket(test_data-protocol, SOCK_DGRAM, 0);)  and join it to
  multicast ipv6 address. 
  if i am running this program after steps 1+2 i get the following error:
  Resource temporarily unavailable when trying to join the socket to the
  multicast ipv6 address by the
  system call : 

Could you provide a test application?
Given it is small and can be ran without external dependencies, it will
be fixed way much faster.

Thanks.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23 WARNING: at kernel/softirq.c:139 local_bh_enable()

2007-11-23 Thread Evgeniy Polyakov
On Fri, Nov 23, 2007 at 01:41:39PM -0600, Matt Mackall ([EMAIL PROTECTED]) 
wrote:
 Here's another thought: move all this logic into the networking core,
 unify it with current softirq zapper, then allow it to be called from
 various other places (like atomic allocators). Then it'll all be in
 central maintained place with more users.

This can be done quite easily - put a check into __kfree_skb() if
netpoll is compiled-in and we are in hardirq context, then put skb
into softirq freeing queue. Then zap_completion_queue() can free
anything without ever knowing about nature of the packet, since this
will be checked in __kfree_skb() anyway.

Kind of this:

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 758dafe..88f8ea9 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -196,10 +196,7 @@ static void zap_completion_queue(void)
while (clist != NULL) {
struct sk_buff *skb = clist;
clist = clist-next;
-   if (skb-destructor)
-   dev_kfree_skb_any(skb); /* put this one back */
-   else
-   __kfree_skb(skb);
+   __kfree_skb(skb);
}
}
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 27cfe5f..f720685 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -318,6 +318,12 @@ void kfree_skbmem(struct sk_buff *skb)
 
 void __kfree_skb(struct sk_buff *skb)
 {
+#if defined(CONFIG_NETPOLL) || defined(CONFIG_NETPOLL_TRAP)
+   if (in_irq() || irqs_disabled()) {
+   dev_kfree_skb_irq(skb);
+   return;
+   }
+#endif
dst_release(skb-dst);
 #ifdef CONFIG_XFRM
secpath_put(skb-sp);

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23 WARNING: at kernel/softirq.c:139 local_bh_enable()

2007-11-23 Thread Evgeniy Polyakov
On Fri, Nov 23, 2007 at 09:51:01PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) 
wrote:
 On Fri, Nov 23, 2007 at 09:48:51PM +0300, Evgeniy Polyakov ([EMAIL 
 PROTECTED]) wrote:
  Stop, we are trying to free skb without destructor and catch connection
  tracking, so it is not a solution. To fix the problem we need to check
  if it is not netfilter related, kind of this (not tested), Simon please
  give it a try:
 
 And to be really cool we need to bypass skbs with xfrm attached, since
 its freeing also assumes BH context.

What about compile options?

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 758dafe..adb3c54 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -196,10 +196,25 @@ static void zap_completion_queue(void)
while (clist != NULL) {
struct sk_buff *skb = clist;
clist = clist-next;
-   if (skb-destructor)
+   if (skb-destructor) {
dev_kfree_skb_any(skb); /* put this one back */
-   else
-   __kfree_skb(skb);
+   continue;
+   }
+
+#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
+   if (skb-nfct || skb-nfct_reasm) {
+   dev_kfree_skb_any(skb); /* put this one back */
+   continue;
+   }
+#endif
+
+#ifdef CONFIG_XFRM
+   if (skb-sp) {
+   dev_kfree_skb_any(skb); /* put this one back */
+   continue;
+   }
+#endif
+   __kfree_skb(skb);
}
}
 

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23 WARNING: at kernel/softirq.c:139 local_bh_enable()

2007-11-23 Thread Evgeniy Polyakov
On Fri, Nov 23, 2007 at 11:07:56AM -0600, Matt Mackall ([EMAIL PROTECTED]) 
wrote:
 On Fri, Nov 23, 2007 at 01:55:19PM +0300, Evgeniy Polyakov wrote:
  On Fri, Nov 23, 2007 at 12:21:57AM -0800, Andrew Morton ([EMAIL PROTECTED]) 
  wrote:
[2059664.615816] __iptables__: init4 IN=ppp0 OUT=ppp0 WARNING: at 
kernel/softirq.c:139 local_bh_enable()
[2059664.620535]  [80120364] local_bh_enable+0x3c/0x97
  
[2059664.620657]  [8011c205] __call_console_drivers+0x61/0x6d
[2059664.620669]  [8011c3fc] release_console_sem+0x164/0x1bf
[2059664.620679]  [8011c81f] vprintk+0x27a/0x2ff
   
   If that trace is to be beieved we're doing nefilter stuff on packets which
   were sent across netconsole.
   
   This probably isn't anything the netfilter guys have thought about.  And
   probably we don't want them to.  Is there some simple way in which we can
   exempt netconsole from netfilter processing?
  
  This is not about netfilter, but about freeing skb in interrupt context, 
  which is not allowed, and in interrupt skbs are queued to be freed in 
  softirq,
  but netcnsole wants to flush softirq freeing queue. That is a question: why?
 
 My memory here is hazy, but I think this exists to rescue netconsole
 in low-memory situations. This bit originated with Ingo, so maybe he
 can recall.
 
 Netpoll can process an arbitrary number of skbs inside a single
 interrupt. Think sysrq-t at one packet per line or kgdboe where the
 entire trace session can happen inside one very long interrupt.
 
 Perhaps we can refine this to mark netpoll's skbs (perhaps with
 -destructor?) and delete only skbs we own. As these are never passed
 through any of the other route/xfrm/filter code, they should be safe
 to delete even in irq context, yes?
 
  Removing zap_completion_queue() from find_skb() will fix the warning,
  but I'm not sure this is a correct fix. I've added Matt to the Cc list.
 
 Care to try the sysrq-t or OOM message tests?

We basically can not free skbs there - if it is interrupt context and
we are freeing some skb with destructor we will catch the warning anyway.

No matter if we are under memory pressure or whatever - it is not
allowed - a lot of skbs are supposed to be freed in softirq context,
that is why dev_kfree_skb_any() exists.

I think we can drop skbs _without_ destructor from the queue though in
that conditions given that we actually need only one.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23 WARNING: at kernel/softirq.c:139 local_bh_enable()

2007-11-23 Thread Evgeniy Polyakov
On Fri, Nov 23, 2007 at 12:59:43PM -0600, Matt Mackall ([EMAIL PROTECTED]) 
wrote:
 So I'd be surprised if that was a problem. But I can imagine having
 problems for skbs without destructors which run into one of these in
 __kfree_skb:
 
 dst_release
 secpath_put
 nf_conntrack_put
 nf_conntrack_put_reasm
 nf_bridge_put
 
 ..some or all of which assume a softirq context.

bridging is ok, others require softirq context.
I've sent a patch (the last one should be ok) to guard against xfrm and
connection tracking.

  No matter if we are under memory pressure or whatever - it is not
  allowed - a lot of skbs are supposed to be freed in softirq context,
  that is why dev_kfree_skb_any() exists.
 
 Some skbs we definitely -can- free in irq context. The only ones we
 care about are the ones generated by netpoll. If there's a reason you
 think netpoll's own skbs can't be freed, please describe it.

Only some and to distinguish them we can not use destructor - if it is
set (even empty function) it will fire an alarm.

  I think we can drop skbs _without_ destructor from the queue though in
  that conditions given that we actually need only one.
 
 Huh?

Don't mind - friday...
I posted a patch (third one should be ok) to fix this issue.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netfilter: kernel panic with REDIRECT target. (2.6.23 and 2.6.23.8)

2007-11-20 Thread Evgeniy Polyakov
  Ok, let's try it hard way.
  Please check attached patch and tell if it helped (it will produce
  some debug though).
 
  With both patches applied - one Patrick showed and this one.

 Now works, with this in dmesg
 
 conntrack: ea94159c, new: ead4d7c4, old: ead4d7d0, ct: .

David (Miller :), please apply attached patch, which also needed to fix
netfilter connection tracking bug.
When connection tracking entry (nf_conn) is about to copy itself it can
have some of its extension users (like nat) as being already freed and
thus not required to be copied.
Frankly saying, it can be not the correct fix, but from code observation
and test, perfomed by David [EMAIL PROTECTED] it is.

Actually looking at this function I suspect it was copied from
nf_nat_setup_info() and thus bug was introduced.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c
index 70e7997..86b465b 100644
--- a/net/ipv4/netfilter/nf_nat_core.c
+++ b/net/ipv4/netfilter/nf_nat_core.c
@@ -607,13 +607,10 @@ static void nf_nat_move_storage(struct nf_conn 
*conntrack, void *old)
struct nf_conn_nat *new_nat = nf_ct_ext_find(conntrack, NF_CT_EXT_NAT);
struct nf_conn_nat *old_nat = (struct nf_conn_nat *)old;
struct nf_conn *ct = old_nat-ct;
-   unsigned int srchash;
 
-   if (!(ct-status  IPS_NAT_DONE_MASK))
+   if (!ct || !(ct-status  IPS_NAT_DONE_MASK))
return;
 
-   srchash = hash_by_src(ct-tuplehash[IP_CT_DIR_ORIGINAL].tuple);
-
write_lock_bh(nf_nat_lock);
hlist_replace_rcu(old_nat-bysource, new_nat-bysource);
new_nat-ct = ct;

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LRO ack aggregation

2007-11-20 Thread Evgeniy Polyakov
Hi.

On Tue, Nov 20, 2007 at 08:27:05AM -0500, Andrew Gallatin ([EMAIL PROTECTED]) 
wrote:
 Hmm.. rather than a global tunable, what if it was a
 network driver managed tunable which toggled a flag in the
 lro_mgr features?  Would that be better?

What about ethtool control to set LRO_simple and LRO_ACK_aggregation?

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netfilter: kernel panic with REDIRECT target. (2.6.23 and 2.6.23.8)

2007-11-20 Thread Evgeniy Polyakov
On Tue, Nov 20, 2007 at 01:24:17PM +0100, Patrick McHardy ([EMAIL PROTECTED]) 
wrote:
 Patrick McHardy wrote:
 Evgeniy Polyakov wrote:
 Ok, let's try it hard way.
 Please check attached patch and tell if it helped (it will produce
 some debug though).
 With both patches applied - one Patrick showed and this one.
   
 Now works, with this in dmesg
 
 conntrack: ea94159c, new: ead4d7c4, old: ead4d7d0, ct: .
 
 David (Miller :), please apply attached patch, which also needed to fix
 netfilter connection tracking bug.
 When connection tracking entry (nf_conn) is about to copy itself it can
 have some of its extension users (like nat) as being already freed and
 thus not required to be copied.
 Frankly saying, it can be not the correct fix, but from code observation
 and test, perfomed by David [EMAIL PROTECTED] it is.
 
 I also don't believe this can be correct, let me look into this
 first.
 
 
 I now understand whats happening:
 
 - new connection is allocated without helper
 - connection is REDIRECTed to localhost
 - nf_nat_setup_info adds NAT extension, but doesn't initialize it yet
 - nf_conntrack_alter_reply performs a helper lookup based on the
   new tuple, finds the SIP helper and allocates a helper extension,
   causing reallocation because of too little space
 - nf_nat_move_storage is called with the uninitialized nat extension
 
 So your fix is entirely correct, thanks a lot :)

It is always better to check my third eye revelations :)
Thanks for checking it.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LRO ack aggregation

2007-11-20 Thread Evgeniy Polyakov
On Tue, Nov 20, 2007 at 09:50:56PM +0800, Herbert Xu ([EMAIL PROTECTED]) wrote:
 On Tue, Nov 20, 2007 at 04:35:09PM +0300, Evgeniy Polyakov wrote:
  
  On Tue, Nov 20, 2007 at 08:27:05AM -0500, Andrew Gallatin ([EMAIL 
  PROTECTED]) wrote:
   Hmm.. rather than a global tunable, what if it was a
   network driver managed tunable which toggled a flag in the
   lro_mgr features?  Would that be better?
  
  What about ethtool control to set LRO_simple and LRO_ACK_aggregation?
 
 I have two concerns about this:
 
 1) That same option can still be turned on by distros.

FC and Debian turn on hardware checksumm offloading in e1000 and I have
a card where this results in more than 10% performance _decrease_.
I do not know why, but Im able to run script which disables it via
ethtool.

 2) This doesn't make sense because the code is actually in the
core networking stack.

It depends. Software lro can be controlled by simple procfs switch, but
hardware one? I recall it was number of times pointed that hardware LRO
is possible and likely being implemented in some asics.

 I'm particular unhappy about 2) because I don't want be in a
 situation down the track where every driver is going to add this
 option so that they're not left behind in the arms race.

For software lro I agree, but this looks exactly like gso/tso case and
additional tweak for software gso. Having it per-system is fine, and I
believe no one should ever care that some distro will do bad/good things
with it. Actually we do have so much tricky options in procfs already
which can kill performance...

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LRO ack aggregation

2007-11-20 Thread Evgeniy Polyakov
On Tue, Nov 20, 2007 at 10:08:31PM +0800, Herbert Xu ([EMAIL PROTECTED]) wrote:
 Of course we still have the problem with the option in general
 that Dave raised.  That is this may cause the proliferation of
 TCP receiver behaviour that may be undesirable.

Yes, it results in bursts of traffic because of delayed acks accumulated
in sender's lro engine, but from the first point, if receiver is slow,
then it will slowly send acks and they will be slowly accumulated, thus
changing not only seq/ack numbers, but also timings, which is equal to
increasing length of the pipe between users. TCP is able to balance on 
this edge. I'm sure it depends on workload, but heavy bulk transfers,
where only lro with and without ack agregation can win, are quite usual
on long pipes with high performance numbers.

Until it is tested, I doubt it is possible to say it is 100% good or
bad, so my proposal is to write the code, which is tunable from
userspace, turn it off and allow people to test the change.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[take8 4/4] dst: Algorithms used in distributed storage.

2007-11-20 Thread Evgeniy Polyakov

Algorithms used in distributed storage.
Mirror and linear mapping code.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/drivers/block/dst/alg_linear.c b/drivers/block/dst/alg_linear.c
new file mode 100644
index 000..cb77b57
--- /dev/null
+++ b/drivers/block/dst/alg_linear.c
@@ -0,0 +1,104 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/init.h
+#include linux/dst.h
+
+static struct dst_alg *alg_linear;
+
+/*
+ * This callback is invoked when node is removed from storage.
+ */
+static void dst_linear_del_node(struct dst_node *n)
+{
+}
+
+/*
+ * This callback is invoked when node is added to storage.
+ */
+static int dst_linear_add_node(struct dst_node *n)
+{
+   struct dst_storage *st = n-st;
+
+   dprintk(%s: disk_size: %llu, node_size: %llu.\n,
+   __func__, st-disk_size, n-size);
+
+   mutex_lock(st-tree_lock);
+   n-start = st-disk_size;
+   st-disk_size += n-size;
+   mutex_unlock(st-tree_lock);
+
+   return 0;
+}
+
+static int dst_linear_remap(struct dst_request *req)
+{
+   int err;
+
+   if (req-node-bdev) {
+   generic_make_request(req-bio);
+   return 0;
+   }
+
+   err = kst_check_permissions(req-state, req-bio);
+   if (err)
+   return err;
+
+   return req-state-ops-push(req);
+}
+
+/*
+ * Failover callback - it is invoked each time error happens during
+ * request processing.
+ */
+static int dst_linear_error(struct kst_state *st, int err)
+{
+   if (err)
+   set_bit(DST_NODE_FROZEN, st-node-flags);
+   else
+   clear_bit(DST_NODE_FROZEN, st-node-flags);
+   return 0;
+}
+
+static struct dst_alg_ops alg_linear_ops = {
+   .remap  = dst_linear_remap,
+   .add_node   = dst_linear_add_node,
+   .del_node   = dst_linear_del_node,
+   .error  = dst_linear_error,
+   .owner  = THIS_MODULE,
+};
+
+static int __devinit alg_linear_init(void)
+{
+   alg_linear = dst_alloc_alg(alg_linear, alg_linear_ops);
+   if (!alg_linear)
+   return -ENOMEM;
+
+   return 0;
+}
+
+static void __devexit alg_linear_exit(void)
+{
+   dst_remove_alg(alg_linear);
+}
+
+module_init(alg_linear_init);
+module_exit(alg_linear_exit);
+
+MODULE_LICENSE(GPL);
+MODULE_AUTHOR(Evgeniy Polyakov [EMAIL PROTECTED]);
+MODULE_DESCRIPTION(Linear distributed algorithm.);
diff --git a/drivers/block/dst/alg_mirror.c b/drivers/block/dst/alg_mirror.c
new file mode 100644
index 000..1b55f4d
--- /dev/null
+++ b/drivers/block/dst/alg_mirror.c
@@ -0,0 +1,1113 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/init.h
+#include linux/poll.h
+#include linux/dst.h
+
+struct dst_mirror_node_data
+{
+   u64 age;
+};
+
+struct dst_mirror_priv
+{
+   unsigned intchunk_num;
+
+   u64 last_start;
+
+   spinlock_t  backlog_lock;
+   struct list_headbacklog_list;
+
+   struct dst_mirror_node_data old_data, new_data;
+
+   unsigned long   *chunk;
+};
+
+static struct dst_alg *alg_mirror;
+static struct bio_set *dst_mirror_bio_set;
+
+static int dst_mirror_resync(struct dst_node *n, int ndp);
+
+static void dst_mirror_mark_sync(struct dst_node *n)
+{
+   if (test_bit(DST_NODE_NOTSYNC, n-flags)) {
+   struct dst_mirror_priv *priv = n-priv;
+
+   clear_bit(DST_NODE_NOTSYNC, n-flags);
+   dprintk(%s: node: %p, %llu:%llu synchronization 
+   has been completed.\n,
+   __func__, n, n-start, n-size);
+   priv-old_data.age = 0;
+   }
+}
+
+static void dst_mirror_mark_notsync(struct

[take8 0/4] dst: Distributed storage.

2007-11-20 Thread Evgeniy Polyakov

Distributed storage.

I'm pleased to announce the 8'th release of the distributed
storage subsystem (DST). This is a maintenance release and includes
bug fixes only.

DST allows to form a storage on top of local and remote nodes
and combine them into linear or mirroring setup, which in
turn can be exported to remote nodes.

Short changelog:
 * cleanup sysfs files on error path.
Patch by Chris Madden [EMAIL PROTECTED]

Overall list of features of the DST can be found on project's homepage:

http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst

Thank you.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[take8 2/4] dst: Core distributed storage files.

2007-11-20 Thread Evgeniy Polyakov

Core distributed storage files.
Include userspace interfaces, initialization,
block layer bindings and other core functionality.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index b4c8319..ca6592d 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -451,6 +451,8 @@ config ATA_OVER_ETH
This driver provides Support for ATA over Ethernet block
devices like the Coraid EtherDrive (R) Storage Blade.
 
+source drivers/block/dst/Kconfig
+
 source drivers/s390/block/Kconfig
 
 endmenu
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index dd88e33..fcf042d 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_VIODASD) += viodasd.o
 obj-$(CONFIG_BLK_DEV_SX8)  += sx8.o
 obj-$(CONFIG_BLK_DEV_UB)   += ub.o
 
+obj-$(CONFIG_DST)  += dst/
diff --git a/drivers/block/dst/Kconfig b/drivers/block/dst/Kconfig
new file mode 100644
index 000..d35e0cc
--- /dev/null
+++ b/drivers/block/dst/Kconfig
@@ -0,0 +1,21 @@
+config DST
+   tristate Distributed storage
+   depends on NET
+   select CONNECTOR
+   select LIBCRC32C
+   ---help---
+   This driver allows to create a distributed storage.
+
+config DST_ALG_LINEAR
+   tristate Linear distribution algorithm
+   depends on DST
+   ---help---
+   This module allows to create linear mapping of the nodes
+   in the distributed storage.
+
+config DST_ALG_MIRROR
+   tristate Mirror distribution algorithm
+   depends on DST
+   ---help---
+   This module allows to create a mirror of the noes in the
+   distributed storage.
diff --git a/drivers/block/dst/Makefile b/drivers/block/dst/Makefile
new file mode 100644
index 000..1400e94
--- /dev/null
+++ b/drivers/block/dst/Makefile
@@ -0,0 +1,6 @@
+obj-$(CONFIG_DST) += dst.o
+
+dst-y := dcore.o kst.o
+
+obj-$(CONFIG_DST_ALG_LINEAR) += alg_linear.o
+obj-$(CONFIG_DST_ALG_MIRROR) += alg_mirror.o
diff --git a/drivers/block/dst/dcore.c b/drivers/block/dst/dcore.c
new file mode 100644
index 000..77b2c4f
--- /dev/null
+++ b/drivers/block/dst/dcore.c
@@ -0,0 +1,1608 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/module.h
+#include linux/kernel.h
+#include linux/init.h
+#include linux/blkdev.h
+#include linux/bio.h
+#include linux/slab.h
+#include linux/connector.h
+#include linux/socket.h
+#include linux/dst.h
+#include linux/device.h
+#include linux/in.h
+#include linux/in6.h
+#include linux/buffer_head.h
+
+#include net/sock.h
+
+static LIST_HEAD(dst_storage_list);
+static LIST_HEAD(dst_alg_list);
+static DEFINE_MUTEX(dst_storage_lock);
+static DEFINE_MUTEX(dst_alg_lock);
+static int dst_major;
+static struct kst_worker *kst_main_worker;
+static struct cb_id cn_dst_id = { CN_DST_IDX, CN_DST_VAL };
+
+struct kmem_cache *dst_request_cache;
+
+static char dst_name[] = Squizzed black-out of the dancing back-aching hippo;
+
+/*
+ * DST sysfs tree. For device called 'storage' which is formed
+ * on top of two nodes this looks like this:
+ *
+ * /sys/devices/storage/
+ * /sys/devices/storage/alg : alg_linear
+ * /sys/devices/storage/n-800/type : R: 192.168.4.80:1025
+ * /sys/devices/storage/n-800/size : 800
+ * /sys/devices/storage/n-800/start : 800
+ * /sys/devices/storage/n-800/clean
+ * /sys/devices/storage/n-800/dirty
+ * /sys/devices/storage/n-0/type : R: 192.168.4.81:1025
+ * /sys/devices/storage/n-0/size : 800
+ * /sys/devices/storage/n-0/start : 0
+ * /sys/devices/storage/n-0/clean
+ * /sys/devices/storage/n-0/dirty
+ * /sys/devices/storage/remove_all_nodes
+ * /sys/devices/storage/nodes : sectors (start [size]): 0 [800] | 800 [800]
+ * /sys/devices/storage/name : storage
+ */
+
+static int dst_dev_match(struct device *dev, struct device_driver *drv)
+{
+   return 1;
+}
+
+static void dst_dev_release(struct device *dev)
+{
+}
+
+static struct bus_type dst_dev_bus_type = {
+   .name   = dst,
+   .match  = dst_dev_match,
+};
+
+static struct device dst_dev = {
+   .bus= dst_dev_bus_type,
+   .release= dst_dev_release
+};
+
+static void dst_node_release(struct device *dev)
+{
+}
+
+static struct device dst_node_dev = {
+   .release= dst_node_release
+};
+
+static void dst_free_alg(struct dst_alg *alg)
+{
+   kfree(alg);
+}
+
+/*
+ * Algorithm is never freed

[take8 1/4] dst: Distributed storage documentation.

2007-11-20 Thread Evgeniy Polyakov

Distributed storage documentation.

Algorithms used in the system, userspace interfaces
(sysfs dirs and files), design and implementation details
are described here.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/Documentation/dst/algorithms.txt b/Documentation/dst/algorithms.txt
new file mode 100644
index 000..1437a6a
--- /dev/null
+++ b/Documentation/dst/algorithms.txt
@@ -0,0 +1,115 @@
+Each storage by itself is just a set of contiguous logical blocks, with
+allowed number of operations. Nodes, each of which has own start and size,
+are placed into storage by appropriate algorithm, which remaps
+logical sector number into real node's sector. One can create
+own algorithms, since DST has pluggable interface for that.
+Currently mirrored and linear algorithms are supported.
+
+Let's briefly describe how they work.
+
+Linear algorithm.
+Simple approach of concatenating storages into single device with
+increased size is used in this algorithm. Essentially new device
+has size equal to sum of sizes of underlying nodes and nodes are
+placed one after another.
+
+  /- Node 1 ---\ /-- Node 3 \
+start  end start   end
+ |==||==|
+ |start end |
+ |  \--- Node 2 -/  |
+ |  |
+start  end
+ \-- DST storage --/
+
+   /\
+   ||
+   ||
+
+  IO operations
+
+   Figure 1. 
+ 3 nodes combined into single storage using linear algorithm.
+
+Mirror algorithm.
+In this algorithms nodes are placed under each other, so when
+operation comes to the first one, it can be mirrored to all
+underlying nodes. In case of reading, actual data is obtained from
+the nearest node - algoritm keeps track of previous operation
+and knows where it was stopped, so that subsequent seek to the 
+start of the new request will take the shortest time.
+Writing is always mirrored to all underlying nodes.
+
+  IO operations
+   ||
+   ||
+   \/
+
+| DST storage ---|
+|  prev position |
+|---| Node 1 |
+|  prev pos  |
+| Node 2 -|--|
+|prev pos|
+|---| Node 3 |
+
+   Figure 2.
+   3 nodes combined into single storage using mirror algorithm.
+
+Each algorithm must implement number of callbacks,
+which must be registered during initialization time.
+
+struct dst_alg_ops
+{
+   int (*add_node)(struct dst_node *n);
+   void(*del_node)(struct dst_node *n);
+   int (*remap)(struct dst_request *req);
+   int (*error)(struct kst_state *state, int err);
+   struct module   *owner;
+};
+
[EMAIL PROTECTED]
+This callback is invoked when new node is being added into the storage,
+but before node is actually added into the storage, so that it could
+be accessed from it. When it is called, all appropriate initialization
+of the underlying device is already completed (system has been connected
+to remote node or got a reference to the local block device). At this
+stage algorithm can add node into private map. 
+It must return zero on success or negative value otherwise.
+
[EMAIL PROTECTED]
+This callback is invoked when node is being deleted from the storage,
+i.e. when its reference counter hits zero. It is called before
+any cleaning is performed.
+It must return zero on success or negative value otherwise.
+
[EMAIL PROTECTED]
+This callback is invoked each time new bio hits the storage.
+Request structure contains BIO itself, pointer to the node, which originally
+stores the whole region under given IO request, and various parameters
+used by storage core to process this block request.
+It must return zero on success or negative value otherwise. It is upto
+this method to call all cleaning if remapping failed, for example it must
+call kst_bio_endio() for given callback in case of error, which in turn
+will call bio_endio(). Note, that dst_request structure provided in this
+callback is allocated on stack, so if there is a need to use it outside
+of the given function, it must be cloned (it will happen automatically
+in state's push callback, but that copy will not be shared by any other
+user).
+
[EMAIL PROTECTED]
+This callback is invoked for each error, which happend when processed

[take8 3/4] dst: Network state machine.

2007-11-20 Thread Evgeniy Polyakov

Network state machine.

Includes network async processing state machine and related tasks.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/drivers/block/dst/kst.c b/drivers/block/dst/kst.c
new file mode 100644
index 000..ba5e5ef
--- /dev/null
+++ b/drivers/block/dst/kst.c
@@ -0,0 +1,1475 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/kernel.h
+#include linux/module.h
+#include linux/list.h
+#include linux/slab.h
+#include linux/socket.h
+#include linux/kthread.h
+#include linux/net.h
+#include linux/in.h
+#include linux/poll.h
+#include linux/bio.h
+#include linux/dst.h
+
+#include net/sock.h
+
+struct kst_poll_helper
+{
+   poll_table  pt;
+   struct kst_state*st;
+};
+
+static LIST_HEAD(kst_worker_list);
+static DEFINE_MUTEX(kst_worker_mutex);
+
+/*
+ * This function creates bound socket for local export node.
+ */
+static int kst_sock_create(struct kst_state *st, struct saddr *addr,
+   int type, int proto, int backlog)
+{
+   int err;
+
+   err = sock_create(addr-sa_family, type, proto, st-socket);
+   if (err)
+   goto err_out_exit;
+
+   err = st-socket-ops-bind(st-socket, (struct sockaddr *)addr,
+   addr-sa_data_len);
+
+   err = st-socket-ops-listen(st-socket, backlog);
+   if (err)
+   goto err_out_release;
+
+   st-socket-sk-sk_allocation = GFP_NOIO;
+
+   return 0;
+
+err_out_release:
+   sock_release(st-socket);
+err_out_exit:
+   return err;
+}
+
+static void kst_sock_release(struct kst_state *st)
+{
+   if (st-socket) {
+   sock_release(st-socket);
+   st-socket = NULL;
+   }
+}
+
+void kst_wake(struct kst_state *st)
+{
+   if (st) {
+   struct kst_worker *w = st-node-w;
+   unsigned long flags;
+
+   spin_lock_irqsave(w-ready_lock, flags);
+   if (list_empty(st-ready_entry))
+   list_add_tail(st-ready_entry, w-ready_list);
+   spin_unlock_irqrestore(w-ready_lock, flags);
+
+   wake_up(w-wait);
+   }
+}
+EXPORT_SYMBOL_GPL(kst_wake);
+
+/*
+ * Polling machinery.
+ */
+static int kst_state_wake_callback(wait_queue_t *wait, unsigned mode,
+   int sync, void *key)
+{
+   struct kst_state *st = container_of(wait, struct kst_state, wait);
+   kst_wake(st);
+   return 1;
+}
+
+static void kst_queue_func(struct file *file, wait_queue_head_t *whead,
+poll_table *pt)
+{
+   struct kst_state *st = container_of(pt, struct kst_poll_helper, pt)-st;
+
+   st-whead = whead;
+   init_waitqueue_func_entry(st-wait, kst_state_wake_callback);
+   add_wait_queue(whead, st-wait);
+}
+
+static void kst_poll_exit(struct kst_state *st)
+{
+   if (st-whead) {
+   remove_wait_queue(st-whead, st-wait);
+   st-whead = NULL;
+   }
+}
+
+/*
+ * This function removes request from state tree and ordering list.
+ */
+void kst_del_req(struct dst_request *req)
+{
+   list_del_init(req-request_list_entry);
+}
+EXPORT_SYMBOL_GPL(kst_del_req);
+
+static struct dst_request *kst_req_first(struct kst_state *st)
+{
+   struct dst_request *req = NULL;
+
+   if (!list_empty(st-request_list))
+   req = list_entry(st-request_list.next, struct dst_request,
+   request_list_entry);
+   return req;
+}
+
+/*
+ * This function dequeues first request from the queue and tree.
+ */
+static struct dst_request *kst_dequeue_req(struct kst_state *st)
+{
+   struct dst_request *req;
+
+   mutex_lock(st-request_lock);
+   req = kst_req_first(st);
+   if (req)
+   kst_del_req(req);
+   mutex_unlock(st-request_lock);
+   return req;
+}
+
+/*
+ * This function enqueues request into tree, indexed by start of the request,
+ * and also puts request into ordered queue.
+ */
+int kst_enqueue_req(struct kst_state *st, struct dst_request *req)
+{
+   if (unlikely(req-flags  DST_REQ_CHECK_QUEUE)) {
+   struct dst_request *r;
+
+   list_for_each_entry(r, st-request_list, request_list_entry) {
+   if (bio_rw(r-bio) != bio_rw(req-bio))
+   continue;
+
+   if (r-start = req-start + req-size)
+   continue

Re: Netfilter: kernel panic with REDIRECT target. (2.6.23 and 2.6.23.8)

2007-11-19 Thread Evgeniy Polyakov
On Mon, Nov 19, 2007 at 06:51:38PM +, David ([EMAIL PROTECTED]) wrote:
 Patrick McHardy wrote:
  iptables -t nat -A PREROUTING -j REDIRECT -i eth2 -p udp --dport
  5061 --to-ports 5060
 
  
  Also post the kernel panic log.

 
  Please try if this patch fixes the problem.
 
 No luck with the patch I'm afraid, panic log attached (of patched kernel).

Ok, let's try it hard way.
Please check attached patch and tell if it helped (it will produce
some debug though).
What is a load on this machine? Is it simple enough to reproduce?
I will take closer look tomorrow if this will not help.

Thanks.

diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c
index 70e7997..7dc3496 100644
--- a/net/ipv4/netfilter/nf_nat_core.c
+++ b/net/ipv4/netfilter/nf_nat_core.c
@@ -607,13 +607,13 @@ static void nf_nat_move_storage(struct nf_conn 
*conntrack, void *old)
struct nf_conn_nat *new_nat = nf_ct_ext_find(conntrack, NF_CT_EXT_NAT);
struct nf_conn_nat *old_nat = (struct nf_conn_nat *)old;
struct nf_conn *ct = old_nat-ct;
-   unsigned int srchash;
+   
+   printk(conntrack: %p, new: %p, old: %p, ct: %p.\n,
+   conntrack, new_nat, old_nat, ct);
 
-   if (!(ct-status  IPS_NAT_DONE_MASK))
+   if (!ct || !(ct-status  IPS_NAT_DONE_MASK))
return;
 
-   srchash = hash_by_src(ct-tuplehash[IP_CT_DIR_ORIGINAL].tuple);
-
write_lock_bh(nf_nat_lock);
hlist_replace_rcu(old_nat-bysource, new_nat-bysource);
new_nat-ct = ct;

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netfilter: kernel panic with REDIRECT target. (2.6.23 and 2.6.23.8)

2007-11-19 Thread Evgeniy Polyakov
On Mon, Nov 19, 2007 at 10:24:23PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) 
wrote:
 On Mon, Nov 19, 2007 at 06:51:38PM +, David ([EMAIL PROTECTED]) wrote:
  Patrick McHardy wrote:
   iptables -t nat -A PREROUTING -j REDIRECT -i eth2 -p udp --dport
   5061 --to-ports 5060
  
   
   Also post the kernel panic log.
 
  
   Please try if this patch fixes the problem.
  
  No luck with the patch I'm afraid, panic log attached (of patched kernel).
 
 Ok, let's try it hard way.
 Please check attached patch and tell if it helped (it will produce
 some debug though).

With both patches applied - one Patrick showed and this one.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re : Bug in using inet_lookup ()

2007-11-16 Thread Evgeniy Polyakov
On Fri, Nov 16, 2007 at 09:47:08AM +, Nj A ([EMAIL PROTECTED]) wrote:
 Hello,
  Please show at least one bug trace when inet_lookup(tcp_hashinfo, 0, 0, 0, 
  0,
  0) fails :)
 Trying this the system hangs :-( (setting panic* doesn't change more).

Your code below can not work - you _never_ call inet_lookup().
In your bug inet_lookup() is called, so this wither code is wrong, or
bug is hand written.

And you use inet_iif() which requires dst entry (routing cache), which
you do not setup either.

You can do following to resolve where problem occurs:
$ gdb vmlix
 p inet_lookup
 l *(returned_above_address + 0x300)

it will show you the line where bug occurs.
You have to compile your kernel with debugging symbols.

To prove that inet_lookup() works correctly patch tcv_v4_rcv() to print
lookup result for static source/destination addresses/ports copied from
you message and zero ifindex (the last field).

I'm pretty sure your code, which was not shown yet, has a bug in the
inet_lookup() calling routing.

 However, using (tcp_hashinfo, ip_src, p_src, ip_dst, p_dst, 0) gives the 
 following oops:

Wrong, you do _NOT_ use this in your code.

 BUG: unable to handle kernel NULL pointer dereference at virtual address 
 
 printing eip:
 c02f19e1
 *pde = 
 Oops:  [#1]
 CPU:0
 EIP:0060:[c02f19e1]Not tainted VLI
 EFLAGS: 00010282   (2.6.18 #1)
 EIP is at inet_lookup+0x300x500
 eax: 9e3779b9   ebx: 0004   ecx: 9e377a57   edx: f4046f84
 esi: f46a6010   edi:    ebp: 009e   esp: f4046f38
 ds: 007b   es: 007b   ss: 0068
 Process knl-thread (pid: 3068, ti=f4046000 task=f46f0610 task.ti=f4046000)
 Stack: 22921900 f6953840 f46a6010 f46a6000 f4046f84 0004 f46a6010 f46a6000
 f6953840 f8d3314a 0004 b7f3a000 0404 0005 0bfe 
 0bfe 0404  f4046fa8 f6953840 f4aa7880 f4aa7800 f4046fa8
 Code: 00 00 00 8d bc 27 00 00 00 00 55 89 cd 57 0f b7 c9 56 81 e9 47 86 c8 61 
 53 83 ec 14 89 54 24 10 8b b8 54 02 00 00 b8 b9 79 37 9e 8b 5f 10 29 d8 89 
 da 03 44 24 28 c1 ea 0d 29 c8 29 d9 31 d0 89
 EIP: [c02f19e1] inet_lookup +0x300x500 SS:ESP 0068:f4046f38
 
  Yes, to show the code you are using.
 Ok so basically I am receiving via Netlink a state telling me the ip_src, 
 psrc, ip_dst, pdst.

 sk =
  inet_lookup (tcp_hashinfo, payload-src, payload-p_src, payload-dst, 
 payload-p_dst, inet_iif (s_skb));

WRONG!

You did not setup s_skb-dst, so inet_iif() will fail.
Use 0 there, as you were told already several times.
This will not catch device binding though.

 if (!sk)
  goto no_tcp_socket;
  if (sk-sk_state == TCP_TIME_WAIT)
  goto time_wait_socket;
  ...
   bh_lock_sock (sk);
  pdev:
   spin_lock (tmp_lock);
   new_dev = list_entry (tmp, struct net_device, todo_list);
   spin_unlock (tmp_lock);
   if (!new_dev)
 goto err;
   s_skb-dev = new_dev;
 ...
  switch (sk-sk_state)
  {
   case TCP_SYN_RECV:
..
   case TCP_LISTEN:
   ..
   case TCP_SYN_SENT:
   ..
  }
bh_unlock_sock (sk);
 ...
 /* send reply via Netlink */

This code _NEVER_ calls inet_lookup(), since the first ckeck for
s_skb-dev will fail and you will select device via your list and then
never return to inet_lookup().

Anyway, until your code is presented fully so that people could show you
exactly wrong line it is pretty impossible to try to convince you that
inet_llokup() does work and you have a bug in setup.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re : Re : Bug in using inet_lookup ()

2007-11-15 Thread Evgeniy Polyakov
On Wed, Nov 14, 2007 at 04:47:22PM +, Nj A ([EMAIL PROTECTED]) wrote:
 By setting the ID of the ingress device to the inet_lookup() to 0, the 
 machine reboots automatically.
 Setting proc/sys/kernel/panic* to non zero values dosn't help more..

Sorry, I did not understand?
You mean after you provide zero to inet_lookup() instead of device id it
strted to reboot?

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Null pointer dereference in nf_nat_move_storage(), kernel 2.6.23.1

2007-11-15 Thread Evgeniy Polyakov
Hi Chuck.

On Wed, Nov 14, 2007 at 06:25:15PM -0500, Chuck Ebbert ([EMAIL PROTECTED]) 
wrote:
  https://bugzilla.redhat.com/show_bug.cgi?id=259501#c14

   [f8b61643] __nf_ct_ext_add+0x12f/0x1c4 [nf_conntrack] 

  nf_nat_move_storage():
  /usr/src/debug/kernel-2.6.23/linux-2.6.23.i686/net/ipv4/netfilter/nf_nat_core.c:612
87:   f7 47 64 80 01 00 00testl  $0x180,0x64(%edi)
8e:   74 39   je c9 nf_nat_move_storage+0x65
  
  line 612:
  if (!(ct-status  IPS_NAT_DONE_MASK))
  return;

Please test attached patch.

This routing is called each time hash should be replaced, nf_conn has
extension list which contains pointers to connection tracking users
(like nat, which is right now the only such user), so when replace takes
place it should copy own extensions. Loop above checks for own
extension, but tries to move higer-layer one, which can lead to above
oops.

Not tested, derived from code observation only.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/net/netfilter/nf_conntrack_extend.c 
b/net/netfilter/nf_conntrack_extend.c
index a1a65a1..cf6ba66 100644
--- a/net/netfilter/nf_conntrack_extend.c
+++ b/net/netfilter/nf_conntrack_extend.c
@@ -109,7 +109,7 @@ void *__nf_ct_ext_add(struct nf_conn *ct, enum nf_ct_ext_id 
id, gfp_t gfp)
rcu_read_lock();
t = rcu_dereference(nf_ct_ext_types[i]);
if (t  t-move)
-   t-move(ct, ct-ext + ct-ext-offset[id]);
+   t-move(ct, ct-ext + ct-ext-offset[i]);
rcu_read_unlock();
}
kfree(ct-ext);

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re : Re : Re : Bug in using inet_lookup ()

2007-11-15 Thread Evgeniy Polyakov
On Thu, Nov 15, 2007 at 05:29:52PM +0100, Nj A ([EMAIL PROTECTED]) wrote:
 Hello all,
 No bugs are due to the inet_lookup call now using the following:
   if ((s_skb = alloc_skb (MAX_TCP_HEADER + 15, GFP_ATOMIC)) == NULL)
   {
  printk (%s: Unable to allocate memory \n, __FUNCTION__);
  err = -ENOMEM;
   }
   dev = s_skb-dev;
 
   if (!dev)
  printk (%s: no device attached to s_skb\n, __FUNCTION__);
  goto process_dev;
 
   sk = inet_lookup (tcp_hashinfo, src, p_src, dst, p_dst, inet_iif 
 (s_skb));
 
   bh_lock_sock (sk);
 process_dev:
   spin_lock (tmp_lock);
   new_dev = list_entry (tmp, struct net_device, todo_list);
   spin_unlock (tmp_lock);
   if (!new_dev)
  printk (%s: no device attached to new_dev \n, __FUNCTION__);
   s_skb-dev = new_dev;
 
   ...
   bh_unlock_sock (sk);
   ...
 
 However, I am not having the right results. I checked with an established 
 socket and expected to see that the socket is established (which is the case) 
 but got the wrong state when testing on (sk-sk_state) and the socket seems 
 in the TIME_WAIT / CLOSE state.
 
 May be I am corrupting the search by manually attaching a device to the skb?
 Any idea please?

Well, your code will oops just like before - you provide empty skb to
the inet_iif(), which is wrong. Actually you will not even reach that
point, since your code will exit after skb-dev check.

Try simple inet_lookup(tcp_hashinfo, src, p_src, dst, p_dst, 0).
It does work.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re : Re : Re : Re : Bug in using inet_lookup ()

2007-11-15 Thread Evgeniy Polyakov
On Thu, Nov 15, 2007 at 04:57:17PM +, Nj A ([EMAIL PROTECTED]) wrote:
  Well, your code will oops just like before - you provide empty skb to
  the inet_iif(), which is wrong. Actually you will not even reach that
  point, since your code will exit after skb-dev check.
  
  Try simple inet_lookup(tcp_hashinfo, src, p_src, dst, p_dst, 0).
 
 But trying  inet_lookup(tcp_hashinfo, src, p_src, dst, p_dst, 0), the 
 machine either hangs or panics.

Hmmm, it does not.
Please show at least one bug trace when inet_lookup(tcp_hashinfo, 0, 0, 0, 0, 
0) fails :)

 Is there any clean manner to come across this issue?

Yes, to show the code you are using.
Sorry, all mind readers are on vacations.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug in using inet_lookup ()

2007-11-14 Thread Evgeniy Polyakov
On Wed, Nov 14, 2007 at 09:26:18AM +, Nj A ([EMAIL PROTECTED]) wrote:
 /* The kernel TCP hashtable */
 struct inet_hashinfo __cacheline_aligned tcp_hashinfo = {
 .lhash_lock = __RW_LOCK_UNLOCKED (tcp_hashinfo.lhash_lock),
 .lhash_users = ATOMIC_INIT (0),
 .lhash_wait = __WAIT_QUEUE_HEAD_INITIALIZER (tcp_hashinfo.lhash_wait),
 };
 ...
 struct sock *sk;
 struct sk_buff *skb;
 skb = alloc_skb (MAX_TCP_HEADER + 15, GFP_KERNEL);
 if (skb == NULL)
 printk (%s: Unable to allocate memory \n, __FUNCTION__);
 sk = inet_lookup (tcp_hashinfo, ip_src, src_port, ip_dst, dst_port, inet_iif 
 (skb));
 if (!sk)
 ...
 This portion of code seems to cause the kernel to panic due to dereferencing 
 a NULL pointer.
 Can anyone please tell me what is the error above?
 Best Regards,
 
Where exactly? Likely in inet_iif(), since it dereferences dst (routing
info), which is not presented after simple alloc_skb().
You have to setup skb correctly, check how ip_rcv() does it.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re : Bug in using inet_lookup ()

2007-11-14 Thread Evgeniy Polyakov
On Wed, Nov 14, 2007 at 01:12:11PM +, Nj A ([EMAIL PROTECTED]) wrote:
 I suspected it could be that. However, can't see in ip_rcv the right portion 
 that can help.
 Any further tip please?

It is ip_rcv_finish() called from ip_rcv():
if (skb-dst == NULL) {
int err = ip_route_input(skb, iph-daddr, iph-saddr, iph-tos,
 skb-dev);
if (unlikely(err)) {
if (err == -EHOSTUNREACH)
IP_INC_STATS_BH(IPSTATS_MIB_INADDRERRORS);
else if (err == -ENETUNREACH)
IP_INC_STATS_BH(IPSTATS_MIB_INNOROUTES);
goto drop;
}
}

So you will have to specify device, you got your skb via.
Actually it is not exactly needed in some cases, you will need interface
index (dev-ifindex). You can find socket by using that number instead
of dereferencing dst.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] New Kernel Bugs

2007-11-13 Thread Evgeniy Polyakov
On Tue, Nov 13, 2007 at 03:15:53AM -0800, Andrew Morton ([EMAIL PROTECTED]) 
wrote:
  NETWORKING===
  
  RTNLGRP_ND_USEROPT does not report ifindex (IPv6)
  http://bugzilla.kernel.org/show_bug.cgi?id=9349
  Kernel: 2.6.24+
 
 No response from developers

Fixed (extended) in the DaveM's tree (or will be soon - patch was
submitted by Pierre Ynard).

Sorry, others are either driver related (and thus require
hardware to be tested on and maintainers to be kicked in) 
or too obscure (like 2.6.11 bug and weird network problem 
which is undetectible on other systems).

Yes, we suck, but we try to recover :)

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: possible bug in tcp_probe

2007-11-13 Thread Evgeniy Polyakov
Hi.

On Tue, Nov 13, 2007 at 11:26:15AM +, Gavin McCullagh ([EMAIL PROTECTED]) 
wrote:
 74.259589763 192.168.2.1 36988 192.168.3.5 5001 0x679c23dc 0x679bc3b4 18 13 
 9114624 78 76 1 0 64
 74.260590660 192.168.2.1 44261 192.168.3.5 5006 0x573bb3ed 0x573b700d 13 9 
 5254144 155 127 1 0 64
 74.261607478 192.168.2.1 44261 192.168.3.5 5006 0x588.066586741 192.168.2.1 
 33739 192.168.3.5 5009 0xe26d1767 0xe26cf577 2 3 13090816 443 15818 1 0 64
 88.066690797 192.168.2.1 33739 192.168.3.5 5009 0xe26d1767 0xe26cfb1f 3 3 
 13092864 2365 15818 1 0 64
 88.067625714 192.168.2.1 59385 192.168.3.5 5012 0x411c1090 0x411bd258 12 9 
 14578688 2807 15812 1 0 64
 
 As you can see the third line has been truncated as well as the next
 roughly 14 seconds of data after which data continues writing as usual.
 
 I don't think my small changes are causing this but perhaps I'm wrong.
 Does anyone know what might be causing the above?

Log buffer has limited size, you can not write from different threads to
it and expect all data being printed synchronously, there is nothing
exceptional here.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Stack Trace. Bad?

2007-11-07 Thread Evgeniy Polyakov
Hi Jon.

On Tue, Nov 06, 2007 at 02:23:03PM -0600, Jon Nelson ([EMAIL PROTECTED]) wrote:
 [linux-raid was also emailed this same information]

It looks like it was not :)

 I was testing some network throughput today and ran into this.
 I should note that I've this motherboard has 2x MCP55 Ethernet and one
 of them works fine and the other one gives lots and lots of frame
 errors under load.
 
 The following is only an harmless informational message.
 Unless you get a _continuous_flood_ of these messages it means
 everything is working fine. Allocations from irqs cannot be
 perfectly reliable and the kernel is designed to handle that.
 md0_raid5: page allocation failure. order:2, mode:0x20
 
 Call Trace:
  IRQ  [802684c2] __alloc_pages+0x324/0x33d
  [80283147] kmem_getpages+0x66/0x116
  [8028367a] fallback_alloc+0x104/0x174
  [80283330] kmem_cache_alloc_node+0x9c/0xa8
  [80396984] __alloc_skb+0x65/0x138
  [8821d82a] :forcedeth:nv_alloc_rx_optimized+0x4d/0x18f

What MTU for this card is? Forcedeth supports jumbo frames, but does it
in very unoptimized way, particulary by relying on the possibility to
allocate 2-order pages, which is wrong.

So, set MTU to 1500 and things will be back into good shape.
I think adding fragments support is not a short-term solution because
of closed specs.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel panic removing devices from a teql queuing discipline

2007-11-06 Thread Evgeniy Polyakov
On Mon, Nov 05, 2007 at 11:08:00PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) 
wrote:
 On Tue, Oct 30, 2007 at 01:33:41AM -0700, David Miller ([EMAIL PROTECTED]) 
 wrote:
   The panic is in __teql_resolve (which has been inlined into 
   teql_master_xmit) in
   net/sched/sch_teql.c at this line:
   
 if (n  n-tbl == mn-tbl 
   
   Specifically the dereference of n-tbl is faulting as n is not valid.
 
 n is never valid (null), mn is garbage.

My fault, of course you are right, n is invalid because it is
dereferenced from qdisc, which was changed. That was too late in Moscow 
for conclusions...

   And the address looks like part of an ASCCI string...  figt
  
  I studied sch_teql.c a bit and I suspect that the slave list
  management in teql_destroy() and teql_qdisc_init() might be
  suspect.
 
 tecl_reset() is called from deactivate and qdisc is set to noop already,
 but subsequent teql_xmit does not know about it and dereference private
 data as teql qdisc and thus oopses. I will fix it tomorrow if you will
 not catch it first :)

It looks like I am.
Tested, works, fixed.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/net/sched/sch_teql.c b/net/sched/sch_teql.c
index f05ad9a..e0a44b9 100644
--- a/net/sched/sch_teql.c
+++ b/net/sched/sch_teql.c
@@ -263,6 +276,9 @@ __teql_resolve(struct sk_buff *skb, struct sk_buff 
*skb_res, struct net_device *
 static __inline__ int
 teql_resolve(struct sk_buff *skb, struct sk_buff *skb_res, struct net_device 
*dev)
 {
+   if (dev-qdisc == noop_qdisc)
+   return -ENODEV;
+
if (dev-hard_header == NULL ||
skb-dst == NULL ||
skb-dst-neighbour == NULL)

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[0/4] Distributed storage. Squizzed black-out of the dancing back-aching hippo.

2007-11-05 Thread Evgeniy Polyakov
Hi.

I'm pleased to announce 7'th and the final release of the distributed
storage subsystem (DST). It allows to form a storage on top of local and
remote nodes and combine them in linear or mirroring setup, which in
turn can be exported to remote nodes.

Short changelog:
* added strong checksum support (Castagnoli crc)
* extended autoconfiguration (added ability to request if remote
side supports strong checksum and turn it on if needed)
* documentation addon - sysfs files
* added clean/dirty sysfs files which allows to mark
node as clean (sinc) or dirty (not sync)
* fair number of bug fixes (including really tricky
bastards, which are unlikely to be found in real
setups, but which were still bugs)
* and the main one - added release name (it clearly shows my condition)

Overall list of features of the DST can be found on project's homepage:

http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst

Thank you.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   7   8   9   10   >