[Devel] [PATCH] [NETFILTER] ipt_SAME: add compat conversion functions

2007-11-13 Thread Konstantin Khorenko
[NETFILTER]: ipt_SAME: add compat conversion functions

ipt_SAME should have the compat function cause its entry structure 
(ipt_same_info)
contains a pointer between data filled/checked in both kernel and userspace.

Signed-off-by: Konstantin Khorenko [EMAIL PROTECTED]

---
Thank you,
Konstantin Khorenko

SWsoft Virtuozzo/OpenVZ Linux kernel team

--- ./net/ipv4/netfilter/ipt_SAME.c.SAME2007-11-06 13:55:16.0 
+0300
+++ ./net/ipv4/netfilter/ipt_SAME.c 2007-11-09 16:50:38.0 +0300
@@ -152,6 +152,47 @@ same_target(struct sk_buff *skb,
return nf_nat_setup_info(ct, newrange, hooknum);
 }
 
+#ifdef CONFIG_COMPAT
+struct compat_ipt_same_info
+{
+   unsigned char info;
+   u_int32_t rangesize;
+   u_int32_t ipnum;
+   compat_uptr_t iparray;
+
+   /* hangs off end. */
+   struct nf_nat_range range[IPT_SAME_MAX_RANGE];
+};
+
+static void compat_from_user(void *dst, void *src)
+{
+   const struct compat_ipt_same_info *cl = src;
+   struct ipt_same_info l = {
+   .info   = cl-info,
+   .rangesize  = cl-rangesize,
+   .ipnum  = 0,
+   .iparray= NULL,
+   };
+
+   memcpy(l.range, cl-range, sizeof(l.range));
+   memcpy(dst, l, sizeof(l));
+}
+
+static int compat_to_user(void __user *dst, void *src)
+{
+   const struct ipt_same_info *l = src;
+   struct compat_ipt_same_info cl = {
+   .info   = l-info,
+   .rangesize  = l-rangesize,
+   .ipnum  = 0,
+   .iparray= (compat_uptr_t)NULL,
+   };
+
+   memcpy(cl.range, l-range, sizeof(cl.range));
+   return copy_to_user(dst, cl, sizeof(cl)) ? -EFAULT : 0;
+}
+#endif /* CONFIG_COMPAT */
+
 static struct xt_target same_reg __read_mostly = {
.name   = SAME,
.family = AF_INET,
@@ -161,6 +202,11 @@ static struct xt_target same_reg __read_
.hooks  = (1  NF_IP_PRE_ROUTING | 1  NF_IP_POST_ROUTING),
.checkentry = same_check,
.destroy= same_destroy,
+#ifdef CONFIG_COMPAT
+   .compatsize = sizeof(struct compat_ipt_same_info),
+   .compat_from_user = compat_from_user,
+   .compat_to_user = compat_to_user,
+#endif
.me = THIS_MODULE,
 };
 

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [NETFILTER]: Unable to delete a SAME rule (Using SAME target problems)

2007-11-13 Thread Konstantin Khorenko
Dear all,

The problem description: unable to delete a SAME target rule.

The problem has been already raised some time ago - at least here:
http://marc.info/?l=netfilterm=117246219803862w=2

The problem was originally found using 2.6.18-8.1.15.el5 x86_64 kernel and
iptables v1.3.5 (stock RHEL5) but it seems to me that the problem is still not
fixed in newer kernel/iptables versions.

---
[EMAIL PROTECTED] ~]# iptables -N foo -t nat
[EMAIL PROTECTED] ~]# iptables -t nat -A foo -j SAME --to 1.2.3.4
[EMAIL PROTECTED] ~]# iptables -L -t nat
Chain PREROUTING (policy ACCEPT)
target prot opt source   destination

Chain POSTROUTING (policy ACCEPT)
target prot opt source   destination

Chain OUTPUT (policy ACCEPT)
target prot opt source   destination

Chain foo (0 references)
target prot opt source   destination
SAME   all  --  anywhere anywheresame:1.2.3.4
[EMAIL PROTECTED] ~]# iptables -t nat -D foo -j SAME --to 1.2.3.4
iptables: No chain/target/match by that name
---


The root of the problem - the structure ipt_same_info:
struct ipt_same_info
{
unsigned char info;
u_int32_t rangesize;
u_int32_t ipnum;
u_int32_t *iparray;

/* hangs off end. */
struct ip_nat_range range[IPT_SAME_MAX_RANGE];
};

ipnum  iparray is filled/used in kernel space only.

Userspace (iptables) tries to delete the rule:
1) it asks the kernel for the existing table

2) kernel provides the table.
Note: due to generic copy code 'ipt_same_info' structure is completely filled
up like any other entry structure, i mean - 'ipnum' and 'iparray' are non-zero!

3) iptables generates the ipt_same_info structure for the rule which it tries
to delete.
ipnum and iparray are zeroed.

4) iptables searches the table provided by kernel for the rule to be deleted.
It compares many things and at the end it compares the module dependent
structures (ipt_same_info).
Ok, iptables also uses the generic code for comparison module dependent
structures, so it tries not to compare the complete structure, but only first
(struct iptables_target).userspacesize bytes of it.

extensions/libipt_SAME.c:
...
static struct iptables_target same_target = {
.name   = SAME,
.version= IPTABLES_VERSION,
.size   = IPT_ALIGN(sizeof(struct ipt_same_info)),
.userspacesize  = IPT_ALIGN(sizeof(struct ipt_same_info)),
...

But it has to set '.userspacesize' to sizeof(struct ipt_same_info) because it
must compare the 'range' array of the 'ipt_same_info' cause it contains range
descriptions.

5) Trying to compare complete 'ipt_same_info' iptables is unable to find the
requested rule for deletion because 'ipnum' and 'iparray' fields always differ
(zero in userspace-generated structure and non-zero in the tables provided by
kernel).

6) So the deletion fails.


At the moment i can see only 3 ways of fixing this:
* reassemble struct ipt_same_info - put 'ipnum' and 'iparray' at the end of the
structure. This will save generic code both in kernel and userspace.

* let struct ipt_same_info be as is, teach userspace to manipulate more complex
masks (not only first X bytes of the structure)

* let struct ipt_same_info be as is, teach kernel to zero pointers and all the
fields which are used only in kernel.

All these ways are quite painful, but could someone please comment this - may
be i just missed and some decision had been already done on this issue?

---
Thank you,
Konstantin Khorenko

SWsoft Virtuozzo/OpenVZ Linux kernel team

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ms/mm/vmscan.c: don't forget to free shrinker-nr_deferred

2015-04-28 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.el7.ovz.4.8
--
commit ed7c753cddb57b06a21bdfb3e46fd38cf567bf68
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Tue Apr 28 18:31:55 2015 +0400

ms/mm/vmscan.c: don't forget to free shrinker-nr_deferred

ms commit: ae39332162a837c3791bb21172d22382a90a6fd1

From: Andrew Vagin ava...@openvz.org

This leak was added by commit 1d3d4437eae1 (vmscan: per-node deferred
work).

unreferenced object 0x88006ada3bd0 (size 8):
  comm criu, pid 14781, jiffies 4295238251 (age 105.641s)
  hex dump (first 8 bytes):
00 00 00 00 00 00 00 00  
  backtrace:
[8170caee] kmemleak_alloc+0x5e/0xc0
[811c0527] __kmalloc+0x247/0x310
[8117848c] register_shrinker+0x3c/0xa0
[811e115b] sget+0x5ab/0x670
[812532f4] proc_mount+0x54/0x170
[811e1893] mount_fs+0x43/0x1b0
[81202dd2] vfs_kern_mount+0x72/0x110
[81202e89] kern_mount_data+0x19/0x30
[812530a0] pid_ns_prepare_proc+0x20/0x40
[81083c56] alloc_pid+0x466/0x4a0
[8105aeda] copy_process+0xc6a/0x1860
[8105beab] do_fork+0x8b/0x370
[8105c1a6] SyS_clone+0x16/0x20
[8171f739] stub_clone+0x69/0x90
[] 0x

Signed-off-by: Andrew Vagin ava...@openvz.org

Cc: Mel Gorman mgor...@suse.de
Cc: Michal Hocko mho...@suse.cz
Cc: Rik van Riel r...@redhat.com
Cc: Johannes Weiner han...@cmpxchg.org
Cc: Glauber Costa glom...@openvz.org
Cc: Chuck Lever chuck.le...@oracle.com
Signed-off-by: Andrew Morton a...@linux-foundation.org
Signed-off-by: Linus Torvalds torva...@linux-foundation.org

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
Reported-by: Cyrill Gorcunov gorcu...@odin.com
---
 mm/vmscan.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ce7611c..cd97aed 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -215,6 +215,7 @@ void unregister_shrinker(struct shrinker *shrinker)
down_write(shrinker_rwsem);
list_del(shrinker-list);
up_write(shrinker_rwsem);
+   kfree(shrinker-nr_deferred);
 }
 EXPORT_SYMBOL(unregister_shrinker);
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] bc: Add missing put_beancounter call

2015-04-28 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.el7.ovz.4.8
--
commit aeb358367d9e09f88bf97ccaa6e8cac896f318ce
Author: Cyrill Gorcunov gorcu...@odin.com
Date:   Tue Apr 28 18:37:00 2015 +0400

bc: Add missing put_beancounter call

Looks like over #ifdef'ed typo in first place.
We took bencounter but never put it back.

This is a part of the fix for
https://jira.sw.ru/browse/PSBM-29895

Signed-off-by: Cyrill Gorcunov gorcu...@odin.com
Acked-by: Vladimir Davydov vdavy...@parallels.com

 kernel/bc/io_prio.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
---
 kernel/bc/io_prio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bc/io_prio.c b/kernel/bc/io_prio.c
index 66e5404..83d3424 100644
--- a/kernel/bc/io_prio.c
+++ b/kernel/bc/io_prio.c
@@ -48,8 +48,8 @@ int ub_set_ioprio(int id, int ioprio)
ret = 0;
else
ret = -ENOTSUPP;
-   put_beancounter(ub);
 #endif
+   put_beancounter(ub);
 out:
return ret;
 }
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] mm: transcendent swap cache

2015-04-28 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.el7.ovz.4.8
--
commit dd56dcc8aa7e8b2426d94309d1b199e9ca40a9e9
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Tue Apr 28 17:32:00 2015 +0400

mm: transcendent swap cache

Transcendent swap cache (tswap) is a simple driver for frontswap, which
stores reclaimed pages in memory unmodified. Its purpose is to adopt
pages evicted from a memory cgroup on local pressure, so that they can
be fetched back later without costly disk accesses. It works similarly
to shadow gangs from PCS6 except pages has to be copied on eviction.

Tswap pages are reclaimed on global pressure in the LRU order using the
shrinker API. Upon eviction a tswap page is copied back to the swap
cache and writeback is initiated.

* Usage

 - To enable the frontswap backend, pass tswap.enabled=1 at boot.
 - To activate/deactivate tswap, write Y/N to
   /sys/module/tswap/parameters/active
 - To get the number of pages cached, read
   /sys/module/tswap/parameters/nr_pages

https://jira.sw.ru/browse/PSBM-32063

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 config.OpenVZ |  1 +
 mm/Kconfig| 10 ++
 mm/Makefile   |  1 +
 3 files changed, 12 insertions(+)

diff --git a/config.OpenVZ b/config.OpenVZ
index 7275b27..250b85d 100644
--- a/config.OpenVZ
+++ b/config.OpenVZ
@@ -5290,6 +5290,7 @@ CONFIG_HAVE_ARCH_SOFT_DIRTY=y
 CONFIG_MEM_SOFT_DIRTY=y
 
 CONFIG_TCACHE=y
+CONFIG_TSWAP=y
 
 #
 # User resources
diff --git a/mm/Kconfig b/mm/Kconfig
index d91c9a6..e3c7f58 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -531,3 +531,13 @@ config TCACHE
  stores reclaimed pages in memory without any modifications. It is
  only worth enabling if used along with memory cgroups in order to
  cache pages which were reclaimed on local pressure.
+
+config TSWAP
+   bool Transcendent swap cache
+   depends on FRONTSWAP
+   default n
+   help
+ Transcendent swap cache is a simple backend for frontswap, which
+ stores reclaimed pages in memory without any modifications. It is
+ only worth enabling if used along with memory cgroups in order to
+ cache pages which were reclaimed on local pressure.
diff --git a/mm/Makefile b/mm/Makefile
index a9b1f75..cb994a6 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -62,3 +62,4 @@ obj-$(CONFIG_CLEANCACHE) += cleancache.o
 obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
 obj-$(CONFIG_ZBUD) += zbud.o
 obj-$(CONFIG_TCACHE) += tcache.o
+obj-$(CONFIG_TSWAP) += tswap.o
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] mm/tcache: change API to conform to tswap

2015-04-28 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear on 
ssh://g...@git.sw.ru/vzs/vzkernel.git
after rh7-3.10.0-123.1.2.el7.ovz.4.8
--
commit 21c8fbe7ffe311f2d830065273a475a61a3b3600
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Tue Apr 28 16:48:56 2015 +0400

mm/tcache: change API to conform to tswap

Since tswap cannot be built as a module, let us forbid building tcache
as a module too. Now the API looks like:

 - To enable the cleancache backend, pass tcache.enabled=1 at boot
 - To activate/deactivate tcache, write Y/N to
   /sys/module/tcache/parameters/active

https://jira.sw.ru/browse/PSBM-31915

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 config.OpenVZ |  2 +-
 mm/Kconfig|  2 +-
 mm/tcache.c   | 18 +++---
 3 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/config.OpenVZ b/config.OpenVZ
index 3df8b50..7275b27 100644
--- a/config.OpenVZ
+++ b/config.OpenVZ
@@ -5289,7 +5289,7 @@ CONFIG_MPILIB_EXTRA=y
 CONFIG_HAVE_ARCH_SOFT_DIRTY=y
 CONFIG_MEM_SOFT_DIRTY=y
 
-CONFIG_TCACHE=m
+CONFIG_TCACHE=y
 
 #
 # User resources
diff --git a/mm/Kconfig b/mm/Kconfig
index 7d17e35..d91c9a6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -523,7 +523,7 @@ config MEM_SOFT_DIRTY
  See Documentation/vm/soft-dirty.txt for more details.
 
 config TCACHE
-   tristate Transcendent file cache
+   bool Transcendent file cache
depends on CLEANCACHE
default n
help
diff --git a/mm/tcache.c b/mm/tcache.c
index 88bf82a..bc740f0 100644
--- a/mm/tcache.c
+++ b/mm/tcache.c
@@ -124,9 +124,15 @@ static struct tcache_lru *tcache_lru_node;
  * - tcache_lru-lock is independent
  */
 
-/* Enable/disable populating the cache */
+/* Enable/disable tcache backend (set at boot time) */
 static bool tcache_enabled __read_mostly;
+module_param_named(enabled, tcache_enabled, bool, 0444);
+
+/* Enable/disable populating the cache */
+static bool tcache_active __read_mostly;
+module_param_named(active, tcache_active, bool, 0644);
 
+/* Total number of pages cached */
 static DEFINE_PER_CPU(long, nr_tcache_pages);
 
 static inline u32 key_hash(const struct cleancache_filekey *key)
@@ -831,7 +837,7 @@ static void tcache_cleancache_put_page(int pool_id,
struct tcache_node *node;
struct page *cache_page;
 
-   if (!tcache_enabled)
+   if (!tcache_active)
return;
 
cache_page = tcache_alloc_page();
@@ -928,12 +934,7 @@ static int param_get_nr_pages(char *buffer, const struct 
kernel_param *kp)
 static struct kernel_param_ops param_ops_nr_pages = {
.get = param_get_nr_pages,
 };
-
-module_param_named(enabled, tcache_enabled, bool, 0644);
-MODULE_PARM_DESC(enabled, Activate/deactivate tcache);
-
 module_param_cb(nr_pages, param_ops_nr_pages, NULL, 0444);
-MODULE_PARM_DESC(nr_pages, Number of pages cached);
 
 static int __init tcache_lru_init(void)
 {
@@ -955,6 +956,9 @@ static int __init tcache_init(void)
 {
int err;
 
+   if (!tcache_enabled)
+   return 0;
+
err = tcache_lru_init();
if (err)
goto out_fail;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] venet: Dont create venet-s in sub net namespaces (v2)

2015-04-28 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.el7.ovz.4.8
--
commit 1808194996b32fa64ea43aef8eceb6da7b17f073
Author: Cyrill Gorcunov gorcu...@parallels.com
Date:   Tue Apr 28 19:00:38 2015 +0400

venet: Dont create venet-s in sub net namespaces (v2)

Patch diff-venet-dont-create-venet-s-in-sub-net-namespaces
ported from RHEL6-based.

Docker don't need venet inteterfaces in sub net namespaces.

v2: Don't create venet if ve-netns isn't equal to the current netns

https://jira.sw.ru/browse/PSBM-29811

Signed-off-by: Andrew Vagin ava...@openvz.org
Signed-off-by: Cyrill Gorcunov gorcu...@gmail.com
Reviewed-by: Vladimir Davydov vdavy...@odin.com

vdavydov@: this patch fixes 'unshare -n' failing with EEXIST.
---
 drivers/net/venetdev.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/venetdev.c b/drivers/net/venetdev.c
index 20729d0..d4c3fa9 100644
--- a/drivers/net/venetdev.c
+++ b/drivers/net/venetdev.c
@@ -1086,6 +1086,11 @@ static __net_init int venet_init_net(struct net *net)
int err;
 
env = get_exec_env();
+   if (env-ve_netns  net != env-ve_netns) {
+   /* Don't create venet-s in sub net namespaces */
+   return 0;
+   }
+
if (env-veip)
return -EEXIST;
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] bc: Drop redundant put_beancounter in ubstat_get_list

2015-04-28 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.el7.ovz.4.8
--
commit f44d00cdfbdff0c850a8c8e59a4e537e47cca69b
Author: Cyrill Gorcunov gorcu...@odin.com
Date:   Tue Apr 28 18:45:12 2015 +0400

bc: Drop redundant put_beancounter in ubstat_get_list

sys_ubstat() is currently disabled, but still let's remove
redundant put.

Signed-off-by: Cyrill Gorcunov gorcu...@odin.com
Acked-by: Vladimir Davydov vdavy...@parallels.com
---
 kernel/bc/statd.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/bc/statd.c b/kernel/bc/statd.c
index 559f5ab..25aab55 100644
--- a/kernel/bc/statd.c
+++ b/kernel/bc/statd.c
@@ -85,7 +85,6 @@ static int ubstat_get_list(void __user *buf, long size)
}
rcu_read_unlock();
 
-   put_beancounter(ubp);
size = min_t(long, (ptr - page) * sizeof(*ptr), size);
if (size  0  copy_to_user(buf, page, size)) {
retval = -EFAULT;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ve: Revert ve/pid_ns: reap zombies with external parent on container's init exit

2015-04-30 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit 3851f54d93e548e5af6ffeb2bb9fa63b09637d43
Author: Kirill Tkhai ktk...@odin.com
Date:   Thu Apr 30 17:17:00 2015 +0400

ve: Revert ve/pid_ns: reap zombies with external parent on container's 
init exit

Revert commit 4dba4987c61aff42e36aa3c889ba68dab84b0be8 ported from 2.6.32 
kernel.

It's unnecessary because the shutdown sequence was reworked by Konstantin
Khlebnikov earlier in commit b5656165832b19ad628eee2a80a939625d43eab1.

With 4dba4987c61aff42e36aa3c889ba68dab84b0be8 applied we has a problem with
double task waiting which leads to memory corruption.

https://jira.sw.ru/browse/PSBM-33254

NOTE: vzctl from PCS6 and vz7 does not need functionality like this. The 
first
one ignores signals (and child autoreaps), the second waits for its child 
in-ve
process exit (and reaps it). So, this was need for versions = PSBM5.

Signed-off-by: Kirill Tkhai ktk...@odin.com
---
 include/linux/sched.h  |  2 --
 include/linux/ve.h |  2 --
 kernel/exit.c  | 15 ---
 kernel/pid_namespace.c |  2 --
 kernel/ve/ve.c | 38 --
 5 files changed, 59 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b5e5a17..7a3b793 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2291,8 +2291,6 @@ extern int allow_signal(int);
 extern void exit_mm(struct task_struct *);
 extern int disallow_signal(int);
 
-extern int reap_zombie(struct task_struct *);
-
 extern int do_execve(const char *,
 const char __user * const __user *,
 const char __user * const __user *);
diff --git a/include/linux/ve.h b/include/linux/ve.h
index 61f31fd..833a731 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -208,7 +208,6 @@ static inline int vtty_open_master(int veid, int idx) { 
return -ENODEV; }
 
 void ve_stop_ns(struct pid_namespace *ns);
 void ve_exit_ns(struct pid_namespace *ns);
-void ve_reap_external(struct pid_namespace *ns);
 int ve_start_container(struct ve_struct *ve);
 
 #else  /* CONFIG_VE */
@@ -226,7 +225,6 @@ static inline int vz_security_protocol_check(struct net 
*net, int protocol) { re
 
 static inline void ve_stop_ns(struct pid_namespace *ns) { }
 static inline void ve_exit_ns(struct pid_namespace *ns) { }
-static inline void ve_reap_external(struct pid_namespace *s) ( )
 
 #define kthread_create_on_node_ve(ve, threadfn, data, node, namefmt...)
\
kthread_create_on_node_ve(threadfn, data, node, namefmt...)
diff --git a/kernel/exit.c b/kernel/exit.c
index 5e43932..56b840c 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1215,21 +1215,6 @@ static int wait_task_zombie(struct wait_opts *wo, struct 
task_struct *p)
return retval;
 }
 
-int reap_zombie(struct task_struct *p)
-{
-   struct wait_opts wo = {
-   .wo_flags = WEXITED,
-   };
-   int ret = 0;
-
-   if (p-exit_state == EXIT_ZOMBIE  !delay_group_leader(p)) {
-   p-exit_signal = -1;
-   ret = wait_task_zombie(wo, p);
-   }
-
-   return ret;
-}
-
 static int *task_stopped_code(struct task_struct *p, bool ptrace)
 {
if (ptrace) {
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index cbff312..173b7df 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -236,8 +236,6 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
rc = sys_wait4(-1, NULL, __WALL, NULL);
} while (rc != -ECHILD);
 
-   ve_reap_external(pid_ns);
-
/*
 * sys_wait4() above can't reap the TASK_DEAD children.
 * Make sure they all go away, see free_pid().
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 1944e23..9a6424e 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -588,44 +588,6 @@ err_kthread:
 }
 EXPORT_SYMBOL_GPL(ve_start_container);
 
-static bool ve_reap_one(struct pid_namespace *pid_ns)
-{
-   struct task_struct *task;
-   int nr;
-   bool reaped = false;
-
-   read_lock(tasklist_lock);
-   nr = next_pidmap(pid_ns, 1);
-   while (nr  0) {
-   rcu_read_lock();
-
-   task = pid_task(find_vpid(nr), PIDTYPE_PID);
-   if (task  task != current 
-   task-exit_state != EXIT_DEAD 
-   !(task-flags  PF_KTHREAD)) {
-   printk(KERN_INFO VE#%d: found task on stop: %s (pid:
-   %d, exit_state: %d)\n, task-task_ve-veid,
-   task-comm, task_pid_nr(task),
-   task-exit_state);
-   reaped = true;
-   if (reap_zombie(task))
-   read_lock(tasklist_lock);
-   }
-
-   

Re: [Devel] [PATCH] vziolimit: add bc cgroup control

2015-04-28 Thread Konstantin Khorenko
Volodya, please review it from the interface's point of view.

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On 04/27/2015 05:50 PM, Dmitry Monakhov wrote:
 Example: Equivalent if: vzctl set $VE --iopslimit 10 --save
 
 echo 15000  /sys/fs/cgroup/beancounter/$VE/beancounter.iopslimit.burst
 echo 1000   /sys/fs/cgroup/beancounter/$VE/beancounter.iopslimit.latency
 echo 10 /sys/fs/cgroup/beancounter/$VE/beancounter.iopslimit.speed
 
 Signed-off-by: Dmitry Monakhov dmonak...@openvz.org
 ---
  include/bc/beancounter.h |7 ++
  kernel/bc/beancounter.c  |7 +--
  kernel/ve/vziolimit.c|  152 
 +-
  3 files changed, 159 insertions(+), 7 deletions(-)
 
 diff --git a/include/bc/beancounter.h b/include/bc/beancounter.h
 index 3ee6389..faf0197 100644
 --- a/include/bc/beancounter.h
 +++ b/include/bc/beancounter.h
 @@ -178,6 +178,13 @@ enum ub_severity { UB_HARD, UB_SOFT, UB_FORCE };
  #define UB_TEST  0x100
  #define UB_SEV_FLAGS UB_TEST
  
 +extern struct cgroup_subsys ub_subsys;
 +static inline struct user_beancounter *cgroup_ub(struct cgroup *cg)
 +{
 + return container_of(cgroup_subsys_state(cg, ub_subsys_id),
 + struct user_beancounter, css);
 +}
 +
  static inline int ub_barrier_hit(struct user_beancounter *ub, int resource)
  {
   return ub-ub_parms[resource].held  ub-ub_parms[resource].barrier;
 diff --git a/kernel/bc/beancounter.c b/kernel/bc/beancounter.c
 index c9b17de..76936e0 100644
 --- a/kernel/bc/beancounter.c
 +++ b/kernel/bc/beancounter.c
 @@ -319,12 +319,6 @@ LIST_HEAD(ub_list_head); /* protected by ub_list_lock */
  EXPORT_SYMBOL(ub_list_head);
  int ub_count;
  
 -static inline struct user_beancounter *cgroup_ub(struct cgroup *cg)
 -{
 - return container_of(cgroup_subsys_state(cg, ub_subsys_id),
 - struct user_beancounter, css);
 -}
 -
  /*
   *   Per user resource beancounting. Resources are tied to their luid.
   *   The resource structure itself is tagged both to the process and
 @@ -713,6 +707,7 @@ struct cgroup_subsys ub_subsys = {
   .attach = ub_cgroup_attach,
   .use_id = true,
  };
 +EXPORT_SYMBOL(ub_subsys);
  
  /*
   *   Generic resource charging stuff
 diff --git a/kernel/ve/vziolimit.c b/kernel/ve/vziolimit.c
 index 9e2b014..d78bf27 100644
 --- a/kernel/ve/vziolimit.c
 +++ b/kernel/ve/vziolimit.c
 @@ -25,6 +25,16 @@ struct throttle {
 long long state;  /* current state in units */
  };
  
 +enum {
 + UB_CGROUP_IOLIMIT_SPEED = 0,
 + UB_CGROUP_IOLIMIT_BURST = 1,
 + UB_CGROUP_IOLIMIT_LATENCY   = 2,
 + UB_CGROUP_IOPSLIMIT_SPEED   = 3,
 + UB_CGROUP_IOPSLIMIT_BURST   = 4,
 + UB_CGROUP_IOPSLIMIT_LATENCY = 5,
 +
 +};
 +
  /**
   * set throttler initial state, externally serialized
   * @speedmaximum speed (1/sec)
 @@ -350,16 +360,156 @@ static struct vzioctlinfo iolimit_vzioctl = {
   .owner  = THIS_MODULE,
  };
  
 +static ssize_t iolimit_cgroup_read(struct cgroup *cg, struct cftype *cft,
 +   struct file *file, char __user *buf,
 +   size_t nbytes, loff_t *ppos)
 +{
 + struct user_beancounter *ub = cgroup_ub(cg);
 + struct iolimit *iolimit = ub-private_data2;
 + unsigned long val = 0;
 + int len;
 + char str[32];
 +
 + if (!iolimit)
 + goto out;
 +
 + spin_lock_irq(ub-ub_lock);
 + switch (cft-private) {
 + case UB_CGROUP_IOLIMIT_SPEED:
 + val = iolimit-throttle.speed;
 + break;
 + case UB_CGROUP_IOLIMIT_BURST:
 + val = iolimit-throttle.burst;
 + break;
 + case UB_CGROUP_IOLIMIT_LATENCY:
 + val = iolimit-throttle.latency;
 + break;
 +
 + case UB_CGROUP_IOPSLIMIT_SPEED:
 + val = iolimit-iops.speed;
 + break;
 + case UB_CGROUP_IOPSLIMIT_BURST:
 + val = iolimit-iops.burst;
 + break;
 + case UB_CGROUP_IOPSLIMIT_LATENCY:
 + val = iolimit-iops.latency;
 + break;
 + default:
 + BUG();
 + }
 + spin_unlock_irq(ub-ub_lock);
 +out:
 + len = scnprintf(str, sizeof(str), %lu\n, val);
 + return simple_read_from_buffer(buf, nbytes, ppos, str, len);
 +}
 +
 +static int iolimit_cgroup_write_u64(struct cgroup *cg, struct cftype *cft, 
 u64 val)
 +{
 + struct user_beancounter *ub = cgroup_ub(cg);
 + struct iolimit *iolimit;
 +
 + iolimit = iolimit_get(ub);
 + if (!iolimit)
 + return -ENOMEM;
 +
 + spin_lock_irq(ub-ub_lock);
 + iolimit-throttle.time = iolimit-iops.time = jiffies;
 +
 + switch (cft-private) {
 + case UB_CGROUP_IOLIMIT_SPEED:
 + wmb();
 + iolimit-throttle.speed = val;
 + break;
 + case UB_CGROUP_IOPSLIMIT_SPEED:
 + wmb();
 + iolimit-iops.speed = val

[Devel] test1

2015-04-28 Thread Konstantin Khorenko
test1

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] test3

2015-04-28 Thread Konstantin Khorenko
test3

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] test2

2015-04-28 Thread Konstantin Khorenko
test2

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ve: Make get_ve_by_id() lockless

2015-04-30 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit 1b776f45202a6e39b9516a080212a206be840ad4
Author: Kirill Tkhai ktk...@odin.com
Date:   Thu Apr 30 19:24:53 2015 +0400

ve: Make get_ve_by_id() lockless

css_tryget() fails if css_offline has already been called,
so the true result guarantees that final put_ve() hasn't
been made yet.

Signed-off-by: Kirill Tkhai ktk...@odin.com
Acked-by: Vladimir Davydov vdavy...@parallels.com
---
 kernel/ve/ve.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 9a6424e..3ef10bc 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -226,10 +226,11 @@ EXPORT_SYMBOL(__find_ve_by_id);
 struct ve_struct *get_ve_by_id(envid_t veid)
 {
struct ve_struct *ve;
-   mutex_lock(ve_list_lock);
+   rcu_read_lock();
ve = __find_ve_by_id(veid);
-   get_ve(ve);
-   mutex_unlock(ve_list_lock);
+   if (ve  !css_tryget(ve-css))
+   ve = NULL;
+   rcu_read_unlock();
return ve;
 }
 EXPORT_SYMBOL(get_ve_by_id);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ms/mm/memcg: do not use vmalloc for mem_cgroup allocations

2015-04-30 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit b527d3cd43eea6fa1b0d75d72c99dd1af522f0b7
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Thu Apr 30 19:28:18 2015 +0400

ms/mm/memcg: do not use vmalloc for mem_cgroup allocations

The vmalloc was introduced by 33327948782b (memcgroup: use vmalloc for
mem_cgroup allocation), because at that time MAX_NUMNODES was used for
defining the per-node array in the mem_cgroup structure so that the
structure could be huge even if the system had the only NUMA node.

The situation was significantly improved by commit 45cf7ebd5a03 (memcg:
reduce the size of struct memcg 244-fold), which made the size of the
mem_cgroup structure calculated dynamically depending on the real number
of NUMA nodes installed on the system (nr_node_ids), so now there is no
point in using vmalloc here: the structure is allocated rarely and on
most systems its size is about 1K.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com

Acked-by: Michal Hocko mho...@suse.cz
Cc: Glauber Costa glom...@openvz.org
Cc: Johannes Weiner han...@cmpxchg.org
Cc: Balbir Singh bsinghar...@gmail.com
Cc: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
Signed-off-by: Andrew Morton a...@linux-foundation.org
Signed-off-by: Linus Torvalds torva...@linux-foundation.org
(cherry picked from commit 8ff69e2c85f84b6b371e3c1d01207e73c0500125)
---
 mm/memcontrol.c | 28 ++--
 1 file changed, 6 insertions(+), 22 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ac75d76..e772a06 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -48,7 +48,6 @@
 #include linux/sort.h
 #include linux/fs.h
 #include linux/seq_file.h
-#include linux/vmalloc.h
 #include linux/vmpressure.h
 #include linux/mm_inline.h
 #include linux/page_cgroup.h
@@ -357,12 +356,6 @@ struct mem_cgroup {
struct mem_cgroup_lru_info info;
 };
 
-static size_t memcg_size(void)
-{
-   return sizeof(struct mem_cgroup) +
-   nr_node_ids * sizeof(struct mem_cgroup_per_node *);
-}
-
 /* internal only representation about the status of kmem accounting. */
 enum {
KMEM_ACCOUNTED_ACTIVE, /* accounted by this cgroup itself */
@@ -6010,14 +6003,12 @@ static void free_mem_cgroup_per_zone_info(struct 
mem_cgroup *memcg, int node)
 static struct mem_cgroup *mem_cgroup_alloc(void)
 {
struct mem_cgroup *memcg;
-   size_t size = memcg_size();
+   size_t size;
 
-   /* Can be very big if nr_node_ids is very big */
-   if (size  PAGE_SIZE)
-   memcg = kzalloc(size, GFP_KERNEL);
-   else
-   memcg = vzalloc(size);
+   size = sizeof(struct mem_cgroup);
+   size += nr_node_ids * sizeof(struct mem_cgroup_per_node *);
 
+   memcg = kzalloc(size, GFP_KERNEL);
if (!memcg)
return NULL;
 
@@ -6028,10 +6019,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
return memcg;
 
 out_free:
-   if (size  PAGE_SIZE)
-   kfree(memcg);
-   else
-   vfree(memcg);
+   kfree(memcg);
return NULL;
 }
 
@@ -6049,7 +6037,6 @@ out_free:
 static void __mem_cgroup_free(struct mem_cgroup *memcg)
 {
int node;
-   size_t size = memcg_size();
 
mem_cgroup_remove_from_trees(memcg);
free_css_id(mem_cgroup_subsys, memcg-css);
@@ -6071,10 +6058,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 * the cgroup_lock.
 */
disarm_static_keys(memcg);
-   if (size  PAGE_SIZE)
-   kfree(memcg);
-   else
-   vfree(memcg);
+   kfree(memcg);
 }
 
 /*
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] mm/memcg: remove memcg from kmemcg_sharers list on css free

2015-04-30 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit cdfd4f41f48ed049db36168ee4a52a6f91f0640e
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Thu Apr 30 19:17:11 2015 +0400

mm/memcg: remove memcg from kmemcg_sharers list on css free

When a memcg dir is removed, memcg is added to the kmemcg_sharers list
of its parent, so that when the parent dies too, we will be able to
update kmemcg_id of all its children (see memcg_deactivate_kmem). When a
memcg is freed, it should be therefore removed from its parent's
kmemcg_sharers list, but currently it is not. This leads to
use-after-free, in particular, showing up as the following warning:

[   94.460097] WARNING: at lib/list_debug.c:29 __list_add+0x65/0xc0()
[   94.460157] list_add corruption. next-prev should be prev 
(88010b8825d8), but was 88008ed7a5e0. (next=88008ed7a5d8).
[   94.460257] Modules linked in:
[   94.465299] CPU: 1 PID: 12987 Comm: vzctl ve: 0 Not tainted 3.10.0+ #14 
ovz.4.8-9-gf68f6df24106
[   94.465359] Hardware name:
[   94.465418]  81806524 7dfeaa4e 8800a27d9d08 
815c9c3c
[   94.465745]  8800a27d9d40 8105da71 88008eb525d8 
88008ed7a5d8
[   94.466021]  88010b8825d8  88003668bf90 
8800a27d9da8
[   94.466467] Call Trace:
[   94.466539]  [815c9c3c] dump_stack+0x19/0x1b
[   94.466609]  [8105da71] warn_slowpath_common+0x61/0x80
[   94.466674]  [8105daec] warn_slowpath_fmt+0x5c/0x80
[   94.466743]  [815cd792] ? mutex_lock+0x12/0x2f
[   94.466812]  [812bba95] __list_add+0x65/0xc0
[   94.466882]  [811aea23] mem_cgroup_css_offline+0x143/0x1d0
[   94.466951]  [810e4317] cgroup_destroy_locked+0xe7/0x370
[   94.467011]  [810e45c2] cgroup_rmdir+0x22/0x40
[   94.467093]  [811ca286] vfs_rmdir+0x96/0xf0
[   94.467192]  [811ca485] do_rmdir+0x1a5/0x200
[   94.467334]  [811c17fe] ? SYSC_newstat+0x3e/0x60
[   94.467396]  [811cd2d6] SyS_rmdir+0x16/0x20
[   94.467455]  [815da3d9] system_call_fastpath+0x16/0x1b

Fix this by adding missing list_del to css_free. Note, all the list
manipulations are protected by the cgroup_mutex, which is taken for both
css_offline and css_free, so no extra protection is needed.

Also, do not call memcg_destroy_kmem_caches if kmem accounting was not
activated, because it is pointless - there cannot be any slab caches in
such a case.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 mm/memcontrol.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7775a9b..a94926f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5733,7 +5733,10 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, 
struct cgroup_subsys *ss)
 
 static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 {
-   memcg_destroy_kmem_caches(memcg);
+   if (test_bit(KMEM_ACCOUNTED_ACTIVATED, memcg-kmem_account_flags)) {
+   list_del(memcg-kmemcg_sharers);
+   memcg_destroy_kmem_caches(memcg);
+   }
mem_cgroup_sockets_destroy(memcg);
 }
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ms/mm/memcg: fix memcg_size() calculation

2015-04-30 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit c6094f3be2fa9604f679529532229dc40f90c08a
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Thu Apr 30 19:28:18 2015 +0400

ms/mm/memcg: fix memcg_size() calculation

The mem_cgroup structure contains nr_node_ids pointers to
mem_cgroup_per_node objects, not the objects themselves.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com

Acked-by: Michal Hocko mho...@suse.cz
Cc: Glauber Costa glom...@openvz.org
Cc: Johannes Weiner han...@cmpxchg.org
Cc: Balbir Singh bsinghar...@gmail.com
Cc: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
Signed-off-by: Andrew Morton a...@linux-foundation.org
Signed-off-by: Linus Torvalds torva...@linux-foundation.org
(cherry picked from commit 695c60830764945cf61a2cc623eb1392d137223e)
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a94926f..ac75d76 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -360,7 +360,7 @@ struct mem_cgroup {
 static size_t memcg_size(void)
 {
return sizeof(struct mem_cgroup) +
-   nr_node_ids * sizeof(struct mem_cgroup_per_node);
+   nr_node_ids * sizeof(struct mem_cgroup_per_node *);
 }
 
 /* internal only representation about the status of kmem accounting. */
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ve: add config options for vzlist and vznetstat modules

2015-04-30 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit ec888edb0dcfbfb717a685797a62273a5d3281d9
Author: Kirill Tkhai ktk...@odin.com
Date:   Thu Apr 30 20:06:11 2015 +0400

ve: add config options for vzlist and vznetstat modules

Port diff-vz-hookin-generic-3

+ CONFIG_VZ_LIST
+ CONFIG_VE_NETDEV_ACCOUNTING

https://jira.sw.ru/browse/PSBM-19217

Signed-off-by: Kirill Tkhai ktk...@odin.com
---
 config.OpenVZ |  2 ++
 include/linux/vznetstat.h | 40 ++--
 kernel/Kconfig.openvz | 17 +
 kernel/ve/Makefile|  8 +++-
 4 files changed, 56 insertions(+), 11 deletions(-)

diff --git a/config.OpenVZ b/config.OpenVZ
index 250b85d..35f2609 100644
--- a/config.OpenVZ
+++ b/config.OpenVZ
@@ -5269,8 +5269,10 @@ CONFIG_CHECKPOINT_RESTORE=y
 #
 CONFIG_VE=y
 CONFIG_VE_CALLS=m
+CONFIG_VZ_LIST=m
 CONFIG_VZ_GENCALLS=y
 CONFIG_VE_NETDEV=m
+CONFIG_VE_NETDEV_ACCOUNTING=m
 CONFIG_VE_ETHDEV=m
 CONFIG_VZ_DEV=m
 CONFIG_VE_IPTABLES=y
diff --git a/include/linux/vznetstat.h b/include/linux/vznetstat.h
index 2a6d1ae..b6627cc 100644
--- a/include/linux/vznetstat.h
+++ b/include/linux/vznetstat.h
@@ -47,18 +47,17 @@ static inline int venet_acct_skb_size(struct sk_buff *skb)
return skb-data_len + (skb-tail - skb-network_header);
 }
 
+struct ve_addr_struct;
+
+#if IS_ENABLED(CONFIG_VE_NETDEV_ACCOUNTING)
+struct venet_stat *venet_acct_find_stat(envid_t veid);
+struct venet_stat *venet_acct_find_create_stat(envid_t veid);
 static inline void venet_acct_get_stat(struct venet_stat *stat)
 {
atomic_inc(stat-users);
 }
 void   venet_acct_put_stat(struct venet_stat *);
 
-struct venet_stat *venet_acct_find_create_stat(envid_t veid);
-struct venet_stat *venet_acct_find_stat(envid_t veid);
-int init_venet_acct_ip_stat(struct ve_struct *env, struct venet_stat *stat);
-void fini_venet_acct_ip_stat(struct ve_struct *env);
-
-struct ve_addr_struct;
 void venet_acct_classify_add_incoming(struct venet_stat *, struct sk_buff 
*skb);
 void venet_acct_classify_add_outgoing(struct venet_stat *, struct sk_buff 
*skb);
 void venet_acct_classify_sub_outgoing(struct venet_stat *, struct sk_buff 
*skb);
@@ -69,4 +68,33 @@ void venet_acct_classify_add_outgoing_plain(struct 
venet_stat *stat,
struct ve_addr_struct *dst_addr, int data_size);
 void ip_vznetstat_touch(void);
 
+int init_venet_acct_ip_stat(struct ve_struct *env, struct venet_stat *stat);
+void fini_venet_acct_ip_stat(struct ve_struct *env);
+#else /* !CONFIG_VE_NETDEV_ACCOUNTING */
+static inline void venet_acct_get_stat(struct venet_stat *stat) { }
+static inline void venet_acct_put_stat(struct venet_stat *stat) { }
+
+static inline void venet_acct_classify_add_incoming(struct venet_stat *stat,
+   struct sk_buff *skb) {}
+static inline void venet_acct_classify_add_outgoing(struct venet_stat *stat,
+   struct sk_buff *skb) {}
+static inline void venet_acct_classify_sub_outgoing(struct venet_stat *stat,
+   struct sk_buff *skb) {}
+
+static inline void venet_acct_classify_add_incoming_plain(struct venet_stat 
*stat,
+   struct ve_addr_struct *src_addr, int data_size) {}
+static inline void venet_acct_classify_add_outgoing_plain(struct venet_stat 
*stat,
+   struct ve_addr_struct *dst_addr, int data_size) {}
+static inline void ip_vznetstat_touch(void) {}
+
+static inline int init_venet_acct_ip_stat(struct ve_struct *env, struct 
venet_stat *stat)
+{
+   return 0;
+}
+static void fini_venet_acct_ip_stat(struct ve_struct *env)
+{
+}
+
+#endif /* CONFIG_VE_NETDEV_ACCOUNTING */
+
 #endif
diff --git a/kernel/Kconfig.openvz b/kernel/Kconfig.openvz
index 3cd83e3..3eb2fd2 100644
--- a/kernel/Kconfig.openvz
+++ b/kernel/Kconfig.openvz
@@ -66,6 +66,23 @@ config VE_IPTABLES
help
  This option controls whether to build VE netfiltering code.
 
+config VZ_LIST
+   tristate VE listing/statistics user ioctl interface
+   depends on VE
+   default m
+   help
+ This options controls building of vzlist module.
+ This module provides ioctl interfaces for fetching VE ids, ip 
addresses
+ and pids of running processes.
+
+config VE_NETDEV_ACCOUNTING
+   tristate VE networking accounting
+   depends on VE_NETDEV
+   default m
+   help
+ This option allows traffic accounting on Virtual Networking device and
+ on real devices moved to a Virtual Environment
+
 config VZ_WDOG
tristate VE watchdog module
depends on VE_CALLS
diff --git a/kernel/ve/Makefile b/kernel/ve/Makefile
index 044a337..1037943 100644
--- a/kernel/ve/Makefile
+++ b/kernel/ve/Makefile
@@ -13,12 +13,10 @@ vzmon-objs = vecalls.o
 

[Devel] [PATCH RHEL7 COMMIT] ms/exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting

2015-05-05 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit 3dd868ef8c2428e0d9c9e148d9a06f972c2f7ade
Author: Oleg Nesterov o...@redhat.com
Date:   Tue May 5 16:27:10 2015 +0400

ms/exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting

alloc_pid() does get_pid_ns() beforehand but forgets to put_pid_ns() if it
fails because disable_pid_allocation() was called by the exiting
child_reaper.

We could simply move get_pid_ns() down to successful return, but this fix
tries to be as trivial as possible.

Signed-off-by: Oleg Nesterov o...@redhat.com
Reviewed-by: Eric W. Biederman ebied...@xmission.com
Cc: Aaron Tomlin atom...@redhat.com
Cc: Pavel Emelyanov xe...@parallels.com
Cc: Serge Hallyn serge.hal...@ubuntu.com
Cc: Sterling Alexander stale...@redhat.com
Cc: sta...@vger.kernel.org
Signed-off-by: Andrew Morton a...@linux-foundation.org
Signed-off-by: Linus Torvalds torva...@linux-foundation.org
(cherry picked from commit 24c037ebf5723d4d9ab0996433cee4f96c292a4d)

The memory leak was found by kmemleak:

unreferenced object 0x880099efcec0 (size 2192):
  comm vzctl, pid 11269, jiffies 4294743454 (age 315.703s)
  hex dump (first 32 bytes):
27 00 00 00 00 00 00 00 ff 7f 00 00 00 00 00 00  '...
00 80 fa 6d 00 88 ff ff 00 80 00 00 00 00 00 00  ...m
  backtrace:
[815af0de] kmemleak_alloc+0x4e/0xb0
[8119d288] kmem_cache_alloc+0x148/0x220
[810ec283] copy_pid_ns+0xa3/0x360
[8108be93] create_new_namespaces+0xd3/0x180
[8108c045] copy_namespaces+0x75/0x110
[8105bf1f] copy_process.part.34+0x90f/0x14e0
[8105cbfc] do_fork+0xbc/0x350
[8105cf46] SyS_clone+0x16/0x20
[815da4b9] stub_clone+0x69/0x90
[] 0x
unreferenced object 0x88006dfa8000 (size 4096):
  comm vzctl, pid 11269, jiffies 4294743454 (age 315.703s)
  hex dump (first 32 bytes):
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
  backtrace:
[815af0de] kmemleak_alloc+0x4e/0xb0
[8119d4b4] kmem_cache_alloc_trace+0x154/0x240
[810ec2a8] copy_pid_ns+0xc8/0x360
[8108be93] create_new_namespaces+0xd3/0x180
[8108c045] copy_namespaces+0x75/0x110
[8105bf1f] copy_process.part.34+0x90f/0x14e0
[8105cbfc] do_fork+0xbc/0x350
[8105cf46] SyS_clone+0x16/0x20
[815da4b9] stub_clone+0x69/0x90
[] 0x

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 kernel/pid.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/pid.c b/kernel/pid.c
index f02eafe..5dd2a7e 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -336,6 +336,8 @@ out:
 
 out_unlock:
spin_unlock_irq(pidmap_lock);
+   put_pid_ns(ns);
+
 out_free:
while (++i = ns-level)
free_pidmap(pid-numbers + i);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] vziolimit: add bc cgroup control

2015-05-05 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit 2954da8377f2be38c31142d8aea37e22d8cefc4a
Author: Dmitry Monakhov dmonak...@openvz.org
Date:   Tue May 5 16:37:42 2015 +0400

vziolimit: add bc cgroup control

Example: Equivalent if: vzctl set $VE --iopslimit 10 --save

echo 15000  /sys/fs/cgroup/beancounter/$VE/beancounter.iopslimit.burst
echo 1000   /sys/fs/cgroup/beancounter/$VE/beancounter.iopslimit.latency
echo 10 /sys/fs/cgroup/beancounter/$VE/beancounter.iopslimit.speed

https://jira.sw.ru/browse/PSBM-32281

Signed-off-by: Dmitry Monakhov dmonak...@openvz.org
Acked-by: Vladimir Davydov vdavy...@parallels.com

khorenko@:
Need new cgroup interface because we need to identify Container not via 
CTID,
but via string (CT UUID, in particular).
---
 include/bc/beancounter.h |   7 +++
 kernel/bc/beancounter.c  |   7 +--
 kernel/ve/vziolimit.c| 152 ++-
 3 files changed, 159 insertions(+), 7 deletions(-)

diff --git a/include/bc/beancounter.h b/include/bc/beancounter.h
index 3ee6389..faf0197 100644
--- a/include/bc/beancounter.h
+++ b/include/bc/beancounter.h
@@ -178,6 +178,13 @@ enum ub_severity { UB_HARD, UB_SOFT, UB_FORCE };
 #define UB_TEST0x100
 #define UB_SEV_FLAGS   UB_TEST
 
+extern struct cgroup_subsys ub_subsys;
+static inline struct user_beancounter *cgroup_ub(struct cgroup *cg)
+{
+   return container_of(cgroup_subsys_state(cg, ub_subsys_id),
+   struct user_beancounter, css);
+}
+
 static inline int ub_barrier_hit(struct user_beancounter *ub, int resource)
 {
return ub-ub_parms[resource].held  ub-ub_parms[resource].barrier;
diff --git a/kernel/bc/beancounter.c b/kernel/bc/beancounter.c
index c9b17de..76936e0 100644
--- a/kernel/bc/beancounter.c
+++ b/kernel/bc/beancounter.c
@@ -319,12 +319,6 @@ LIST_HEAD(ub_list_head); /* protected by ub_list_lock */
 EXPORT_SYMBOL(ub_list_head);
 int ub_count;
 
-static inline struct user_beancounter *cgroup_ub(struct cgroup *cg)
-{
-   return container_of(cgroup_subsys_state(cg, ub_subsys_id),
-   struct user_beancounter, css);
-}
-
 /*
  * Per user resource beancounting. Resources are tied to their luid.
  * The resource structure itself is tagged both to the process and
@@ -713,6 +707,7 @@ struct cgroup_subsys ub_subsys = {
.attach = ub_cgroup_attach,
.use_id = true,
 };
+EXPORT_SYMBOL(ub_subsys);
 
 /*
  * Generic resource charging stuff
diff --git a/kernel/ve/vziolimit.c b/kernel/ve/vziolimit.c
index a6f900d..fc8b24a 100644
--- a/kernel/ve/vziolimit.c
+++ b/kernel/ve/vziolimit.c
@@ -24,6 +24,16 @@ struct throttle {
long long state;/* current state in units */
 };
 
+enum {
+   UB_CGROUP_IOLIMIT_SPEED = 0,
+   UB_CGROUP_IOLIMIT_BURST = 1,
+   UB_CGROUP_IOLIMIT_LATENCY   = 2,
+   UB_CGROUP_IOPSLIMIT_SPEED   = 3,
+   UB_CGROUP_IOPSLIMIT_BURST   = 4,
+   UB_CGROUP_IOPSLIMIT_LATENCY = 5,
+
+};
+
 /**
  * set throttler initial state, externally serialized
  * @speed  maximum speed (1/sec)
@@ -349,16 +359,156 @@ static struct vzioctlinfo iolimit_vzioctl = {
.owner  = THIS_MODULE,
 };
 
+static ssize_t iolimit_cgroup_read(struct cgroup *cg, struct cftype *cft,
+ struct file *file, char __user *buf,
+ size_t nbytes, loff_t *ppos)
+{
+   struct user_beancounter *ub = cgroup_ub(cg);
+   struct iolimit *iolimit = ub-private_data2;
+   unsigned long val = 0;
+   int len;
+   char str[32];
+
+   if (!iolimit)
+   goto out;
+
+   spin_lock_irq(ub-ub_lock);
+   switch (cft-private) {
+   case UB_CGROUP_IOLIMIT_SPEED:
+   val = iolimit-throttle.speed;
+   break;
+   case UB_CGROUP_IOLIMIT_BURST:
+   val = iolimit-throttle.burst;
+   break;
+   case UB_CGROUP_IOLIMIT_LATENCY:
+   val = iolimit-throttle.latency;
+   break;
+
+   case UB_CGROUP_IOPSLIMIT_SPEED:
+   val = iolimit-iops.speed;
+   break;
+   case UB_CGROUP_IOPSLIMIT_BURST:
+   val = iolimit-iops.burst;
+   break;
+   case UB_CGROUP_IOPSLIMIT_LATENCY:
+   val = iolimit-iops.latency;
+   break;
+   default:
+   BUG();
+   }
+   spin_unlock_irq(ub-ub_lock);
+out:
+   len = scnprintf(str, sizeof(str), %lu\n, val);
+   return simple_read_from_buffer(buf, nbytes, ppos, str, len);
+}
+
+static int iolimit_cgroup_write_u64(struct cgroup *cg, struct cftype *cft, u64 
val)
+{
+   struct user_beancounter *ub = cgroup_ub(cg);
+   struct iolimit *iolimit;
+
+   iolimit = iolimit_get(ub);
+   

[Devel] [PATCH RHEL7 COMMIT] ve/tty: Prevent iteration with NULL dev in device_destroy_namespace()

2015-05-06 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit b60174a217555f4bf9d0c37616fba4a4cb8561dd
Author: Kirill Tkhai ktk...@odin.com
Date:   Wed May 6 18:14:24 2015 +0400

ve/tty: Prevent iteration with NULL dev in device_destroy_namespace()

We must check class_find_device() return value before we call namespace 
method.
Otherwise, we may pass NULL as device in ve_namespace like in below:

[4.455518] BUG: unable to handle kernel NULL pointer dereference at 
0278
[4.456336] IP: [81233caa] ve_namespace+0xa/0x40
[4.456336] PGD 0
[4.456336] Oops:  [#1] SMP
[4.456336] Modules linked in:
[4.456336] CPU: 1 PID: 1 Comm: swapper/0 ve: 0 Not tainted 
3.10.0-123.1.2.el7.ovz.4.8.x86_64 #1 ovz.4.8
[4.456336] Hardware name: To Be Filled By O.E.M. To Be Filled By 
O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
[4.456336] task: 8800777e ti: 8800777e8000 task.ti: 
8800777e8000
[4.456336] RIP: 0010:[81233caa]  [81233caa] 
ve_namespace+0xa/0x40
[4.456336] RSP: :8800777e99c8  EFLAGS: 00010286
[4.456336] RAX: 81233ca0 RBX:  RCX: 

[4.456336] RDX: 091e RSI: 8800777e99e4 RDI: 

[4.456336] RBP: 8800777e99d0 R08:  R09: 
0004
[4.456336] R10: 001c2000 R11:  R12: 
81960d60
[4.456336] R13:  R14: 819656a0 R15: 
81991380
[4.456336] FS:  () GS:88007a88() 
knlGS:
[4.456336] CS:  0010 DS:  ES:  CR0: 8005003b
[4.456336] CR2: 0278 CR3: 018ae000 CR4: 
07e0
[4.456336] DR0:  DR1:  DR2: 

[4.456336] DR3:  DR6: 0ff0 DR7: 
0400
[4.456336] Stack:
[4.456336]   8800777e9a00 813a1585 
0040004281335d74
[4.456336]  88006f348000 0002 88006f36c380 
8800777e9a20
[4.456336]  8135b2a2 821f7bf0 88006f36c2b0 
8800777e9a50
[4.456336] Call Trace:
[4.456336]  [813a1585] device_destroy_namespace+0x25/0x80
[4.456336]  [8135b2a2] tty_unregister_device+0x32/0x60
[4.456336]  [8137bc4e] uart_remove_one_port+0x9e/0x130
[4.456336]  [81380a5e] 
serial8250_register_8250_port+0xae/0x390
[4.456336]  [81385b18] pciserial_init_ports+0x118/0x210
[4.456336]  [81385ced] pciserial_init_one+0xdd/0x200
[4.456336]  [812d5915] local_pci_probe+0x45/0xa0
[4.456336]  [812d6d05] ? pci_match_device+0xc5/0xd0
[4.456336]  [812d6e89] pci_device_probe+0x139/0x150
[4.456336]  [813a3d37] driver_probe_device+0x87/0x390
[4.456336]  [813a4113] __driver_attach+0x93/0xa0
[4.456336]  [813a4080] ? __device_attach+0x40/0x40
[4.456336]  [813a1ac3] bus_for_each_dev+0x73/0xc0
[4.456336]  [813a378e] driver_attach+0x1e/0x20
[4.456336]  [813a32e0] bus_add_driver+0x200/0x2d0
[4.456336]  [813a4794] driver_register+0x64/0xf0
[4.456336]  [812d6b75] __pci_register_driver+0xa6/0xc0
[4.456336]  [81a2eadb] ? early_serial_setup+0x129/0x129
[4.456336]  [81a2eaf4] serial_pci_driver_init+0x19/0x1b
[4.456336]  [810020e2] do_one_initcall+0xe2/0x190
[4.456336]  [819ed14f] kernel_init_freeable+0x17d/0x21c
[4.456336]  [819ec92b] ? do_early_param+0x88/0x88
[4.456336]  [815a8010] ? rest_init+0x80/0x80
[4.456336]  [815a801e] kernel_init+0xe/0x180
[4.456336]  [815d5f2c] ret_from_fork+0x7c/0xb0
[4.456336]  [815a8010] ? rest_init+0x80/0x80
[4.456336]  [815a801e] kernel_init+0xe/0x180
[4.456336]  [815d5f2c] ret_from_fork+0x7c/0xb0
[4.456336]  [815a8010] ? rest_init+0x80/0x80
[4.456336] Code: ff ff ff 66 0f 1f 44 00 00 49 89 45 18 e9 35 ff ff ff 
0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 53 48 
83 bf 78 02 00 00 00 48 89 fb 74 11 48 c7 c0 c0 cb 8e 81 5b

0x278 is groups atribute offset in structure device.
The patch fixes the problem and makes the check logic similar to logic in 
device_destroy().

https://jira.sw.ru/browse/PSBM-33239

Signed-off-by: Kirill Tkhai ktk...@odin.com
---
 drivers/base/core.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git 

[Devel] [PATCH RHEL7 COMMIT] ms/sysfs: do not account sysfs_ino_ida allocations to memcg

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit 5f52f9964f8b79ebf893aca281fdeab82bde4495
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Thu May 7 13:11:38 2015 +0400

ms/sysfs: do not account sysfs_ino_ida allocations to memcg

Patch has been added to the mm tree:

http://www.spinics.net/lists/stable/msg89832.html

==
sysfs_ino_ida is used for sysfs inode number allocations. Since IDA has
a layered structure, different IDs can reside on the same layer, which
is currently accounted to some memory cgroup. The problem is that each
kmem cache of a memory cgroup has its own directory on sysfs (under
/sys/fs/kernel/cache-name/cgroup). If the inode number of such a
directory or any file in it gets allocated from a layer accounted to the
cgroup which the cache is created for, the cgroup will get pinned for
good, because one has to free all kmem allocations accounted to a cgroup
in order to release it and destroy all its kmem caches. That said we
must not account layers of sysfs_ino_ida to any memory cgroup.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 fs/sysfs/dir.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 2c68c20..e12273c 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -265,7 +265,7 @@ static int sysfs_alloc_ino(unsigned int *pino)
spin_unlock(sysfs_ino_lock);
 
if (rc == -EAGAIN) {
-   if (ida_pre_get(sysfs_ino_ida, GFP_KERNEL))
+   if (ida_pre_get(sysfs_ino_ida, GFP_KERNEL | __GFP_NOACCOUNT))
goto retry;
rc = -ENOMEM;
}
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ms/gfp: add __GFP_NOACCOUNT

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit 9e40c9060eb049c22f3ee44efb58bd3f3dbfc1b2
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Thu May 7 13:11:24 2015 +0400

ms/gfp: add __GFP_NOACCOUNT

Patch has been added to the mm tree:

http://www.spinics.net/lists/stable/msg89831.html

==
Not all kmem allocations should be accounted to memcg. The following
patch gives an example when accounting of a certain type of allocations
to memcg can effectively result in a memory leak. This patch adds the
__GFP_NOACCOUNT flag which if passed to kmalloc and friends will force
the allocation to go through the root cgroup. It will be used by the
next patch.

Note, since in case of kmemleak enabled each kmalloc implies yet another
allocation from the kmemleak_object cache, we add __GFP_NOACCOUNT to
gfp_kmemleak_mask.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com

Changes in v2:
 - do not account kmemleak objects to memcg for __GFP_NOACCOUNT allocations

 include/linux/gfp.h|2 ++
 include/linux/memcontrol.h |4 
 mm/kmemleak.c  |3 ++-
 3 files changed, 8 insertions(+), 1 deletion(-)
---
 include/linux/gfp.h| 2 ++
 include/linux/memcontrol.h | 4 
 mm/kmemleak.c  | 3 ++-
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4d66a5d..d14bb78 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -30,6 +30,7 @@ struct vm_area_struct;
 #define ___GFP_HARDWALL0x2u
 #define ___GFP_THISNODE0x4u
 #define ___GFP_RECLAIMABLE 0x8u
+#define ___GFP_NOACCOUNT   0x10u
 #define ___GFP_NOTRACK 0x20u
 #define ___GFP_NO_KSWAPD   0x40u
 #define ___GFP_OTHER_NODE  0x80u
@@ -85,6 +86,7 @@ struct vm_area_struct;
 #define __GFP_HARDWALL   ((__force gfp_t)___GFP_HARDWALL) /* Enforce hardwall 
cpuset memory allocs */
 #define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)/* No fallback, no 
policies */
 #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is 
reclaimable */
+#define __GFP_NOACCOUNT((__force gfp_t)___GFP_NOACCOUNT) /* Don't 
account to memcg */
 #define __GFP_NOTRACK  ((__force gfp_t)___GFP_NOTRACK)  /* Don't track with 
kmemcheck */
 
 #define __GFP_NO_KSWAPD((__force gfp_t)___GFP_NO_KSWAPD)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 2e6c045..675b4c5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -486,6 +486,8 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup 
**memcg, int order)
if (!memcg_kmem_enabled())
return true;
 
+   if (gfp  __GFP_NOACCOUNT)
+   return true;
/*
 * __GFP_NOFAIL allocations will move on even if charging is not
 * possible. Therefore we don't even try, and have this allocation
@@ -549,6 +551,8 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 {
if (!memcg_kmem_enabled())
return cachep;
+   if (gfp  __GFP_NOACCOUNT)
+   return cachep;
if (gfp  __GFP_NOFAIL)
return cachep;
if (in_interrupt() || (!current-mm) || (current-flags  PF_KTHREAD))
diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index c8d7f31..98e1b34 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -114,7 +114,8 @@
 #define BYTES_PER_POINTER  sizeof(void *)
 
 /* GFP bitmask for kmemleak internal allocations */
-#define gfp_kmemleak_mask(gfp) (((gfp)  (GFP_KERNEL | GFP_ATOMIC)) | \
+#define gfp_kmemleak_mask(gfp) (((gfp)  (GFP_KERNEL | GFP_ATOMIC | \
+  __GFP_NOACCOUNT)) | \
 __GFP_NORETRY | __GFP_NOMEMALLOC | \
 __GFP_NOWARN)
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ve/cgroups: Drop virtualization code, v5

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.11
--
commit 7f018a3da1e259083cbd37dc5f1198b17c775f7b
Author: Cyrill Gorcunov gorcu...@odin.com
Date:   Fri May 8 01:44:15 2015 +0400

ve/cgroups: Drop virtualization code, v5

Here we rip off all the virtualization code we introduced into kernel to
behave close to rhel6.

Because we're trying a new concept (bindmounting from the node) it is
no longer needed.

Now some details:

 - drop cgroup_show_path -- we don't hide VEID in /proc/self/cgroup output,
   it doesn't break criu so no need to carry it, same applies to changes
   in cgroup_path;

 - because we drop virtualization of systemd -- disable creation of new
   hierarchies in container: we don't support it, neither we need it. The
   primary reason why we allowed new hierarchies in container was that
   CRIU has been running restore procedure inside VE but now we initiate
   restore from VE0, so we can safely disable new hierarchies;

 - in cgroup_addrm_files go back to former RHEL7 code; if we need something
   special here it must be reviewed carefully and separately;

 - no need to hide /proc/cgroups in VE, there is no sensible info present.

v2:
 - take into account commits 38f039db6e023ac14517219ad6f674633c4e99ca
   and c2ac6df22b20389ae2d0af49c436b00ff3243e89 removing 
cgroup_is_disposable,
   cgroup_kernel_destroy, ve::ve_cgroup_head.

 - drop GRPP_WEAK, CGRP_SELF_DESTRUCTION and CGRP_VE_TOP_CGROUP_VIRTUAL 
flags
   which implies the cgroups no longer auto-cleaned up but user-space tool
   (read vzctl and friends) should handle cgroups removal

 - because we're moving to native cgroups code we don't virtualize release
   agent anymore

 - still cgroup::cgroup_ve member is needed because we're using it
   all over the code

v3:
 - move back ve_offline, we need to free ve id

v4:
- use native call_usermodehelper in release_agent execution, we don't
  virtualized cgroups, but I kept error code and pr_warn so it would
  be easier identify problems if ever
- drop cgroup::cgroup_ve member, no longer used
- drop unused cgroup_kernel_destory

v5:
 - disable mounting of cgroups inside VE
 - disable modifying toplevel bindmount cgroup
   files from inside of container, except ve cgroup,
   where we need to write START to kick container to
   run (probably we will need more control here for
   restore via CRIU case, hasn't investigated it
   yet)
 - drop redundant @cgrp from ve_offline

Signed-off-by: Cyrill Gorcunov gorcu...@odin.com
Acked-by: Vladimir Davydov vdavy...@odin.com

CC: Konstantin Khorenko khore...@odin.com
CC: Pavel Emelyanov xe...@odin.com
CC: Andrey Vagin ava...@odin.com
---
 include/linux/cgroup.h  |  17 +---
 include/linux/ve.h  |   1 -
 kernel/bc/beancounter.c |  11 +--
 kernel/cgroup.c | 252 +++-
 kernel/ve/ve.c  |  10 --
 kernel/ve/vecalls.c |   6 +-
 6 files changed, 63 insertions(+), 234 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index f6c6105..a7b6941 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -46,7 +46,6 @@ struct cgroup_sb_opts {
 enum cgroup_open_flags {
CGRP_CREAT  = 0x0001,   /* create if not found */
CGRP_EXCL   = 0x0002,   /* fail if already exist */
-   CGRP_WEAK   = 0x0004,   /* arm cgroup self-destruction */
 };
 
 struct vfsmount *cgroup_kernel_mount(struct cgroup_sb_opts *opts);
@@ -56,7 +55,6 @@ struct cgroup *cgroup_kernel_open(struct cgroup *parent,
 int cgroup_kernel_remove(struct cgroup *parent, const char *name);
 int cgroup_kernel_attach(struct cgroup *cgrp, struct task_struct *tsk);
 void cgroup_kernel_close(struct cgroup *cgrp);
-void cgroup_kernel_destroy(struct cgroup *cgrp);
 
 extern int cgroup_init_early(void);
 extern int cgroup_init(void);
@@ -190,10 +188,6 @@ enum {
CGRP_CPUSET_CLONE_CHILDREN,
/* see the comment above CGRP_ROOT_SANE_BEHAVIOR for details */
CGRP_SANE_BEHAVIOR,
-   CGRP_SELF_DESTRUCTION,
-
-   /* container virtualization */
-   CGRP_VE_TOP_CGROUP_VIRTUAL,
 };
 
 struct cgroup_name {
@@ -241,13 +235,6 @@ struct cgroup {
 
struct cgroupfs_root *root;
 
-   /* The path to use for release notifications. */
-   char *release_agent;
-
-   /* Owner VE for fake cgroup hierarchy */
-   struct ve_struct *cgroup_ve;
-   struct list_head cgroup_ve_list;
-
/*
 * List of cg_cgroup_links pointing at css_sets with
 * tasks in this cgroup. Protected by css_set_lock
@@ -325,7 +312,6 @@ enum {
 
CGRP_ROOT_NOPREFIX  = (1  1

[Devel] [PATCH RHEL7 COMMIT] ve/cgroup: devices -- Modify exception list for docker sake

2015-05-06 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit b41a8db9cdf3a598c9abe35cb968b0ab476e8eeb
Author: Cyrill Gorcunov gorcu...@odin.com
Date:   Wed May 6 20:34:38 2015 +0400

ve/cgroup: devices -- Modify exception list for docker sake

When docker runs up it modifies nested device cgroups. The devices it needs
to operate with are almost the same we've had in our exception list already
except:

 1) Add ACC_MKNOD for every device we have

This is harmless operation simply to make docker happy.

 2) Add setting up ACC_MKNOD for devices created for container
via set_device_perms_ve. At the moment this is important
for VT use inside container.

 3) Add MISC_MAJOR:200 for tun device

Tun/tap is safe to use inside container as far as I know.

p.s. khorenko@ approved this kind of change in pcs7.

 4) For some reason docker requires write access to /dev/random,
grand it (since we're prohibiting writing to /dev/random
from inside of ve on kernel level, it's safe to do).

v2:
 - Use ns_capable(CAP_VE_SYS_ADMIN) instead of plain capable(CAP_SYS_ADMIN)
   for docker sake. Note the vanilla kernel no longer has any can_attach
   helper, but to make the patch smaller lets keep it. ns_capable should
   be enough for security, after all the user in container may attach own
   tasks only.

v3:
 - Use nsown_capable.

v4:
 - Switch back to plain capable test. It turned out that vanilla
   kernel has no cap test in devcgroup_can_attach (neither it
   has this helper), while nsown_capable looks like be too relaxed.
   So I think we could use plain capable() as we do in PCS6 kernel
   same time requiring CAP_VE_SYS_ADMIN to present inside container.

Signed-off-by: Cyrill Gorcunov gorcu...@odin.com
Acked-by: Konstantin Khorenko khore...@odin.com

CC: Vladimir Davydov vdavy...@odin.com
CC: Pavel Emelyanov xe...@odin.com
CC: Andrey Vagin ava...@odin.com
---
 security/device_cgroup.c | 38 --
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 53adb00..31024f7 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -16,6 +16,7 @@
 #include uapi/linux/vzcalluser.h
 #include linux/major.h
 #include linux/module.h
+#include linux/capability.h
 
 #define ACC_MKNOD 1
 #define ACC_READ  2
@@ -80,7 +81,7 @@ static int devcgroup_can_attach(struct cgroup *new_cgrp,
 {
struct task_struct *task = cgroup_taskset_first(set);
 
-   if (current != task  !capable(CAP_SYS_ADMIN))
+   if (current != task  !capable(CAP_SYS_ADMIN)  
!capable(CAP_VE_SYS_ADMIN))
return -EPERM;
return 0;
 }
@@ -662,7 +663,7 @@ static int devcgroup_update_access(struct dev_cgroup 
*devcgroup,
struct cgroup *p = devcgroup-css.cgroup;
struct dev_cgroup *parent = NULL;
 
-   if (!capable(CAP_SYS_ADMIN))
+   if (!capable(CAP_SYS_ADMIN)  !capable(CAP_VE_SYS_ADMIN))
return -EPERM;
 
if (p-parent)
@@ -984,21 +985,22 @@ int devcgroup_inode_mknod(int mode, dev_t dev)
 #ifdef CONFIG_VE
 
 static struct dev_exception_item default_whitelist_items[] = {
-   { ~0,   ~0, DEV_CHAR, ACC_HIDDEN | ACC_MKNOD },
-   { ~0,   ~0, DEV_BLOCK, ACC_HIDDEN | ACC_MKNOD },
-   { UNIX98_PTY_MASTER_MAJOR,  ~0, DEV_CHAR, ACC_HIDDEN | ACC_READ | 
ACC_WRITE },
-   { UNIX98_PTY_SLAVE_MAJOR,   ~0, DEV_CHAR, ACC_HIDDEN | ACC_READ | 
ACC_WRITE },
-   { PTY_MASTER_MAJOR, ~0, DEV_CHAR, ACC_HIDDEN | ACC_READ | 
ACC_WRITE },
-   { PTY_SLAVE_MAJOR,  ~0, DEV_CHAR, ACC_HIDDEN | ACC_READ | 
ACC_WRITE },
-   { MEM_MAJOR,/* null */  3, DEV_CHAR, ACC_HIDDEN | ACC_READ | 
ACC_WRITE },
-   { MEM_MAJOR,/* zero */  5, DEV_CHAR, ACC_HIDDEN | ACC_READ | 
ACC_WRITE },
-   { MEM_MAJOR,/* full */  7, DEV_CHAR, ACC_HIDDEN | ACC_READ | 
ACC_WRITE },
-   { TTYAUX_MAJOR, /* tty */   0, DEV_CHAR, ACC_HIDDEN | ACC_READ | 
ACC_WRITE },
-   { TTYAUX_MAJOR, /* console */   1, DEV_CHAR, ACC_HIDDEN | ACC_READ | 
ACC_WRITE },
-   { TTYAUX_MAJOR, /* ptmx */  2, DEV_CHAR, ACC_HIDDEN | ACC_READ | 
ACC_WRITE },
-   { MEM_MAJOR,/* random */8, DEV_CHAR, ACC_HIDDEN | ACC_READ },
-   { MEM_MAJOR,/* urandom */   9, DEV_CHAR, ACC_HIDDEN | ACC_READ | 
ACC_WRITE },
-   { MEM_MAJOR,/* kmsg */  11, DEV_CHAR, ACC_HIDDEN | ACC_WRITE },
+   { ~0,   ~0, DEV_CHAR,   ACC_HIDDEN | 
ACC_MKNOD },
+   { ~0,   ~0, DEV_BLOCK,  ACC_HIDDEN | 
ACC_MKNOD

[Devel] [PATCH RHEL7 COMMIT] bc: add {, un}charge_beancounter_fast define for !CONFIG_BEANCOUNTERS

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit 0a7dbbce30853b60f4f6f8a4c51257f3ed04d158
Author: Kir Kolyshkin k...@openvz.org
Date:   Thu May 7 20:28:04 2015 +0400

bc: add {,un}charge_beancounter_fast define for !CONFIG_BEANCOUNTERS

This was found while tring to compile the kernel with a stock
config (i.e. no CONFIG_BEANCOUNTERS, CONFIG_VE etc.) and
boot it on IBM Power8.

=

Fix this:

  CC  mm/shmem.o
mm/shmem.c: In function ‘shmem_acct_size’:
mm/shmem.c:180:2: error: implicit declaration of function
‘charge_beancounter_fast’ [-Werror=implicit-function-declaration]
  ret = charge_beancounter_fast(ub, UB_PRIVVMPAGES, pages, UB_HARD);
  ^
mm/shmem.c:211:2: error: implicit declaration of function
‘uncharge_beancounter_fast’ [-Werror=implicit-function-declaration]
  uncharge_beancounter_fast(ub, UB_PRIVVMPAGES, pages);
  ^
cc1: some warnings being treated as errors

Signed-off-by: Kir Kolyshkin k...@openvz.org
---
 include/bc/beancounter.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/bc/beancounter.h b/include/bc/beancounter.h
index 4889c3f..f332de9 100644
--- a/include/bc/beancounter.h
+++ b/include/bc/beancounter.h
@@ -270,8 +270,10 @@ static inline void ub_init_early(void) { };
 static inline int charge_beancounter(struct user_beancounter *ub,
int resource, unsigned long val,
enum ub_severity strict) { return 0; }
+#define charge_beancounter_fast charge_beancounter
 static inline void uncharge_beancounter(struct user_beancounter *ub,
int resource, unsigned long val) { }
+#define uncharge_beancounter_fast uncharge_beancounter
 
 static inline void ub_reclaim_rate_limit(struct user_beancounter *ub,
 int wait, unsigned count) { }
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] bc/net.h: don't define CONFIG_BEANCOUNTERS if undefined

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit d2036c1302ae10a427d0bfdf22c2e1f8eb692041
Author: Kir Kolyshkin k...@openvz.org
Date:   Thu May 7 20:28:05 2015 +0400

bc/net.h: don't define CONFIG_BEANCOUNTERS if undefined

This was found while tring to compile the kernel with a stock
config (i.e. no CONFIG_BEANCOUNTERS, CONFIG_VE etc.) and
boot it on IBM Power8.

=

This is pretty ugly, but without this patch CONFIG_BEANCOUNTERS
appears as defined even it's not in .config, that leads to the
following error (and who knows what else):

  CC  kernel/cgroup.o
include/linux/cgroup_subsys.h:95:8: error: ‘ub_subsys_id’ undeclared 
here (not in a function)
 SUBSYS(ub)
^
kernel/cgroup.c:108:21: note: in definition of macro ‘SUBSYS’
 #define SUBSYS(_x) [_x ## _subsys_id] = _x ## _subsys,
 ^
In file included from kernel/cgroup.c:111:0:
include/linux/cgroup_subsys.h:95:1: error: array index in initializer not 
of integer type
 SUBSYS(ub)
 ^
include/linux/cgroup_subsys.h:95:1: error: (near initialization for 
‘subsys’)
include/linux/cgroup_subsys.h:95:8: error: ‘ub_subsys’ undeclared here 
(not in a function)
 SUBSYS(ub)
^
kernel/cgroup.c:108:42: note: in definition of macro ‘SUBSYS’
 #define SUBSYS(_x) [_x ## _subsys_id] = _x ## _subsys,
  ^
make[1]: *** [kernel/cgroup.o] Error 1

Signed-off-by: Kir Kolyshkin k...@openvz.org
---
 include/bc/net.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/include/bc/net.h b/include/bc/net.h
index 90e57a1..e0fb572 100644
--- a/include/bc/net.h
+++ b/include/bc/net.h
@@ -18,7 +18,10 @@
 #include bc/sock.h
 #include bc/beancounter.h
 
+#ifdef CONFIG_BEANCOUNTERS
 #undef CONFIG_BEANCOUNTERS
+#define CONFIG_BEANCOUNTERS_WILL_BE_BACK
+#endif
 #undef __BC_DECL_H_
 #undef UB_DECLARE_FUNC
 #undef UB_DECLARE_VOID_FUNC
@@ -213,7 +216,10 @@ UB_DECLARE_VOID_FUNC(ub_skb_set_charge(struct sk_buff *skb,
struct sock *sk, unsigned long size, int res))
 UB_DECLARE_FUNC(int, __ub_too_many_orphans(struct sock *sk, int count))
 
+#ifdef CONFIG_BEANCOUNTERS_WILL_BE_BACK
 #define CONFIG_BEANCOUNTERS 1
+#undef CONFIG_BEANCOUNTERS_WILL_BE_BACK
+#endif
 #undef __BC_DECL_H_
 #undef UB_DECLARE_FUNC
 #undef UB_DECLARE_VOID_FUNC
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ve/sched: cpu_cgroup_get_stat() declaration in !CONFIG_VZ_FAIRSCHED case

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit 26d7aef283252ea60ce7debfd8b8806867efdf7a
Author: Kir Kolyshkin k...@openvz.org
Date:   Thu May 7 20:28:06 2015 +0400

ve/sched: cpu_cgroup_get_stat() declaration in !CONFIG_VZ_FAIRSCHED case

This was found while tring to compile the kernel with a stock
config (i.e. no CONFIG_BEANCOUNTERS, CONFIG_VE etc.) and
boot it on IBM Power8.

=

Function cpu_cgroup_get_stat() is needed by proc/uptime.c
even if !CONFIG_VZ_FAIRSCHED case.

This fixes the following compilation problem:

  CC  fs/proc/uptime.o
fs/proc/uptime.c: In function ‘get_veX_idle’:
fs/proc/uptime.c:32:2: error: implicit declaration of function
‘cpu_cgroup_get_stat’ [-Werror=implicit-function-declaration]
  cpu_cgroup_get_stat(cgrp, kstat);
  ^
cc1: some warnings being treated as errors

Signed-off-by: Kir Kolyshkin k...@openvz.org
---
 include/linux/fairsched.h | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/fairsched.h b/include/linux/fairsched.h
index c89cb86..c944bed 100644
--- a/include/linux/fairsched.h
+++ b/include/linux/fairsched.h
@@ -53,8 +53,6 @@ int fairsched_new_node(int id, unsigned int vcpus);
 int fairsched_move_task(int id, struct task_struct *tsk);
 void fairsched_drop_node(int id, int leave);
 
-struct kernel_cpustat;
-void cpu_cgroup_get_stat(struct cgroup *cgrp, struct kernel_cpustat *kstat);
 int fairsched_get_cpu_stat(int id, struct kernel_cpustat *kstat);
 
 int cpu_cgroup_get_avenrun(struct cgroup *cgrp, unsigned long *avenrun);
@@ -75,6 +73,10 @@ static inline int fairsched_get_cpu_avenrun(int id, unsigned 
long *avenrun) { re
 static inline int fairsched_get_cpu_stat(int id, struct kernel_cpustat *kstat) 
{ return -ENOSYS; }
 
 #endif /* CONFIG_VZ_FAIRSCHED */
+
+struct kernel_cpustat;
+void cpu_cgroup_get_stat(struct cgroup *cgrp, struct kernel_cpustat *kstat);
+
 #endif /* __KERNEL__ */
 
 #endif /* __LINUX_FAIRSCHED_H__ */
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] bc/io_acct: define get_io_ub() for !CONFIG_BC_IO_ACCOUNTING case

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit ad7f5ed73d43314f612d647fb60a56d1562ce773
Author: Kir Kolyshkin k...@openvz.org
Date:   Thu May 7 20:28:07 2015 +0400

bc/io_acct: define get_io_ub() for !CONFIG_BC_IO_ACCOUNTING case

This was found while tring to compile the kernel with a stock
config (i.e. no CONFIG_BEANCOUNTERS, CONFIG_VE etc.) and
boot it on IBM Power8.

=

In case CONFIG_BC_IO_ACCOUNTING is defined, virtinfo.h
is included via task_io_accounting_ops.h that includes
bc/io_acct.h that includes virtinfo.h.

In case CONFIG_BC_IO_ACCOUNTING is not defined, we are screwed.

Signed-off-by: Kir Kolyshkin k...@openvz.org
---
 fs/direct-io.c   | 1 +
 include/bc/io_acct.h | 5 +
 2 files changed, 6 insertions(+)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 8e67f35..b61227d 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -38,6 +38,7 @@
 #include linux/atomic.h
 #include linux/prefetch.h
 #include linux/aio.h
+#include linux/virtinfo.h
 
 /*
  * How many user pages to map in one call to get_user_pages().  This determines
diff --git a/include/bc/io_acct.h b/include/bc/io_acct.h
index 0456fbf..b3bcfd1 100644
--- a/include/bc/io_acct.h
+++ b/include/bc/io_acct.h
@@ -118,6 +118,11 @@ static inline bool ub_should_skip_writeback(struct 
user_beancounter *ub,
return false;
 }
 
+static inline struct user_beancounter *get_io_ub(void)
+{
+   return NULL;
+}
+
 #endif /* UBC_IO_ACCT */
 
 #endif
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ve/kernel: fix kernel_thread() compilation for !CONFIG_VE

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit a68f7bbb30ef8ae0adcbd5f772989ae37fac6941
Author: Kir Kolyshkin k...@openvz.org
Date:   Thu May 7 20:28:09 2015 +0400

ve/kernel: fix kernel_thread() compilation for !CONFIG_VE

This was found while tring to compile the kernel with a stock
config (i.e. no CONFIG_BEANCOUNTERS, CONFIG_VE etc.) and
boot it on IBM Power8.

=

This is a fix to commit 0800da0.

Variable ve_allow_kthreads is defined in kernel/ve/ve.c which is only
included if CONFIG_VE is set, otherwise we get this:

kernel/fork.c: In function ‘kernel_thread’:
kernel/fork.c:1723:7: error: ‘ve_allow_kthreads’ undeclared (first 
use
in this function)
  if (!ve_allow_kthreads  !ve_is_super(get_exec_env())) {

Signed-off-by: Kir Kolyshkin k...@openvz.org
---
 kernel/fork.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/fork.c b/kernel/fork.c
index 5c857bc..95950e7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1719,12 +1719,14 @@ long do_fork(unsigned long clone_flags,
  */
 pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
 {
+#ifdef CONFIG_VE
/* Don't allow kernel_thread() inside VE */
if (!ve_allow_kthreads  !ve_is_super(get_exec_env())) {
printk(kernel_thread call inside container\n);
dump_stack();
return -EPERM;
}
+#endif
 
return do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
(unsigned long)arg, NULL, NULL);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ve/sched: put move_task_groups() under CONFIG_CFS_CPULIMIT

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit 0237fbfd0a0acc3b75cafce1050e392c0463a67b
Author: Kir Kolyshkin k...@openvz.org
Date:   Thu May 7 20:28:10 2015 +0400

ve/sched: put move_task_groups() under CONFIG_CFS_CPULIMIT

This was found while tring to compile the kernel with a stock
config (i.e. no CONFIG_BEANCOUNTERS, CONFIG_VE etc.) and
boot it on IBM Power8.

=

Function move_task_groups() is not defined if CONFIG_CFS_CPULIMIT
is not set, so compiler complains.

This is a fix to commit 51fd1aec7.

Signed-off-by: Kir Kolyshkin k...@openvz.org
---
 kernel/sched/fair.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bfb4c28..ec866ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6947,6 +6947,7 @@ more_balance:
double_rq_unlock(env.dst_rq, busiest);
local_irq_restore(flags);
 
+#ifdef CONFIG_CFS_CPULIMIT
if (!ld_moved  (env.flags  LBF_ALL_PINNED)) {
env.loop = 0;
local_irq_save(flags);
@@ -6955,6 +6956,7 @@ more_balance:
double_rq_unlock(env.dst_rq, busiest);
local_irq_restore(flags);
}
+#endif
 
/*
 * some other cpu did the load balance for us.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ve/mm/filemap.c: include virtinfo.h for !CONFIG_BC_IO_ACCOUNTING case

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit 2aa67fdf2f2a8096311ab22afae0537d123c5446
Author: Kir Kolyshkin k...@openvz.org
Date:   Thu May 7 20:28:12 2015 +0400

ve/mm/filemap.c: include virtinfo.h for !CONFIG_BC_IO_ACCOUNTING case

This was found while tring to compile the kernel with a stock
config (i.e. no CONFIG_BEANCOUNTERS, CONFIG_VE etc.) and
boot it on IBM Power8.

=

In case CONFIG_BC_IO_ACCOUNTING is not set, linux/virtinfo.h
is not getting included into mm/filemap.c via bc/io_acct.h,
and we have this:

mm/filemap.c: In function ‘do_generic_file_read’:
mm/filemap.c:1573:3: error: implicit declaration of function
‘virtinfo_notifier_call’ [-Werror=implicit-function-declaration]
   virtinfo_notifier_call(VITYPE_IO, VIRTINFO_IO_PREPARE, NULL);

Signed-off-by: Kir Kolyshkin k...@openvz.org
---
 mm/filemap.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index c328b4a..43e3345 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -45,6 +45,7 @@
 
 #include asm/mman.h
 
+#include linux/virtinfo.h
 #include bc/io_acct.h
 
 /*
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] bc/mm/{memory.c, mprotect.c}: use mm_ub() macro

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit 38f2eb7684e989f7cc57f8c6b6b612ddaf7d884e
Author: Kir Kolyshkin k...@openvz.org
Date:   Thu May 7 20:28:12 2015 +0400

bc/mm/{memory.c,mprotect.c}: use mm_ub() macro

This was found while tring to compile the kernel with a stock
config (i.e. no CONFIG_BEANCOUNTERS, CONFIG_VE etc.) and
boot it on IBM Power8.

=

Fix !CONFIG_BEANCOUNTERS compilation

Signed-off-by: Kir Kolyshkin k...@openvz.org
---
 mm/memory.c   | 2 +-
 mm/mprotect.c | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7961198..5ec71da 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3193,7 +3193,7 @@ static int do_swap_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
mem_cgroup_commit_charge_swapin(page, ptr);
 
swap_free(entry);
-   if (vm_swap_full() || ub_swap_full(mm-mm_ub) ||
+   if (vm_swap_full() || ub_swap_full(mm_ub(mm)) ||
(vma-vm_flags  VM_LOCKED) || PageMlocked(page))
try_to_free_swap(page);
unlock_page(page);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index d976ae6..b55899d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -282,7 +282,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct 
vm_area_struct **pprev,
error = -ENOMEM;
if (!VM_UB_PRIVATE(oldflags, vma-vm_file) 
VM_UB_PRIVATE(newflags, vma-vm_file) 
-   charge_beancounter_fast(mm-mm_ub, UB_PRIVVMPAGES, nrpages, 
UB_SOFT))
+   charge_beancounter_fast(mm_ub(mm), UB_PRIVVMPAGES, nrpages, 
UB_SOFT))
goto fail_ch;
 
/*
@@ -348,7 +348,7 @@ success:
 
if (VM_UB_PRIVATE(oldflags, vma-vm_file) 
!VM_UB_PRIVATE(newflags, vma-vm_file))
-   uncharge_beancounter_fast(mm-mm_ub, UB_PRIVVMPAGES, nrpages);
+   uncharge_beancounter_fast(mm_ub(mm), UB_PRIVVMPAGES, nrpages);
 
perf_event_mmap(vma);
return 0;
@@ -358,7 +358,7 @@ fail:
 fail_sec:
if (!VM_UB_PRIVATE(oldflags, vma-vm_file) 
VM_UB_PRIVATE(newflags, vma-vm_file))
-   uncharge_beancounter_fast(mm-mm_ub, UB_PRIVVMPAGES, nrpages);
+   uncharge_beancounter_fast(mm_ub(mm), UB_PRIVVMPAGES, nrpages);
 fail_ch:
return error;
 }
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] bc/mm/oom: put UB_OOM_MANUAL_SCORE_ADJ bit check under CONFIG_BEANCOUNTERS

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit a1fe3b80fa701fa9e4894f735f3dfb49337e2d24
Author: Kir Kolyshkin k...@openvz.org
Date:   Thu May 7 20:28:13 2015 +0400

bc/mm/oom: put UB_OOM_MANUAL_SCORE_ADJ bit check under CONFIG_BEANCOUNTERS

This was found while tring to compile the kernel with a stock
config (i.e. no CONFIG_BEANCOUNTERS, CONFIG_VE etc.) and
boot it on IBM Power8.

=

Signed-off-by: Kir Kolyshkin k...@openvz.org
---
 mm/oom_group.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/oom_group.c b/mm/oom_group.c
index 206ba06..2401eed 100644
--- a/mm/oom_group.c
+++ b/mm/oom_group.c
@@ -60,8 +60,10 @@ int get_task_oom_score_adj(struct task_struct *t)
uid_t task_uid;
int adj = 0;
 
+#ifdef CONFIG_BEANCOUNTERS
if (test_bit(UB_OOM_MANUAL_SCORE_ADJ, get_task_ub(t)-ub_flags))
return t-signal-oom_score_adj;
+#endif
 
rcu_read_lock();
cred = __task_cred(t);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] bc/mm/shmem.c: use mm_ub() macro to avoid compilation errors if !CONFIG_BEANCOUNTERS

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit 5425b7be953d98968a1b23e613ca45210a6cc6ca
Author: Kir Kolyshkin k...@openvz.org
Date:   Thu May 7 20:28:14 2015 +0400

bc/mm/shmem.c: use mm_ub() macro to avoid compilation errors if 
!CONFIG_BEANCOUNTERS

This was found while tring to compile the kernel with a stock
config (i.e. no CONFIG_BEANCOUNTERS, CONFIG_VE etc.) and
boot it on IBM Power8.

=

Use mm_ub() macro to avoid this:

mm/shmem.c: In function ‘tmpfs_ram_pages’:
mm/shmem.c:122:18: error: ‘struct mm_struct’ has no member named 
‘mm_ub’
  ub = current-mm-mm_ub;
  ^

mm/shmem.c: In function ‘shmem_zero_setup’:
mm/shmem.c:3013:39: error: ‘struct mm_struct’ has no member named
‘mm_ub’
   uncharge_beancounter_fast(vma-vm_mm-mm_ub, UB_PRIVVMPAGES,

Signed-off-by: Kir Kolyshkin k...@openvz.org
---
 mm/shmem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index c01b3a2..a6b3e30 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -119,7 +119,7 @@ static unsigned long tmpfs_ram_pages(void)
if (unlikely(!current-mm))
goto out;
 
-   ub = current-mm-mm_ub;
+   ub = mm_ub(current-mm);
if (ub != get_ub0()) {
ub_rampages = ub-ub_parms[UB_PHYSPAGES].limit;
if (ub_rampages == UB_MAXVALUE)
@@ -3010,7 +3010,7 @@ int shmem_zero_setup(struct vm_area_struct *vma)
if (vma-vm_file)
fput(vma-vm_file);
else if (vma-vm_flags  VM_WRITE)
-   uncharge_beancounter_fast(vma-vm_mm-mm_ub, UB_PRIVVMPAGES,
+   uncharge_beancounter_fast(mm_ub(vma-vm_mm), UB_PRIVVMPAGES,
  size  PAGE_SHIFT);
vma-vm_file = file;
vma-vm_ops = shmem_vm_ops;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ve/netfilter: put appropriate part under CONFIG_VE_IPTABLES

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit 07f71c74fccb4900b5c2594dc00a9b550499f589
Author: Kir Kolyshkin k...@openvz.org
Date:   Thu May 7 20:28:14 2015 +0400

ve/netfilter: put appropriate part under CONFIG_VE_IPTABLES

This was found while tring to compile the kernel with a stock
config (i.e. no CONFIG_BEANCOUNTERS, CONFIG_VE etc.) and
boot it on IBM Power8.

=

Fix compilation with !CONFIG_VE_IPTABLES.

  CC [M]  net/netfilter/nf_conntrack_standalone.o
net/netfilter/nf_conntrack_standalone.c: In function 
‘nf_conntrack_standalone_init’:
net/netfilter/nf_conntrack_standalone.c:587:12: error: ‘struct 
ve_struct’ has no member named ‘ipt_mask’
   get_ve0()-ipt_mask = ~(VE_NF_CONNTRACK_MOD | VE_IP_IPTABLE_NAT_MOD);
^
net/netfilter/nf_conntrack_standalone.c:590:12: error: ‘struct 
ve_struct’ has no member named ‘ipt_mask’
   get_ve0()-ipt_mask |= VE_NF_CONNTRACK_MOD | VE_IP_IPTABLE_NAT_MOD;
^

Signed-off-by: Kir Kolyshkin k...@openvz.org
---
 net/netfilter/nf_conntrack_standalone.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/netfilter/nf_conntrack_standalone.c 
b/net/netfilter/nf_conntrack_standalone.c
index 6e45fc2..ee2889d 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -582,6 +582,7 @@ static int __init nf_conntrack_standalone_init(void)
 {
int ret;
 
+#ifdef CONFIG_VE_IPTABLES
if (ip_conntrack_disable_ve0) {
printk(Disabling conntracks and NAT for ve0\n);
get_ve0()-ipt_mask = ~(VE_NF_CONNTRACK_MOD | 
VE_IP_IPTABLE_NAT_MOD);
@@ -589,6 +590,7 @@ static int __init nf_conntrack_standalone_init(void)
printk(Enabling conntracks and NAT for ve0\n);
get_ve0()-ipt_mask |= VE_NF_CONNTRACK_MOD | 
VE_IP_IPTABLE_NAT_MOD;
}
+#endif
 
ret = nf_conntrack_init_start();
if (ret  0)
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ms/ppc: define TIOSAK (fix tty_ioctl() compilation)

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit b8461956ba8785caa41580809f9c31224f54
Author: Kir Kolyshkin k...@openvz.org
Date:   Thu May 7 20:28:16 2015 +0400

ms/ppc: define TIOSAK (fix tty_ioctl() compilation)

This was found while tring to compile the kernel with a stock
config (i.e. no CONFIG_BEANCOUNTERS, CONFIG_VE etc.) and
boot it on IBM Power8.

=

Fix the following error:

  CC  drivers/tty/tty_io.o
drivers/tty/tty_io.c: In function ‘tty_ioctl’:
drivers/tty/tty_io.c:2843:7: error: ‘TIOSAK’ undeclared (first use in
this function)
  case TIOSAK:

This is an addition to commit 28f8dfa.

Signed-off-by: Kir Kolyshkin k...@openvz.org
---
 arch/powerpc/include/uapi/asm/ioctls.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/ioctls.h 
b/arch/powerpc/include/uapi/asm/ioctls.h
index 49a2579..8cc89fa 100644
--- a/arch/powerpc/include/uapi/asm/ioctls.h
+++ b/arch/powerpc/include/uapi/asm/ioctls.h
@@ -116,4 +116,6 @@
 #define TIOCMIWAIT 0x545C  /* wait for a change on serial input line(s) */
 #define TIOCGICOUNT0x545D  /* read serial port inline interrupt counts */
 
+#define TIOSAK _IO('T', 0x66)  /* Secure Attention Key */
+
 #endif /* _ASM_POWERPC_IOCTLS_H */
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ve/ppc: include OpenVZ-specific Kconfigs for ppc

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit 1c89ca97839638864e363b8bd74bc600ec8705b4
Author: Kir Kolyshkin k...@openvz.org
Date:   Thu May 7 20:28:17 2015 +0400

ve/ppc: include OpenVZ-specific Kconfigs for ppc

This was found while tring to compile the kernel with a stock
config (i.e. no CONFIG_BEANCOUNTERS, CONFIG_VE etc.) and
boot it on IBM Power8.

=

Signed-off-by: Kir Kolyshkin k...@openvz.org
---
 arch/powerpc/Kconfig | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index e9f9fe1..7761552 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -1015,6 +1015,8 @@ config PHYSICAL_START
default 0x
 endif
 
+source kernel/Kconfig.openvz
+
 source net/Kconfig
 
 source drivers/Kconfig
@@ -1025,6 +1027,8 @@ source arch/powerpc/sysdev/qe_lib/Kconfig
 
 source lib/Kconfig
 
+source kernel/bc/Kconfig
+
 source arch/powerpc/Kconfig.debug
 
 source security/Kconfig
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ub/kernel: use get_task_ub() - fix compile for !CONFIG_BEANCOUNTERS

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit 6137f0c46cd91112c4ac312be969732d5ff6fec9
Author: Kir Kolyshkin k...@openvz.org
Date:   Thu May 7 20:28:19 2015 +0400

ub/kernel: use get_task_ub() - fix compile for !CONFIG_BEANCOUNTERS

This was found while tring to compile the kernel with a stock
config (i.e. no CONFIG_BEANCOUNTERS, CONFIG_VE etc.) and
boot it on IBM Power8.

=

Fix the following compilation warning if CONFIG_BEANCOUNTERS is not set:

  CC  kernel/exit.o
kernel/exit.c: In function ‘release_task’:
kernel/exit.c:215:20: error: ‘struct task_struct’ has no member named
‘task_bc’
  ub_task_uncharge(p-task_bc.task_ub);
   ^

Signed-off-by: Kir Kolyshkin k...@openvz.org
---
 kernel/exit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 56b840c..1c65d95 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -211,7 +211,7 @@ repeat:
 
write_unlock_irq(tasklist_lock);
release_thread(p);
-   ub_task_uncharge(p-task_bc.task_ub);
+   ub_task_uncharge(get_task_ub(p));
call_rcu(p-rcu, delayed_put_task_struct);
 
p = leader;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] bc: add struct seq_file declaration - fix virtinfo.h compile warning

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit c359f8c3a9ce430c5473cf340fe53f4dc8017012
Author: Kir Kolyshkin k...@openvz.org
Date:   Thu May 7 20:28:05 2015 +0400

bc: add struct seq_file declaration - fix virtinfo.h compile warning

This was found while tring to compile the kernel with a stock
config (i.e. no CONFIG_BEANCOUNTERS, CONFIG_VE etc.) and
boot it on IBM Power8.

=

Fix the following compilation warning:

C  kernel/bc/sys.o
In file included from kernel/bc/sys.c:11:0:
include/linux/virtinfo.h:60:10: warning: ‘struct seq_file’ declared
inside parameter list [enabled by default]
   struct user_beancounter *ub, unsigned long meminfo_val);
  ^
include/linux/virtinfo.h:60:10: warning: its scope is only this
definition or declaration, which is probably not what you want [enabled
by default]

The fix it forward definition of struct seq_file.

Signed-off-by: Kir Kolyshkin k...@openvz.org
---
 include/linux/virtinfo.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/virtinfo.h b/include/linux/virtinfo.h
index e8cb94c..1730aca 100644
--- a/include/linux/virtinfo.h
+++ b/include/linux/virtinfo.h
@@ -56,6 +56,8 @@ struct meminfo {
unsigned long slab_reclaimable, slab_unreclaimable;
 };
 
+struct seq_file;
+
 int meminfo_proc_show_ub(struct seq_file *m, void *v,
struct user_beancounter *ub, unsigned long meminfo_val);
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] bc/dcache: add parameter names to ub_dcache_reclaim() declaration for !CONFIG_BEANCOUNTERS

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit fcf188d71141445bda8dc60d2d3d614dfb2cf6d9
Author: Kir Kolyshkin k...@openvz.org
Date:   Thu May 7 20:28:09 2015 +0400

bc/dcache: add parameter names to ub_dcache_reclaim() declaration for 
!CONFIG_BEANCOUNTERS

This was found while tring to compile the kernel with a stock
config (i.e. no CONFIG_BEANCOUNTERS, CONFIG_VE etc.) and
boot it on IBM Power8.

=

In case CONFIG_BEANCOUNTERS is not defined, UB_DECLARE_VOID_FUNC
macro is expanded into static inline function (rather than a
function declaration), which requires parameter names.

Fixes the following error:
  CC  kernel/ve/vecalls.o
In file included from include/bc/dcache.h:4:0,
 from kernel/ve/vecalls.c:78:
include/bc/dcache.h: In function ‘ub_dcache_reclaim’:
include/bc/dcache.h:15:47: error: parameter name omitted
 UB_DECLARE_VOID_FUNC(ub_dcache_reclaim(struct user_beancounter *ub, 
unsigned long, unsigned long))
   ^
include/bc/decl.h:34:21: note: in definition of macro 
‘UB_DECLARE_VOID_FUNC’
  static inline void decl   \
 ^
include/bc/dcache.h:15:47: error: parameter name omitted
 UB_DECLARE_VOID_FUNC(ub_dcache_reclaim(struct user_beancounter *ub, 
unsigned long, unsigned long))
   ^
include/bc/decl.h:34:21: note: in definition of macro 
‘UB_DECLARE_VOID_FUNC’
  static inline void decl   \
 ^

Signed-off-by: Kir Kolyshkin k...@openvz.org
---
 include/bc/dcache.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/bc/dcache.h b/include/bc/dcache.h
index e2b2c9a..186e0fc 100644
--- a/include/bc/dcache.h
+++ b/include/bc/dcache.h
@@ -11,7 +11,7 @@ UB_DECLARE_VOID_FUNC(ub_dcache_set_owner(struct dentry *d, 
struct user_beancount
 UB_DECLARE_VOID_FUNC(ub_dcache_change_owner(struct dentry *dentry, struct 
user_beancounter *ub))
 UB_DECLARE_VOID_FUNC(ub_dcache_clear_owner(struct dentry *dentry))
 UB_DECLARE_VOID_FUNC(ub_dcache_unuse(struct user_beancounter *ub))
-UB_DECLARE_VOID_FUNC(ub_dcache_reclaim(struct user_beancounter *ub, unsigned 
long, unsigned long))
+UB_DECLARE_VOID_FUNC(ub_dcache_reclaim(struct user_beancounter *ub, unsigned 
long numerator, unsigned long denominator))
 UB_DECLARE_FUNC(int, ub_dcache_shrink(struct user_beancounter *ub, unsigned 
long size, gfp_t gfp_mask))
 UB_DECLARE_FUNC(unsigned long, ub_dcache_get_size(struct dentry *dentry))
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ve/mm: include virtinfo.h for !CONFIG_BC_IO_ACCOUNTING case

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit e6c1d7a4f8c5d47dde0dfc58be3d47a653513767
Author: Kir Kolyshkin k...@openvz.org
Date:   Thu May 7 20:28:11 2015 +0400

ve/mm: include virtinfo.h for !CONFIG_BC_IO_ACCOUNTING case

This was found while tring to compile the kernel with a stock
config (i.e. no CONFIG_BEANCOUNTERS, CONFIG_VE etc.) and
boot it on IBM Power8.

=

In case CONFIG_BC_IO_ACCOUNTING is defined, virtinfo.h
is included via task_io_accounting_ops.h that includes
bc/io_acct.h that includes virtinfo.h.

In case CONFIG_BC_IO_ACCOUNTING is not defined, we are screwed.

Signed-off-by: Kir Kolyshkin k...@openvz.org
---
 mm/page-writeback.c | 1 +
 mm/readahead.c  | 1 +
 2 files changed, 2 insertions(+)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b0f33bf..855043c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -36,6 +36,7 @@
 #include linux/pagevec.h
 #include linux/timer.h
 #include linux/sched/rt.h
+#include linux/virtinfo.h
 #include trace/events/writeback.h
 
 /*
diff --git a/mm/readahead.c b/mm/readahead.c
index d0b4118..6e53aef 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -19,6 +19,7 @@
 #include linux/pagemap.h
 #include linux/syscalls.h
 #include linux/file.h
+#include linux/virtinfo.h
 
 /*
  * Initialise a struct file's readahead state.  Assumes that the caller has
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ve/ppc: include ve.h in process.c

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit 0a15e65aed4380f29bb4265508bb66a871d3f5fb
Author: Kir Kolyshkin k...@openvz.org
Date:   Thu May 7 20:28:15 2015 +0400

ve/ppc: include ve.h in process.c

This was found while tring to compile the kernel with a stock
config (i.e. no CONFIG_BEANCOUNTERS, CONFIG_VE etc.) and
boot it on IBM Power8.

=

As get_exec_env() is defined in linux/ve.h we have to include it.
Fixes the following compile error:

  CC  arch/powerpc/kernel/process.o
In file included from arch/powerpc/kernel/process.c:20:0:
arch/powerpc/kernel/process.c: In function ‘arch_align_stack’:
include/linux/mm.h:1951:43: error: dereferencing pointer to incomplete
type
 #define randomize_va_space (get_exec_env()-_randomize_va_space)
   ^
arch/powerpc/kernel/process.c:1564:53: note: in expansion of macro
‘randomize_va_space’
  if (!(current-personality  ADDR_NO_RANDOMIZE)  randomize_va_space)
 ^
make[1]: *** [arch/powerpc/kernel/process.o] Error 1

This is an addition to commit 0c67100

Signed-off-by: Kir Kolyshkin k...@openvz.org
---
 arch/powerpc/kernel/process.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 6f55061..50c7d73 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -38,6 +38,7 @@
 #include linux/personality.h
 #include linux/random.h
 #include linux/hw_breakpoint.h
+#include linux/ve.h
 
 #include asm/pgtable.h
 #include asm/uaccess.h
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ve/proc/meminfo.c: use mm_ub() macro

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit ae181d3fd3012765dd6e6671d258f6fbe2745eea
Author: Kir Kolyshkin k...@openvz.org
Date:   Thu May 7 20:28:19 2015 +0400

ve/proc/meminfo.c: use mm_ub() macro

This was found while tring to compile the kernel with a stock
config (i.e. no CONFIG_BEANCOUNTERS, CONFIG_VE etc.) and
boot it on IBM Power8.

=

Fixes compilation with !CONFIG_BEANCOUNTERS.

Signed-off-by: Kir Kolyshkin k...@openvz.org
---
 fs/proc/meminfo.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index c70b77b..e352241 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -16,6 +16,7 @@
 #include linux/vmalloc.h
 #include asm/page.h
 #include asm/pgtable.h
+#include bc/beancounter.h
 #include internal.h
 
 void __attribute__((weak)) arch_report_meminfo(struct seq_file *m)
@@ -277,7 +278,7 @@ int meminfo_proc_show_ub(struct seq_file *m, void *v,
 
 static int meminfo_proc_show(struct seq_file *m, void *v)
 {
-   return meminfo_proc_show_ub(m, v, current-mm-mm_ub,
+   return meminfo_proc_show_ub(m, v, mm_ub(current-mm),
get_exec_env()-meminfo_val);
 }
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] vziolimit: port diff-iolimit-plain-throttler

2015-05-05 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit f3bba392c33dc26ea98d2977e8ddd2861eb8b805
Author: Dmitry Monakhov dmonak...@openvz.org
Date:   Tue May 5 13:44:34 2015 +0400

vziolimit: port diff-iolimit-plain-throttler

iolimit: implement plain throttler

Implement throttler logic.

Signed-off-by: Konstantin Khlebnikov khlebni...@openvz.org


https://jira.sw.ru/browse/PSBM-20104

Signed-off-by: Dmitry Monakhov dmonak...@openvz.org
---
 kernel/ve/vziolimit.c | 72 +++
 1 file changed, 72 insertions(+)

diff --git a/kernel/ve/vziolimit.c b/kernel/ve/vziolimit.c
index d8edb1e..d70d416 100644
--- a/kernel/ve/vziolimit.c
+++ b/kernel/ve/vziolimit.c
@@ -13,6 +13,78 @@
 #include linux/vziolimit.h
 #include bc/beancounter.h
 
+struct throttle {
+   unsigned speed; /* maximum speed, units per second */
+   unsigned burst; /* maximum bust, units */
+   unsigned latency;   /* maximum wait delay, jiffies */
+   unsigned state; /* current state */
+   unsigned long time; /* wall time in jiffies */
+};
+
+/**
+ * set throttler initial state, externally serialized
+ * @speed  maximum speed (1/sec)
+ * @burst  maximum burst chunk
+ * @latencymaximum timeout (ms)
+ */
+static void throttle_setup(struct throttle *th, unsigned speed,
+   unsigned burst, unsigned latency)
+{
+   th-time = jiffies;
+   th-burst = burst;
+   th-latency = msecs_to_jiffies(latency);
+   th-state = 0;
+   wmb();
+   th-speed = speed;
+}
+
+/* externally serialized */
+static void throttle_charge(struct throttle *th, unsigned charge)
+{
+   unsigned long now = jiffies;
+   u64 step;
+
+   if (!th-speed)
+   return;
+
+   if (time_before(th-time, now)) {
+   step = (u64)th-speed * (now - th-time);
+   do_div(step, HZ);
+   th-state = min((unsigned)step + th-state, charge + th-burst);
+   th-time = now;
+   }
+
+   if (charge  th-state) {
+   charge -= th-state;
+   step = (u64)charge * HZ;
+   if (do_div(step, th-speed))
+   step++;
+   th-time += step;
+   step *= th-speed;
+   do_div(step, HZ);
+   th-state = max_t(int, (int)step - charge, 0);
+   } else
+   th-state -= charge;
+
+   if (time_after(th-time, now + th-latency))
+   th-time = now + th-latency;
+}
+
+/* lockless */
+static unsigned long throttle_timeout(struct throttle *th, unsigned long now)
+{
+   unsigned long time;
+
+   if (!th-speed)
+   return 0;
+   rmb();
+   time = th-time;
+   if (time_before(time, now))
+   return 0;
+   return min(time - now, (unsigned long)th-latency);
+}
+
+
 static int iolimit_virtinfo(struct vnotifier_block *nb,
unsigned long cmd, void *arg, int old_ret)
 {
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] vziolimit: port diff-iolimit-initial-skeleton-opensource

2015-05-05 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit fdb55be0a92a7b0022f68afb1dfd75a8e2220600
Author: Dmitry Monakhov dmonak...@openvz.org
Date:   Tue May 5 13:44:34 2015 +0400

vziolimit: port diff-iolimit-initial-skeleton-opensource

https://jira.sw.ru/browse/PSBM-20104

Signed-off-by: Dmitry Monakhov dmonak...@openvz.org
---
 include/linux/vziolimit.h | 17 +++
 kernel/Kconfig.openvz |  8 +++
 kernel/ve/Makefile|  2 ++
 kernel/ve/vziolimit.c | 53 +++
 4 files changed, 80 insertions(+)

diff --git a/include/linux/vziolimit.h b/include/linux/vziolimit.h
new file mode 100644
index 000..a017b0f
--- /dev/null
+++ b/include/linux/vziolimit.h
@@ -0,0 +1,17 @@
+/*
+ *  include/linux/vziolimit.h
+ *
+ *  Copyright (C) 2010, Parallels inc.
+ *  All rights reserved.
+ *
+ */
+
+#ifndef _LINUX_VZIOLIMIT_H
+#define _LINUX_VZIOLIMIT_H
+
+#include linux/types.h
+#include linux/ioctl.h
+
+#define VZIOLIMITTYPE 'I'
+
+#endif /* _LINUX_VZIOLIMIT_H */
diff --git a/kernel/Kconfig.openvz b/kernel/Kconfig.openvz
index 3eb2fd2..5b6e6c1 100644
--- a/kernel/Kconfig.openvz
+++ b/kernel/Kconfig.openvz
@@ -108,3 +108,11 @@ config VTTYS
default y
 
 endmenu
+
+
+config VZ_IOLIMIT
+   tristate Container IO-limiting
+   depends on VE  VE_CALLS  BC_IO_ACCOUNTING
+   default m
+   help
+  This option provides io-limiting module.
diff --git a/kernel/ve/Makefile b/kernel/ve/Makefile
index cd798ad..a8371d5 100644
--- a/kernel/ve/Makefile
+++ b/kernel/ve/Makefile
@@ -18,6 +18,8 @@ obj-$(CONFIG_VE_NETDEV_ACCOUNTING) += vznetstat/vznetstat.o 
vznetstat/ip_vznetst
 obj-$(CONFIG_VZ_LIST) += vzlist.o
 obj-$(CONFIG_VE_CALLS) += vzstat.o
 
+obj-$(CONFIG_VZ_IOLIMIT) += vziolimit.o
+
 obj-m += dummy/ip6_vzprivnet.o
 obj-m += dummy/ip_vzprivnet.o
 obj-m += dummy/pio_nfs.o
diff --git a/kernel/ve/vziolimit.c b/kernel/ve/vziolimit.c
new file mode 100644
index 000..d8edb1e
--- /dev/null
+++ b/kernel/ve/vziolimit.c
@@ -0,0 +1,53 @@
+/*
+ *  kernel/ve/vziolimit.c
+ *
+ *  Copyright (C) 2010, Parallels inc.
+ *  All rights reserved.
+ *
+ */
+
+#include linux/module.h
+#include linux/sched.h
+#include linux/virtinfo.h
+#include linux/vzctl.h
+#include linux/vziolimit.h
+#include bc/beancounter.h
+
+static int iolimit_virtinfo(struct vnotifier_block *nb,
+   unsigned long cmd, void *arg, int old_ret)
+{
+}
+
+static struct vnotifier_block iolimit_virtinfo_nb = {
+   .notifier_call = iolimit_virtinfo,
+};
+
+static int iolimit_ioctl(struct file *file, unsigned int cmd, unsigned long 
arg)
+{
+}
+
+static struct vzioctlinfo iolimit_vzioctl = {
+   .type   = VZIOLIMITTYPE,
+   .ioctl  = iolimit_ioctl,
+#ifdef CONFIG_COMPAT
+   .compat_ioctl   = iolimit_ioctl,
+#endif
+   .owner  = THIS_MODULE,
+};
+
+static int __init iolimit_init(void)
+{
+   virtinfo_notifier_register(VITYPE_IO, iolimit_virtinfo_nb);
+   vzioctl_register(iolimit_vzioctl);
+
+   return 0;
+}
+
+static void __exit iolimit_exit(void)
+{
+   vzioctl_unregister(iolimit_vzioctl);
+   virtinfo_notifier_unregister(VITYPE_IO, iolimit_virtinfo_nb);
+}
+
+module_init(iolimit_init)
+module_exit(iolimit_exit)
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] vziolimit: port diff-iolimit-dont-throttle-flushers

2015-05-05 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit 75c3c4c7c1ea7734d550f752d4865dbd9e5074b4
Author: Dmitry Monakhov dmonak...@openvz.org
Date:   Tue May 5 13:44:36 2015 +0400

vziolimit: port diff-iolimit-dont-throttle-flushers

iolimit: do not throttle flush threads

Account pdflush IO progress but not delay it.

Signed-off-by: Konstantin Khlebnikov khlebni...@openvz.org


https://jira.sw.ru/browse/PSBM-20104

Signed-off-by: Dmitry Monakhov dmonak...@openvz.org
---
 kernel/ve/vziolimit.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/ve/vziolimit.c b/kernel/ve/vziolimit.c
index ac5abcd..72b8d51 100644
--- a/kernel/ve/vziolimit.c
+++ b/kernel/ve/vziolimit.c
@@ -110,6 +110,8 @@ static int iolimit_virtinfo(struct vnotifier_block *nb,
break;
case VIRTINFO_IO_PREPARE:
case VIRTINFO_IO_JOURNAL:
+   if (current-flags  PF_FLUSHER)
+   break;
timeout = throttle_timeout(iolimit-throttle, jiffies);
if (timeout) {
__set_current_state(TASK_UNINTERRUPTIBLE);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] vziolimit: port diff-ubc-iolimit-keep-charge-in-throttler

2015-05-05 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit c476e1a6d365c8ee156b05ab0a0ccbcb9601a984
Author: Dmitry Monakhov dmonak...@openvz.org
Date:   Tue May 5 13:44:39 2015 +0400

vziolimit: port diff-ubc-iolimit-keep-charge-in-throttler

iolimit: keep charge in throttler

Now throttler can keep precharge, so we can start throttling earlier.
* use long long for state to avoid overflows.

Signed-off-by: Konstantin Khlebnikov khlebni...@openvz.org
Acked-by: Pavel Emelyanov xe...@parallels.com


https://jira.sw.ru/browse/PSBM-20104

Signed-off-by: Dmitry Monakhov dmonak...@openvz.org
---
 kernel/ve/vziolimit.c | 53 ---
 1 file changed, 33 insertions(+), 20 deletions(-)

diff --git a/kernel/ve/vziolimit.c b/kernel/ve/vziolimit.c
index e2eedae..7ff2854 100644
--- a/kernel/ve/vziolimit.c
+++ b/kernel/ve/vziolimit.c
@@ -18,8 +18,8 @@ struct throttle {
unsigned speed; /* maximum speed, units per second */
unsigned burst; /* maximum bust, units */
unsigned latency;   /* maximum wait delay, jiffies */
-   unsigned state; /* current state */
unsigned long time; /* wall time in jiffies */
+   long long state;/* current state in units */
 };
 
 /**
@@ -34,41 +34,46 @@ static void throttle_setup(struct throttle *th, unsigned 
speed,
th-time = jiffies;
th-burst = burst;
th-latency = msecs_to_jiffies(latency);
-   th-state = 0;
+   /* feed throttler to avoid freezing */
+   if (th-state  burst)
+   th-state = burst;
wmb();
th-speed = speed;
 }
 
 /* externally serialized */
-static void throttle_charge(struct throttle *th, unsigned charge)
+static void throttle_charge(struct throttle *th, long long charge)
 {
-   unsigned long now = jiffies;
-   u64 step;
-
-   if (!th-speed)
-   return;
+   unsigned long time, now = jiffies;
+   long long step, ceiling = charge + th-burst;
 
if (time_before(th-time, now)) {
step = (u64)th-speed * (now - th-time);
do_div(step, HZ);
-   th-state = min((unsigned)step + th-state, charge + th-burst);
+   step += th-state;
+   /* feed throttler as much as we can */
+   if (step = ceiling)
+   th-state = step;
+   else if (th-state  ceiling)
+   th-state = ceiling;
th-time = now;
}
 
if (charge  th-state) {
charge -= th-state;
-   step = (u64)charge * HZ;
+   step = charge * HZ;
if (do_div(step, th-speed))
step++;
-   th-time += step;
+   time = th-time + step;
+   /* limit maximum latency */
+   if (time_after(time, now + th-latency))
+   time = now + th-latency;
+   th-time = time;
step *= th-speed;
-   do_div(step, HZ);
-   th-state = max_t(int, (int)step - charge, 0);
-   } else
-   th-state -= charge;
-
-   if (time_after(th-time, now + th-latency))
-   th-time = now + th-latency;
+   if (do_div(step, HZ))
+   step++;
+   th-state += step;
+   }
 }
 
 /* lockless */
@@ -133,14 +138,22 @@ static int iolimit_virtinfo(struct vnotifier_block *nb,
if (!iolimit-throttle.speed)
break;
spin_lock_irqsave(ub-ub_lock, flags);
-   throttle_charge(iolimit-throttle, *(size_t*)arg);
+   if (iolimit-throttle.speed) {
+   long long charge = *(size_t*)arg;
+
+   throttle_charge(iolimit-throttle, charge);
+   iolimit-throttle.state -= charge;
+   }
spin_unlock_irqrestore(ub-ub_lock, flags);
break;
case VIRTINFO_IO_OP_ACCOUNT:
if (!iolimit-iops.speed)
break;
spin_lock_irqsave(ub-ub_lock, flags);
-   throttle_charge(iolimit-iops, 1);
+   if (iolimit-iops.speed) {
+   throttle_charge(iolimit-iops, 1);
+   iolimit-iops.state--;
+   }
spin_unlock_irqrestore(ub-ub_lock, flags);
break;
case VIRTINFO_IO_PREPARE:
___
Devel mailing list
Devel@openvz.org

[Devel] [PATCH RHEL7 COMMIT] vziolimit: port diff-ubc-iolimit-precharge-dirty-pages-opensource

2015-05-05 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit e6d640dd11fe7f1fc172f81d4a2e61d1d198eaf0
Author: Dmitry Monakhov dmonak...@openvz.org
Date:   Tue May 5 13:44:40 2015 +0400

vziolimit: port diff-ubc-iolimit-precharge-dirty-pages-opensource

https://jira.sw.ru/browse/PSBM-20104

Signed-off-by: Dmitry Monakhov dmonak...@openvz.org
---
 kernel/ve/vziolimit.c | 32 
 1 file changed, 32 insertions(+)

diff --git a/kernel/ve/vziolimit.c b/kernel/ve/vziolimit.c
index 7ff2854..19dd7ad 100644
--- a/kernel/ve/vziolimit.c
+++ b/kernel/ve/vziolimit.c
@@ -120,6 +120,35 @@ static unsigned long iolimit_timeout(struct iolimit 
*iolimit)
throttle_timeout(iolimit-iops, now));
 }
 
+static void iolimit_balance_dirty(struct iolimit *iolimit,
+ struct user_beancounter *ub,
+ unsigned long write_chunk)
+{
+   struct throttle *th = iolimit-throttle;
+   unsigned long flags, dirty, state;
+
+   if (!th-speed)
+   return;
+
+   /* can be non-atomic on i386, but ok. this just hint. */
+   state = th-state  PAGE_SHIFT;
+   dirty = ub_stat_get(ub, dirty_pages) + write_chunk;
+   /* protect agains ub-stat percpu drift */
+   if (dirty + UB_STAT_BATCH * num_possible_cpus()  state)
+   return;
+   /* get exact value of for smooth throttling */
+   dirty = ub_stat_get_exact(ub, dirty_pages) + write_chunk;
+   if (dirty  state)
+   return;
+
+   spin_lock_irqsave(ub-ub_lock, flags);
+   /* precharge dirty pages */
+   throttle_charge(th, (long long)dirty  PAGE_SHIFT);
+   /* set dirty_exceeded for smooth throttling */
+   ub-dirty_exceeded = 1;
+   spin_unlock_irqrestore(ub-ub_lock, flags);
+}
+
 static int iolimit_virtinfo(struct vnotifier_block *nb,
unsigned long cmd, void *arg, int old_ret)
 {
@@ -170,6 +199,9 @@ static int iolimit_virtinfo(struct vnotifier_block *nb,
if (timeout)
return NOTIFY_FAIL;
break;
+   case VIRTINFO_IO_BALANCE_DIRTY:
+   iolimit_balance_dirty(iolimit, ub, (unsigned long)arg);
+   break;
}
 
return NOTIFY_OK;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] vziolimit: port diff-vziolimit-fix-rounding-error-in-throttler

2015-05-05 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit f3167fce64f7ae1d406fab9ea2845aafd8fc474d
Author: Dmitry Monakhov dmonak...@openvz.org
Date:   Tue May 5 13:44:41 2015 +0400

vziolimit: port diff-vziolimit-fix-rounding-error-in-throttler

vziolimit: fix rounding error in throttler

In some cases this error can lead to doubling of IO bandwidth,
especially if the speed is small and not a divisor of 1000.

https://jira.sw.ru/browse/PSBM-13302

Signed-off-by: Konstantin Khlebnikov khlebni...@openvz.org


https://jira.sw.ru/browse/PSBM-20104

Signed-off-by: Dmitry Monakhov dmonak...@openvz.org
---
 kernel/ve/vziolimit.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/ve/vziolimit.c b/kernel/ve/vziolimit.c
index 890d7e8..473e2b7 100644
--- a/kernel/ve/vziolimit.c
+++ b/kernel/ve/vziolimit.c
@@ -18,6 +18,7 @@ struct throttle {
unsigned speed; /* maximum speed, units per second */
unsigned burst; /* maximum bust, units */
unsigned latency;   /* maximum wait delay, jiffies */
+   unsigned remain;/* units/HZ */
unsigned long time; /* wall time in jiffies */
long long state;/* current state in units */
 };
@@ -70,8 +71,8 @@ static void throttle_charge(struct throttle *th, long long 
charge)
time = now + th-latency;
th-time = time;
step *= th-speed;
-   if (do_div(step, HZ))
-   step++;
+   step += th-remain;
+   th-remain = do_div(step, HZ);
th-state += step;
}
 }
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] vziolimit: port diff-ubc-introduce-atomic-flags-bit-field-shadow

2015-05-05 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit d9cd557f5d1ab0ff33d5fe2af033005887693b25
Author: Dmitry Monakhov dmonak...@openvz.org
Date:   Tue May 5 13:44:43 2015 +0400

vziolimit: port diff-ubc-introduce-atomic-flags-bit-field-shadow

ubc: introduce atomic flags bit-field

vziolimit's part of the patch:

convert ub-dirty_exceeded, ub-ub_oom_noproc, ub-ub_manual_oom_score_adj 
into
normal atomic bit flags in ub-ub_flags.

Signed-off-by: Konstantin Khlebnikov khlebni...@openvz.org
Acked-by: Pavel Emelyanov xe...@parallels.com


https://jira.sw.ru/browse/PSBM-20104

Signed-off-by: Dmitry Monakhov dmonak...@openvz.org
---
 kernel/ve/vziolimit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/ve/vziolimit.c b/kernel/ve/vziolimit.c
index 906e32a..2cfc58f 100644
--- a/kernel/ve/vziolimit.c
+++ b/kernel/ve/vziolimit.c
@@ -143,7 +143,7 @@ static void iolimit_balance_dirty(struct iolimit *iolimit,
/* precharge dirty pages */
throttle_charge(th, (long long)dirty  PAGE_SHIFT);
/* set dirty_exceeded for smooth throttling */
-   ub-dirty_exceeded = 1;
+   set_bit(UB_DIRTY_EXCEEDED, ub-ub_flags);
spin_unlock_irqrestore(ub-ub_lock, flags);
 }
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] vziolimit: port diff-ubc-iostat-wire-vziolimit-into-deadline-io_scheduler

2015-05-05 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit 2e9e25bf5876c8354762828ca7c86f86eb72873d
Author: Dmitry Monakhov dmonak...@openvz.org
Date:   Tue May 5 13:44:44 2015 +0400

vziolimit: port diff-ubc-iostat-wire-vziolimit-into-deadline-io_scheduler

vziolimit: wire into deadline io-scheduler

Trivial patch for supporting deadline scheduler in iostat/iolimit

https://jira.sw.ru/browse/PCLIN-31058

Signed-off-by: Konstantin Khlebnikov khlebni...@openvz.org


https://jira.sw.ru/browse/PSBM-20104

Signed-off-by: Dmitry Monakhov dmonak...@openvz.org
---
 block/deadline-iosched.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 20614a3..8923afa 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -13,6 +13,7 @@
 #include linux/init.h
 #include linux/compiler.h
 #include linux/rbtree.h
+#include bc/io_acct.h
 
 /*
  * See Documentation/block/deadline-iosched.txt
@@ -108,6 +109,8 @@ deadline_add_request(struct request_queue *q, struct 
request *rq)
 */
rq_set_fifo_time(rq, jiffies + dd-fifo_expire[data_dir]);
list_add_tail(rq-queuelist, dd-fifo_list[data_dir]);
+   ub_writeback_io(1, blk_rq_sectors(rq));
+   virtinfo_notifier_call_irq(VITYPE_IO, VIRTINFO_IO_OP_ACCOUNT, NULL);
 }
 
 /*
@@ -186,6 +189,12 @@ deadline_merged_requests(struct request_queue *q, struct 
request *req,
deadline_remove_request(q, next);
 }
 
+static void deadline_bio_merged(struct request_queue *q, struct request *req,
+   struct bio *bio)
+{
+   ub_writeback_io(0, bio_sectors(bio));
+}
+
 /*
  * move request from sort list to dispatch queue.
  */
@@ -445,6 +454,7 @@ static struct elevator_type iosched_deadline = {
.elevator_merge_fn =deadline_merge,
.elevator_merged_fn =   deadline_merged_request,
.elevator_merge_req_fn =deadline_merged_requests,
+   .elevator_bio_merged_fn =   deadline_bio_merged,
.elevator_dispatch_fn = deadline_dispatch_requests,
.elevator_add_req_fn =  deadline_add_request,
.elevator_former_req_fn =   elv_rb_former_request,
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] vziolimit: add blktrace hooks

2015-05-05 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit 3bfc1fa5b827b38ecc96c6ddaaa9cacfa4bb2eec
Author: Dmitry Monakhov dmonak...@openvz.org
Date:   Tue May 5 13:44:46 2015 +0400

vziolimit: add blktrace hooks

This hooks allow us to use standard blktrace tools to trace vziolimits
functionality.

https://jira.sw.ru/browse/PSBM-20104

Signed-off-by: Dmitry Monakhov dmonak...@openvz.org
---
 block/cfq-iosched.c  |  2 +-
 block/deadline-iosched.c |  2 +-
 fs/direct-io.c   |  2 +-
 kernel/ve/vziolimit.c| 18 ++
 4 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 2937c58..c649678 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3925,7 +3925,7 @@ static void cfq_insert_request(struct request_queue *q, 
struct request *rq)
cfqg_stats_update_io_add(RQ_CFQG(rq), cfqd-serving_group,
 rq-cmd_flags);
 
-   virtinfo_notifier_call_irq(VITYPE_IO, VIRTINFO_IO_OP_ACCOUNT, NULL);
+   virtinfo_notifier_call_irq(VITYPE_IO, VIRTINFO_IO_OP_ACCOUNT, q);
cfq_rq_enqueued(cfqd, cfqq, rq);
 }
 
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 8923afa..792b305 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -110,7 +110,7 @@ deadline_add_request(struct request_queue *q, struct 
request *rq)
rq_set_fifo_time(rq, jiffies + dd-fifo_expire[data_dir]);
list_add_tail(rq-queuelist, dd-fifo_list[data_dir]);
ub_writeback_io(1, blk_rq_sectors(rq));
-   virtinfo_notifier_call_irq(VITYPE_IO, VIRTINFO_IO_OP_ACCOUNT, NULL);
+   virtinfo_notifier_call_irq(VITYPE_IO, VIRTINFO_IO_OP_ACCOUNT, q);
 }
 
 /*
diff --git a/fs/direct-io.c b/fs/direct-io.c
index b4bb2e9..8e67f35 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -721,7 +721,7 @@ submit_page_section(struct dio *dio, struct dio_submit 
*sdio, struct page *page,
 {
int ret = 0;
 
-   virtinfo_notifier_call(VITYPE_IO, VIRTINFO_IO_PREPARE, NULL);
+   virtinfo_notifier_call(VITYPE_IO, VIRTINFO_IO_PREPARE, 
bdev_get_queue(map_bh-b_bdev));
 
if (dio-rw  WRITE) {
/*
diff --git a/kernel/ve/vziolimit.c b/kernel/ve/vziolimit.c
index 4a0fee7..6239c4c 100644
--- a/kernel/ve/vziolimit.c
+++ b/kernel/ve/vziolimit.c
@@ -11,6 +11,8 @@
 #include linux/virtinfo.h
 #include linux/vzctl.h
 #include linux/vziolimit.h
+#include linux/blkdev.h
+#include linux/blktrace_api.h
 #include asm/uaccess.h
 #include bc/beancounter.h
 
@@ -153,6 +155,7 @@ static int iolimit_virtinfo(struct vnotifier_block *nb,
struct user_beancounter *ub = get_exec_ub();
struct iolimit *iolimit = ub-private_data2;
unsigned long flags, timeout;
+   struct request_queue *q;
 
if (!iolimit)
return old_ret;
@@ -175,8 +178,16 @@ static int iolimit_virtinfo(struct vnotifier_block *nb,
break;
case VIRTINFO_IO_FUSE_REQ:
case VIRTINFO_IO_OP_ACCOUNT:
+
if (!iolimit-iops.speed)
break;
+
+   q = (struct request_queue *) arg;
+   if (q)
+   blk_add_trace_msg(q, vziolimit iops ub:%s 
speed:%d remain:%d ,
+ 
ub-ub_name,iolimit-iops.speed,
+ iolimit-iops.remain);
+
spin_lock_irqsave(ub-ub_lock, flags);
if (iolimit-iops.speed) {
throttle_charge(iolimit-iops, 1);
@@ -192,9 +203,16 @@ static int iolimit_virtinfo(struct vnotifier_block *nb,
break;
case VIRTINFO_IO_PREPARE:
case VIRTINFO_IO_JOURNAL:
+
if (current-flags  PF_SWAPWRITE)
break;
+
timeout = iolimit_timeout(iolimit);
+   q = (struct request_queue *) arg;
+   if (q)
+   blk_add_trace_msg(q, vziolimit sleep ub:%s 
speed:%ld ,
+ ub-ub_name, timeout);
+
if (timeout  !fatal_signal_pending(current))
iolimit_wait(iolimit, timeout);
break;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] vziolimit: port diff-iolimit-handle-virtinfo-events

2015-05-05 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit 337d82253d3db00d6f8fff7ac939f0cd155173aa
Author: Dmitry Monakhov dmonak...@openvz.org
Date:   Tue May 5 13:44:35 2015 +0400

vziolimit: port diff-iolimit-handle-virtinfo-events

iolimit: wire throttler into virtinfo

Call throttler methods from virtinfo hook.

Signed-off-by: Konstantin Khlebnikov khlebni...@openvz.org


https://jira.sw.ru/browse/PSBM-20104

Signed-off-by: Dmitry Monakhov dmonak...@openvz.org
---
 kernel/ve/vziolimit.c | 36 
 1 file changed, 36 insertions(+)

diff --git a/kernel/ve/vziolimit.c b/kernel/ve/vziolimit.c
index d70d416..949d1a6 100644
--- a/kernel/ve/vziolimit.c
+++ b/kernel/ve/vziolimit.c
@@ -84,10 +84,46 @@ static unsigned long throttle_timeout(struct throttle *th, 
unsigned long now)
return min(time - now, (unsigned long)th-latency);
 }
 
+struct iolimit {
+   struct throttle throttle;
+};
 
 static int iolimit_virtinfo(struct vnotifier_block *nb,
unsigned long cmd, void *arg, int old_ret)
 {
+   struct user_beancounter *ub = top_beancounter(get_exec_ub());
+   struct iolimit *iolimit = ub-private_data2;
+   unsigned long flags, timeout;
+
+   if (!iolimit)
+   return old_ret;
+
+   if (!iolimit-throttle.speed)
+   return NOTIFY_OK;
+
+   switch (cmd) {
+   case VIRTINFO_IO_ACCOUNT:
+   spin_lock_irqsave(ub-ub_lock, flags);
+   throttle_charge(iolimit-throttle, *(size_t*)arg);
+   spin_unlock_irqrestore(ub-ub_lock, flags);
+   break;
+   case VIRTINFO_IO_PREPARE:
+   case VIRTINFO_IO_JOURNAL:
+   timeout = throttle_timeout(iolimit-throttle, jiffies);
+   if (timeout) {
+   __set_current_state(TASK_UNINTERRUPTIBLE);
+   schedule_timeout(timeout);
+   }
+   break;
+   case VIRTINFO_IO_READAHEAD:
+   case VIRTINFO_IO_CONGESTION:
+   timeout = throttle_timeout(iolimit-throttle, jiffies);
+   if (timeout)
+   return NOTIFY_FAIL;
+   break;
+   }
+
+   return NOTIFY_OK;
 }
 
 static struct vnotifier_block iolimit_virtinfo_nb = {
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] vziolimit: port diff-iolimit-vzctl-api

2015-05-05 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit e72e1777eb631b77c97358f821d733d93349b8b6
Author: Dmitry Monakhov dmonak...@openvz.org
Date:   Tue May 5 13:44:35 2015 +0400

vziolimit: port diff-iolimit-vzctl-api

iolimit: vzctl ioctl interface

Implement iolimit control block allocation, add get and set ioctl.

Signed-off-by: Konstantin Khlebnikov khlebni...@openvz.org


https://jira.sw.ru/browse/PSBM-20104

Signed-off-by: Dmitry Monakhov dmonak...@openvz.org
---
 include/linux/vziolimit.h | 10 +++
 kernel/ve/vziolimit.c | 68 +++
 2 files changed, 78 insertions(+)

diff --git a/include/linux/vziolimit.h b/include/linux/vziolimit.h
index a017b0f..5af8c04 100644
--- a/include/linux/vziolimit.h
+++ b/include/linux/vziolimit.h
@@ -14,4 +14,14 @@
 
 #define VZIOLIMITTYPE 'I'
 
+struct iolimit_state {
+   unsigned int id;
+   unsigned int speed;
+   unsigned int burst;
+   unsigned int latency;
+};
+
+#define VZCTL_SET_IOLIMIT  _IOW(VZIOLIMITTYPE, 0, struct iolimit_state)
+#define VZCTL_GET_IOLIMIT  _IOR(VZIOLIMITTYPE, 1, struct iolimit_state)
+
 #endif /* _LINUX_VZIOLIMIT_H */
diff --git a/kernel/ve/vziolimit.c b/kernel/ve/vziolimit.c
index 949d1a6..ac5abcd 100644
--- a/kernel/ve/vziolimit.c
+++ b/kernel/ve/vziolimit.c
@@ -11,6 +11,7 @@
 #include linux/virtinfo.h
 #include linux/vzctl.h
 #include linux/vziolimit.h
+#include asm/uaccess.h
 #include bc/beancounter.h
 
 struct throttle {
@@ -132,6 +133,73 @@ static struct vnotifier_block iolimit_virtinfo_nb = {
 
 static int iolimit_ioctl(struct file *file, unsigned int cmd, unsigned long 
arg)
 {
+   struct user_beancounter *ub;
+   struct iolimit *iolimit, *new_iolimit = NULL;
+   struct iolimit_state state;
+   int err;
+
+   if (cmd != VZCTL_SET_IOLIMIT  cmd != VZCTL_GET_IOLIMIT)
+   return -ENOTTY;
+
+   if (copy_from_user(state, (void __user *)arg, sizeof(state)))
+   return -EFAULT;
+
+   ub = get_beancounter_byuid(state.id, 0);
+   if (!ub)
+   return -ENOENT;
+
+   iolimit = ub-private_data2;
+
+   switch (cmd) {
+   case VZCTL_SET_IOLIMIT:
+   if (!iolimit) {
+   new_iolimit = kmalloc(sizeof(struct iolimit), 
GFP_KERNEL);
+   err = -ENOMEM;
+   if (!new_iolimit)
+   break;
+   }
+
+   spin_lock_irq(ub-ub_lock);
+
+   if (!iolimit  ub-private_data2) {
+   kfree(new_iolimit);
+   iolimit = ub-private_data2;
+   } else if (!iolimit)
+   iolimit = new_iolimit;
+
+   throttle_setup(iolimit-throttle, state.speed,
+   state.burst, state.latency);
+
+   if (!ub-private_data2)
+   ub-private_data2 = iolimit;
+
+   spin_unlock_irq(ub-ub_lock);
+
+   err = 0;
+   break;
+   case VZCTL_GET_IOLIMIT:
+   err = -ENXIO;
+   if (!iolimit)
+   break;
+
+   spin_lock_irq(ub-ub_lock);
+   state.speed = iolimit-throttle.speed;
+   state.burst = iolimit-throttle.burst;
+   state.latency = 
jiffies_to_msecs(iolimit-throttle.latency);
+   spin_unlock_irq(ub-ub_lock);
+
+   err = -EFAULT;
+   if (copy_to_user((void __user *)arg, state, 
sizeof(state)))
+   break;
+
+   err = 0;
+   break;
+   default:
+   err = -ENOTTY;
+   }
+
+   put_beancounter(ub);
+   return err;
 }
 
 static struct vzioctlinfo iolimit_vzioctl = {
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] vziolimit: port diff-iolimit-wakup-at-kill-and-limit-change

2015-05-05 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit c249e3f74d4d4ff315d9119fe25371f1384d
Author: Dmitry Monakhov dmonak...@openvz.org
Date:   Tue May 5 13:44:37 2015 +0400

vziolimit: port diff-iolimit-wakup-at-kill-and-limit-change

iolimit: wakeable and killable iolimit wait

* Make iolimit wait killable.
* Wakeup all tasks after iolimit speed change.

https://jira.sw.ru/browse/PCLIN-28566

Signed-off-by: Konstantin Khlebnikov khlebni...@openvz.org


https://jira.sw.ru/browse/PSBM-20104

Signed-off-by: Dmitry Monakhov dmonak...@openvz.org
---
 kernel/ve/vziolimit.c | 26 ++
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/kernel/ve/vziolimit.c b/kernel/ve/vziolimit.c
index 72b8d51..af16182 100644
--- a/kernel/ve/vziolimit.c
+++ b/kernel/ve/vziolimit.c
@@ -87,8 +87,25 @@ static unsigned long throttle_timeout(struct throttle *th, 
unsigned long now)
 
 struct iolimit {
struct throttle throttle;
+   wait_queue_head_t wq;
 };
 
+static void iolimit_wait(struct iolimit *iolimit, unsigned long timeout)
+{
+   DEFINE_WAIT(wait);
+
+   do {
+   prepare_to_wait(iolimit-wq, wait, TASK_KILLABLE);
+   timeout = schedule_timeout(timeout);
+   if (fatal_signal_pending(current))
+   break;
+   if (unlikely(timeout))
+   timeout = min(throttle_timeout(iolimit-throttle,
+   jiffies), timeout);
+   } while (timeout);
+   finish_wait(iolimit-wq, wait);
+}
+
 static int iolimit_virtinfo(struct vnotifier_block *nb,
unsigned long cmd, void *arg, int old_ret)
 {
@@ -113,10 +130,8 @@ static int iolimit_virtinfo(struct vnotifier_block *nb,
if (current-flags  PF_FLUSHER)
break;
timeout = throttle_timeout(iolimit-throttle, jiffies);
-   if (timeout) {
-   __set_current_state(TASK_UNINTERRUPTIBLE);
-   schedule_timeout(timeout);
-   }
+   if (timeout  !fatal_signal_pending(current))
+   iolimit_wait(iolimit, timeout);
break;
case VIRTINFO_IO_READAHEAD:
case VIRTINFO_IO_CONGESTION:
@@ -159,6 +174,7 @@ static int iolimit_ioctl(struct file *file, unsigned int 
cmd, unsigned long arg)
err = -ENOMEM;
if (!new_iolimit)
break;
+   init_waitqueue_head(new_iolimit-wq);
}
 
spin_lock_irq(ub-ub_lock);
@@ -177,6 +193,8 @@ static int iolimit_ioctl(struct file *file, unsigned int 
cmd, unsigned long arg)
 
spin_unlock_irq(ub-ub_lock);
 
+   wake_up_all(iolimit-wq);
+
err = 0;
break;
case VZCTL_GET_IOLIMIT:
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] vziolimit: port diff-vziolimit-no-top-bc

2015-05-05 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit 3fb3a3d21ddfa9a891388a6b9a0d7eff59e7ef1c
Author: Dmitry Monakhov dmonak...@openvz.org
Date:   Tue May 5 13:44:38 2015 +0400

vziolimit: port diff-vziolimit-no-top-bc

Signed-off-by: Stanislav Kinsbursky skinsbur...@parallels.com


https://jira.sw.ru/browse/PSBM-20104

Signed-off-by: Dmitry Monakhov dmonak...@openvz.org
---
 kernel/ve/vziolimit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/ve/vziolimit.c b/kernel/ve/vziolimit.c
index af16182..a72a8ec 100644
--- a/kernel/ve/vziolimit.c
+++ b/kernel/ve/vziolimit.c
@@ -109,7 +109,7 @@ static void iolimit_wait(struct iolimit *iolimit, unsigned 
long timeout)
 static int iolimit_virtinfo(struct vnotifier_block *nb,
unsigned long cmd, void *arg, int old_ret)
 {
-   struct user_beancounter *ub = top_beancounter(get_exec_ub());
+   struct user_beancounter *ub = get_exec_ub();
struct iolimit *iolimit = ub-private_data2;
unsigned long flags, timeout;
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] vziolimit: port diff-ubc-iolimit-rework-async-iops-limit

2015-05-05 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit 351183df592c114604899c7c2c21aac0af75f773
Author: Dmitry Monakhov dmonak...@openvz.org
Date:   Tue May 5 13:44:41 2015 +0400

vziolimit: port diff-ubc-iolimit-rework-async-iops-limit

iolimit: rework async iops limit

Now async IO will not use last iops from stash, it will slowdown sync IO,
but should not choke it. Thus async writeback sometimes becomes free,
we anyway cannot limit iops correctly for this case.

test results:

[ container iops limit 10 iops/sec, ramlimit 256mb, file 1gb ]
# ioping -P 1 -q -i 0 -WWW -L -s 4k zero
[ cached random 4k writes, print statistics every second ]

before patch:

1 27388042 0 150 27388042 27388042 27388042 0
20496 10137323 2022 8281438 1 495 10091730 70489
1 15686980 0 261 15686980 15686980 15686980 0
14350 10145229 1414 5793620 1 707 10097028 84285
1 28054974 0 146 28054974 28054974 28054974 0
20473 10137498 2020 8272002 1 495 10092918 70537

iops not limited at all (3rd column),
but max latency (7th column)  15 sec

after patch:

2048 957784 2138 8758351 1 468 526912 14975
2048 1157707 1769 7245882 1 565 626641 18083
1024 628864 1628 6669652 1 614 626675 19574
2048 1459687 1403 5746854 1 713 826885 22916
2048 759792 2695 11040664 1 371 428803 11907
2048 864740 2368 9700729 1 422 433732 13439
2048 860729 2379 9745934 1 420 429865 13376

iops not limited, but latency is sane now

meanwhile iops on sync writeback (fsync after each write) perfectly limited:

# ioping -P 1 -q -i 0 -WWW -s 4k zero

10 999898 10 40964 99945 0 100012 17
10 33 10 40963 99979 3 100010 11
10 34 10 40963 99908 3 100066 38
10 1000939 10 40922 99927 100094 101054 321
10 11 10 40964 99973 1 16 10
10 33 10 40963 99976 3 100014 13

Signed-off-by: Konstantin Khlebnikov khlebni...@openvz.org
Acked-by: Pavel Emelyanov xe...@parallels.com


https://jira.sw.ru/browse/PSBM-20104

Signed-off-by: Dmitry Monakhov dmonak...@openvz.org
---
 kernel/ve/vziolimit.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/ve/vziolimit.c b/kernel/ve/vziolimit.c
index 19dd7ad..890d7e8 100644
--- a/kernel/ve/vziolimit.c
+++ b/kernel/ve/vziolimit.c
@@ -181,7 +181,13 @@ static int iolimit_virtinfo(struct vnotifier_block *nb,
spin_lock_irqsave(ub-ub_lock, flags);
if (iolimit-iops.speed) {
throttle_charge(iolimit-iops, 1);
-   iolimit-iops.state--;
+   /*
+* Writeback doesn't use last iops from stash
+* to avoid choking future sync operations.
+*/
+   if (iolimit-iops.state  1 ||
+   !(current-flags  PF_FLUSHER))
+   iolimit-iops.state--;
}
spin_unlock_irqrestore(ub-ub_lock, flags);
break;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] vziolimit: port diff-vziolimit-dont-change-state-at-speed-change

2015-05-05 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit 3cbd2017bcb15367f6fa56d273f0d3a292eb9cab
Author: Dmitry Monakhov dmonak...@openvz.org
Date:   Tue May 5 13:44:42 2015 +0400

vziolimit: port diff-vziolimit-dont-change-state-at-speed-change

vziolimit: dont change state at speed change

This freeze from comment actually does not exist.
Otherwise we get io activity peak after each speed change.

Signed-off-by: Konstantin Khlebnikov khlebni...@openvz.org


https://jira.sw.ru/browse/PSBM-20104

Signed-off-by: Dmitry Monakhov dmonak...@openvz.org
---
 kernel/ve/vziolimit.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/kernel/ve/vziolimit.c b/kernel/ve/vziolimit.c
index 473e2b7..367fb68 100644
--- a/kernel/ve/vziolimit.c
+++ b/kernel/ve/vziolimit.c
@@ -35,9 +35,6 @@ static void throttle_setup(struct throttle *th, unsigned 
speed,
th-time = jiffies;
th-burst = burst;
th-latency = msecs_to_jiffies(latency);
-   /* feed throttler to avoid freezing */
-   if (th-state  burst)
-   th-state = burst;
wmb();
th-speed = speed;
 }
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] vziolimit: port diff-vziolimit-charge-fuse-requests-into-iops-throttler

2015-05-05 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit a67e2965b48f351fdeee9cfc06b74f53999d80d0
Author: Dmitry Monakhov dmonak...@openvz.org
Date:   Tue May 5 13:44:43 2015 +0400

vziolimit: port diff-vziolimit-charge-fuse-requests-into-iops-throttler

vziolimit: charge fuse requests into iops throttler

And the final gravestone...

https://jira.sw.ru/browse/PSBM-10626

Signed-off-by: Konstantin Khlebnikov khlebni...@openvz.org


https://jira.sw.ru/browse/PSBM-20104

Signed-off-by: Dmitry Monakhov dmonak...@openvz.org
---
 kernel/ve/vziolimit.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/ve/vziolimit.c b/kernel/ve/vziolimit.c
index 367fb68..906e32a 100644
--- a/kernel/ve/vziolimit.c
+++ b/kernel/ve/vziolimit.c
@@ -173,6 +173,7 @@ static int iolimit_virtinfo(struct vnotifier_block *nb,
}
spin_unlock_irqrestore(ub-ub_lock, flags);
break;
+   case VIRTINFO_IO_FUSE_REQ:
case VIRTINFO_IO_OP_ACCOUNT:
if (!iolimit-iops.speed)
break;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] vziolimit: compilation fix

2015-05-05 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit 1c2434b8793b0ca488370193d7897de94cc8afb1
Author: Dmitry Monakhov dmonak...@openvz.org
Date:   Tue May 5 13:44:44 2015 +0400

vziolimit: compilation fix

Patch is based on the commit 2bfd4efb04c695aa2c9e60c474d1c93decea9558
  From: Andrew Perepechko pa...@cloudlinux.com

https://jira.sw.ru/browse/PSBM-20104

Signed-off-by: Dmitry Monakhov dmonak...@openvz.org
---
 kernel/bc/beancounter.c | 1 +
 kernel/ve/vziolimit.c   | 4 ++--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/bc/beancounter.c b/kernel/bc/beancounter.c
index b4da36b..c9b17de 100644
--- a/kernel/bc/beancounter.c
+++ b/kernel/bc/beancounter.c
@@ -428,6 +428,7 @@ struct user_beancounter *get_beancounter_byuid(uid_t uid, 
int create)
snprintf(name, sizeof(name), %u, uid);
return get_beancounter_by_name(name, create);
 }
+EXPORT_SYMBOL(get_beancounter_byuid);
 
 uid_t ub_legacy_id(struct user_beancounter *ub)
 {
diff --git a/kernel/ve/vziolimit.c b/kernel/ve/vziolimit.c
index 2cfc58f..4a0fee7 100644
--- a/kernel/ve/vziolimit.c
+++ b/kernel/ve/vziolimit.c
@@ -185,14 +185,14 @@ static int iolimit_virtinfo(struct vnotifier_block *nb,
 * to avoid choking future sync operations.
 */
if (iolimit-iops.state  1 ||
-   !(current-flags  PF_FLUSHER))
+   !(current-flags  PF_SWAPWRITE))
iolimit-iops.state--;
}
spin_unlock_irqrestore(ub-ub_lock, flags);
break;
case VIRTINFO_IO_PREPARE:
case VIRTINFO_IO_JOURNAL:
-   if (current-flags  PF_FLUSHER)
+   if (current-flags  PF_SWAPWRITE)
break;
timeout = iolimit_timeout(iolimit);
if (timeout  !fatal_signal_pending(current))
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] cfq: add virtinfo hook for vziolimits

2015-05-05 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit 82a17ebd8f7404b3bbc0b3a277930974efdec3d2
Author: Dmitry Monakhov dmonak...@openvz.org
Date:   Tue May 5 13:44:45 2015 +0400

cfq: add virtinfo hook for vziolimits

https://jira.sw.ru/browse/PSBM-20104

Signed-off-by: Dmitry Monakhov dmonak...@openvz.org
---
 block/cfq-iosched.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index c410752..2937c58 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -14,6 +14,8 @@
 #include linux/rbtree.h
 #include linux/ioprio.h
 #include linux/blktrace_api.h
+#include bc/io_acct.h
+
 #include blk.h
 #include blk-cgroup.h
 
@@ -3922,6 +3924,8 @@ static void cfq_insert_request(struct request_queue *q, 
struct request *rq)
cfq_add_rq_rb(rq);
cfqg_stats_update_io_add(RQ_CFQG(rq), cfqd-serving_group,
 rq-cmd_flags);
+
+   virtinfo_notifier_call_irq(VITYPE_IO, VIRTINFO_IO_OP_ACCOUNT, NULL);
cfq_rq_enqueued(cfqd, cfqq, rq);
 }
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] vziolimit: correct copyright info

2015-05-05 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.9
--
commit 519489776d209b1e00676c569fa03daf38809488
Author: Konstantin Khorenko khore...@openvz.org
Date:   Tue May 5 13:52:17 2015 +0400

vziolimit: correct copyright info

Signed-off-by: Konstantin Khorenko khore...@openvz.org
---
 include/linux/vziolimit.h | 3 +--
 kernel/ve/vziolimit.c | 3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/linux/vziolimit.h b/include/linux/vziolimit.h
index 6452b22..0eef650 100644
--- a/include/linux/vziolimit.h
+++ b/include/linux/vziolimit.h
@@ -1,8 +1,7 @@
 /*
  *  include/linux/vziolimit.h
  *
- *  Copyright (C) 2010, Parallels inc.
- *  All rights reserved.
+ *  Copyright (c) 2010-2015 Parallels IP Holdings GmbH
  *
  */
 
diff --git a/kernel/ve/vziolimit.c b/kernel/ve/vziolimit.c
index 6239c4c..a6f900d 100644
--- a/kernel/ve/vziolimit.c
+++ b/kernel/ve/vziolimit.c
@@ -1,8 +1,7 @@
 /*
  *  kernel/ve/vziolimit.c
  *
- *  Copyright (C) 2010, Parallels inc.
- *  All rights reserved.
+ *  Copyright (c) 2010-2015 Parallels IP Holdings GmbH
  *
  */
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ve/printk/ppc: fix asm modifier in DEFINE_STRUCT_MEMBER_ALIAS

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit d2b8e29be14cbd4da0b641156fffd9204ca0d70a
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Thu May 7 20:28:18 2015 +0400

ve/printk/ppc: fix asm modifier in DEFINE_STRUCT_MEMBER_ALIAS

This was found while tring to compile the kernel with a stock
config (i.e. no CONFIG_BEANCOUNTERS, CONFIG_VE etc.) and
boot it on IBM Power8.

=

gcc fails to parse %a0 on PPC for some reason.
Use conventional %c0 instead.

 ‘%cdigit’ can be used to substitute an operand that is a constant
 value without the syntax that normally indicates an immediate operand.
[...]
 ‘%adigit’ can be used to substitute an operand as if it were a memory
 reference, with the actual operand treated as the address. This may be
 useful when outputting a “load address” instruction, because often the
 assembler syntax for such an instruction requires you to write the
 operand as if it were a memory reference.

(Source: 
https://gcc.gnu.org/onlinedocs/gccint/Output-Template.html#Output-Template)

Also from GCC source code (gcc/final.c):

 /* Output text from TEMPLATE to the assembler output file,
obeying %-directions to substitute operands taken from
the vector OPERANDS.

%N (for N a digit) means print operand N in usual manner.
%lN means require operand N to be a CODE_LABEL or LABEL_REF
   and print the label name with no punctuation.
%cN means require operand N to be a constant
   and print the constant expression with no punctuation.
%aN means expect operand N to be a memory address
   (not a memory reference!) and print a reference
   to that address.
%nN means expect operand N to be a constant
   and print a constant expression for minus the value
   of the operand, with no other punctuation.  */

I want to define log_buf, log_buf_len, etc as aliases for
init_log_state.buf, init_log_state.buf_len, etc. So for log_buf_len
(offset 0x8 in struct log_state on 64 bit) I want to generate

.globl log_buf_len
.set log_buf_len, init_log_state+0x8

 ^^^
NOTE: No punctuation here
(no $ sign, which is prepended if %N is used)

So I obviously should have used %cN, because it explicitly states:
constant expression with no punctuation.

It turns out that %aN also fits on Linux/x86, because it seems that a
memory address can always be used as a memory offset. However, it seems
that on Linux/PPC gcc expects a memory address to be in some specific
range, otherwise it just crashes, so that offset and address are
orthogonal things.

Hope this explains the magic behind this patch.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 kernel/printk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/printk.c b/kernel/printk.c
index 9ee2b80..51805a54 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -293,7 +293,7 @@ static struct log_state {
 static void  ## name ## _definition(void) __attribute__((used));   \
 static void  ## name ## _definition(void)  \
 {  \
-   asm (.globl  #name \n\t.set  #name ,  #inst +%a0\
+   asm (.globl  #name \n\t.set  #name ,  #inst +%c0\
 : : g (offsetof(typeof(inst), memb)));   \
 }  \
 extern typeof(inst.memb) name;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ve/ppc: remove vdso32_pages declaration from ppc elf.h

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit 5198d98193b77594bc1c03505dcf1ce0da98cfab
Author: Kir Kolyshkin k...@openvz.org
Date:   Thu May 7 20:28:20 2015 +0400

ve/ppc: remove vdso32_pages declaration from ppc elf.h

This was found while tring to compile the kernel with a stock
config (i.e. no CONFIG_BEANCOUNTERS, CONFIG_VE etc.) and
boot it on IBM Power8.

=

Remove some junk from arch/powerpc/include/asm/elf.h.

This is a fix to commit 0800da07.

Signed-off-by: Kir Kolyshkin k...@openvz.org
---
 arch/powerpc/include/asm/elf.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h
index 925e1da..935b5e7 100644
--- a/arch/powerpc/include/asm/elf.h
+++ b/arch/powerpc/include/asm/elf.h
@@ -118,7 +118,6 @@ extern int ucache_bsize;
 /* vDSO has arch_setup_additional_pages */
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES
 struct linux_binprm;
-export struct page *vdso32_pages[1];
 extern int arch_setup_additional_pages(struct linux_binprm *bprm,
   int uses_interp);
 #define VDSO_AUX_ENT(a,b) NEW_AUX_ENT(a,b)
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] vzstat: KSTAT_PERF_ENTER redefinition fixed when !CONFIG_VE

2015-05-07 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.4.10
--
commit 4589238b74c04892cdf0a2b31a7beb04a2a93abd
Author: Kir Kolyshkin k...@openvz.org
Date:   Thu May 7 20:28:21 2015 +0400

vzstat: KSTAT_PERF_ENTER redefinition fixed when !CONFIG_VE

This was found while tring to compile the kernel with a stock
config (i.e. no CONFIG_BEANCOUNTERS, CONFIG_VE etc.) and
boot it on IBM Power8.

=

This is a fix to commit df6cdba, fixing the following warning:

In file included from include/linux/ve.h:18:0,
 from init/main.c:78:
include/linux/vzstat.h:120:0: warning: KSTAT_PERF_ENTER redefined
[enabled by default]
 #define KSTAT_PERF_ENTER(name)
 ^
include/linux/vzstat.h:119:0: note: this is the location of the previous
definition
 #define KSTAT_PERF_ENTER(ptr, real_time, cpu_time)
 ^

Signed-off-by: Kir Kolyshkin k...@openvz.org
---
 include/linux/vzstat.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/vzstat.h b/include/linux/vzstat.h
index 0931b76..26f24f7 100644
--- a/include/linux/vzstat.h
+++ b/include/linux/vzstat.h
@@ -116,7 +116,7 @@ extern void KSTAT_LAT_UPDATE(struct kstat_lat_struct *p);
 extern void KSTAT_LAT_PCPU_UPDATE(struct kstat_lat_pcpu_struct *p);
 
 #else
-#define KSTAT_PERF_ENTER(ptr, real_time, cpu_time)
+#define KSTAT_PERF_ADD(ptr, real_time, cpu_time)
 #define KSTAT_PERF_ENTER(name)
 #define KSTAT_PERF_LEAVE(name)
 #define KSTAT_LAT_ADD(p, dur)
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh6] syslog: distinguish between /proc/kmsg and syscalls (Pavel Tikhomirov)

2015-05-13 Thread Konstantin Khorenko
On 05/11/2015 02:56 PM, Sergey Korshunoff wrote:
 Current openvz kernel (2.6.32) already has this feature.
 What purpose of this message?

So, just for the record: as it is written in comments for
https://bugzilla.openvz.org/show_bug.cgi?id=2693

the patch for the issue is absent in kernels  2.6.32-042stab109.4.

Sergey just already applied same/similar patch to his own kernel,
thus already had this fixed in advance.

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] mm/tswap/tcache: enable tcache and tswap by default

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit a1cd5a98145e5032cad97a0bf15e3e0904fad8d0
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Mon May 18 17:00:04 2015 +0400

mm/tswap/tcache: enable tcache and tswap by default

We use both of them = enable tcache and tswap by default.

In order to disable them add appropriate kernel boot options:
tcache.enabled=0
tswap.enabled=0

https://jira.sw.ru/browse/PSBM-31757
https://jira.sw.ru/browse/PSBM-32063

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 mm/tcache.c | 2 +-
 mm/tswap.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/tcache.c b/mm/tcache.c
index bc740f0..e83ad05 100644
--- a/mm/tcache.c
+++ b/mm/tcache.c
@@ -125,7 +125,7 @@ static struct tcache_lru *tcache_lru_node;
  */
 
 /* Enable/disable tcache backend (set at boot time) */
-static bool tcache_enabled __read_mostly;
+static bool tcache_enabled __read_mostly = true;
 module_param_named(enabled, tcache_enabled, bool, 0444);
 
 /* Enable/disable populating the cache */
diff --git a/mm/tswap.c b/mm/tswap.c
index c4effa3..4b792cd 100644
--- a/mm/tswap.c
+++ b/mm/tswap.c
@@ -27,7 +27,7 @@ struct tswap_lru {
 static struct tswap_lru *tswap_lru_node;
 
 /* Enable/disable tswap backend (set at boot time) */
-static bool tswap_enabled __read_mostly;
+static bool tswap_enabled __read_mostly = true;
 module_param_named(enabled, tswap_enabled, bool, 0444);
 
 /* Enable/disable populating the cache */
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ve/cgroups: fake num_cgroups in /proc/cgroups output

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit 213b5800cbf1e1f36efaab61f2f49ea198bdb1e8
Author: Vasily Averin v...@odin.com
Date:   Mon May 18 16:32:55 2015 +0400

ve/cgroups: fake num_cgroups in /proc/cgroups output

Like in rh6-based kernels,
/proc/cgroups output inside container will show 1 in 'num_cgroups' column.

https://jira.sw.ru/browse/PSBM-33400

Signed-off-by: Vasily Averin v...@openvz.org

khorenko@:
This is done in order to prevent people to try guessing the
number of Containers running on a Hardware Node
because even if the guess is correct, it gives no useful info,
but people can easily come to wrong conclusions.
---
 kernel/cgroup.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f881f69..f897042 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4815,6 +4815,8 @@ out:
return retval;
 }
 
+#define _cg_virtualized(x) ((ve_is_super(get_exec_env())) ? (x) : 1)
+
 /* Display information about each subsystem and each hierarchy */
 static int proc_cgroupstats_show(struct seq_file *m, void *v)
 {
@@ -4829,11 +4831,14 @@ static int proc_cgroupstats_show(struct seq_file *m, 
void *v)
mutex_lock(cgroup_mutex);
for (i = 0; i  CGROUP_SUBSYS_COUNT; i++) {
struct cgroup_subsys *ss = subsys[i];
+   int num;
+
if (ss == NULL)
continue;
+   num = _cg_virtualized(ss-root-number_of_cgroups);
seq_printf(m, %s\t%d\t%d\t%d\n,
   ss-name, ss-root-hierarchy_id,
-  ss-root-number_of_cgroups, !ss-disabled);
+  num, !ss-disabled);
}
mutex_unlock(cgroup_mutex);
return 0;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ve/net/printk: net_veboth_ratelimited introduced

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit 16d8b1d984f26100bf006ed93fcd47642401dd26
Author: Vasily Averin v...@odin.com
Date:   Mon May 18 12:29:44 2015 +0400

ve/net/printk: net_veboth_ratelimited introduced

net_veboth_ratelimited is required to save net-ratelimited messages
both into host and into containers dmesg buffers

Signed-off-by:  Vasily Averin v...@openvz.org
Acked-by: Kirill Tkhai ktk...@odin.com
---
 include/linux/net.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/net.h b/include/linux/net.h
index d7b2205..7e59abe 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -249,6 +249,8 @@ do {
\
net_ratelimited_function(pr_debug, fmt, ##__VA_ARGS__)
 #define net_velog_ratelimited(fmt, ...)\
net_ratelimited_function(ve_printk, VE_LOG, fmt, ##__VA_ARGS__)
+#define net_veboth_ratelimited(fmt, ...)   \
+   net_ratelimited_function(ve_printk, VE_LOG_BOTH, fmt, ##__VA_ARGS__)
 
 
 #define net_random()   prandom_u32()
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [RFC rh7] ve: cgroups -- Allow to attach non-self into ve cgroups

2015-05-18 Thread Konstantin Khorenko
On 05/14/2015 07:52 PM, Cyrill Gorcunov wrote:
 In vzctl/libvzctl bundle we restore container like
 
  - create ve/$ctid cgroup
  - move self into this cgroup
  - run criu from inside
 
 So that kernel code passes ve_can_attach test. In turn for
 our P.Haul project (which is managing live migration) the
 situation is different -- it opens ve/$ctid but moves
 criu service pid instead (so that the service will
 start restore procedure). Which leads to situation
 where ve_can_attach fails with -EINVAL.
 
 Reported-by: Nikita Spiridonov nspirido...@odin.com
 Signed-off-by: Cyrill Gorcunov gorcu...@odin.com
 CC: Vladimir Davydov vdavy...@odin.com
 CC: Konstantin Khorenko khore...@odin.com
 CC: Pavel Emelyanov xe...@odin.com
 CC: Andrey Vagin ava...@odin.com
 ---
 
 Guys, could you please take a look, especially from
 security POV, is it safe to remove all these checks?
 
  kernel/ve/ve.c |   31 +--
  1 file changed, 13 insertions(+), 18 deletions(-)
 
 Index: linux-pcs7.git/kernel/ve/ve.c
 ===
 --- linux-pcs7.git.orig/kernel/ve/ve.c
 +++ linux-pcs7.git/kernel/ve/ve.c
 @@ -750,13 +750,6 @@ static void ve_destroy(struct cgroup *cg
  static int ve_can_attach(struct cgroup *cg, struct cgroup_taskset *tset)
  {
   struct ve_struct *ve = cgroup_ve(cg);
 - struct task_struct *task = current;
 -
 - if (cgroup_taskset_size(tset) != 1 ||
 - cgroup_taskset_first(tset) != task ||
 - !thread_group_leader(task) ||
 - !thread_group_empty(task))
 - return -EINVAL;

Is this true that without these checks a single thread of a multithread process 
can enter CT?
If no - where is the check for this case?
If yes - let's prohibit this.

   if (ve-is_locked)
   return -EBUSY;
 @@ -775,20 +768,22 @@ static int ve_can_attach(struct cgroup *
  static void ve_attach(struct cgroup *cg, struct cgroup_taskset *tset)
  {
   struct ve_struct *ve = cgroup_ve(cg);
 - struct task_struct *tsk = current;
 -
 - /* this probihibts ptracing of task entered to VE from host system */
 - if (ve-is_running  tsk-mm)
 - tsk-mm-vps_dumpable = VD_VE_ENTER_TASK;
 + struct task_struct *tsk;
  
 - /* Drop OOM protection. */
 - tsk-signal-oom_score_adj = 0;
 - tsk-signal-oom_score_adj_min = 0;
 + cgroup_taskset_for_each(tsk, cg, tset) {
 + /* this probihibts ptracing of task entered to VE from host 
 system */
 + if (ve-is_running  tsk-mm)
 + tsk-mm-vps_dumpable = VD_VE_ENTER_TASK;
 +
 + /* Drop OOM protection. */
 + tsk-signal-oom_score_adj = 0;
 + tsk-signal-oom_score_adj_min = 0;
  
 - /* Leave parent exec domain */
 - tsk-parent_exec_id--;
 + /* Leave parent exec domain */
 + tsk-parent_exec_id--;
  
 - tsk-task_ve = ve;
 + tsk-task_ve = ve;
 + }
  }
  
  static int ve_state_read(struct cgroup *cg, struct cftype *cft,
 ___
 Devel mailing list
 Devel@openvz.org
 https://lists.openvz.org/mailman/listinfo/devel
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ve/netfilter: ve_printk for nf_conntrack: table full

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit 8782918c418820d5127afa4a5db74c9b3eac3b82
Author: Vasily Averin v...@odin.com
Date:   Mon May 18 12:29:57 2015 +0400

ve/netfilter: ve_printk for nf_conntrack: table full

port of diff-ve-printk-conntrack-tables-full from rh6-based kernels

nf_conntrack: table full, dropping packet message
should be visible both in CT and on HN and
should contain CTID for reading simplicity.

https://bugzilla.openvz.org/show_bug.cgi?id=2940

Signed-off-by: Vasily Averin v...@openvz.org
Acked-by: Kirill Tkhai ktk...@odin.com
---
 net/netfilter/nf_conntrack_core.c   | 4 +++-
 net/netfilter/nf_conntrack_expect.c | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 495b859..017c755 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -696,7 +696,9 @@ __nf_conntrack_alloc(struct net *net, u16 zone,
unlikely(atomic_read(net-ct.count)  ct_max)) {
if (!early_drop(net, hash_bucket(hash, net))) {
atomic_dec(net-ct.count);
-   net_warn_ratelimited(nf_conntrack: table full, 
dropping packet\n);
+   net_veboth_ratelimited(KERN_WARNING VE%u: 
+   nf_conntrack table full, 
dropping packet\n,
+   net-owner_ve-veid);
return ERR_PTR(-ENOMEM);
}
}
diff --git a/net/netfilter/nf_conntrack_expect.c 
b/net/netfilter/nf_conntrack_expect.c
index d80db92..bfa95fd 100644
--- a/net/netfilter/nf_conntrack_expect.c
+++ b/net/netfilter/nf_conntrack_expect.c
@@ -408,7 +408,9 @@ static inline int __nf_ct_expect_check(struct 
nf_conntrack_expect *expect)
}
 
if (net-ct.expect_count = init_net.ct.expect_max) {
-   net_warn_ratelimited(nf_conntrack: expectation table full\n);
+   net_veboth_ratelimited(KERN_WARNING VE%u 
+   nf_conntrack: expectation table 
full\n,
+   net-owner_ve-veid);
ret = -EMFILE;
}
 out:
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ve/fairsched: drop host node

2015-05-13 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit 1b02143f5ddecfa202e68708627be4f6aaf5e5ac
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Wed May 13 20:11:17 2015 +0400

ve/fairsched: drop host node

The fairsched host node, i.e. cpu/cpuset cgroup /0, conflicts with
systemd: the latter moves all processes out of it and even tries to
delete it. To make it work as expected we should create /0 from the
userspace via systemd.

https://jira.sw.ru/browse/PSBM-33487

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
Acked-by: Cyrill Gorcunov gorcu...@openvz.org
---
 kernel/fairsched.c | 20 ++--
 1 file changed, 2 insertions(+), 18 deletions(-)

diff --git a/kernel/fairsched.c b/kernel/fairsched.c
index 978fd12..0d0fa5c 100644
--- a/kernel/fairsched.c
+++ b/kernel/fairsched.c
@@ -26,7 +26,6 @@ struct fairsched_node {
 };
 
 static struct fairsched_node root_node = {NULL, NULL};
-static struct fairsched_node host_node = {NULL, NULL};
 
 /* fairsched use node id = INT_MAX for ve0 tasks */
 #define FAIRSCHED_HOST_NODE 2147483647
@@ -91,7 +90,7 @@ static int fairsched_move(struct fairsched_node *node, struct 
task_struct *tsk)
 
ret = cgroup_kernel_attach(node-cpuset, tsk);
if (ret) {
-   err = cgroup_kernel_attach(host_node.cpu, tsk);
+   err = cgroup_kernel_attach(root_node.cpu, tsk);
if (err)
printk(KERN_ERR Cleanup error, fairsched id=, 
err=%d\n, err);
}
@@ -379,7 +378,7 @@ void fairsched_drop_node(int id, int leave)
int err;
 
if (leave) {
-   err = fairsched_move(host_node, current);
+   err = fairsched_move(root_node, current);
if (err)
printk(KERN_ERR Can't leave fairsched node %d 
err=%d\n, id, err);
@@ -769,7 +768,6 @@ extern int sysctl_sched_rt_runtime;
 int __init fairsched_init(void)
 {
struct vfsmount *cpu_mnt, *cpuset_mnt;
-   int ret;
struct cgroup_sb_opts cpu_opts = {
.name   = vz_compat ? fairsched : NULL,
.subsys_mask=
@@ -794,20 +792,6 @@ int __init fairsched_init(void)
}
root_node.cpuset = cgroup_get_root(cpuset_mnt);
 
-   ret = fairsched_create(host_node, 0);
-   if (ret)
-   return ret;
-
-   ret = sched_cgroup_set_rt_runtime(host_node.cpu,
- 3 * sysctl_sched_rt_runtime / 4);
-   if (ret)
-   printk(KERN_WARNING
-  Can't set rt runtime for fairsched host: %d\n, ret);
-
-   ret = fairsched_move(host_node, init_pid_ns.child_reaper);
-   if (ret)
-   return ret;
-
 #ifdef CONFIG_PROC_FS
proc_create(fairsched, S_ISVTX, NULL, proc_fairsched_operations);
proc_create(fairsched2, S_ISVTX, NULL, proc_fairsched_operations);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ub: drop host node

2015-05-13 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit 7d10a4007de110332ffb866532fb5bd2ebe7a26a
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Wed May 13 20:11:27 2015 +0400

ub: drop host node

The ub host node, i.e. memory/blkio cgroup /0, conflicts with systemd:
the latter moves all processes out of it and even tries to delete it. To
make it work as expected we should create /0 from the userspace via
systemd.

https://jira.sw.ru/browse/PSBM-33487

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
Acked-by: Cyrill Gorcunov gorcu...@openvz.org
---
 kernel/bc/beancounter.c | 68 ++---
 1 file changed, 31 insertions(+), 37 deletions(-)

diff --git a/kernel/bc/beancounter.c b/kernel/bc/beancounter.c
index 28dfe43..cdbe846 100644
--- a/kernel/bc/beancounter.c
+++ b/kernel/bc/beancounter.c
@@ -100,6 +100,20 @@ static int resource_precharge_min = 0;
 static int resource_precharge_max = INT_MAX / NR_CPUS;
 static struct cgroup *mem_cgroup_root, *blkio_cgroup_root, *ub_cgroup_root;
 
+static struct cgroup *ub_cgroup_open(struct cgroup *root,
+struct user_beancounter *ub)
+{
+   if (ub == get_ub0())
+   return root;
+   return cgroup_kernel_open(root, CGRP_CREAT, ub-ub_name);
+}
+
+static void ub_cgroup_close(struct cgroup *root, struct cgroup *cg)
+{
+   if (cg != root)
+   cgroup_kernel_close(cg);
+}
+
 extern int mem_cgroup_apply_beancounter(struct cgroup *cg,
struct user_beancounter *ub);
 
@@ -109,18 +123,16 @@ static int ub_mem_cgroup_attach_task(struct 
user_beancounter *ub,
struct cgroup *cg;
int ret;
 
-   cg = cgroup_kernel_open(mem_cgroup_root, CGRP_CREAT, ub-ub_name);
+   cg = ub_cgroup_open(mem_cgroup_root, ub);
if (IS_ERR(cg))
return PTR_ERR(cg);
-
if (ub != get_ub0())
ret = mem_cgroup_apply_beancounter(cg, ub);
else
ret = 0;
if (!ret)
ret = cgroup_kernel_attach(cg, tsk);
-
-   cgroup_kernel_close(cg);
+   ub_cgroup_close(mem_cgroup_root, cg);
return ret;
 }
 
@@ -132,14 +144,11 @@ static int ub_blkio_cgroup_attach_task(struct 
user_beancounter *ub,
 
if (!ubc_ioprio)
return 0;
-
-   cg = cgroup_kernel_open(blkio_cgroup_root, CGRP_CREAT, ub-ub_name);
+   cg = ub_cgroup_open(blkio_cgroup_root, ub);
if (IS_ERR(cg))
return PTR_ERR(cg);
-
ret = cgroup_kernel_attach(cg, tsk);
-
-   cgroup_kernel_close(cg);
+   ub_cgroup_close(blkio_cgroup_root, cg);
return ret;
 }
  
@@ -149,17 +158,11 @@ static int ub_cgroup_attach_task(struct user_beancounter 
*ub,
struct cgroup *cg;
int ret;
 
-   if (ub != get_ub0()) {
-   cg = cgroup_kernel_open(ub_cgroup_root, CGRP_CREAT, 
ub-ub_name);
-   if (IS_ERR(cg))
-   return PTR_ERR(cg);
-   } else
-   cg = ub_cgroup_root;
-
+   cg = ub_cgroup_open(ub_cgroup_root, ub);
+   if (IS_ERR(cg))
+   return PTR_ERR(cg);
ret = cgroup_kernel_attach(cg, tsk);
-
-   if (ub != get_ub0())
-   cgroup_kernel_close(cg);
+   ub_cgroup_close(ub_cgroup_root, cg);
return ret;
 }
 
@@ -196,11 +199,11 @@ int ub_update_mem_cgroup_limits(struct user_beancounter 
*ub)
if (ub == get_ub0())
return -EPERM;
 
-   cg = cgroup_kernel_open(mem_cgroup_root, 0, ub-ub_name);
-   if (IS_ERR_OR_NULL(cg))
-   return PTR_ERR(cg) ?: -ENOENT;
+   cg = ub_cgroup_open(mem_cgroup_root, ub);
+   if (IS_ERR(cg))
+   return PTR_ERR(cg);
ret = mem_cgroup_apply_beancounter(cg, ub);
-   cgroup_kernel_close(cg);
+   ub_cgroup_close(mem_cgroup_root, cg);
return ret;
 }
 
@@ -217,10 +220,10 @@ void ub_get_mem_cgroup_parms(struct user_beancounter *ub,
 
memset(parms, 0, sizeof(parms));
 
-   cg = cgroup_kernel_open(mem_cgroup_root, 0, ub-ub_name);
+   cg = ub_cgroup_open(mem_cgroup_root, ub);
if (!IS_ERR_OR_NULL(cg)) {
mem_cgroup_fill_ub_parms(cg, parms[0], parms[1], parms[2]);
-   cgroup_kernel_close(cg);
+   ub_cgroup_close(mem_cgroup_root, cg);
}
 
if (physpages)
@@ -242,14 +245,14 @@ void ub_page_stat(struct user_beancounter *ub, const 
nodemask_t *nodemask,
 
memset(pages, 0, sizeof(unsigned long) * NR_LRU_LISTS);
 
-   cg = cgroup_kernel_open(mem_cgroup_root, 0, ub-ub_name);
-   if (IS_ERR_OR_NULL(cg))
+   cg = ub_cgroup_open(mem_cgroup_root, ub);
+   if (IS_ERR(cg))
return;
 
for_each_node_mask(nid, *nodemask)

[Devel] [PATCH RHEL7 COMMIT] ve/cgroups: Allow to attach non-self into ve cgroups, v3

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit 729323172bc760a2daf4d790a5bffc74ec10c04d
Author: Cyrill Gorcunov gorcu...@odin.com
Date:   Tue May 19 00:43:44 2015 +0400

ve/cgroups: Allow to attach non-self into ve cgroups, v3

In vzctl/libvzctl bundle we restore container like

 - create ve/$ctid cgroup
 - move self into this cgroup
 - run criu from inside

So that kernel code passes ve_can_attach test. In turn for
our P.Haul project (which is managing live migration) the
situation is different -- it opens ve/$ctid but moves
criu service pid instead (so that the service will
start restore procedure). Which leads to situation
where ve_can_attach fails with -EINVAL.

Basically we need to

1) Check that in case if task is getting attached to
   VE cgroup it should be a single threaded task.

2) In case of multithread task all threads should be
   moved in one pass (this actually prepared by
   cgroup_attach_task caller).

3) In case if VE is stopping or starting only kernel
   threads can attach.

khorenko@:
Check for thread_group_empty(task) is enough to be sure
the task is single-threaded.

https://jira.sw.ru/browse/PSBM-33561

Reported-by: Nikita Spiridonov nspirido...@odin.com
Signed-off-by: Cyrill Gorcunov gorcu...@odin.com

CC: Vladimir Davydov vdavy...@odin.com
CC: Konstantin Khorenko khore...@odin.com
CC: Pavel Emelyanov xe...@odin.com
CC: Andrey Vagin ava...@odin.com
---
 kernel/ve/ve.c | 51 ++-
 1 file changed, 30 insertions(+), 21 deletions(-)

diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index e598d15..cf7c848 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -775,24 +775,31 @@ static void ve_destroy(struct cgroup *cg)
 static int ve_can_attach(struct cgroup *cg, struct cgroup_taskset *tset)
 {
struct ve_struct *ve = cgroup_ve(cg);
-   struct task_struct *task = current;
-
-   if (cgroup_taskset_size(tset) != 1 ||
-   cgroup_taskset_first(tset) != task ||
-   !thread_group_leader(task) ||
-   !thread_group_empty(task))
-   return -EINVAL;
+   struct task_struct *task;
 
if (ve-is_locked)
return -EBUSY;
 
/*
+* We either moving the whole group of threads,
+* either a single thread process.
+*/
+   if (cgroup_taskset_size(tset) == 1) {
+   task = cgroup_taskset_first(tset);
+   if (!thread_group_empty(task))
+   return -EINVAL;
+   }
+
+   /*
 * Forbid userspace tasks to enter during starting or stopping.
-* Permit attaching kernel threads and init task for this containers.
+* Permit attaching kernel threads for this containers.
 */
-   if (!ve-is_running  (ve-ve_ns || nr_threads_ve(ve)) 
-   !(task-flags  PF_KTHREAD))
-   return -EPIPE;
+   if (!ve-is_running  (ve-ve_ns || nr_threads_ve(ve))) {
+   cgroup_taskset_for_each(task, cg, tset) {
+   if (!(task-flags  PF_KTHREAD))
+   return -EPIPE;
+   }
+   }
 
return 0;
 }
@@ -800,20 +807,22 @@ static int ve_can_attach(struct cgroup *cg, struct 
cgroup_taskset *tset)
 static void ve_attach(struct cgroup *cg, struct cgroup_taskset *tset)
 {
struct ve_struct *ve = cgroup_ve(cg);
-   struct task_struct *tsk = current;
+   struct task_struct *task;
 
-   /* this probihibts ptracing of task entered to VE from host system */
-   if (ve-is_running  tsk-mm)
-   tsk-mm-vps_dumpable = VD_VE_ENTER_TASK;
+   cgroup_taskset_for_each(task, cg, tset) {
+   /* this probihibts ptracing of task entered to VE from host 
system */
+   if (ve-is_running  task-mm)
+   task-mm-vps_dumpable = VD_VE_ENTER_TASK;
 
-   /* Drop OOM protection. */
-   tsk-signal-oom_score_adj = 0;
-   tsk-signal-oom_score_adj_min = 0;
+   /* Drop OOM protection. */
+   task-signal-oom_score_adj = 0;
+   task-signal-oom_score_adj_min = 0;
 
-   /* Leave parent exec domain */
-   tsk-parent_exec_id--;
+   /* Leave parent exec domain */
+   task-parent_exec_id--;
 
-   tsk-task_ve = ve;
+   task-task_ve = ve;
+   }
 }
 
 static int ve_state_read(struct cgroup *cg, struct cftype *cft,
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: check new size of block device on ioctl(GROW)

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit 0385f754e9f680c7d5095ae981fe29c1b6e7323a
Author: Andrey Smetanin asmeta...@virtuozzo.com
Date:   Tue May 19 08:26:55 2015 +0400

ploop: check new size of block device on ioctl(GROW)

Return error if userspace attepmts to grow block device above limits
imposed by ploop1 formats.

https://jira.sw.ru/browse/PSBM-21027

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
 drivers/block/ploop/fmt_ploop1.c   |  4 
 drivers/block/ploop/ploop1_image.h | 13 +
 2 files changed, 17 insertions(+)

diff --git a/drivers/block/ploop/fmt_ploop1.c b/drivers/block/ploop/fmt_ploop1.c
index 624bdc1..fb12c30 100644
--- a/drivers/block/ploop/fmt_ploop1.c
+++ b/drivers/block/ploop/fmt_ploop1.c
@@ -458,6 +458,10 @@ ploop1_prepare_grow(struct ploop_delta * delta, u64 
*new_size, int *reloc)
if (*new_size  ((1  delta-cluster_log) - 1))
return -EINVAL;
 
+   if (*new_size  ploop1_max_size(1  delta-plo-cluster_log,
+   delta-plo-fmt_version))
+   return -EFBIG;
+
vh = (struct ploop_pvd_header *)page_address(ph-dyn_page);
n_present  = le32_to_cpu(vh-m_FirstBlockOffset)  log;
BUG_ON (!n_present);
diff --git a/drivers/block/ploop/ploop1_image.h 
b/drivers/block/ploop/ploop1_image.h
index 337c05b..c4efe87 100644
--- a/drivers/block/ploop/ploop1_image.h
+++ b/drivers/block/ploop/ploop1_image.h
@@ -247,6 +247,19 @@ ploop1_version(struct ploop_pvd_header *vh)
return -1;
 }
 
+static inline __u64
+ploop1_max_size(__u32 blocksize, int version)
+{
+   switch (version) {
+   case PLOOP_FMT_V1:
+   return (__u32)-1;
+   case PLOOP_FMT_V2:
+   return 0xUL * blocksize;
+   }
+
+   return 0;
+}
+
 #ifdef __KERNEL__
 static inline u64
 get_SizeInSectors_from_le(struct ploop_pvd_header *vh, int version)
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: fix a race condition on relocation of blocks

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit a762247cf8ff0b2ec0ba6e8a9742f7a5e38a8b15
Author: Andrey Smetanin asmeta...@virtuozzo.com
Date:   Tue May 19 08:27:02 2015 +0400

ploop: fix a race condition on relocation of blocks

map_release() are not atomic, because it calls atomic_read
and atomic_dec_and_test. Looks like it was designed to be
called under plo-lock.

https://jira.sw.ru/browse/PSBM-23905

Signed-off-by: Andrey Vagin ava...@openvz.org

Acked-by: Maxim Patlasov mpatla...@parallels.com
---
 drivers/block/ploop/dev.c | 6 ++
 drivers/block/ploop/map.c | 6 ++
 2 files changed, 12 insertions(+)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 353fb35..e3422d8 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -1471,12 +1471,14 @@ static int prepare_merge_req(struct ploop_request * 
preq)
return res;
 
 drop_map:
+   spin_lock_irq(plo-lock);
map_release(preq-trans_map);
preq-trans_map = NULL;
if (preq-map) {
map_release(preq-map);
preq-map = NULL;
}
+   spin_unlock_irq(plo-lock);
return 1;
 }
 
@@ -1688,8 +1690,10 @@ ploop_entry_reloc_a_req(struct ploop_request *preq, 
iblock_t *iblk)
if (*clu = MAP_MAX_IND(preq))
break;
 
+   spin_lock_irq(plo-lock);
map_release(preq-map);
preq-map = NULL;
+   spin_unlock_irq(plo-lock);
}
 
if (*clu = plo-map.max_index) {
@@ -1814,8 +1818,10 @@ static int discard_get_index(struct ploop_request *preq)
preq-iblock = 0;
 
if (preq-map) {
+   spin_lock_irq(plo-lock);
map_release(preq-map);
preq-map = NULL;
+   spin_unlock_irq(plo-lock);
}
 
return 0;
diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
index 5f50f81..2e971cd 100644
--- a/drivers/block/ploop/map.c
+++ b/drivers/block/ploop/map.c
@@ -145,6 +145,10 @@ static void flush_lru_buffer(struct ploop_map * map)
map-lru_buffer_ptr = 0;
 }
 
+/*
+ * map_release() must be called under plo-lock, because
+ * The pair atomic_read  atomic_dec_and_test is not atomic.
+ */
 void map_release(struct map_node * m)
 {
struct ploop_map * map = m-parent;
@@ -1026,9 +1030,11 @@ static void map_wb_complete_post_process(struct 
ploop_map *map,
}
 
if (test_bit(PLOOP_REQ_RELOC_S, preq-state)) {
+   spin_lock_irq(plo-lock);
del_lockout(preq);
map_release(preq-map);
preq-map = NULL;
+   spin_unlock_irq(plo-lock);
 
requeue_req(preq, PLOOP_E_RELOC_COMPLETE);
return;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: prioritize BAT operations

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit d742aa564de94c3816a9d3a7991adb00d23678d4
Author: Andrey Smetanin asmeta...@virtuozzo.com
Date:   Tue May 19 08:27:04 2015 +0400

ploop: prioritize BAT operations

Ploop uses -read_page and -write_page methods of pio_direct to read/write
index table. These operations are rare and usually someone is blocked on 
them.
Let's give them a priority by setting SYNCIO flag.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
 drivers/block/ploop/io_direct.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index c18d2f0..e5eb66a 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -1432,7 +1432,7 @@ static void
 dio_read_page(struct ploop_io * io, struct ploop_request * preq,
  struct page * page, sector_t sec)
 {
-   dio_io_page(io, READ, preq, page, sec);
+   dio_io_page(io, READ | REQ_SYNC, preq, page, sec);
 }
 
 static void
@@ -1444,7 +1444,8 @@ dio_write_page(struct ploop_io * io, struct ploop_request 
* preq,
return;
}
 
-   dio_io_page(io, WRITE | (fua ? REQ_FUA : 0), preq, page, sec);
+   dio_io_page(io, WRITE | (fua ? REQ_FUA : 0) | REQ_SYNC,
+   preq, page, sec);
 }
 
 static int
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: make manual abort transition verbose

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit 9a5fe498a7a1d9c1ecf4001c0766f325f1139079
Author: Andrey Smetanin asmeta...@virtuozzo.com
Date:   Tue May 19 08:27:09 2015 +0400

ploop: make manual abort transition verbose

Signed-off-by: Dmitry Monakhov dmonak...@openvz.org
---
 drivers/block/ploop/sysfs.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/block/ploop/sysfs.c b/drivers/block/ploop/sysfs.c
index 3ef53ac..07a4829 100644
--- a/drivers/block/ploop/sysfs.c
+++ b/drivers/block/ploop/sysfs.c
@@ -326,6 +326,9 @@ static u32 show_aborted(struct ploop_device * plo)
 
 static int store_aborted(struct ploop_device * plo, u32 val)
 {
+   printk(KERN_INFO ploop: Force %s aborted state for ploop%d\n,
+  val ? set : clear, plo-index);
+
if (val)
set_bit(PLOOP_S_ABORT, plo-state);
else
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: warning on disk full condition

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit b6eb7575242d5e266d231ed53a4f7e03e47b2a68
Author: Andrey Smetanin asmeta...@virtuozzo.com
Date:   Tue May 19 08:27:10 2015 +0400

ploop: warning on disk full condition

People complain that it's not always obvious why an app in CT gets
-ENOSPC while there remains some space on host filesystem.

The patch adds time ratelimited printk about disk full condition.
Maximal rate is 1 per hour.

https://bugzilla.openvz.org/show_bug.cgi?id=3045

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
 drivers/block/ploop/dev.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 9aaab4a..ab99724 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -3533,8 +3533,18 @@ static int ploop_bd_full(struct backing_dev_info *bdi, 
long long nr, int root)
 
current-journal_info = NULL;
ret = sb-s_op-statfs(F_DENTRY(file), buf);
-   if (ret || buf.f_bfree * buf.f_bsize  reserved + nr)
+   if (ret || buf.f_bfree * buf.f_bsize  reserved + nr) {
+   static unsigned long full_warn_time;
+
+   if (printk_timed_ratelimit(full_warn_time, 60*60*HZ))
+   printk(KERN_WARNING
+  ploop%d: host disk is almost full 
+  (%llu  %llu); CT sees -ENOSPC !\n,
+  plo-index, buf.f_bfree * buf.f_bsize,
+  reserved + nr);
+
rc = 1;
+   }
 
fput(file);
current-journal_info = jctx;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: fix busyloop on secondary discard bio

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit a5678140dd8f793272b5e562e81e27e2a249e4fd
Author: Andrey Smetanin asmeta...@virtuozzo.com
Date:   Tue May 19 08:27:11 2015 +0400

ploop: fix busyloop on secondary discard bio

After diff-ploop-add-a-separate-queue-for-discard-bio-s, ploop_thread()
skips processing previously queued discard bio-s if any discard bio is
already under processing (fbd-fbd_dbl is not empty).

ploop_wait() must take care about such a case, otherwise a busyloop may
happen: ploop_thread() believes that it has to go to sleep because all
incoming queues are empty excepting plo-bio_discard_list which cannot be
processed by now and calls ploop_wait(); the latter returns immediately
because plo-bio_discard_list is not empty and hence needs for processing.

The patch also fixes a trivial bug in discard bio accounting:

ploop_bio_queue() is called for all bio-s including discard bio-s and
it decrements bio_qlen unconditionally. This is incorrect: it has to
decrement either bio_qlen or discard_bio_qlen dependently on the type of 
bio.

https://jira.sw.ru/browse/PSBM-30451
https://bugzilla.openvz.org/show_bug.cgi?id=3124

Signed-off-by: Maxim Patlasov mpatla...@parallels.com

Acked-by: Andrew Vagin ava...@parallels.com
---
 drivers/block/ploop/dev.c  |  9 +++--
 drivers/block/ploop/freeblks.c | 12 
 2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index e2ff0aa..ac0f28f 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -551,7 +551,11 @@ ploop_bio_queue(struct ploop_device * plo, struct bio * 
bio,
 
__TRACE(A %p %u\n, preq, preq-req_cluster);
 
-   plo-bio_qlen--;
+   if (unlikely(bio-bi_rw  REQ_DISCARD))
+   plo-bio_discard_qlen--;
+   else
+   plo-bio_qlen--;
+
ploop_entry_add(plo, preq);
 
if (bio-bi_size  !(bio-bi_rw  REQ_DISCARD))
@@ -2563,7 +2567,8 @@ static void ploop_wait(struct ploop_device * plo, int 
once, struct blk_plug *plu
 !plo-active_reqs))
break;
} else if (plo-bio_head ||
-   !bio_list_empty(plo-bio_discard_list)) {
+   (!bio_list_empty(plo-bio_discard_list) 
+!ploop_discard_is_inprogress(plo-fbd))) {
/* ready_queue and entry_queue are empty, but
 * bio list not. Obviously, we'd like to process
 * bio_list instead of sleeping */
diff --git a/drivers/block/ploop/freeblks.c b/drivers/block/ploop/freeblks.c
index cf48d3a..89108c7 100644
--- a/drivers/block/ploop/freeblks.c
+++ b/drivers/block/ploop/freeblks.c
@@ -696,20 +696,24 @@ int ploop_fb_get_free_block(struct ploop_freeblks_desc 
*fbd,
 
 static void fbd_complete_bio(struct ploop_freeblks_desc *fbd, int err)
 {
+   struct ploop_device *plo = fbd-plo;
unsigned int nr_completed = 0;
 
while (fbd-fbd_dbl.head) {
struct bio * bio = fbd-fbd_dbl.head;
fbd-fbd_dbl.head = bio-bi_next;
bio-bi_next = NULL;
-   BIO_ENDIO(fbd-plo-queue, bio, err);
+   BIO_ENDIO(plo-queue, bio, err);
nr_completed++;
}
fbd-fbd_dbl.tail = NULL;
 
-   spin_lock_irq(fbd-plo-lock);
-   fbd-plo-bio_total -= nr_completed;
-   spin_unlock_irq(fbd-plo-lock);
+   spin_lock_irq(plo-lock);
+   plo-bio_total -= nr_completed;
+   if (!bio_list_empty(plo-bio_discard_list) 
+   waitqueue_active(plo-waitq))
+   wake_up_interruptible(plo-waitq);
+   spin_unlock_irq(plo-lock);
 }
 
 void ploop_fb_reinit(struct ploop_freeblks_desc *fbd, int err)
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: mark reloc reqs to force FUA before write of relocated data

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit 622b02378d190968a9ad04f5e8161a1574a1d2df
Author: Andrey Smetanin asmeta...@virtuozzo.com
Date:   Tue May 19 08:27:15 2015 +0400

ploop: mark reloc reqs to force FUA before write of relocated data

Series description:

During relocation of ploop clusters (resize/baloon) we need to FUA/fsync
image file after such operations:
 a) new data block wrote
 b) BAT update
 c) nullify old data block for BAT grow. We do this already nullify of old 
data
block at format module - complete_grow callback.

This patch forses fsync(kaio), FUA(direct) of reloc write I/O to image
by marking such reloc reqs(A|S) with appropriate flags. Kaio/direct modules
tuned by patch to force fsync/FUA if these flags are set. This code does
FUA/fsync only for a) and b) cases, while c) already implemented.

Also patch fixes inconsistent bio list FUA processing in direct module.
The problem is that for bunch of bios we only set FUA at last bio. Its 
possible
in case of power outage that last bio will be stored and previos are not
because they are stored only in cache at the time of power failure.
To solve problem this patch marking last bio as FLUSH|FUA if more than one 
bio
in list.

Moreover for KAIO if fsync possible at BAT update stage we do that like we
did in direct case instead of 2 fsync's. For direct case if we going to make
FUA at BAT update only(optimization trick that already exists) then we need
to mark req to FLUSH previously written(without FUA) data.

Performance:
Overall(includes EXT4 resize upto 16T) resize performance degradated by -5% 
of
time.

https://jira.sw.ru/browse/PSBM-31222
https://jira.sw.ru/browse/PSBM-31225
https://jira.sw.ru/browse/PSBM-31321

Signed-off-by: Andrey Smetanin asmeta...@parallels.com

Andrey Smetanin (7):
  ploop: define struct ploop_request-state flags to force pre FLUSH
before write IO and FUA/fsync at I/O complete
  ploop: mark reloc reqs to force FUA/fsync(kaio) for index update I/O
  ploop: mark reloc reqs to force FUA before write of relocated data
  ploop: direct: to support truly FLUSH/FUA of req we need mark first
bio FLUSH, write all bios and mark last bio as FLUSH/FUA
  ploop: added ploop_req_delay_fua_possible() func that detects possible
delaying of upcoming FUA to index update stage. This function will
be lately used in direct/kaio code to detect and delay FUA
  ploop: make image fsync at I/O complete if it's required by FUA/fsync
force flag or by req-req_rw
  ploop: do preflush or postfua according force FUA/flush flags, and
delay FUA if possible but add force FLUSH to req if so

This patch description:
Need to force FUA/fsync of relocated data write for consistent resize.

Signed-off-by: Andrey Smetanin asmeta...@parallels.com

Reviewed-by: Andrew Vagin ava...@parallels.com
---
 drivers/block/ploop/dev.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index ac0f28f..bd5fe37 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -2434,6 +2434,9 @@ restart:
top_delta = ploop_top_delta(plo);
sbl.head = sbl.tail = preq-aux_bio;
 
+   /* Relocated data write required sync before BAT updatee */
+   set_bit(PLOOP_REQ_FORCE_FUA, preq-state);
+
if (test_bit(PLOOP_REQ_RELOC_S, preq-state)) {
preq-eng_state = PLOOP_E_DATA_WBI;
plo-st.bio_out++;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ms/memcg/proc: add kpagecgroup file

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit 3e48c113a59f801934292fc89e6915b1d8a341a7
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Tue May 19 08:23:48 2015 +0400

ms/memcg/proc: add kpagecgroup file

Patchset description: idle memory tracking

This patch set backports

  https://lkml.org/lkml/2015/5/12/449

which is required by vcmmd.

It is not yet clear if the original patch set will be accepted upstream
as is, there still may be changes. However, I hope the user API will be
preserved. If it is not, we will have to fix this in our kernel too.

https://jira.sw.ru/browse/PSBM-32460

Vladimir Davydov (3):
  memcg: add page_cgroup_ino helper
  proc: add kpagecgroup file
  proc: add kpageidle file

===
This patch description:

/proc/kpagecgroup contains a 64-bit inode number of the memory cgroup
each page is charged to, indexed by PFN. Having this information is
useful for estimating a cgroup working set size.

The file is present if CONFIG_PROC_PAGE_MONITOR  CONFIG_MEMCG.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 Documentation/vm/pagemap.txt |  6 -
 fs/proc/Kconfig  |  5 +++--
 fs/proc/page.c   | 53 
 3 files changed, 61 insertions(+), 3 deletions(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index fd7c3cf..e37cff9 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel 
that allow
 userspace programs to examine the page tables and related information by
 reading files in /proc.
 
-There are three components to pagemap:
+There are four components to pagemap:
 
  * /proc/pid/pagemap.  This file lets a userspace process find out which
physical frame each virtual page is mapped to.  It contains one 64-bit
@@ -63,6 +63,10 @@ There are three components to pagemap:
 21. KSM
 22. THP
 
+ * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
+   memory cgroup each page is charged to, indexed by PFN. Only available when
+   CONFIG_MEMCG is set.
+
 Short descriptions to the page flags:
 
  0. LOCKED
diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 15af622..e8ed22d 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -65,5 +65,6 @@ config PROC_PAGE_MONITOR
help
  Various /proc files exist to monitor process memory utilization:
  /proc/pid/smaps, /proc/pid/clear_refs, /proc/pid/pagemap,
- /proc/kpagecount, and /proc/kpageflags. Disabling these
-  interfaces will reduce the size of the kernel by approximately 4kb.
+ /proc/kpagecount, /proc/kpageflags, and /proc/kpagecgroup.
+ Disabling these interfaces will reduce the size of the kernel
+ by approximately 4kb.
diff --git a/fs/proc/page.c b/fs/proc/page.c
index cab84b6..c9cbed3 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -8,6 +8,7 @@
 #include linux/proc_fs.h
 #include linux/seq_file.h
 #include linux/hugetlb.h
+#include linux/memcontrol.h
 #include linux/kernel-page-flags.h
 #include asm/uaccess.h
 #include internal.h
@@ -213,10 +214,62 @@ static const struct file_operations 
proc_kpageflags_operations = {
.read = kpageflags_read,
 };
 
+#ifdef CONFIG_MEMCG
+static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
+   size_t count, loff_t *ppos)
+{
+   u64 __user *out = (u64 __user *)buf;
+   struct page *ppage;
+   unsigned long src = *ppos;
+   unsigned long pfn;
+   ssize_t ret = 0;
+   u64 ino;
+
+   pfn = src / KPMSIZE;
+   count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
+   if (src  KPMMASK || count  KPMMASK)
+   return -EINVAL;
+
+   while (count  0) {
+   if (pfn_valid(pfn))
+   ppage = pfn_to_page(pfn);
+   else
+   ppage = NULL;
+
+   if (ppage)
+   ino = page_cgroup_ino(ppage);
+   else
+   ino = 0;
+
+   if (put_user(ino, out)) {
+   ret = -EFAULT;
+   break;
+   }
+
+   pfn++;
+   out++;
+   count -= KPMSIZE;
+   }
+
+   *ppos += (char __user *)out - buf;
+   if (!ret)
+   ret = (char __user *)out - buf;
+   return ret;
+}
+
+static const struct file_operations proc_kpagecgroup_operations = {
+   .llseek = mem_lseek,
+   .read = kpagecgroup_read,
+};
+#endif /* CONFIG_MEMCG */
+
 static int __init proc_page_init(void)
 {
proc_create(kpagecount, S_IRUSR, 

[Devel] [PATCH RHEL7 COMMIT] ploop: prevent dangerous ploop-umount

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit 8854414d2d97abd7ab86d4c9d1c74d9b2fc04c3c
Author: Andrey Smetanin asmeta...@virtuozzo.com
Date:   Tue May 19 08:26:56 2015 +0400

ploop: prevent dangerous ploop-umount

Umounting ploop device if inner fs is still mounted on it leads to
numerous complains in kernel logs like:

VFS: Busy inodes after unmount. sb = 880108987000, fs type = ext4, sb 
count = 2, sb-s_root = /

and is not what user expected. The patch adds some protection from dummy
userspace mistakes: do not allow to stop ploop device (this is the first 
step
of ploop-umount) if user uses /dev/ploopNp1 for ioctl, or if someone (inner 
fs)
is still using the device.

https://jira.sw.ru/browse/PSBM-21474

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
 drivers/block/ploop/dev.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 5a3a5ec..2f4928d 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -3548,6 +3548,20 @@ static int ploop_stop(struct ploop_device * plo, struct 
block_device *bdev)
struct ploop_delta * delta;
int cnt;
 
+   if (bdev != bdev-bd_contains) {
+   if (printk_ratelimit())
+   printk(KERN_INFO stop ploop%d failed (wrong bdev)\n,
+  plo-index);
+   return -ENODEV;
+   }
+
+   if (bdev-bd_contains-bd_holders) {
+   if (printk_ratelimit())
+   printk(KERN_INFO stop ploop%d failed (holders=%d)\n,
+  plo-index, bdev-bd_contains-bd_holders);
+   return -EBUSY;
+   }
+
if (!test_bit(PLOOP_S_RUNNING, plo-state))
return -EINVAL;
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: fix iblk-to-sector calculations

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit 96f009d1061c9e1ec9b6c7699eef565bcd44f26a
Author: Andrey Smetanin asmeta...@virtuozzo.com
Date:   Tue May 19 08:26:59 2015 +0400

ploop: fix iblk-to-sector calculations

iblk stands for image-file block number. Its size is the same as u32. The 
size
of 'sector' is the same as long. While converting the former to the latter
like this: sec = iblk  shift, we must always cast 'iblk' to long. And we
actually do in most cases. The patch fixes a place in io_direct module where
it was forgotten.

https://jira.sw.ru/browse/PSBM-22961

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
 drivers/block/ploop/io_direct.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index ab74849..56b9f37 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -119,8 +119,8 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
goto out_em_err;
 
if (write  em-block_start == BLOCK_UNINIT) {
-   sector_t end = (iblk + 1)  preq-plo-cluster_log;
-   sec = iblk  preq-plo-cluster_log;
+   sector_t end = (sector_t)(iblk + 1)  preq-plo-cluster_log;
+   sec = (sector_t)iblk  preq-plo-cluster_log;
 
if (em-start = sec)
sec = em-end;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: reverse order of fdatawait and fsync fop

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit 0ac13b3ba07b42573f151c21d9727a2cbcd415d1
Author: Andrey Smetanin asmeta...@virtuozzo.com
Date:   Tue May 19 08:27:00 2015 +0400

ploop: reverse order of fdatawait and fsync fop

dio_fsync_thread must call filemap_fdatawrite() before file-f_op-fsync().
Otherwise:

  8,06   82 0.003095587 12328  D  WS 441706496 + 512 
[ploop19054]
  8,06   83 0.003103726 12328  D  WS 441707008 + 512 
[ploop19054]
  8,06   84 0.003108627 12328  D  WS 441707520 + 512 
[ploop19054]
  8,06   85 0.003113176 12328  D  WS 441708032 + 512 
[ploop19054]
  ...
  8,06  102 0.003149386  1299  D  WS 3950526248 + 24 
[jbd2/dm-1-8]
  ...
  8,06  103 0.003305550 0  C  WS 441706496 + 512 [0]
  8,06  104 0.003458057 0  C  WS 441707008 + 512 [0]
  8,06  105 0.003608325 0  C  WS 441707520 + 512 [0]
  8,06  106 0.003758297 0  C  WS 441708032 + 512 [0]
  8,06  107 0.003794543 0  C  WS 3950526248 + 24 [0]

And if the node crashes (or reboot happens) after last dispatch, journal 
data
may come to the disk while user bulk data -- not. The result would be ploop
image corruption.

The patch re-arranges the sequence of calls to make it safe and natural (the
same way as in vfs_fsync_range()).

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
 drivers/block/ploop/io_direct.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index babc940..c18d2f0 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -735,14 +735,13 @@ static int dio_fsync_thread(void * data)
spin_unlock_irq(plo-lock);
 
/* filemap_fdatawrite() has been made already */
+   filemap_fdatawait(io-files.mapping);
 
err = 0;
if (io-files.file-f_op-fsync)
err = io-files.file-f_op-FOP_FSYNC(io-files.file,
  0);
 
-   filemap_fdatawait(io-files.mapping);
-
/* Do we need to invalidate page cache? Not really,
 * because we use it only to create full new pages,
 * which we overwrite completely. Probably, we should
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: support 4K block-size of host block-device

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit 5e02bd5942dd5cfd66f5b4096e966ae9b134b5ea
Author: Andrey Smetanin asmeta...@virtuozzo.com
Date:   Tue May 19 08:27:01 2015 +0400

ploop: support 4K block-size of host block-device

Avoid 512-bytes reads/writes. They were used by 'expanded' format module
to get and save format header. Let's use 4K reads/writes instead.

Customer's problem:

 [root@pcstest10 ~]# ploop mount /vz3/test.hdd
 add delta dev=/dev/ploop19025 img=/vz3/test.hdd (rw)
 Can't add image /vz3/test.hdd: Input/output error
 [root@pcstest10 ~]#

 Right after trying to mount the image the kernel throws the following:

 [1564044.775584] sd 13:0:0:0: [sde] Bad block number requested

 The block size of this device is not 512 as for other direct attached
 disks. It is 4096 and the device is an iSCSI target.

https://jira.sw.ru/browse/PSBM-21989

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
 drivers/block/ploop/fmt_ploop1.c | 28 ++--
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/block/ploop/fmt_ploop1.c b/drivers/block/ploop/fmt_ploop1.c
index fb12c30..5ce6915 100644
--- a/drivers/block/ploop/fmt_ploop1.c
+++ b/drivers/block/ploop/fmt_ploop1.c
@@ -78,7 +78,7 @@ static int ploop1_stop(struct ploop_delta * delta)
 
vh = (struct ploop_pvd_header *)page_address(ph-dyn_page);
 
-   err = delta-io.ops-sync_read(delta-io, ph-dyn_page, 512, 0, 0);
+   err = delta-io.ops-sync_read(delta-io, ph-dyn_page, 4096, 0, 0);
if (err)
return err;
 
@@ -90,7 +90,7 @@ static int ploop1_stop(struct ploop_delta * delta)
 
vh-m_DiskInUse = 0;
 
-   err = delta-io.ops-sync_write(delta-io, ph-dyn_page, 512, 0, 0);
+   err = delta-io.ops-sync_write(delta-io, ph-dyn_page, 4096, 0, 0);
if (err)
return err;
 
@@ -128,7 +128,7 @@ ploop1_open(struct ploop_delta * delta)
goto out_err;
 
/* IO engine is ready. */
-   err = delta-io.ops-sync_read(delta-io, ph-dyn_page, 512, 0, 0);
+   err = delta-io.ops-sync_read(delta-io, ph-dyn_page, 4096, 0, 0);
if (err)
goto out_err;
 
@@ -168,7 +168,7 @@ ploop1_open(struct ploop_delta * delta)
 
if (!(delta-flags  PLOOP_FMT_RDONLY)) {
vh-m_DiskInUse = cpu_to_le32(SIGNATURE_DISK_IN_USE);
-   err = delta-io.ops-sync_write(delta-io, ph-dyn_page, 512, 
0, 0);
+   err = delta-io.ops-sync_write(delta-io, ph-dyn_page, 4096, 
0, 0);
if (err)
goto out_err;
}
@@ -198,7 +198,7 @@ ploop1_refresh(struct ploop_delta * delta)
 
vh = (struct ploop_pvd_header *)page_address(ph-dyn_page);
 
-   err = delta-io.ops-sync_read(delta-io, ph-dyn_page, 512, 0, 0);
+   err = delta-io.ops-sync_read(delta-io, ph-dyn_page, 4096, 0, 0);
if (err)
return err;
 
@@ -266,7 +266,7 @@ ploop1_sync(struct ploop_delta * delta)
if (err)
return err;
 
-   err = delta-io.ops-sync_read(delta-io, ph-dyn_page, 512, 0, 0);
+   err = delta-io.ops-sync_read(delta-io, ph-dyn_page, 4096, 0, 0);
if (err)
return err;
 
@@ -279,7 +279,7 @@ ploop1_sync(struct ploop_delta * delta)
vh-m_Flags = cpu_to_le32(vh-m_Flags);
}
 
-   err = delta-io.ops-sync_write(delta-io, ph-dyn_page, 512, 0, 0);
+   err = delta-io.ops-sync_write(delta-io, ph-dyn_page, 4096, 0, 0);
if (err)
return err;
 
@@ -312,7 +312,7 @@ ploop1_complete_snapshot(struct ploop_delta * delta, struct 
ploop_snapdata * sd)
if (err)
goto out;
 
-   err = delta-io.ops-sync_read(delta-io, ph-dyn_page, 512, 0, 0);
+   err = delta-io.ops-sync_read(delta-io, ph-dyn_page, 4096, 0, 0);
if (err)
goto out;
 
@@ -335,7 +335,7 @@ ploop1_complete_snapshot(struct ploop_delta * delta, struct 
ploop_snapdata * sd)
 * remain valid.
 */
 
-   err = delta-io.ops-sync_write(delta-io, ph-dyn_page, 512, 0, 0);
+   err = delta-io.ops-sync_write(delta-io, ph-dyn_page, 4096, 0, 0);
if (err)
goto out;
 
@@ -367,7 +367,7 @@ ploop1_prepare_merge(struct ploop_delta * delta, struct 
ploop_snapdata * sd)
 
vh = (struct ploop_pvd_header *)page_address(ph-dyn_page);
 
-   err = delta-io.ops-sync_read(delta-io, ph-dyn_page, 512, 0, 0);
+   err = delta-io.ops-sync_read(delta-io, ph-dyn_page, 4096, 0, 0);
if (err)
return err;
 
@@ -403,7 +403,7 @@ ploop1_start_merge(struct ploop_delta * delta, struct 
ploop_snapdata * sd)
return -EIO;
}
 
-   err = delta-io.ops-sync_read(delta-io, ph-dyn_page, 512, 0, 0);
+   err 

[Devel] [PATCH RHEL7 COMMIT] ploop: bug on bad fiemap (v2)

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit e3b634ed036e618d74643faaa478dc3951c2f781
Author: Andrey Smetanin asmeta...@virtuozzo.com
Date:   Tue May 19 08:27:05 2015 +0400

ploop: bug on bad fiemap (v2)

Based on crash analysis, one of extents from ploop em-tree is bad:

883fe6230ae0
  start = 19380224
  end = 19447808
  block_start = 0
  refs = {
counter = 1
  }

ploop never calculates em-block_start other than by direct assigning:

 em-block_start = fi_extent.fe_physical  9;

The patch attempts to catch erroneous (zero) output immediately after
fiemap call.

Changed in v2:
 - WARN_ON (instead of BUG_ON) for delalloc extents

https://jira.sw.ru/browse/PSBM-26762

Signed-off-by: Maxim Patlasov mpatla...@parallels.com
---
 drivers/block/ploop/io_direct_map.c | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/block/ploop/io_direct_map.c 
b/drivers/block/ploop/io_direct_map.c
index b3cb04d..b9a0ce9 100644
--- a/drivers/block/ploop/io_direct_map.c
+++ b/drivers/block/ploop/io_direct_map.c
@@ -641,6 +641,7 @@ static struct extent_map *__map_extent_bmap(struct ploop_io 
*io,
 {
struct extent_map_tree *tree = io-files.em_tree;
struct inode *inode = mapping-host;
+   loff_t start_off = (loff_t)start  9;
struct extent_map *em;
struct fiemap_extent_info fieinfo;
struct fiemap_extent fi_extent;
@@ -681,6 +682,25 @@ again:
old_fs = get_fs();
set_fs(KERNEL_DS);
ret = inode-i_op-fiemap(inode, fieinfo, start  9, 1);
+
+   /* chase for PSBM-26762: em-block_start == 0 */
+   if (!ret  fieinfo.fi_extents_mapped == 1 
+   !(fi_extent.fe_flags  FIEMAP_EXTENT_UNWRITTEN) 
+   (fi_extent.fe_physical  9) == 0) {
+   /* see how ext4_fill_fiemap_extents() implemented */
+   if (!(fi_extent.fe_flags  FIEMAP_EXTENT_DELALLOC)) {
+   printk(bad fiemap(%ld,%ld) on inode=%p fieinfo=%p
+i_size=%lld\n, start, len, inode, fieinfo,
+   i_size_read(inode));
+   BUG();
+   }
+   /* complain about delalloc case -- ploop always fallocate
+   * before buffered write */
+   WARN(1, ploop%d: delalloc extent [%lld,%lld] for [%lld,%ld];
+i_size=%lld\n, io-plo-index, fi_extent.fe_logical,
+   fi_extent.fe_length, start_off, len  9, 
i_size_read(inode));
+   ret = -ENOENT;
+   }
set_fs(old_fs);
 
if (ret) {
@@ -808,9 +828,10 @@ void trim_extent_mappings(struct extent_map_tree *tree, 
sector_t start)
 
while ((em = lookup_extent_mapping(tree, start, ((sector_t)(-1ULL)) - 
start))) {
remove_extent_mapping(tree, em);
+   WARN_ON(atomic_read(em-refs) != 2);
/* once for us */
extent_put(em);
-   /* _XXX_ This cannot be correct in the case of concurrent 
lookups */
+   /* No concurrent lookups due to ploop_quiesce(). See WARN_ON 
above */
/* once for the tree */
extent_put(em);
}
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: put top-delta back if merge failed

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit ee2968cd8321728956effea19e98959befec32d0
Author: Andrey Smetanin asmeta...@virtuozzo.com
Date:   Tue May 19 08:27:07 2015 +0400

ploop: put top-delta back if merge failed

Before merge, we move top-delta to a temporary plo-trans_map list. Since
then, it's not present in the main plo-map list anymore. If merge failed,
we must put it back to plo-map. Otherwise the delta will be lost forever
(visible in /sys/block/ploop*/pdelta/*, but not accessible from ploop).

https://jira.sw.ru/browse/PSBM-25252

Signed-off-by: Maxim Patlasov mpatla...@parallels.com

Acked-by: Pavel Emelyanov xe...@parallels.com
---
 drivers/block/ploop/dev.c | 49 +++
 1 file changed, 28 insertions(+), 21 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index d2a9eb4..2e6302f 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -3280,6 +3280,26 @@ static void ploop_update_fmt_version(struct ploop_device 
* plo)
}
 }
 
+static void ploop_merge_cleanup(struct ploop_device * plo,
+   struct ploop_map * map,
+   struct ploop_delta * delta, int err)
+{
+   ploop_quiesce(plo);
+   mutex_lock(plo-sysfs_mutex);
+   list_del(delta-list);
+
+   if (err)
+   list_add(delta-list, plo-map.delta_list);
+   else
+   ploop_update_fmt_version(plo);
+
+   plo-trans_map = NULL;
+   plo-maintenance_type = PLOOP_MNTN_OFF;
+   mutex_unlock(plo-sysfs_mutex);
+   ploop_map_destroy(map);
+   ploop_relax(plo);
+}
+
 static int ploop_merge(struct ploop_device * plo)
 {
int err;
@@ -3368,32 +3388,19 @@ already:
if (test_bit(PLOOP_S_ABORT, plo-state)) {
printk(KERN_WARNING merge for ploop%d failed (state ABORT)\n,
   plo-index);
-   plo-trans_map = NULL;
-   plo-maintenance_type = PLOOP_MNTN_OFF;
err = -EIO;
-   goto out;
}
 
-   ploop_quiesce(plo);
-   mutex_lock(plo-sysfs_mutex);
-   plo-trans_map = NULL;
-   plo-maintenance_type = PLOOP_MNTN_OFF;
-   list_del(delta-list);
-   ploop_update_fmt_version(plo);
-   mutex_unlock(plo-sysfs_mutex);
-   ploop_map_destroy(map);
-   ploop_relax(plo);
+   ploop_merge_cleanup(plo, map, delta, err);
 
-   kfree(map);
-
-   kobject_del(delta-kobj);
-   kobject_put(plo-kobj);
-
-   delta-ops-stop(delta);
-   delta-ops-destroy(delta);
-   kobject_put(delta-kobj);
-   return 0;
+   if (!err) {
+   kobject_del(delta-kobj);
+   kobject_put(plo-kobj);
 
+   delta-ops-stop(delta);
+   delta-ops-destroy(delta);
+   kobject_put(delta-kobj);
+   }
 out:
kfree(map);
return err;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: added ploop_req_delay_fua_possible() func that detects possible delaying of upcoming FUA to index update stage

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit abc95cfc45bd725c7aba2f7697e322413ae5725a
Author: Andrey Smetanin asmeta...@virtuozzo.com
Date:   Tue May 19 08:27:12 2015 +0400

ploop: added ploop_req_delay_fua_possible() func that detects possible 
delaying of upcoming FUA to index update stage

During relocation of ploop clusters (resize/baloon) we need to FUA/fsync
image file after such operations:
 a) new data block wrote
 b) BAT update
 c) nullify old data block for BAT grow. We do this already nullify of old 
data
block at format module - complete_grow callback.

This patch forses fsync(kaio), FUA(direct) of reloc write I/O to image
by marking such reloc reqs(A|S) with appropriate flags. Kaio/direct modules
tuned by patch to force fsync/FUA if these flags are set. This code does
FUA/fsync only for a) and b) cases, while c) already implemented.

Also patch fixes inconsistent bio list FUA processing in direct module.
The problem is that for bunch of bios we only set FUA at last bio. Its 
possible
in case of power outage that last bio will be stored and previos are not
because they are stored only in cache at the time of power failure.
To solve problem this patch marking last bio as FLUSH|FUA if more than one 
bio
in list.

Moreover for KAIO if fsync possible at BAT update stage we do that like we
did in direct case instead of 2 fsync's. For direct case if we going to make
FUA at BAT update only(optimization trick that already exists) then we need
to mark req to FLUSH previously written(without FUA) data.

Performance:
Overall(includes EXT4 resize upto 16T) resize performance degradated by -5% 
of
time.

https://jira.sw.ru/browse/PSBM-31222
https://jira.sw.ru/browse/PSBM-31225
https://jira.sw.ru/browse/PSBM-31321

Signed-off-by: Andrey Smetanin asmeta...@parallels.com

Andrey Smetanin (7):
  ploop: define struct ploop_request-state flags to force pre FLUSH
before write IO and FUA/fsync at I/O complete
  ploop: mark reloc reqs to force FUA/fsync(kaio) for index update I/O
  ploop: mark reloc reqs to force FUA before write of relocated data
  ploop: direct: to support truly FLUSH/FUA of req we need mark first
bio FLUSH, write all bios and mark last bio as FLUSH/FUA
  ploop: added ploop_req_delay_fua_possible() func that detects possible
delaying of upcoming FUA to index update stage. This function will
be lately used in direct/kaio code to detect and delay FUA
  ploop: make image fsync at I/O complete if it's required by FUA/fsync
force flag or by req-req_rw
  ploop: do preflush or postfua according force FUA/flush flags, and
delay FUA if possible but add force FLUSH to req if so

This patch description:

This function will be lately used in direct/kaio code to detect and delay 
FUA.

https://jira.sw.ru/browse/PSBM-31222
https://jira.sw.ru/browse/PSBM-31225
https://jira.sw.ru/browse/PSBM-31321

Signed-off-by: Andrey Smetanin asmeta...@parallels.com

Reviewed-by: Andrew Vagin ava...@parallels.com
---
 include/linux/ploop/ploop.h | 17 +
 1 file changed, 17 insertions(+)

diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h
index eacd36a..d8b83a6 100644
--- a/include/linux/ploop/ploop.h
+++ b/include/linux/ploop/ploop.h
@@ -577,6 +577,23 @@ void ploop_fail_request(struct ploop_request * preq, int 
err);
 void ploop_preq_drop(struct ploop_device * plo, struct list_head *drop_list,
  int keep_locked);
 
+
+static inline int ploop_req_delay_fua_possible(unsigned long rw,
+   struct ploop_request *preq)
+{
+   int delay_fua = 0;
+
+   /* In case of eng_state != COMPLETE, we'll do FUA in
+* ploop_index_update(). Otherwise, we should post
+* fua.
+*/
+   if (rw  REQ_FUA) {
+   if (preq-eng_state != PLOOP_E_COMPLETE)
+   delay_fua = 1;
+   }
+   return delay_fua;
+}
+
 static inline void ploop_set_error(struct ploop_request * preq, int err)
 {
if (!preq-error) {
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: force FUA of nullified blocks for BAT grow

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit 051a2a154c0e040d7d15ab9a3b56b77d9de021b3
Author: Andrey Smetanin asmeta...@virtuozzo.com
Date:   Tue May 19 08:27:16 2015 +0400

ploop: force FUA of nullified blocks for BAT grow

Lately we think we does sync of nullified blocks at format
driver by image fsync before header BAT size grow update.
But we write this data directly into underlying device
bypassing EXT4 by usage of extent map tree
(see dio_submit()). So fsync of EXT4 image doesnt help us.
We need to force sync of nullified blocks. This patch does
it by marking preq via PLOOP_REQ_FORCE_FUA flag.

https://jira.sw.ru/browse/PSBM-31969

Signed-off-by: Andrey Smetanin asmeta...@parallels.com

Acked-by: Andrew Vagin ava...@parallels.com
---
 drivers/block/ploop/map.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/block/ploop/map.c b/drivers/block/ploop/map.c
index 67e2852..8ea67e9 100644
--- a/drivers/block/ploop/map.c
+++ b/drivers/block/ploop/map.c
@@ -1056,10 +1056,14 @@ static void map_wb_complete_post_process(struct 
ploop_map *map,
   0, PAGE_SIZE);
 
/*
-* FUA of this data occures at format driver -complete_grow() by
-* all image sync. After that header size increased to use this
-* cluster as BAT cluster.
+* Lately we think we does sync of nullified blocks at format
+* driver by image fsync before header update.
+* But we write this data directly into underlying device
+* bypassing EXT4 by usage of extent map tree
+* (see dio_submit()). So fsync of EXT4 image doesnt help us.
+* We need to force sync of nullified blocks.
 */
+   set_bit(PLOOP_REQ_FORCE_FUA, preq-state);
top_delta-io.ops-submit(top_delta-io, preq, preq-req_rw,
  sbl, preq-iblock, 1plo-cluster_log);
 }
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ms/memcg: add page_cgroup_ino helper

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit 77c59afe2b55a1dd631c3b8a6d3763eff8d09941
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Tue May 19 08:23:31 2015 +0400

ms/memcg: add page_cgroup_ino helper

Patchset description: idle memory tracking

This patch set backports

  https://lkml.org/lkml/2015/5/12/449

which is required by vcmmd.

It is not yet clear if the original patch set will be accepted upstream
as is, there still may be changes. However, I hope the user API will be
preserved. If it is not, we will have to fix this in our kernel too.

https://jira.sw.ru/browse/PSBM-32460

Vladimir Davydov (3):
  memcg: add page_cgroup_ino helper
  proc: add kpagecgroup file
  proc: add kpageidle file

===
This patch description:

Hwpoison allows to filter pages by memory cgroup ino. To ahieve that, it
calls try_get_mem_cgroup_from_page(), then mem_cgroup_css(), and finally
extracts the inode number from the cgroup returned. This looks bulky.
Since in the next patch I need to get the ino of the memory cgroup a
page is charged to too, in this patch I introduce the page_cgroup_ino()
helper.

Note that page_cgroup_ino() only considers those pages that are charged
to mem_cgroup-res (i.e. page_cgroup-mem_cgroup != NULL), and for
others it returns 0, while try_get_mem_cgroup_page(), used by hwpoison
before, may extract the cgroup from a swapcache readahead page too.
Ignoring swapcache readahead pages allows to call page_cgroup_ino() on
unlocked pages, which is nice. Hwpoison users will hardly see any
difference.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 include/linux/memcontrol.h |  3 +++
 mm/hwpoison-inject.c   |  3 ---
 mm/memcontrol.c| 22 ++
 mm/memory-failure.c| 18 +-
 4 files changed, 26 insertions(+), 20 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 675b4c5..5507be5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -200,6 +200,9 @@ void mem_cgroup_split_huge_fixup(struct page *head);
 bool mem_cgroup_bad_page_check(struct page *page);
 void mem_cgroup_print_bad_page(struct page *page);
 #endif
+
+unsigned long page_cgroup_ino(struct page *page);
+
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index 3a61efc..bd580f8 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -44,12 +44,9 @@ static int hwpoison_inject(void *data, u64 val)
/*
 * do a racy check with elevated page count, to make sure PG_hwpoison
 * will only be set for the targeted owner (or on a free page).
-* We temporarily take page lock for try_get_mem_cgroup_from_page().
 * memory_failure() will redo the check reliably inside page lock.
 */
-   lock_page(hpage);
err = hwpoison_filter(hpage);
-   unlock_page(hpage);
if (err)
return 0;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e772a06..9dda309 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2877,6 +2877,28 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct 
page *page)
return memcg;
 }
 
+/**
+ * page_cgroup_ino - return inode number of page's memcg
+ * @page: the page
+ *
+ * Look up the memory cgroup @page is charged to and return its inode number.
+ * It is safe to call this function without taking a reference to the page.
+ */
+unsigned long page_cgroup_ino(struct page *page)
+{
+   struct mem_cgroup *memcg;
+   struct page_cgroup *pc;
+   unsigned long ino = 0;
+
+   pc = lookup_page_cgroup(page);
+   lock_page_cgroup(pc);
+   memcg = pc-mem_cgroup;
+   if (PageCgroupUsed(pc)  memcg)
+   ino = memcg-css.cgroup-dentry-d_inode-i_ino;
+   unlock_page_cgroup(pc);
+   return ino;
+}
+
 static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
   struct page *page,
   unsigned int nr_pages,
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 06f8d308..b3b1a2d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -133,26 +133,10 @@ u64 hwpoison_filter_memcg;
 EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
 static int hwpoison_filter_task(struct page *p)
 {
-   struct mem_cgroup *mem;
-   struct cgroup_subsys_state *css;
-   unsigned long ino;
-
if (!hwpoison_filter_memcg)
return 0;
 
-   mem = try_get_mem_cgroup_from_page(p);
-   if (!mem)
-   return -EINVAL;
-
-   css = mem_cgroup_css(mem);
-   /* root_mem_cgroup has NULL dentries */
-   

[Devel] [PATCH RHEL7 COMMIT] ms/mm/proc: add kpageidle file

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit 35dcabf891ce1931294c5bf3d98e1203ff656432
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Tue May 19 08:23:57 2015 +0400

ms/mm/proc: add kpageidle file

Patchset description: idle memory tracking

This patch set backports

  https://lkml.org/lkml/2015/5/12/449

which is required by vcmmd.

It is not yet clear if the original patch set will be accepted upstream
as is, there still may be changes. However, I hope the user API will be
preserved. If it is not, we will have to fix this in our kernel too.

https://jira.sw.ru/browse/PSBM-32460

Vladimir Davydov (3):
  memcg: add page_cgroup_ino helper
  proc: add kpagecgroup file
  proc: add kpageidle file

===
This patch description:

Knowing the portion of memory that is not used by a certain application
or memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g. by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced.
However, this method has two serious shortcomings:

 - it does not count unmapped file pages
 - it affects the reclaimer logic

To overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new proc file, /proc/kpageidle. A page's Idle flag
can only be set from userspace by setting bit in /proc/kpageidle at the
offset corresponding to the page, and it is cleared whenever the page is
accessed either through page tables (it is cleared in page_referenced()
in this case) or using the read(2) system call (mark_page_accessed()).
Thus by setting the Idle flag for pages of a particular workload, which
can be found e.g. by reading /proc/PID/pagemap, waiting for some time to
let the workload access its working set, and then reading the kpageidle
file, one can estimate the amount of pages that are not used by the
workload.

The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to kpageidle. If
page_referenced() is called on a Young page, it will add 1 to its return
value, therefore concealing the fact that the Access bit was cleared.

Note, since there is no room for extra page flags on 32 bit, this
feature uses extended page flags when compiled on 32 bit.

(on RH7 page ext is not available so make it depend on 64 bit)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 Documentation/vm/pagemap.txt |  12 +++-
 fs/proc/page.c   | 168 +++
 fs/proc/task_mmu.c   |   3 +-
 include/linux/mm.h   |  50 +
 include/linux/page-flags.h   |   9 +++
 mm/Kconfig   |  12 
 mm/page_alloc.c  |   4 ++
 mm/rmap.c|   9 +++
 mm/swap.c|   2 +
 9 files changed, 267 insertions(+), 2 deletions(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index e37cff9..a4fe9b2 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel 
that allow
 userspace programs to examine the page tables and related information by
 reading files in /proc.
 
-There are four components to pagemap:
+There are five components to pagemap:
 
  * /proc/pid/pagemap.  This file lets a userspace process find out which
physical frame each virtual page is mapped to.  It contains one 64-bit
@@ -67,6 +67,16 @@ There are four components to pagemap:
memory cgroup each page is charged to, indexed by PFN. Only available when
CONFIG_MEMCG is set.
 
+ * /proc/kpageidle.  This file implements a bitmap where each bit corresponds
+   to a page, indexed by PFN. When the bit is set, the corresponding page is
+   idle. A page is considered idle if it has not been accessed since it was
+   marked idle. To mark a page idle one should set the bit corresponding to the
+   page by writing to the file. A value written to the file is OR-ed with the
+   current bitmap value. Only user memory pages can be marked idle, for other
+   page types input is silently ignored. Writing to this file beyond max PFN
+   results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
+   set.
+
 Short descriptions to the page flags:
 
  0. 

[Devel] [PATCH RHEL7 COMMIT] ploop: prevent disclosure 4 bytes of the stack kernel

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit c25ed54c1a19bc8c11fcc472c3e4869c210eca97
Author: Andrey Smetanin asmeta...@virtuozzo.com
Date:   Tue May 19 08:26:57 2015 +0400

ploop: prevent disclosure 4 bytes of the stack kernel

 Memory leak (4 bytes) in the ploop_getdevice_ioc function.

  217401 +static int ploop_getdevice_ioc(unsigned long arg)
  217402 +{
  217403 +   int err;
  217404 +   int index = 0;
  217405 +   struct rb_node *n;
  217406 +   struct ploop_getdevice_ctl ctl;
  217407 +
  217408 +   mutex_lock(ploop_devices_mutex);
  217409 +   for (n = rb_first(ploop_devices_tree); n; n = 
rb_next(n), index++) {
  217410 +   struct ploop_device *plo;
  217411 +   plo = rb_entry(n, struct ploop_device, 
link);
  217412 +   if (plo-index != index || 
list_empty(plo-map.delta_list))
  217413 +   break;
  217414 +   }
  217415 +   mutex_unlock(ploop_devices_mutex);
  217416 +
  217417 +   ctl.minor = index  PLOOP_PART_SHIFT;
  217418 +   if (ctl.minor  ~MINORMASK)
  217419 +   return -ERANGE;
  217420 +   err = copy_to_user((void*)arg, ctl, sizeof(ctl));
  217421 +   return err;
  217422 +}

 The ploop_getdevice_ioc() function copy to user the
ploop_getdevice_ctl structure but it initialize juste the 'minor'
attribute. It's possible to disclosure 4 bytes of the stack kernel via
the '__mbz1' attribute.

 Below the 'ploop_getdevice_ctl' structure :

 3772915 +struct ploop_getdevice_ctl
 3772916 +{
 3772917 +   __u32   minor;
 3772918 +   __u32   __mbz1;
 3772919 +} __attribute__ ((aligned (8)));

Signed-off-by: Andrey Vagin ava...@openvz.org

Reported-by: Jonathan Salwan (Sysdream Security Laboratory) 
jonathan.sal...@gmail.com
---
 drivers/block/ploop/dev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 2f4928d..8556af2 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -4277,7 +4277,7 @@ static int ploop_getdevice_ioc(unsigned long arg)
int err;
int index = 0;
struct rb_node *n;
-   struct ploop_getdevice_ctl ctl;
+   struct ploop_getdevice_ctl ctl = {};
 
mutex_lock(ploop_devices_mutex);
for (n = rb_first(ploop_devices_tree); n; n = rb_next(n), index++) {
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: skip writes of zeroes to unallocated blocks by default

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit 6051dc5f6e200cef2011e2174d1c3b76280fe75f
Author: Andrey Smetanin asmeta...@virtuozzo.com
Date:   Tue May 19 08:27:01 2015 +0400

ploop: skip writes of zeroes to unallocated blocks by default

Reading from unallocated blocks returns zeroes =
we can safely skip writes of zeroes to unallocated blocks.

As a lot of tests do dd if=/dev/zero ..., this optimization is valuable.

Feature enabled, test results:
  [root@p2 ~]# echo 1 /sys/block/ploop37803/ptune/check_zeros
  [root@p2 ~]# dd if=/dev/zero of=/mnt/sb-io-test bs=1M count=1k oflag=dsync
  1024+0 records in
  1024+0 records out
  1073741824 bytes (1.1 GB) copied, 1.58975 s, 675 MB/s

The impact on CPU utilization is negligible.

https://jira.sw.ru/browse/PSBM-22506
https://jira.sw.ru/browse/PSBM-22381

Signed-off-by: Konstantin Khorenko khore...@parallels.com

Acked-by: Maxim V. Patlasov mpatla...@parallels.com
---
 include/linux/ploop/ploop.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h
index d295cba..434789e 100644
--- a/include/linux/ploop/ploop.h
+++ b/include/linux/ploop/ploop.h
@@ -323,6 +323,7 @@ struct ploop_tunable
 .congestion_low_watermark = DEFAULT_PLOOP_MAXRQ/2, \
 .pass_flushes = 1, \
 .pass_fuas = 1, \
+.check_zeros = 1, \
 .max_active_requests = DEFAULT_PLOOP_BATCH_ENTRY_QLEN / 2, }
 
 struct ploop_stats
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: fix spurious hole complains

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit e4c1ce43241df81fad73953200d887c6a402d82f
Author: Andrey Smetanin asmeta...@virtuozzo.com
Date:   Tue May 19 08:27:07 2015 +0400

ploop: fix spurious hole complains

Spurious complains were triggered by fiemap-ahead logic of pio_direct 
module.
Fix it by suppressing complains if fiemap behind EOF failed. Also print
more details about a hole.

Signed-off-by: Maxim Patlasov mpatla...@parallels.com

Acked-by: Andrew Vagin ava...@parallels.com
---
 drivers/block/ploop/io_direct_map.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/block/ploop/io_direct_map.c 
b/drivers/block/ploop/io_direct_map.c
index b9a0ce9..c1d889b 100644
--- a/drivers/block/ploop/io_direct_map.c
+++ b/drivers/block/ploop/io_direct_map.c
@@ -681,7 +681,7 @@ again:
 
old_fs = get_fs();
set_fs(KERNEL_DS);
-   ret = inode-i_op-fiemap(inode, fieinfo, start  9, 1);
+   ret = inode-i_op-fiemap(inode, fieinfo, start_off, 1);
 
/* chase for PSBM-26762: em-block_start == 0 */
if (!ret  fieinfo.fi_extents_mapped == 1 
@@ -709,8 +709,11 @@ again:
}
 
if (fieinfo.fi_extents_mapped != 1) {
-   ploop_msg_once(io-plo, a hole in image file detected (%d),
-  fieinfo.fi_extents_mapped);
+   if (start_off  i_size_read(inode))
+   ploop_msg_once(io-plo, a hole in image file detected
+   (mapped=%d i_size=%llu off=%llu),
+  fieinfo.fi_extents_mapped,
+  i_size_read(inode), start_off);
extent_put(em);
return ERR_PTR(-EINVAL);
}
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: notify blktrace about bio completions

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit 3ebe6f8f4178ebf89b3aff5b064657e1e9615dce
Author: Andrey Smetanin asmeta...@virtuozzo.com
Date:   Tue May 19 08:27:11 2015 +0400

ploop: notify blktrace about bio completions

Signed-off-by: Andrey Smetanin asmeta...@virtuozzo.com
---
 drivers/block/ploop/dev.c  | 14 --
 drivers/block/ploop/freeblks.c |  4 +++-
 include/linux/ploop/compat.h   |  6 +-
 3 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 225c2ab..e2ff0aa 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -13,6 +13,8 @@
 #include linux/ve.h
 #include asm/uaccess.h
 
+#include trace/events/block.h
+
 #include linux/ploop/ploop.h
 #include ploop_events.h
 #include freeblks.h
@@ -518,7 +520,7 @@ ploop_bio_queue(struct ploop_device * plo, struct bio * bio,
bio-bi_bdev = plo-bdev;
clear_bit(BIO_BDEV_REUSED, bio-bi_flags);
}
-   BIO_ENDIO(bio, err);
+   BIO_ENDIO(plo-queue, bio, err);
list_add(preq-list, plo-free_list);
plo-bio_qlen--;
plo-bio_discard_qlen--;
@@ -591,7 +593,7 @@ DEFINE_BIO_CB(ploop_fast_end_io)
 
plo = orig-bi_bdev-bd_disk-private_data;
 
-   BIO_ENDIO(orig, err);
+   BIO_ENDIO(plo-queue, orig, err);
 
/* End of fast bio wakes up main process only when this could
 * mean exit from ATTENTION state.
@@ -800,13 +802,13 @@ static void ploop_make_request(struct request_queue *q, 
struct bio *bio)
 * marked as FLUSH, otherwise just warn and complete. */
if (!(bio-bi_rw  REQ_FLUSH)) {
WARN_ON(1);
-   BIO_ENDIO(bio, 0);
+   BIO_ENDIO(q, bio, 0);
return;
}
/* useless to pass this bio further */
if (!plo-tune.pass_flushes) {
ploop_acc_ff_in(plo, bio-bi_rw);
-   BIO_ENDIO(bio, 0);
+   BIO_ENDIO(q, bio, 0);
return;
}
}
@@ -862,7 +864,7 @@ static void ploop_make_request(struct request_queue *q, 
struct bio *bio)
plo-bio_total--;
spin_unlock_irq(plo-lock);
 
-   BIO_ENDIO(bio, -EIO);
+   BIO_ENDIO(q, bio, -EIO);
if (nbio)
bio_put(nbio);
return;
@@ -1208,7 +1210,7 @@ static void ploop_complete_request(struct ploop_request * 
preq)
struct bio * bio = preq-bl.head;
preq-bl.head = bio-bi_next;
bio-bi_next = NULL;
-   BIO_ENDIO(bio, preq-error);
+   BIO_ENDIO(plo-queue, bio, preq-error);
nr_completed++;
}
preq-bl.tail = NULL;
diff --git a/drivers/block/ploop/freeblks.c b/drivers/block/ploop/freeblks.c
index 569cb94..cf48d3a 100644
--- a/drivers/block/ploop/freeblks.c
+++ b/drivers/block/ploop/freeblks.c
@@ -8,6 +8,8 @@
 #include linux/buffer_head.h
 #include linux/kthread.h
 
+#include trace/events/block.h
+
 #include linux/ploop/ploop.h
 #include freeblks.h
 
@@ -700,7 +702,7 @@ static void fbd_complete_bio(struct ploop_freeblks_desc 
*fbd, int err)
struct bio * bio = fbd-fbd_dbl.head;
fbd-fbd_dbl.head = bio-bi_next;
bio-bi_next = NULL;
-   BIO_ENDIO(bio, err);
+   BIO_ENDIO(fbd-plo-queue, bio, err);
nr_completed++;
}
fbd-fbd_dbl.tail = NULL;
diff --git a/include/linux/ploop/compat.h b/include/linux/ploop/compat.h
index ace8ec1..03c3ae3 100644
--- a/include/linux/ploop/compat.h
+++ b/include/linux/ploop/compat.h
@@ -44,7 +44,11 @@ static void func(struct bio *bio, int err) {
 
 #define END_BIO_CB(func)  }
 
-#define BIO_ENDIO(_bio, _err)  bio_endio(_bio, _err)
+#define BIO_ENDIO(_queue, _bio, _err)  \
+   do {\
+   trace_block_bio_complete((_queue), (_bio), (_err)); \
+   bio_endio((_bio), (_err));  \
+   } while (0);
 
 #define F_DENTRY(file) (file)-f_path.dentry
 #define F_MNT(file)(file)-f_path.mnt
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: add a separate queue for discard bio-s (v2)

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit fa6f3b8595f13c13eebd452bc0947754ac249c2c
Author: Andrey Smetanin asmeta...@virtuozzo.com
Date:   Tue May 19 08:27:10 2015 +0400

ploop: add a separate queue for discard bio-s (v2)

When I created support of discard requests, process_bio_queue is
called from ploop_thread. So I use ploop_quiesceploop_relax for
synchronization. Now it is called from ploop_make_request too,
so my synchronization doesn't work any more.

The race was added by
diff-ploop-converting-bio-into-ploop-request-in-function-ploop_make_request.

This patch adds a separate queue for discard requests, which is handled
only from ploop_thread(). In addition we get ability to postpone discard 
bio-s,
while we are handling others. So we will not fail, if a bio is received 
while
another one is processed. In a future this will allow us to handle more than
one bio concurrently.

v2: fix comments from Maxim
 Also, ploop_preq_drop() and ploop_complete_request() must wake up 
ploop-thread
 if !bio_list_empty(plo-bio_discard_list) as well.

https://jira.sw.ru/browse/PSBM-27676

Note, that this is a plain(no logic changes) port for RHEL7 of Andrew Vagin
original patch (RHEL6).

Signed-off-by: Andrew Vagin ava...@openvz.org
---
 drivers/block/ploop/dev.c  | 54 ++
 drivers/block/ploop/freeblks.c |  5 
 drivers/block/ploop/freeblks.h |  1 +
 drivers/block/ploop/sysfs.c|  6 +
 include/linux/ploop/ploop.h|  2 ++
 5 files changed, 58 insertions(+), 10 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index ab99724..225c2ab 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -117,8 +117,9 @@ static void mitigation_timeout(unsigned long data)
spin_lock_irq(plo-lock);
if (test_bit(PLOOP_S_WAIT_PROCESS, plo-state) 
(!list_empty(plo-entry_queue) ||
-(plo-bio_head  !list_empty(plo-free_list))) 
-   waitqueue_active(plo-waitq))
+((plo-bio_head  !bio_list_empty(plo-bio_discard_list)) 
+   !list_empty(plo-free_list))) 
+   waitqueue_active(plo-waitq))
wake_up_interruptible(plo-waitq);
spin_unlock_irq(plo-lock);
 }
@@ -237,7 +238,8 @@ void ploop_preq_drop(struct ploop_device * plo, struct 
list_head *drop_list,
if (waitqueue_active(plo-req_waitq))
wake_up(plo-req_waitq);
else if (test_bit(PLOOP_S_WAIT_PROCESS, plo-state) 
-   waitqueue_active(plo-waitq)  plo-bio_head)
+   waitqueue_active(plo-waitq) 
+   (plo-bio_head || !bio_list_empty(plo-bio_discard_list)))
wake_up_interruptible(plo-waitq);
 
ploop_uncongest(plo);
@@ -519,6 +521,7 @@ ploop_bio_queue(struct ploop_device * plo, struct bio * bio,
BIO_ENDIO(bio, err);
list_add(preq-list, plo-free_list);
plo-bio_qlen--;
+   plo-bio_discard_qlen--;
plo-bio_total--;
return;
}
@@ -756,6 +759,28 @@ static void ploop_unplug(struct blk_plug_cb *cb, bool 
from_schedule)
kfree(cb);
 }
 
+static void
+process_discard_bio_queue(struct ploop_device *plo, struct list_head 
*drop_list)
+{
+   bool discard = test_bit(PLOOP_S_DISCARD, plo-state);
+   while (!list_empty(plo-free_list)) {
+   struct bio *tmp;
+
+   /* Only one discard bio can be handled concurrently */
+   if (discard  ploop_discard_is_inprogress(plo-fbd))
+   return;
+
+   tmp = bio_list_pop(plo-bio_discard_list);
+   if (tmp == NULL)
+   break;
+
+   /* If PLOOP_S_DISCARD isn't set, ploop_bio_queue
+* will complete it with a proper error.
+*/
+   ploop_bio_queue(plo, tmp, drop_list);
+   }
+}
+
 static void ploop_make_request(struct request_queue *q, struct bio *bio)
 {
struct bio * nbio;
@@ -843,6 +868,12 @@ static void ploop_make_request(struct request_queue *q, 
struct bio *bio)
return;
}
 
+   if (bio-bi_rw  REQ_DISCARD) {
+   bio_list_add(plo-bio_discard_list, bio);
+   plo-bio_discard_qlen++;
+   goto queued;
+   }
+
/* Write tracking in fast path does not work at the moment. */
if (unlikely(test_bit(PLOOP_S_TRACK, plo-state) 
 (bio-bi_rw  WRITE)))
@@ -864,9 +895,6 @@ static void ploop_make_request(struct request_queue *q, 
struct bio *bio)
if (unlikely(nbio == NULL))
goto queue;
 
-   if (bio-bi_rw  REQ_DISCARD)
- 

[Devel] [PATCH RHEL7 COMMIT] ploop: define struct ploop_request-state flags to force pre FLUSH before write IO and FUA/fsync at I/O complete

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit ebf1008ff2e19354244317140b41ae3c2854f74b
Author: Andrey Smetanin asmeta...@virtuozzo.com
Date:   Tue May 19 08:27:13 2015 +0400

ploop: define struct ploop_request-state flags to force pre FLUSH before 
write IO and FUA/fsync at I/O complete

Series description:

During relocation of ploop clusters (resize/baloon) we need to FUA/fsync
image file after such operations:
 a) new data block wrote
 b) BAT update
 c) nullify old data block for BAT grow. We do this already nullify of old 
data
block at format module - complete_grow callback.

This patch forses fsync(kaio), FUA(direct) of reloc write I/O to image
by marking such reloc reqs(A|S) with appropriate flags. Kaio/direct modules
tuned by patch to force fsync/FUA if these flags are set. This code does
FUA/fsync only for a) and b) cases, while c) already implemented.

Also patch fixes inconsistent bio list FUA processing in direct module.
The problem is that for bunch of bios we only set FUA at last bio. Its 
possible
in case of power outage that last bio will be stored and previos are not
because they are stored only in cache at the time of power failure.
To solve problem this patch marking last bio as FLUSH|FUA if more than one 
bio
in list.

Moreover for KAIO if fsync possible at BAT update stage we do that like we
did in direct case instead of 2 fsync's. For direct case if we going to make
FUA at BAT update only(optimization trick that already exists) then we need
to mark req to FLUSH previously written(without FUA) data.

Performance:
Overall(includes EXT4 resize upto 16T) resize performance degradated by -5% 
of
time.

https://jira.sw.ru/browse/PSBM-31222
https://jira.sw.ru/browse/PSBM-31225
https://jira.sw.ru/browse/PSBM-31321

Signed-off-by: Andrey Smetanin asmeta...@parallels.com

Andrey Smetanin (7):
  ploop: define struct ploop_request-state flags to force pre FLUSH
before write IO and FUA/fsync at I/O complete
  ploop: mark reloc reqs to force FUA/fsync(kaio) for index update I/O
  ploop: mark reloc reqs to force FUA before write of relocated data
  ploop: direct: to support truly FLUSH/FUA of req we need mark first
bio FLUSH, write all bios and mark last bio as FLUSH/FUA
  ploop: added ploop_req_delay_fua_possible() func that detects possible
delaying of upcoming FUA to index update stage. This function will
be lately used in direct/kaio code to detect and delay FUA
  ploop: make image fsync at I/O complete if it's required by FUA/fsync
force flag or by req-req_rw
  ploop: do preflush or postfua according force FUA/flush flags, and
delay FUA if possible but add force FLUSH to req if so

This patch description:
Need such defines to force FUA/FLUSH/fsync in direct/kaio modules.

https://jira.sw.ru/browse/PSBM-31222
https://jira.sw.ru/browse/PSBM-31225
https://jira.sw.ru/browse/PSBM-31321

Signed-off-by: Andrey Smetanin asmeta...@parallels.com

Reviewed-by: Andrew Vagin ava...@parallels.com
---
 include/linux/ploop/ploop.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h
index d8b83a6..73280e0 100644
--- a/include/linux/ploop/ploop.h
+++ b/include/linux/ploop/ploop.h
@@ -456,6 +456,9 @@ enum
PLOOP_REQ_ZERO,
PLOOP_REQ_DISCARD,
PLOOP_REQ_RSYNC,
+   PLOOP_REQ_FORCE_FUA,/*force fua of req write I/O by engine */
+   PLOOP_REQ_FORCE_FLUSH,  /*force flush by engine */
+   PLOOP_REQ_KAIO_FSYNC,   /*force image fsync by KAIO module */
 };
 
 enum
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: direct: to support truly FLUSH/FUA of req we need mark first bio FLUSH, write all bios and mark last bio as FLUSH/FUA

2015-05-18 Thread Konstantin Khorenko
The commit is pushed to branch-rh7-3.10.0-123.1.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-123.1.2.vz7.5.1
--
commit 822c64967450485c2a26b9cfbf388d85ad022781
Author: Andrey Smetanin asmeta...@virtuozzo.com
Date:   Tue May 19 08:27:13 2015 +0400

ploop: direct: to support truly FLUSH/FUA of req we need mark first bio 
FLUSH, write all bios and mark last bio as FLUSH/FUA

Series description:

During relocation of ploop clusters (resize/baloon) we need to FUA/fsync
image file after such operations:
 a) new data block wrote
 b) BAT update
 c) nullify old data block for BAT grow. We do this already nullify of old 
data
block at format module - complete_grow callback.

This patch forses fsync(kaio), FUA(direct) of reloc write I/O to image
by marking such reloc reqs(A|S) with appropriate flags. Kaio/direct modules
tuned by patch to force fsync/FUA if these flags are set. This code does
FUA/fsync only for a) and b) cases, while c) already implemented.

Also patch fixes inconsistent bio list FUA processing in direct module.
The problem is that for bunch of bios we only set FUA at last bio. Its 
possible
in case of power outage that last bio will be stored and previos are not
because they are stored only in cache at the time of power failure.
To solve problem this patch marking last bio as FLUSH|FUA if more than one 
bio
in list.

Moreover for KAIO if fsync possible at BAT update stage we do that like we
did in direct case instead of 2 fsync's. For direct case if we going to make
FUA at BAT update only(optimization trick that already exists) then we need
to mark req to FLUSH previously written(without FUA) data.

Performance:
Overall(includes EXT4 resize upto 16T) resize performance degradated by -5% 
of
time.

https://jira.sw.ru/browse/PSBM-31222
https://jira.sw.ru/browse/PSBM-31225
https://jira.sw.ru/browse/PSBM-31321

Signed-off-by: Andrey Smetanin asmeta...@parallels.com

Andrey Smetanin (7):
  ploop: define struct ploop_request-state flags to force pre FLUSH
before write IO and FUA/fsync at I/O complete
  ploop: mark reloc reqs to force FUA/fsync(kaio) for index update I/O
  ploop: mark reloc reqs to force FUA before write of relocated data
  ploop: direct: to support truly FLUSH/FUA of req we need mark first
bio FLUSH, write all bios and mark last bio as FLUSH/FUA
  ploop: added ploop_req_delay_fua_possible() func that detects possible
delaying of upcoming FUA to index update stage. This function will
be lately used in direct/kaio code to detect and delay FUA
  ploop: make image fsync at I/O complete if it's required by FUA/fsync
force flag or by req-req_rw
  ploop: do preflush or postfua according force FUA/flush flags, and
delay FUA if possible but add force FLUSH to req if so

This patch description:
Patch fixes inconsistent bio list FUA processing in direct module.
The problem is that for bunch of bios we only set FUA at last bio. Its 
possible
in case of power outage that last bio will be stored and previos are not
because they are stored only in cache at the time of power failure.
To solve problem this patch marking last bio as FLUSH|FUA if more than one 
bio
in list.

https://jira.sw.ru/browse/PSBM-31222
https://jira.sw.ru/browse/PSBM-31225
https://jira.sw.ru/browse/PSBM-31321

Signed-off-by: Andrey Smetanin asmeta...@parallels.com

Reviewed-by: Andrew Vagin ava...@parallels.com
---
 drivers/block/ploop/io_direct.c | 19 ---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 5e2e078..2e81d81 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -85,6 +85,7 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
int preflush;
int postfua = 0;
int write = !!(rw  REQ_WRITE);
+   int bio_num;
 
trace_submit(preq);
 
@@ -215,6 +216,7 @@ flush_bio:
}
extent_put(em);
 
+   bio_num = 0;
while (bl.head) {
struct bio * b = bl.head;
unsigned long rw2 = rw;
@@ -230,10 +232,11 @@ flush_bio:
preflush = 0;
}
if (unlikely(postfua  !bl.head))
-   rw2 |= REQ_FUA;
+   rw2 |= (REQ_FUA | ((bio_num) ? REQ_FLUSH : 0));
 
ploop_acc_ff_out(preq-plo, rw2 | b-bi_rw);
submit_bio(rw2  ~(bl.head ? REQ_SYNC : 0), b);
+   bio_num++;
}
 
ploop_complete_io_request(preq);
@@ -1341,9 +1344,12 @@ dio_io_page(struct ploop_io * io, unsigned long rw,
int err;
int off;
int postfua;
+   int 

  1   2   3   4   5   6   7   8   9   10   >