date:20170726

[PATCHv2 0/4] perf stat: Enable group read of counters

2017-07-26 Thread Jiri Olsa

hi,
sending changes to enable group read of perf counters
for perf stat command. It allows us to read whole group
of counters within single read syscall.

v2 changes:
  - fixed release segfault reported by Arnaldo
  - rebased to latest Arnaldo's perf/core
  - patch 1 already merged in

Also available in here:
  git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
  perf/stat_group

Not sure why we haven't supported yet, but anyway it was
unavailable for some time due to a bug which was fixed
just recently via:
  ba5213ae6b88 ("perf/core: Correct event creation with PERF_FORMAT_GROUP")

thanks,
jirka


---
Jiri Olsa (3):
  perf tools: Add perf_evsel__read_size function
  perf tools: Add perf_evsel__read_counter function
  perf stat: Use group read for event groups

 tools/perf/builtin-stat.c |  30 ---
 tools/perf/util/counts.h  |   1 +
 tools/perf/util/evsel.c   | 139 
-
 tools/perf/util/evsel.h   |   2 ++
 tools/perf/util/stat.c|   4 +++
 tools/perf/util/stat.h|   5 ++--
 6 files changed, 175 insertions(+), 6 deletions(-)

[PATCH 1/3] perf tools: Add perf_evsel__read_size function

2017-07-26 Thread Jiri Olsa

Currently we use the size of struct perf_counts_values
to read the event, which prevents us to put any new
member to the struct.

Adding perf_evsel__read_size to return size of the
buffer needed for event read.

Link: http://lkml.kernel.org/n/tip-cfc3dmil3tlzezzxtyi9f...@git.kernel.org
Signed-off-by: Jiri Olsa 
---
 tools/perf/util/evsel.c | 29 -
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 450b5fadf8cb..4dd0fcc06db9 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1261,15 +1261,42 @@ void perf_counts_values__scale(struct 
perf_counts_values *count,
*pscaled = scaled;
 }
 
+static int perf_evsel__read_size(struct perf_evsel *evsel)
+{
+   u64 read_format = evsel->attr.read_format;
+   int entry = sizeof(u64); /* value */
+   int size = 0;
+   int nr = 1;
+
+   if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
+   size += sizeof(u64);
+
+   if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
+   size += sizeof(u64);
+
+   if (read_format & PERF_FORMAT_ID)
+   entry += sizeof(u64);
+
+   if (read_format & PERF_FORMAT_GROUP) {
+   nr = evsel->nr_members;
+   size += sizeof(u64);
+   }
+
+   size += entry * nr;
+   return size;
+}
+
 int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
 struct perf_counts_values *count)
 {
+   size_t size = perf_evsel__read_size(evsel);
+
memset(count, 0, sizeof(*count));
 
if (FD(evsel, cpu, thread) < 0)
return -EINVAL;
 
-   if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) <= 0)
+   if (readn(FD(evsel, cpu, thread), count->values, size) <= 0)
return -errno;
 
return 0;
-- 
2.9.4

[PATCH 3/3] perf stat: Use group read for event groups

2017-07-26 Thread Jiri Olsa

Make perf stat use  group read if there  are groups
defined. The group read will get the values for all
member of groups within a single syscall instead of
calling read syscall for every event.

We can see considerable less amount of kernel cycles
spent on single group read, than reading each event
separately, like for following perf stat command:

  # perf stat -e {cycles,instructions} -I 10 -a sleep 1

Monitored with "perf stat -r 5 -e '{cycles:u,cycles:k}'"

Before:

24,325,676  cycles:u
   297,040,775  cycles:k

   1.038554134 seconds time elapsed

After:
25,034,418  cycles:u
   158,256,395  cycles:k

   1.036864497 seconds time elapsed

The perf_evsel__open fallback changes contributed by Andi Kleen.

Link: http://lkml.kernel.org/n/tip-b6g8qarwvptr81cqdtfst...@git.kernel.org
Signed-off-by: Jiri Olsa 
---
 tools/perf/builtin-stat.c | 30 +++---
 tools/perf/util/counts.h  |  1 +
 tools/perf/util/evsel.c   | 10 ++
 3 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 48ac53b199fc..866da7aa54bf 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -213,10 +213,20 @@ static void perf_stat__reset_stats(void)
 static int create_perf_stat_counter(struct perf_evsel *evsel)
 {
struct perf_event_attr *attr = >attr;
+   struct perf_evsel *leader = evsel->leader;
 
-   if (stat_config.scale)
+   if (stat_config.scale) {
attr->read_format = PERF_FORMAT_TOTAL_TIME_ENABLED |
PERF_FORMAT_TOTAL_TIME_RUNNING;
+   }
+
+   /*
+* The event is part of non trivial group, let's enable
+* the group read (for leader) and ID retrieval for all
+* members.
+*/
+   if (leader->nr_members > 1)
+   attr->read_format |= PERF_FORMAT_ID|PERF_FORMAT_GROUP;
 
attr->inherit = !no_inherit;
 
@@ -333,13 +343,21 @@ static int read_counter(struct perf_evsel *counter)
struct perf_counts_values *count;
 
count = perf_counts(counter->counts, cpu, thread);
-   if (perf_evsel__read(counter, cpu, thread, count)) {
+
+   /*
+* The leader's group read loads data into its group 
members
+* (via perf_evsel__read_counter) and sets threir 
count->loaded.
+*/
+   if (!count->loaded &&
+   perf_evsel__read_counter(counter, cpu, thread)) {
counter->counts->scaled = -1;
perf_counts(counter->counts, cpu, thread)->ena 
= 0;
perf_counts(counter->counts, cpu, thread)->run 
= 0;
return -1;
}
 
+   count->loaded = false;
+
if (STAT_RECORD) {
if (perf_evsel__write_stat_event(counter, cpu, 
thread, count)) {
pr_err("failed to write stat event\n");
@@ -559,6 +577,11 @@ static int store_counter_ids(struct perf_evsel *counter)
return __store_counter_ids(counter, cpus, threads);
 }
 
+static bool perf_evsel__should_store_id(struct perf_evsel *counter)
+{
+   return STAT_RECORD || counter->attr.read_format & PERF_FORMAT_ID;
+}
+
 static int __run_perf_stat(int argc, const char **argv)
 {
int interval = stat_config.interval;
@@ -631,7 +654,8 @@ static int __run_perf_stat(int argc, const char **argv)
if (l > unit_width)
unit_width = l;
 
-   if (STAT_RECORD && store_counter_ids(counter))
+   if (perf_evsel__should_store_id(counter) &&
+   store_counter_ids(counter))
return -1;
}
 
diff --git a/tools/perf/util/counts.h b/tools/perf/util/counts.h
index 34d8baaf558a..cb45a6aecf9d 100644
--- a/tools/perf/util/counts.h
+++ b/tools/perf/util/counts.h
@@ -12,6 +12,7 @@ struct perf_counts_values {
};
u64 values[3];
};
+   boolloaded;
 };
 
 struct perf_counts {
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 89aecf3a35c7..3735c9e0080d 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -49,6 +49,7 @@ static struct {
bool clockid_wrong;
bool lbr_flags;
bool write_backward;
+   bool group_read;
 } perf_missing_features;
 
 static clockid_t clockid;
@@ -1321,6 +1322,7 @@ perf_evsel__set_count(struct perf_evsel *counter, int 
cpu, int thread,
count->val= val;
count->ena= ena;
count->run= run;
+   count->loaded = true;
 }
 
 static int
@@ -1677,6 +1679,8 @@ int perf_evsel__open(struct perf_evsel *evsel, struct 
cpu_map *cpus,
if

[PATCH 2/3] perf tools: Add perf_evsel__read_counter function

2017-07-26 Thread Jiri Olsa

Adding perf_evsel__read_counter function to read single or
group counter. After calling this function the counter's
evsel::counts struct is filled with values for the counter
and member of its group if there are any.

Link: http://lkml.kernel.org/n/tip-itsuxdyt7rp4mvij1t6k7...@git.kernel.org
Signed-off-by: Jiri Olsa 
---
 tools/perf/util/evsel.c | 100 
 tools/perf/util/evsel.h |   2 +
 tools/perf/util/stat.c  |   4 ++
 tools/perf/util/stat.h  |   5 ++-
 4 files changed, 109 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 4dd0fcc06db9..89aecf3a35c7 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1302,6 +1302,106 @@ int perf_evsel__read(struct perf_evsel *evsel, int cpu, 
int thread,
return 0;
 }
 
+static int
+perf_evsel__read_one(struct perf_evsel *evsel, int cpu, int thread)
+{
+   struct perf_counts_values *count = perf_counts(evsel->counts, cpu, 
thread);
+
+   return perf_evsel__read(evsel, cpu, thread, count);
+}
+
+static void
+perf_evsel__set_count(struct perf_evsel *counter, int cpu, int thread,
+ u64 val, u64 ena, u64 run)
+{
+   struct perf_counts_values *count;
+
+   count = perf_counts(counter->counts, cpu, thread);
+
+   count->val= val;
+   count->ena= ena;
+   count->run= run;
+}
+
+static int
+perf_evsel__process_group_data(struct perf_evsel *leader,
+  int cpu, int thread, u64 *data)
+{
+   u64 read_format = leader->attr.read_format;
+   struct sample_read_value *v;
+   u64 nr, ena = 0, run = 0, i;
+
+   nr = *data++;
+
+   if (nr != (u64) leader->nr_members)
+   return -EINVAL;
+
+   if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
+   ena = *data++;
+
+   if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
+   run = *data++;
+
+   v = (struct sample_read_value *) data;
+
+   perf_evsel__set_count(leader, cpu, thread,
+ v[0].value, ena, run);
+
+   for (i = 1; i < nr; i++) {
+   struct perf_evsel *counter;
+
+   counter = perf_evlist__id2evsel(leader->evlist, v[i].id);
+   if (!counter)
+   return -EINVAL;
+
+   perf_evsel__set_count(counter, cpu, thread,
+ v[i].value, ena, run);
+   }
+
+   return 0;
+}
+
+static int
+perf_evsel__read_group(struct perf_evsel *leader, int cpu, int thread)
+{
+   struct perf_stat_evsel *ps = leader->priv;
+   u64 read_format = leader->attr.read_format;
+   int size = perf_evsel__read_size(leader);
+   u64 *data = ps->group_data;
+
+   if (!(read_format & PERF_FORMAT_ID))
+   return -EINVAL;
+
+   if (!perf_evsel__is_group_leader(leader))
+   return -EINVAL;
+
+   if (!data) {
+   data = zalloc(size);
+   if (!data)
+   return -ENOMEM;
+
+   ps->group_data = data;
+   }
+
+   if (FD(leader, cpu, thread) < 0)
+   return -EINVAL;
+
+   if (readn(FD(leader, cpu, thread), data, size) <= 0)
+   return -errno;
+
+   return perf_evsel__process_group_data(leader, cpu, thread, data);
+}
+
+int perf_evsel__read_counter(struct perf_evsel *evsel, int cpu, int thread)
+{
+   u64 read_format = evsel->attr.read_format;
+
+   if (read_format & PERF_FORMAT_GROUP)
+   return perf_evsel__read_group(evsel, cpu, thread);
+   else
+   return perf_evsel__read_one(evsel, cpu, thread);
+}
+
 int __perf_evsel__read_on_cpu(struct perf_evsel *evsel,
  int cpu, int thread, bool scale)
 {
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index fb40ca3c6519..de03c18daaf0 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -299,6 +299,8 @@ static inline bool perf_evsel__match2(struct perf_evsel *e1,
 int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
 struct perf_counts_values *count);
 
+int perf_evsel__read_counter(struct perf_evsel *evsel, int cpu, int thread);
+
 int __perf_evsel__read_on_cpu(struct perf_evsel *evsel,
  int cpu, int thread, bool scale);
 
diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index 53b9a994a3dc..35e9848734d6 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -128,6 +128,10 @@ static int perf_evsel__alloc_stat_priv(struct perf_evsel 
*evsel)
 
 static void perf_evsel__free_stat_priv(struct perf_evsel *evsel)
 {
+   struct perf_stat_evsel *ps = evsel->priv;
+
+   if (ps)
+   free(ps->group_data);
zfree(>priv);
 }
 
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index 7522bf10b03e..eacaf958e19d 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@

Re: [PATCH 1/3] arm/syscalls: Move address limit check in loop

2017-07-26 Thread Will Deacon

On Tue, Jul 25, 2017 at 01:01:17PM -0700, Thomas Garnier wrote:
> On Tue, Jul 25, 2017 at 3:38 AM, Russell King - ARM Linux
>  wrote:
> > On Tue, Jul 25, 2017 at 01:28:01PM +0300, Leonard Crestez wrote:
> >> On Mon, 2017-07-24 at 10:07 -0700, Thomas Garnier wrote:
> >> > On Wed, Jul 19, 2017 at 10:58 AM, Thomas Garnier  >> > > wrote:
> >> > >
> >> > > The work pending loop can call set_fs after addr_limit_user_check
> >> > > removed the _TIF_FSCHECK flag. To prevent the infinite loop, move
> >> > > the addr_limit_user_check call at the beginning of the loop.
> >> > >
> >> > > Fixes: 73ac5d6a2b6a ("arm/syscalls: Check address limit on user-
> >> > > mode return")
> >> > > Reported-by: Leonard Crestez 
> >> > > Signed-off-by: Thomas Garnier 
> >>
> >> > Any comments on this patch set?
> >>
> >> Tested-by: Leonard Crestez 
> >>
> >> This appears to fix the original issue of failing to boot from NFS when
> >> there are lots of alignment faults. But this is a very basic test
> >> relative to the reach of this change.
> >>
> >> However the original patch has been in linux-next for a while and
> >> apparently nobody else noticed system calls randomly hanging on arm.
> >>
> >> I assume maintainers need to give their opinion.
> >
> > I've already stated my opinion, which is different from what Linus has
> > requested of Thomas.  IMHO, the current approach is going to keep on
> > causing problems along the lines that I've already pointed out.
> 
> I understand. Do you think this problem apply to arm64 as well?

It's probably less of an issue for arm64 because we don't take alignment
faults from the kernel and I think the perf case would resolve itself by
throttling the event. However, I also don't see the advantage of doing
this in the work loop as opposed to leaving it until we're actually doing
the return to userspace.

I looked to see what you've done for x86, but it looks like you check/clear
the flag before the work pending loop (exit_to_usermode_loop), which
subsequently re-enables interrupts and exits when
EXIT_TO_USERMODE_LOOP_FLAGS are all clear. Since TIF_FSCHECK isn't included
in those flags, what stops it being set again by an irq and remaining set
for the return to userspace?

Will

Re: [PATCH] arm64: Convert to using %pOF instead of full_name

2017-07-26 Thread Will Deacon

Hi Rob,

On Tue, Jul 25, 2017 at 07:27:29PM -0500, Rob Herring wrote:
> On Tue, Jul 25, 2017 at 7:04 AM, Will Deacon  wrote:
> > On Tue, Jul 18, 2017 at 04:42:42PM -0500, Rob Herring wrote:
> >> Now that we have a custom printf format specifier, convert users of
> >> full_name to use %pOF instead. This is preparation to remove storing
> >> of the full path string for each node.
> >>
> >> Signed-off-by: Rob Herring 
> >> Cc: Catalin Marinas 
> >> Cc: Will Deacon 
> >> Cc: linux-arm-ker...@lists.infradead.org
> >> ---
> >>  arch/arm64/kernel/cpu_ops.c  |  4 ++--
> >>  arch/arm64/kernel/smp.c  | 12 ++--
> >>  arch/arm64/kernel/topology.c | 22 +++---
> >>  3 files changed, 19 insertions(+), 19 deletions(-)
> >
> > I've queued this and the perf patch too, but it would be good if somebody
> > could update sparse to recognise this format specifier. Currently it
> > just complains about it.
> 
> I'm happy to fix it, but I ran sparse and don't see any errors. Got a pointer?

I went back and checked again and it's not sparse that's warning, it's
actually smatch (sorry for getting that mixed up):

  arch/arm64/kernel/cpu_ops.c:85 cpu_read_enable_method() error: unrecognized 
%p extension 'O', treated as normal %p [smatch]

Will

Re: [PATCH v4.4.y] sched/cgroup: Move sched_online_group() back into css_online() to fix crash

2017-07-26 Thread Matt Fleming

On Tue, 25 Jul, at 11:04:39AM, Greg KH wrote:
> On Thu, Jul 20, 2017 at 02:53:09PM +0100, Matt Fleming wrote:
> > From: Konstantin Khlebnikov 
> > 
> > commit 96b777452d8881480fd5be50112f791c17db4b6b upstream.
> > 
> > Commit:
> > 
> >   2f5177f0fd7e ("sched/cgroup: Fix/cleanup cgroup teardown/init")
> > 
> > .. moved sched_online_group() from css_online() to css_alloc().
> > It exposes half-baked task group into global lists before initializing
> > generic cgroup stuff.
> > 
> > LTP testcase (third in cgroup_regression_test) written for testing
> > similar race in kernels 2.6.26-2.6.28 easily triggers this oops:
> > 
> >   BUG: unable to handle kernel NULL pointer dereference at 0008
> >   IP: kernfs_path_from_node_locked+0x260/0x320
> >   CPU: 1 PID: 30346 Comm: cat Not tainted 4.10.0-rc5-test #4
> >   Call Trace:
> >   ? kernfs_path_from_node+0x4f/0x60
> >   kernfs_path_from_node+0x3e/0x60
> >   print_rt_rq+0x44/0x2b0
> >   print_rt_stats+0x7a/0xd0
> >   print_cpu+0x2fc/0xe80
> >   ? __might_sleep+0x4a/0x80
> >   sched_debug_show+0x17/0x30
> >   seq_read+0xf2/0x3b0
> >   proc_reg_read+0x42/0x70
> >   __vfs_read+0x28/0x130
> >   ? security_file_permission+0x9b/0xc0
> >   ? rw_verify_area+0x4e/0xb0
> >   vfs_read+0xa5/0x170
> >   SyS_read+0x46/0xa0
> >   entry_SYSCALL_64_fastpath+0x1e/0xad
> > 
> > Here the task group is already linked into the global RCU-protected 
> > 'task_groups'
> > list, but the css->cgroup pointer is still NULL.
> > 
> > This patch reverts this chunk and moves online back to css_online().
> > 
> > Signed-off-by: Konstantin Khlebnikov 
> > Signed-off-by: Peter Zijlstra (Intel) 
> > Cc: Linus Torvalds 
> > Cc: Peter Zijlstra 
> > Cc: Tejun Heo 
> > Cc: Thomas Gleixner 
> > Fixes: 2f5177f0fd7e ("sched/cgroup: Fix/cleanup cgroup teardown/init")
> > Link: 
> > http://lkml.kernel.org/r/148655324740.424917.5302984537258726349.stgit@buzz
> > Signed-off-by: Ingo Molnar 
> > Signed-off-by: Matt Fleming 
> > ---
> >  kernel/sched/core.c | 14 --
> >  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> What about 4.9-stable, this should go there too, right?

Yes, good catch. Would you like me to send a separate patch?

[PATCH v5 0/4] ARM: dts: imx: add CX9020 Embedded PC device tree

2017-07-26 Thread linux-kernel-dev

From: Patrick Bruenn 

The CX9020 differs from i.MX53 Quick Start Board by:
- use uart2 instead of uart1
- DVI-D connector instead of VGA
- no audio
- no SATA connector
- CCAT FPGA connected to emi
- enable rtc

v5:
- rebased on v4.13-rc2
- don't take maintainership for imx53-cx9020.dtsi, keep it to
  ARM/FREESCALE IMX maintainers
- add explicit pinmux settings for pwr leds (EIM_D22 - D24)
- remove display0->status="okay"
- use "regulator-vbus" as name for usb_vbus regulator node
- use correct reset values for explicit pinmux settings of:
  MX53_PAD_GPIO_0__CCM_CLKO
  MX53_PAD_GPIO_16__I2C3_SDA
  MX53_PAD_GPIO_1__ESDHC1_CD
  MX53_PAD_GPIO_3__GPIO1_3
  MX53_PAD_GPIO_8__GPIO1_8

v4:
- move alternative UART2 pinmux settings to imx53-pinfunc.h
- fix copyright notice and model name to clearify cx9020 is a
  Beckhoff board and not from Freescale/NXP/Qualcomm
- add "bhf,cx9020" compatible
- remove ccat node and pin configuration as long as the ccat
  driver is not mainlined
- use dvi-connector + ti,tfp410 instead of panel-simple
- add newlines between property list and child nodes
- replace underscores in node names with hypens
- replace magic number 0 with polarity defines from
  include/dt-bindings/gpio/gpio.h
- move rtc node into imx53.dtsi, change it's name into 'srtc',
  to avoid a conflict with 'rtc' node in imx53-m53.dtsi
- rename regulator-3p2v
- drop imx53-qsb container node
- make iomux configuration explicit
- remove unused audmux
- remove unused led_pin_gpio3_23 configuration
- use blue gpio-leds as disk-activity indicators for mmc0 and mmc1
- add mmc indicator leds to sdhc pingroups
- keep node names in alphabetical order
- remove unused sata and ssi2
- remove unused pin configs from hoggrp
- add entry for Beckhoff related files to MAINTAINERS

v3: add missig changelog
v2:
- keep alphabetic order of dts/Makefile
- configure uart2 with 'fsl,dte-mode'
- use display-0 and panel-0 as node names
- remove unnecessary "simple-bus" for fixed regulators

Patrick Bruenn (4):
  dt-bindings: arm: Add entry for Beckhoff CX9020
  ARM: dts: imx53: add srtc node
  ARM: dts: imx53: add alternative UART2 configuration
  ARM: dts: imx: add CX9020 Embedded PC device tree

 Documentation/devicetree/bindings/arm/bhf.txt  |   6 +
 .../devicetree/bindings/vendor-prefixes.txt|   1 +
 MAINTAINERS|   5 +
 arch/arm/boot/dts/Makefile |   1 +
 arch/arm/boot/dts/imx53-cx9020.dts | 297 +
 arch/arm/boot/dts/imx53-pinfunc.h  |   4 +
 arch/arm/boot/dts/imx53.dtsi   |   9 +
 7 files changed, 323 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/arm/bhf.txt
 create mode 100644 arch/arm/boot/dts/imx53-cx9020.dts

-- 
2.11.0

[PATCH v5 4/4] ARM: dts: imx: add CX9020 Embedded PC device tree

2017-07-26 Thread linux-kernel-dev

From: Patrick Bruenn 

The CX9020 differs from i.MX53 Quick Start Board by:
- use uart2 instead of uart1
- DVI-D connector instead of VGA
- no audio
- no SATA connector
- CCAT FPGA connected to emi
- enable rtc

Signed-off-by: Patrick Bruenn 
---
 arch/arm/boot/dts/Makefile |   1 +
 arch/arm/boot/dts/imx53-cx9020.dts | 297 +
 2 files changed, 298 insertions(+)
 create mode 100644 arch/arm/boot/dts/imx53-cx9020.dts

diff --git a/arch/arm/boot/dts/Makefile b/arch/arm/boot/dts/Makefile
index 4b17f35dc9a7..f0ba9be523e0 100644
--- a/arch/arm/boot/dts/Makefile
+++ b/arch/arm/boot/dts/Makefile
@@ -340,6 +340,7 @@ dtb-$(CONFIG_SOC_IMX51) += \
imx51-ts4800.dtb
 dtb-$(CONFIG_SOC_IMX53) += \
imx53-ard.dtb \
+   imx53-cx9020.dtb \
imx53-m53evk.dtb \
imx53-mba53.dtb \
imx53-qsb.dtb \
diff --git a/arch/arm/boot/dts/imx53-cx9020.dts 
b/arch/arm/boot/dts/imx53-cx9020.dts
new file mode 100644
index ..4f54fd4418a3
--- /dev/null
+++ b/arch/arm/boot/dts/imx53-cx9020.dts
@@ -0,0 +1,297 @@
+/*
+ * Copyright 2017 Beckhoff Automation GmbH & Co. KG
+ * based on imx53-qsb.dts
+ *
+ * The code contained herein is licensed under the GNU General Public
+ * License. You may obtain a copy of the GNU General Public License
+ * Version 2 or later at the following locations:
+ *
+ * http://www.opensource.org/licenses/gpl-license.html
+ * http://www.gnu.org/copyleft/gpl.html
+ */
+
+/dts-v1/;
+#include "imx53.dtsi"
+
+/ {
+   model = "Beckhoff CX9020 Embedded PC";
+   compatible = "bhf,cx9020", "fsl,imx53";
+
+   chosen {
+   stdout-path = 
+   };
+
+   memory {
+   reg = <0x7000 0x2000>,
+ <0xb000 0x2000>;
+   };
+
+   display-0 {
+   #address-cells =<1>;
+   #size-cells = <0>;
+   compatible = "fsl,imx-parallel-display";
+   interface-pix-fmt = "rgb24";
+   pinctrl-names = "default";
+   pinctrl-0 = <_ipu_disp0>;
+
+   port@0 {
+   reg = <0>;
+
+   display0_in: endpoint {
+   remote-endpoint = <_di0_disp0>;
+   };
+   };
+
+   port@1 {
+   reg = <1>;
+
+   display0_out: endpoint {
+   remote-endpoint = <_in>;
+   };
+   };
+   };
+
+   dvi-connector {
+   compatible = "dvi-connector";
+   ddc-i2c-bus = <>;
+   digital;
+
+   port {
+   dvi_connector_in: endpoint {
+   remote-endpoint = <_out>;
+   };
+   };
+   };
+
+   dvi-converter {
+   #address-cells = <1>;
+   #size-cells = <0>;
+   compatible = "ti,tfp410";
+
+   port@0 {
+   reg = <0>;
+
+   tfp410_in: endpoint {
+   remote-endpoint = <_out>;
+   };
+   };
+
+   port@1 {
+   reg = <1>;
+
+   tfp410_out: endpoint {
+   remote-endpoint = <_connector_in>;
+   };
+   };
+   };
+
+   leds {
+   compatible = "gpio-leds";
+
+   pwr-r {
+   gpios = < 22 GPIO_ACTIVE_HIGH>;
+   default-state = "off";
+   };
+
+   pwr-g {
+   gpios = < 24 GPIO_ACTIVE_HIGH>;
+   default-state = "on";
+   };
+
+   pwr-b {
+   gpios = < 23 GPIO_ACTIVE_HIGH>;
+   default-state = "off";
+   };
+
+   sd1-b {
+   linux,default-trigger = "mmc0";
+   gpios = < 20 GPIO_ACTIVE_HIGH>;
+   };
+
+   sd2-b {
+   linux,default-trigger = "mmc1";
+   gpios = < 17 GPIO_ACTIVE_HIGH>;
+   };
+   };
+
+   regulator-3p2v {
+   compatible = "regulator-fixed";
+   regulator-name = "3P2V";
+   regulator-min-microvolt = <320>;
+   regulator-max-microvolt = <320>;
+   regulator-always-on;
+   };
+
+   reg_usb_vbus: regulator-vbus {
+   compatible = "regulator-fixed";
+   regulator-name = "usb_vbus";
+   regulator-min-microvolt = <500>;
+   regulator-max-microvolt = <500>;
+   gpio = < 8 GPIO_ACTIVE_HIGH>;
+   enable-active-high;
+   };
+};
+
+ {
+   pinctrl-names = "default";
+   pinctrl-0 = <_esdhc1>;
+   cd-gpios = < 1 GPIO_ACTIVE_LOW>;
+

[PATCH v5 1/4] dt-bindings: arm: Add entry for Beckhoff CX9020

2017-07-26 Thread linux-kernel-dev

From: Patrick Bruenn 

- add vendor prefix bhf for Beckhoff
- add new board binding bhf,cx9020

Signed-off-by: Patrick Bruenn 
---
 Documentation/devicetree/bindings/arm/bhf.txt | 6 ++
 Documentation/devicetree/bindings/vendor-prefixes.txt | 1 +
 MAINTAINERS   | 5 +
 3 files changed, 12 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/arm/bhf.txt

diff --git a/Documentation/devicetree/bindings/arm/bhf.txt 
b/Documentation/devicetree/bindings/arm/bhf.txt
new file mode 100644
index ..886b503caf9c
--- /dev/null
+++ b/Documentation/devicetree/bindings/arm/bhf.txt
@@ -0,0 +1,6 @@
+Beckhoff Automation Platforms Device Tree Bindings
+--
+
+CX9020 Embedded PC
+Required root node properties:
+- compatible = "bhf,cx9020", "fsl,imx53";
diff --git a/Documentation/devicetree/bindings/vendor-prefixes.txt 
b/Documentation/devicetree/bindings/vendor-prefixes.txt
index daf465bef758..20c2cf57ebc9 100644
--- a/Documentation/devicetree/bindings/vendor-prefixes.txt
+++ b/Documentation/devicetree/bindings/vendor-prefixes.txt
@@ -47,6 +47,7 @@ avic  Shanghai AVIC Optoelectronics Co., Ltd.
 axentiaAxentia Technologies AB
 axis   Axis Communications AB
 bananapi BIPAI KEJI LIMITED
+bhfBeckhoff Automation GmbH & Co. KG
 boeBOE Technology Group Co., Ltd.
 bosch  Bosch Sensortec GmbH
 boundary   Boundary Devices Inc.
diff --git a/MAINTAINERS b/MAINTAINERS
index f66488dfdbc9..e1d3111aea97 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1196,6 +1196,11 @@ F:   arch/arm/boot/dts/sama*.dtsi
 F: arch/arm/include/debug/at91.S
 F: drivers/memory/atmel*
 
+ARM/BECKHOFF SUPPORT
+M: Patrick Bruenn 
+S: Maintained
+F: Documentation/devicetree/bindings/arm/bhf.txt
+
 ARM/CALXEDA HIGHBANK ARCHITECTURE
 M: Rob Herring 
 L: linux-arm-ker...@lists.infradead.org (moderated for non-subscribers)
-- 
2.11.0

Re: [PATCH] mm, memcg: reset low limit during memcg offlining

2017-07-26 Thread Tejun Heo

Hello, Vladimir.

On Wed, Jul 26, 2017 at 11:30:17AM +0300, Vladimir Davydov wrote:
> > As I understand, css_reset() callback is intended to _completely_ disable 
> > all
> > limits, as if there were no cgroup at all.
> 
> But that's exactly what cgroup offline is: deletion of a cgroup as if it
> never existed. The fact that we leave the zombie dangling until all
> pages charged to the cgroup are gone is an implementation detail. IIRC
> we would "reparent" those charges and delete the mem_cgroup right away
> if it were not inherently racy.

That may be true for memcg but not in general.  Think about writeback
IOs servicing dirty pages of a removed cgroup.  Removing a cgroup
shouldn't grant it more resources than when it was alive and changing
the membership to the parent will break that.  For memcg, they seem
the same just because no new major consumption can be generated after
removal.

> The user can't tweak limits of an offline cgroup, because the cgroup
> directory no longer exist. So IMHO resetting all limits is reasonable.
> If you want to keep the cgroup limits effective, you shouldn't have
> deleted it in the first place, I suppose.

I don't think that's the direction we wanna go.  Granting more
resources on removal is surprising.

Thanks.

-- 
tejun

Re: [PATCH v2 01/11] ASoC: samsung: s3c2412: Handle return value of clk_prepare_enable.

2017-07-26 Thread Arvind Yadav


Hi,


On Wednesday 26 July 2017 04:58 PM, Mark Brown wrote:

On Wed, Jul 26, 2017 at 11:15:25AM +0530, Arvind Yadav wrote:


--- a/sound/soc/samsung/s3c2412-i2s.c
+++ b/sound/soc/samsung/s3c2412-i2s.c
@@ -65,13 +65,16 @@ static int s3c2412_i2s_probe(struct snd_soc_dai *dai)
s3c2412_i2s.iis_cclk = devm_clk_get(dai->dev, "i2sclk");
if (IS_ERR(s3c2412_i2s.iis_cclk)) {
pr_err("failed to get i2sclk clock\n");
-   return PTR_ERR(s3c2412_i2s.iis_cclk);
+   ret = PTR_ERR(s3c2412_i2s.iis_cclk);
+   goto err;
}
  

Why are we making this unrelated change?  None of the error handling we
jump to is relevant if this fails...

3c_i2sv2_probe is enabling "iis" clock. If devm_clk_get(, "i2sclk") fails.
we need to disable and free the clock "iis" .



/* Set MPLL as the source for IIS CLK */
  
  	clk_set_parent(s3c2412_i2s.iis_cclk, clk_get(NULL, "mpll"));

-   clk_prepare_enable(s3c2412_i2s.iis_cclk);
+   ret = clk_prepare_enable(s3c2412_i2s.iis_cclk);
+   if (ret)
+   goto err;
  
  	s3c2412_i2s.iis_cclk = s3c2412_i2s.iis_pclk;
  
@@ -80,6 +83,11 @@ static int s3c2412_i2s_probe(struct snd_soc_dai *dai)

  S3C_GPIO_PULL_NONE);
  
  	return 0;

+
+err:
+   clk_disable(s3c2412_i2s.iis_pclk);

This will disable the clock if we failed to enable it which is clearly
not correct.  It's also matching a clk_prepare_enable() with a
clk_disable() which is going to leave an unbalanced prepare.

s3c_i2sv2_probe is enabling "iis" clock. And s3c2412_i2s_probe is enabling
"i2sclk"  and "mpll"clock. If, "mpll" clk_prepare_enable fails. We need 
to disable and
free the clock "iis".  and devm will handle other clock "i2sclk". In 
this code we have used

"s3c2412_i2s.iis_cclk" for all the clock which is more confusing for me.
Please correct me if i am wrong.

~arvind

Re: [PATCH] lib/strscpy: avoid KASAN false positive

2017-07-26 Thread Dmitry Vyukov

On Wed, Jul 19, 2017 at 6:05 PM, Dave Jones  wrote:
> On Wed, Jul 19, 2017 at 11:39:32AM -0400, Chris Metcalf wrote:
>
>  > > We could just remove all that word-at-a-time logic.  Do we have any
>  > > evidence that this would harm anything?
>  >
>  > The word-at-a-time logic was part of the initial commit since I wanted
>  > to ensure that strscpy could be used to replace strlcpy or strncpy without
>  > serious concerns about performance.
>
> I'm curious what the typical length of the strings we're concerned about
> in this case are if this makes a difference.


My vote is for proceeding with the original Andrey's patch. It's not
perfect, but it's simple, short, minimally intrusive and fixes the
problem at hand. We can do something more fundamental when/if we have
more such cases.

[PATCH v5 2/4] ARM: dts: imx53: add srtc node

2017-07-26 Thread linux-kernel-dev

From: Patrick Bruenn 

The i.MX53 has an integrated secure real time clock. Add it to the dtsi.

Signed-off-by: Patrick Bruenn 
---
 arch/arm/boot/dts/imx53.dtsi | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/arm/boot/dts/imx53.dtsi b/arch/arm/boot/dts/imx53.dtsi
index 2e516f4985e4..8bf0d89cdd35 100644
--- a/arch/arm/boot/dts/imx53.dtsi
+++ b/arch/arm/boot/dts/imx53.dtsi
@@ -433,6 +433,15 @@
clock-names = "ipg", "per";
};
 
+   srtc: srtc@53fa4000 {
+   compatible = "fsl,imx53-rtc", "fsl,imx25-rtc";
+   reg = <0x53fa4000 0x4000>;
+   interrupts = <24>;
+   interrupt-parent = <>;
+   clocks = < IMX5_CLK_SRTC_GATE>;
+   clock-names = "ipg";
+   };
+
iomuxc: iomuxc@53fa8000 {
compatible = "fsl,imx53-iomuxc";
reg = <0x53fa8000 0x4000>;
-- 
2.11.0

[PATCH v5 3/4] ARM: dts: imx53: add alternative UART2 configuration

2017-07-26 Thread linux-kernel-dev

From: Patrick Bruenn 

UART2 on EIM_D26 - EIM_D29 pins supports interchanging RXD/TXD pins
and RTS/CTS pins.
One board using these alternate settings is Beckhoff CX9020. Add the
alternative configuration here, to make it available to others, too.

Signed-off-by: Patrick Bruenn 
---
 arch/arm/boot/dts/imx53-pinfunc.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/arm/boot/dts/imx53-pinfunc.h 
b/arch/arm/boot/dts/imx53-pinfunc.h
index aec406bc65eb..59f9c29e3fe2 100644
--- a/arch/arm/boot/dts/imx53-pinfunc.h
+++ b/arch/arm/boot/dts/imx53-pinfunc.h
@@ -524,6 +524,7 @@
 #define MX53_PAD_EIM_D25__UART1_DSR0x140 0x488 
0x000 0x7 0x0
 #define MX53_PAD_EIM_D26__EMI_WEIM_D_260x144 
0x48c 0x000 0x0 0x0
 #define MX53_PAD_EIM_D26__GPIO3_26 0x144 0x48c 
0x000 0x1 0x0
+#define MX53_PAD_EIM_D26__UART2_RXD_MUX0x144 
0x48c 0x880 0x2 0x0
 #define MX53_PAD_EIM_D26__UART2_TXD_MUX0x144 
0x48c 0x000 0x2 0x0
 #define MX53_PAD_EIM_D26__FIRI_RXD 0x144 0x48c 
0x80c 0x3 0x0
 #define MX53_PAD_EIM_D26__IPU_CSI0_D_1 0x144 0x48c 
0x000 0x4 0x0
@@ -533,6 +534,7 @@
 #define MX53_PAD_EIM_D27__EMI_WEIM_D_270x148 
0x490 0x000 0x0 0x0
 #define MX53_PAD_EIM_D27__GPIO3_27 0x148 0x490 
0x000 0x1 0x0
 #define MX53_PAD_EIM_D27__UART2_RXD_MUX0x148 
0x490 0x880 0x2 0x1
+#define MX53_PAD_EIM_D27__UART2_TXD_MUX0x148 
0x490 0x000 0x2 0x0
 #define MX53_PAD_EIM_D27__FIRI_TXD 0x148 0x490 
0x000 0x3 0x0
 #define MX53_PAD_EIM_D27__IPU_CSI0_D_0 0x148 0x490 
0x000 0x4 0x0
 #define MX53_PAD_EIM_D27__IPU_DI1_PIN130x148 
0x490 0x000 0x5 0x0
@@ -541,6 +543,7 @@
 #define MX53_PAD_EIM_D28__EMI_WEIM_D_280x14c 
0x494 0x000 0x0 0x0
 #define MX53_PAD_EIM_D28__GPIO3_28 0x14c 0x494 
0x000 0x1 0x0
 #define MX53_PAD_EIM_D28__UART2_CTS0x14c 0x494 
0x000 0x2 0x0
+#define MX53_PAD_EIM_D28__UART2_RTS0x14c 0x494 
0x87c 0x2 0x0
 #define MX53_PAD_EIM_D28__IPU_DISPB0_SER_DIO   0x14c 0x494 
0x82c 0x3 0x1
 #define MX53_PAD_EIM_D28__CSPI_MOSI0x14c 0x494 
0x788 0x4 0x1
 #define MX53_PAD_EIM_D28__I2C1_SDA 0x14c 0x494 
0x818 0x5 0x1
@@ -548,6 +551,7 @@
 #define MX53_PAD_EIM_D28__IPU_DI0_PIN130x14c 
0x494 0x000 0x7 0x0
 #define MX53_PAD_EIM_D29__EMI_WEIM_D_290x150 
0x498 0x000 0x0 0x0
 #define MX53_PAD_EIM_D29__GPIO3_29 0x150 0x498 
0x000 0x1 0x0
+#define MX53_PAD_EIM_D29__UART2_CTS0x150 0x498 
0x000 0x2 0x0
 #define MX53_PAD_EIM_D29__UART2_RTS0x150 0x498 
0x87c 0x2 0x1
 #define MX53_PAD_EIM_D29__IPU_DISPB0_SER_RS0x150 0x498 
0x000 0x3 0x0
 #define MX53_PAD_EIM_D29__CSPI_SS0 0x150 0x498 
0x78c 0x4 0x2
-- 
2.11.0

Re: [PATCH v3 3/3] power: wm831x_power: Support USB charger current limit management

2017-07-26 Thread Sebastian Reichel

Hi,

On Wed, Jul 26, 2017 at 11:05:25AM +0800, Baolin Wang wrote:
> On 25 July 2017 at 17:59, Sebastian Reichel
>  wrote:
> > On Tue, Jul 25, 2017 at 04:00:01PM +0800, Baolin Wang wrote:
> >> Integrate with the newly added USB charger interface to limit the current
> >> we draw from the USB input based on the input device configuration
> >> identified by the USB stack, allowing us to charge more quickly from high
> >> current inputs without drawing more current than specified from others.
> >>
> >> Signed-off-by: Mark Brown 
> >> Signed-off-by: Baolin Wang 
> >> ---
> >>  Documentation/devicetree/bindings/mfd/wm831x.txt |1 +
> >>  drivers/power/supply/wm831x_power.c  |   58 
> >> ++
> >>  2 files changed, 59 insertions(+)
> >>
> >> diff --git a/Documentation/devicetree/bindings/mfd/wm831x.txt 
> >> b/Documentation/devicetree/bindings/mfd/wm831x.txt
> >> index 9f8b743..4e3bc07 100644
> >> --- a/Documentation/devicetree/bindings/mfd/wm831x.txt
> >> +++ b/Documentation/devicetree/bindings/mfd/wm831x.txt
> >> @@ -31,6 +31,7 @@ Required properties:
> >>  ../interrupt-controller/interrupts.txt
> >>
> >>  Optional sub-nodes:
> >> +  - usb-phy : Contains a phandle to the USB PHY.
> >>- regulators : Contains sub-nodes for each of the regulators supplied by
> >>  the device. The regulators are bound using their names listed below:
> >>
> >> diff --git a/drivers/power/supply/wm831x_power.c 
> >> b/drivers/power/supply/wm831x_power.c
> >> index 7082301..d3948ab 100644
> >> --- a/drivers/power/supply/wm831x_power.c
> >> +++ b/drivers/power/supply/wm831x_power.c
> >> @@ -13,6 +13,7 @@
> >>  #include 
> >>  #include 
> >>  #include 
> >> +#include 
> >>
> >>  #include 
> >>  #include 
> >> @@ -31,6 +32,8 @@ struct wm831x_power {
> >>   char usb_name[20];
> >>   char battery_name[20];
> >>   bool have_battery;
> >> + struct usb_phy *usb_phy;
> >> + struct notifier_block usb_notify;
> >>  };
> >>
> >>  static int wm831x_power_check_online(struct wm831x *wm831x, int supply,
> >> @@ -125,6 +128,43 @@ static int wm831x_usb_get_prop(struct power_supply 
> >> *psy,
> >>   POWER_SUPPLY_PROP_VOLTAGE_NOW,
> >>  };
> >>
> >> +/* In milliamps */
> >> +static const unsigned int wm831x_usb_limits[] = {
> >> + 0,
> >> + 2,
> >> + 100,
> >> + 500,
> >> + 900,
> >> + 1500,
> >> + 1800,
> >> + 550,
> >> +};
> >> +
> >> +static int wm831x_usb_limit_change(struct notifier_block *nb,
> >> +unsigned long limit, void *data)
> >> +{
> >> + struct wm831x_power *wm831x_power = container_of(nb,
> >> +  struct wm831x_power,
> >> +  usb_notify);
> >> + unsigned int i, best;
> >> +
> >> + /* Find the highest supported limit */
> >> + best = 0;
> >> + for (i = 0; i < ARRAY_SIZE(wm831x_usb_limits); i++) {
> >> + if (limit >= wm831x_usb_limits[i] &&
> >> + wm831x_usb_limits[best] < wm831x_usb_limits[i])
> >> + best = i;
> >> + }
> >> +
> >> + dev_dbg(wm831x_power->wm831x->dev,
> >> + "Limiting USB current to %umA", wm831x_usb_limits[best]);
> >> +
> >> + wm831x_set_bits(wm831x_power->wm831x, WM831X_POWER_STATE,
> >> + WM831X_USB_ILIM_MASK, best);
> >> +
> >> + return 0;
> >> +}
> >> +
> >>  /*
> >>   *   Battery properties
> >>   */
> >> @@ -607,6 +647,19 @@ static int wm831x_power_probe(struct platform_device 
> >> *pdev)
> >>   }
> >>   }
> >>
> >> + power->usb_phy = devm_usb_get_phy_by_phandle(>dev,
> >> +  "usb-phy", 0);
> >> + if (!IS_ERR(power->usb_phy)) {
> >> + power->usb_notify.notifier_call = wm831x_usb_limit_change;
> >> + ret = usb_register_notifier(power->usb_phy,
> >> + >usb_notify);
> >> + if (ret) {
> >> + dev_err(>dev, "Failed to register notifier: 
> >> %d\n",
> >> + ret);
> >> + goto err_bat_irq;
> >> + }
> >> + }
> >
> > No error handling for power->usb_phy? I think you should bail out
> > for all errors except for "not defined in DT". Especially I would
> > expect probe defer handling in case the power supply driver is
> > loaded before the phy driver.
> 
> Make sense. So I think I need to change like below:
> 
> power->usb_phy = devm_usb_get_phy_by_phandle(>dev, "usb-phy", 0);
> if (!IS_ERR(power->usb_phy)) {
> power->usb_notify.notifier_call = wm831x_usb_limit_change;
> ret = usb_register_notifier(power->usb_phy, >usb_notify);
> if (ret) {
> dev_err(>dev, "Failed to

Re: [PATCH 1/1] mm/hugetlb: Make huge_pte_offset() consistent and document behaviour

2017-07-26 Thread Punit Agrawal

Hi Michal,

Michal Hocko  writes:

> On Wed 26-07-17 10:50:38, Michal Hocko wrote:
>> On Tue 25-07-17 16:41:14, Punit Agrawal wrote:
>> > When walking the page tables to resolve an address that points to
>> > !p*d_present() entry, huge_pte_offset() returns inconsistent values
>> > depending on the level of page table (PUD or PMD).
>> > 
>> > It returns NULL in the case of a PUD entry while in the case of a PMD
>> > entry, it returns a pointer to the page table entry.
>> > 
>> > A similar inconsitency exists when handling swap entries - returns NULL
>> > for a PUD entry while a pointer to the pte_t is retured for the PMD
>> > entry.
>> > 
>> > Update huge_pte_offset() to make the behaviour consistent - return NULL
>> > in the case of p*d_none() and a pointer to the pte_t for hugepage or
>> > swap entries.
>> > 
>> > Document the behaviour to clarify the expected behaviour of this
>> > function. This is to set clear semantics for architecture specific
>> > implementations of huge_pte_offset().
>> 
>> hugetlb pte semantic is a disaster and I agree it could see some
>> cleanup/clarifications but I am quite nervous to see a patchi like this.
>> How do we check that nothing will get silently broken by this change?

Glad I'm not the only one who finds the hugetlb semantics somewhat
confusing. :)

I've been running tests from mce-test suite and libhugetlbfs for similar
changes we did on arm64. There could be assumptions that were not
exercised but I'm not sure how to check for all the possible usages.

Do you have any other suggestions that can help improve confidence in
the patch?

>
> Forgot to add. Hugetlb have been special because of the pte sharing. I
> haven't looked into that code for quite some time but there might be a
> good reason why pud behave differently.

I checked the code and don't see anything that would explain (or
require) the difference in behaviour.

[PATCH] init:main.c: Fixed issues for Block comments and

2017-07-26 Thread janani-sankarababu

Signed-off-by: Janani S 

---
 init/main.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/init/main.c b/init/main.c
index 052481f..f8eb4966 100644
--- a/init/main.c
+++ b/init/main.c
@@ -181,7 +181,8 @@ static bool __init obsolete_checksetup(char *line)
/* Already done in parse_early_param?
 * (Needs exact match on param part).
 * Keep iterating, as we can have early
-* params and __setups of same names 8( */
+* params and __setups of same names
+*/
if (line[n] == '\0' || line[n] == '=')
had_early_param = true;
} else if (!p->setup_func) {
@@ -693,9 +694,9 @@ asmlinkage __visible void __init start_kernel(void)
arch_post_acpi_subsys_init();
sfi_init_late();
 
-   if (efi_enabled(EFI_RUNTIME_SERVICES)) {
+   if (efi_enabled(EFI_RUNTIME_SERVICES))
efi_free_boot_services();
-   }
+
 
/* Do the rest non-__init'ed, we're now alive */
rest_init();
-- 
1.9.1

[tip:locking/core] kasan: Allow kasan_check_read/write() to accept pointers to volatiles

2017-07-26 Thread tip-bot for Dmitry Vyukov

Commit-ID:  f06e8c584fa0d05312c11ea66194f3d2efb93c21
Gitweb: http://git.kernel.org/tip/f06e8c584fa0d05312c11ea66194f3d2efb93c21
Author: Dmitry Vyukov 
AuthorDate: Thu, 22 Jun 2017 16:14:17 +0200
Committer:  Ingo Molnar 
CommitDate: Wed, 26 Jul 2017 13:08:54 +0200

kasan: Allow kasan_check_read/write() to accept pointers to volatiles

Currently kasan_check_read/write() accept 'const void*', make them
accept 'const volatile void*'. This is required for instrumentation
of atomic operations and there is just no reason to not allow that.

Signed-off-by: Dmitry Vyukov 
Reviewed-by: Andrey Ryabinin 
Acked-by: Mark Rutland 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: kasan-...@googlegroups.com
Cc: linux...@kvack.org
Cc: will.dea...@arm.com
Link: 
http://lkml.kernel.org/r/33e5ec275c1ee89299245b2ebbccd63709c6021f.1498140838.git.dvyu...@google.com
Signed-off-by: Ingo Molnar 
---
 include/linux/kasan-checks.h | 10 ++
 mm/kasan/kasan.c |  4 ++--
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/kasan-checks.h b/include/linux/kasan-checks.h
index b7f8ace..41960fe 100644
--- a/include/linux/kasan-checks.h
+++ b/include/linux/kasan-checks.h
@@ -2,11 +2,13 @@
 #define _LINUX_KASAN_CHECKS_H
 
 #ifdef CONFIG_KASAN
-void kasan_check_read(const void *p, unsigned int size);
-void kasan_check_write(const void *p, unsigned int size);
+void kasan_check_read(const volatile void *p, unsigned int size);
+void kasan_check_write(const volatile void *p, unsigned int size);
 #else
-static inline void kasan_check_read(const void *p, unsigned int size) { }
-static inline void kasan_check_write(const void *p, unsigned int size) { }
+static inline void kasan_check_read(const volatile void *p, unsigned int size)
+{ }
+static inline void kasan_check_write(const volatile void *p, unsigned int size)
+{ }
 #endif
 
 #endif
diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c
index ca11bc4..6f319fb 100644
--- a/mm/kasan/kasan.c
+++ b/mm/kasan/kasan.c
@@ -267,13 +267,13 @@ static void check_memory_region(unsigned long addr,
check_memory_region_inline(addr, size, write, ret_ip);
 }
 
-void kasan_check_read(const void *p, unsigned int size)
+void kasan_check_read(const volatile void *p, unsigned int size)
 {
check_memory_region((unsigned long)p, size, false, _RET_IP_);
 }
 EXPORT_SYMBOL(kasan_check_read);
 
-void kasan_check_write(const void *p, unsigned int size)
+void kasan_check_write(const volatile void *p, unsigned int size)
 {
check_memory_region((unsigned long)p, size, true, _RET_IP_);
 }

Re: [PATCH] arm64: Convert to using %pOF instead of full_name

2017-07-26 Thread Dan Carpenter

Sorry about the false positive.  I will push a fix for that later today
or tomorrow at the latest.

regards,
dan carpenter

[tip:x86/asm] x86/kconfig: Make it easier to switch to the new ORC unwinder

2017-07-26 Thread tip-bot for Josh Poimboeuf

Commit-ID:  a34a766ff96d9e88572e35a45066279e40a85d84
Gitweb: http://git.kernel.org/tip/a34a766ff96d9e88572e35a45066279e40a85d84
Author: Josh Poimboeuf 
AuthorDate: Mon, 24 Jul 2017 18:36:58 -0500
Committer:  Ingo Molnar 
CommitDate: Wed, 26 Jul 2017 13:18:20 +0200

x86/kconfig: Make it easier to switch to the new ORC unwinder

A couple of Kconfig changes which make it much easier to switch to the
new CONFIG_ORC_UNWINDER:

1) Remove x86 dependencies on CONFIG_FRAME_POINTER for lockdep,
   latencytop, and fault injection.  x86 has a 'guess' unwinder which
   just scans the stack for kernel text addresses.  It's not 100%
   accurate but in many cases it's good enough.  This allows those users
   who don't want the text overhead of the frame pointer or ORC
   unwinders to still use these features.  More importantly, this also
   makes it much more straightforward to disable frame pointers.

2) Make CONFIG_ORC_UNWINDER depend on !CONFIG_FRAME_POINTER.  While it
   would be possible to have both enabled, it doesn't really make sense
   to do so.  So enforce a sane configuration to prevent the user from
   making a dumb mistake.

With these changes, when you disable CONFIG_FRAME_POINTER, "make
oldconfig" will ask if you want to enable CONFIG_ORC_UNWINDER.

Signed-off-by: Josh Poimboeuf 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Jiri Slaby 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: live-patch...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/9985fb91ce5005fe33ea5cc2a20f14bd33c61d03.1500938583.git.jpoim...@redhat.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/Kconfig.debug | 7 +++
 lib/Kconfig.debug  | 6 +++---
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index dc10ec6..268a318 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -357,7 +357,7 @@ config PUNIT_ATOM_DEBUG
 
 config ORC_UNWINDER
bool "ORC unwinder"
-   depends on X86_64
+   depends on X86_64 && !FRAME_POINTER
select STACK_VALIDATION
---help---
  This option enables the ORC (Oops Rewind Capability) unwinder for
@@ -365,9 +365,8 @@ config ORC_UNWINDER
  a simplified version of the DWARF Call Frame Information standard.
 
  This unwinder is more accurate across interrupt entry frames than the
- frame pointer unwinder.  It can also enable a 5-10% performance
- improvement across the entire kernel if CONFIG_FRAME_POINTER is
- disabled.
+ frame pointer unwinder.  It also enables a 5-10% performance
+ improvement across the entire kernel compared to frame pointers.
 
  Enabling this option will increase the kernel's runtime memory usage
  by roughly 2-4MB, depending on your kernel config.
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 0f0d019..32a48e7 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1124,7 +1124,7 @@ config LOCKDEP
bool
depends on DEBUG_KERNEL && TRACE_IRQFLAGS_SUPPORT && STACKTRACE_SUPPORT 
&& LOCKDEP_SUPPORT
select STACKTRACE
-   select FRAME_POINTER if !MIPS && !PPC && !ARM_UNWIND && !S390 && 
!MICROBLAZE && !ARC && !SCORE
+   select FRAME_POINTER if !MIPS && !PPC && !ARM_UNWIND && !S390 && 
!MICROBLAZE && !ARC && !SCORE && !X86
select KALLSYMS
select KALLSYMS_ALL
 
@@ -1543,7 +1543,7 @@ config FAULT_INJECTION_STACKTRACE_FILTER
depends on FAULT_INJECTION_DEBUG_FS && STACKTRACE_SUPPORT
depends on !X86_64
select STACKTRACE
-   select FRAME_POINTER if !MIPS && !PPC && !S390 && !MICROBLAZE && 
!ARM_UNWIND && !ARC && !SCORE
+   select FRAME_POINTER if !MIPS && !PPC && !S390 && !MICROBLAZE && 
!ARM_UNWIND && !ARC && !SCORE && !X86
help
  Provide stacktrace filter for fault-injection capabilities
 
@@ -1552,7 +1552,7 @@ config LATENCYTOP
depends on DEBUG_KERNEL
depends on STACKTRACE_SUPPORT
depends on PROC_FS
-   select FRAME_POINTER if !MIPS && !PPC && !S390 && !MICROBLAZE && 
!ARM_UNWIND && !ARC
+   select FRAME_POINTER if !MIPS && !PPC && !S390 && !MICROBLAZE && 
!ARM_UNWIND && !ARC && !X86
select KALLSYMS
select KALLSYMS_ALL
select STACKTRACE

[tip:x86/asm] x86/unwind: Add the ORC unwinder

2017-07-26 Thread tip-bot for Josh Poimboeuf

Commit-ID:  ee9f8fce99640811b2b8e79d0d1dbe8bab69ba67
Gitweb: http://git.kernel.org/tip/ee9f8fce99640811b2b8e79d0d1dbe8bab69ba67
Author: Josh Poimboeuf 
AuthorDate: Mon, 24 Jul 2017 18:36:57 -0500
Committer:  Ingo Molnar 
CommitDate: Wed, 26 Jul 2017 13:18:20 +0200

x86/unwind: Add the ORC unwinder

Add the new ORC unwinder which is enabled by CONFIG_ORC_UNWINDER=y.
It plugs into the existing x86 unwinder framework.

It relies on objtool to generate the needed .orc_unwind and
.orc_unwind_ip sections.

For more details on why ORC is used instead of DWARF, see
Documentation/x86/orc-unwinder.txt - but the short version is
that it's a simplified, fundamentally more robust debugninfo
data structure, which also allows up to two orders of magnitude
faster lookups than the DWARF unwinder - which matters to
profiling workloads like perf.

Thanks to Andy Lutomirski for the performance improvement ideas:
splitting the ORC unwind table into two parallel arrays and creating a
fast lookup table to search a subset of the unwind table.

Signed-off-by: Josh Poimboeuf 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Jiri Slaby 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: live-patch...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/0a6cbfb40f8da99b7a45a1a8302dc6aef16ec812.1500938583.git.jpoim...@redhat.com
[ Extended the changelog. ]
Signed-off-by: Ingo Molnar 
---
 Documentation/x86/orc-unwinder.txt | 179 
 arch/um/include/asm/unwind.h   |   8 +
 arch/x86/Kconfig   |   1 +
 arch/x86/Kconfig.debug |  25 ++
 arch/x86/include/asm/module.h  |   9 +
 arch/x86/include/asm/orc_lookup.h  |  46 +++
 arch/x86/include/asm/orc_types.h   |   2 +-
 arch/x86/include/asm/unwind.h  |  76 +++--
 arch/x86/kernel/Makefile   |   8 +-
 arch/x86/kernel/module.c   |  11 +-
 arch/x86/kernel/setup.c|   3 +
 arch/x86/kernel/unwind_frame.c |  39 +--
 arch/x86/kernel/unwind_guess.c |   5 +
 arch/x86/kernel/unwind_orc.c   | 582 +
 arch/x86/kernel/vmlinux.lds.S  |   3 +
 include/asm-generic/vmlinux.lds.h  |  27 +-
 lib/Kconfig.debug  |   3 +
 scripts/Makefile.build |  14 +-
 18 files changed, 977 insertions(+), 64 deletions(-)

diff --git a/Documentation/x86/orc-unwinder.txt 
b/Documentation/x86/orc-unwinder.txt
new file mode 100644
index 000..af0c9a4
--- /dev/null
+++ b/Documentation/x86/orc-unwinder.txt
@@ -0,0 +1,179 @@
+ORC unwinder
+
+
+Overview
+
+
+The kernel CONFIG_ORC_UNWINDER option enables the ORC unwinder, which is
+similar in concept to a DWARF unwinder.  The difference is that the
+format of the ORC data is much simpler than DWARF, which in turn allows
+the ORC unwinder to be much simpler and faster.
+
+The ORC data consists of unwind tables which are generated by objtool.
+They contain out-of-band data which is used by the in-kernel ORC
+unwinder.  Objtool generates the ORC data by first doing compile-time
+stack metadata validation (CONFIG_STACK_VALIDATION).  After analyzing
+all the code paths of a .o file, it determines information about the
+stack state at each instruction address in the file and outputs that
+information to the .orc_unwind and .orc_unwind_ip sections.
+
+The per-object ORC sections are combined at link time and are sorted and
+post-processed at boot time.  The unwinder uses the resulting data to
+correlate instruction addresses with their stack states at run time.
+
+
+ORC vs frame pointers
+-
+
+With frame pointers enabled, GCC adds instrumentation code to every
+function in the kernel.  The kernel's .text size increases by about
+3.2%, resulting in a broad kernel-wide slowdown.  Measurements by Mel
+Gorman [1] have shown a slowdown of 5-10% for some workloads.
+
+In contrast, the ORC unwinder has no effect on text size or runtime
+performance, because the debuginfo is out of band.  So if you disable
+frame pointers and enable the ORC unwinder, you get a nice performance
+improvement across the board, and still have reliable stack traces.
+
+Ingo Molnar says:
+
+  "Note that it's not just a performance improvement, but also an
+  instruction cache locality improvement: 3.2% .text savings almost
+  directly transform into a similarly sized reduction in cache
+  footprint. That can transform to even higher speedups for workloads
+  whose cache locality is borderline."
+
+Another benefit of ORC compared to frame pointers is that it can
+reliably unwind across interrupts and exceptions.  Frame pointer based
+unwinds can sometimes skip the caller of the interrupted function, if it
+was a leaf function or if the interrupt hit before the frame pointer was
+saved.
+
+The main disadvantage of the ORC unwinder compared to frame pointers is
+that it needs more memory to store the ORC unwind tables: roughly 2-4MB
+depending

[tip:x86/asm] x86/kconfig: Consolidate unwinders into multiple choice selection

2017-07-26 Thread tip-bot for Josh Poimboeuf

Commit-ID:  81d387190039c14edac8de2b3ec789beb899afd9
Gitweb: http://git.kernel.org/tip/81d387190039c14edac8de2b3ec789beb899afd9
Author: Josh Poimboeuf 
AuthorDate: Tue, 25 Jul 2017 08:54:24 -0500
Committer:  Ingo Molnar 
CommitDate: Wed, 26 Jul 2017 14:05:36 +0200

x86/kconfig: Consolidate unwinders into multiple choice selection

There are three mutually exclusive unwinders.  Make that more obvious by
combining them into a multiple-choice selection:

  CONFIG_FRAME_POINTER_UNWINDER
  CONFIG_ORC_UNWINDER
  CONFIG_GUESS_UNWINDER (if CONFIG_EXPERT=y)

Frame pointers are still the default (for now).

The old CONFIG_FRAME_POINTER option is still used in some
arch-independent places, so keep it around, but make it
invisible to the user on x86 - it's now selected by
CONFIG_FRAME_POINTER_UNWINDER=y.

Suggested-by: Ingo Molnar 
Signed-off-by: Josh Poimboeuf 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Jiri Slaby 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: live-patch...@vger.kernel.org
Link: http://lkml.kernel.org/r/20170725135424.zukjmgpz3plf5pmt@treble
Signed-off-by: Ingo Molnar 
---
 arch/x86/Kconfig  |  3 +--
 arch/x86/Kconfig.debug| 47 ---
 arch/x86/configs/tiny.config  |  2 ++
 arch/x86/include/asm/unwind.h |  4 ++--
 4 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7ccf26a..9b30212 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -73,7 +73,6 @@ config X86
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USE_QUEUED_SPINLOCKS
select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
-   select ARCH_WANT_FRAME_POINTERS
select ARCH_WANTS_DYNAMIC_TASK_STRUCT
select ARCH_WANTS_THP_SWAP  if X86_64
select BUILDTIME_EXTABLE_SORT
@@ -168,7 +167,7 @@ config X86
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_REGS_AND_STACK_ACCESS_API
-   select HAVE_RELIABLE_STACKTRACE if X86_64 && FRAME_POINTER && 
STACK_VALIDATION
+   select HAVE_RELIABLE_STACKTRACE if X86_64 && 
FRAME_POINTER_UNWINDER && STACK_VALIDATION
select HAVE_STACK_VALIDATIONif X86_64
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UNSTABLE_SCHED_CLOCK
diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 268a318..93bbb31 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -355,9 +355,32 @@ config PUNIT_ATOM_DEBUG
  The current power state can be read from
  /sys/kernel/debug/punit_atom/dev_power_state
 
+choice
+   prompt "Choose kernel unwinder"
+   default FRAME_POINTER_UNWINDER
+   ---help---
+ This determines which method will be used for unwinding kernel stack
+ traces for panics, oopses, bugs, warnings, perf, /proc//stack,
+ livepatch, lockdep, and more.
+
+config FRAME_POINTER_UNWINDER
+   bool "Frame pointer unwinder"
+   select FRAME_POINTER
+   ---help---
+ This option enables the frame pointer unwinder for unwinding kernel
+ stack traces.
+
+ The unwinder itself is fast and it uses less RAM than the ORC
+ unwinder, but the kernel text size will grow by ~3% and the kernel's
+ overall performance will degrade by roughly 5-10%.
+
+ This option is recommended if you want to use the livepatch
+ consistency model, as this is currently the only way to get a
+ reliable stack trace (CONFIG_HAVE_RELIABLE_STACKTRACE).
+
 config ORC_UNWINDER
bool "ORC unwinder"
-   depends on X86_64 && !FRAME_POINTER
+   depends on X86_64
select STACK_VALIDATION
---help---
  This option enables the ORC (Oops Rewind Capability) unwinder for
@@ -371,12 +394,22 @@ config ORC_UNWINDER
  Enabling this option will increase the kernel's runtime memory usage
  by roughly 2-4MB, depending on your kernel config.
 
-config FRAME_POINTER_UNWINDER
-   def_bool y
-   depends on !ORC_UNWINDER && FRAME_POINTER
-
 config GUESS_UNWINDER
-   def_bool y
-   depends on !ORC_UNWINDER && !FRAME_POINTER
+   bool "Guess unwinder"
+   depends on EXPERT
+   ---help---
+ This option enables the "guess" unwinder for unwinding kernel stack
+ traces.  It scans the stack and reports every kernel text address it
+ finds.  Some of the addresses it reports may be incorrect.
+
+ While this option often produces false positives, it can still be
+ useful in many cases.  Unlike the other unwinders, it has no runtime
+ overhead.
+
+endchoice
+
+config FRAME_POINTER
+   depends on !ORC_UNWINDER && !GUESS_UNWINDER
+   bool
 
 endmenu
diff --git a/arch/x86/configs/tiny.config b/arch/x86/configs/tiny.config
index 4b429df..550cd50 100644
---

Re: [PATCH net-next v2 01/10] net: dsa: lan9303: Fixed MDIO interface

2017-07-26 Thread Egil Hjelmeland


On 25. juli 2017 21:15, Vivien Didelot wrote:

Hi Egil,

Egil Hjelmeland  writes:


Fixes after testing on actual HW:

- lan9303_mdio_write()/_read() must multiply register number
   by 4 to get offset

- Indirect access (PMI) to phy register only work in I2C mode. In
   MDIO mode phy registers must be accessed directly. Introduced
   struct lan9303_phy_ops to handle the two modes. Renamed functions
   to clarify.

- lan9303_detect_phy_setup() : Failed MDIO read return 0x.
   Handle that.


Small patch series when possible are better. Bullet points in commit
messages are likely to describe how a patch or series may be split up
;-)

This patch seems to be the unique patch of the series resolving what is
described in the cover letter as "Make the MDIO interface work".

I'd suggest you to split up this one commit in several *atomic* and easy
to review patches and send them separately as on thread named "net: dsa:
lan9303: fix MDIO interface" (also note that imperative is prefered for
subject lines, see: https://chris.beams.io/posts/git-commit/#imperative)

<...>


-static int lan9303_port_phy_reg_wait_for_completion(struct lan9303 *chip)
+static int lan9303_indirect_phy_wait_for_completion(struct lan9303 *chip)


For instance you can have a first commit only renaming the functions.
The reason for it is to separate the functional changes from cosmetic
changes, which makes it easier for review.

<...>


Thank you for reviewing.

I can split the first patch.

I can also split the patch series to more digestible series. But
since most of the patches touches the same file, I assume that each
series must be completed and applied before starting on a new one.
So I really want to group the patches into only a few series in order
to not spend months on the process.



+   if ((reg != 0) && (reg != 0x))


if (reg && reg != 0x) should be enough.


Of course.


+struct lan9303_phy_ops {
+   /* PHY 1 &2 access*/


The spacing is weird in the comment. "/* PHY 1 & 2 access */" maybe?



Yes.


+int lan9303_mdio_phy_write(struct lan9303 *chip, int phy, int regnum, u16 val)
+{
+   struct lan9303_mdio *sw_dev = dev_get_drvdata(chip->dev);
+   struct mdio_device *mdio = sw_dev->device;
+
+   mutex_lock(>bus->mdio_lock);
+   mdio->bus->write(mdio->bus, phy, regnum, val);
+   mutex_unlock(>bus->mdio_lock);


This is exactly what mdiobus_write(mdio->bus, phy, regnum, val) is
doing. There are very few valid reasons to go play in the mii_bus
structure, using generic APIs are strongly prefered. Plus you have
checks and traces for free!



Lack of oversight was the only reason. I just adapted stuff from
lan9303_mdio_phy_write above. Will switch to mdiobus_write of course.


Same here, mdiobus_read().


Ditto.



Thanks,

 Vivien



Appreciated,
Egil

Re: [PATCH v7 12/13] ACPI / init: Invoke early ACPI initialization earlier

2017-07-26 Thread Dou Liyang


Hi Baoquan,

At 07/18/2017 04:45 PM, b...@redhat.com wrote:

On 07/18/17 at 02:08pm, Dou Liyang wrote:

Hi, Zheng

At 07/18/2017 01:18 PM, Zheng, Lv wrote:

Hi,

Can the problem be fixed by invoking acpi_put_table() for mapped DMAR table?


Invoking acpi_put_table() is my first choice. But it made the kernel
*panic* when we try to get the table again in intel_iommu_init() in
late stage.

I am also confused that:

There are two places where we used DMAR table in Linux:

1) In detect_intel_iommu() in ACPI early stage:

...
status = acpi_get_table(ACPI_SIG_DMAR, 0, _tbl);

if (dmar_tbl) {
acpi_put_table(dmar_tbl);
dmar_tbl = NULL;
}

2) In dmar_table_init() in ACPI late stage:

...
status = acpi_get_table(ACPI_SIG_DMAR, 0, _tbl);
...

As we know, dmar_table_init() is called by intel_iommu_init() and
intel_prepare_irq_remapping().

When I invoked acpi_put_table() in the intel_prepare_irq_remapping() in
early stage like 1) shows, kernel will panic.


That's because acpi_put_table() will make the table pointer be NULL,
while dmar_table_init() will skip parse_dmar_table() calling if
dmar_table_initialized is set to 1 in intel_prepare_irq_remapping().

Dmar hardware support interrupt remapping and io remapping separately. But
intel_iommu_init() is called later than intel_prepare_irq_remapping().
So what if make dmar_table_init() a reentrant function? You can just
have a try, but maybe not a good idea, the dmar table will be parsed
twice.


The true reason why the kernel panic is that acpi_put_table() only
released DMAR table structure, but not released the remapping
structures in DMAR table, such as DRHD, RMRR. So the address of
RMRR parsed in early ACPI stage will be used in late ACPI stage in
intel_iommu_init(), which make the kernel panic.

The solution is invoking the intel_iommu_free_dmars() before
dmar_table_init() in intel_iommu_init() to release the RMRR.
Demo code will show at the bottom.

I prefer to invoke acpi_early_init() earlier. But it needs a regression
test[1].

I am looking for Thinkpad x121e (AMD E-450 APU) to test. I have tested
it in Thinkpad s430, It's OK.

BTY, I am confused how does the ACPI subsystem affect PIT which
will be used to fast calibrate CPU frequency[2].

Do you have any idea?

[1] https://lkml.org/lkml/2014/3/10/123
[2] https://lkml.org/lkml/2014/3/12/3


 drivers/iommu/dmar.c| 27 +++
 drivers/iommu/intel-iommu.c |  2 ++
 drivers/iommu/intel_irq_remapping.c | 17 -
 include/linux/dmar.h|  2 ++
 init/main.c |  2 +-
 5 files changed, 32 insertions(+), 18 deletions(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index c8b0329..e6261b7 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -68,6 +68,8 @@ DECLARE_RWSEM(dmar_global_lock);
 LIST_HEAD(dmar_drhd_units);

 struct acpi_table_header * __initdata dmar_tbl;
+struct acpi_table_header * __initdata dmar_tbl_original;
+
 static int dmar_dev_scope_status = 1;
 static unsigned long dmar_seq_ids[BITS_TO_LONGS(DMAR_UNITS_SUPPORTED)];

@@ -627,6 +629,7 @@ parse_dmar_table(void)
 * fixed map.
 */
dmar_table_detect();
+   dmar_tbl_original = dmar_tbl;

/*
 * ACPI tables may not be DMA protected by tboot, so use DMAR copy
@@ -811,26 +814,18 @@ int __init dmar_dev_scope_init(void)

 int __init dmar_table_init(void)
 {
-   static int dmar_table_initialized;
int ret;

-   if (dmar_table_initialized == 0) {
-   ret = parse_dmar_table();
-   if (ret < 0) {
-   if (ret != -ENODEV)
-   pr_info("Parse DMAR table failure.\n");
-   } else  if (list_empty(_drhd_units)) {
-   pr_info("No DMAR devices found\n");
-   ret = -ENODEV;
-   }
-
-   if (ret < 0)
-   dmar_table_initialized = ret;
-   else
-   dmar_table_initialized = 1;
+   ret = parse_dmar_table();
+   if (ret < 0) {
+   if (ret != -ENODEV)
+   pr_info("Parse DMAR table failure.\n");
+   } else  if (list_empty(_drhd_units)) {
+   pr_info("No DMAR devices found\n");
+   ret = -ENODEV;
}

-   return dmar_table_initialized < 0 ? dmar_table_initialized : 0;
+   return ret;
 }

 static void warn_invalid_dmar(u64 addr, const char *message)
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 687f18f..90f74f4 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4832,6 +4832,8 @@ int __init intel_iommu_init(void)
}

down_write(_global_lock);
+
+   intel_iommu_free_dmars();
if (dmar_table_init()) {
if (force_on)
panic("tboot: Failed to initialize DMAR table\n");
diff --git

Re: [PATCH] mm: take memory hotplug lock within numa_zonelist_order_handler()

2017-07-26 Thread Michal Hocko

On Wed 26-07-17 13:48:12, Heiko Carstens wrote:
> On Wed, Jul 26, 2017 at 01:31:12PM +0200, Michal Hocko wrote:
> > On Wed 26-07-17 13:17:38, Heiko Carstens wrote:
> > [...]
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 6d30e914afb6..fc32aa81f359 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -4891,9 +4891,11 @@ int numa_zonelist_order_handler(struct ctl_table 
> > > *table, int write,
> > >   NUMA_ZONELIST_ORDER_LEN);
> > >   user_zonelist_order = oldval;
> > >   } else if (oldval != user_zonelist_order) {
> > > + mem_hotplug_begin();
> > >   mutex_lock(_mutex);
> > >   build_all_zonelists(NULL, NULL);
> > >   mutex_unlock(_mutex);
> > > + mem_hotplug_done();
> > >   }
> > >   }
> > >  out:
> > 
> > Please note that this code has been removed by
> > http://lkml.kernel.org/r/20170721143915.14161-2-mho...@kernel.org. It
> > will get to linux-next as soon as Andrew releases a new version mmotm
> > tree.
> 
> We still would need something for 4.13, no?

If this presents a real problem then yes. Has this happened in a real
workload or during some artificial test? I mean the code has been like
that for ages and nobody noticed/reported any problems.

That being said, I do not have anything against your patch. It is
trivial to rebase mine on top of yours. I am just not sure it is worth
the code churn. E.g. do you think this patch is a stable backport
material?
-- 
Michal Hocko
SUSE Labs

Re: linux-next: unsigned commits in the drm-misc tree

2017-07-26 Thread Daniel Vetter

On Wed, Jul 26, 2017 at 9:09 AM, Daniel Vetter  wrote:
> Oops, that shouldn't have happened. Actually, our maintainer tooling
> ensures this doesn't happen, by auto-adding the committer sob line.
> But these patches (and a bunch of others pushed by Benjamin) haven't
> been pushed by our tooling it seems (the Link: tag is missing at
> least).
>
> Benjamin, what happened there?

Ok, figured it out, added another safety check to the scripting, and
hard-reset the tree. Unfortunately some of the patches already landed
in drm-next, so that needed a hard-reset too, plus in drm-intel-next,
where I still need to do the hard-reset. Ugh.

Benjamin: As part of the hard-reset I've thrown out all the patches
you've committed. That was simpler than digging out the right patches
from the rebase push. Please re-apply and push the right ones again.

My apologies for the hiccup, we maintainers (Dave, Sean & me) should
have caught this earlier.

Thanks, Daniel

>
> Thanks, Daniel
>
>
> On Wed, Jul 26, 2017 at 8:11 AM, Stephen Rothwell  
> wrote:
>> Hi all,
>>
>> I noticed a set of commits that have no Signed-off-by from their
>> committer:
>>
>>   d9864a1d2dfc ("drm/stm: drv: Rename platform driver name")
>>
>> to
>>
>>   ed34d261a12a ("drm/stm: dsi: Constify phy ops structure")
>>
>> --
>> Cheers,
>> Stephen Rothwell
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

[PATCH] iommu/amd: Fix schedule-while-atomic BUG in initialization code

2017-07-26 Thread Joerg Roedel

Hi Artem, Thomas,

On Wed, Jul 26, 2017 at 12:42:49PM +0200, Thomas Gleixner wrote:
> On Tue, 25 Jul 2017, Artem Savkov wrote:
> 
> > Hi,
> > 
> > Commit 1c3c5ea "sched/core: Enable might_sleep() and smp_processor_id()
> > checks early" seem to have uncovered an issue with amd-iommu/x2apic.
> > 
> > Starting with that commit the following warning started to show up on AMD
> > systems during boot:
>  
> > [0.16] BUG: sleeping function called from invalid context at 
> > kernel/locking/mutex.c:747 
> 
> > [0.16]  mutex_lock_nested+0x1b/0x20 
> > [0.16]  register_syscore_ops+0x1d/0x70 
> > [0.16]  state_next+0x119/0x910 
> > [0.16]  iommu_go_to_state+0x29/0x30 
> > [0.16]  amd_iommu_enable+0x13/0x23 
> > [0.16]  irq_remapping_enable+0x1b/0x39 
> > [0.16]  enable_IR_x2apic+0x91/0x196 
> > [0.16]  default_setup_apic_routing+0x16/0x6e 
> > [0.16]  native_smp_prepare_cpus+0x257/0x2d5

Thanks for the report!

> --- a/drivers/iommu/amd_iommu_init.c
> +++ b/drivers/iommu/amd_iommu_init.c
> @@ -2440,7 +2440,6 @@ static int __init state_next(void)
>   break;
>   case IOMMU_ACPI_FINISHED:
>   early_enable_iommus();
> - register_syscore_ops(_iommu_syscore_ops);
>   x86_platform.iommu_shutdown = disable_iommus;
>   init_state = IOMMU_ENABLED;
>   break;
> @@ -2559,6 +2558,8 @@ static int __init amd_iommu_init(void)
>   for_each_iommu(iommu)
>   iommu_flush_all_caches(iommu);
>   }
> + } else {
> + register_syscore_ops(_iommu_syscore_ops);
>   }
>  
>   return ret;

Yes, that should fix it, but I think its better to just move the
register_syscore_ops() call to a later initialization step, like in the
patch below. I tested it an will queue it to my iommu/fixes branch.

>From 461242d7211c901b6ccdf349cc89235bd5da Mon Sep 17 00:00:00 2001
From: Joerg Roedel 
Date: Wed, 26 Jul 2017 14:17:55 +0200
Subject: [PATCH] iommu/amd: Fix schedule-while-atomic BUG in initialization
 code

The register_syscore_ops() function takes a mutex and might
sleep. In the IOMMU initialization code it is invoked during
irq-remapping setup already, where irqs are disabled.

This causes a schedule-while-atomic bug:

 BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:747
 in_atomic(): 0, irqs_disabled(): 1, pid: 1, name: swapper/0
 no locks held by swapper/0/1.
 irq event stamp: 304
 hardirqs last  enabled at (303): [] 
_raw_spin_unlock_irqrestore+0x36/0x60
 hardirqs last disabled at (304): [] 
enable_IR_x2apic+0x79/0x196
 softirqs last  enabled at (36): [] __do_softirq+0x35f/0x4ec
 softirqs last disabled at (31): [] irq_exit+0x105/0x120
 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.13.0-rc2.1.el7a.test.x86_64.debug 
#1
 Hardware name:  PowerEdge C6145 /040N24, BIOS 3.5.0 10/28/2014
 Call Trace:
  dump_stack+0x85/0xca
  ___might_sleep+0x22a/0x260
  __might_sleep+0x4a/0x80
  __mutex_lock+0x58/0x960
  ? iommu_completion_wait.part.17+0xb5/0x160
  ? register_syscore_ops+0x1d/0x70
  ? iommu_flush_all_caches+0x120/0x150
  mutex_lock_nested+0x1b/0x20
  register_syscore_ops+0x1d/0x70
  state_next+0x119/0x910
  iommu_go_to_state+0x29/0x30
  amd_iommu_enable+0x13/0x23

Fix it by moving the register_syscore_ops() call to the next
initialization step, which runs with irqs enabled.

Signed-off-by: Joerg Roedel 
---
 drivers/iommu/amd_iommu_init.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/amd_iommu_init.c b/drivers/iommu/amd_iommu_init.c
index 5cc597b383c7..372303700566 100644
--- a/drivers/iommu/amd_iommu_init.c
+++ b/drivers/iommu/amd_iommu_init.c
@@ -2440,11 +2440,11 @@ static int __init state_next(void)
break;
case IOMMU_ACPI_FINISHED:
early_enable_iommus();
-   register_syscore_ops(_iommu_syscore_ops);
x86_platform.iommu_shutdown = disable_iommus;
init_state = IOMMU_ENABLED;
break;
case IOMMU_ENABLED:
+   register_syscore_ops(_iommu_syscore_ops);
ret = amd_iommu_init_pci();
init_state = ret ? IOMMU_INIT_ERROR : IOMMU_PCI_INIT;
enable_iommus_v2();
-- 
2.13.1

Re: [PATCH V3 1/4] ARM64: dts: rockchip: rk3328 add iommu nodes

2017-07-26 Thread Joerg Roedel

Hey Heiko,

On Wed, Jul 26, 2017 at 01:44:02PM +0200, Heiko Stübner wrote:
> I really would prefer iommu dt-nodes going through my tree :-)
> 
> Especially as parts of these conflict with already pending patches for
> graphics support and with the iommu nodes sitting in your tree these
> would need to wait another kernel release.

Sure, no problem. I have nothing pushed yet, so it's easy to remove
again. Do you want to take all three patch-sets from Simon through your
tree or just this one?

Regards,

Joerg

Re: [RFC PATCH 3/5] mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap

2017-07-26 Thread Michal Hocko

On Wed 26-07-17 13:45:39, Heiko Carstens wrote:
[...]
> In general I do like your idea, however if I understand your patches
> correctly we might have an ordering problem on s390: it is not possible to
> access hot-added memory on s390 before it is online (MEM_GOING_ONLINE
> succeeded).

Could you point me to the code please? I cannot seem to find the
notifier which implements that.

> On MEM_GOING_ONLINE we ask the hypervisor to back the potential available
> hot-added memory region with physical pages. Accessing those ranges before
> that will result in an exception.

Can we make the range which backs the memmap range available? E.g from
s390 specific __vmemmap_populate path?
 
> However with your approach the memory is still allocated when add_memory()
> is being called, correct? That wouldn't be a change to the current
> behaviour; except for the ordering problem outlined above.

Could you be more specific please? I do not change when the memmap is
allocated.

-- 
Michal Hocko
SUSE Labs

Re: [PATCH 1/1] mm/hugetlb: Make huge_pte_offset() consistent and document behaviour

2017-07-26 Thread Michal Hocko

On Wed 26-07-17 13:11:46, Punit Agrawal wrote:
> Hi Michal,
> 
> Michal Hocko  writes:
> 
> > On Wed 26-07-17 10:50:38, Michal Hocko wrote:
> >> On Tue 25-07-17 16:41:14, Punit Agrawal wrote:
> >> > When walking the page tables to resolve an address that points to
> >> > !p*d_present() entry, huge_pte_offset() returns inconsistent values
> >> > depending on the level of page table (PUD or PMD).
> >> > 
> >> > It returns NULL in the case of a PUD entry while in the case of a PMD
> >> > entry, it returns a pointer to the page table entry.
> >> > 
> >> > A similar inconsitency exists when handling swap entries - returns NULL
> >> > for a PUD entry while a pointer to the pte_t is retured for the PMD
> >> > entry.
> >> > 
> >> > Update huge_pte_offset() to make the behaviour consistent - return NULL
> >> > in the case of p*d_none() and a pointer to the pte_t for hugepage or
> >> > swap entries.
> >> > 
> >> > Document the behaviour to clarify the expected behaviour of this
> >> > function. This is to set clear semantics for architecture specific
> >> > implementations of huge_pte_offset().
> >> 
> >> hugetlb pte semantic is a disaster and I agree it could see some
> >> cleanup/clarifications but I am quite nervous to see a patchi like this.
> >> How do we check that nothing will get silently broken by this change?
> 
> Glad I'm not the only one who finds the hugetlb semantics somewhat
> confusing. :)

This is a huge understatement. It is a source of nightmares.

> I've been running tests from mce-test suite and libhugetlbfs for similar
> changes we did on arm64. There could be assumptions that were not
> exercised but I'm not sure how to check for all the possible usages.
> 
> Do you have any other suggestions that can help improve confidence in
> the patch?

Unfortunatelly I don't. I just know there were many subtle assumptions
all over the place so I am rather careful to not touch the code unless
really necessary.

That being said, I am not opposing your patch.

-- 
Michal Hocko
SUSE Labs

Re: [PATCH] memory: mtk-smi: Use of_device_get_match_data helper

2017-07-26 Thread Honghui Zhang

On Wed, 2017-07-26 at 11:36 +0100, Robin Murphy wrote:
> On 26/07/17 10:59, honghui.zh...@mediatek.com wrote:
> > From: Honghui Zhang 
> > 

> >  * for mtk smi gen 1, we need to get the ao(always on) base to config
> >  * m4u port, and we need to enable the aync clock for transform the smi
> >  * clock into emi clock domain, but for mtk smi gen2, there's no smi ao
> >  * base.
> >  */
> > -   smi_gen = (enum mtk_smi_gen)of_id->data;
> > -   if (smi_gen == MTK_SMI_GEN1) {
> > +   smi_gen = of_device_get_match_data(dev);
> 
> The data you're retrieving is the exact same thing as of_id->data was,
> i.e. an enum mtk_smi_gen cast to void*, so dereferencing it is not a
> good idea. The first patch was almost right; you just need to keep the
> cast in the assignment to smi_gen.
> 
> Robin.
> 
Hi, Robin, thanks very much.
I will send a new version.

> > +   if (*smi_gen == MTK_SMI_GEN1) {
> > res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
> > common->smi_ao_base = devm_ioremap_resource(dev, res);
> > if (IS_ERR(common->smi_ao_base))
> > 
>

Re: [PATCH v3 4/9] pwm: Add STM32 LPTimer PWM driver

2017-07-26 Thread Fabrice Gasnier

On 07/07/2017 06:31 PM, Fabrice Gasnier wrote:
> Add support for single PWM channel on Low-Power Timer, that can be
> found on some STM32 platforms.
> 
> Signed-off-by: Fabrice Gasnier 
> ---
> Changes in v3:
> - remove prescalers[] array, use power-of-2 presc directly
> - Update following Thierry's comments:
> - fix issue using FIELD_GET() macro
> - Add get_state() callback
> - remove some checks in probe
> - slight rework 'reenable' flag
> - use more common method to disable pwm in remove()

Hi Thierry,

Gentle ping for PWM driver review since I did changes in v3.
Please advise.

Best Regards,
Fabrice
> 
> Changes in v2:
> - s/Low Power/Low-Power
> - update few comment lines
> ---
>  drivers/pwm/Kconfig|  10 ++
>  drivers/pwm/Makefile   |   1 +
>  drivers/pwm/pwm-stm32-lp.c | 246 
> +
>  3 files changed, 257 insertions(+)
>  create mode 100644 drivers/pwm/pwm-stm32-lp.c
> 
> diff --git a/drivers/pwm/Kconfig b/drivers/pwm/Kconfig
> index 313c107..7cb982b 100644
> --- a/drivers/pwm/Kconfig
> +++ b/drivers/pwm/Kconfig
> @@ -417,6 +417,16 @@ config PWM_STM32
> To compile this driver as a module, choose M here: the module
> will be called pwm-stm32.
>  
> +config PWM_STM32_LP
> + tristate "STMicroelectronics STM32 PWM LP"
> + depends on MFD_STM32_LPTIMER || COMPILE_TEST
> + help
> +   Generic PWM framework driver for STMicroelectronics STM32 SoCs
> +   with Low-Power Timer (LPTIM).
> +
> +   To compile this driver as a module, choose M here: the module
> +   will be called pwm-stm32-lp.
> +
>  config PWM_STMPE
>   bool "STMPE expander PWM export"
>   depends on MFD_STMPE
> diff --git a/drivers/pwm/Makefile b/drivers/pwm/Makefile
> index 93da1f7..a3a4bee 100644
> --- a/drivers/pwm/Makefile
> +++ b/drivers/pwm/Makefile
> @@ -40,6 +40,7 @@ obj-$(CONFIG_PWM_SAMSUNG)   += pwm-samsung.o
>  obj-$(CONFIG_PWM_SPEAR)  += pwm-spear.o
>  obj-$(CONFIG_PWM_STI)+= pwm-sti.o
>  obj-$(CONFIG_PWM_STM32)  += pwm-stm32.o
> +obj-$(CONFIG_PWM_STM32_LP)   += pwm-stm32-lp.o
>  obj-$(CONFIG_PWM_STMPE)  += pwm-stmpe.o
>  obj-$(CONFIG_PWM_SUN4I)  += pwm-sun4i.o
>  obj-$(CONFIG_PWM_TEGRA)  += pwm-tegra.o
> diff --git a/drivers/pwm/pwm-stm32-lp.c b/drivers/pwm/pwm-stm32-lp.c
> new file mode 100644
> index 000..9793b29
> --- /dev/null
> +++ b/drivers/pwm/pwm-stm32-lp.c
> @@ -0,0 +1,246 @@
> +/*
> + * STM32 Low-Power Timer PWM driver
> + *
> + * Copyright (C) STMicroelectronics 2017
> + *
> + * Author: Gerald Baeza 
> + *
> + * License terms: GNU General Public License (GPL), version 2
> + *
> + * Inspired by Gerald Baeza's pwm-stm32 driver
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +struct stm32_pwm_lp {
> + struct pwm_chip chip;
> + struct clk *clk;
> + struct regmap *regmap;
> +};
> +
> +static inline struct stm32_pwm_lp *to_stm32_pwm_lp(struct pwm_chip *chip)
> +{
> + return container_of(chip, struct stm32_pwm_lp, chip);
> +}
> +
> +/* STM32 Low-Power Timer is preceded by a configurable power-of-2 prescaler 
> */
> +#define STM32_LPTIM_MAX_PRESCALER128
> +
> +static int stm32_pwm_lp_apply(struct pwm_chip *chip, struct pwm_device *pwm,
> +   struct pwm_state *state)
> +{
> + struct stm32_pwm_lp *priv = to_stm32_pwm_lp(chip);
> + unsigned long long prd, div, dty;
> + struct pwm_state cstate;
> + u32 val, mask, cfgr, presc = 0;
> + bool reenable;
> + int ret;
> +
> + pwm_get_state(pwm, );
> + reenable = !cstate.enabled;
> +
> + if (!state->enabled) {
> + if (cstate.enabled) {
> + /* Disable LP timer */
> + ret = regmap_write(priv->regmap, STM32_LPTIM_CR, 0);
> + if (ret)
> + return ret;
> + /* disable clock to PWM counter */
> + clk_disable(priv->clk);
> + }
> + return 0;
> + }
> +
> + /* Calculate the period and prescaler value */
> + div = (unsigned long long)clk_get_rate(priv->clk) * state->period;
> + do_div(div, NSEC_PER_SEC);
> + prd = div;
> + while (div > STM32_LPTIM_MAX_ARR) {
> + presc++;
> + if ((1 << presc) > STM32_LPTIM_MAX_PRESCALER) {
> + dev_err(priv->chip.dev, "max prescaler exceeded\n");
> + return -EINVAL;
> + }
> + div = prd >> presc;
> + }
> + prd = div;
> +
> + /* Calculate the duty cycle */
> + dty = prd * state->duty_cycle;
> + do_div(dty, state->period);
> +
> + if (!cstate.enabled) {
> + /* enable clock to drive PWM counter */
> + ret = clk_enable(priv->clk);
> + if (ret)
> + return ret;
> + }
> +
> + ret = regmap_read(priv->regmap,

[PATCH v2] memory: mtk-smi: Use of_device_get_match_data helper

2017-07-26 Thread honghui.zhang

From: Honghui Zhang 

Replace custom code with generic helper to retrieve driver data.

Signed-off-by: Honghui Zhang 
---
 drivers/memory/mtk-smi.c | 14 ++
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/drivers/memory/mtk-smi.c b/drivers/memory/mtk-smi.c
index 4afbc41..2b798bb4 100644
--- a/drivers/memory/mtk-smi.c
+++ b/drivers/memory/mtk-smi.c
@@ -240,20 +240,15 @@ static int mtk_smi_larb_probe(struct platform_device 
*pdev)
struct device *dev = >dev;
struct device_node *smi_node;
struct platform_device *smi_pdev;
-   const struct of_device_id *of_id;
 
if (!dev->pm_domain)
return -EPROBE_DEFER;
 
-   of_id = of_match_node(mtk_smi_larb_of_ids, pdev->dev.of_node);
-   if (!of_id)
-   return -EINVAL;
-
larb = devm_kzalloc(dev, sizeof(*larb), GFP_KERNEL);
if (!larb)
return -ENOMEM;
 
-   larb->larb_gen = of_id->data;
+   larb->larb_gen = of_device_get_match_data(dev);
res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
larb->base = devm_ioremap_resource(dev, res);
if (IS_ERR(larb->base))
@@ -319,7 +314,6 @@ static int mtk_smi_common_probe(struct platform_device 
*pdev)
struct device *dev = >dev;
struct mtk_smi *common;
struct resource *res;
-   const struct of_device_id *of_id;
enum mtk_smi_gen smi_gen;
 
if (!dev->pm_domain)
@@ -338,17 +332,13 @@ static int mtk_smi_common_probe(struct platform_device 
*pdev)
if (IS_ERR(common->clk_smi))
return PTR_ERR(common->clk_smi);
 
-   of_id = of_match_node(mtk_smi_common_of_ids, pdev->dev.of_node);
-   if (!of_id)
-   return -EINVAL;
-
/*
 * for mtk smi gen 1, we need to get the ao(always on) base to config
 * m4u port, and we need to enable the aync clock for transform the smi
 * clock into emi clock domain, but for mtk smi gen2, there's no smi ao
 * base.
 */
-   smi_gen = (enum mtk_smi_gen)of_id->data;
+   smi_gen = (enum mtk_smi_gen)of_device_get_match_data(dev);
if (smi_gen == MTK_SMI_GEN1) {
res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
common->smi_ao_base = devm_ioremap_resource(dev, res);
-- 
2.6.4

[PATCH v3 1/2] video/hdmi: Introduce helpers for the HDMI audio infoframe payload

2017-07-26 Thread Chris Zhong

The DP is using the same audio infoframe payload as hdmi, per DP 1.3
spec, but it has a different header. Provide a new interface here,
it just packs the payload.

Signed-off-by: Chris Zhong 
---

Changes in v3:
- add size < HDMI_AUDIO_INFOFRAME_SIZE check according to Doug's advice

Changes in v2: None

 drivers/video/hdmi.c | 66 ++--
 include/linux/hdmi.h |  2 ++
 2 files changed, 50 insertions(+), 18 deletions(-)

diff --git a/drivers/video/hdmi.c b/drivers/video/hdmi.c
index 1cf907e..9868050 100644
--- a/drivers/video/hdmi.c
+++ b/drivers/video/hdmi.c
@@ -240,6 +240,49 @@ int hdmi_audio_infoframe_init(struct hdmi_audio_infoframe 
*frame)
 EXPORT_SYMBOL(hdmi_audio_infoframe_init);
 
 /**
+ * hdmi_audio_infoframe_pack_payload() - write HDMI audio infoframe payload to
+ * binary buffer
+ * @frame: HDMI audio infoframe
+ * @buffer: destination buffer
+ * @size: size of buffer
+ *
+ * Packs the information contained in the @frame structure into a binary
+ * representation that can be written into the corresponding controller
+ * registers.
+ *
+ * Returns 0 on success or a negative error code on failure.
+ */
+ssize_t hdmi_audio_infoframe_pack_payload(struct hdmi_audio_infoframe *frame,
+ void *buffer, size_t size)
+{
+   unsigned char channels;
+   u8 *ptr = buffer;
+
+   if (size < frame->length || size < HDMI_AUDIO_INFOFRAME_SIZE)
+   return -ENOSPC;
+
+   memset(buffer, 0, size);
+
+   if (frame->channels >= 2)
+   channels = frame->channels - 1;
+   else
+   channels = 0;
+
+   ptr[0] = ((frame->coding_type & 0xf) << 4) | (channels & 0x7);
+   ptr[1] = ((frame->sample_frequency & 0x7) << 2) |
+(frame->sample_size & 0x3);
+   ptr[2] = frame->coding_type_ext & 0x1f;
+   ptr[3] = frame->channel_allocation;
+   ptr[4] = (frame->level_shift_value & 0xf) << 3;
+
+   if (frame->downmix_inhibit)
+   ptr[4] |= BIT(7);
+
+   return 0;
+}
+EXPORT_SYMBOL(hdmi_audio_infoframe_pack_payload);
+
+/**
  * hdmi_audio_infoframe_pack() - write HDMI audio infoframe to binary buffer
  * @frame: HDMI audio infoframe
  * @buffer: destination buffer
@@ -256,22 +299,15 @@ EXPORT_SYMBOL(hdmi_audio_infoframe_init);
 ssize_t hdmi_audio_infoframe_pack(struct hdmi_audio_infoframe *frame,
  void *buffer, size_t size)
 {
-   unsigned char channels;
u8 *ptr = buffer;
size_t length;
+   int ret;
 
length = HDMI_INFOFRAME_HEADER_SIZE + frame->length;
 
if (size < length)
return -ENOSPC;
 
-   memset(buffer, 0, size);
-
-   if (frame->channels >= 2)
-   channels = frame->channels - 1;
-   else
-   channels = 0;
-
ptr[0] = frame->type;
ptr[1] = frame->version;
ptr[2] = frame->length;
@@ -279,16 +315,10 @@ ssize_t hdmi_audio_infoframe_pack(struct 
hdmi_audio_infoframe *frame,
 
/* start infoframe payload */
ptr += HDMI_INFOFRAME_HEADER_SIZE;
-
-   ptr[0] = ((frame->coding_type & 0xf) << 4) | (channels & 0x7);
-   ptr[1] = ((frame->sample_frequency & 0x7) << 2) |
-(frame->sample_size & 0x3);
-   ptr[2] = frame->coding_type_ext & 0x1f;
-   ptr[3] = frame->channel_allocation;
-   ptr[4] = (frame->level_shift_value & 0xf) << 3;
-
-   if (frame->downmix_inhibit)
-   ptr[4] |= BIT(7);
+   ret = hdmi_audio_infoframe_pack_payload(frame, ptr,
+   size - HDMI_INFOFRAME_HEADER_SIZE);
+   if (ret)
+   return ret;
 
hdmi_infoframe_set_checksum(buffer, length);
 
diff --git a/include/linux/hdmi.h b/include/linux/hdmi.h
index d271ff2..a4be132 100644
--- a/include/linux/hdmi.h
+++ b/include/linux/hdmi.h
@@ -272,6 +272,8 @@ struct hdmi_audio_infoframe {
 int hdmi_audio_infoframe_init(struct hdmi_audio_infoframe *frame);
 ssize_t hdmi_audio_infoframe_pack(struct hdmi_audio_infoframe *frame,
  void *buffer, size_t size);
+ssize_t hdmi_audio_infoframe_pack_payload(struct hdmi_audio_infoframe *frame,
+ void *buffer, size_t size);
 
 enum hdmi_3d_structure {
HDMI_3D_STRUCTURE_INVALID = -1,
-- 
2.7.4

[PATCH v3 2/2] drm/rockchip: cdn-dp: send audio infoframe to sink

2017-07-26 Thread Chris Zhong

Some DP/HDMI sink need to receive the audio infoframe to play sound,
especially some multi-channel AV receiver, they need the
channel_allocation from infoframe to config the speakers. Send the
audio infoframe via SDP will make them work properly.

Signed-off-by: Chris Zhong 

---

Changes in v3: None
Changes in v2:
- According to the advice of Sean Paul and Doug
use hdmi_audio_infoframe_pack_payload to pack the buffer
define a SDP_HEADER_SIZE

 drivers/gpu/drm/rockchip/cdn-dp-core.c | 20 
 drivers/gpu/drm/rockchip/cdn-dp-reg.c  | 27 +++
 drivers/gpu/drm/rockchip/cdn-dp-reg.h  |  6 ++
 include/drm/drm_dp_helper.h|  1 +
 4 files changed, 54 insertions(+)

diff --git a/drivers/gpu/drm/rockchip/cdn-dp-core.c 
b/drivers/gpu/drm/rockchip/cdn-dp-core.c
index 9b0b058..6a4fc66 100644
--- a/drivers/gpu/drm/rockchip/cdn-dp-core.c
+++ b/drivers/gpu/drm/rockchip/cdn-dp-core.c
@@ -802,6 +802,7 @@ static int cdn_dp_audio_hw_params(struct device *dev,  void 
*data,
.sample_rate = params->sample_rate,
.channels = params->channels,
};
+   u8 buffer[HDMI_AUDIO_INFOFRAME_SIZE + EDP_SDP_HEADER_SIZE] = { 0 };
int ret;
 
mutex_lock(>lock);
@@ -823,6 +824,25 @@ static int cdn_dp_audio_hw_params(struct device *dev,  
void *data,
goto out;
}
 
+   /*
+* Prepare the infoframe header to SDP header per DP 1.3 spec, Table
+* 2-98.
+*/
+   buffer[0] = 0;
+   buffer[1] = HDMI_INFOFRAME_TYPE_AUDIO;
+   buffer[2] = 0x1b;
+   buffer[3] = 0x48;
+
+   ret = hdmi_audio_infoframe_pack_payload(>cea,
+   [EDP_SDP_HEADER_SIZE],
+   HDMI_AUDIO_INFOFRAME_SIZE);
+   if (ret < 0) {
+   DRM_DEV_ERROR(dev, "Failed to pack audio infoframe: %d\n", ret);
+   goto out;
+   }
+
+   cdn_dp_sdp_write(dp, 0, buffer, sizeof(buffer));
+
ret = cdn_dp_audio_config(dp, );
if (!ret)
dp->audio_info = audio;
diff --git a/drivers/gpu/drm/rockchip/cdn-dp-reg.c 
b/drivers/gpu/drm/rockchip/cdn-dp-reg.c
index b14d211..4a818e4 100644
--- a/drivers/gpu/drm/rockchip/cdn-dp-reg.c
+++ b/drivers/gpu/drm/rockchip/cdn-dp-reg.c
@@ -286,6 +286,33 @@ int cdn_dp_dpcd_write(struct cdn_dp_device *dp, u32 addr, 
u8 value)
return ret;
 }
 
+void cdn_dp_sdp_write(struct cdn_dp_device *dp, int entry_id, u8 *buf,
+ u32 buf_len)
+{
+   int idx;
+   u32 *packet = (u32 *)buf;
+   u32 num_packets = buf_len / 4;
+   u8 type;
+
+   if (buf_len < EDP_SDP_HEADER_SIZE) {
+   DRM_DEV_ERROR(dp->dev, "sdp buffer length: %d\n", buf_len);
+   return;
+   }
+
+   type = buf[1];
+
+   for (idx = 0; idx < num_packets; idx++)
+   writel(cpu_to_le32(*packet++), dp->regs + SOURCE_PIF_DATA_WR);
+
+   writel(entry_id, dp->regs + SOURCE_PIF_WR_ADDR);
+
+   writel(F_HOST_WR, dp->regs + SOURCE_PIF_WR_REQ);
+
+   writel(PIF_PKT_TYPE_VALID | F_PACKET_TYPE(type) | entry_id,
+  dp->regs + SOURCE_PIF_PKT_ALLOC_REG);
+   writel(PIF_PKT_ALLOC_WR_EN, dp->regs + SOURCE_PIF_PKT_ALLOC_WR_EN);
+}
+
 int cdn_dp_load_firmware(struct cdn_dp_device *dp, const u32 *i_mem,
 u32 i_size, const u32 *d_mem, u32 d_size)
 {
diff --git a/drivers/gpu/drm/rockchip/cdn-dp-reg.h 
b/drivers/gpu/drm/rockchip/cdn-dp-reg.h
index c4bbb4a83..6ec0e81 100644
--- a/drivers/gpu/drm/rockchip/cdn-dp-reg.h
+++ b/drivers/gpu/drm/rockchip/cdn-dp-reg.h
@@ -424,6 +424,11 @@
 /* Reference cycles when using lane clock as reference */
 #define LANE_REF_CYC   0x8000
 
+#define F_HOST_WR  BIT(0)
+#define PIF_PKT_ALLOC_WR_ENBIT(0)
+#define PIF_PKT_TYPE_VALID (3 << 16)
+#define F_PACKET_TYPE(x)   (((x) & 0xff) << 8)
+
 enum voltage_swing_level {
VOLTAGE_LEVEL_0,
VOLTAGE_LEVEL_1,
@@ -478,5 +483,6 @@ int cdn_dp_set_video_status(struct cdn_dp_device *dp, int 
active);
 int cdn_dp_config_video(struct cdn_dp_device *dp);
 int cdn_dp_audio_stop(struct cdn_dp_device *dp, struct audio_info *audio);
 int cdn_dp_audio_mute(struct cdn_dp_device *dp, bool enable);
+void cdn_dp_sdp_write(struct cdn_dp_device *dp, int entry_id, u8 *buf, u32 
len);
 int cdn_dp_audio_config(struct cdn_dp_device *dp, struct audio_info *audio);
 #endif /* _CDN_DP_REG_H */
diff --git a/include/drm/drm_dp_helper.h b/include/drm/drm_dp_helper.h
index b17476a..5d5dd07 100644
--- a/include/drm/drm_dp_helper.h
+++ b/include/drm/drm_dp_helper.h
@@ -878,6 +878,7 @@ struct edp_sdp_header {
u8 HB3; /* 7:5 reserved, 4:0 number of valid data bytes */
 } __packed;
 
+#define EDP_SDP_HEADER_SIZE4
 #define EDP_SDP_HEADER_REVISION_MASK   0x1F
 #define EDP_SDP_HEADER_VALID_PAYLOAD_BYTES 0x1F
 
-- 
2.7.4

[PATCH] f2fs: provide f2fs_balance_fs to __write_node_page

2017-07-26 Thread Yunlong Song

Signed-off-by: Yunlong Song 
---
 fs/f2fs/checkpoint.c |  2 +-
 fs/f2fs/f2fs.h   |  2 +-
 fs/f2fs/node.c   | 16 ++--
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index 5b876f6..3c84a25 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -1017,7 +1017,7 @@ static int block_operations(struct f2fs_sb_info *sbi)
 
if (get_pages(sbi, F2FS_DIRTY_NODES)) {
up_write(>node_write);
-   err = sync_node_pages(sbi, );
+   err = sync_node_pages(sbi, , false);
if (err) {
up_write(>node_change);
f2fs_unlock_all(sbi);
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 94a88b2..f69051b 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -2293,7 +2293,7 @@ struct page *new_node_page(struct dnode_of_data *dn,
 void move_node_page(struct page *node_page, int gc_type);
 int fsync_node_pages(struct f2fs_sb_info *sbi, struct inode *inode,
struct writeback_control *wbc, bool atomic);
-int sync_node_pages(struct f2fs_sb_info *sbi, struct writeback_control *wbc);
+int sync_node_pages(struct f2fs_sb_info *sbi, struct writeback_control *wbc, 
bool need);
 void build_free_nids(struct f2fs_sb_info *sbi, bool sync, bool mount);
 bool alloc_nid(struct f2fs_sb_info *sbi, nid_t *nid);
 void alloc_nid_done(struct f2fs_sb_info *sbi, nid_t nid);
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index d53fe62..b5c0ce3 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1326,7 +1326,7 @@ static struct page *last_fsync_dnode(struct f2fs_sb_info 
*sbi, nid_t ino)
 }
 
 static int __write_node_page(struct page *page, bool atomic, bool *submitted,
-   struct writeback_control *wbc)
+   struct writeback_control *wbc, bool need)
 {
struct f2fs_sb_info *sbi = F2FS_P_SB(page);
nid_t nid;
@@ -1387,6 +1387,10 @@ static int __write_node_page(struct page *page, bool 
atomic, bool *submitted,
}
 
unlock_page(page);
+   if (need)
+   f2fs_balance_fs(sbi, false);
+   else
+   f2fs_balance_fs_bg(sbi);
 
if (unlikely(f2fs_cp_error(sbi))) {
f2fs_submit_merged_write(sbi, NODE);
@@ -1405,7 +1409,7 @@ static int __write_node_page(struct page *page, bool 
atomic, bool *submitted,
 static int f2fs_write_node_page(struct page *page,
struct writeback_control *wbc)
 {
-   return __write_node_page(page, false, NULL, wbc);
+   return __write_node_page(page, false, NULL, wbc, true);
 }
 
 int fsync_node_pages(struct f2fs_sb_info *sbi, struct inode *inode,
@@ -1493,7 +1497,7 @@ int fsync_node_pages(struct f2fs_sb_info *sbi, struct 
inode *inode,
 
ret = __write_node_page(page, atomic &&
page == last_page,
-   , wbc);
+   , wbc, true);
if (ret) {
unlock_page(page);
f2fs_put_page(last_page, 0);
@@ -1530,7 +1534,7 @@ int fsync_node_pages(struct f2fs_sb_info *sbi, struct 
inode *inode,
return ret ? -EIO: 0;
 }
 
-int sync_node_pages(struct f2fs_sb_info *sbi, struct writeback_control *wbc)
+int sync_node_pages(struct f2fs_sb_info *sbi, struct writeback_control *wbc, 
bool need)
 {
pgoff_t index, end;
struct pagevec pvec;
@@ -1608,7 +1612,7 @@ int sync_node_pages(struct f2fs_sb_info *sbi, struct 
writeback_control *wbc)
set_fsync_mark(page, 0);
set_dentry_mark(page, 0);
 
-   ret = __write_node_page(page, false, , wbc);
+   ret = __write_node_page(page, false, , wbc, 
need);
if (ret)
unlock_page(page);
else if (submitted)
@@ -1697,7 +1701,7 @@ static int f2fs_write_node_pages(struct address_space 
*mapping,
diff = nr_pages_to_write(sbi, NODE, wbc);
wbc->sync_mode = WB_SYNC_NONE;
blk_start_plug();
-   sync_node_pages(sbi, wbc);
+   sync_node_pages(sbi, wbc, true);
blk_finish_plug();
wbc->nr_to_write = max((long)0, wbc->nr_to_write - diff);
return 0;
-- 
1.8.5.2

Re: [linux-sunxi] [PATCH 10/10] ARM: dts: sun8i: Add SY8106A regulator to Orange Pi PC

2017-07-26 Thread icenowy

在 2017-07-26 19:44，Maxime Ripard 写道：

Hi,

On Wed, Jul 26, 2017 at 12:23:48PM +0200, Ondřej Jirman wrote:

Hi,

icen...@aosc.io píše v St 26. 07. 2017 v 15:36 +0800:
>
> > > >
> > > > Otherwse
> > > >
> > > > > +   regulator-max-microvolt = <140>;
> > > > > +   regulator-ramp-delay = <200>;
> > > >
> > > > Is this an actual constraint of the SoC? Or is it a characteristic
> > > > of the regulator? If it is the latter, it belongs in the driver.
> > > > AFAIK the regulator supports varying the ramp delay (slew rate).
>
> I don't know...
>
> Maybe I should ask Ondrej?

It is probably neither.

It is used to calculate a delay inserted by the kernel between setting
a new target voltage over I2C and changing the frequency of the CPU.
The actual delay is calculated by the difference between previous and
the new voltage.

I don't remember seeing anything in the datasheet of the regulator.
This is just some low value that works.

It would probably be dependent on the capacitance on the output of the
regulator, actual load (which varies), etc. So it is a board specific
value. One could measure it with an oscilloscope if there's a need to
optimize this.

If this is a reasonable default, then this should be in the
driver. You can't expect anyone to properly calculate a ramp delay and
have access to both a scope and the CPU power lines.

It seems that in regulator_desc structure a default value of ramp delay
can be set, and the ones specified in dt can override it.

So just add .ramp_delay = 200 in the driver's regulator_desc part?

Should a comment be added that explains it's only an experienced value
on Allwinner H3/H5 boards VDD-CPUX usage?

Maxime

___
linux-arm-kernel mailing list
linux-arm-ker...@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

[GIT PULL] intel_th: Fixes for char-misc-linus

2017-07-26 Thread Alexander Shishkin

Hi Greg,

Here are my fixes for 4.13, please consider pulling. These are really
just two new PCI IDs. Thanks!

The following changes since commit 520eccdfe187591a51ea9ab4c1a024ae4d0f68d9:

  Linux 4.13-rc2 (2017-07-23 16:15:17 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/ash/stm.git 
tags/stm-fixes-for-greg-20170726

for you to fetch changes up to a45ae3526897ebcd128e9044040bc7b4f57de4f0:

  intel_th: pci: Add Cannon Lake PCH-LP support (2017-07-26 15:33:15 +0300)


intel_th: Fixes for v4.13

These are two new PCI IDs (Cannon Lake PCH-H and PCH-LP).


Alexander Shishkin (2):
  intel_th: pci: Add Cannon Lake PCH-H support
  intel_th: pci: Add Cannon Lake PCH-LP support

 drivers/hwtracing/intel_th/pci.c | 10 ++
 1 file changed, 10 insertions(+)

Re: [PATCH 1/1] mm/hugetlb: Make huge_pte_offset() consistent and document behaviour

2017-07-26 Thread Michal Hocko

On Wed 26-07-17 14:33:57, Michal Hocko wrote:
> On Wed 26-07-17 13:11:46, Punit Agrawal wrote:
[...]
> > I've been running tests from mce-test suite and libhugetlbfs for similar
> > changes we did on arm64. There could be assumptions that were not
> > exercised but I'm not sure how to check for all the possible usages.
> > 
> > Do you have any other suggestions that can help improve confidence in
> > the patch?
> 
> Unfortunatelly I don't. I just know there were many subtle assumptions
> all over the place so I am rather careful to not touch the code unless
> really necessary.
> 
> That being said, I am not opposing your patch.

Let me be more specific. I am not opposing your patch but we should
definitely need more reviewers to have a look. I am not seeing any
immediate problems with it but I do not see a large improvements either
(slightly less nightmare doesn't make me sleep all that well ;)). So I
will leave the decisions to others.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH v8 1/3] perf: cavium: Support memory controller PMU counters

2017-07-26 Thread Suzuki K Poulose


On 26/07/17 12:19, Jan Glauber wrote:

On Tue, Jul 25, 2017 at 04:39:18PM +0100, Suzuki K Poulose wrote:

On 25/07/17 16:04, Jan Glauber wrote:

Add support for the PMU counters on Cavium SOC memory controllers.

This patch also adds generic functions to allow supporting more
devices with PMU counters.

Properties of the LMC PMU counters:
- not stoppable
- fixed purpose
- read-only
- one PCI device per memory controller

Signed-off-by: Jan Glauber 
---
drivers/perf/Kconfig   |   8 +
drivers/perf/Makefile  |   1 +
drivers/perf/cavium_pmu.c  | 424 +
include/linux/cpuhotplug.h |   1 +
4 files changed, 434 insertions(+)
create mode 100644 drivers/perf/cavium_pmu.c

diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index e5197ff..a46c3f0 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -43,4 +43,12 @@ config XGENE_PMU
help
  Say y if you want to use APM X-Gene SoC performance monitors.

+config CAVIUM_PMU
+   bool "Cavium SOC PMU"


Is there any specific reason why this can't be built as a module ?


Yes. I don't know how to load the module automatically. I can't make it
a pci driver as the EDAC driver "owns" the device (and having two
drivers for one device wont work as far as I know). I tried to hook
into the EDAC driver but the EDAC maintainer was not overly welcoming
that approach.




And while it would be possible to have it a s a module I think it is of
no use if it requires manualy loading. But maybe there is a simple
solution I'm missing here?



If you are talking about a Cavium specific EDAC driver, may be we could
make that depend on this driver "at runtime" via symbols (may be even,
trigger the probe of PMU), which will be referenced only when CONFIG_CAVIUM_PMU
is defined. It is not the perfect solution, but that should do the trick.



+   /*
+* Forbid groups containing mixed PMUs, software events are acceptable.
+*/
+   if (event->group_leader->pmu != event->pmu &&
+   !is_software_event(event->group_leader))
+   return -EINVAL;
+
+   list_for_each_entry(sibling, >group_leader->sibling_list,
+   group_entry)
+   if (sibling->pmu != event->pmu &&
+   !is_software_event(sibling))
+   return -EINVAL;


Do we also need to check if the events in the same group can be scheduled
at once ? i.e, there is enough resources to schedule the requested events from
the group.



Not sure what you mean, do I need to check for programmable counters
that no more counters are programmed than available?



Yes. What if there are two events, both trying to use the same counter (either
due to lack of programmable counters or duplicate events).


+
+   hwc->config = event->attr.config;
+   hwc->idx = -1;
+   return 0;
+}
+

...


+static int cvm_pmu_add(struct perf_event *event, int flags, u64 config_base,
+  u64 event_base)
+{
+   struct cvm_pmu_dev *pmu_dev = to_pmu_dev(event->pmu);
+   struct hw_perf_event *hwc = >hw;
+
+   if (!cmpxchg(_dev->events[hwc->config], NULL, event))
+   hwc->idx = hwc->config;
+
+   if (hwc->idx == -1)
+   return -EBUSY;
+
+   hwc->config_base = config_base;
+   hwc->event_base = event_base;
+   hwc->state = PERF_HES_UPTODATE | PERF_HES_STOPPED;
+
+   if (flags & PERF_EF_START)
+   pmu_dev->pmu.start(event, PERF_EF_RELOAD);
+
+   return 0;
+}
+
+static void cvm_pmu_del(struct perf_event *event, int flags)
+{
+   struct cvm_pmu_dev *pmu_dev = to_pmu_dev(event->pmu);
+   struct hw_perf_event *hwc = >hw;
+   int i;
+
+   event->pmu->stop(event, PERF_EF_UPDATE);
+
+   /*
+* For programmable counters we need to check where we installed it.
+* To keep this function generic always test the more complicated
+* case (free running counters won't need the loop).
+*/
+   for (i = 0; i < pmu_dev->num_counters; i++)
+   if (cmpxchg(_dev->events[i], event, NULL) == event)
+   break;


I couldn't see why hwc->config wouldn't give us the index where we installed
the event in pmu_dev->events. What am I missing ?


Did you see the comment above? It is not yet needed but will be when I
add support for programmable counters.


Is it supported in this series ?


If it is still confusing I can
also remove that for now and add it back later when it is needed.


What is the hwc->idx for programmable counters ? is it going to be different
than hwc->config ? If so, can we use hwc->idx to keep the idx where we installed
the event ?

Suzuki

Re: [RFC][PATCH v3]: documentation,atomic: Add new documents

2017-07-26 Thread Boqun Feng

On Wed, Jul 26, 2017 at 01:53:28PM +0200, Peter Zijlstra wrote:
> 
> New version..
> 
> 
> ---
> Subject: documentation,atomic: Add new documents
> From: Peter Zijlstra 
> Date: Mon Jun 12 14:50:27 CEST 2017
> 
> Since we've vastly expanded the atomic_t interface in recent years the
> existing documentation is woefully out of date and people seem to get
> confused a bit.
> 
> Start a new document to hopefully better explain the current state of
> affairs.
> 
> The old atomic_ops.txt also covers bitmaps and a few more details so
> this is not a full replacement and we'll therefore keep that document
> around until such a time that we've managed to write more text to cover
> its entire.
> 

You seems have a unfinished paragraph..

> Also please, ReST people, go away.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
[...]
> +
> +Further, while something like:
> +
> +  smp_mb__before_atomic();
> +  atomic_dec();
> +
> +is a 'typical' RELEASE pattern, the barrier is strictly stronger than
> +a RELEASE. Similarly for something like:
> +

.. at here. Maybe you planned to put stronger ACQUIRE pattern?

> +
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -498,7 +498,7 @@ VARIETIES OF MEMORY BARRIER
>   This means that ACQUIRE acts as a minimal "acquire" operation and
>   RELEASE acts as a minimal "release" operation.
>  
[...]
> -
> -[!] Note that special memory barrier primitives are available for these
> -situations because on some CPUs the atomic instructions used imply full 
> memory
> -barriers, and so barrier instructions are superfluous in conjunction with 
> them,
> -and in such cases the special barrier primitives will be no-ops.
> -
> -See Documentation/core-api/atomic_ops.rst for more information.
> +See Documentation/atomic_t.txt for more information.
>  

s/atomic_t.txt/atomic_{t,bitops}.txt/ ?

other than those two tiny things,

Reviewed-by: Boqun Feng 

Regards,
Boqun

>  
>  ACCESSING DEVICES


signature.asc
Description: PGP signature

RE: [PATCH v12 6/8] mm: support reporting free page blocks

2017-07-26 Thread Wang, Wei W

On Wednesday, July 26, 2017 7:55 PM, Michal Hocko wrote:
> On Wed 26-07-17 19:44:23, Wei Wang wrote:
> [...]
> > I thought about it more. Probably we can use the callback function
> > with a little change like this:
> >
> > void walk_free_mem(void *opaque1, void (*visit)(void *opaque2,
> > unsigned long pfn,
> >unsigned long nr_pages))
> > {
> > ...
> > for_each_populated_zone(zone) {
> >for_each_migratetype_order(order, type) {
> > report_unused_page_block(zone, order, type,
> > ); // from patch 6
> > pfn = page_to_pfn(page);
> > visit(opaque1, pfn, 1 << order);
> > }
> > }
> > }
> >
> > The above function scans all the free list and directly sends each
> > free page block to the hypervisor via the virtio_balloon callback
> > below. No need to implement a bitmap.
> >
> > In virtio-balloon, we have the callback:
> > void *virtio_balloon_report_unused_pages(void *opaque,  unsigned long
> > pfn, unsigned long nr_pages) {
> > struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> > ...put the free page block to the the ring of vb; }
> >
> >
> > What do you think?
> 
> I do not mind conveying a context to the callback. I would still prefer
> to keep the original min_order to check semantic though. Why? Well,
> it doesn't make much sense to scan low order free blocks all the time
> because they are simply too volatile. Larger blocks tend to surivive for
> longer. So I assume you would only care about larger free blocks. This
> will also make the call cheaper.
> --

OK, I will keep min order there in the next version.

Best,
Wei

Re: [PATCH V2 net-next 01/21] net-next/hinic: Initialize hw interface

2017-07-26 Thread Aviad Krawczyk

OK, we will use module_pci_driver although it is not very common in the same 
segment.

On 7/25/2017 11:02 PM, Francois Romieu wrote:
> Aviad Krawczyk  :
> [...]
>> module_pci_driver - is not used in other drivers in the same segments, it
>> is necessary ?
> 
> /me checks... Ok, there seems to be some overenthusiastic copy'paste.
> 
> See drivers/net/ethernet/intel/ixgb/ixgb_main.c:
> [...]
> /**
>  * ixgb_init_module - Driver Registration Routine
>  *
>  * ixgb_init_module is the first routine called when the driver is
>  * loaded. All it does is register with the PCI subsystem.
>  **/
> 
> static int __init
> ixgb_init_module(void)
> {
>   pr_info("%s - version %s\n", ixgb_driver_string, ixgb_driver_version);
>   pr_info("%s\n", ixgb_copyright);
> 
>   return pci_register_driver(_driver);
> }
> 
> module_init(ixgb_init_module);
> 
> /**
>  * ixgb_exit_module - Driver Exit Cleanup Routine
>  *
>  * ixgb_exit_module is called just before the driver is removed
>  * from memory.
>  **/
> 
> static void __exit
> ixgb_exit_module(void)
> {
>   pci_unregister_driver(_driver);
> }
> 
> module_exit(ixgb_exit_module);
> 
> Driver version ought to be fed through ethtool, if ever. Copyright message
> mildly contributes to a better world. So the whole stuff above could be:
> 
> module_pci_driver(ixgb_driver);
>

Re: [PATCH] virtio-net: fix module unloading

2017-07-26 Thread Michael S. Tsirkin

On Wed, Jul 26, 2017 at 11:52:07AM +0800, Jason Wang wrote:
> 
> 
> On 2017年07月24日 21:38, Andrew Jones wrote:
> > Unregister the driver before removing multi-instance hotplug
> > callbacks. This order avoids the warning issued from
> > __cpuhp_remove_state_cpuslocked when the number of remaining
> > instances isn't yet zero.
> > 
> > Fixes: 8017c279196a ("net/virtio-net: Convert to hotplug state machine")
> > Cc: Sebastian Andrzej Siewior 
> > Signed-off-by: Andrew Jones 
> > ---
> >   drivers/net/virtio_net.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 99a26a9efec1..f41ab0ea942a 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -2743,9 +2743,9 @@ module_init(virtio_net_driver_init);
> >   static __exit void virtio_net_driver_exit(void)
> >   {
> > +   unregister_virtio_driver(_net_driver);
> > cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
> > cpuhp_remove_multi_state(virtionet_online);
> > -   unregister_virtio_driver(_net_driver);
> >   }
> >   module_exit(virtio_net_driver_exit);
> 
> Acked-by: Jason Wang 

Thanks for the review!
I merged it before the tag and don't want to rebase.
Sorry about that.

-- 
MST

Re: [PATCH] fortify: Use WARN instead of BUG for now

2017-07-26 Thread Daniel Micay

It should just be renamed from fortify_panic -> fortify_error, including
in arch/x86/boot/compressed/misc.c and arch/x86/boot/compressed/misc.c.
It can use WARN instead of BUG by with a 'default n', !COMPILE_TEST
option to use BUG again. Otherwise it needs to be patched downstream
when that's wanted.

I don't think splitting it is the right approach to improving the
runtime error handling. That only makes sense for the compile-time
errors due to the limitations of __attribute__((error)). Can we think
about that before changing it? Just make it use WARN for now.

The best debugging experience would be passing along the sizes and
having the fortify_error function convert that into nice error messages.
For memcpy(p, q, n), n can be larger than both the detected sizes of p
and q, not just either one. The error should just be saying the function
name and printing the copy size and maximum sizes of p and q. That's
going to increase the code size too but I think splitting it will be
worse and it goes in the wrong direction in terms of complexity. It's
going to make future extensions / optimization harder if it's split.

[REGRESSION 4.13-rc] NFS returns -EACCESS at the first read

2017-07-26 Thread Takashi Iwai

Hi,

I seem hitting a regression of NFS client on the today's Linus git
tree.  The symptom is that the file read over NFS returns occasionally
-EACCESS at the first read.  When I try to read the same file again
(or do some other thing), I can read it successfully.

The git bisection leaded to the commit
bd8b2441742b49c76bec707757bd9c028ea9838e
NFS: Store the raw NFS access mask in the inode's access cache


Any further hint for debugging?


thanks,

Takashi

Re: [linux-sunxi] [PATCH 10/10] ARM: dts: sun8i: Add SY8106A regulator to Orange Pi PC

2017-07-26 Thread Ondřej Jirman

Maxime Ripard píše v St 26. 07. 2017 v 13:44 +0200:
> Hi,
> 
> On Wed, Jul 26, 2017 at 12:23:48PM +0200, Ondřej Jirman wrote:
> > Hi,
> > 
> > icen...@aosc.io píše v St 26. 07. 2017 v 15:36 +0800:
> > > 
> > > > > > 
> > > > > > Otherwse
> > > > > > 
> > > > > > > +   regulator-max-microvolt = <140>;
> > > > > > > +   regulator-ramp-delay = <200>;
> > > > > > 
> > > > > > Is this an actual constraint of the SoC? Or is it a characteristic
> > > > > > of the regulator? If it is the latter, it belongs in the driver.
> > > > > > AFAIK the regulator supports varying the ramp delay (slew rate).
> > > 
> > > I don't know...
> > > 
> > > Maybe I should ask Ondrej?
> > 
> > It is probably neither.
> > 
> > It is used to calculate a delay inserted by the kernel between setting
> > a new target voltage over I2C and changing the frequency of the CPU.
> > The actual delay is calculated by the difference between previous and
> > the new voltage.
> > 
> > I don't remember seeing anything in the datasheet of the regulator.
> > This is just some low value that works.
> > 
> > It would probably be dependent on the capacitance on the output of the
> > regulator, actual load (which varies), etc. So it is a board specific
> > value. One could measure it with an oscilloscope if there's a need to
> > optimize this.
> 
> If this is a reasonable default, then this should be in the
> driver. You can't expect anyone to properly calculate a ramp delay and
> have access to both a scope and the CPU power lines.

It translates to 1ms per 0.2V which is highly conservative. The real
times will be in 1-10us range. So I guess this could be a default in
the driver.

regards,
  o.

> Maxime
> 
> -- 
> Maxime Ripard, Free Electrons
> Embedded Linux and Kernel engineering
> http://free-electrons.com
> 

signature.asc
Description: This is a digitally signed message part

Re: [RFC][PATCH] thunderbolt: icm: Ignore mailbox errors in icm_suspend()

2017-07-26 Thread Rafael J. Wysocki

On Wednesday, July 26, 2017 11:32:44 AM Mika Westerberg wrote:
> On Tue, Jul 25, 2017 at 06:10:57PM +0200, Rafael J. Wysocki wrote:
> > On Tuesday, July 25, 2017 01:00:12 PM Mika Westerberg wrote:
> > > On Tue, Jul 25, 2017 at 01:31:00AM +0200, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki 
> > > > 
> > > > On one of my test machines nhi_mailbox_cmd() called from icm_suspend()
> > > > times out and returnes an error which then is propagated to the
> > > > caller and causes the entire system suspend to be aborted which isn't
> > > > very useful.
> > > > 
> > > > Instead of aborting system suspend, print the error into the log
> > > > and continue.
> > > 
> > > I agree, it should not prevent suspend but I wonder why it fails in the
> > > first place? Can you check what is the return value?
> > 
> > As per the above, the error is a timeout, ie. -ETIMEDOUT.
> 
> Ah, right I somehow missed that.
> 
> Does it have Falcon Ridge controller or Alpine Ridge?

I'll check later today, but i guess you'll know (see below).

> Just to make sure, can you increase the timeout in nhi_mailbox_cmd()
> to 1000ms or so. It should not take that long though but better to check.

Well, I can do that, but I don't think it will help.

It just looks like the chip is not responding at all at that point.

> Which system this is BTW?

It's the Dell 9360. :-)

Sometimes after a reboot or a power cycle it starts in a state in which the
TBT controller and a USB one (which seem to be somehow connected)
appear to be dead or at least really flaky.  Basically, the box needs to be
power-cycled again to get rid of this condition and then everything works.

Thanks,
Rafael

Re: [REGRESSION 4.13-rc] NFS returns -EACCESS at the first read

2017-07-26 Thread Anna Schumaker

Hi Takashi,

On 07/26/2017 08:54 AM, Takashi Iwai wrote:
> Hi,
> 
> I seem hitting a regression of NFS client on the today's Linus git
> tree.  The symptom is that the file read over NFS returns occasionally
> -EACCESS at the first read.  When I try to read the same file again
> (or do some other thing), I can read it successfully.
> 
> The git bisection leaded to the commit
> bd8b2441742b49c76bec707757bd9c028ea9838e
> NFS: Store the raw NFS access mask in the inode's access cache
> 
> 
> Any further hint for debugging?

Does the patch in this email thread help? 
http://www.spinics.net/lists/linux-nfs/msg64930.html

Thanks,
Anna
> 
> 
> thanks,
> 
> Takashi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Re: [PATCH net] Revert "vhost: cache used event for better performance"

2017-07-26 Thread Michael S. Tsirkin

On Wed, Jul 26, 2017 at 04:03:17PM +0800, Jason Wang wrote:
> This reverts commit 809ecb9bca6a9424ccd392d67e368160f8b76c92. Since it
> was reported to break vhost_net. We want to cache used event and use
> it to check for notification. We try to valid cached used event by
> checking whether or not it was ahead of new, but this is not correct
> all the time, it could be stale and there's no way to know about this.
> 
> Signed-off-by: Jason Wang 


Could you supply a bit more data here please?  How does it get stale?
What does guest need to do to make it stale?  This will be helpful if
anyone wants to bring it back, or if we want to extend the protocol.

> ---
>  drivers/vhost/vhost.c | 28 ++--
>  drivers/vhost/vhost.h |  3 ---
>  2 files changed, 6 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index e4613a3..9cb3f72 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -308,7 +308,6 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>   vq->avail = NULL;
>   vq->used = NULL;
>   vq->last_avail_idx = 0;
> - vq->last_used_event = 0;
>   vq->avail_idx = 0;
>   vq->last_used_idx = 0;
>   vq->signalled_used = 0;
> @@ -1402,7 +1401,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, 
> void __user *argp)
>   r = -EINVAL;
>   break;
>   }
> - vq->last_avail_idx = vq->last_used_event = s.num;
> + vq->last_avail_idx = s.num;
>   /* Forget the cached index value. */
>   vq->avail_idx = vq->last_avail_idx;
>   break;
> @@ -2241,6 +2240,10 @@ static bool vhost_notify(struct vhost_dev *dev, struct 
> vhost_virtqueue *vq)
>   __u16 old, new;
>   __virtio16 event;
>   bool v;
> + /* Flush out used index updates. This is paired
> +  * with the barrier that the Guest executes when enabling
> +  * interrupts. */
> + smp_mb();
>  
>   if (vhost_has_feature(vq, VIRTIO_F_NOTIFY_ON_EMPTY) &&
>   unlikely(vq->avail_idx == vq->last_avail_idx))
> @@ -2248,10 +2251,6 @@ static bool vhost_notify(struct vhost_dev *dev, struct 
> vhost_virtqueue *vq)
>  
>   if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
>   __virtio16 flags;
> - /* Flush out used index updates. This is paired
> -  * with the barrier that the Guest executes when enabling
> -  * interrupts. */
> - smp_mb();
>   if (vhost_get_avail(vq, flags, >avail->flags)) {
>   vq_err(vq, "Failed to get flags");
>   return true;
> @@ -2266,26 +2265,11 @@ static bool vhost_notify(struct vhost_dev *dev, 
> struct vhost_virtqueue *vq)
>   if (unlikely(!v))
>   return true;
>  
> - /* We're sure if the following conditions are met, there's no
> -  * need to notify guest:
> -  * 1) cached used event is ahead of new
> -  * 2) old to new updating does not cross cached used event. */
> - if (vring_need_event(vq->last_used_event, new + vq->num, new) &&
> - !vring_need_event(vq->last_used_event, new, old))
> - return false;
> -
> - /* Flush out used index updates. This is paired
> -  * with the barrier that the Guest executes when enabling
> -  * interrupts. */
> - smp_mb();
> -
>   if (vhost_get_avail(vq, event, vhost_used_event(vq))) {
>   vq_err(vq, "Failed to get used event idx");
>   return true;
>   }
> - vq->last_used_event = vhost16_to_cpu(vq, event);
> -
> - return vring_need_event(vq->last_used_event, new, old);
> + return vring_need_event(vhost16_to_cpu(vq, event), new, old);
>  }
>  
>  /* This actually signals the guest, using eventfd. */
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index f720958..bb7c29b 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -115,9 +115,6 @@ struct vhost_virtqueue {
>   /* Last index we used. */
>   u16 last_used_idx;
>  
> - /* Last used evet we've seen */
> - u16 last_used_event;
> -
>   /* Used flags */
>   u16 used_flags;
>  
> -- 
> 2.7.4

Re: [PATCH RFC] sched: Allow migrating kthreads into online but inactive CPUs

2017-07-26 Thread Paul E. McKenney

On Tue, Jul 25, 2017 at 06:58:21PM +0200, Peter Zijlstra wrote:
> Hi,
> 
> On Sat, Jun 17, 2017 at 08:10:08AM -0400, Tejun Heo wrote:
> > Per-cpu workqueues have been tripping CPU affinity sanity checks while
> > a CPU is being offlined.  A per-cpu kworker ends up running on a CPU
> > which isn't its target CPU while the CPU is online but inactive.
> > 
> > While the scheduler allows kthreads to wake up on an online but
> > inactive CPU, it doesn't allow a running kthread to be migrated to
> > such a CPU, which leads to an odd situation where setting affinity on
> > a sleeping and running kthread leads to different results.
> > 
> > Each mem-reclaim workqueue has one rescuer which guarantees forward
> > progress and the rescuer needs to bind itself to the CPU which needs
> > help in making forward progress; however, due to the above issue,
> > while set_cpus_allowed_ptr() succeeds, the rescuer doesn't end up on
> > the correct CPU if the CPU is in the process of going offline,
> > tripping the sanity check and executing the work item on the wrong
> > CPU.
> > 
> > This patch updates __migrate_task() so that kthreads can be migrated
> > into an inactive but online CPU.
> > 
> > Signed-off-by: Tejun Heo 
> > Reported-by: "Paul E. McKenney" 
> > Reported-by: Steven Rostedt 
> 
> Hmm.. so the rules for running on !active && online are slightly
> stricter than just being a kthread, how about the below, does that work
> too?

Of 24 one-hour runs of the TREE07 rcutorture scenario, two had stalled
tasks with this patch.  One of them had more than 200 instances, the other
two instances.  In contrast, a 30-hour run a week ago with Tejun's patch
completed cleanly.  Here "stalled task" means that one of rcutorture's
update-side kthreads fails to make any progress for more than 15 seconds.
Grace periods are progressing, but a kthread waiting for a grace period
isn't making progress, and is stuck with its ->state field at 0x402,
that is TASK_NOLOAD|TASK_UNINTERRUPTIBLE.  Which is as if it never got
the wakeup, given that it is sleeping on schedule_timeout_idle().

Now, two of 24 might just be bad luck, but I haven't seen anything like
this out of TREE07 since I queued Tejun's patch, so I am inclined to
view your patch below with considerable suspicion.

I -am- seeing this out of TREE01, even with Tejun's patch, but that
scenario sets maxcpu=8 and nr_cpus=43, which seems to be tickling an issue
that several other people are seeing.  Others' testing seems to indicate
that setting CONFIG_SOFTLOCKUP_DETECTOR=y suppresses this issue, but I
need to do an overnight run to check my test cases, and that is tonight.

So there might be something else going on as well.

Thanx, Paul

>  kernel/sched/core.c | 36 ++--
>  1 file changed, 30 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d3d39a283beb..59b667c16826 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -894,6 +894,22 @@ void check_preempt_curr(struct rq *rq, struct 
> task_struct *p, int flags)
>  }
> 
>  #ifdef CONFIG_SMP
> +
> +/*
> + * Per-CPU kthreads are allowed to run on !actie && online CPUs, see
> + * __set_cpus_allowed_ptr() and select_fallback_rq().
> + */
> +static inline bool is_per_cpu_kthread(struct task_struct *p)
> +{
> + if (!(p->flags & PF_KTHREAD))
> + return false;
> +
> + if (p->nr_cpus_allowed != 1)
> + return false;
> +
> + return true;
> +}
> +
>  /*
>   * This is how migration works:
>   *
> @@ -951,8 +967,13 @@ struct migration_arg {
>  static struct rq *__migrate_task(struct rq *rq, struct rq_flags *rf,
>struct task_struct *p, int dest_cpu)
>  {
> - if (unlikely(!cpu_active(dest_cpu)))
> - return rq;
> + if (is_per_cpu_kthread(p)) {
> + if (unlikely(!cpu_online(dest_cpu)))
> + return rq;
> + } else {
> + if (unlikely(!cpu_active(dest_cpu)))
> + return rq;
> + }
> 
>   /* Affinity changed (again). */
>   if (!cpumask_test_cpu(dest_cpu, >cpus_allowed))
> @@ -1482,10 +1503,13 @@ static int select_fallback_rq(int cpu, struct 
> task_struct *p)
>   for (;;) {
>   /* Any allowed, online CPU? */
>   for_each_cpu(dest_cpu, >cpus_allowed) {
> - if (!(p->flags & PF_KTHREAD) && !cpu_active(dest_cpu))
> - continue;
> - if (!cpu_online(dest_cpu))
> - continue;
> + if (is_per_cpu_kthread(p)) {
> + if (!cpu_online(dest_cpu))
> + continue;
> + } else {
> + if (!cpu_active(dest_cpu))
> + continue;
> + }
>   goto out;
>

Re: [PATCH v1] xen: get rid of paravirt op adjust_exception_frame

2017-07-26 Thread Boris Ostrovsky




On 7/24/2017 10:28 AM, Juergen Gross wrote:

When running as Xen pv-guest the exception frame on the stack contains
%r11 and %rcx additional to the other data pushed by the processor.

Instead of having a paravirt op being called for each exception type
prepend the Xen specific code to each exception entry. When running as
Xen pv-guest just use the exception entry with prepended instructions,
otherwise use the entry without the Xen specific code.

Signed-off-by: Juergen Gross 


Reviewed-by: Boris Ostrovsky 

(I'd s/xen/x86/ in subject to get x86 maintainers' attention ;-))

Re: [PATCH 1/2] printk/console: Always disable boot consoles that use init memory before it is freed

2017-07-26 Thread Sergey Senozhatsky

On (07/14/17 14:51), Petr Mladek wrote:
> From: Matt Redfearn 
> 
> Commit 4c30c6f566c0 ("kernel/printk: do not turn off bootconsole in
> printk_late_init() if keep_bootcon") added a check on keep_bootcon to
> ensure that boot consoles were kept around until the real console is
> registered.
> 
> This can lead to problems if the boot console data and code are in the
> init section, since it can be freed before the boot console is
> unregistered.
> 
> Commit 81cc26f2bd11 ("printk: only unregister boot consoles when
> necessary") fixed this a better way. It allowed to keep boot consoles
> that did not use init data. Unfortunately it did not remove the check
> of keep_bootcon.
> 
> This can lead to crashes and weird panics when the bootconsole is
> accessed after free, especially if page poisoning is in use and the
> code / data have been overwritten with a poison value.
> 
> To prevent this, always free the boot console if it is within the init
> section. In addition, print a warning about that the console is removed
> prematurely.
> 
> Finally there is a new comment how to avoid the warning. It replaced
> an explanation that duplicated a more comprehensive function
> description few lines above.
> 
> Fixes: 4c30c6f566c0 ("kernel/printk: do not turn off bootconsole in 
> printk_late_init() if keep_bootcon")
> Signed-off-by: Matt Redfearn 
> [pmla...@suse.com: print the warning, code and comments clean up]
> Signed-off-by: Petr Mladek 

Reviewed-by: Sergey Senozhatsky 

-ss

Re: [PATCH 2/2] printk/console: Enhance the check for consoles using init memory

2017-07-26 Thread Sergey Senozhatsky

On (07/14/17 14:51), Petr Mladek wrote:
> printk_late_init() is responsible for disabling boot consoles that
> use init memory. It checks the address of struct console for this.
> 
> But this is not enough. For example, there are several early
> consoles that have write() method in the init section and
> struct console in the normal section. They are not disabled
> and could cause fancy and hard to debug system states.
> 
> It is even more complicated by the macros EARLYCON_DECLARE() and
> OF_EARLYCON_DECLARE() where various struct members are set at
> runtime by the provided setup() function.
> 
> I have tried to reproduce this problem and forced the classic uart
> early console to stay using keep_bootcon parameter. In particular
> I used earlycon=uart,io,0x3f8 keep_bootcon console=ttyS0,115200.
> The system did not boot:
> 
> [1.570496] PM: Image not found (code -22)
> [1.570496] PM: Image not found (code -22)
> [1.571886] PM: Hibernation image not present or could not be loaded.
> [1.571886] PM: Hibernation image not present or could not be loaded.
> [1.576407] Freeing unused kernel memory: 2528K
> [1.577244] kernel tried to execute NX-protected page - exploit attempt? 
> (uid: 0)
> 
> The double lines are caused by having both early uart console and
> ttyS0 console enabled at the same time. The early console stopped
> working when the init memory was freed. Fortunately, the invalid
> call was caught by the NX-protexted page check and did not cause
> any silent fancy problems.
> 
> This patch adds a check for many other addresses stored in
> struct console. It omits setup() and match() that are used
> only when the console is registered. Therefore they have
> already been used at this point and there is no reason
> to use them again.
> 
> Signed-off-by: Petr Mladek 

Reviewed-by: Sergey Senozhatsky 

-ss

[RFC PATCH] mm: memcg: fix css double put in mem_cgroup_iter

2017-07-26 Thread Wenwei Tao

From: Wenwei Tao 

By removing the child cgroup while the parent cgroup is
under reclaim, we could trigger the following kernel panic
on kernel 3.10:

kernel BUG at kernel/cgroup.c:893!
 invalid opcode:  [#1] SMP
 CPU: 1 PID: 22477 Comm: kworker/1:1 Not tainted 3.10.107 #1
 Workqueue: cgroup_destroy css_dput_fn
 task: 8817959a5780 ti: 8817e8886000 task.ti: 8817e8886000
 RIP: 0010:[]  []
cgroup_diput+0xc0/0xf0
 RSP: :8817e8887da0  EFLAGS: 00010246
 RAX:  RBX: 8817a5dd5d40 RCX: dead0200
 RDX:  RSI: 8817973a6910 RDI: 8817f54c2a00
 RBP: 8817e8887dc8 R08: 8817a5dd5dd0 R09: df9fb35794b01820
 R10: df9fb35794b01820 R11: 7fa95b1efcda R12: 8817a5dd5d9c
 R13: 8817f38b3a40 R14: 8817973a6910 R15: 8817973a6910
 FS:  () GS:88181f22()
knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 7fa6e6234000 CR3: 00179f19d000 CR4: 000407e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: 0ff0 DR7: 0400
 Stack:
  8817a5dd5d40 8817a5dd5d9c 8817f38b3a40 8817973a6910
  0040 8817e8887df8 811b37c2 8817fa23c000
  8817f57dbb80 88181f232ac0 88181f237500 8817e8887e10
 Call Trace:
  [] dput+0x1a2/0x2f0
  [] cgroup_dput.isra.21+0x1c/0x30
  [] css_dput_fn+0x1d/0x20
  [] process_one_work+0x17c/0x460
  [] worker_thread+0x116/0x3b0
  [] ? manage_workers.isra.25+0x290/0x290
  [] kthread+0xc0/0xd0
  [] ? insert_kthread_work+0x40/0x40
  [] ret_from_fork+0x58/0x90
  [] ? insert_kthread_work+0x40/0x40
 Code: 41 5e 41 5f 5d c3 0f 1f 44 00 00 48 8b 7f 78 48 8b 07 a8 01 74 15
48 81 c7 30 01 00 00 48 c7 c6 a0 a7 0c 81 e8 b2 83 02 00 eb c8 <0f> 0b
49 8b 4e 18 48 c7 c2 7e f1 7a 81 be 85 03 00 00 48 c7 c7
 RIP  [] cgroup_diput+0xc0/0xf0
 RSP 
 ---[ end trace 85eeea5212c44f51 ]---


I think there is a css double put in mem_cgroup_iter. Under reclaim,
we call mem_cgroup_iter the first time with prev == NULL, and we get
last_visited memcg from per zone's reclaim_iter then call __mem_cgroup_iter_next
try to get next alive memcg, __mem_cgroup_iter_next could return NULL
if last_visited is already the last one so we put the last_visited's
memcg css and continue to the next while loop, this time we might not
do css_tryget(_visited->css) if the dead_count is changed, but
we still do css_put(_visited->css), we put it twice, this could
trigger the BUG_ON at kernel/cgroup.c:893.

Reported-by: Wang Yu 
Tested-by: Wang Yu 
Signed-off-by: Wenwei Tao 
---
 mm/memcontrol.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 437ae2c..3d7a046 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1230,8 +1230,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup 
*root,
memcg = __mem_cgroup_iter_next(root, last_visited);
 
if (reclaim) {
-   if (last_visited && last_visited != root)
+   if (last_visited && last_visited != root) {
css_put(_visited->css);
+   last_visited = NULL;
+   }
 
iter->last_visited = memcg;
smp_wmb();
-- 
1.8.3.1

Re: [PATCH 2/2] ceph: pagecache writeback fault injection switch

2017-07-26 Thread Yan, Zheng

On Tue, Jul 25, 2017 at 10:50 PM, Jeff Layton  wrote:
> From: Jeff Layton 
>
> Testing ceph for proper writeback error handling turns out to be quite
> difficult. I tried using iptables to block traffic but that didn't
> give reliable results.
>
> I hacked in this wb_fault switch that makes the filesystem pretend that
> writeback failed, even when it succeeds. With this, I could verify that
> cephfs fsync error reporting does work properly.
>
> Signed-off-by: Jeff Layton 
> ---
>  fs/ceph/addr.c| 7 +++
>  fs/ceph/debugfs.c | 8 +++-
>  fs/ceph/super.h   | 2 ++
>  3 files changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 50836280a6f8..a3831d100e16 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -584,6 +584,10 @@ static int writepage_nounlock(struct page *page, struct 
> writeback_control *wbc)
>page_off, len,
>truncate_seq, truncate_size,
>>i_mtime, , 1);
> +
> +   if (fsc->wb_fault && err >= 0)
> +   err = -EIO;
> +
> if (err < 0) {
> struct writeback_control tmp_wbc;
> if (!wbc)
> @@ -666,6 +670,9 @@ static void writepages_finish(struct ceph_osd_request 
> *req)
> struct ceph_fs_client *fsc = ceph_inode_to_client(inode);
> bool remove_page;
>
> +   if (fsc->wb_fault && rc >= 0)
> +   rc = -EIO;
> +
> dout("writepages_finish %p rc %d\n", inode, rc);
> if (rc < 0) {
> mapping_set_error(mapping, rc);
> diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
> index 4e2d112c982f..e1e6eaa12031 100644
> --- a/fs/ceph/debugfs.c
> +++ b/fs/ceph/debugfs.c
> @@ -197,7 +197,6 @@ CEPH_DEFINE_SHOW_FUNC(caps_show)
>  CEPH_DEFINE_SHOW_FUNC(dentry_lru_show)
>  CEPH_DEFINE_SHOW_FUNC(mds_sessions_show)
>
> -
>  /*
>   * debugfs
>   */
> @@ -231,6 +230,7 @@ void ceph_fs_debugfs_cleanup(struct ceph_fs_client *fsc)
> debugfs_remove(fsc->debugfs_caps);
> debugfs_remove(fsc->debugfs_mdsc);
> debugfs_remove(fsc->debugfs_dentry_lru);
> +   debugfs_remove(fsc->debugfs_wb_fault);
>  }
>
>  int ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
> @@ -298,6 +298,12 @@ int ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
> if (!fsc->debugfs_dentry_lru)
> goto out;
>
> +   fsc->debugfs_wb_fault = debugfs_create_bool("wb_fault",
> +   0600, fsc->client->debugfs_dir,
> +   >wb_fault);
> +   if (!fsc->debugfs_wb_fault)
> +   goto out;
> +
> return 0;
>
>  out:
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index f02a2225fe42..a38fd6203b77 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -84,6 +84,7 @@ struct ceph_fs_client {
>
> unsigned long mount_state;
> int min_caps;  /* min caps i added */
> +   bool wb_fault;
>
> struct ceph_mds_client *mdsc;
>
> @@ -100,6 +101,7 @@ struct ceph_fs_client {
> struct dentry *debugfs_bdi;
> struct dentry *debugfs_mdsc, *debugfs_mdsmap;
> struct dentry *debugfs_mds_sessions;
> +   struct dentry *debugfs_wb_fault;
>  #endif
>

I think it's better not to enable this feature by default. Enabling it
by compilation option or mount option?

Regards
Yan, Zheng

>  #ifdef CONFIG_CEPH_FSCACHE
> --
> 2.13.3
>

[PATCH] Drivers : edac : checkpatch.pl clean up

2017-07-26 Thread Himanshu Jha

Fixed 'no assignment in if condition' coding style issue and removed 
unnecessary spaces at the start of a line.

Signed-off-by: Himanshu Jha 
---
 drivers/edac/i82860_edac.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/edac/i82860_edac.c b/drivers/edac/i82860_edac.c
index 236c813..c8c1c4d 100644
--- a/drivers/edac/i82860_edac.c
+++ b/drivers/edac/i82860_edac.c
@@ -282,7 +282,9 @@ static void i82860_remove_one(struct pci_dev *pdev)
if (i82860_pci)
edac_pci_release_generic_ctl(i82860_pci);
 
-   if ((mci = edac_mc_del_mc(>dev)) == NULL)
+   mci = edac_mc_del_mc(>dev);
+
+   if (mci == NULL)
return;
 
edac_mc_free(mci);
@@ -312,10 +314,11 @@ static int __init i82860_init(void)
 
edac_dbg(3, "\n");
 
-   /* Ensure that the OPSTATE is set correctly for POLL or NMI */
-   opstate_init();
+   /* Ensure that the OPSTATE is set correctly for POLL or NMI */
+   opstate_init();
 
-   if ((pci_rc = pci_register_driver(_driver)) < 0)
+   pci_rc = pci_register_driver(_driver);
+   if (pci_rc < 0)
goto fail0;
 
if (!mci_pdev) {
-- 
2.7.4

Re: [PATCH 1/2] ceph: use errseq_t for writeback error reporting

2017-07-26 Thread Yan, Zheng

On Tue, Jul 25, 2017 at 10:50 PM, Jeff Layton  wrote:
> From: Jeff Layton 
>
> Ensure that when writeback errors are marked that we report those to all
> file descriptions that were open at the time of the error.
>
> Signed-off-by: Jeff Layton 
> ---
>  fs/ceph/caps.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index 7007ae2a5ad2..13f6edf24acd 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -2110,7 +2110,7 @@ int ceph_fsync(struct file *file, loff_t start, loff_t 
> end, int datasync)
>
> dout("fsync %p%s\n", inode, datasync ? " datasync" : "");
>
> -   ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
> +   ret = file_write_and_wait_range(file, start, end);
> if (ret < 0)
> goto out;
>
> --
> 2.13.3
>

Reviewed-by: "Yan, Zheng"

RE: linux-next: Tree for Jul 26

2017-07-26 Thread Rosen, Rami

Hi Sergey,
Paolo Abeni had sent a patch:
https://www.mail-archive.com/netdev@vger.kernel.org/msg179192.html

Regards,
Rami Rosen

-Original Message-
From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On 
Behalf Of Sergey Senozhatsky
Sent: Wednesday, July 26, 2017 13:49
To: Paolo Abeni ; Stephen Rothwell 
Cc: Linux-Next Mailing List ; Linux Kernel Mailing 
List ; Paul Moore ; David S. 
Miller ; net...@vger.kernel.org
Subject: Re: linux-next: Tree for Jul 26

Hello,

On (07/26/17 16:12), Stephen Rothwell wrote:
> Hi all,
> 
> Changes since 20170725:
> 
> Non-merge commits (relative to Linus' tree): 2358
>  2466 files changed, 86994 insertions(+), 44655 deletions(-)


dce4551cb2adb1ac ("udp: preserve head state for IP_CMSG_PASSSEC") causes a 
build error

net/ipv4/udp.c: In function ‘__udp_queue_rcv_skb’:
net/ipv4/udp.c:1789:49: error: ‘struct sk_buff’ has no member named ‘sp’; did 
you mean ‘sk’?
  if (likely(IPCB(skb)->opt.optlen == 0 && !skb->sp))
 ^

-ss

Re: [PATCH v8 1/3] perf: cavium: Support memory controller PMU counters

2017-07-26 Thread Jan Glauber

On Wed, Jul 26, 2017 at 01:47:35PM +0100, Suzuki K Poulose wrote:
> On 26/07/17 12:19, Jan Glauber wrote:
> >On Tue, Jul 25, 2017 at 04:39:18PM +0100, Suzuki K Poulose wrote:
> >>On 25/07/17 16:04, Jan Glauber wrote:
> >>>Add support for the PMU counters on Cavium SOC memory controllers.
> >>>
> >>>This patch also adds generic functions to allow supporting more
> >>>devices with PMU counters.
> >>>
> >>>Properties of the LMC PMU counters:
> >>>- not stoppable
> >>>- fixed purpose
> >>>- read-only
> >>>- one PCI device per memory controller
> >>>
> >>>Signed-off-by: Jan Glauber 
> >>>---
> >>>drivers/perf/Kconfig   |   8 +
> >>>drivers/perf/Makefile  |   1 +
> >>>drivers/perf/cavium_pmu.c  | 424 
> >>>+
> >>>include/linux/cpuhotplug.h |   1 +
> >>>4 files changed, 434 insertions(+)
> >>>create mode 100644 drivers/perf/cavium_pmu.c
> >>>
> >>>diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
> >>>index e5197ff..a46c3f0 100644
> >>>--- a/drivers/perf/Kconfig
> >>>+++ b/drivers/perf/Kconfig
> >>>@@ -43,4 +43,12 @@ config XGENE_PMU
> >>>help
> >>>  Say y if you want to use APM X-Gene SoC performance monitors.
> >>>
> >>>+config CAVIUM_PMU
> >>>+  bool "Cavium SOC PMU"
> >>
> >>Is there any specific reason why this can't be built as a module ?
> >
> >Yes. I don't know how to load the module automatically. I can't make it
> >a pci driver as the EDAC driver "owns" the device (and having two
> >drivers for one device wont work as far as I know). I tried to hook
> >into the EDAC driver but the EDAC maintainer was not overly welcoming
> >that approach.
> 
> >
> >And while it would be possible to have it a s a module I think it is of
> >no use if it requires manualy loading. But maybe there is a simple
> >solution I'm missing here?
> 
> 
> If you are talking about a Cavium specific EDAC driver, may be we could
> make that depend on this driver "at runtime" via symbols (may be even,
> trigger the probe of PMU), which will be referenced only when 
> CONFIG_CAVIUM_PMU
> is defined. It is not the perfect solution, but that should do the trick.

I think that is roughly what I proposed in v6. Can you have a look at:

https://lkml.org/lkml/2017/6/23/333
https://patchwork.kernel.org/patch/9806427/

Probably there is a better way to do it. Or maybe we just keep it as
built-in for the time being.

--Jan

RE: [PATCH v3 00/16] Switchtec NTB Support

2017-07-26 Thread Allen Hubbe

From: Logan Gunthorpe
> Changes since v2:
> 
> - Reordered the ntb_test link patch per Allen
> - Removed an extra call to switchtec_ntb_init_mw
> - Fixed a typo in the switchtec.txt documentation.

Patches 5..16 (also 5 [was 6], and 14, objections notwithstanding):

Acked-by: Allen Hubbe 

> --
> 
> Changes since v1:
> 
> - Rebased onto latest ntb-next branch (with v4.13-rc1)
> - Reworked ntb_mw_count() function so that it can be called all the
>   time (per discussion with Allen)
> - Various spelling and formatting cleanups from Bjorn
> - Added request_module() call such that the NTB module is automatically
>   loaded when appropriate hardware exists.
> 
> --
> 
> Changes since the rfc:
> 
> - Rebased on ntb-next
> - Switched ntb_part_op to use sleep instead of delay
> - Dropped a number of useless dbg __func__ prints
> - Went back to the dynamic instead of the static class
> - Swapped the notifier block for a simple callback
> - Modified the new ntb api so that a couple functions with pidx
>   now must be called after link up. Per our discussion on the list.
> 
> --
> 
> This patchset implements Non-Transparent Bridge (NTB) support for the
> Microsemi Switchtec series of switches. We're looking for some
> review from the community at this point but hope to get it upstreamed
> for v4.14.
> 
> Switchtec NTB support is configured over the same function and bar
> as the management endpoint. Thus, the new driver hooks into the
> management driver which we had merged in v4.12. We use the class
> interface API to register an NTB device for every switchtec device
> which supports NTB (not all do).
> 
> The Switchtec hardware supports doorbells, memory windows and messages.
> Seeing there is no native scratchpad support, 128 spads are emulated
> through the use of a pre-setup memory window. The switch has 64
> doorbells which are shared between the two partitions and a
> configurable set of memory windows. While the hardware supports more
> than 2 partitions, this driver only supports the first two seeing
> the current NTB API only supports two hosts.
> 
> The driver has been tested with ntb_netdev and fully passes the
> ntb_test script.
> 
> This patchset is based off of ntb-next and can be found in this
> git repo:
> 
> https://github.com/sbates130272/linux-p2pmem.git switchtec_ntb_v3
> 
> *** BLURB HERE ***
> 
> Logan Gunthorpe (16):
>   switchtec: move structure definitions into a common header
>   switchtec: export class symbol for use in upper layer driver
>   switchtec: add NTB hardware register definitions
>   switchtec: add link event notifier callback
>   ntb: ntb_test: ensure the link is up before trying to configure the
> mws
>   ntb: ensure ntb_mw_get_align() is only called when the link is up
>   ntb: add check and comment for link up to mw_count() and
> mw_get_align()
>   switchtec_ntb: introduce initial NTB driver
>   switchtec_ntb: initialize hardware for memory windows
>   switchtec_ntb: initialize hardware for doorbells and messages
>   switchtec_ntb: add skeleton NTB driver
>   switchtec_ntb: add link management
>   switchtec_ntb: implement doorbell registers
>   switchtec_ntb: implement scratchpad registers
>   switchtec_ntb: add memory window support
>   switchtec_ntb: update switchtec documentation with notes for NTB
> 
>  Documentation/switchtec.txt |   12 +
>  MAINTAINERS |2 +
>  drivers/ntb/hw/Kconfig  |1 +
>  drivers/ntb/hw/Makefile |1 +
>  drivers/ntb/hw/mscc/Kconfig |9 +
>  drivers/ntb/hw/mscc/Makefile|1 +
>  drivers/ntb/hw/mscc/switchtec_ntb.c | 1211 
> +++
>  drivers/ntb/ntb_transport.c |   20 +-
>  drivers/ntb/test/ntb_perf.c |   18 +-
>  drivers/ntb/test/ntb_tool.c |6 +-
>  drivers/pci/switch/switchtec.c  |  316 ++--
>  include/linux/ntb.h |   11 +-
>  include/linux/switchtec.h   |  373 ++
>  tools/testing/selftests/ntb/ntb_test.sh |4 +
>  14 files changed, 1702 insertions(+), 283 deletions(-)
>  create mode 100644 drivers/ntb/hw/mscc/Kconfig
>  create mode 100644 drivers/ntb/hw/mscc/Makefile
>  create mode 100644 drivers/ntb/hw/mscc/switchtec_ntb.c
>  create mode 100644 include/linux/switchtec.h
> 
> --
> 2.11.0

Re: [RFC][PATCH] thunderbolt: icm: Ignore mailbox errors in icm_suspend()

2017-07-26 Thread Mika Westerberg

On Wed, Jul 26, 2017 at 02:48:54PM +0200, Rafael J. Wysocki wrote:
> On Wednesday, July 26, 2017 11:32:44 AM Mika Westerberg wrote:
> > On Tue, Jul 25, 2017 at 06:10:57PM +0200, Rafael J. Wysocki wrote:
> > > On Tuesday, July 25, 2017 01:00:12 PM Mika Westerberg wrote:
> > > > On Tue, Jul 25, 2017 at 01:31:00AM +0200, Rafael J. Wysocki wrote:
> > > > > From: Rafael J. Wysocki 
> > > > > 
> > > > > On one of my test machines nhi_mailbox_cmd() called from icm_suspend()
> > > > > times out and returnes an error which then is propagated to the
> > > > > caller and causes the entire system suspend to be aborted which isn't
> > > > > very useful.
> > > > > 
> > > > > Instead of aborting system suspend, print the error into the log
> > > > > and continue.
> > > > 
> > > > I agree, it should not prevent suspend but I wonder why it fails in the
> > > > first place? Can you check what is the return value?
> > > 
> > > As per the above, the error is a timeout, ie. -ETIMEDOUT.
> > 
> > Ah, right I somehow missed that.
> > 
> > Does it have Falcon Ridge controller or Alpine Ridge?
> 
> I'll check later today, but i guess you'll know (see below).

No need to check, it is Alpine Ridge (since it is Dell 9360).

> > Just to make sure, can you increase the timeout in nhi_mailbox_cmd()
> > to 1000ms or so. It should not take that long though but better to check.
> 
> Well, I can do that, but I don't think it will help.
> 
> It just looks like the chip is not responding at all at that point.

I see.

Then I think we should apply your patch now and we can investigate this
further offline and hopefully find the root cause for the problem.

For this patch:

Acked-by: Mika Westerberg 

> > Which system this is BTW?
> 
> It's the Dell 9360. :-)
> 
> Sometimes after a reboot or a power cycle it starts in a state in which the
> TBT controller and a USB one (which seem to be somehow connected)
> appear to be dead or at least really flaky.  Basically, the box needs to be
> power-cycled again to get rid of this condition and then everything works.

The xHCI controller is part of the Thunderbolt controller so whenever
you have normal USB-C device connected there, you should also see the
Alpine Ridge hierarchy in lspci output but the Thunderbolt host
controller is not there.

Re: netlink: NULL timer crash

2017-07-26 Thread Dmitry Vyukov

On Wed, Jul 26, 2017 at 3:09 PM,   wrote:
> Hi Dmitry,
>
> By trying to apply your reproducer to normal kernels, this scenery can not
> be reproduced (on fedora). Does this C source only for  KASAN kernels?

No, NULL derefs are detected without KASAN.


> On Thursday, March 23, 2017 at 8:55:52 PM UTC+8, Dmitry Vyukov wrote:
>>
>> Hello,
>>
>> The following program triggers call of NULL timer func:
>>
>>
>> https://gist.githubusercontent.com/dvyukov/c210d01c74b911273469a93862ea7788/raw/2a3182772a6a6e20af3e71c02c2a1c2895d803fb/gistfile1.txt
>>
>>
>> BUG: unable to handle kernel NULL pointer dereference at   (null)
>> IP:   (null)
>> PGD 0
>> Oops: 0010 [#1] SMP KASAN
>> Modules linked in:
>> CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.11.0-rc3+ #365
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
>> 01/01/2011
>> task: 88006c634300 task.stack: 88006c64
>> RIP: 0010:  (null)
>> RSP: 0018:88006d1077c8 EFLAGS: 00010246
>> RAX: dc00 RBX: 880062bddb00 RCX: 8154e161
>> RDX: 1090c1f1 RSI:  RDI: 880062bddb00
>> RBP: 88006d1077e8 R08: fbfff0a936a8 R09: 0001
>> R10: 0001 R11: fbfff0a936a7 R12: 84860f80
>> R13:  R14: 880062bddb60 R15: 11000da20f05
>> FS:  () GS:88006d10()
>> knlGS:
>> CS:  0010 DS:  ES:  CR0: 80050033
>> CR2:  CR3: 04e21000 CR4: 001406e0
>> Call Trace:
>>  
>>  neigh_timer_handler+0x365/0xd40 net/core/neighbour.c:944
>>  call_timer_fn+0x232/0x8c0 kernel/time/timer.c:1268
>>  expire_timers kernel/time/timer.c:1307 [inline]
>>  __run_timers+0x6f7/0xbd0 kernel/time/timer.c:1601
>>  run_timer_softirq+0x21/0x80 kernel/time/timer.c:1614
>>  __do_softirq+0x2d6/0xb54 kernel/softirq.c:284
>>  invoke_softirq kernel/softirq.c:364 [inline]
>>  irq_exit+0x1b1/0x1e0 kernel/softirq.c:405
>>  exiting_irq arch/x86/include/asm/apic.h:657 [inline]
>>  smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:962
>>  apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:487
>> RIP: 0010:native_safe_halt+0x6/0x10 arch/x86/include/asm/irqflags.h:53
>> RSP: 0018:88006c647dc0 EFLAGS: 0286 ORIG_RAX: ff10
>> RAX: dc00 RBX: 11000d8c8fbb RCX: 
>> RDX: 109d8ed4 RSI: 0001 RDI: 84ec76a0
>> RBP: 88006c647dc0 R08: ed000d8c6861 R09: 
>> R10:  R11:  R12: fbfff09d8ed2
>> R13: 88006c647e78 R14: 84ec7690 R15: 0002
>>  
>>  arch_safe_halt arch/x86/include/asm/paravirt.h:98 [inline]
>>  default_idle+0xba/0x450 arch/x86/kernel/process.c:275
>>  arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:266
>>  default_idle_call+0x37/0x80 kernel/sched/idle.c:97
>>  cpuidle_idle_call kernel/sched/idle.c:155 [inline]
>>  do_idle+0x230/0x380 kernel/sched/idle.c:244
>>  cpu_startup_entry+0x18/0x20 kernel/sched/idle.c:346
>>  start_secondary+0x2a7/0x340 arch/x86/kernel/smpboot.c:275
>>  start_cpu+0x14/0x14 arch/x86/kernel/head_64.S:306
>> Code:  Bad RIP value.
>> RIP:   (null) RSP: 88006d1077c8
>> CR2: 
>> ---[ end trace 845120b8a0d21411 ]---
>>
>> On commit 093b995e3b55a0ae0670226ddfcb05bfbf0099ae
>
> --
> You received this message because you are subscribed to the Google Groups
> "syzkaller" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to syzkaller+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Re: [PATCH] irqchip: create a Kconfig menu for irqchip drivers

2017-07-26 Thread Masahiro Yamada

2017-07-26 19:37 GMT+09:00 Marc Zyngier :
> On 26/07/17 11:18, Masahiro Yamada wrote:
>> Hi Marc,
>>
>>
>> 2017-07-26 17:04 GMT+09:00 Marc Zyngier :
>>> On 26/07/17 05:03, Masahiro Yamada wrote:
 Some irqchip drivers have a Kconfig prompt.  When we run menuconfig
 or friends, those drivers are directly listed in the "Device Drivers"
 menu level.  This does not look nice.  Create a sub-system level menu.

 Signed-off-by: Masahiro Yamada 
 ---

  drivers/irqchip/Kconfig | 4 
  1 file changed, 4 insertions(+)

 diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig
 index f1fd5f44d1d4..7b66313a2952 100644
 --- a/drivers/irqchip/Kconfig
 +++ b/drivers/irqchip/Kconfig
 @@ -1,3 +1,5 @@
 +menu "IRQ chip support"
 +
  config IRQCHIP
   def_bool y
   depends on OF_IRQ
 @@ -306,3 +308,5 @@ config QCOM_IRQ_COMBINER
   help
 Say yes here to add support for the IRQ combiner devices embedded
 in Qualcomm Technologies chips.
 +
 +endmenu

>>>
>>> I'm very reluctant to introduce this. IMHO, interrupt controllers are
>>> way too low level a thing to let them be selected by the user. They
>>> really should be selected by the platform that needs them
>>
>> This is true for the root irqchip.
>> Not necessarily true for child irqchips.
>
> I dispute that argument. We've been able to make this work so far
> *without* exposing yet another menu maze to the user. What has changed?


The irqchip maintainers applied drivers
with user-configurable Kconfig entries.




>>
>>
>>> Do you have any example in mind where having a user-selectable interrupt
>>> controller actually makes sense on its own?
>>
>> Yes.
>>
>> I see some user-selectable drivers in drivers/irqchip/Kconfig
>> and I'd like to add one more for my SoCs.
>>
>>
>> This patch:
>> https://github.com/uniphier/linux/commit/f39efdf0ce34f77ae9e324d9ec6c7f486f43a0ed
>>
>> This is really optional, so
>> I intentionally implemented it as a platform driver
>> instead of IRQCHIP_DECLARE().
>
> I really cannot see how this could be optional. It means that you could
> end-up in a situation where the drivers for the devices being this
> irqchip could have been compiled in, but not their interrupt controller.
> How useful is that?

In my case, the assumed irq consumer is GPIO.

If the irq consumer is probed before the irqchip,
it will be tried later by -EPROBE_DEFER.

If the irqchip is not compiled at all, right, the irq consumer will not work.
One possible (and general) solution is to specify "depends on" correctly
between the provider and the consumer.



>> Looks like irq-ts4800.c, irq-keystone.c are modules as well.
>
> They are directly selected by their respective defconfig.


Are you sure?

As far as I see, they are not selected by anyone.


$ git grep 'TS4800_IRQ\|KEYSTONE_IRQ'
arch/arm/configs/keystone_defconfig:CONFIG_KEYSTONE_IRQ=y
arch/arm/configs/multi_v7_defconfig:CONFIG_KEYSTONE_IRQ=y
drivers/irqchip/Kconfig:config TS4800_IRQ
drivers/irqchip/Kconfig:config KEYSTONE_IRQ
drivers/irqchip/Makefile:obj-$(CONFIG_TS4800_IRQ)   += irq-ts4800.o
drivers/irqchip/Makefile:obj-$(CONFIG_KEYSTONE_IRQ) +=
irq-keystone.o



defconfig just provides a default value.

Users are allowed to disable the option from menuconfig.




> On arm64,
> which is what I expect you driver targets, you should simply select it
> in your platform entry.

OK, assuming your clain is correct,
we have 5 suspicious entries in drivers/irqchip/Kconfig.


config JCORE_AIC
bool "J-Core integrated AIC" if COMPILE_TEST

config TS4800_IRQ
tristate "TS-4800 IRQ controller"

config KEYSTONE_IRQ
tristate "Keystone 2 IRQ controller IP"

config EZNPS_GIC
bool "NPS400 Global Interrupt Manager (GIM)"

config QCOM_IRQ_COMBINER
bool "QCOM IRQ combiner support"



The prompt strings make the entries visible in menuconfig.
So, they should be removed.
The prompts are pointless if the options are supposed by selected by others.

Also, tristate is pointless.
If they are supposed to be selected by platforms,
they have no chance to be a module.
They should be turned into bool (without prompt)

Is this what you mean?



-- 
Best Regards
Masahiro Yamada

[RFC]Add new mdev interface for QoS

2017-07-26 Thread Gao, Ping A

The vfio-mdev provide the capability to let different guest share the
same physical device through mediate sharing, as result it bring a
requirement about how to control the device sharing, we need a QoS
related interface for mdev to management virtual device resource.

E.g. In practical use, vGPUs assigned to different quests almost has
different performance requirements, some guests may need higher priority
for real time usage, some other may need more portion of the GPU
resource to get higher 3D performance, corresponding we can define some
interfaces like weight/cap for overall budget control, priority for
single submission control.

So I suggest to add some common attributes which are vendor agnostic in
mdev core sysfs for QoS purpose.

-Ping

Re: [PATCH net] Revert "vhost: cache used event for better performance"

2017-07-26 Thread Jason Wang




On 2017年07月26日 20:57, Michael S. Tsirkin wrote:

On Wed, Jul 26, 2017 at 04:03:17PM +0800, Jason Wang wrote:

This reverts commit 809ecb9bca6a9424ccd392d67e368160f8b76c92. Since it
was reported to break vhost_net. We want to cache used event and use
it to check for notification. We try to valid cached used event by
checking whether or not it was ahead of new, but this is not correct
all the time, it could be stale and there's no way to know about this.

Signed-off-by: Jason Wang

Could you supply a bit more data here please?  How does it get stale?
What does guest need to do to make it stale?  This will be helpful if
anyone wants to bring it back, or if we want to extend the protocol.



The problem we don't know whether or not guest has published a new used 
event. The check vring_need_event(vq->last_used_event, new + vq->num, 
new) is not sufficient to check for this.


Thanks

Re: [PATCH net-next 2/2] bnxt_en: define sriov_lock unconditionally

2017-07-26 Thread Arnd Bergmann

On Wed, Jul 26, 2017 at 12:54 PM, Sathya Perla
 wrote:
> On Wed, Jul 26, 2017 at 2:35 PM, Arnd Bergmann  wrote:
> [...]
>>> Sathya already sent 3 patches to fix some of these issues.  But I need
>>> to rework one of his patch and resend.
>>
>> Ok, thanks. I just ran into one more issue, and don't know if that's included
>> as well. If not, please also add the patch below (or fold it into the one
>> that adds the switchdev dependency to the ethernet driver):
>>
>> 8<--
>> Subject: [PATCH] RDMA/bnxt_re: add NET_SWITCHDEV dependency
>>
>> The rdma side of BNXT enables the ethernet driver and has a list
>> of its dependencies. However, the ethernet driver now also depends
>> on NET_SWITCHDEV, so we have to add that dependency for both:
>
> Arnd, after the patch "bnxt_en: use SWITCHDEV_SET_OPS() for setting
> vf_rep_switchdev_ops" the bnxt_en driver doesn't need an explicit
> NET_SWITCHDEV dependency. So, the bnxt_re driver shouldn't need one
> either. Are you still seeing the bnxt_re issue even after pulling the
> above patch??

I think that's fine then. I missed that patch when it went in, so I only
needed the add-on since I still had my own earlier patch. I'll drop both
from my test tree now, and will let you know in case something else
remains.

 Arnd

Re: linux-next: Tree for Jul 26

2017-07-26 Thread Sergey Senozhatsky

Hello,

On (07/26/17 13:09), Rosen, Rami wrote:
> Hi Sergey,
> Paolo Abeni had sent a patch:
> https://www.mail-archive.com/netdev@vger.kernel.org/msg179192.html

yep, this should do the trick. thanks.

-ss

Re: [PATCH] iommu/amd: Fix schedule-while-atomic BUG in initialization code

2017-07-26 Thread Thomas Gleixner

On Wed, 26 Jul 2017, Joerg Roedel wrote:
> Yes, that should fix it, but I think its better to just move the
> register_syscore_ops() call to a later initialization step, like in the
> patch below. I tested it an will queue it to my iommu/fixes branch.

Fair enough. Acked-by-me.

Re: [PATCH] iommu/amd: Fix schedule-while-atomic BUG in initialization code

2017-07-26 Thread Artem Savkov

On Wed, Jul 26, 2017 at 02:26:14PM +0200, Joerg Roedel wrote:
> Hi Artem, Thomas,
> 
> On Wed, Jul 26, 2017 at 12:42:49PM +0200, Thomas Gleixner wrote:
> > On Tue, 25 Jul 2017, Artem Savkov wrote:
> > 
> > > Hi,
> > > 
> > > Commit 1c3c5ea "sched/core: Enable might_sleep() and smp_processor_id()
> > > checks early" seem to have uncovered an issue with amd-iommu/x2apic.
> > > 
> > > Starting with that commit the following warning started to show up on AMD
> > > systems during boot:
> >  
> > > [0.16] BUG: sleeping function called from invalid context at 
> > > kernel/locking/mutex.c:747 
> > 
> > > [0.16]  mutex_lock_nested+0x1b/0x20 
> > > [0.16]  register_syscore_ops+0x1d/0x70 
> > > [0.16]  state_next+0x119/0x910 
> > > [0.16]  iommu_go_to_state+0x29/0x30 
> > > [0.16]  amd_iommu_enable+0x13/0x23 
> > > [0.16]  irq_remapping_enable+0x1b/0x39 
> > > [0.16]  enable_IR_x2apic+0x91/0x196 
> > > [0.16]  default_setup_apic_routing+0x16/0x6e 
> > > [0.16]  native_smp_prepare_cpus+0x257/0x2d5
> 
> Thanks for the report!
> 
> > --- a/drivers/iommu/amd_iommu_init.c
> > +++ b/drivers/iommu/amd_iommu_init.c
> > @@ -2440,7 +2440,6 @@ static int __init state_next(void)
> > break;
> > case IOMMU_ACPI_FINISHED:
> > early_enable_iommus();
> > -   register_syscore_ops(_iommu_syscore_ops);
> > x86_platform.iommu_shutdown = disable_iommus;
> > init_state = IOMMU_ENABLED;
> > break;
> > @@ -2559,6 +2558,8 @@ static int __init amd_iommu_init(void)
> > for_each_iommu(iommu)
> > iommu_flush_all_caches(iommu);
> > }
> > +   } else {
> > +   register_syscore_ops(_iommu_syscore_ops);
> > }
> >  
> > return ret;
> 
> Yes, that should fix it, but I think its better to just move the
> register_syscore_ops() call to a later initialization step, like in the
> patch below. I tested it an will queue it to my iommu/fixes branch.

Checked it as well just in case, didn't see any issues. Thank you.

Reported-and-tested-by: Artem Savkov 

-- 
Regards,
  Artem

[v4 2/4] mm, oom: cgroup-aware OOM killer

2017-07-26 Thread Roman Gushchin

Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.

This behavior doesn't suit well the system with many running
containers:

1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.

2) Containers often do not expect that some random process inside
will be killed. In many cases much safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.

3) Per-process oom_score_adj affects global OOM, so it's a breache
in the isolation.

To address these issues, cgroup-aware OOM killer is introduced.

Under OOM conditions, it tries to find the biggest memory consumer,
and free memory by killing corresponding task(s). The difference
the "traditional" OOM killer is that it can treat memory cgroups
as memory consumers as well as single processes.

By default, it will look for the biggest leaf cgroup, and kill
the largest task inside.

But a user can change this behavior by enabling the per-cgroup
oom_kill_all_tasks option. If set, it causes the OOM killer treat
the whole cgroup as an indivisible memory consumer. In case if it's
selected as on OOM victim, all belonging tasks will be killed.

Tasks in the root cgroup are treated as independent memory consumers,
and are compared with other memory consumers (e.g. leaf cgroups).
The root cgroup doesn't support the oom_kill_all_tasks feature.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/memcontrol.h |  23 +
 include/linux/oom.h|   3 +
 mm/memcontrol.c| 208 +
 mm/oom_kill.c  | 172 -
 4 files changed, 349 insertions(+), 57 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3914e3dd6168..b21bbb0edc72 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct oom_control;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -199,6 +200,12 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;
 
+   /* kill all tasks in the subtree in case of OOM */
+   bool oom_kill_all_tasks;
+
+   /* cached OOM score */
+   long oom_score;
+
/* handle for "memory.events" */
struct cgroup_file events_file;
 
@@ -342,6 +349,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct 
cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+   css_put(>css);
+}
+
 #define mem_cgroup_from_counter(counter, member)   \
container_of(counter, struct mem_cgroup, member)
 
@@ -480,6 +492,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 
 bool mem_cgroup_oom_synchronize(bool wait);
 
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -739,6 +753,10 @@ static inline bool task_in_mem_cgroup(struct task_struct 
*task,
return true;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
@@ -926,6 +944,11 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+   return false;
+}
 #endif /* CONFIG_MEMCG */
 
 static inline void __inc_memcg_state(struct mem_cgroup *memcg,
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 8a266e2be5a6..b7ec3bd441be 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -39,6 +39,7 @@ struct oom_control {
unsigned long totalpages;
struct task_struct *chosen;
unsigned long chosen_points;
+   struct mem_cgroup *chosen_memcg;
 };
 
 extern struct mutex oom_lock;
@@ -79,6 +80,8 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9085e55eb69f..ba72d1cf73d0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2625,6 +2625,181 @@ static

[v4 4/4] mm, oom, docs: describe the cgroup-aware OOM killer

2017-07-26 Thread Roman Gushchin

Update cgroups v2 docs.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org
---
 Documentation/cgroup-v2.txt | 62 +
 1 file changed, 62 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index cb9ea281ab72..bf106b6b6b52 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
5-2-1. Memory Interface Files
5-2-2. Usage Guidelines
5-2-3. Memory Ownership
+   5-2-4. Cgroup-aware OOM Killer
  5-3. IO
5-3-1. IO Interface Files
5-3-2. Writeback
@@ -1001,6 +1002,37 @@ PAGE_SIZE multiple when read back.
high limit is used and monitored properly, this limit's
utility is limited to providing the final safety net.
 
+  memory.oom_kill_all_tasks
+
+   A read-write single value file which exits on non-root
+   cgroups.  The default is "0".
+
+   Defines whether the OOM killer should treat the cgroup
+   as a single entity during the victim selection.
+
+   If set, OOM killer will kill all belonging tasks in
+   corresponding cgroup is selected as an OOM victim.
+
+   Be default, OOM killer respect /proc/pid/oom_score_adj value
+   -1000, and will never kill the task, unless oom_kill_all_tasks
+   is set.
+
+  memory.oom_priority
+
+   A read-write single value file which exits on non-root
+   cgroups.  The default is "0".
+
+   An integer number within the [-1, 1] range,
+   which defines the order in which the OOM killer selects victim
+   memory cgroups.
+
+   OOM killer prefers memory cgroups with larger priority if they
+   are populated with elegible tasks.
+
+   The oom_priority value is compared within sibling cgroups.
+
+   The root cgroup has the oom_priority 0, which cannot be changed.
+
   memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined.  Unless specified
@@ -1205,6 +1237,36 @@ POSIX_FADV_DONTNEED to relinquish the ownership of 
memory areas
 belonging to the affected files to ensure correct memory ownership.
 
 
+Cgroup-aware OOM Killer
+~~~
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats memory cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choise of a victim, hierarchically looking for the largest memory
+consumer. By default, it will look for the biggest task in the
+biggest leaf cgroup.
+
+Be default, all cgroups have oom_priority 0, and OOM killer will
+chose the largest cgroup recursively on each level. For non-root
+cgroups it's possible to change the oom_priority, and it will cause
+the OOM killer to look athe the priority value first, and compare
+sizes only of cgroups with equal priority.
+
+But a user can change this behavior by enabling the per-cgroup
+oom_kill_all_tasks option. If set, it causes the OOM killer treat
+the whole cgroup as an indivisible memory consumer. In case if it's
+selected as on OOM victim, all belonging tasks will be killed.
+
+Tasks in the root cgroup are treated as independent memory consumers,
+and are compared with other memory consumers (e.g. leaf cgroups).
+The root cgroup doesn't support the oom_kill_all_tasks feature.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
 IO
 --
 
-- 
2.13.3

[v4 3/4] mm, oom: introduce oom_priority for memory cgroups

2017-07-26 Thread Roman Gushchin

Introduce a per-memory-cgroup oom_priority setting: an integer number
within the [-1, 1] range, which defines the order in which
the OOM killer selects victim memory cgroups.

OOM killer prefers memory cgroups with larger priority if they are
populated with elegible tasks.

The oom_priority value is compared within sibling cgroups.

The root cgroup has the oom_priority 0, which cannot be changed.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: David Rientjes 
Cc: Tejun Heo 
Cc: Tetsuo Handa 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/memcontrol.h |  3 +++
 mm/memcontrol.c| 55 --
 2 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b21bbb0edc72..d31ac58e08ad 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -206,6 +206,9 @@ struct mem_cgroup {
/* cached OOM score */
long oom_score;
 
+   /* OOM killer priority */
+   short oom_priority;
+
/* handle for "memory.events" */
struct cgroup_file events_file;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ba72d1cf73d0..2c1566995077 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2710,12 +2710,21 @@ static void select_victim_memcg(struct mem_cgroup 
*root, struct oom_control *oc)
for (;;) {
struct cgroup_subsys_state *css;
struct mem_cgroup *memcg = NULL;
+   short prio = SHRT_MIN;
long score = LONG_MIN;
 
css_for_each_child(css, >css) {
struct mem_cgroup *iter = mem_cgroup_from_css(css);
 
-   if (iter->oom_score > score) {
+   if (iter->oom_score == 0)
+   continue;
+
+   if (iter->oom_priority > prio) {
+   memcg = iter;
+   prio = iter->oom_priority;
+   score = iter->oom_score;
+   } else if (iter->oom_priority == prio &&
+  iter->oom_score > score) {
memcg = iter;
score = iter->oom_score;
}
@@ -2782,7 +2791,15 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
 * For system-wide OOMs we should consider tasks in the root cgroup
 * with oom_score larger than oc->chosen_points.
 */
-   if (!oc->memcg) {
+   if (!oc->memcg && !(oc->chosen_memcg &&
+   oc->chosen_memcg->oom_priority > 0)) {
+   /*
+* Root memcg has priority 0, so if chosen memcg has lower
+* priority, any task in root cgroup is preferable.
+*/
+   if (oc->chosen_memcg && oc->chosen_memcg->oom_priority < 0)
+   oc->chosen_points = 0;
+
select_victim_root_cgroup_task(oc);
 
if (oc->chosen && oc->chosen_memcg) {
@@ -5373,6 +5390,34 @@ static ssize_t memory_oom_kill_all_tasks_write(struct 
kernfs_open_file *of,
return nbytes;
 }
 
+static int memory_oom_priority_show(struct seq_file *m, void *v)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+
+   seq_printf(m, "%d\n", memcg->oom_priority);
+
+   return 0;
+}
+
+static ssize_t memory_oom_priority_write(struct kernfs_open_file *of,
+   char *buf, size_t nbytes, loff_t off)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+   int oom_priority;
+   int err;
+
+   err = kstrtoint(strstrip(buf), 0, _priority);
+   if (err)
+   return err;
+
+   if (oom_priority < -1 || oom_priority > 1)
+   return -EINVAL;
+
+   memcg->oom_priority = (short)oom_priority;
+
+   return nbytes;
+}
+
 static int memory_events_show(struct seq_file *m, void *v)
 {
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5499,6 +5544,12 @@ static struct cftype memory_files[] = {
.write = memory_oom_kill_all_tasks_write,
},
{
+   .name = "oom_priority",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .seq_show = memory_oom_priority_show,
+   .write = memory_oom_priority_write,
+   },
+   {
.name = "events",
.flags = CFTYPE_NOT_ON_ROOT,
.file_offset = offsetof(struct mem_cgroup, events_file),
-- 
2.13.3

Re: [REGRESSION 4.13-rc] NFS returns -EACCESS at the first read

2017-07-26 Thread Takashi Iwai

On Wed, 26 Jul 2017 14:57:07 +0200,
Anna Schumaker wrote:
> 
> Hi Takashi,
> 
> On 07/26/2017 08:54 AM, Takashi Iwai wrote:
> > Hi,
> > 
> > I seem hitting a regression of NFS client on the today's Linus git
> > tree.  The symptom is that the file read over NFS returns occasionally
> > -EACCESS at the first read.  When I try to read the same file again
> > (or do some other thing), I can read it successfully.
> > 
> > The git bisection leaded to the commit
> > bd8b2441742b49c76bec707757bd9c028ea9838e
> > NFS: Store the raw NFS access mask in the inode's access cache
> > 
> > 
> > Any further hint for debugging?
> 
> Does the patch in this email thread help? 
> http://www.spinics.net/lists/linux-nfs/msg64930.html

Thanks, I gave it a shot and the result looks good.  Feel free to my
tested-by tag:
  Tested-by: Takashi Iwai 

Though, when I look around the code, I feel somehow uneasy by that
still MAY_XXX is used for nfs_access_entry.mask, e.g. in
nfs3_proc_access() or nfs4_proc_access().  Are these function OK
without the similar conversion?

thanks,

Takashi

[v4 1/4] mm, oom: refactor the TIF_MEMDIE usage

2017-07-26 Thread Roman Gushchin

First, separate tsk_is_oom_victim() and TIF_MEMDIE flag checks:
let the first one indicate that a task is killed by the OOM killer,
and the second one indicate that a task has an access to the memory
reserves (with a hope to eliminate it later).

Second, set TIF_MEMDIE to all threads of an OOM victim process.

Third, to limit the number of processes which have an access to memory
reserves, let's keep an atomic pointer to a task, which grabbed it.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org
---
 kernel/exit.c   |  2 +-
 mm/memcontrol.c |  2 +-
 mm/oom_kill.c   | 30 +-
 3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 8f40bee5ba9d..d5f372a2a363 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -542,7 +542,7 @@ static void exit_mm(void)
task_unlock(current);
mm_update_next_owner(mm);
mmput(mm);
-   if (test_thread_flag(TIF_MEMDIE))
+   if (tsk_is_oom_victim(current))
exit_oom_victim();
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d61133e6af99..9085e55eb69f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1896,7 +1896,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
gfp_mask,
 * bypass the last charges so that they can exit quickly and
 * free their memory.
 */
-   if (unlikely(test_thread_flag(TIF_MEMDIE) ||
+   if (unlikely(tsk_is_oom_victim(current) ||
 fatal_signal_pending(current) ||
 current->flags & PF_EXITING))
goto force;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e8b4f030c1c..72de01be4d33 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -435,6 +435,8 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 static bool oom_killer_disabled __read_mostly;
 
+static struct task_struct *tif_memdie_owner;
+
 #define K(x) ((x) << (PAGE_SHIFT-10))
 
 /*
@@ -656,13 +658,24 @@ static void mark_oom_victim(struct task_struct *tsk)
struct mm_struct *mm = tsk->mm;
 
WARN_ON(oom_killer_disabled);
-   /* OOM killer might race with memcg OOM */
-   if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
+
+   if (!cmpxchg(_memdie_owner, NULL, current)) {
+   struct task_struct *t;
+
+   rcu_read_lock();
+   for_each_thread(current, t)
+   set_tsk_thread_flag(t, TIF_MEMDIE);
+   rcu_read_unlock();
+   }
+
+   /*
+* OOM killer might race with memcg OOM.
+* oom_mm is bound to the signal struct life time.
+*/
+   if (cmpxchg(>signal->oom_mm, NULL, mm))
return;
 
-   /* oom_mm is bound to the signal struct life time. */
-   if (!cmpxchg(>signal->oom_mm, NULL, mm))
-   mmgrab(tsk->signal->oom_mm);
+   mmgrab(tsk->signal->oom_mm);
 
/*
 * Make sure that the task is woken up from uninterruptible sleep
@@ -682,6 +695,13 @@ void exit_oom_victim(void)
 {
clear_thread_flag(TIF_MEMDIE);
 
+   /*
+* If current tasks if a thread, which initially
+* received TIF_MEMDIE, clear tif_memdie_owner to
+* give a next process a chance to capture it.
+*/
+   cmpxchg(_memdie_owner, current, NULL);
+
if (!atomic_dec_return(_victims))
wake_up_all(_victims_wait);
 }
-- 
2.13.3

Re: Sparse warnings on GENMASK + arm32

2017-07-26 Thread Lance Richardson

> From: "Stephen Boyd" 
> To: linux-spa...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Sent: Tuesday, 25 July, 2017 9:30:20 PM
> Subject: Sparse warnings on GENMASK + arm32
> 
> I see sparse warning when I check a clk driver file in the kernel
> on a 32-bit ARM build.
> 
> drivers/clk/sunxi/clk-sun6i-ar100.c:65:20: warning: cast truncates bits from
> constant value (3 becomes )
> 
> The code in question looks like:
> 
> static const struct factors_data sun6i_ar100_data = {
>   .mux = 16,
>   .muxmask = GENMASK(1, 0),
>   .table = _ar100_config,
>   .getter = sun6i_get_ar100_factors,
> };
> 
> where factors_data is
> 
> struct factors_data {
>   int enable;
>   int mux;
>   int muxmask;
>   const struct clk_factors_config *table;
>   void (*getter)(struct factors_request *req);
>   void (*recalc)(struct factors_request *req);
>   const char *name;
> };
> 
> 
> and sparse seems to be complaining about the muxmask assignment
> here. Oddly, this doesn't happen on arm64 builds. Both times, I'm
> checking this on an x86-64 machine.
> 
>  $ sparse --version
>  v0.5.1-rc4-1-gfa71b7ac0594
> 
> Is there something confusing to sparse in the GENMASK macro?
> 

Hmm, it seems sparse is incorrectly taking ~0UL to be a 64-bit value
while BITS_PER_LONG is (correctly) evaluated to be 32.

#define GENMASK(h, l) \
(((~0UL) << (l)) & (~0UL >> (BITS_PER_LONG - 1 - (h

> --
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> a Linux Foundation Collaborative Project
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Re: [PATCH 1/1] mm/hugetlb: Make huge_pte_offset() consistent and document behaviour

2017-07-26 Thread Punit Agrawal

Michal Hocko  writes:

> On Wed 26-07-17 14:33:57, Michal Hocko wrote:
>> On Wed 26-07-17 13:11:46, Punit Agrawal wrote:
> [...]
>> > I've been running tests from mce-test suite and libhugetlbfs for similar
>> > changes we did on arm64. There could be assumptions that were not
>> > exercised but I'm not sure how to check for all the possible usages.
>> > 
>> > Do you have any other suggestions that can help improve confidence in
>> > the patch?
>> 
>> Unfortunatelly I don't. I just know there were many subtle assumptions
>> all over the place so I am rather careful to not touch the code unless
>> really necessary.
>> 
>> That being said, I am not opposing your patch.
>
> Let me be more specific. I am not opposing your patch but we should
> definitely need more reviewers to have a look. I am not seeing any
> immediate problems with it but I do not see a large improvements either
> (slightly less nightmare doesn't make me sleep all that well ;)). So I
> will leave the decisions to others.

I hear you - I'd definitely appreciate more eyes on the code change and
description.

Thanks for taking a look.

Re: [PATCH] mm: take memory hotplug lock within numa_zonelist_order_handler()

2017-07-26 Thread Thomas Gleixner

On Wed, 26 Jul 2017, Heiko Carstens wrote:
> Andre Wild reported the folling warning:
> 
> WARNING: CPU: 2 PID: 1205 at kernel/cpu.c:240 
> lockdep_assert_cpus_held+0x4c/0x60
> Modules linked in:
> CPU: 2 PID: 1205 Comm: bash Not tainted 4.13.0-rc2-00022-gfd2b2c57ec20 #10
> Hardware name: IBM 2964 N96 702 (z/VM 6.4.0)
> task: 701d8100 task.stack: 73594000
> Krnl PSW : 0704f0018000 00145e24 
> (lockdep_assert_cpus_held+0x4c/0x60)
> ...
> Call Trace:
>  lockdep_assert_cpus_held+0x42/0x60)
>  stop_machine_cpuslocked+0x62/0xf0
>  build_all_zonelists+0x92/0x150
>  numa_zonelist_order_handler+0x102/0x150
>  proc_sys_call_handler.isra.12+0xda/0x118
>  proc_sys_write+0x34/0x48
>  __vfs_write+0x3c/0x178
>  vfs_write+0xbc/0x1a0
>  SyS_write+0x66/0xc0
>  system_call+0xc4/0x2b0
>  locks held by bash/1205:
>  #0:  (sb_writers#4){.+.+.+}, at: [<0037b29e>] vfs_write+0xa6/0x1a0
>  #1:  (zl_order_mutex){+.+...}, at: [<002c8e4c>] 
> numa_zonelist_order_handler+0x44/0x150
>  #2:  (zonelists_mutex){+.+...}, at: [<002c8efc>] 
> numa_zonelist_order_handler+0xf4/0x150
> Last Breaking-Event-Address:
>  [<00145e20>] lockdep_assert_cpus_held+0x48/0x60
> 
> This can be easily triggered with e.g.
> 
>  >echo n > /proc/sys/vm/numa_zonelist_order
> 
> With commit 3f906ba23689a ("mm/memory-hotplug: switch locking to a
> percpu rwsem") memory hotplug locking was changed to fix a potential
> deadlock. This also switched the stop_machine() invocation within
> build_all_zonelists() to stop_machine_cpuslocked() which now expects
> that online cpus are locked when being called.
> 
> This assumption is not true if build_all_zonelists() is being called
> from numa_zonelist_order_handler(). In order to fix this simply add a
> mem_hotplug_begin()/mem_hotplug_done() pair to numa_zonelist_order_handler().

Sorry, I missed that call path when I did the conversion. So yes, that
needs some protection

Thanks,

tglx

[PATCH 05/11] powerpc/topology: Remove the unused parent_node() macro

2017-07-26 Thread Dou Liyang

Commit a7be6e5a7f8d ("mm: drop useless local parameters of
__register_one_node()") removes the last user of parent_node().

The parent_node() macro in POWERPC platform is unnecessary.

Remove it for cleanup.

Reported-by: Michael Ellerman 
Signed-off-by: Dou Liyang 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: linuxppc-...@lists.ozlabs.org
---
 arch/powerpc/include/asm/topology.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index dc4e159..2d84bca 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -16,8 +16,6 @@ struct device_node;
 
 #include 
 
-#define parent_node(node)  (node)
-
 #define cpumask_of_node(node) ((node) == -1 ?  \
   cpu_all_mask :   \
   node_to_cpumask_map[node])
-- 
2.5.5

[PATCH 10/11] x86/topology: Remove the unused parent_node() macro

2017-07-26 Thread Dou Liyang

Commit a7be6e5a7f8d ("mm: drop useless local parameters of
__register_one_node()") removes the last user of parent_node().

The parent_node() macro in X86 platform is unnecessary.

Remove it for cleanup.

Reported-by: Michael Ellerman 
Signed-off-by: Dou Liyang 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
---
 arch/x86/include/asm/topology.h | 6 --
 1 file changed, 6 deletions(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 6358a85..c1d2a98 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -75,12 +75,6 @@ static inline const struct cpumask *cpumask_of_node(int node)
 
 extern void setup_node_to_cpumask_map(void);
 
-/*
- * Returns the number of the node containing Node 'node'. This
- * architecture is flat, so it is a pretty simple function!
- */
-#define parent_node(node) (node)
-
 #define pcibus_to_node(bus) __pcibus_to_node(bus)
 
 extern int __node_distance(int, int);
-- 
2.5.5

[PATCH 08/11] sparc64/topology: Remove the unused parent_node() macro

2017-07-26 Thread Dou Liyang

Commit a7be6e5a7f8d ("mm: drop useless local parameters of
__register_one_node()") removes the last user of parent_node().

The parent_node() macro in SPARC64 platform is unnecessary.

Remove it for cleanup.

Reported-by: Michael Ellerman 
Signed-off-by: Dou Liyang 
Cc: "David S. Miller" 
Cc: sparcli...@vger.kernel.org
---
 arch/sparc/include/asm/topology_64.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/sparc/include/asm/topology_64.h 
b/arch/sparc/include/asm/topology_64.h
index ad5293f..0fcc9a0 100644
--- a/arch/sparc/include/asm/topology_64.h
+++ b/arch/sparc/include/asm/topology_64.h
@@ -10,8 +10,6 @@ static inline int cpu_to_node(int cpu)
return numa_cpu_lookup_table[cpu];
 }
 
-#define parent_node(node)  (node)
-
 #define cpumask_of_node(node) ((node) == -1 ?  \
   cpu_all_mask :   \
   _cpumask_lookup_table[node])
-- 
2.5.5

Re: [PATCH v2 02/13] xen/pvcalls: connect to the backend

2017-07-26 Thread Boris Ostrovsky




On 7/25/2017 5:21 PM, Stefano Stabellini wrote:

Implement the probe function for the pvcalls frontend. Read the
supported versions, max-page-order and function-calls nodes from
xenstore.

Introduce a data structure named pvcalls_bedata. It contains pointers to
the command ring, the event channel, a list of active sockets and a list
of passive sockets. Lists accesses are protected by a spin_lock.

Introduce a waitqueue to allow waiting for a response on commands sent
to the backend.

Introduce an array of struct xen_pvcalls_response to store commands
responses.

Only one frontend<->backend connection is supported at any given time
for a guest. Store the active frontend device to a static pointer.

Introduce a stub functions for the event handler.

Signed-off-by: Stefano Stabellini 
CC: boris.ostrov...@oracle.com
CC: jgr...@suse.com
---
  drivers/xen/pvcalls-front.c | 153 
  1 file changed, 153 insertions(+)

diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
index a8d38c2..5e0b265 100644
--- a/drivers/xen/pvcalls-front.c
+++ b/drivers/xen/pvcalls-front.c
@@ -20,6 +20,29 @@
  #include 
  #include 
  
+#define PVCALLS_INVALID_ID (UINT_MAX)


Unnecessary parentheses


+#define RING_ORDER XENBUS_MAX_RING_GRANT_ORDER


PVCALLS_RING_ORDER?


+#define PVCALLS_NR_REQ_PER_RING __CONST_RING_SIZE(xen_pvcalls, XEN_PAGE_SIZE)
+
+struct pvcalls_bedata {
+   struct xen_pvcalls_front_ring ring;
+   grant_ref_t ref;
+   int irq;
+
+   struct list_head socket_mappings;
+   struct list_head socketpass_mappings;
+   spinlock_t pvcallss_lock;
+
+   wait_queue_head_t inflight_req;
+   struct xen_pvcalls_response rsp[PVCALLS_NR_REQ_PER_RING];
+};
+struct xenbus_device *pvcalls_front_dev;


static


+
+static irqreturn_t pvcalls_front_event_handler(int irq, void *dev_id)
+{
+   return IRQ_HANDLED;
+}
+
  static const struct xenbus_device_id pvcalls_front_ids[] = {
{ "pvcalls" },
{ "" }
@@ -33,12 +56,142 @@ static int pvcalls_front_remove(struct xenbus_device *dev)
  static int pvcalls_front_probe(struct xenbus_device *dev,
  const struct xenbus_device_id *id)
  {
+   int ret = -EFAULT, evtchn, ref = -1, i;
+   unsigned int max_page_order, function_calls, len;
+   char *versions;
+   grant_ref_t gref_head = 0;
+   struct xenbus_transaction xbt;
+   struct pvcalls_bedata *bedata = NULL;
+   struct xen_pvcalls_sring *sring;
+
+   if (pvcalls_front_dev != NULL) {
+   dev_err(>dev, "only one PV Calls connection supported\n");
+   return -EINVAL;
+   }
+
+   versions = xenbus_read(XBT_NIL, dev->otherend, "versions", );
+   if (!len)
+   return -EINVAL;
+   if (strcmp(versions, "1")) {
+   kfree(versions);
+   return -EINVAL;
+   }
+   kfree(versions);
+   ret = xenbus_scanf(XBT_NIL, dev->otherend,
+  "max-page-order", "%u", _page_order);
+   if (ret <= 0)
+   return -ENODEV;
+   if (max_page_order < RING_ORDER)
+   return -ENODEV;
+   ret = xenbus_scanf(XBT_NIL, dev->otherend,
+  "function-calls", "%u", _calls);
+   if (ret <= 0 || function_calls != 1)
+   return -ENODEV;
+   pr_info("%s max-page-order is %u\n", __func__, max_page_order);
+
+   bedata = kzalloc(sizeof(struct pvcalls_bedata), GFP_KERNEL);
+   if (!bedata)
+   return -ENOMEM;
+
+   init_waitqueue_head(>inflight_req);
+   for (i = 0; i < PVCALLS_NR_REQ_PER_RING; i++)
+   bedata->rsp[i].req_id = PVCALLS_INVALID_ID;
+
+   sring = (struct xen_pvcalls_sring *) __get_free_page(GFP_KERNEL |
+__GFP_ZERO);
+   if (!sring)
+   goto error;
+   SHARED_RING_INIT(sring);
+   FRONT_RING_INIT(>ring, sring, XEN_PAGE_SIZE);
+
+   ret = xenbus_alloc_evtchn(dev, );
+   if (ret)
+   goto error;
+
+   bedata->irq = bind_evtchn_to_irqhandler(evtchn,
+   pvcalls_front_event_handler,
+   0, "pvcalls-frontend", dev);
+   if (bedata->irq < 0) {
+   ret = bedata->irq;
+   goto error;
+   }
+
+   ret = gnttab_alloc_grant_references(1, _head);
+   if (ret < 0)
+   goto error;
+   bedata->ref = ref = gnttab_claim_grant_reference(_head);


Is ref really needed?


+   if (ref < 0)
+   goto error;
+   gnttab_grant_foreign_access_ref(ref, dev->otherend_id,
+   virt_to_gfn((void *)sring), 0);
+
+ again:
+   ret = xenbus_transaction_start();
+   if (ret) {
+   xenbus_dev_fatal(dev, ret, "starting transaction");
+   goto error;
+   }
+   ret = xenbus_printf(xbt,

[PATCH 03/11] metag/numa: Remove the unused parent_node() macro

2017-07-26 Thread Dou Liyang

Commit a7be6e5a7f8d ("mm: drop useless local parameters of
__register_one_node()") removes the last user of parent_node().

The parent_node() macro in METAG architecture is unnecessary.

Remove it for cleanup.

Reported-by: Michael Ellerman 
Signed-off-by: Dou Liyang 
Cc: James Hogan 
Cc: a...@linux-foundation.org
Cc: linux-me...@vger.kernel.org
---
 arch/metag/include/asm/topology.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/metag/include/asm/topology.h 
b/arch/metag/include/asm/topology.h
index e95f874..707c7f7 100644
--- a/arch/metag/include/asm/topology.h
+++ b/arch/metag/include/asm/topology.h
@@ -4,7 +4,6 @@
 #ifdef CONFIG_NUMA
 
 #define cpu_to_node(cpu)   ((void)(cpu), 0)
-#define parent_node(node)  ((void)(node), 0)
 
 #define cpumask_of_node(node)  ((void)node, cpu_online_mask)
 
-- 
2.5.5

[PATCH v2] smp_call_function: use inline helpers instead of macros

2017-07-26 Thread Arnd Bergmann

A new caller of smp_call_function() passes a local variable as the 'wait'
argument, and that variable is otherwise unused, so we get a warning
in non-SMP configurations:

virt/kvm/kvm_main.c: In function 'kvm_make_all_cpus_request':
virt/kvm/kvm_main.c:195:7: error: unused variable 'wait' 
[-Werror=unused-variable]
  bool wait = req & KVM_REQUEST_WAIT;

This addresses the warning by changing the two macros into inline functions.
As reported by the 0day build bot, a small change is required in the MIPS
r4k code for this, which then gets a warning about a missing variable.

Fixes: 7a97cec26b94 ("KVM: mark requests that need synchronization")
Cc: Paolo Bonzini 
Link: https://patchwork.kernel.org/patch/9722063/
Signed-off-by: Arnd Bergmann 
---
v2: - fix MIPS build error reported by kbuild test robot
- remove up_smp_call_function()
---
 arch/mips/mm/c-r4k.c |  2 ++
 include/linux/smp.h  | 12 +++-
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/mips/mm/c-r4k.c b/arch/mips/mm/c-r4k.c
index 81d6a15c93d0..f353bf5f24f1 100644
--- a/arch/mips/mm/c-r4k.c
+++ b/arch/mips/mm/c-r4k.c
@@ -97,9 +97,11 @@ static inline void r4k_on_each_cpu(unsigned int type,
   void (*func)(void *info), void *info)
 {
preempt_disable();
+#ifdef CONFIG_SMP
if (r4k_op_needs_ipi(type))
smp_call_function_many(_foreign_map[smp_processor_id()],
   func, info, 1);
+#endif
func(info);
preempt_enable();
 }
diff --git a/include/linux/smp.h b/include/linux/smp.h
index 68123c1fe549..ea24e2d3504c 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -135,17 +135,19 @@ static inline void smp_send_stop(void) { }
  * These macros fold the SMP functionality into a single CPU system
  */
 #define raw_smp_processor_id() 0
-static inline int up_smp_call_function(smp_call_func_t func, void *info)
+static inline int smp_call_function(smp_call_func_t func, void *info, int wait)
 {
return 0;
 }
-#define smp_call_function(func, info, wait) \
-   (up_smp_call_function(func, info))
 
 static inline void smp_send_reschedule(int cpu) { }
 #define smp_prepare_boot_cpu() do {} while (0)
-#define smp_call_function_many(mask, func, info, wait) \
-   (up_smp_call_function(func, info))
+
+static inline void smp_call_function_many(const struct cpumask *mask,
+   smp_call_func_t func, void *info, bool wait)
+{
+}
+
 static inline void call_function_init(void) { }
 
 static inline int
-- 
2.9.0

[PATCH 11/11] asm-generic: numa: Remove the unused parent_node() macro

2017-07-26 Thread Dou Liyang

Commit a7be6e5a7f8d ("mm: drop useless local parameters of
__register_one_node()") removes the last user of parent_node().

The parent_node() macro in generic situation is unnecessary.

Remove it for cleanup.

Reported-by: Michael Ellerman 
Signed-off-by: Dou Liyang 
Cc: Arnd Bergmann 
Cc: linux-a...@vger.kernel.org
---
 include/asm-generic/topology.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/include/asm-generic/topology.h b/include/asm-generic/topology.h
index fc824e2..a91d842 100644
--- a/include/asm-generic/topology.h
+++ b/include/asm-generic/topology.h
@@ -44,9 +44,6 @@
 #define cpu_to_mem(cpu)((void)(cpu),0)
 #endif
 
-#ifndef parent_node
-#define parent_node(node)  ((void)(node),0)
-#endif
 #ifndef cpumask_of_node
 #define cpumask_of_node(node)  ((void)node, cpu_online_mask)
 #endif
-- 
2.5.5

[PATCH 01/11] arm64: numa: Remove the unused parent_node() macro

2017-07-26 Thread Dou Liyang

Commit a7be6e5a7f8d ("mm: drop useless local parameters of
__register_one_node()") removes the last user of parent_node().

The parent_node() macro in ARM64 platform is unnecessary.

Remove it for cleanup.

Reported-by: Michael Ellerman 
Signed-off-by: Dou Liyang 
Cc: Michael Ellerman 
Cc: Will Deacon 
Cc: linux-arm-ker...@lists.infradead.org
---
 arch/arm64/include/asm/numa.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h
index bf466d1..ef7b238 100644
--- a/arch/arm64/include/asm/numa.h
+++ b/arch/arm64/include/asm/numa.h
@@ -7,9 +7,6 @@
 
 #define NR_NODE_MEMBLKS(MAX_NUMNODES * 2)
 
-/* currently, arm64 implements flat NUMA topology */
-#define parent_node(node)  (node)
-
 int __node_distance(int from, int to);
 #define node_distance(a, b) __node_distance(a, b)
 
-- 
2.5.5

[PATCH 09/11] tile/topology: Remove the unused parent_node() macro

2017-07-26 Thread Dou Liyang

Commit a7be6e5a7f8d ("mm: drop useless local parameters of
__register_one_node()") removes the last user of parent_node().

The parent_node() macro in tile platform is unnecessary.

Remove it for cleanup.

Reported-by: Michael Ellerman 
Signed-off-by: Dou Liyang 
Cc: Chris Metcalf 
---
 arch/tile/include/asm/topology.h | 6 --
 1 file changed, 6 deletions(-)

diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h
index b11d5fc..635a0a4 100644
--- a/arch/tile/include/asm/topology.h
+++ b/arch/tile/include/asm/topology.h
@@ -29,12 +29,6 @@ static inline int cpu_to_node(int cpu)
return cpu_2_node[cpu];
 }
 
-/*
- * Returns the number of the node containing Node 'node'.
- * This architecture is flat, so it is a pretty simple function!
- */
-#define parent_node(node) (node)
-
 /* Returns a bitmask of CPUs on Node 'node'. */
 static inline const struct cpumask *cpumask_of_node(int node)
 {
-- 
2.5.5

[PATCH v2] Kbuild: use -fshort-wchar globally

2017-07-26 Thread Arnd Bergmann

A previous patch added the --no-wchar-size-warning to the Makefile to
avoid this harmless warning:

arm-linux-gnueabi-ld: warning: drivers/xen/efi.o uses 2-byte wchar_t yet the 
output is to use 4-byte wchar_t; use of wchar_t values across objects may fail

Changing kbuild to use thin archives instead of recursive linking
unfortunately brings the same warning back during the final link.

The kernel does not use wchar_t string literals at this point, and
xen does not use wchar_t at all (only efi_char16_t), so the flag
has no effect, but as pointed out by Jan Beulich, adding a wchar_t
string literal would be bad here.

Since wchar_t is always defined as u16, independent of the toolchain
default, always passing -fshort-wchar is correct and lets us
remove the Xen specific hack along with fixing the warning.

Signed-off-by: Arnd Bergmann 
Fixes: 971a69db7dc0 ("Xen: don't warn about 2-byte wchar_t in efi")
Acked-by: David Vrabel 
Link: https://patchwork.kernel.org/patch/9275217/
---
I submitted an earlier patch in August 2016, simply removing the
flag in xen, but there seems to be no harm in enabling it globally
---
 Makefile | 2 +-
 drivers/xen/Makefile | 3 ---
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/Makefile b/Makefile
index f1533423094f..0fe63a47fd52 100644
--- a/Makefile
+++ b/Makefile
@@ -396,7 +396,7 @@ LINUXINCLUDE:= \
 KBUILD_CPPFLAGS := -D__KERNEL__
 
 KBUILD_CFLAGS   := -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \
-  -fno-strict-aliasing -fno-common \
+  -fno-strict-aliasing -fno-common -fshort-wchar \
   -Werror-implicit-function-declaration \
   -Wno-format-security \
   -std=gnu89 $(call cc-option,-fno-PIE)
diff --git a/drivers/xen/Makefile b/drivers/xen/Makefile
index 8feab810aed9..7f188b8d0c67 100644
--- a/drivers/xen/Makefile
+++ b/drivers/xen/Makefile
@@ -7,9 +7,6 @@ obj-y   += xenbus/
 nostackp := $(call cc-option, -fno-stack-protector)
 CFLAGS_features.o  := $(nostackp)
 
-CFLAGS_efi.o   += -fshort-wchar
-LDFLAGS+= $(call ld-option, 
--no-wchar-size-warning)
-
 dom0-$(CONFIG_ARM64) += arm-device.o
 dom0-$(CONFIG_PCI) += pci.o
 dom0-$(CONFIG_USB_SUPPORT) += dbgp.o
-- 
2.9.0

Re: [REGRESSION 4.13-rc] NFS returns -EACCESS at the first read

2017-07-26 Thread Anna Schumaker



On 07/26/2017 09:30 AM, Takashi Iwai wrote:
> On Wed, 26 Jul 2017 14:57:07 +0200,
> Anna Schumaker wrote:
>>
>> Hi Takashi,
>>
>> On 07/26/2017 08:54 AM, Takashi Iwai wrote:
>>> Hi,
>>>
>>> I seem hitting a regression of NFS client on the today's Linus git
>>> tree.  The symptom is that the file read over NFS returns occasionally
>>> -EACCESS at the first read.  When I try to read the same file again
>>> (or do some other thing), I can read it successfully.
>>>
>>> The git bisection leaded to the commit
>>> bd8b2441742b49c76bec707757bd9c028ea9838e
>>> NFS: Store the raw NFS access mask in the inode's access cache
>>>
>>>
>>> Any further hint for debugging?
>>
>> Does the patch in this email thread help? 
>> http://www.spinics.net/lists/linux-nfs/msg64930.html
> 
> Thanks, I gave it a shot and the result looks good.  Feel free to my
> tested-by tag:
>   Tested-by: Takashi Iwai 
> 
> 
> Though, when I look around the code, I feel somehow uneasy by that
> still MAY_XXX is used for nfs_access_entry.mask, e.g. in
> nfs3_proc_access() or nfs4_proc_access().  Are these function OK
> without the similar conversion?

I just started looking at that at the end of the day yesterday.  I think they 
work by accident, since all the bits in the mask are set by nfs_do_access().  
They should probably be converted, but I don't think it's urgent.

Anna

> 
> 
> thanks,
> 
> Takashi
>

[PATCH 02/11] ia64: topology: Remove the unused parent_node() macro

2017-07-26 Thread Dou Liyang

Commit a7be6e5a7f8d ("mm: drop useless local parameters of
__register_one_node()") removes the last user of parent_node().

The parent_node() macro in IA64(Itanium) platform is unnecessary.

Remove it for cleanup.

Reported-by: Michael Ellerman 
Signed-off-by: Dou Liyang 
Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: linux-i...@vger.kernel.org
---
 arch/ia64/include/asm/topology.h | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
index 3ad8f69..82f9bf7 100644
--- a/arch/ia64/include/asm/topology.h
+++ b/arch/ia64/include/asm/topology.h
@@ -34,13 +34,6 @@
   _to_cpu_mask[node])
 
 /*
- * Returns the number of the node containing Node 'nid'.
- * Not implemented here. Multi-level hierarchies detected with
- * the help of node_distance().
- */
-#define parent_node(nid) (nid)
-
-/*
  * Determines the node for a given pci bus
  */
 #define pcibus_to_node(bus) PCI_CONTROLLER(bus)->node
-- 
2.5.5

Re: [PATCH net] Revert "vhost: cache used event for better performance"

2017-07-26 Thread Jason Wang




On 2017年07月26日 21:18, Jason Wang wrote:



On 2017年07月26日 20:57, Michael S. Tsirkin wrote:

On Wed, Jul 26, 2017 at 04:03:17PM +0800, Jason Wang wrote:

This reverts commit 809ecb9bca6a9424ccd392d67e368160f8b76c92. Since it
was reported to break vhost_net. We want to cache used event and use
it to check for notification. We try to valid cached used event by
checking whether or not it was ahead of new, but this is not correct
all the time, it could be stale and there's no way to know about this.

Signed-off-by: Jason Wang

Could you supply a bit more data here please?  How does it get stale?
What does guest need to do to make it stale?  This will be helpful if
anyone wants to bring it back, or if we want to extend the protocol.



The problem we don't know whether or not guest has published a new 
used event. The check vring_need_event(vq->last_used_event, new + 
vq->num, new) is not sufficient to check for this.


Thanks


More notes, the previous assumption is that we don't move used event 
back, but this could happen in fact if idx is wrapper around. Will 
repost and add this into commit log.


Thanks

[PATCH 07/11] sh/numa: Remove the unused parent_node() macro

2017-07-26 Thread Dou Liyang

Commit a7be6e5a7f8d ("mm: drop useless local parameters of
__register_one_node()") removes the last user of parent_node().

The parent_node() macro in SUPERH platform is unnecessary.

Remove it for cleanup.

Reported-by: Michael Ellerman 
Signed-off-by: Dou Liyang 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: linux...@vger.kernel.org
---
 arch/sh/include/asm/topology.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/sh/include/asm/topology.h b/arch/sh/include/asm/topology.h
index 358e3f5..6931f50 100644
--- a/arch/sh/include/asm/topology.h
+++ b/arch/sh/include/asm/topology.h
@@ -4,7 +4,6 @@
 #ifdef CONFIG_NUMA
 
 #define cpu_to_node(cpu)   ((void)(cpu),0)
-#define parent_node(node)  ((void)(node),0)
 
 #define cpumask_of_node(node)  ((void)node, cpu_online_mask)
 
-- 
2.5.5

[PATCH 04/11] MIPS: numa: Remove the unused parent_node() macro

2017-07-26 Thread Dou Liyang

Commit a7be6e5a7f8d ("mm: drop useless local parameters of
__register_one_node()") removes the last user of parent_node().

The parent_node() macros in both IP27 and Loongson64 are unnecessary.

Remove it for cleanup.

Reported-by: Michael Ellerman 
Signed-off-by: Dou Liyang 
Cc: Ralf Baechle 
Cc: James Hogan 
Cc: linux-m...@linux-mips.org
---
 arch/mips/include/asm/mach-ip27/topology.h   | 1 -
 arch/mips/include/asm/mach-loongson64/topology.h | 1 -
 2 files changed, 2 deletions(-)

diff --git a/arch/mips/include/asm/mach-ip27/topology.h 
b/arch/mips/include/asm/mach-ip27/topology.h
index defd135..3fb7a0e 100644
--- a/arch/mips/include/asm/mach-ip27/topology.h
+++ b/arch/mips/include/asm/mach-ip27/topology.h
@@ -23,7 +23,6 @@ struct cpuinfo_ip27 {
 extern struct cpuinfo_ip27 sn_cpu_info[NR_CPUS];
 
 #define cpu_to_node(cpu)   (sn_cpu_info[(cpu)].p_nodeid)
-#define parent_node(node)  (node)
 #define cpumask_of_node(node)  ((node) == -1 ? \
 cpu_all_mask : \
 _data(node)->h_cpus)
diff --git a/arch/mips/include/asm/mach-loongson64/topology.h 
b/arch/mips/include/asm/mach-loongson64/topology.h
index 0d8f3b5..bcb8856 100644
--- a/arch/mips/include/asm/mach-loongson64/topology.h
+++ b/arch/mips/include/asm/mach-loongson64/topology.h
@@ -4,7 +4,6 @@
 #ifdef CONFIG_NUMA
 
 #define cpu_to_node(cpu)   (cpu_logical_map(cpu) >> 2)
-#define parent_node(node)  (node)
 #define cpumask_of_node(node)  (&__node_data[(node)]->cpumask)
 
 struct pci_bus;
-- 
2.5.5

[PATCH] [v2] iopoll: avoid -Wint-in-bool-context warning

2017-07-26 Thread Arnd Bergmann

When we pass the result of a multiplication as the timeout or
the delay, we can get a warning:

drivers/mmc/host/bcm2835.c:596:149: error: '*' in boolean context, suggest '&&' 
instead [-Werror=int-in-bool-context]
drivers/mfd/arizona-core.c:247:195: error: '*' in boolean context, suggest '&&' 
instead [-Werror=int-in-bool-context]
drivers/gpu/drm/sun4i/sun4i_hdmi_i2c.c:49:27: error: '*' in boolean context, 
suggest '&&' instead [-Werror=int-in-bool-context]

The warning is a bit questionable inside of a macro, but this
is intentional on the side of the gcc developers. It is also
an indication of another problem: we evaluate the timeout
and sleep arguments multiple times, which can have undesired
side-effects when those are complex expressions.

This changes the three iopoll variants to use local variables
for storing copies of the timeouts. This adds some more type
safety, and avoids both the double-evaluation and the gcc
warning.

Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81484
Signed-off-by: Arnd Bergmann 
---
v2: - use temporary variables instead of zero-comparison, to
  avoid double evaluation
- also address the delay, not just timout handling
---
 include/linux/iopoll.h | 24 +++-
 include/linux/regmap.h | 12 +++-
 2 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/include/linux/iopoll.h b/include/linux/iopoll.h
index d29e1e21bf3f..b1d861caca16 100644
--- a/include/linux/iopoll.h
+++ b/include/linux/iopoll.h
@@ -42,18 +42,21 @@
  */
 #define readx_poll_timeout(op, addr, val, cond, sleep_us, timeout_us)  \
 ({ \
-   ktime_t timeout = ktime_add_us(ktime_get(), timeout_us); \
-   might_sleep_if(sleep_us); \
+   u64 __timeout_us = (timeout_us); \
+   unsigned long __sleep_us = (sleep_us); \
+   ktime_t __timeout = ktime_add_us(ktime_get(), __timeout_us); \
+   might_sleep_if((__sleep_us) != 0); \
for (;;) { \
(val) = op(addr); \
if (cond) \
break; \
-   if (timeout_us && ktime_compare(ktime_get(), timeout) > 0) { \
+   if (__timeout_us && \
+   ktime_compare(ktime_get(), __timeout) > 0) { \
(val) = op(addr); \
break; \
} \
-   if (sleep_us) \
-   usleep_range((sleep_us >> 2) + 1, sleep_us); \
+   if (__sleep_us) \
+   usleep_range((__sleep_us >> 2) + 1, __sleep_us); \
} \
(cond) ? 0 : -ETIMEDOUT; \
 })
@@ -77,17 +80,20 @@
  */
 #define readx_poll_timeout_atomic(op, addr, val, cond, delay_us, timeout_us) \
 ({ \
-   ktime_t timeout = ktime_add_us(ktime_get(), timeout_us); \
+   u64 __timeout_us = (timeout_us); \
+   unsigned long __delay_us = (delay_us); \
+   ktime_t __timeout = ktime_add_us(ktime_get(), __timeout_us); \
for (;;) { \
(val) = op(addr); \
if (cond) \
break; \
-   if (timeout_us && ktime_compare(ktime_get(), timeout) > 0) { \
+   if (__timeout_us && \
+   ktime_compare(ktime_get(), __timeout) > 0) { \
(val) = op(addr); \
break; \
} \
-   if (delay_us) \
-   udelay(delay_us);   \
+   if (__delay_us) \
+   udelay(__delay_us); \
} \
(cond) ? 0 : -ETIMEDOUT; \
 })
diff --git a/include/linux/regmap.h b/include/linux/regmap.h
index 1474ab0a3922..a4d30c877f6b 100644
--- a/include/linux/regmap.h
+++ b/include/linux/regmap.h
@@ -120,22 +120,24 @@ struct reg_sequence {
  */
 #define regmap_read_poll_timeout(map, addr, val, cond, sleep_us, timeout_us) \
 ({ \
-   ktime_t __timeout = ktime_add_us(ktime_get(), timeout_us); \
+   u64 __timeout_us = (timeout_us); \
+   unsigned long __sleep_us = (sleep_us); \
+   ktime_t __timeout = ktime_add_us(ktime_get(), __timeout_us); \
int __ret; \
-   might_sleep_if(sleep_us); \
+   might_sleep_if(__sleep_us); \
for (;;) { \
__ret = regmap_read((map), (addr), &(val)); \
if (__ret) \
break; \
if (cond) \
break; \
-   if ((timeout_us) && \
+   if (__timeout_us && \
ktime_compare(ktime_get(), __timeout) > 0) { \
__ret = regmap_read((map), (addr), &(val)); \
break; \
} \
-   if (sleep_us) \
-   usleep_range(((sleep_us) >> 2) + 1, sleep_us); \
+   if (__sleep_us) \
+   usleep_range((__sleep_us >> 2) + 1, __sleep_us); \
} \
__ret ?: ((cond) ? 0 : -ETIMEDOUT); \
 })
-- 
2.9.0

[PATCH 06/11] s390/topology: Remove the unused parent_node() macro

2017-07-26 Thread Dou Liyang

Commit a7be6e5a7f8d ("mm: drop useless local parameters of
__register_one_node()") removes the last user of parent_node().

The parent_node() macro in S390 platform is unnecessary.

Remove it for cleanup.

Reported-by: Michael Ellerman 
Signed-off-by: Dou Liyang 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Michael Holzheu 
Cc: linux-s...@vger.kernel.org
---
 arch/s390/include/asm/topology.h | 6 --
 1 file changed, 6 deletions(-)

diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h
index fa1bfce..5222da1 100644
--- a/arch/s390/include/asm/topology.h
+++ b/arch/s390/include/asm/topology.h
@@ -77,12 +77,6 @@ static inline const struct cpumask *cpumask_of_node(int node)
return _to_cpumask_map[node];
 }
 
-/*
- * Returns the number of the node containing node 'node'. This
- * architecture is flat, so it is a pretty simple function!
- */
-#define parent_node(node) (node)
-
 #define pcibus_to_node(bus) __pcibus_to_node(bus)
 
 #define node_distance(a, b) __node_distance(a, b)
-- 
2.5.5

[PATCH 00/11] Remove the parent_node() for each arch

2017-07-26 Thread Dou Liyang

Michael reports the parent_node() will never be invoked since the
Commit a7be6e5a7f8d ("mm: drop useless local parameters of
__register_one_node()") removes the last user of it. 

So we start removing it from the topology.h headers for each arch.

Dou Liyang (11):
  arm64: numa: Remove the unused parent_node() macro
  ia64: topology: Remove the unused parent_node() macro
  metag/numa: Remove the unused parent_node() macro
  MIPS: numa: Remove the unused parent_node() macro
  powerpc/numa: Remove the unused parent_node() macro
  s390/topology: Remove the unused parent_node() macro
  sh/numa: Remove the unused parent_node() macro
  sparc64/topology: Remove the unused parent_node() macro
  tile/topology: Remove the unused parent_node() macro
  x86/topology: Remove the unused parent_node() macro
  asm-generic: numa: Remove the unused parent_node() macro

 arch/arm64/include/asm/numa.h| 3 ---
 arch/ia64/include/asm/topology.h | 7 ---
 arch/metag/include/asm/topology.h| 1 -
 arch/mips/include/asm/mach-ip27/topology.h   | 1 -
 arch/mips/include/asm/mach-loongson64/topology.h | 1 -
 arch/powerpc/include/asm/topology.h  | 2 --
 arch/s390/include/asm/topology.h | 6 --
 arch/sh/include/asm/topology.h   | 1 -
 arch/sparc/include/asm/topology_64.h | 2 --
 arch/tile/include/asm/topology.h | 6 --
 arch/x86/include/asm/topology.h  | 6 --
 include/asm-generic/topology.h   | 3 ---
 12 files changed, 39 deletions(-)

-- 
2.5.5

Re: [PATCH] iommu/amd: Fix schedule-while-atomic BUG in initialization code

2017-07-26 Thread Joerg Roedel

On Wed, Jul 26, 2017 at 03:25:05PM +0200, Artem Savkov wrote:
> On Wed, Jul 26, 2017 at 02:26:14PM +0200, Joerg Roedel wrote:
> > Yes, that should fix it, but I think its better to just move the
> > register_syscore_ops() call to a later initialization step, like in the
> > patch below. I tested it an will queue it to my iommu/fixes branch.
> 
> Checked it as well just in case, didn't see any issues. Thank you.
> 
> Reported-and-tested-by: Artem Savkov 

Thanks for testing it! I added your's and Thomas' tags and applied the
patch to my tree. It should go upstream this week.


Joerg

Re: [RFC PATCH] mm: memcg: fix css double put in mem_cgroup_iter

2017-07-26 Thread Michal Hocko

On Wed 26-07-17 21:07:42, Wenwei Tao wrote:
> From: Wenwei Tao 
> 
> By removing the child cgroup while the parent cgroup is
> under reclaim, we could trigger the following kernel panic
> on kernel 3.10:
> 
> kernel BUG at kernel/cgroup.c:893!
>  invalid opcode:  [#1] SMP
>  CPU: 1 PID: 22477 Comm: kworker/1:1 Not tainted 3.10.107 #1
>  Workqueue: cgroup_destroy css_dput_fn
>  task: 8817959a5780 ti: 8817e8886000 task.ti: 8817e8886000
>  RIP: 0010:[]  []
> cgroup_diput+0xc0/0xf0
>  RSP: :8817e8887da0  EFLAGS: 00010246
>  RAX:  RBX: 8817a5dd5d40 RCX: dead0200
>  RDX:  RSI: 8817973a6910 RDI: 8817f54c2a00
>  RBP: 8817e8887dc8 R08: 8817a5dd5dd0 R09: df9fb35794b01820
>  R10: df9fb35794b01820 R11: 7fa95b1efcda R12: 8817a5dd5d9c
>  R13: 8817f38b3a40 R14: 8817973a6910 R15: 8817973a6910
>  FS:  () GS:88181f22()
> knlGS:
>  CS:  0010 DS:  ES:  CR0: 80050033
>  CR2: 7fa6e6234000 CR3: 00179f19d000 CR4: 000407e0
>  DR0:  DR1:  DR2: 
>  DR3:  DR6: 0ff0 DR7: 0400
>  Stack:
>   8817a5dd5d40 8817a5dd5d9c 8817f38b3a40 8817973a6910
>   0040 8817e8887df8 811b37c2 8817fa23c000
>   8817f57dbb80 88181f232ac0 88181f237500 8817e8887e10
>  Call Trace:
>   [] dput+0x1a2/0x2f0
>   [] cgroup_dput.isra.21+0x1c/0x30
>   [] css_dput_fn+0x1d/0x20
>   [] process_one_work+0x17c/0x460
>   [] worker_thread+0x116/0x3b0
>   [] ? manage_workers.isra.25+0x290/0x290
>   [] kthread+0xc0/0xd0
>   [] ? insert_kthread_work+0x40/0x40
>   [] ret_from_fork+0x58/0x90
>   [] ? insert_kthread_work+0x40/0x40
>  Code: 41 5e 41 5f 5d c3 0f 1f 44 00 00 48 8b 7f 78 48 8b 07 a8 01 74 15
> 48 81 c7 30 01 00 00 48 c7 c6 a0 a7 0c 81 e8 b2 83 02 00 eb c8 <0f> 0b
> 49 8b 4e 18 48 c7 c2 7e f1 7a 81 be 85 03 00 00 48 c7 c7
>  RIP  [] cgroup_diput+0xc0/0xf0
>  RSP 
>  ---[ end trace 85eeea5212c44f51 ]---
> 
> 
> I think there is a css double put in mem_cgroup_iter. Under reclaim,
> we call mem_cgroup_iter the first time with prev == NULL, and we get
> last_visited memcg from per zone's reclaim_iter then call 
> __mem_cgroup_iter_next
> try to get next alive memcg, __mem_cgroup_iter_next could return NULL
> if last_visited is already the last one so we put the last_visited's
> memcg css and continue to the next while loop, this time we might not
> do css_tryget(_visited->css) if the dead_count is changed, but
> we still do css_put(_visited->css), we put it twice, this could
> trigger the BUG_ON at kernel/cgroup.c:893.

Yes, I guess your are right and I suspect that this has been silently
fixed by 519ebea3bf6d ("mm: memcontrol: factor out reclaim iterator
loading and updating"). I think a more appropriate fix is would be.
Are you able to reproduce and re-test it?
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 437ae2cbe102..0848ec05c12a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1224,6 +1224,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup 
*root,
if (last_visited && last_visited != root &&
!css_tryget(_visited->css))
last_visited = NULL;
+   } else {
+   last_visited = true;
}
}
 
-- 
Michal Hocko
SUSE Labs

Re: Sparse warnings on GENMASK + arm32

2017-07-26 Thread Luc Van Oostenryck

On Wed, Jul 26, 2017 at 09:33:01AM -0400, Lance Richardson wrote:
> > From: "Stephen Boyd" 
> > I see sparse warning when I check a clk driver file in the kernel
> > on a 32-bit ARM build.
> > 
> > drivers/clk/sunxi/clk-sun6i-ar100.c:65:20: warning: cast truncates bits from
> > constant value (3 becomes )
> 
> Hmm, it seems sparse is incorrectly taking ~0UL to be a 64-bit value
> while BITS_PER_LONG is (correctly) evaluated to be 32.
> 
> #define GENMASK(h, l) \
>   (((~0UL) << (l)) & (~0UL >> (BITS_PER_LONG - 1 - (h

It's the kernel CHECKFLAGS that should be using -m32/-m64 if built
on a machine with a different wordsize tht the arch.

I sent earlier a patch for ARM, I just forgot to CC the mailing list here.

-- Luc

< 8 9 10 11 12 13 14 15 16 17 >

1201 - 1300 of 1946 matches

Mail list logo