date:20171222

[PATCH v2 1/1] usb: xhci: dbc: Add SPDX identifiers to dbc files

2017-12-22 Thread Lu Baolu

Update the xhci dbc files with the correct SPDX license identifiers.

Fixes: dfba2174dc42 ("usb: xhci: Add DbC support in xHCI driver")
Cc: Philippe Ombredanne 
Signed-off-by: Lu Baolu 
---
 changes from v1:
 * Use plain C style comments in header file.

 drivers/usb/host/xhci-dbgcap.c | 1 +
 drivers/usb/host/xhci-dbgcap.h | 2 +-
 drivers/usb/host/xhci-dbgtty.c | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/host/xhci-dbgcap.c b/drivers/usb/host/xhci-dbgcap.c
index 452df0f..beea567 100644
--- a/drivers/usb/host/xhci-dbgcap.c
+++ b/drivers/usb/host/xhci-dbgcap.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /**
  * xhci-dbgcap.c - xHCI debug capability support
  *
diff --git a/drivers/usb/host/xhci-dbgcap.h b/drivers/usb/host/xhci-dbgcap.h
index e66ea07..ce0c607 100644
--- a/drivers/usb/host/xhci-dbgcap.h
+++ b/drivers/usb/host/xhci-dbgcap.h
@@ -1,4 +1,4 @@
-
+/* SPDX-License-Identifier: GPL-2.0 */
 /**
  * xhci-dbgcap.h - xHCI debug capability support
  *
diff --git a/drivers/usb/host/xhci-dbgtty.c b/drivers/usb/host/xhci-dbgtty.c
index 8d47b6f..033cc0e 100644
--- a/drivers/usb/host/xhci-dbgtty.c
+++ b/drivers/usb/host/xhci-dbgtty.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /**
  * xhci-dbgtty.c - tty glue for xHCI debug capability
  *
-- 
2.7.4

[PATCH v2 1/1] usb: xhci: dbc: Add SPDX identifiers to dbc files

2017-12-22 Thread Lu Baolu

Update the xhci dbc files with the correct SPDX license identifiers.

Fixes: dfba2174dc42 ("usb: xhci: Add DbC support in xHCI driver")
Cc: Philippe Ombredanne 
Signed-off-by: Lu Baolu 
---
 changes from v1:
 * Use plain C style comments in header file.

 drivers/usb/host/xhci-dbgcap.c | 1 +
 drivers/usb/host/xhci-dbgcap.h | 2 +-
 drivers/usb/host/xhci-dbgtty.c | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/host/xhci-dbgcap.c b/drivers/usb/host/xhci-dbgcap.c
index 452df0f..beea567 100644
--- a/drivers/usb/host/xhci-dbgcap.c
+++ b/drivers/usb/host/xhci-dbgcap.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /**
  * xhci-dbgcap.c - xHCI debug capability support
  *
diff --git a/drivers/usb/host/xhci-dbgcap.h b/drivers/usb/host/xhci-dbgcap.h
index e66ea07..ce0c607 100644
--- a/drivers/usb/host/xhci-dbgcap.h
+++ b/drivers/usb/host/xhci-dbgcap.h
@@ -1,4 +1,4 @@
-
+/* SPDX-License-Identifier: GPL-2.0 */
 /**
  * xhci-dbgcap.h - xHCI debug capability support
  *
diff --git a/drivers/usb/host/xhci-dbgtty.c b/drivers/usb/host/xhci-dbgtty.c
index 8d47b6f..033cc0e 100644
--- a/drivers/usb/host/xhci-dbgtty.c
+++ b/drivers/usb/host/xhci-dbgtty.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /**
  * xhci-dbgtty.c - tty glue for xHCI debug capability
  *
-- 
2.7.4

[PATCH] clk: mediatek: Fix all warnings for missing struct clk_onecell_data

2017-12-22 Thread sean.wang

From: Sean Wang 

In fact, the clk-mtk.h header is unnecessary for reset.c and thus it's
safe to remove it from the file to get rid of below build warnings.

All warnings (new ones prefixed by >>):

   In file included from drivers/clk/mediatek/reset.c:22:0:
>>drivers/clk/mediatek/clk-mtk.h:44:19: warning: 'struct clk_onecell_data'
declared inside parameter list will not be visible outside of
this definition or declaration
  int num, struct clk_onecell_data *clk_data);
  ^~~~
drivers/clk/mediatek/clk-mtk.h:63:19: warning: 'struct clk_onecell_data'
declared inside parameter list will not be visible outside of
this definition or declaration
  int num, struct clk_onecell_data *clk_data);
  ^~~~
drivers/clk/mediatek/clk-mtk.h:145:10: warning: 'struct clk_onecell_data'
declared inside parameter list will not be visible outside of
this definition or declaration
  struct clk_onecell_data *clk_data);
 ^~~~
drivers/clk/mediatek/clk-mtk.h:164:11: warning: 'struct clk_onecell_data'
declared inside parameter list will not be visible outside of
this definition or declaration
   struct clk_onecell_data *clk_data);
  ^~~~
drivers/clk/mediatek/clk-mtk.h:190:12: warning: 'struct clk_onecell_data'
declared inside parameter list will not be visible outside of this
definition or declaration
struct clk_onecell_data *clk_data);
   ^~~~

Reported-by: kbuild test robot 
Signed-off-by: Sean Wang 
Cc: kbuild-...@01.org
Cc: Stephen Boyd 
Cc: linux-...@vger.kernel.org
---
 drivers/clk/mediatek/reset.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/clk/mediatek/reset.c b/drivers/clk/mediatek/reset.c
index d3551d5..70ebb2e 100644
--- a/drivers/clk/mediatek/reset.c
+++ b/drivers/clk/mediatek/reset.c
@@ -19,8 +19,6 @@
 #include 
 #include 
 
-#include "clk-mtk.h"
-
 struct mtk_reset {
struct regmap *regmap;
int regofs;
-- 
2.7.4

[PATCH] clk: mediatek: Fix all warnings for missing struct clk_onecell_data

2017-12-22 Thread sean.wang

From: Sean Wang 

In fact, the clk-mtk.h header is unnecessary for reset.c and thus it's
safe to remove it from the file to get rid of below build warnings.

All warnings (new ones prefixed by >>):

   In file included from drivers/clk/mediatek/reset.c:22:0:
>>drivers/clk/mediatek/clk-mtk.h:44:19: warning: 'struct clk_onecell_data'
declared inside parameter list will not be visible outside of
this definition or declaration
  int num, struct clk_onecell_data *clk_data);
  ^~~~
drivers/clk/mediatek/clk-mtk.h:63:19: warning: 'struct clk_onecell_data'
declared inside parameter list will not be visible outside of
this definition or declaration
  int num, struct clk_onecell_data *clk_data);
  ^~~~
drivers/clk/mediatek/clk-mtk.h:145:10: warning: 'struct clk_onecell_data'
declared inside parameter list will not be visible outside of
this definition or declaration
  struct clk_onecell_data *clk_data);
 ^~~~
drivers/clk/mediatek/clk-mtk.h:164:11: warning: 'struct clk_onecell_data'
declared inside parameter list will not be visible outside of
this definition or declaration
   struct clk_onecell_data *clk_data);
  ^~~~
drivers/clk/mediatek/clk-mtk.h:190:12: warning: 'struct clk_onecell_data'
declared inside parameter list will not be visible outside of this
definition or declaration
struct clk_onecell_data *clk_data);
   ^~~~

Reported-by: kbuild test robot 
Signed-off-by: Sean Wang 
Cc: kbuild-...@01.org
Cc: Stephen Boyd 
Cc: linux-...@vger.kernel.org
---
 drivers/clk/mediatek/reset.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/clk/mediatek/reset.c b/drivers/clk/mediatek/reset.c
index d3551d5..70ebb2e 100644
--- a/drivers/clk/mediatek/reset.c
+++ b/drivers/clk/mediatek/reset.c
@@ -19,8 +19,6 @@
 #include 
 #include 
 
-#include "clk-mtk.h"
-
 struct mtk_reset {
struct regmap *regmap;
int regofs;
-- 
2.7.4

Re: [PATCH] arm: dts: mt7623: enable all four available UARTs on bananapi-r2

2017-12-22 Thread Matthias Brugger



On 12/22/2017 07:06 AM, sean.w...@mediatek.com wrote:
> From: Sean Wang 
> 
> On bpi-r2 board, totally there're four uarts which we usually called
> uart[0-3] helpful to extend slow I/O devices. Among those ones, uart2 has
> dedicated pin slot which is used to conolse log. uart[0-1] appear at the
> 40-pins connector and uart3 has no pinout, but just has test points (TP47
> for TX and TP48 for RX, respectively) nearby uart2. Also, some missing
> pinctrl is being complemented for those devices.
> 
> Signed-off-by: Sean Wang 
> ---
>  arch/arm/boot/dts/mt7623n-bananapi-bpi-r2.dts | 26 --
>  1 file changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/boot/dts/mt7623n-bananapi-bpi-r2.dts 
> b/arch/arm/boot/dts/mt7623n-bananapi-bpi-r2.dts
> index 7bf5aa2..64bf5db 100644
> --- a/arch/arm/boot/dts/mt7623n-bananapi-bpi-r2.dts
> +++ b/arch/arm/boot/dts/mt7623n-bananapi-bpi-r2.dts
> @@ -409,6 +409,20 @@
>;
>   };
>   };
> +
> + uart2_pins_a: uart@2 {
> + pins_dat {
> + pinmux = ,
> +  ;
> + };
> + };
> +
> + uart3_pins_a: uart@3 {
> + pins_dat {
> + pinmux = ,
> +  ;
> + };
> + };
>  };
>  
>   {
> @@ -454,16 +468,24 @@
>   {
>   pinctrl-names = "default";
>   pinctrl-0 = <_pins_a>;
> - status = "disabled";
> + status = "okay";
>  };
>  
>   {
>   pinctrl-names = "default";
>   pinctrl-0 = <_pins_a>;
> - status = "disabled";
> + status = "okay";
>  };
>  
>   {
> + pinctrl-names = "default";
> + pinctrl-0 = <_pins_a>;
> + status = "okay";
> +};
> +
> + {
> + pinctrl-names = "default";
> + pinctrl-0 = <_pins_a>;
>   status = "okay";
>  };
>  

Why do we want to enable uart3 when there are only test points?
It is not very useful, or do I oversee something?

Regards,
Matthias

Re: [PATCH] arm: dts: mt7623: enable all four available UARTs on bananapi-r2

2017-12-22 Thread Matthias Brugger



On 12/22/2017 07:06 AM, sean.w...@mediatek.com wrote:
> From: Sean Wang 
> 
> On bpi-r2 board, totally there're four uarts which we usually called
> uart[0-3] helpful to extend slow I/O devices. Among those ones, uart2 has
> dedicated pin slot which is used to conolse log. uart[0-1] appear at the
> 40-pins connector and uart3 has no pinout, but just has test points (TP47
> for TX and TP48 for RX, respectively) nearby uart2. Also, some missing
> pinctrl is being complemented for those devices.
> 
> Signed-off-by: Sean Wang 
> ---
>  arch/arm/boot/dts/mt7623n-bananapi-bpi-r2.dts | 26 --
>  1 file changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/boot/dts/mt7623n-bananapi-bpi-r2.dts 
> b/arch/arm/boot/dts/mt7623n-bananapi-bpi-r2.dts
> index 7bf5aa2..64bf5db 100644
> --- a/arch/arm/boot/dts/mt7623n-bananapi-bpi-r2.dts
> +++ b/arch/arm/boot/dts/mt7623n-bananapi-bpi-r2.dts
> @@ -409,6 +409,20 @@
>;
>   };
>   };
> +
> + uart2_pins_a: uart@2 {
> + pins_dat {
> + pinmux = ,
> +  ;
> + };
> + };
> +
> + uart3_pins_a: uart@3 {
> + pins_dat {
> + pinmux = ,
> +  ;
> + };
> + };
>  };
>  
>   {
> @@ -454,16 +468,24 @@
>   {
>   pinctrl-names = "default";
>   pinctrl-0 = <_pins_a>;
> - status = "disabled";
> + status = "okay";
>  };
>  
>   {
>   pinctrl-names = "default";
>   pinctrl-0 = <_pins_a>;
> - status = "disabled";
> + status = "okay";
>  };
>  
>   {
> + pinctrl-names = "default";
> + pinctrl-0 = <_pins_a>;
> + status = "okay";
> +};
> +
> + {
> + pinctrl-names = "default";
> + pinctrl-0 = <_pins_a>;
>   status = "okay";
>  };
>  

Why do we want to enable uart3 when there are only test points?
It is not very useful, or do I oversee something?

Regards,
Matthias

Re: [PATCH 1/1] usb: xhci: dbc: Add SPDX identifiers to dbc files

2017-12-22 Thread Lu Baolu

Hi,

On 12/22/2017 04:32 PM, Philippe Ombredanne wrote:
> Lu,
>
> On Fri, Dec 22, 2017 at 2:34 AM, Lu Baolu  wrote:
>> Update the xhci dbc files with the correct SPDX license identifiers.
>>
>> Fixes: dfba2174dc42 ("usb: xhci: Add DbC support in xHCI driver")
>> Signed-off-by: Lu Baolu 
>> ---
>>  drivers/usb/host/xhci-dbgcap.c | 1 +
>>  drivers/usb/host/xhci-dbgcap.h | 2 +-
>>  drivers/usb/host/xhci-dbgtty.c | 1 +
>>  3 files changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/usb/host/xhci-dbgcap.c b/drivers/usb/host/xhci-dbgcap.c
>> index 452df0f..beea567 100644
>> --- a/drivers/usb/host/xhci-dbgcap.c
>> +++ b/drivers/usb/host/xhci-dbgcap.c
>> @@ -1,3 +1,4 @@
>> +// SPDX-License-Identifier: GPL-2.0
>>  /**
>>   * xhci-dbgcap.c - xHCI debug capability support
>>   *
>> diff --git a/drivers/usb/host/xhci-dbgcap.h b/drivers/usb/host/xhci-dbgcap.h
>> index e66ea07..6c8906d 100644
>> --- a/drivers/usb/host/xhci-dbgcap.h
>> +++ b/drivers/usb/host/xhci-dbgcap.h
>> @@ -1,4 +1,4 @@
>> -
>> +// SPDX-License-Identifier: GPL-2.0
> Headers should be using plain C style comments per Thomas
> documentation patches.

Thanks. I will send a v2 for this.

Best regards,
Lu Baolu

>> +/* SPDX-License-Identifier: GPL-2.0 */
>>  /**
>>   * xhci-dbgcap.h - xHCI debug capability support
>>   *
>> diff --git a/drivers/usb/host/xhci-dbgtty.c b/drivers/usb/host/xhci-dbgtty.c
>> index 8d47b6f..033cc0e 100644
>> --- a/drivers/usb/host/xhci-dbgtty.c
>> +++ b/drivers/usb/host/xhci-dbgtty.c
>> @@ -1,3 +1,4 @@
>> +// SPDX-License-Identifier: GPL-2.0
>>  /**
>>   * xhci-dbgtty.c - tty glue for xHCI debug capability
>>   *

Re: [PATCH 1/1] usb: xhci: dbc: Add SPDX identifiers to dbc files

2017-12-22 Thread Lu Baolu

Hi,

On 12/22/2017 04:32 PM, Philippe Ombredanne wrote:
> Lu,
>
> On Fri, Dec 22, 2017 at 2:34 AM, Lu Baolu  wrote:
>> Update the xhci dbc files with the correct SPDX license identifiers.
>>
>> Fixes: dfba2174dc42 ("usb: xhci: Add DbC support in xHCI driver")
>> Signed-off-by: Lu Baolu 
>> ---
>>  drivers/usb/host/xhci-dbgcap.c | 1 +
>>  drivers/usb/host/xhci-dbgcap.h | 2 +-
>>  drivers/usb/host/xhci-dbgtty.c | 1 +
>>  3 files changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/usb/host/xhci-dbgcap.c b/drivers/usb/host/xhci-dbgcap.c
>> index 452df0f..beea567 100644
>> --- a/drivers/usb/host/xhci-dbgcap.c
>> +++ b/drivers/usb/host/xhci-dbgcap.c
>> @@ -1,3 +1,4 @@
>> +// SPDX-License-Identifier: GPL-2.0
>>  /**
>>   * xhci-dbgcap.c - xHCI debug capability support
>>   *
>> diff --git a/drivers/usb/host/xhci-dbgcap.h b/drivers/usb/host/xhci-dbgcap.h
>> index e66ea07..6c8906d 100644
>> --- a/drivers/usb/host/xhci-dbgcap.h
>> +++ b/drivers/usb/host/xhci-dbgcap.h
>> @@ -1,4 +1,4 @@
>> -
>> +// SPDX-License-Identifier: GPL-2.0
> Headers should be using plain C style comments per Thomas
> documentation patches.

Thanks. I will send a v2 for this.

Best regards,
Lu Baolu

>> +/* SPDX-License-Identifier: GPL-2.0 */
>>  /**
>>   * xhci-dbgcap.h - xHCI debug capability support
>>   *
>> diff --git a/drivers/usb/host/xhci-dbgtty.c b/drivers/usb/host/xhci-dbgtty.c
>> index 8d47b6f..033cc0e 100644
>> --- a/drivers/usb/host/xhci-dbgtty.c
>> +++ b/drivers/usb/host/xhci-dbgtty.c
>> @@ -1,3 +1,4 @@
>> +// SPDX-License-Identifier: GPL-2.0
>>  /**
>>   * xhci-dbgtty.c - tty glue for xHCI debug capability
>>   *

Re: [PATCH 04/11] x86: geode: constify gpio_led

2017-12-22 Thread kbuild test robot

Hi Arvind,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on arm-soc/for-next]
[also build test WARNING on v4.15-rc4 next-20171222]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Arvind-Yadav/constify-gpio_led/20171223-142117
base:   https://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc.git for-next
config: i386-randconfig-s1-201751 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All warnings (new ones prefixed by >>):

>> WARNING: vmlinux.o(.data+0x16554): Section mismatch in reference from the 
>> variable geos_leds_data to the variable .init.rodata:geos_leds
   The variable geos_leds_data references
   the variable __initconst geos_leds
   If the reference is valid then annotate the
   variable with or __refdata (see linux/init.h) or name the variable:
   

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH 04/11] x86: geode: constify gpio_led

2017-12-22 Thread kbuild test robot

Hi Arvind,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on arm-soc/for-next]
[also build test WARNING on v4.15-rc4 next-20171222]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Arvind-Yadav/constify-gpio_led/20171223-142117
base:   https://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc.git for-next
config: i386-randconfig-s1-201751 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All warnings (new ones prefixed by >>):

>> WARNING: vmlinux.o(.data+0x16554): Section mismatch in reference from the 
>> variable geos_leds_data to the variable .init.rodata:geos_leds
   The variable geos_leds_data references
   the variable __initconst geos_leds
   If the reference is valid then annotate the
   variable with or __refdata (see linux/init.h) or name the variable:
   

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH 0/1] Re: kernel BUG at fs/userfaultfd.c:LINE!

2017-12-22 Thread Dmitry Vyukov

On Sat, Dec 23, 2017 at 1:25 AM, Andrea Arcangeli  wrote:
> Hello,
>
> Thanks for the CC, I'm temporarily very busy so if there's something
> urgent, safer to CC.

Hi,

syzbot uses get_maintainer.pl and for fs/userfaultfd.c you are not
there, so if you want to be CCed please add yourself to MAINTAINERS.


> This passed both testcases, the hard part was already done. I'm glad
> there was nothing wrong in the previous fix that had to be redone.
>
> Simply we forgot to undo the vma->vm_userfaultfd_ctx = NULL after
> aborting the new child uffd ctx, the original code of course didn't do
> that either.
>
> Having just seen this issue, this isn't very well tested.
>
> Thank you,
> Andrea
>
> Andrea Arcangeli (1):
>   userfaultfd: clear the vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK
> fails
>
>  fs/userfaultfd.c | 20 ++--
>  1 file changed, 18 insertions(+), 2 deletions(-)

The original report footer was stripped, so:

Please credit me with: Reported-by: syzbot 

and we also need to tell syzbot about the fix with:

#syz fix:
userfaultfd: clear the vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails

Re: [PATCH 0/1] Re: kernel BUG at fs/userfaultfd.c:LINE!

2017-12-22 Thread Dmitry Vyukov

On Sat, Dec 23, 2017 at 1:25 AM, Andrea Arcangeli  wrote:
> Hello,
>
> Thanks for the CC, I'm temporarily very busy so if there's something
> urgent, safer to CC.

Hi,

syzbot uses get_maintainer.pl and for fs/userfaultfd.c you are not
there, so if you want to be CCed please add yourself to MAINTAINERS.


> This passed both testcases, the hard part was already done. I'm glad
> there was nothing wrong in the previous fix that had to be redone.
>
> Simply we forgot to undo the vma->vm_userfaultfd_ctx = NULL after
> aborting the new child uffd ctx, the original code of course didn't do
> that either.
>
> Having just seen this issue, this isn't very well tested.
>
> Thank you,
> Andrea
>
> Andrea Arcangeli (1):
>   userfaultfd: clear the vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK
> fails
>
>  fs/userfaultfd.c | 20 ++--
>  1 file changed, 18 insertions(+), 2 deletions(-)

The original report footer was stripped, so:

Please credit me with: Reported-by: syzbot 

and we also need to tell syzbot about the fix with:

#syz fix:
userfaultfd: clear the vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails

Re: [PATCH v16 3/5] mfd: Add driver for RAVE Supervisory Processor

2017-12-22 Thread kbuild test robot

Hi Andrey,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.15-rc4]
[cannot apply to next-20171222]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Andrey-Smirnov/ZII-RAVE-platform-driver/20171223-132544
config: i386-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.2.0-12) 7.2.1 20171025
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/mfd/rave-sp.c: In function 'csum_ccitt':
>> drivers/mfd/rave-sp.c:227:25: error: implicit declaration of function 
>> 'crc_ccitt_false'; did you mean 'crc_ccitt_byte'? 
>> [-Werror=implicit-function-declaration]
 const u16 calculated = crc_ccitt_false(0x, buf, size);
^~~
crc_ccitt_byte
   cc1: some warnings being treated as errors

vim +227 drivers/mfd/rave-sp.c

   224  
   225  static void csum_ccitt(const u8 *buf, size_t size, u8 *crc)
   226  {
 > 227  const u16 calculated = crc_ccitt_false(0x, buf, size);
   228  
   229  /*
   230   * While the rest of the wire protocol is little-endian,
   231   * CCITT-16 CRC in RDU2 device is sent out in big-endian order.
   232   */
   233  put_unaligned_be16(calculated, crc);
   234  }
   235  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH v16 3/5] mfd: Add driver for RAVE Supervisory Processor

2017-12-22 Thread kbuild test robot

Hi Andrey,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.15-rc4]
[cannot apply to next-20171222]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Andrey-Smirnov/ZII-RAVE-platform-driver/20171223-132544
config: i386-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.2.0-12) 7.2.1 20171025
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/mfd/rave-sp.c: In function 'csum_ccitt':
>> drivers/mfd/rave-sp.c:227:25: error: implicit declaration of function 
>> 'crc_ccitt_false'; did you mean 'crc_ccitt_byte'? 
>> [-Werror=implicit-function-declaration]
 const u16 calculated = crc_ccitt_false(0x, buf, size);
^~~
crc_ccitt_byte
   cc1: some warnings being treated as errors

vim +227 drivers/mfd/rave-sp.c

   224  
   225  static void csum_ccitt(const u8 *buf, size_t size, u8 *crc)
   226  {
 > 227  const u16 calculated = crc_ccitt_false(0x, buf, size);
   228  
   229  /*
   230   * While the rest of the wire protocol is little-endian,
   231   * CCITT-16 CRC in RDU2 device is sent out in big-endian order.
   232   */
   233  put_unaligned_be16(calculated, crc);
   234  }
   235  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: platform/x86/thinkpad_acpi: Adjustments for four function implementations

2017-12-22 Thread SF Markus Elfring

>> Do you find the Linux allocation failure report insufficient in this case?
> 
> Leave those pr_ messages alone, please,

Have you got special software development concerns?


> unless they are really causing some sort of issue (which?).

Can the code be redundant here?


> Doing it just for "compliance" with a doc isn't nearly good enough reason.

Do you give only a low value to coding style guidelines in this use case?

Regards,
Markus

Re: platform/x86/thinkpad_acpi: Adjustments for four function implementations

2017-12-22 Thread SF Markus Elfring

>> Do you find the Linux allocation failure report insufficient in this case?
> 
> Leave those pr_ messages alone, please,

Have you got special software development concerns?


> unless they are really causing some sort of issue (which?).

Can the code be redundant here?


> Doing it just for "compliance" with a doc isn't nearly good enough reason.

Do you give only a low value to coding style guidelines in this use case?

Regards,
Markus

Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-22 Thread Anshuman Khandual

On 12/23/2017 03:43 AM, Ross Zwisler wrote:
> On Fri, Dec 22, 2017 at 08:39:41AM +0530, Anshuman Khandual wrote:
>> On 12/14/2017 07:40 AM, Ross Zwisler wrote:
>>>  Quick Summary 
>>>
>>> Platforms exist today which have multiple types of memory attached to a
>>> single CPU.  These disparate memory ranges have some characteristics in
>>> common, such as CPU cache coherence, but they can have wide ranges of
>>> performance both in terms of latency and bandwidth.
>>
>> Right.
>>
>>>
>>> For example, consider a system that contains persistent memory, standard
>>> DDR memory and High Bandwidth Memory (HBM), all attached to the same CPU.
>>> There could potentially be an order of magnitude or more difference in
>>> performance between the slowest and fastest memory attached to that CPU.
>>
>> Right.
>>
>>>
>>> With the current Linux code NUMA nodes are CPU-centric, so all the memory
>>> attached to a given CPU will be lumped into the same NUMA node.  This makes
>>> it very difficult for userspace applications to understand the performance
>>> of different memory ranges on a given CPU.
>>
>> Right but that might require fundamental changes to the NUMA representation.
>> Plugging those memory as separate NUMA nodes, identify them through sysfs
>> and try allocating from it through mbind() seems like a short term solution.
>>
>> Though if we decide to go in this direction, sysfs interface or something
>> similar is required to enumerate memory properties.
> 
> Yep, and this patch series is trying to be the sysfs interface that is
> required to the memory properties.  :)  It's a certainty that we will have
> memory-only NUMA nodes, at least on platforms that support ACPI.  Supporting
> memory-only proximity domains (which Linux turns in to memory-only NUMA nodes)
> is explicitly supported with the introduction of the HMAT in ACPI 6.2.

Yeah, even on POWER platforms can have memory only NUMA nodes.

> 
> It also turns out that the existing memory management code already deals with
> them just fine - you see this with my hmat_examples setup:
> 
> https://github.com/rzwisler/hmat_examples
> 
> Both configurations created by this repo create memory-only NUMA nodes, even
> with upstream kernels.  My patches don't change that, they just provide a
> sysfs representation of the HMAT so users can discover the memory that exists
> in the system.

Once its a NUMA node everything will work as is from MM interface
point of view. But the point is how we export these properties to
user space. My only concern is lets not do it in a way which will
be locked without first going through NUMA redesign for these new
attribute based memory, thats all.

> 
>>> We solve this issue by providing userspace with performance information on
>>> individual memory ranges.  This performance information is exposed via
>>> sysfs:
>>>
>>>   # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null
>>>   mem_tgt2/firmware_id:1
>>>   mem_tgt2/is_cached:0
>>>   mem_tgt2/local_init/read_bw_MBps:40960
>>>   mem_tgt2/local_init/read_lat_nsec:50
>>>   mem_tgt2/local_init/write_bw_MBps:40960
>>>   mem_tgt2/local_init/write_lat_nsec:50
>>
>> I might have missed discussions from earlier versions, why we have this
>> kind of a "source --> target" model ? We will enlist properties for all
>> possible "source --> target" on the system ? Right now it shows only
>> bandwidth and latency properties, can it accommodate other properties
>> as well in future ?
> 
> The initiator/target model is useful in preventing us from needing a
> MAX_NUMA_NODES x MAX_NUMA_NODES sized table for each performance attribute.  I
> talked about it a little more here:

That makes it even more complex. Not only we have a memory attribute
like bandwidth specific to the range, we are also exporting it's
relative values as seen from different CPU nodes. Its again kind of
a NUMA distance table being exported in the generic sysfs path like
/sys/devices/. The problem is possible future memory attributes like
'reliability', 'density', 'power consumption' might not have a need
for a "source --> destination" kind of model as they dont change
based on which CPU node is accessing it.

> 
> https://lists.01.org/pipermail/linux-nvdimm/2017-December/013654.html
> 
>>> This allows applications to easily find the memory that they want to use.
>>> We expect that the existing NUMA APIs will be enhanced to use this new
>>> information so that applications can continue to use them to select their
>>> desired memory.
>>
>> I had presented a proposal for NUMA redesign in the Plumbers Conference this
>> year where various memory devices with different kind of memory attributes
>> can be represented in the kernel and be used explicitly from the user space.
>> Here is the link to the proposal if you feel interested. The proposal is
>> very intrusive and also I dont have a RFC for it yet for discussion here.
>>
>>

Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-22 Thread Anshuman Khandual

On 12/23/2017 03:43 AM, Ross Zwisler wrote:
> On Fri, Dec 22, 2017 at 08:39:41AM +0530, Anshuman Khandual wrote:
>> On 12/14/2017 07:40 AM, Ross Zwisler wrote:
>>>  Quick Summary 
>>>
>>> Platforms exist today which have multiple types of memory attached to a
>>> single CPU.  These disparate memory ranges have some characteristics in
>>> common, such as CPU cache coherence, but they can have wide ranges of
>>> performance both in terms of latency and bandwidth.
>>
>> Right.
>>
>>>
>>> For example, consider a system that contains persistent memory, standard
>>> DDR memory and High Bandwidth Memory (HBM), all attached to the same CPU.
>>> There could potentially be an order of magnitude or more difference in
>>> performance between the slowest and fastest memory attached to that CPU.
>>
>> Right.
>>
>>>
>>> With the current Linux code NUMA nodes are CPU-centric, so all the memory
>>> attached to a given CPU will be lumped into the same NUMA node.  This makes
>>> it very difficult for userspace applications to understand the performance
>>> of different memory ranges on a given CPU.
>>
>> Right but that might require fundamental changes to the NUMA representation.
>> Plugging those memory as separate NUMA nodes, identify them through sysfs
>> and try allocating from it through mbind() seems like a short term solution.
>>
>> Though if we decide to go in this direction, sysfs interface or something
>> similar is required to enumerate memory properties.
> 
> Yep, and this patch series is trying to be the sysfs interface that is
> required to the memory properties.  :)  It's a certainty that we will have
> memory-only NUMA nodes, at least on platforms that support ACPI.  Supporting
> memory-only proximity domains (which Linux turns in to memory-only NUMA nodes)
> is explicitly supported with the introduction of the HMAT in ACPI 6.2.

Yeah, even on POWER platforms can have memory only NUMA nodes.

> 
> It also turns out that the existing memory management code already deals with
> them just fine - you see this with my hmat_examples setup:
> 
> https://github.com/rzwisler/hmat_examples
> 
> Both configurations created by this repo create memory-only NUMA nodes, even
> with upstream kernels.  My patches don't change that, they just provide a
> sysfs representation of the HMAT so users can discover the memory that exists
> in the system.

Once its a NUMA node everything will work as is from MM interface
point of view. But the point is how we export these properties to
user space. My only concern is lets not do it in a way which will
be locked without first going through NUMA redesign for these new
attribute based memory, thats all.

> 
>>> We solve this issue by providing userspace with performance information on
>>> individual memory ranges.  This performance information is exposed via
>>> sysfs:
>>>
>>>   # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null
>>>   mem_tgt2/firmware_id:1
>>>   mem_tgt2/is_cached:0
>>>   mem_tgt2/local_init/read_bw_MBps:40960
>>>   mem_tgt2/local_init/read_lat_nsec:50
>>>   mem_tgt2/local_init/write_bw_MBps:40960
>>>   mem_tgt2/local_init/write_lat_nsec:50
>>
>> I might have missed discussions from earlier versions, why we have this
>> kind of a "source --> target" model ? We will enlist properties for all
>> possible "source --> target" on the system ? Right now it shows only
>> bandwidth and latency properties, can it accommodate other properties
>> as well in future ?
> 
> The initiator/target model is useful in preventing us from needing a
> MAX_NUMA_NODES x MAX_NUMA_NODES sized table for each performance attribute.  I
> talked about it a little more here:

That makes it even more complex. Not only we have a memory attribute
like bandwidth specific to the range, we are also exporting it's
relative values as seen from different CPU nodes. Its again kind of
a NUMA distance table being exported in the generic sysfs path like
/sys/devices/. The problem is possible future memory attributes like
'reliability', 'density', 'power consumption' might not have a need
for a "source --> destination" kind of model as they dont change
based on which CPU node is accessing it.

> 
> https://lists.01.org/pipermail/linux-nvdimm/2017-December/013654.html
> 
>>> This allows applications to easily find the memory that they want to use.
>>> We expect that the existing NUMA APIs will be enhanced to use this new
>>> information so that applications can continue to use them to select their
>>> desired memory.
>>
>> I had presented a proposal for NUMA redesign in the Plumbers Conference this
>> year where various memory devices with different kind of memory attributes
>> can be represented in the kernel and be used explicitly from the user space.
>> Here is the link to the proposal if you feel interested. The proposal is
>> very intrusive and also I dont have a RFC for it yet for discussion here.
>>
>>

Re: Prototype patch for Linux-kernel memory model

2017-12-22 Thread afzal mohammed

Hi,

On Fri, Dec 22, 2017 at 09:41:32AM +0530, afzal mohammed wrote:
> On Thu, Dec 21, 2017 at 08:15:02AM -0800, Paul E. McKenney wrote:

> > Have you installed and run the herd tool?  Doing so would allow you
> > to experiment with changes to the litmus tests.
> 
> Yes, i installed herd tool and then i was at a loss :(, so started
> re-reading the documentation, yet to run any of the tests.

Above was referring to "opam install herdtools7" & the pre-requisites,
with the current HEAD of herd, build fails as below, but builds fine
with the latest tag - 7.47.

Could run a couple of tests as well now, thanks.

afzal


herdtools7(master)$ make all
sh ./build.sh $HOME
+ /usr/bin/ocamldep.opt -modules gen/RISCVCompile_gen.ml > 
gen/RISCVCompile_gen.ml.depends
File "gen/RISCVCompile_gen.ml", line 94, characters 8-9:
Error: Syntax error
Command exited with code 2.
Compilation unsuccessful after building 1439 targets (0 cached) in 00:00:59.
Makefile:4: recipe for target 'all' failed
make: *** [all] Error 10

Re: Prototype patch for Linux-kernel memory model

2017-12-22 Thread afzal mohammed

Hi,

On Fri, Dec 22, 2017 at 09:41:32AM +0530, afzal mohammed wrote:
> On Thu, Dec 21, 2017 at 08:15:02AM -0800, Paul E. McKenney wrote:

> > Have you installed and run the herd tool?  Doing so would allow you
> > to experiment with changes to the litmus tests.
> 
> Yes, i installed herd tool and then i was at a loss :(, so started
> re-reading the documentation, yet to run any of the tests.

Above was referring to "opam install herdtools7" & the pre-requisites,
with the current HEAD of herd, build fails as below, but builds fine
with the latest tag - 7.47.

Could run a couple of tests as well now, thanks.

afzal


herdtools7(master)$ make all
sh ./build.sh $HOME
+ /usr/bin/ocamldep.opt -modules gen/RISCVCompile_gen.ml > 
gen/RISCVCompile_gen.ml.depends
File "gen/RISCVCompile_gen.ml", line 94, characters 8-9:
Error: Syntax error
Command exited with code 2.
Compilation unsuccessful after building 1439 targets (0 cached) in 00:00:59.
Makefile:4: recipe for target 'all' failed
make: *** [all] Error 10

Re: [RFC][PATCH][REGRESSION FIX] x86/ftrace: Add ORC annotation to 64 bit ftrace assembly

2017-12-22 Thread Steven Rostedt

On Fri, 22 Dec 2017 20:43:37 -0600
Josh Poimboeuf  wrote:

> Objtool doesn't like the fact that save_mcount_regs pushes rbp at the
> beginning, but it's never restored (directly, at least).
> 
> How about something like this instead, where it skips the original rbp
> push?  I didn't test it, but I think it should work.  It at least gets
> rid of the warnings and doesn't need any unwind hints other than the
> UNWIND_HINT_EMPTY in return_to_handler.

There's a reason I did it that way, but don't remember exactly why. I
think we discussed this a while ago when we had issues with ftrace.S
during your first iterations of objtool.

I'll have to test all different compile options and different gcc
compiler versions.

-- Steve

Re: [RFC][PATCH][REGRESSION FIX] x86/ftrace: Add ORC annotation to 64 bit ftrace assembly

2017-12-22 Thread Steven Rostedt

On Fri, 22 Dec 2017 20:43:37 -0600
Josh Poimboeuf  wrote:

> Objtool doesn't like the fact that save_mcount_regs pushes rbp at the
> beginning, but it's never restored (directly, at least).
> 
> How about something like this instead, where it skips the original rbp
> push?  I didn't test it, but I think it should work.  It at least gets
> rid of the warnings and doesn't need any unwind hints other than the
> UNWIND_HINT_EMPTY in return_to_handler.

There's a reason I did it that way, but don't remember exactly why. I
think we discussed this a while ago when we had issues with ftrace.S
during your first iterations of objtool.

I'll have to test all different compile options and different gcc
compiler versions.

-- Steve

Re: [RFC] does ioremap() cause memory leak?

2017-12-22 Thread Xishi Qiu

On 2017/12/21 16:55, Xishi Qiu wrote:

> When we use iounmap() to free the mapping, it calls unmap_vmap_area() to 
> clear page table,
> but do not free the memory of page table, right?
> 
> So when use ioremap() to mapping another area(incluce the area before), it 
> may use
> large mapping(e.g. ioremap_pmd_enabled()), so the original page table 
> memory(e.g. pte memory)
> will be lost, it cause memory leak, right?
> 
> Thanks,
> Xishi Qiu
> 
> 
> .
> 

Hi, here is another question from lious.li...@hisilicon.com

As ARM-ARM said

“The architecture permits the caching of any translation table entry that has 
been returned from memory without a

fault, provided that the entry does not, itself, cause a Translation fault, an 
Address size fault, or an Access Flag fault.

This means that the entries that can be cached include:

• Entries in translation tables that point to subsequent tables to be used in 
that stage of translation.

• Stage 2 translation table entries used as part of a stage 1 translation table 
walk

• Stage 2 translation table entries used to translate the output address of the 
stage 1 translation.”

this means pgd, pud, pmd, pte all can be cached in TLB if itself have not a 
fault.

the scenario want page walk from:

4K:pgd0 --> pud0 --> pmd0 --> pte0 (4K)

To

2M:   pgd0 --> pud0 --> pte1(2M)

--> is connect next pagetable

-X-> is disconnect next pagetable

I have seen the ioremap and iounmap software flow for ARM64 in Kernel version 
4.14.

When I use ioremap to get a valid virtual address for a device address, Kernel 
would use ioremap_page_range to config the pagetable.

In ioremap_page_range function, if there is no pud, pmd or pte, Kernel would 
alloc one page for it. And then Kernel write the valid value into the address.

When I use iounmap to release this area, Kernel would write zero into the last 
level pagetable, then execute tlbi vaae1is to flush the tlb. But I haven`t seen 
Kernel would free the used page for pud, pmd or pte.

So there is a scene, I config Kernel to use 4K pagetable, and enable 
CONFIG_HAVE_ARCH_HUGE_VMAP. The when I use ioremap, Kernel would config 1G, 2M 
or 4K pagetable according to the size.

First I use ioremap to ask for 4K size. Kernel returns a virtual address VA1. 
Then I use iounmap to free this area. Kernel would write zero into the VA1`s 
level3 pagetable. Then when Kernel wants to get VA1 back, Kernel would send a 
tlbi vaae1is.

the page become follow:

1. 4K:   pgd0 --> pud0 --> pmd0 --> pte0 (4K)

2. pte0 write 0

3. 4K:   pgd0 --> pud0 --> pmd0(still valid) -X-> pte0 (4K,not valid)

4. tlbi vaae1is

Sencond I use ioremap to ask for 2M size. Kernel would config a 2M page, then 
return the virtual address. And Kernel just allocates the same virtual address 
VA1 for me. But I see in the ioremap_page_range software flow, Kernel just 
write the valid value into the level2 pagetable address, and doesn`t release 
the allocated page for the previous level3 pagetable. And when Kernel modifies 
the level2 pagetable, it also doesn`t follow the ARM break-before-make flow.

the page change as follow:

1.pgd0 --> pud0 --> pmd0(still valid) -X-> pte0 (4K,not valid)

2.write pmd0(still valid) to block for 2M.

3.expect pgd0 --> pud0 --> pte1(2M)

but because pmd0(4K pmd, still valid) before becoming to pte(2M pte), maybe 
have a speculative access between 1 and 3.

the pgd0, pud0, pmd0 have no fault will be cached in TBL, the pte0 have fault 
so can't be cached, this speculative access will be drop(no exception).

and the page change as:

1.pgd0 --> pud0 --> pmd0(still valid) -X-> pte0 (4K,not valid)

2.speculative access the same VA(pgd0 --> pud0 --> pmd0(still valid) -X-> pte0 
(4K,not valid)). cache the pgd0, pud0, pmd0.

3.write pmd0 from pmd to block(pte) for 2M.

4.the page walker maybe pgd0 --> pud0 --> pmd0(cached in TLB) --> 0x0 
(translation fault)

So I have two questions for this scene.

1. When the same virtual address allocated from ioremap, first is 4K size, 
second is 2M size, if Kernel would leak memory.

2. Kernel modifies the old invalid 4K pagetable to 2M, but doesn`t follow the 
ARM break-before-make flow, CPU maybe get the old invalid 4K pagetable 
information, then Kernel would panic.

Re: [RFC] does ioremap() cause memory leak?

2017-12-22 Thread Xishi Qiu

On 2017/12/21 16:55, Xishi Qiu wrote:

> When we use iounmap() to free the mapping, it calls unmap_vmap_area() to 
> clear page table,
> but do not free the memory of page table, right?
> 
> So when use ioremap() to mapping another area(incluce the area before), it 
> may use
> large mapping(e.g. ioremap_pmd_enabled()), so the original page table 
> memory(e.g. pte memory)
> will be lost, it cause memory leak, right?
> 
> Thanks,
> Xishi Qiu
> 
> 
> .
> 

Hi, here is another question from lious.li...@hisilicon.com

As ARM-ARM said

“The architecture permits the caching of any translation table entry that has 
been returned from memory without a

fault, provided that the entry does not, itself, cause a Translation fault, an 
Address size fault, or an Access Flag fault.

This means that the entries that can be cached include:

• Entries in translation tables that point to subsequent tables to be used in 
that stage of translation.

• Stage 2 translation table entries used as part of a stage 1 translation table 
walk

• Stage 2 translation table entries used to translate the output address of the 
stage 1 translation.”

this means pgd, pud, pmd, pte all can be cached in TLB if itself have not a 
fault.

the scenario want page walk from:

4K:pgd0 --> pud0 --> pmd0 --> pte0 (4K)

To

2M:   pgd0 --> pud0 --> pte1(2M)

--> is connect next pagetable

-X-> is disconnect next pagetable

I have seen the ioremap and iounmap software flow for ARM64 in Kernel version 
4.14.

When I use ioremap to get a valid virtual address for a device address, Kernel 
would use ioremap_page_range to config the pagetable.

In ioremap_page_range function, if there is no pud, pmd or pte, Kernel would 
alloc one page for it. And then Kernel write the valid value into the address.

When I use iounmap to release this area, Kernel would write zero into the last 
level pagetable, then execute tlbi vaae1is to flush the tlb. But I haven`t seen 
Kernel would free the used page for pud, pmd or pte.

So there is a scene, I config Kernel to use 4K pagetable, and enable 
CONFIG_HAVE_ARCH_HUGE_VMAP. The when I use ioremap, Kernel would config 1G, 2M 
or 4K pagetable according to the size.

First I use ioremap to ask for 4K size. Kernel returns a virtual address VA1. 
Then I use iounmap to free this area. Kernel would write zero into the VA1`s 
level3 pagetable. Then when Kernel wants to get VA1 back, Kernel would send a 
tlbi vaae1is.

the page become follow:

1. 4K:   pgd0 --> pud0 --> pmd0 --> pte0 (4K)

2. pte0 write 0

3. 4K:   pgd0 --> pud0 --> pmd0(still valid) -X-> pte0 (4K,not valid)

4. tlbi vaae1is

Sencond I use ioremap to ask for 2M size. Kernel would config a 2M page, then 
return the virtual address. And Kernel just allocates the same virtual address 
VA1 for me. But I see in the ioremap_page_range software flow, Kernel just 
write the valid value into the level2 pagetable address, and doesn`t release 
the allocated page for the previous level3 pagetable. And when Kernel modifies 
the level2 pagetable, it also doesn`t follow the ARM break-before-make flow.

the page change as follow:

1.pgd0 --> pud0 --> pmd0(still valid) -X-> pte0 (4K,not valid)

2.write pmd0(still valid) to block for 2M.

3.expect pgd0 --> pud0 --> pte1(2M)

but because pmd0(4K pmd, still valid) before becoming to pte(2M pte), maybe 
have a speculative access between 1 and 3.

the pgd0, pud0, pmd0 have no fault will be cached in TBL, the pte0 have fault 
so can't be cached, this speculative access will be drop(no exception).

and the page change as:

1.pgd0 --> pud0 --> pmd0(still valid) -X-> pte0 (4K,not valid)

2.speculative access the same VA(pgd0 --> pud0 --> pmd0(still valid) -X-> pte0 
(4K,not valid)). cache the pgd0, pud0, pmd0.

3.write pmd0 from pmd to block(pte) for 2M.

4.the page walker maybe pgd0 --> pud0 --> pmd0(cached in TLB) --> 0x0 
(translation fault)

So I have two questions for this scene.

1. When the same virtual address allocated from ioremap, first is 4K size, 
second is 2M size, if Kernel would leak memory.

2. Kernel modifies the old invalid 4K pagetable to 2M, but doesn`t follow the 
ARM break-before-make flow, CPU maybe get the old invalid 4K pagetable 
information, then Kernel would panic.

Re: thinkpad x60: sound problems in 4.15-rc1 was Re: thinkpad x60: sound problems in 4.14.0-next-20171114

2017-12-22 Thread vcaputo

On Wed, Dec 20, 2017 at 01:33:45AM +0100, Thomas Gleixner wrote:
> On Tue, 19 Dec 2017, vcap...@pengaru.com wrote:
> > On Wed, Dec 20, 2017 at 12:22:12AM +0100, Pavel Machek wrote:
> > > You forgot to mention commit id :-).
> > > 
> > 
> > That is very strange, anyhow:
> > 
> >  commit fdba46ffb4c203b6e6794163493fd310f98bb4be
> >  Author: Thomas Gleixner 
> >  Date:   Wed Sep 13 23:29:27 2017 +0200
> > 
> >  x86/apic: Get rid of multi CPU affinity
> > 
> > 
> > Will try reverting soon, just a bit busy today out in the desert and the sun
> > is going down so my solar panel is useless.
> 
> The revert is not trivial.
> 
> What is the exact problem and how do you reproduce that?
> 
> Thanks,
> 

So I had some time today to poke at this some more.  Since it looks to
be easily reproduced by simply pulling the AC power while playing music
or doing IO, and dmesg clearly reports using mwait, I tried booting with
idle=nomwait to see if that made any difference.  It didn't, the same
thing still occurs.

In trying to make sense of this totally unfamiliar apic code and better
understand these changes, I came across this comment which seemed a bit
telling:

  40 void flat_vector_allocation_domain(int cpu, struct cpumask *retmask,
  41const struct cpumask *mask)
  42 {
  43 /*
  44  * Careful. Some cpus do not strictly honor the set of cpus
  45  * specified in the interrupt destination when using lowest
  46  * priority interrupt delivery mode.
  47  *
  48  * In particular there was a hyperthreading cpu observed to
  49  * deliver interrupts to the wrong hyperthread when only one
  50  * hyperthread was specified in the interrupt desitination.
  51  */
  52 cpumask_clear(retmask);
  53 cpumask_bits(retmask)[0] = APIC_ALL_CPUS;
  54 }

It's this allocation domain mask hook which has been bypassed by the
offending commit.  The existing approach is more robust in the face of
relaxed adherence to destination cpumasks since it's all-inclusive,
whereas the new code is exclusive to a specific cpu.

Is it possible what I'm observing is just another manifestation of
what's being described in that comment?  This is a core 2 duo, so not
hyper-threaded.  But maybe something funny happens when switching
cstates in response to interrupts - like maybe the wrong cpu can be used
if it can save power vs. powering up another?  Just thinking out loud
here.

In any case, 4.15-rc4 is quite unusable on my machine because of this.

Pavel, do you observe the same behavior on your x60, WRT AC power?

I've dropped Takashi from the CC list as this pretty clearly isn't a
sound-specific problem.

Thanks,
Vito Caputo

Re: thinkpad x60: sound problems in 4.15-rc1 was Re: thinkpad x60: sound problems in 4.14.0-next-20171114

2017-12-22 Thread vcaputo

On Wed, Dec 20, 2017 at 01:33:45AM +0100, Thomas Gleixner wrote:
> On Tue, 19 Dec 2017, vcap...@pengaru.com wrote:
> > On Wed, Dec 20, 2017 at 12:22:12AM +0100, Pavel Machek wrote:
> > > You forgot to mention commit id :-).
> > > 
> > 
> > That is very strange, anyhow:
> > 
> >  commit fdba46ffb4c203b6e6794163493fd310f98bb4be
> >  Author: Thomas Gleixner 
> >  Date:   Wed Sep 13 23:29:27 2017 +0200
> > 
> >  x86/apic: Get rid of multi CPU affinity
> > 
> > 
> > Will try reverting soon, just a bit busy today out in the desert and the sun
> > is going down so my solar panel is useless.
> 
> The revert is not trivial.
> 
> What is the exact problem and how do you reproduce that?
> 
> Thanks,
> 

So I had some time today to poke at this some more.  Since it looks to
be easily reproduced by simply pulling the AC power while playing music
or doing IO, and dmesg clearly reports using mwait, I tried booting with
idle=nomwait to see if that made any difference.  It didn't, the same
thing still occurs.

In trying to make sense of this totally unfamiliar apic code and better
understand these changes, I came across this comment which seemed a bit
telling:

  40 void flat_vector_allocation_domain(int cpu, struct cpumask *retmask,
  41const struct cpumask *mask)
  42 {
  43 /*
  44  * Careful. Some cpus do not strictly honor the set of cpus
  45  * specified in the interrupt destination when using lowest
  46  * priority interrupt delivery mode.
  47  *
  48  * In particular there was a hyperthreading cpu observed to
  49  * deliver interrupts to the wrong hyperthread when only one
  50  * hyperthread was specified in the interrupt desitination.
  51  */
  52 cpumask_clear(retmask);
  53 cpumask_bits(retmask)[0] = APIC_ALL_CPUS;
  54 }

It's this allocation domain mask hook which has been bypassed by the
offending commit.  The existing approach is more robust in the face of
relaxed adherence to destination cpumasks since it's all-inclusive,
whereas the new code is exclusive to a specific cpu.

Is it possible what I'm observing is just another manifestation of
what's being described in that comment?  This is a core 2 duo, so not
hyper-threaded.  But maybe something funny happens when switching
cstates in response to interrupts - like maybe the wrong cpu can be used
if it can save power vs. powering up another?  Just thinking out loud
here.

In any case, 4.15-rc4 is quite unusable on my machine because of this.

Pavel, do you observe the same behavior on your x60, WRT AC power?

I've dropped Takashi from the CC list as this pretty clearly isn't a
sound-specific problem.

Thanks,
Vito Caputo

Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-22 Thread Anshuman Khandual

On 12/22/2017 10:43 PM, Dave Hansen wrote:
> On 12/21/2017 07:09 PM, Anshuman Khandual wrote:
>> I had presented a proposal for NUMA redesign in the Plumbers Conference this
>> year where various memory devices with different kind of memory attributes
>> can be represented in the kernel and be used explicitly from the user space.
>> Here is the link to the proposal if you feel interested. The proposal is
>> very intrusive and also I dont have a RFC for it yet for discussion here.
> I think that's the best reason to "re-use NUMA" for this: it's _not_
> intrusive.
> 
> Also, from an x86 perspective, these HMAT systems *will* be out there.
> Old versions of Linux *will* see different types of memory as separate
> NUMA nodes.  So, if we are going to do something different, it's going
> to be interesting to un-teach those systems about using the NUMA APIs
> for this.  That ship has sailed.

I understand the need to fetch these details from ACPI/DT for
applications to target these distinct memory only NUMA nodes.
This can be done by parsing from platform specific values from
/proc/acpi/ or /proc/device-tree/ interfaces. This can be a
short term solution before NUMA redesign can be figured out.
But adding generic devices like "hmat" in the /sys/devices/
path which will be locked for good, seems problematic.

Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-22 Thread Anshuman Khandual

On 12/22/2017 10:43 PM, Dave Hansen wrote:
> On 12/21/2017 07:09 PM, Anshuman Khandual wrote:
>> I had presented a proposal for NUMA redesign in the Plumbers Conference this
>> year where various memory devices with different kind of memory attributes
>> can be represented in the kernel and be used explicitly from the user space.
>> Here is the link to the proposal if you feel interested. The proposal is
>> very intrusive and also I dont have a RFC for it yet for discussion here.
> I think that's the best reason to "re-use NUMA" for this: it's _not_
> intrusive.
> 
> Also, from an x86 perspective, these HMAT systems *will* be out there.
> Old versions of Linux *will* see different types of memory as separate
> NUMA nodes.  So, if we are going to do something different, it's going
> to be interesting to un-teach those systems about using the NUMA APIs
> for this.  That ship has sailed.

I understand the need to fetch these details from ACPI/DT for
applications to target these distinct memory only NUMA nodes.
This can be done by parsing from platform specific values from
/proc/acpi/ or /proc/device-tree/ interfaces. This can be a
short term solution before NUMA redesign can be figured out.
But adding generic devices like "hmat" in the /sys/devices/
path which will be locked for good, seems problematic.

Re: [PATCH v11 2/6] mailbox: qcom: Create APCS child device for clock controller

2017-12-22 Thread Jassi Brar

On Tue, Dec 5, 2017 at 9:16 PM, Georgi Djakov  wrote:
> There is a clock controller functionality provided by the APCS hardware
> block of msm8916 devices. The device-tree would represent an APCS node
> with both mailbox and clock provider properties.
>
The spec might depict a 'clock' box and 'mailbox' box inside the
bigger APCS box. However, from the code I see in this patchset, they
are orthogonal and can & should be represented as independent DT
nodes.
Patch-1 is ok, though.

Thanks

Re: [PATCH v11 2/6] mailbox: qcom: Create APCS child device for clock controller

2017-12-22 Thread Jassi Brar

On Tue, Dec 5, 2017 at 9:16 PM, Georgi Djakov  wrote:
> There is a clock controller functionality provided by the APCS hardware
> block of msm8916 devices. The device-tree would represent an APCS node
> with both mailbox and clock provider properties.
>
The spec might depict a 'clock' box and 'mailbox' box inside the
bigger APCS box. However, from the code I see in this patchset, they
are orthogonal and can & should be represented as independent DT
nodes.
Patch-1 is ok, though.

Thanks

Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-22 Thread Alexandru Chirvasitu

I was just now trying to track down my other issue, whereby somewhere
along the tree kexec stops working properly. In the process of doing
that I realized I had initially made one change to the original 4.9
config beyond oldconfig: I'd turned off WX debugging.

I've now compiled a bunch of versions with WX debugging back on, and
new behavior arises. I am attaching the joournalctl log of a booted
4.13 kernel (from Linus' tree, commit 569dbb8).

It boots and logs me in, but returns a call trace I wasn't seeing
without the WX debugging. I am sending over in case it provides any
information.

The trace bears the 23:24:09 timestamp.

On Sat, Dec 23, 2017 at 01:35:12AM +, Dexuan Cui wrote:
> > From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> > Sent: Friday, December 22, 2017 14:29
> > 
> > The output of that precise command run just now on a freshly-compiled
> > copy of that commit is attached.
> > 
> > On Fri, Dec 22, 2017 at 09:31:28PM +, Dexuan Cui wrote:
> > > > From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> > > > Sent: Friday, December 22, 2017 06:21
> > > >
> > > > In the absence of logs, the best I can do at the moment is attach a
> > > > picture of the screen I am presented with on the apic=debug boot
> > > > attempt.
> > > > Alex
> > >
> > > The panic happens in irq_matrix_assign_system+0x4e/0xd0 in your picture.
> > > IMO we should find which line of code causes the panic. I suppose
> > > "objdump -D kernel/irq/matrix.o" can help to do that.
> > >
> > > Thanks,
> > > -- Dexuan
> 
> The BUG_ON panic happens at line 147:
>BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
> 
> I'm sure Thomas and Dou know it better than me. 
> 
> 137 void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit,
> 138   bool replace)
> 139 {
> 140 struct cpumap *cm = this_cpu_ptr(m->maps);
> 141
> 142 BUG_ON(bit > m->matrix_bits);
> 143 BUG_ON(m->online_maps > 1 || (m->online_maps && !replace));
> 144
> 145 set_bit(bit, m->system_map);
> 146 if (replace) {
> 147 BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
> 148 cm->allocated--;
> 149 m->total_allocated--;
> 150 }
> 151 if (bit >= m->alloc_start && bit < m->alloc_end)
> 152 m->systembits_inalloc++;
> 153
> 154 trace_irq_matrix_assign_system(bit, m);
> 155 }
> 
> -- Dexuan
> 
-- Logs begin at Thu 2017-12-14 19:59:20 EST, end at Fri 2017-12-22 23:45:29 
EST. --
Dec 22 23:24:09 D-69-91-141-110 kernel: random: get_random_bytes called from 
start_kernel+0x32/0x3a0 with crng_init=0
Dec 22 23:24:09 D-69-91-141-110 kernel: Linux version 4.13.0 (root@axiomatic) 
(gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5)) #1 SMP Fri Dec 22 
22:18:58 EST 2017
Dec 22 23:24:09 D-69-91-141-110 kernel: x86/fpu: x87 FPU will use FXSAVE
Dec 22 23:24:09 D-69-91-141-110 kernel: e820: BIOS-provided physical RAM map:
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0x-0x0009fbff] usable
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0x0009fc00-0x0009] reserved
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0x000e-0x000f] reserved
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0x0010-0xb7f9] usable
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xb7fa-0xb7fadfff] ACPI data
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xb7fae000-0xb7fe] ACPI NVS
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xb7ff-0xb7ff] reserved
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xfee0-0xfee00fff] reserved
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xffb8-0x] reserved
Dec 22 23:24:09 D-69-91-141-110 kernel: NX (Execute Disable) protection: active
Dec 22 23:24:09 D-69-91-141-110 kernel: random: fast init done
Dec 22 23:24:09 D-69-91-141-110 kernel: SMBIOS 2.4 present.
Dec 22 23:24:09 D-69-91-141-110 kernel: DMI: ASUSTeK Computer Inc. F5RL 
   /F5RL  , BIOS 210 06/12/2008
Dec 22 23:24:09 D-69-91-141-110 kernel: tsc: Fast TSC calibration using PIT
Dec 22 23:24:09 D-69-91-141-110 kernel: e820: update [mem 
0x-0x0fff] usable ==> reserved
Dec 22 23:24:09 D-69-91-141-110 kernel: e820: remove [mem 
0x000a-0x000f] usable
Dec 22 23:24:09 D-69-91-141-110 kernel: e820: last_pfn = 0xb7fa0 max_arch_pfn = 
0x100
Dec 22 23:24:09 D-69-91-141-110 kernel: MTRR default type: uncachable
Dec 22 23:24:09 D-69-91-141-110 kernel: MTRR fixed ranges enabled:
Dec 22 23:24:09 D-69-91-141-110 kernel:   0-9 write-back
Dec 22 23:24:09 D-69-91-141-110 kernel:   A-B uncachable
Dec 22 23:24:09 D-69-91-141-110 kernel:   C-C write-protect
Dec 22 23:24:09 D-69-91-141-110

Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-22 Thread Alexandru Chirvasitu

I was just now trying to track down my other issue, whereby somewhere
along the tree kexec stops working properly. In the process of doing
that I realized I had initially made one change to the original 4.9
config beyond oldconfig: I'd turned off WX debugging.

I've now compiled a bunch of versions with WX debugging back on, and
new behavior arises. I am attaching the joournalctl log of a booted
4.13 kernel (from Linus' tree, commit 569dbb8).

It boots and logs me in, but returns a call trace I wasn't seeing
without the WX debugging. I am sending over in case it provides any
information.

The trace bears the 23:24:09 timestamp.

On Sat, Dec 23, 2017 at 01:35:12AM +, Dexuan Cui wrote:
> > From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> > Sent: Friday, December 22, 2017 14:29
> > 
> > The output of that precise command run just now on a freshly-compiled
> > copy of that commit is attached.
> > 
> > On Fri, Dec 22, 2017 at 09:31:28PM +, Dexuan Cui wrote:
> > > > From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> > > > Sent: Friday, December 22, 2017 06:21
> > > >
> > > > In the absence of logs, the best I can do at the moment is attach a
> > > > picture of the screen I am presented with on the apic=debug boot
> > > > attempt.
> > > > Alex
> > >
> > > The panic happens in irq_matrix_assign_system+0x4e/0xd0 in your picture.
> > > IMO we should find which line of code causes the panic. I suppose
> > > "objdump -D kernel/irq/matrix.o" can help to do that.
> > >
> > > Thanks,
> > > -- Dexuan
> 
> The BUG_ON panic happens at line 147:
>BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
> 
> I'm sure Thomas and Dou know it better than me. 
> 
> 137 void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit,
> 138   bool replace)
> 139 {
> 140 struct cpumap *cm = this_cpu_ptr(m->maps);
> 141
> 142 BUG_ON(bit > m->matrix_bits);
> 143 BUG_ON(m->online_maps > 1 || (m->online_maps && !replace));
> 144
> 145 set_bit(bit, m->system_map);
> 146 if (replace) {
> 147 BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
> 148 cm->allocated--;
> 149 m->total_allocated--;
> 150 }
> 151 if (bit >= m->alloc_start && bit < m->alloc_end)
> 152 m->systembits_inalloc++;
> 153
> 154 trace_irq_matrix_assign_system(bit, m);
> 155 }
> 
> -- Dexuan
> 
-- Logs begin at Thu 2017-12-14 19:59:20 EST, end at Fri 2017-12-22 23:45:29 
EST. --
Dec 22 23:24:09 D-69-91-141-110 kernel: random: get_random_bytes called from 
start_kernel+0x32/0x3a0 with crng_init=0
Dec 22 23:24:09 D-69-91-141-110 kernel: Linux version 4.13.0 (root@axiomatic) 
(gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5)) #1 SMP Fri Dec 22 
22:18:58 EST 2017
Dec 22 23:24:09 D-69-91-141-110 kernel: x86/fpu: x87 FPU will use FXSAVE
Dec 22 23:24:09 D-69-91-141-110 kernel: e820: BIOS-provided physical RAM map:
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0x-0x0009fbff] usable
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0x0009fc00-0x0009] reserved
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0x000e-0x000f] reserved
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0x0010-0xb7f9] usable
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xb7fa-0xb7fadfff] ACPI data
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xb7fae000-0xb7fe] ACPI NVS
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xb7ff-0xb7ff] reserved
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xfee0-0xfee00fff] reserved
Dec 22 23:24:09 D-69-91-141-110 kernel: BIOS-e820: [mem 
0xffb8-0x] reserved
Dec 22 23:24:09 D-69-91-141-110 kernel: NX (Execute Disable) protection: active
Dec 22 23:24:09 D-69-91-141-110 kernel: random: fast init done
Dec 22 23:24:09 D-69-91-141-110 kernel: SMBIOS 2.4 present.
Dec 22 23:24:09 D-69-91-141-110 kernel: DMI: ASUSTeK Computer Inc. F5RL 
   /F5RL  , BIOS 210 06/12/2008
Dec 22 23:24:09 D-69-91-141-110 kernel: tsc: Fast TSC calibration using PIT
Dec 22 23:24:09 D-69-91-141-110 kernel: e820: update [mem 
0x-0x0fff] usable ==> reserved
Dec 22 23:24:09 D-69-91-141-110 kernel: e820: remove [mem 
0x000a-0x000f] usable
Dec 22 23:24:09 D-69-91-141-110 kernel: e820: last_pfn = 0xb7fa0 max_arch_pfn = 
0x100
Dec 22 23:24:09 D-69-91-141-110 kernel: MTRR default type: uncachable
Dec 22 23:24:09 D-69-91-141-110 kernel: MTRR fixed ranges enabled:
Dec 22 23:24:09 D-69-91-141-110 kernel:   0-9 write-back
Dec 22 23:24:09 D-69-91-141-110 kernel:   A-B uncachable
Dec 22 23:24:09 D-69-91-141-110 kernel:   C-C write-protect
Dec 22 23:24:09 D-69-91-141-110

Re: [PATCH RESEND 1/1] cpufreq: imx6q: switch to Use clk_bulk_get to refine clk operations

2017-12-22 Thread Dong Aisheng

On Fri, Dec 22, 2017 at 07:34:57PM +0100, Rafael J. Wysocki wrote:
> On Thursday, December 21, 2017 2:18:01 PM CET Dong Aisheng wrote:
> > Hi Rafael,
> > 
> > On Thu, Sep 14, 2017 at 02:40:32PM -0700, Viresh Kumar wrote:
> > > On 31-08-17, 19:43, Dong Aisheng wrote:
> > > > Use clk_bulk_get to ease the driver clocks handling.
> > > > 
> > > > Cc: "Rafael J. Wysocki" 
> > > > Cc: Viresh Kumar 
> > > > Cc: Shawn Guo 
> > > > Cc: Anson Huang 
> > > > Cc: Leonard Crestez 
> > > > Signed-off-by: Dong Aisheng 
> > > > --
> > > > The original one is here which depends on clk_bulk APIs.
> > > > https://patchwork.kernel.org/patch/9737337/
> > > > Now the clk_bulk APIs are already in kernel, so resend the patch.
> > > > (Patch title changed a bit to be more specific.)
> > > > ---
> > > >  drivers/cpufreq/imx6q-cpufreq.c | 125 
> > > > ++--
> > > >  1 file changed, 56 insertions(+), 69 deletions(-)
> > > 
> > > Acked-by: Viresh Kumar 
> > > 
> > 
> > Would you help pick it?
> > I did not see it in latest kernel.
> > It still applies.
> 
> OK
> 
> Can you please resend the patch again with the ACK from Viresh?
> 

Of course yes.

Thanks

Regards
Dong Aisheng

> It looks like it fell through the cracks, sorry about that.
> 
> Thanks,
> Rafael
>

Re: [PATCH RESEND 1/1] cpufreq: imx6q: switch to Use clk_bulk_get to refine clk operations

2017-12-22 Thread Dong Aisheng

On Fri, Dec 22, 2017 at 07:34:57PM +0100, Rafael J. Wysocki wrote:
> On Thursday, December 21, 2017 2:18:01 PM CET Dong Aisheng wrote:
> > Hi Rafael,
> > 
> > On Thu, Sep 14, 2017 at 02:40:32PM -0700, Viresh Kumar wrote:
> > > On 31-08-17, 19:43, Dong Aisheng wrote:
> > > > Use clk_bulk_get to ease the driver clocks handling.
> > > > 
> > > > Cc: "Rafael J. Wysocki" 
> > > > Cc: Viresh Kumar 
> > > > Cc: Shawn Guo 
> > > > Cc: Anson Huang 
> > > > Cc: Leonard Crestez 
> > > > Signed-off-by: Dong Aisheng 
> > > > --
> > > > The original one is here which depends on clk_bulk APIs.
> > > > https://patchwork.kernel.org/patch/9737337/
> > > > Now the clk_bulk APIs are already in kernel, so resend the patch.
> > > > (Patch title changed a bit to be more specific.)
> > > > ---
> > > >  drivers/cpufreq/imx6q-cpufreq.c | 125 
> > > > ++--
> > > >  1 file changed, 56 insertions(+), 69 deletions(-)
> > > 
> > > Acked-by: Viresh Kumar 
> > > 
> > 
> > Would you help pick it?
> > I did not see it in latest kernel.
> > It still applies.
> 
> OK
> 
> Can you please resend the patch again with the ACK from Viresh?
> 

Of course yes.

Thanks

Regards
Dong Aisheng

> It looks like it fell through the cracks, sorry about that.
> 
> Thanks,
> Rafael
>

kasan for bpf

2017-12-22 Thread Alexei Starovoitov

Hi All,

the recent bugs make people question the safety of bpf and not a surprise
that some folks propose to set kernel.unprivileged_bpf_disabled = 1 by default.

I think it will be wrong long term decision, since it will steer
bpf into "root only" mentality. The verifier checks will become sloppier
and developers will worry less about missing safety checks.
I'd like bpf to go the other way. The root-only programs should be
treated with the same respect as unprivileged.

Today out of 15 program types only one BPF_PROG_TYPE_SOCKET_FILTER is
allowed for unpriv while in many cases like XDP, CLS, ACT, SKB, SOCK progs
could be relaxed to cap_net_admin and sometimes to unpriv.
I think we need to disallow leaking kernel address in XDP, CLS and
all other networking programs. They should be operating on packets
and should never leak addresses. The existing approach:
env->allow_ptr_leaks = capable(CAP_SYS_ADMIN);
if (env->allow_ptr_leaks) ...;
if (env->allow_ptr_leaks) ...;
is just a big hammer.
Realistically only tracing programs need to deal with kernel data and
they probably should be restricted further with additional sysctl or CAPs.

So instead of existing two classes of root/unpriv we need several:
- unpriv prog type
- all prog types for unpriv but execute via bpf_prog_run only (for testing)
- net_admin without leaks (that can run in containers)
- root without leaks
- root with leaks for tracing only

As far as unpriv I think it's clear that the verifier only
is not enough and we need to optionally add run-time checks.
I think 'kasan for bpf' by default would be good first step.
The basic idea is that all accessible memory (stack, ctx, maps)
will have shadow bits and every load/store will be automatically
instrumented with shadow memory checks.
'kasan for bpf' will be on by default for unpriv and extra
sysctl to turn it off in setups that cannot tolerate
slightly degraded performance.
For root and other classes of programs we will be able to
turn it on for testing as well.
I think that will stop many exploits except side channel attacks.

Thoughts?

kasan for bpf

2017-12-22 Thread Alexei Starovoitov

Hi All,

the recent bugs make people question the safety of bpf and not a surprise
that some folks propose to set kernel.unprivileged_bpf_disabled = 1 by default.

I think it will be wrong long term decision, since it will steer
bpf into "root only" mentality. The verifier checks will become sloppier
and developers will worry less about missing safety checks.
I'd like bpf to go the other way. The root-only programs should be
treated with the same respect as unprivileged.

Today out of 15 program types only one BPF_PROG_TYPE_SOCKET_FILTER is
allowed for unpriv while in many cases like XDP, CLS, ACT, SKB, SOCK progs
could be relaxed to cap_net_admin and sometimes to unpriv.
I think we need to disallow leaking kernel address in XDP, CLS and
all other networking programs. They should be operating on packets
and should never leak addresses. The existing approach:
env->allow_ptr_leaks = capable(CAP_SYS_ADMIN);
if (env->allow_ptr_leaks) ...;
if (env->allow_ptr_leaks) ...;
is just a big hammer.
Realistically only tracing programs need to deal with kernel data and
they probably should be restricted further with additional sysctl or CAPs.

So instead of existing two classes of root/unpriv we need several:
- unpriv prog type
- all prog types for unpriv but execute via bpf_prog_run only (for testing)
- net_admin without leaks (that can run in containers)
- root without leaks
- root with leaks for tracing only

As far as unpriv I think it's clear that the verifier only
is not enough and we need to optionally add run-time checks.
I think 'kasan for bpf' by default would be good first step.
The basic idea is that all accessible memory (stack, ctx, maps)
will have shadow bits and every load/store will be automatically
instrumented with shadow memory checks.
'kasan for bpf' will be on by default for unpriv and extra
sysctl to turn it off in setups that cannot tolerate
slightly degraded performance.
For root and other classes of programs we will be able to
turn it on for testing as well.
I think that will stop many exploits except side channel attacks.

Thoughts?

[PATCH] mm/fadvise: discard partial pages iff endbyte is also eof

2017-12-22 Thread 十刀

From: "shidao.ytt" 

in commit 441c228f817f7 ("mm: fadvise: document the
fadvise(FADV_DONTNEED) behaviour for partial pages") Mel Gorman
explained why partial pages should be preserved instead of discarded
when using fadvise(FADV_DONTNEED), however the actual codes to calcuate
end_index was unexpectedly wrong, the code behavior didn't match to the
statement in comments; Luckily in another commit 18aba41cbf
("mm/fadvise.c: do not discard partial pages with POSIX_FADV_DONTNEED")
Oleg Drokin fixed this behavior

Here I come up with a new idea that actually we can still discard the
last parital page iff the page-unaligned endbyte is also the end of
file, since no one else will use the rest of the page and it should be
safe enough to discard.

Signed-off-by: shidao.ytt 
Signed-off-by: Caspar Zhang 
---
 mm/fadvise.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/fadvise.c b/mm/fadvise.c
index ec70d6e..f74b21e 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -127,7 +127,8 @@
 */
start_index = (offset+(PAGE_SIZE-1)) >> PAGE_SHIFT;
end_index = (endbyte >> PAGE_SHIFT);
-   if ((endbyte & ~PAGE_MASK) != ~PAGE_MASK) {
+   if ((endbyte & ~PAGE_MASK) != ~PAGE_MASK &&
+   endbyte != inode->i_size - 1) {
/* First page is tricky as 0 - 1 = -1, but pgoff_t
 * is unsigned, so the end_index >= start_index
 * check below would be true and we'll discard the whole
-- 
1.8.3.1

[PATCH] mm/fadvise: discard partial pages iff endbyte is also eof

2017-12-22 Thread 十刀

From: "shidao.ytt" 

in commit 441c228f817f7 ("mm: fadvise: document the
fadvise(FADV_DONTNEED) behaviour for partial pages") Mel Gorman
explained why partial pages should be preserved instead of discarded
when using fadvise(FADV_DONTNEED), however the actual codes to calcuate
end_index was unexpectedly wrong, the code behavior didn't match to the
statement in comments; Luckily in another commit 18aba41cbf
("mm/fadvise.c: do not discard partial pages with POSIX_FADV_DONTNEED")
Oleg Drokin fixed this behavior

Here I come up with a new idea that actually we can still discard the
last parital page iff the page-unaligned endbyte is also the end of
file, since no one else will use the rest of the page and it should be
safe enough to discard.

Signed-off-by: shidao.ytt 
Signed-off-by: Caspar Zhang 
---
 mm/fadvise.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/fadvise.c b/mm/fadvise.c
index ec70d6e..f74b21e 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -127,7 +127,8 @@
 */
start_index = (offset+(PAGE_SIZE-1)) >> PAGE_SHIFT;
end_index = (endbyte >> PAGE_SHIFT);
-   if ((endbyte & ~PAGE_MASK) != ~PAGE_MASK) {
+   if ((endbyte & ~PAGE_MASK) != ~PAGE_MASK &&
+   endbyte != inode->i_size - 1) {
/* First page is tricky as 0 - 1 = -1, but pgoff_t
 * is unsigned, so the end_index >= start_index
 * check below would be true and we'll discard the whole
-- 
1.8.3.1

Re: [PATCH 11/11] evm: Don't update hmacs in user ns mounts

2017-12-22 Thread Serge E. Hallyn

On Fri, Dec 22, 2017 at 03:32:35PM +0100, Dongsu Park wrote:
> From: Seth Forshee 
> 
> The kernel should not calculate new hmacs for mounts done by
> non-root users. Update evm_calc_hmac_or_hash() to refuse to
> calculate new hmacs for mounts for non-init user namespaces.
> 
> Cc: linux-integr...@vger.kernel.org
> Cc: linux-security-mod...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: James Morris 
> Cc: Mimi Zohar 

Hi Mimi,

does this change seem sufficient to you?

> Cc: "Serge E. Hallyn" 
> Signed-off-by: Seth Forshee 
> Signed-off-by: Dongsu Park 
> ---
>  security/integrity/evm/evm_crypto.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/security/integrity/evm/evm_crypto.c 
> b/security/integrity/evm/evm_crypto.c
> index bcd64baf..729f4545 100644
> --- a/security/integrity/evm/evm_crypto.c
> +++ b/security/integrity/evm/evm_crypto.c
> @@ -190,7 +190,8 @@ static int evm_calc_hmac_or_hash(struct dentry *dentry,
>   int error;
>   int size;
>  
> - if (!(inode->i_opflags & IOP_XATTR))
> + if (!(inode->i_opflags & IOP_XATTR) ||
> + inode->i_sb->s_user_ns != _user_ns)
>   return -EOPNOTSUPP;
>  
>   desc = init_desc(type);
> -- 
> 2.13.6

Re: [PATCH 11/11] evm: Don't update hmacs in user ns mounts

2017-12-22 Thread Serge E. Hallyn

On Fri, Dec 22, 2017 at 03:32:35PM +0100, Dongsu Park wrote:
> From: Seth Forshee 
> 
> The kernel should not calculate new hmacs for mounts done by
> non-root users. Update evm_calc_hmac_or_hash() to refuse to
> calculate new hmacs for mounts for non-init user namespaces.
> 
> Cc: linux-integr...@vger.kernel.org
> Cc: linux-security-mod...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: James Morris 
> Cc: Mimi Zohar 

Hi Mimi,

does this change seem sufficient to you?

> Cc: "Serge E. Hallyn" 
> Signed-off-by: Seth Forshee 
> Signed-off-by: Dongsu Park 
> ---
>  security/integrity/evm/evm_crypto.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/security/integrity/evm/evm_crypto.c 
> b/security/integrity/evm/evm_crypto.c
> index bcd64baf..729f4545 100644
> --- a/security/integrity/evm/evm_crypto.c
> +++ b/security/integrity/evm/evm_crypto.c
> @@ -190,7 +190,8 @@ static int evm_calc_hmac_or_hash(struct dentry *dentry,
>   int error;
>   int size;
>  
> - if (!(inode->i_opflags & IOP_XATTR))
> + if (!(inode->i_opflags & IOP_XATTR) ||
> + inode->i_sb->s_user_ns != _user_ns)
>   return -EOPNOTSUPP;
>  
>   desc = init_desc(type);
> -- 
> 2.13.6

Re: [PATCH] On ppc64le we HAVE_RELIABLE_STACKTRACE

2017-12-22 Thread Josh Poimboeuf

On Thu, Dec 21, 2017 at 11:10:46PM +1100, Michael Ellerman wrote:
> Josh Poimboeuf  writes:
> 
> > On Tue, Dec 19, 2017 at 12:28:33PM +0100, Torsten Duwe wrote:
> >> On Mon, Dec 18, 2017 at 12:56:22PM -0600, Josh Poimboeuf wrote:
> >> > On Mon, Dec 18, 2017 at 03:33:34PM +1000, Nicholas Piggin wrote:
> >> > > On Sun, 17 Dec 2017 20:58:54 -0600
> >> > > Josh Poimboeuf  wrote:
> >> > > 
> >> > > > On Fri, Dec 15, 2017 at 07:40:09PM +1000, Nicholas Piggin wrote:
> >> > > > > On Tue, 12 Dec 2017 08:05:01 -0600
> >> > > > > Josh Poimboeuf  wrote:
> >> > > > >   
> >> > > > > > What about leaf functions?  If a leaf function doesn't establish 
> >> > > > > > a stack
> >> > > > > > frame, and it has inline asm which contains a blr to another 
> >> > > > > > function,
> >> > > > > > this ABI is broken.  
> >> > > > 
> >> > > > Oops, I meant to say "bl" instead of "blr".
> >> 
> >> You need to save LR, one way or the other. If gcc thinks it's a leaf 
> >> function and
> >> does not do it, nor does your asm code, you'll return in an endless loop 
> >> => bug.
> >
> > Ah, so the function's return path would be corrupted, and an unreliable
> > stack trace would be the least of our problems.
> 
> That's mostly true.
> 
> It is possible to save LR somewhere other than the correct stack slot,
> in which case you can return correctly but still confuse the unwinder. A
> function can hide its caller that way.
> 
> It's stupid and we should never do it, but it's not impossible.
> 
> ...
> 
> > So with your proposal, I think I'm convinced that we don't need objtool
> > for ppc64le.  Does anyone disagree?
> 
> I don't disagree, but I'd be happier if we did have objtool support.
> 
> Just because it would give us a lot more certainty that we're doing the
> right thing everywhere, including in hand-coded asm and inline asm.
> 
> It's easy to write powerpc asm such that stack traces are reliable, but
> it is *possible* to break them.

In the unlikely case where some asm code had its own custom stack
format, I guess there are two things which could go wrong:

1) bad LR:

   If LR isn't a kernel text address, the unwinder can stop the stack
   trace, WARN(), and report an error.  Although if we were _extremely_
   unlucky and a random leftover text address just happened to be in the
   LR slot, then the real function would get skipped in the stack trace.
   But even then, it's probably only going to be an asm function getting
   skipped, and we don't patch asm functions anyway, so it shouldn't
   affect livepatch.

2) bad back chain pointer:

   I'm not sure if this is even a reasonable concern.  I doubt it.  But
   if it were to happen, presumably the unwinder would abort the stack
   trace after reading the bad value.  In this case I think the "end"
   check (#4 below) would be sufficient to catch it.

So even if there were some stupid ppc asm code out there with its own
stack magic, it still sounds to me like objtool wouldn't be needed.

> > There are still a few more things that need to be looked at:
> >
> > 1) With function graph tracing enabled, is the unwinder smart enough to
> >get the original function return address, e.g. by calling
> >ftrace_graph_ret_addr()?
> 
> No I don't think so.
> 
> > 2) Similar question for kretprobes.
> >
> > 3) Any other issues with generated code (e.g., bpf, ftrace trampolines),
> >runtime patching (e.g., CPU feature alternatives), kprobes, paravirt,
> >etc, that might confuse the unwinder?
> 
> We'll have to look, I can't be sure off the top of my head.
> 
> > 4) As a sanity check, it *might* be a good idea for
> >save_stack_trace_tsk_reliable() to ensure that it always reaches the
> >end of the stack.  There are several ways to do that:
> >
> >- If the syscall entry stack frame is always the same size, then the
> >  "end" would simply mean that the stack pointer is at a certain
> >  offset from the end of the task stack page.  However this might not
> >  work for kthreads and idle tasks, unless their stacks also start at
> >  the same offset.  (On x86 we actually standardized the end of stack
> >  location for all tasks, both user and kernel.)
> 
> Yeah it differs between user and kernel.
> 
> >- If the unwinder can get to the syscall frame, it can presumably
> >  examine regs->msr to check the PR bit to ensure it got all the way
> >  to syscall entry.  But again this might only work for user tasks,
> >  depending on how kernel task stacks are set up.
> 
> That sounds like a good idea. We could possibly mark the last frame of
> kernel tasks somehow.
> 
> >- Or a different approach would be to do error checking along the
> >  way, and reporting an error for any unexpected conditions.
> >
> >However, given that backlink/LR corruption doesn't seem possible with
> >this architecture, maybe #4 would be overkill.  Personally I would
> >feel more

Re: [PATCH] On ppc64le we HAVE_RELIABLE_STACKTRACE

2017-12-22 Thread Josh Poimboeuf

On Thu, Dec 21, 2017 at 11:10:46PM +1100, Michael Ellerman wrote:
> Josh Poimboeuf  writes:
> 
> > On Tue, Dec 19, 2017 at 12:28:33PM +0100, Torsten Duwe wrote:
> >> On Mon, Dec 18, 2017 at 12:56:22PM -0600, Josh Poimboeuf wrote:
> >> > On Mon, Dec 18, 2017 at 03:33:34PM +1000, Nicholas Piggin wrote:
> >> > > On Sun, 17 Dec 2017 20:58:54 -0600
> >> > > Josh Poimboeuf  wrote:
> >> > > 
> >> > > > On Fri, Dec 15, 2017 at 07:40:09PM +1000, Nicholas Piggin wrote:
> >> > > > > On Tue, 12 Dec 2017 08:05:01 -0600
> >> > > > > Josh Poimboeuf  wrote:
> >> > > > >   
> >> > > > > > What about leaf functions?  If a leaf function doesn't establish 
> >> > > > > > a stack
> >> > > > > > frame, and it has inline asm which contains a blr to another 
> >> > > > > > function,
> >> > > > > > this ABI is broken.  
> >> > > > 
> >> > > > Oops, I meant to say "bl" instead of "blr".
> >> 
> >> You need to save LR, one way or the other. If gcc thinks it's a leaf 
> >> function and
> >> does not do it, nor does your asm code, you'll return in an endless loop 
> >> => bug.
> >
> > Ah, so the function's return path would be corrupted, and an unreliable
> > stack trace would be the least of our problems.
> 
> That's mostly true.
> 
> It is possible to save LR somewhere other than the correct stack slot,
> in which case you can return correctly but still confuse the unwinder. A
> function can hide its caller that way.
> 
> It's stupid and we should never do it, but it's not impossible.
> 
> ...
> 
> > So with your proposal, I think I'm convinced that we don't need objtool
> > for ppc64le.  Does anyone disagree?
> 
> I don't disagree, but I'd be happier if we did have objtool support.
> 
> Just because it would give us a lot more certainty that we're doing the
> right thing everywhere, including in hand-coded asm and inline asm.
> 
> It's easy to write powerpc asm such that stack traces are reliable, but
> it is *possible* to break them.

In the unlikely case where some asm code had its own custom stack
format, I guess there are two things which could go wrong:

1) bad LR:

   If LR isn't a kernel text address, the unwinder can stop the stack
   trace, WARN(), and report an error.  Although if we were _extremely_
   unlucky and a random leftover text address just happened to be in the
   LR slot, then the real function would get skipped in the stack trace.
   But even then, it's probably only going to be an asm function getting
   skipped, and we don't patch asm functions anyway, so it shouldn't
   affect livepatch.

2) bad back chain pointer:

   I'm not sure if this is even a reasonable concern.  I doubt it.  But
   if it were to happen, presumably the unwinder would abort the stack
   trace after reading the bad value.  In this case I think the "end"
   check (#4 below) would be sufficient to catch it.

So even if there were some stupid ppc asm code out there with its own
stack magic, it still sounds to me like objtool wouldn't be needed.

> > There are still a few more things that need to be looked at:
> >
> > 1) With function graph tracing enabled, is the unwinder smart enough to
> >get the original function return address, e.g. by calling
> >ftrace_graph_ret_addr()?
> 
> No I don't think so.
> 
> > 2) Similar question for kretprobes.
> >
> > 3) Any other issues with generated code (e.g., bpf, ftrace trampolines),
> >runtime patching (e.g., CPU feature alternatives), kprobes, paravirt,
> >etc, that might confuse the unwinder?
> 
> We'll have to look, I can't be sure off the top of my head.
> 
> > 4) As a sanity check, it *might* be a good idea for
> >save_stack_trace_tsk_reliable() to ensure that it always reaches the
> >end of the stack.  There are several ways to do that:
> >
> >- If the syscall entry stack frame is always the same size, then the
> >  "end" would simply mean that the stack pointer is at a certain
> >  offset from the end of the task stack page.  However this might not
> >  work for kthreads and idle tasks, unless their stacks also start at
> >  the same offset.  (On x86 we actually standardized the end of stack
> >  location for all tasks, both user and kernel.)
> 
> Yeah it differs between user and kernel.
> 
> >- If the unwinder can get to the syscall frame, it can presumably
> >  examine regs->msr to check the PR bit to ensure it got all the way
> >  to syscall entry.  But again this might only work for user tasks,
> >  depending on how kernel task stacks are set up.
> 
> That sounds like a good idea. We could possibly mark the last frame of
> kernel tasks somehow.
> 
> >- Or a different approach would be to do error checking along the
> >  way, and reporting an error for any unexpected conditions.
> >
> >However, given that backlink/LR corruption doesn't seem possible with
> >this architecture, maybe #4 would be overkill.  Personally I would
> >feel more comfortable with an "end" check and a WARN() if it doesn't
> >

Re: [PATCH 10/11] fuse: Allow user namespace mounts

2017-12-22 Thread Serge E. Hallyn

On Fri, Dec 22, 2017 at 03:32:34PM +0100, Dongsu Park wrote:
> From: Seth Forshee 
> 
> To be able to mount fuse from non-init user namespaces, it's necessary
> to set FS_USERNS_MOUNT flag to fs_flags.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944681/
> 
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Miklos Szeredi 
> Signed-off-by: Seth Forshee 
> [dongsu: add a simple commit messasge]
> Signed-off-by: Dongsu Park 

Reviewed-by: Serge Hallyn 

> ---
>  fs/fuse/inode.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 7f6b2e55..8c98edee 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1212,7 +1212,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
>  static struct file_system_type fuse_fs_type = {
>   .owner  = THIS_MODULE,
>   .name   = "fuse",
> - .fs_flags   = FS_HAS_SUBTYPE,
> + .fs_flags   = FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
>   .mount  = fuse_mount,
>   .kill_sb= fuse_kill_sb_anon,
>  };
> @@ -1244,7 +1244,7 @@ static struct file_system_type fuseblk_fs_type = {
>   .name   = "fuseblk",
>   .mount  = fuse_mount_blk,
>   .kill_sb= fuse_kill_sb_blk,
> - .fs_flags   = FS_REQUIRES_DEV | FS_HAS_SUBTYPE,
> + .fs_flags   = FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
>  };
>  MODULE_ALIAS_FS("fuseblk");
>  
> -- 
> 2.13.6
> 
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

Re: [PATCH 10/11] fuse: Allow user namespace mounts

2017-12-22 Thread Serge E. Hallyn

On Fri, Dec 22, 2017 at 03:32:34PM +0100, Dongsu Park wrote:
> From: Seth Forshee 
> 
> To be able to mount fuse from non-init user namespaces, it's necessary
> to set FS_USERNS_MOUNT flag to fs_flags.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944681/
> 
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Miklos Szeredi 
> Signed-off-by: Seth Forshee 
> [dongsu: add a simple commit messasge]
> Signed-off-by: Dongsu Park 

Reviewed-by: Serge Hallyn 

> ---
>  fs/fuse/inode.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 7f6b2e55..8c98edee 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1212,7 +1212,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
>  static struct file_system_type fuse_fs_type = {
>   .owner  = THIS_MODULE,
>   .name   = "fuse",
> - .fs_flags   = FS_HAS_SUBTYPE,
> + .fs_flags   = FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
>   .mount  = fuse_mount,
>   .kill_sb= fuse_kill_sb_anon,
>  };
> @@ -1244,7 +1244,7 @@ static struct file_system_type fuseblk_fs_type = {
>   .name   = "fuseblk",
>   .mount  = fuse_mount_blk,
>   .kill_sb= fuse_kill_sb_blk,
> - .fs_flags   = FS_REQUIRES_DEV | FS_HAS_SUBTYPE,
> + .fs_flags   = FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
>  };
>  MODULE_ALIAS_FS("fuseblk");
>  
> -- 
> 2.13.6
> 
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

Re: [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant

2017-12-22 Thread Serge E. Hallyn

On Fri, Dec 22, 2017 at 03:32:33PM +0100, Dongsu Park wrote:
> From: Seth Forshee 
> 
> Unprivileged users are normally restricted from mounting with the
> allow_other option by system policy, but this could be bypassed
> for a mount done with user namespace root permissions. In such
> cases allow_other should not allow users outside the userns
> to access the mount as doing so would give the unprivileged user
> the ability to manipulate processes it would otherwise be unable
> to manipulate. Restrict allow_other to apply to users in the same
> userns used at mount or a descendant of that namespace. Also
> export current_in_userns() for use by fuse when built as a
> module.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944671/
> 
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: "Eric W. Biederman" 
> Cc: Serge Hallyn 
> Cc: Miklos Szeredi 
> Signed-off-by: Seth Forshee 
> Signed-off-by: Dongsu Park 

Reviewed-by: Serge Hallyn 

> ---
>  fs/fuse/dir.c   | 2 +-
>  kernel/user_namespace.c | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index ad1cfac1..d41559a0 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
>   const struct cred *cred;
>  
>   if (fc->allow_other)
> - return 1;
> + return current_in_userns(fc->user_ns);
>  
>   cred = current_cred();
>   if (uid_eq(cred->euid, fc->user_id) &&
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 246d4d4c..492c255e 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace 
> *target_ns)
>  {
>   return in_userns(target_ns, current_user_ns());
>  }
> +EXPORT_SYMBOL(current_in_userns);

I have to say I'm not happy with this name.  I wish it had been
called current_under_userns or something to indicate it may also
be in a child.

>  
>  static inline struct user_namespace *to_user_ns(struct ns_common *ns)
>  {
> -- 
> 2.13.6

Re: [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant

2017-12-22 Thread Serge E. Hallyn

On Fri, Dec 22, 2017 at 03:32:33PM +0100, Dongsu Park wrote:
> From: Seth Forshee 
> 
> Unprivileged users are normally restricted from mounting with the
> allow_other option by system policy, but this could be bypassed
> for a mount done with user namespace root permissions. In such
> cases allow_other should not allow users outside the userns
> to access the mount as doing so would give the unprivileged user
> the ability to manipulate processes it would otherwise be unable
> to manipulate. Restrict allow_other to apply to users in the same
> userns used at mount or a descendant of that namespace. Also
> export current_in_userns() for use by fuse when built as a
> module.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944671/
> 
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: "Eric W. Biederman" 
> Cc: Serge Hallyn 
> Cc: Miklos Szeredi 
> Signed-off-by: Seth Forshee 
> Signed-off-by: Dongsu Park 

Reviewed-by: Serge Hallyn 

> ---
>  fs/fuse/dir.c   | 2 +-
>  kernel/user_namespace.c | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index ad1cfac1..d41559a0 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
>   const struct cred *cred;
>  
>   if (fc->allow_other)
> - return 1;
> + return current_in_userns(fc->user_ns);
>  
>   cred = current_cred();
>   if (uid_eq(cred->euid, fc->user_id) &&
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 246d4d4c..492c255e 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace 
> *target_ns)
>  {
>   return in_userns(target_ns, current_user_ns());
>  }
> +EXPORT_SYMBOL(current_in_userns);

I have to say I'm not happy with this name.  I wish it had been
called current_under_userns or something to indicate it may also
be in a child.

>  
>  static inline struct user_namespace *to_user_ns(struct ns_common *ns)
>  {
> -- 
> 2.13.6

Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns

2017-12-22 Thread Serge E. Hallyn

On Fri, Dec 22, 2017 at 03:32:32PM +0100, Dongsu Park wrote:
> From: Seth Forshee 
> 
> In order to support mounts from namespaces other than
> init_user_ns, fuse must translate uids and gids to/from the
> userns of the process servicing requests on /dev/fuse. This
> patch does that, with a couple of restrictions on the namespace:
> 
>  - The userns for the fuse connection is fixed to the namespace
>from which /dev/fuse is opened.
> 
>  - The namespace must be the same as s_user_ns.
> 
> These restrictions simplify the implementation by avoiding the
> need to pass around userns references and by allowing fuse to
> rely on the checks in inode_change_ok for ownership changes.
> Either restriction could be relaxed in the future if needed.
> 
> For cuse the namespace used for the connection is also simply
> current_user_ns() at the time /dev/cuse is opened.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944661/
> 
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Miklos Szeredi 
> Signed-off-by: Seth Forshee 
> Signed-off-by: Dongsu Park 

Acked-by: Serge Hallyn 

> ---
>  fs/fuse/cuse.c   |  3 ++-
>  fs/fuse/dev.c| 11 ---
>  fs/fuse/dir.c| 14 +++---
>  fs/fuse/fuse_i.h |  6 +-
>  fs/fuse/inode.c  | 31 +++
>  5 files changed, 41 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
> index e9e97803..b1b83259 100644
> --- a/fs/fuse/cuse.c
> +++ b/fs/fuse/cuse.c
> @@ -48,6 +48,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "fuse_i.h"
>  
> @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct 
> file *file)
>   if (!cc)
>   return -ENOMEM;
>  
> - fuse_conn_init(>fc);
> + fuse_conn_init(>fc, current_user_ns());
>  
>   fud = fuse_dev_alloc(>fc);
>   if (!fud) {
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 17f0d05b..0f780e16 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
>  
>  static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>  {
> - req->in.h.uid = from_kuid_munged(_user_ns, current_fsuid());
> - req->in.h.gid = from_kgid_munged(_user_ns, current_fsgid());
> + req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
> + req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
>   req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
>  }
>  
> @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn 
> *fc, unsigned npages,
>   __set_bit(FR_WAITING, >flags);
>   if (for_background)
>   __set_bit(FR_BACKGROUND, >flags);
> + if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
> + fuse_put_request(fc, req);
> + return ERR_PTR(-EOVERFLOW);
> + }
>  
>   return req;
>  
> @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, 
> struct file *file,
>   in = >in;
>   reqsize = in->h.len;
>  
> - if (task_active_pid_ns(current) != fc->pid_ns) {
> + if (task_active_pid_ns(current) != fc->pid_ns ||
> + current_user_ns() != fc->user_ns) {
>   rcu_read_lock();
>   in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
>   rcu_read_unlock();
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 24967382..ad1cfac1 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct 
> fuse_attr *attr,
>   stat->ino = attr->ino;
>   stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 0);
>   stat->nlink = attr->nlink;
> - stat->uid = make_kuid(_user_ns, attr->uid);
> - stat->gid = make_kgid(_user_ns, attr->gid);
> + stat->uid = make_kuid(fc->user_ns, attr->uid);
> + stat->gid = make_kgid(fc->user_ns, attr->gid);
>   stat->rdev = inode->i_rdev;
>   stat->atime.tv_sec = attr->atime;
>   stat->atime.tv_nsec = attr->atimensec;
> @@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool 
> trust_local_mtime)
>   return true;
>  }
>  
> -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
> -bool trust_local_cmtime)
> +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
> +struct fuse_setattr_in *arg, bool trust_local_cmtime)
>  {
>   unsigned ivalid = iattr->ia_valid;
>  
>   if (ivalid & ATTR_MODE)
>   arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
>   if (ivalid & ATTR_UID)
> - arg->valid |= FATTR_UID,arg->uid = from_kuid(_user_ns, 
> iattr->ia_uid);
> + arg->valid |= FATTR_UID,arg->uid = from_kuid(fc->user_ns, 
> iattr->ia_uid);
>

Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns

2017-12-22 Thread Serge E. Hallyn

On Fri, Dec 22, 2017 at 03:32:32PM +0100, Dongsu Park wrote:
> From: Seth Forshee 
> 
> In order to support mounts from namespaces other than
> init_user_ns, fuse must translate uids and gids to/from the
> userns of the process servicing requests on /dev/fuse. This
> patch does that, with a couple of restrictions on the namespace:
> 
>  - The userns for the fuse connection is fixed to the namespace
>from which /dev/fuse is opened.
> 
>  - The namespace must be the same as s_user_ns.
> 
> These restrictions simplify the implementation by avoiding the
> need to pass around userns references and by allowing fuse to
> rely on the checks in inode_change_ok for ownership changes.
> Either restriction could be relaxed in the future if needed.
> 
> For cuse the namespace used for the connection is also simply
> current_user_ns() at the time /dev/cuse is opened.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944661/
> 
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Miklos Szeredi 
> Signed-off-by: Seth Forshee 
> Signed-off-by: Dongsu Park 

Acked-by: Serge Hallyn 

> ---
>  fs/fuse/cuse.c   |  3 ++-
>  fs/fuse/dev.c| 11 ---
>  fs/fuse/dir.c| 14 +++---
>  fs/fuse/fuse_i.h |  6 +-
>  fs/fuse/inode.c  | 31 +++
>  5 files changed, 41 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
> index e9e97803..b1b83259 100644
> --- a/fs/fuse/cuse.c
> +++ b/fs/fuse/cuse.c
> @@ -48,6 +48,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "fuse_i.h"
>  
> @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct 
> file *file)
>   if (!cc)
>   return -ENOMEM;
>  
> - fuse_conn_init(>fc);
> + fuse_conn_init(>fc, current_user_ns());
>  
>   fud = fuse_dev_alloc(>fc);
>   if (!fud) {
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 17f0d05b..0f780e16 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
>  
>  static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>  {
> - req->in.h.uid = from_kuid_munged(_user_ns, current_fsuid());
> - req->in.h.gid = from_kgid_munged(_user_ns, current_fsgid());
> + req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
> + req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
>   req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
>  }
>  
> @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn 
> *fc, unsigned npages,
>   __set_bit(FR_WAITING, >flags);
>   if (for_background)
>   __set_bit(FR_BACKGROUND, >flags);
> + if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
> + fuse_put_request(fc, req);
> + return ERR_PTR(-EOVERFLOW);
> + }
>  
>   return req;
>  
> @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, 
> struct file *file,
>   in = >in;
>   reqsize = in->h.len;
>  
> - if (task_active_pid_ns(current) != fc->pid_ns) {
> + if (task_active_pid_ns(current) != fc->pid_ns ||
> + current_user_ns() != fc->user_ns) {
>   rcu_read_lock();
>   in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
>   rcu_read_unlock();
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 24967382..ad1cfac1 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct 
> fuse_attr *attr,
>   stat->ino = attr->ino;
>   stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 0);
>   stat->nlink = attr->nlink;
> - stat->uid = make_kuid(_user_ns, attr->uid);
> - stat->gid = make_kgid(_user_ns, attr->gid);
> + stat->uid = make_kuid(fc->user_ns, attr->uid);
> + stat->gid = make_kgid(fc->user_ns, attr->gid);
>   stat->rdev = inode->i_rdev;
>   stat->atime.tv_sec = attr->atime;
>   stat->atime.tv_nsec = attr->atimensec;
> @@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool 
> trust_local_mtime)
>   return true;
>  }
>  
> -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
> -bool trust_local_cmtime)
> +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
> +struct fuse_setattr_in *arg, bool trust_local_cmtime)
>  {
>   unsigned ivalid = iattr->ia_valid;
>  
>   if (ivalid & ATTR_MODE)
>   arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
>   if (ivalid & ATTR_UID)
> - arg->valid |= FATTR_UID,arg->uid = from_kuid(_user_ns, 
> iattr->ia_uid);
> + arg->valid |= FATTR_UID,arg->uid = from_kuid(fc->user_ns, 
> iattr->ia_uid);
>   if (ivalid & ATTR_GID)
> - arg->valid |= FATTR_GID,arg->gid = from_kgid(_user_ns, 
>

Re: [PATCH v4] fsck.f2fs: check and fix i_namelen to avoid double free

2017-12-22 Thread Yunlong Song


And there is en[namelen] = '\0', should fix namelen to its right value.

On 2017/12/23 11:35, Chao Yu wrote:

On 2017/12/23 11:19, Yunlong Song wrote:

Double free problem:
Since ddr bit jump makes i_namelen a larger value (> 255)，when file is
not encrypted,
the convert_encrypted_name will memcpy out range of en[255], when en is
freed, there
will be double free problem.

It looks there is only memcpy overflow problem here.

Thanks,


On 2017/12/23 11:05, Chao Yu wrote:

On 2017/12/18 21:25, Yunlong Song wrote:

v1 -> v2: use child_info to pass dentry namelen
v2 -> v3: check child != NULL to include the F2FS_FT_ORPHAN file type
v3 -> v4: fix the i_namelen problem of dump.f2fs、

There is no commit log, so what do you mean about "avoid double free"?

Other than that, looks good to me.

Reviewed-by: Chao Yu 

Thanks,


.



.



--
Thanks,
Yunlong Song

Re: [PATCH v4] fsck.f2fs: check and fix i_namelen to avoid double free

2017-12-22 Thread Yunlong Song


And there is en[namelen] = '\0', should fix namelen to its right value.

On 2017/12/23 11:35, Chao Yu wrote:

On 2017/12/23 11:19, Yunlong Song wrote:

Double free problem:
Since ddr bit jump makes i_namelen a larger value (> 255)，when file is
not encrypted,
the convert_encrypted_name will memcpy out range of en[255], when en is
freed, there
will be double free problem.

It looks there is only memcpy overflow problem here.

Thanks,


On 2017/12/23 11:05, Chao Yu wrote:

On 2017/12/18 21:25, Yunlong Song wrote:

v1 -> v2: use child_info to pass dentry namelen
v2 -> v3: check child != NULL to include the F2FS_FT_ORPHAN file type
v3 -> v4: fix the i_namelen problem of dump.f2fs、

There is no commit log, so what do you mean about "avoid double free"?

Other than that, looks good to me.

Reviewed-by: Chao Yu 

Thanks,


.



.



--
Thanks,
Yunlong Song

Re: [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems

2017-12-22 Thread Serge E. Hallyn

On Fri, Dec 22, 2017 at 03:32:31PM +0100, Dongsu Park wrote:
> From: Seth Forshee 
> 
> The user in control of a super block should be allowed to freeze
> and thaw it. Relax the restrictions on the FIFREEZE and FITHAW
> ioctls to require CAP_SYS_ADMIN in s_user_ns.
> 
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Alexander Viro 
> Signed-off-by: Seth Forshee 
> Signed-off-by: Dongsu Park 

Reviewed-by: Serge Hallyn 

> ---
>  fs/ioctl.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 5ace7efb..8c628a8d 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -549,7 +549,7 @@ static int ioctl_fsfreeze(struct file *filp)
>  {
>   struct super_block *sb = file_inode(filp)->i_sb;
>  
> - if (!capable(CAP_SYS_ADMIN))
> + if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
>   return -EPERM;
>  
>   /* If filesystem doesn't support freeze feature, return. */
> @@ -566,7 +566,7 @@ static int ioctl_fsthaw(struct file *filp)
>  {
>   struct super_block *sb = file_inode(filp)->i_sb;
>  
> - if (!capable(CAP_SYS_ADMIN))
> + if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
>   return -EPERM;
>  
>   /* Thaw */
> -- 
> 2.13.6
> 
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

Re: [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems

2017-12-22 Thread Serge E. Hallyn

On Fri, Dec 22, 2017 at 03:32:31PM +0100, Dongsu Park wrote:
> From: Seth Forshee 
> 
> The user in control of a super block should be allowed to freeze
> and thaw it. Relax the restrictions on the FIFREEZE and FITHAW
> ioctls to require CAP_SYS_ADMIN in s_user_ns.
> 
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Alexander Viro 
> Signed-off-by: Seth Forshee 
> Signed-off-by: Dongsu Park 

Reviewed-by: Serge Hallyn 

> ---
>  fs/ioctl.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 5ace7efb..8c628a8d 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -549,7 +549,7 @@ static int ioctl_fsfreeze(struct file *filp)
>  {
>   struct super_block *sb = file_inode(filp)->i_sb;
>  
> - if (!capable(CAP_SYS_ADMIN))
> + if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
>   return -EPERM;
>  
>   /* If filesystem doesn't support freeze feature, return. */
> @@ -566,7 +566,7 @@ static int ioctl_fsthaw(struct file *filp)
>  {
>   struct super_block *sb = file_inode(filp)->i_sb;
>  
> - if (!capable(CAP_SYS_ADMIN))
> + if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
>   return -EPERM;
>  
>   /* Thaw */
> -- 
> 2.13.6
> 
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

Re: [PATCH v2] fsck.f2fs: check nid range before use to avoid segmentation fault

2017-12-22 Thread Yunlong Song

Both are OK, since nid < root_ino cannot trigger segmentation fault 
(nat_block->entries[nid%NAT_ENTRY_PER_BLOCK]).


On 2017/12/23 11:14, Chao Yu wrote:

On 2017/12/18 19:53, Yunlong Song wrote:

Signed-off-by: Yunlong Song 

How about introducing IS_AVAILABLE_NID as below, and use it instead?

#define IS_AVAILABLE_NID(sbi, nid)  (IS_VALID_NID(sbi, nid) && (nid >= 
root_ino))

Thanks,


---
  fsck/fsck.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fsck/fsck.c b/fsck/fsck.c
index 11b8b0b..faf0663 100644
--- a/fsck/fsck.c
+++ b/fsck/fsck.c
@@ -740,7 +740,7 @@ void fsck_chk_inode_blk(struct f2fs_sb_info *sbi, u32 nid,
for (idx = 0; idx < 5; idx++) {
u32 nid = le32_to_cpu(node_blk->i.i_nid[idx]);
  
-		if (nid != 0) {

+   if (nid != 0 && IS_VALID_NID(sbi, nid)) {
struct node_info ni;
  
  			get_node_info(sbi, nid, );




.



--
Thanks,
Yunlong Song

Re: [PATCH v2] fsck.f2fs: check nid range before use to avoid segmentation fault

2017-12-22 Thread Yunlong Song

Both are OK, since nid < root_ino cannot trigger segmentation fault 
(nat_block->entries[nid%NAT_ENTRY_PER_BLOCK]).


On 2017/12/23 11:14, Chao Yu wrote:

On 2017/12/18 19:53, Yunlong Song wrote:

Signed-off-by: Yunlong Song 

How about introducing IS_AVAILABLE_NID as below, and use it instead?

#define IS_AVAILABLE_NID(sbi, nid)  (IS_VALID_NID(sbi, nid) && (nid >= 
root_ino))

Thanks,


---
  fsck/fsck.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fsck/fsck.c b/fsck/fsck.c
index 11b8b0b..faf0663 100644
--- a/fsck/fsck.c
+++ b/fsck/fsck.c
@@ -740,7 +740,7 @@ void fsck_chk_inode_blk(struct f2fs_sb_info *sbi, u32 nid,
for (idx = 0; idx < 5; idx++) {
u32 nid = le32_to_cpu(node_blk->i.i_nid[idx]);
  
-		if (nid != 0) {

+   if (nid != 0 && IS_VALID_NID(sbi, nid)) {
struct node_info ni;
  
  			get_node_info(sbi, nid, );




.



--
Thanks,
Yunlong Song

Re: [PATCH v4] fsck.f2fs: check and fix i_namelen to avoid double free

2017-12-22 Thread Chao Yu

On 2017/12/23 11:19, Yunlong Song wrote:
> Double free problem:
> Since ddr bit jump makes i_namelen a larger value (> 255)，when file is 
> not encrypted,
> the convert_encrypted_name will memcpy out range of en[255], when en is 
> freed, there
> will be double free problem.

It looks there is only memcpy overflow problem here.

Thanks,

> 
> On 2017/12/23 11:05, Chao Yu wrote:
>> On 2017/12/18 21:25, Yunlong Song wrote:
>>> v1 -> v2: use child_info to pass dentry namelen
>>> v2 -> v3: check child != NULL to include the F2FS_FT_ORPHAN file type
>>> v3 -> v4: fix the i_namelen problem of dump.f2fs、
>> There is no commit log, so what do you mean about "avoid double free"?
>>
>> Other than that, looks good to me.
>>
>> Reviewed-by: Chao Yu 
>>
>> Thanks,
>>
>>
>> .
>>
>

Re: [PATCH v4] fsck.f2fs: check and fix i_namelen to avoid double free

2017-12-22 Thread Chao Yu

On 2017/12/23 11:19, Yunlong Song wrote:
> Double free problem:
> Since ddr bit jump makes i_namelen a larger value (> 255)，when file is 
> not encrypted,
> the convert_encrypted_name will memcpy out range of en[255], when en is 
> freed, there
> will be double free problem.

It looks there is only memcpy overflow problem here.

Thanks,

> 
> On 2017/12/23 11:05, Chao Yu wrote:
>> On 2017/12/18 21:25, Yunlong Song wrote:
>>> v1 -> v2: use child_info to pass dentry namelen
>>> v2 -> v3: check child != NULL to include the F2FS_FT_ORPHAN file type
>>> v3 -> v4: fix the i_namelen problem of dump.f2fs、
>> There is no commit log, so what do you mean about "avoid double free"?
>>
>> Other than that, looks good to me.
>>
>> Reviewed-by: Chao Yu 
>>
>> Thanks,
>>
>>
>> .
>>
>

Re: [PATCH 06/11] capabilities: Allow privileged user in s_user_ns to set security.* xattrs

2017-12-22 Thread Serge E. Hallyn

On Fri, Dec 22, 2017 at 03:32:30PM +0100, Dongsu Park wrote:
> From: Seth Forshee 
> 
> A privileged user in s_user_ns will generally have the ability to
> manipulate the backing store and insert security.* xattrs into
> the filesystem directly. Therefore the kernel must be prepared to
> handle these xattrs from unprivileged mounts, and it makes little
> sense for commoncap to prevent writing these xattrs to the
> filesystem. The capability and LSM code have already been updated
> to appropriately handle xattrs from unprivileged mounts, so it
> is safe to loosen this restriction on setting xattrs.
> 
> The exception to this logic is that writing xattrs to a mounted
> filesystem may also cause the LSM inode_post_setxattr or
> inode_setsecurity callbacks to be invoked. SELinux will deny the
> xattr update by virtue of applying mountpoint labeling to
> unprivileged userns mounts, and Smack will deny the writes for
> any user without global CAP_MAC_ADMIN, so loosening the
> capability check in commoncap is safe in this respect as well.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944641/
> 
> Cc: linux-security-mod...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: James Morris 
> Cc: Serge Hallyn 

Reviewed-by: Serge Hallyn 

> Signed-off-by: Seth Forshee 
> Signed-off-by: Dongsu Park 
> ---
>  security/commoncap.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 4f8e0934..dd0afef9 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -920,6 +920,8 @@ int cap_bprm_set_creds(struct linux_binprm *bprm)
>  int cap_inode_setxattr(struct dentry *dentry, const char *name,
>  const void *value, size_t size, int flags)
>  {
> + struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
> +
>   /* Ignore non-security xattrs */
>   if (strncmp(name, XATTR_SECURITY_PREFIX,
>   sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
> @@ -932,7 +934,7 @@ int cap_inode_setxattr(struct dentry *dentry, const char 
> *name,
>   if (strcmp(name, XATTR_NAME_CAPS) == 0)
>   return 0;
>  
> - if (!capable(CAP_SYS_ADMIN))
> + if (!ns_capable(user_ns, CAP_SYS_ADMIN))
>   return -EPERM;
>   return 0;
>  }
> @@ -950,6 +952,8 @@ int cap_inode_setxattr(struct dentry *dentry, const char 
> *name,
>   */
>  int cap_inode_removexattr(struct dentry *dentry, const char *name)
>  {
> + struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
> +
>   /* Ignore non-security xattrs */
>   if (strncmp(name, XATTR_SECURITY_PREFIX,
>   sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
> @@ -965,7 +969,7 @@ int cap_inode_removexattr(struct dentry *dentry, const 
> char *name)
>   return 0;
>   }
>  
> - if (!capable(CAP_SYS_ADMIN))
> + if (!ns_capable(user_ns, CAP_SYS_ADMIN))
>   return -EPERM;
>   return 0;
>  }
> -- 
> 2.13.6

Re: [PATCH 06/11] capabilities: Allow privileged user in s_user_ns to set security.* xattrs

2017-12-22 Thread Serge E. Hallyn

On Fri, Dec 22, 2017 at 03:32:30PM +0100, Dongsu Park wrote:
> From: Seth Forshee 
> 
> A privileged user in s_user_ns will generally have the ability to
> manipulate the backing store and insert security.* xattrs into
> the filesystem directly. Therefore the kernel must be prepared to
> handle these xattrs from unprivileged mounts, and it makes little
> sense for commoncap to prevent writing these xattrs to the
> filesystem. The capability and LSM code have already been updated
> to appropriately handle xattrs from unprivileged mounts, so it
> is safe to loosen this restriction on setting xattrs.
> 
> The exception to this logic is that writing xattrs to a mounted
> filesystem may also cause the LSM inode_post_setxattr or
> inode_setsecurity callbacks to be invoked. SELinux will deny the
> xattr update by virtue of applying mountpoint labeling to
> unprivileged userns mounts, and Smack will deny the writes for
> any user without global CAP_MAC_ADMIN, so loosening the
> capability check in commoncap is safe in this respect as well.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944641/
> 
> Cc: linux-security-mod...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: James Morris 
> Cc: Serge Hallyn 

Reviewed-by: Serge Hallyn 

> Signed-off-by: Seth Forshee 
> Signed-off-by: Dongsu Park 
> ---
>  security/commoncap.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 4f8e0934..dd0afef9 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -920,6 +920,8 @@ int cap_bprm_set_creds(struct linux_binprm *bprm)
>  int cap_inode_setxattr(struct dentry *dentry, const char *name,
>  const void *value, size_t size, int flags)
>  {
> + struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
> +
>   /* Ignore non-security xattrs */
>   if (strncmp(name, XATTR_SECURITY_PREFIX,
>   sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
> @@ -932,7 +934,7 @@ int cap_inode_setxattr(struct dentry *dentry, const char 
> *name,
>   if (strcmp(name, XATTR_NAME_CAPS) == 0)
>   return 0;
>  
> - if (!capable(CAP_SYS_ADMIN))
> + if (!ns_capable(user_ns, CAP_SYS_ADMIN))
>   return -EPERM;
>   return 0;
>  }
> @@ -950,6 +952,8 @@ int cap_inode_setxattr(struct dentry *dentry, const char 
> *name,
>   */
>  int cap_inode_removexattr(struct dentry *dentry, const char *name)
>  {
> + struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
> +
>   /* Ignore non-security xattrs */
>   if (strncmp(name, XATTR_SECURITY_PREFIX,
>   sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
> @@ -965,7 +969,7 @@ int cap_inode_removexattr(struct dentry *dentry, const 
> char *name)
>   return 0;
>   }
>  
> - if (!capable(CAP_SYS_ADMIN))
> + if (!ns_capable(user_ns, CAP_SYS_ADMIN))
>   return -EPERM;
>   return 0;
>  }
> -- 
> 2.13.6

Re: [PATCH 05/11] fs: Allow superblock owner to access do_remount_sb()

2017-12-22 Thread Serge E. Hallyn

On Fri, Dec 22, 2017 at 03:32:29PM +0100, Dongsu Park wrote:
> From: Seth Forshee 
> 
> Superblock level remounts are currently restricted to global
> CAP_SYS_ADMIN, as is the path for changing the root mount to
> read only on umount. Loosen both of these permission checks to
> also allow CAP_SYS_ADMIN in any namespace which is privileged
> towards the userns which originally mounted the filesystem.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944631/
> 
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Alexander Viro 
> Cc: "Eric W. Biederman" 
> Cc: Serge Hallyn 

Acked-by: Serge Hallyn 

> Signed-off-by: Seth Forshee 
> Signed-off-by: Dongsu Park 
> ---
>  fs/namespace.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index e158ec6b..830040d7 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1589,7 +1589,7 @@ static int do_umount(struct mount *mnt, int flags)
>* Special case for "unmounting" root ...
>* we just try to remount it readonly.
>*/
> - if (!capable(CAP_SYS_ADMIN))
> + if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
>   return -EPERM;
>   down_write(>s_umount);
>   if (!sb_rdonly(sb))
> @@ -2327,7 +2327,7 @@ static int do_remount(struct path *path, int ms_flags, 
> int sb_flags,
>   down_write(>s_umount);
>   if (ms_flags & MS_BIND)
>   err = change_mount_flags(path->mnt, ms_flags);
> - else if (!capable(CAP_SYS_ADMIN))
> + else if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
>   err = -EPERM;
>   else
>   err = do_remount_sb(sb, sb_flags, data, 0);
> -- 
> 2.13.6

Re: [PATCH 05/11] fs: Allow superblock owner to access do_remount_sb()

2017-12-22 Thread Serge E. Hallyn

On Fri, Dec 22, 2017 at 03:32:29PM +0100, Dongsu Park wrote:
> From: Seth Forshee 
> 
> Superblock level remounts are currently restricted to global
> CAP_SYS_ADMIN, as is the path for changing the root mount to
> read only on umount. Loosen both of these permission checks to
> also allow CAP_SYS_ADMIN in any namespace which is privileged
> towards the userns which originally mounted the filesystem.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944631/
> 
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Alexander Viro 
> Cc: "Eric W. Biederman" 
> Cc: Serge Hallyn 

Acked-by: Serge Hallyn 

> Signed-off-by: Seth Forshee 
> Signed-off-by: Dongsu Park 
> ---
>  fs/namespace.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index e158ec6b..830040d7 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1589,7 +1589,7 @@ static int do_umount(struct mount *mnt, int flags)
>* Special case for "unmounting" root ...
>* we just try to remount it readonly.
>*/
> - if (!capable(CAP_SYS_ADMIN))
> + if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
>   return -EPERM;
>   down_write(>s_umount);
>   if (!sb_rdonly(sb))
> @@ -2327,7 +2327,7 @@ static int do_remount(struct path *path, int ms_flags, 
> int sb_flags,
>   down_write(>s_umount);
>   if (ms_flags & MS_BIND)
>   err = change_mount_flags(path->mnt, ms_flags);
> - else if (!capable(CAP_SYS_ADMIN))
> + else if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
>   err = -EPERM;
>   else
>   err = do_remount_sb(sb, sb_flags, data, 0);
> -- 
> 2.13.6

Re: [PATCH v20 3/7 RESEND] xbitmap: add more operations

2017-12-22 Thread Matthew Wilcox

On Sat, Dec 23, 2017 at 11:59:54AM +0900, Tetsuo Handa wrote:
> Matthew Wilcox wrote:
> > +   bit %= IDA_BITMAP_BITS;
> > +   radix_tree_iter_init(, index);
> > +   slot = idr_get_free_cmn(root, , GFP_NOWAIT | __GFP_NOWARN, index);
> > +   if (IS_ERR(slot)) {
> > +   if (slot == ERR_PTR(-ENOSPC))
> > +   return 0;   /* Already set */
> 
> Why already set? I guess something is there, but is it guaranteed that
> there is a bitmap with the "bit" set?

Yes.  For radix trees tagged with IDR_RT_MARKER, newly created slots
have the IDR_FREE tag set.  We only clear the IDR_FREE tag once the
bitmap is full.  So if we try to find a free slot and the tag is clear,
we know the bitmap is full.

> > +   bitmap = rcu_dereference_raw(*slot);
> > +   if (!bitmap) {
> > +   bitmap = this_cpu_xchg(ida_bitmap, NULL);
> > +   if (!bitmap)
> > +   return -ENOMEM;
> 
> I can't understand this. I can understand if it were
> 
>   BUG_ON(!bitmap);
> 
> because you called xb_preload().
> 
> But
> 
>   /*
>* Regular test 2
>* set bit 2000, 2001, 2040
>* Next 1 in [0, 2048)  --> 2000
>* Next 1 in [2000, 2002)   --> 2000
>* Next 1 in [2002, 2041)   --> 2040
>* Next 1 in [2002, 2040)   --> none
>* Next 0 in [2000, 2048)   --> 2002
>* Next 0 in [2048, 2060)   --> 2048
>*/
>   xb_preload(GFP_KERNEL);
>   assert(!xb_set_bit(, 2000));
>   assert(!xb_set_bit(, 2001));
>   assert(!xb_set_bit(, 2040));
[...]
>   xb_preload_end();
> 
> you are not calling xb_preload() prior to each xb_set_bit() call.
> This means that, if each xb_set_bit() is not surrounded with
> xb_preload()/xb_preload_end(), there is possibility of hitting
> this_cpu_xchg(ida_bitmap, NULL) == NULL.

This is just a lazy test.  We "know" that the bits in the range 1024-2047
will all land in the same bitmap, so there's no need to preload for each
of them.

> If bitmap == NULL at this_cpu_xchg(ida_bitmap, NULL) is allowed,
> you can use kzalloc(sizeof(*bitmap), GFP_NOWAIT | __GFP_NOWARN)
> and get rid of xb_preload()/xb_preload_end().

No, we can't.  GFP_NOWAIT | __GFP_NOWARN won't try very hard to allocate
memory.  There's no reason to fail the call if the user is in a context
where they can try harder to free memory.

> You are using idr_get_free_cmn(GFP_NOWAIT | __GFP_NOWARN), which
> means that the caller has to be prepared for allocation failure
> when calling xb_set_bit(). Thus, there is no need to use preload
> in order to avoid failing to allocate "bitmap".

xb_preload also preloads radix tree nodes.

> Also, please clarify why it is OK to just return here.
> I don't know what
> 
>   radix_tree_iter_replace(root, , slot, bitmap);
> 
> is doing. If you created a slot but did not assign "bitmap",
> what the caller of xb_test_bit() etc. will find? If there is an
> assumption about this slot, won't this cause a problem?

xb_test_bit will find NULL if bitmap wasn't assigned.  That doesn't
harm anything.

Re: [PATCH v20 3/7 RESEND] xbitmap: add more operations

2017-12-22 Thread Matthew Wilcox

On Sat, Dec 23, 2017 at 11:59:54AM +0900, Tetsuo Handa wrote:
> Matthew Wilcox wrote:
> > +   bit %= IDA_BITMAP_BITS;
> > +   radix_tree_iter_init(, index);
> > +   slot = idr_get_free_cmn(root, , GFP_NOWAIT | __GFP_NOWARN, index);
> > +   if (IS_ERR(slot)) {
> > +   if (slot == ERR_PTR(-ENOSPC))
> > +   return 0;   /* Already set */
> 
> Why already set? I guess something is there, but is it guaranteed that
> there is a bitmap with the "bit" set?

Yes.  For radix trees tagged with IDR_RT_MARKER, newly created slots
have the IDR_FREE tag set.  We only clear the IDR_FREE tag once the
bitmap is full.  So if we try to find a free slot and the tag is clear,
we know the bitmap is full.

> > +   bitmap = rcu_dereference_raw(*slot);
> > +   if (!bitmap) {
> > +   bitmap = this_cpu_xchg(ida_bitmap, NULL);
> > +   if (!bitmap)
> > +   return -ENOMEM;
> 
> I can't understand this. I can understand if it were
> 
>   BUG_ON(!bitmap);
> 
> because you called xb_preload().
> 
> But
> 
>   /*
>* Regular test 2
>* set bit 2000, 2001, 2040
>* Next 1 in [0, 2048)  --> 2000
>* Next 1 in [2000, 2002)   --> 2000
>* Next 1 in [2002, 2041)   --> 2040
>* Next 1 in [2002, 2040)   --> none
>* Next 0 in [2000, 2048)   --> 2002
>* Next 0 in [2048, 2060)   --> 2048
>*/
>   xb_preload(GFP_KERNEL);
>   assert(!xb_set_bit(, 2000));
>   assert(!xb_set_bit(, 2001));
>   assert(!xb_set_bit(, 2040));
[...]
>   xb_preload_end();
> 
> you are not calling xb_preload() prior to each xb_set_bit() call.
> This means that, if each xb_set_bit() is not surrounded with
> xb_preload()/xb_preload_end(), there is possibility of hitting
> this_cpu_xchg(ida_bitmap, NULL) == NULL.

This is just a lazy test.  We "know" that the bits in the range 1024-2047
will all land in the same bitmap, so there's no need to preload for each
of them.

> If bitmap == NULL at this_cpu_xchg(ida_bitmap, NULL) is allowed,
> you can use kzalloc(sizeof(*bitmap), GFP_NOWAIT | __GFP_NOWARN)
> and get rid of xb_preload()/xb_preload_end().

No, we can't.  GFP_NOWAIT | __GFP_NOWARN won't try very hard to allocate
memory.  There's no reason to fail the call if the user is in a context
where they can try harder to free memory.

> You are using idr_get_free_cmn(GFP_NOWAIT | __GFP_NOWARN), which
> means that the caller has to be prepared for allocation failure
> when calling xb_set_bit(). Thus, there is no need to use preload
> in order to avoid failing to allocate "bitmap".

xb_preload also preloads radix tree nodes.

> Also, please clarify why it is OK to just return here.
> I don't know what
> 
>   radix_tree_iter_replace(root, , slot, bitmap);
> 
> is doing. If you created a slot but did not assign "bitmap",
> what the caller of xb_test_bit() etc. will find? If there is an
> assumption about this slot, won't this cause a problem?

xb_test_bit will find NULL if bitmap wasn't assigned.  That doesn't
harm anything.

Re: [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root

2017-12-22 Thread Serge E. Hallyn

On Fri, Dec 22, 2017 at 03:32:28PM +0100, Dongsu Park wrote:
> From: Seth Forshee 
> 
> Expand the check in should_remove_suid() to keep privileges for

I realize this description came from Seth, but reading it now,
'Expand' seems wrong.  Expanding a check brings to my mind making
it stricter, not looser.  How about 'Relax the check' ?

> CAP_FSETID in s_user_ns rather than init_user_ns.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944621/
> 
> --EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid

Why exactly?

This is wrong, because capable_wrt_inode_uidgid() does a check
against current_user_ns, not the  inode->i_sb->s_user_ns

> 
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Alexander Viro 
> Cc: Serge Hallyn 
> Signed-off-by: Seth Forshee 
> Signed-off-by: Dongsu Park 
> ---
>  fs/inode.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index fd401028..6459a437 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1749,7 +1749,8 @@ EXPORT_SYMBOL(touch_atime);
>   */
>  int should_remove_suid(struct dentry *dentry)
>  {
> - umode_t mode = d_inode(dentry)->i_mode;
> + struct inode *inode = d_inode(dentry);
> + umode_t mode = inode->i_mode;
>   int kill = 0;
>  
>   /* suid always must be killed */
> @@ -1763,7 +1764,8 @@ int should_remove_suid(struct dentry *dentry)
>   if (unlikely((mode & S_ISGID) && (mode & S_IXGRP)))
>   kill |= ATTR_KILL_SGID;
>  
> - if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode)))
> + if (unlikely(kill && !capable_wrt_inode_uidgid(inode, CAP_FSETID) &&
> +  S_ISREG(mode)))
>   return kill;
>  
>   return 0;
> -- 
> 2.13.6

Re: [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root

2017-12-22 Thread Serge E. Hallyn

On Fri, Dec 22, 2017 at 03:32:28PM +0100, Dongsu Park wrote:
> From: Seth Forshee 
> 
> Expand the check in should_remove_suid() to keep privileges for

I realize this description came from Seth, but reading it now,
'Expand' seems wrong.  Expanding a check brings to my mind making
it stricter, not looser.  How about 'Relax the check' ?

> CAP_FSETID in s_user_ns rather than init_user_ns.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944621/
> 
> --EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid

Why exactly?

This is wrong, because capable_wrt_inode_uidgid() does a check
against current_user_ns, not the  inode->i_sb->s_user_ns

> 
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Alexander Viro 
> Cc: Serge Hallyn 
> Signed-off-by: Seth Forshee 
> Signed-off-by: Dongsu Park 
> ---
>  fs/inode.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index fd401028..6459a437 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1749,7 +1749,8 @@ EXPORT_SYMBOL(touch_atime);
>   */
>  int should_remove_suid(struct dentry *dentry)
>  {
> - umode_t mode = d_inode(dentry)->i_mode;
> + struct inode *inode = d_inode(dentry);
> + umode_t mode = inode->i_mode;
>   int kill = 0;
>  
>   /* suid always must be killed */
> @@ -1763,7 +1764,8 @@ int should_remove_suid(struct dentry *dentry)
>   if (unlikely((mode & S_ISGID) && (mode & S_IXGRP)))
>   kill |= ATTR_KILL_SGID;
>  
> - if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode)))
> + if (unlikely(kill && !capable_wrt_inode_uidgid(inode, CAP_FSETID) &&
> +  S_ISREG(mode)))
>   return kill;
>  
>   return 0;
> -- 
> 2.13.6

Re: [PATCH v4] fsck.f2fs: check and fix i_namelen to avoid double free

2017-12-22 Thread Yunlong Song


Double free problem:
Since ddr bit jump makes i_namelen a larger value (> 255)，when file is 
not encrypted,
the convert_encrypted_name will memcpy out range of en[255], when en is 
freed, there

will be double free problem.

On 2017/12/23 11:05, Chao Yu wrote:

On 2017/12/18 21:25, Yunlong Song wrote:

v1 -> v2: use child_info to pass dentry namelen
v2 -> v3: check child != NULL to include the F2FS_FT_ORPHAN file type
v3 -> v4: fix the i_namelen problem of dump.f2fs、

There is no commit log, so what do you mean about "avoid double free"?

Other than that, looks good to me.

Reviewed-by: Chao Yu 

Thanks,


.



--
Thanks,
Yunlong Song

Re: [PATCH v4] fsck.f2fs: check and fix i_namelen to avoid double free

2017-12-22 Thread Yunlong Song


Double free problem:
Since ddr bit jump makes i_namelen a larger value (> 255)，when file is 
not encrypted,
the convert_encrypted_name will memcpy out range of en[255], when en is 
freed, there

will be double free problem.

On 2017/12/23 11:05, Chao Yu wrote:

On 2017/12/18 21:25, Yunlong Song wrote:

v1 -> v2: use child_info to pass dentry namelen
v2 -> v3: check child != NULL to include the F2FS_FT_ORPHAN file type
v3 -> v4: fix the i_namelen problem of dump.f2fs、

There is no commit log, so what do you mean about "avoid double free"?

Other than that, looks good to me.

Reviewed-by: Chao Yu 

Thanks,


.



--
Thanks,
Yunlong Song

Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes

2017-12-22 Thread Serge E. Hallyn

On Fri, Dec 22, 2017 at 03:32:27PM +0100, Dongsu Park wrote:
> From: Eric W. Biederman 
> 
> Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to

Note it is CAP_CHOWN

> chown files.  Ordinarily the capable_wrt_inode_uidgid check is
> sufficient to allow access to files but when the underlying filesystem
> has uids or gids that don't map to the current user namespace it is
> not enough, so the chown permission checks need to be extended to
> allow this case.
> 
> Calling chown on filesystem nodes whose uid or gid don't map is
> necessary if those nodes are going to be modified as writing back
> inodes which contain uids or gids that don't map is likely to cause
> filesystem corruption of the uid or gid fields.
> 
> Once chown has been called the existing capable_wrt_inode_uidgid
> checks are sufficient, to allow the owner of a superblock to do anything
> the global root user can do with an appropriate set of capabilities.
> 
> For the proc filesystem this relaxation of permissions is not safe, as
> some files are owned by users (particularly GLOBAL_ROOT_UID) outside
> of the control of the mounter of the proc and that would be unsafe to
> grant chown access to.  So update setattr on proc to disallow changing
> files whose uids or gids are outside of proc's s_user_ns.
> 
> The original version of this patch was written by: Seth Forshee.  I
> have rewritten and rethought this patch enough so it's really not the
> same thing (certainly it needs a different description), but he
> deserves credit for getting out there and getting the conversation
> started, and finding the potential gotcha's and putting up with my
> semi-paranoid feedback.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944611/
> 
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Alexander Viro 
> Cc: "Luis R. Rodriguez" 
> Cc: Kees Cook 
> Inspired-by: Seth Forshee 
> Signed-off-by: Eric W. Biederman 
> [saf: Resolve conflicts caused by s/inode_change_ok/setattr_prepare/]
> Signed-off-by: Dongsu Park 

Reviewed-by: Serge Hallyn 

> ---
>  fs/attr.c | 34 ++
>  fs/proc/base.c|  7 +++
>  fs/proc/generic.c |  7 +++
>  fs/proc/proc_sysctl.c |  7 +++
>  4 files changed, 47 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/attr.c b/fs/attr.c
> index 12ffdb6f..bf8e94f3 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -18,6 +18,30 @@
>  #include 
>  #include 
>  
> +static bool chown_ok(const struct inode *inode, kuid_t uid)
> +{
> + if (uid_eq(current_fsuid(), inode->i_uid) &&
> + uid_eq(uid, inode->i_uid))
> + return true;
> + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> + return true;
> + if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
> + return true;
> + return false;
> +}
> +
> +static bool chgrp_ok(const struct inode *inode, kgid_t gid)
> +{
> + if (uid_eq(current_fsuid(), inode->i_uid) &&
> + (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
> + return true;
> + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> + return true;
> + if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
> + return true;
> + return false;
> +}
> +
>  /**
>   * setattr_prepare - check if attribute changes to a dentry are allowed
>   * @dentry:  dentry to check
> @@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr 
> *attr)
>   goto kill_priv;
>  
>   /* Make sure a caller can chown. */
> - if ((ia_valid & ATTR_UID) &&
> - (!uid_eq(current_fsuid(), inode->i_uid) ||
> -  !uid_eq(attr->ia_uid, inode->i_uid)) &&
> - !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> + if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
>   return -EPERM;
>  
>   /* Make sure caller can chgrp. */
> - if ((ia_valid & ATTR_GID) &&
> - (!uid_eq(current_fsuid(), inode->i_uid) ||
> - (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) 
> &&
> - !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> + if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
>   return -EPERM;
>  
>   /* Make sure a caller can chmod. */
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 31934cb9..9d50ec92 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr 
> *attr)
>  {
>   int error;
>   struct inode *inode = d_inode(dentry);
> + struct user_namespace *s_user_ns;
>  
>   if (attr->ia_valid & ATTR_MODE)
>   return -EPERM;
>  
> + /* Don't let anyone mess with weird proc files */
> + s_user_ns = inode->i_sb->s_user_ns;
> +

Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes

2017-12-22 Thread Serge E. Hallyn

On Fri, Dec 22, 2017 at 03:32:27PM +0100, Dongsu Park wrote:
> From: Eric W. Biederman 
> 
> Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to

Note it is CAP_CHOWN

> chown files.  Ordinarily the capable_wrt_inode_uidgid check is
> sufficient to allow access to files but when the underlying filesystem
> has uids or gids that don't map to the current user namespace it is
> not enough, so the chown permission checks need to be extended to
> allow this case.
> 
> Calling chown on filesystem nodes whose uid or gid don't map is
> necessary if those nodes are going to be modified as writing back
> inodes which contain uids or gids that don't map is likely to cause
> filesystem corruption of the uid or gid fields.
> 
> Once chown has been called the existing capable_wrt_inode_uidgid
> checks are sufficient, to allow the owner of a superblock to do anything
> the global root user can do with an appropriate set of capabilities.
> 
> For the proc filesystem this relaxation of permissions is not safe, as
> some files are owned by users (particularly GLOBAL_ROOT_UID) outside
> of the control of the mounter of the proc and that would be unsafe to
> grant chown access to.  So update setattr on proc to disallow changing
> files whose uids or gids are outside of proc's s_user_ns.
> 
> The original version of this patch was written by: Seth Forshee.  I
> have rewritten and rethought this patch enough so it's really not the
> same thing (certainly it needs a different description), but he
> deserves credit for getting out there and getting the conversation
> started, and finding the potential gotcha's and putting up with my
> semi-paranoid feedback.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944611/
> 
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Alexander Viro 
> Cc: "Luis R. Rodriguez" 
> Cc: Kees Cook 
> Inspired-by: Seth Forshee 
> Signed-off-by: Eric W. Biederman 
> [saf: Resolve conflicts caused by s/inode_change_ok/setattr_prepare/]
> Signed-off-by: Dongsu Park 

Reviewed-by: Serge Hallyn 

> ---
>  fs/attr.c | 34 ++
>  fs/proc/base.c|  7 +++
>  fs/proc/generic.c |  7 +++
>  fs/proc/proc_sysctl.c |  7 +++
>  4 files changed, 47 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/attr.c b/fs/attr.c
> index 12ffdb6f..bf8e94f3 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -18,6 +18,30 @@
>  #include 
>  #include 
>  
> +static bool chown_ok(const struct inode *inode, kuid_t uid)
> +{
> + if (uid_eq(current_fsuid(), inode->i_uid) &&
> + uid_eq(uid, inode->i_uid))
> + return true;
> + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> + return true;
> + if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
> + return true;
> + return false;
> +}
> +
> +static bool chgrp_ok(const struct inode *inode, kgid_t gid)
> +{
> + if (uid_eq(current_fsuid(), inode->i_uid) &&
> + (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
> + return true;
> + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> + return true;
> + if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
> + return true;
> + return false;
> +}
> +
>  /**
>   * setattr_prepare - check if attribute changes to a dentry are allowed
>   * @dentry:  dentry to check
> @@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr 
> *attr)
>   goto kill_priv;
>  
>   /* Make sure a caller can chown. */
> - if ((ia_valid & ATTR_UID) &&
> - (!uid_eq(current_fsuid(), inode->i_uid) ||
> -  !uid_eq(attr->ia_uid, inode->i_uid)) &&
> - !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> + if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
>   return -EPERM;
>  
>   /* Make sure caller can chgrp. */
> - if ((ia_valid & ATTR_GID) &&
> - (!uid_eq(current_fsuid(), inode->i_uid) ||
> - (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) 
> &&
> - !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> + if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
>   return -EPERM;
>  
>   /* Make sure a caller can chmod. */
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 31934cb9..9d50ec92 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr 
> *attr)
>  {
>   int error;
>   struct inode *inode = d_inode(dentry);
> + struct user_namespace *s_user_ns;
>  
>   if (attr->ia_valid & ATTR_MODE)
>   return -EPERM;
>  
> + /* Don't let anyone mess with weird proc files */
> + s_user_ns = inode->i_sb->s_user_ns;
> + if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
> + !kgid_has_mapping(s_user_ns, inode->i_gid))
> + return -EPERM;
> +
>   error =

Re: [PATCH v2] fsck.f2fs: check nid range before use to avoid segmentation fault

2017-12-22 Thread Chao Yu

On 2017/12/18 19:53, Yunlong Song wrote:
> Signed-off-by: Yunlong Song 

How about introducing IS_AVAILABLE_NID as below, and use it instead?

#define IS_AVAILABLE_NID(sbi, nid)  (IS_VALID_NID(sbi, nid) && (nid >= 
root_ino))

Thanks,

> ---
>  fsck/fsck.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fsck/fsck.c b/fsck/fsck.c
> index 11b8b0b..faf0663 100644
> --- a/fsck/fsck.c
> +++ b/fsck/fsck.c
> @@ -740,7 +740,7 @@ void fsck_chk_inode_blk(struct f2fs_sb_info *sbi, u32 nid,
>   for (idx = 0; idx < 5; idx++) {
>   u32 nid = le32_to_cpu(node_blk->i.i_nid[idx]);
>  
> - if (nid != 0) {
> + if (nid != 0 && IS_VALID_NID(sbi, nid)) {
>   struct node_info ni;
>  
>   get_node_info(sbi, nid, );
>

Re: [PATCH v2] fsck.f2fs: check nid range before use to avoid segmentation fault

2017-12-22 Thread Chao Yu

On 2017/12/18 19:53, Yunlong Song wrote:
> Signed-off-by: Yunlong Song 

How about introducing IS_AVAILABLE_NID as below, and use it instead?

#define IS_AVAILABLE_NID(sbi, nid)  (IS_VALID_NID(sbi, nid) && (nid >= 
root_ino))

Thanks,

> ---
>  fsck/fsck.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fsck/fsck.c b/fsck/fsck.c
> index 11b8b0b..faf0663 100644
> --- a/fsck/fsck.c
> +++ b/fsck/fsck.c
> @@ -740,7 +740,7 @@ void fsck_chk_inode_blk(struct f2fs_sb_info *sbi, u32 nid,
>   for (idx = 0; idx < 5; idx++) {
>   u32 nid = le32_to_cpu(node_blk->i.i_nid[idx]);
>  
> - if (nid != 0) {
> + if (nid != 0 && IS_VALID_NID(sbi, nid)) {
>   struct node_info ni;
>  
>   get_node_info(sbi, nid, );
>

Re: [PATCH v4] fsck.f2fs: check and fix i_namelen to avoid double free

2017-12-22 Thread Chao Yu

On 2017/12/18 21:25, Yunlong Song wrote:
> v1 -> v2: use child_info to pass dentry namelen
> v2 -> v3: check child != NULL to include the F2FS_FT_ORPHAN file type
> v3 -> v4: fix the i_namelen problem of dump.f2fs、

There is no commit log, so what do you mean about "avoid double free"?

Other than that, looks good to me.

Reviewed-by: Chao Yu 

Thanks,

Re: [PATCH v4] fsck.f2fs: check and fix i_namelen to avoid double free

2017-12-22 Thread Chao Yu

On 2017/12/18 21:25, Yunlong Song wrote:
> v1 -> v2: use child_info to pass dentry namelen
> v2 -> v3: check child != NULL to include the F2FS_FT_ORPHAN file type
> v3 -> v4: fix the i_namelen problem of dump.f2fs、

There is no commit log, so what do you mean about "avoid double free"?

Other than that, looks good to me.

Reviewed-by: Chao Yu 

Thanks,

Re: [PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting

2017-12-22 Thread Serge E. Hallyn

On Fri, Dec 22, 2017 at 03:32:26PM +0100, Dongsu Park wrote:
> From: Seth Forshee 
> 
> Unprivileged users should not be able to mount mtd block devices
> when they lack sufficient privileges towards the block device
> inode.  Update mount_mtd() to validate that the user has the
> required access to the inode at the specified path. The check
> will be skipped for CAP_SYS_ADMIN, so privileged mounts will
> continue working as before.
> 
> Patch v3 is available: https://patchwork.kernel.org/patch/7640011/
> 
> Cc: linux-...@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Seth Forshee 
> Signed-off-by: Dongsu Park 

Acked-by: Serge Hallyn 

> ---
>  drivers/mtd/mtdsuper.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
> index 4a4d40c0..3c8734f3 100644
> --- a/drivers/mtd/mtdsuper.c
> +++ b/drivers/mtd/mtdsuper.c
> @@ -129,6 +129,7 @@ struct dentry *mount_mtd(struct file_system_type 
> *fs_type, int flags,
>  #ifdef CONFIG_BLOCK
>   struct block_device *bdev;
>   int ret, major;
> + int perm;
>  #endif
>   int mtdnr;
>  
> @@ -180,7 +181,10 @@ struct dentry *mount_mtd(struct file_system_type 
> *fs_type, int flags,
>   /* try the old way - the hack where we allowed users to mount
>* /dev/mtdblock$(n) but didn't actually _use_ the blockdev
>*/
> - bdev = lookup_bdev(dev_name, 0);
> + perm = MAY_READ;
> + if (!(flags & MS_RDONLY))
> + perm |= MAY_WRITE;
> + bdev = lookup_bdev(dev_name, perm);
>   if (IS_ERR(bdev)) {
>   ret = PTR_ERR(bdev);
>   pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
> -- 
> 2.13.6
> 
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

Re: [PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting

2017-12-22 Thread Serge E. Hallyn

On Fri, Dec 22, 2017 at 03:32:26PM +0100, Dongsu Park wrote:
> From: Seth Forshee 
> 
> Unprivileged users should not be able to mount mtd block devices
> when they lack sufficient privileges towards the block device
> inode.  Update mount_mtd() to validate that the user has the
> required access to the inode at the specified path. The check
> will be skipped for CAP_SYS_ADMIN, so privileged mounts will
> continue working as before.
> 
> Patch v3 is available: https://patchwork.kernel.org/patch/7640011/
> 
> Cc: linux-...@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Seth Forshee 
> Signed-off-by: Dongsu Park 

Acked-by: Serge Hallyn 

> ---
>  drivers/mtd/mtdsuper.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
> index 4a4d40c0..3c8734f3 100644
> --- a/drivers/mtd/mtdsuper.c
> +++ b/drivers/mtd/mtdsuper.c
> @@ -129,6 +129,7 @@ struct dentry *mount_mtd(struct file_system_type 
> *fs_type, int flags,
>  #ifdef CONFIG_BLOCK
>   struct block_device *bdev;
>   int ret, major;
> + int perm;
>  #endif
>   int mtdnr;
>  
> @@ -180,7 +181,10 @@ struct dentry *mount_mtd(struct file_system_type 
> *fs_type, int flags,
>   /* try the old way - the hack where we allowed users to mount
>* /dev/mtdblock$(n) but didn't actually _use_ the blockdev
>*/
> - bdev = lookup_bdev(dev_name, 0);
> + perm = MAY_READ;
> + if (!(flags & MS_RDONLY))
> + perm |= MAY_WRITE;
> + bdev = lookup_bdev(dev_name, perm);
>   if (IS_ERR(bdev)) {
>   ret = PTR_ERR(bdev);
>   pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
> -- 
> 2.13.6
> 
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

Re: [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()

2017-12-22 Thread Serge E. Hallyn

On Fri, Dec 22, 2017 at 03:32:25PM +0100, Dongsu Park wrote:
> From: Seth Forshee 
> 
> When looking up a block device by path no permission check is
> done to verify that the user has access to the block device inode
> at the specified path. In some cases it may be necessary to
> check permissions towards the inode, such as allowing
> unprivileged users to mount block devices in user namespaces.
> 
> Add an argument to lookup_bdev() to optionally perform this
> permission check. A value of 0 skips the permission check and
> behaves the same as before. A non-zero value specifies the mask
> of access rights required towards the inode at the specified
> path. The check is always skipped if the user has CAP_SYS_ADMIN.
> 
> All callers of lookup_bdev() currently pass a mask of 0, so this
> patch results in no functional change. Subsequent patches will
> add permission checks where appropriate.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8943601/
> 
> Cc: dm-de...@redhat.com
> Cc: linux-bca...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux-...@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Alexander Viro 
> Cc: Jan Kara 
> Cc: Serge Hallyn 

Acked-by: Serge Hallyn 

> Signed-off-by: Seth Forshee 
> Signed-off-by: Dongsu Park 
> ---
>  drivers/md/bcache/super.c |  2 +-
>  drivers/md/dm-table.c |  2 +-
>  drivers/mtd/mtdsuper.c|  2 +-
>  fs/block_dev.c| 13 ++---
>  fs/quota/quota.c  |  2 +-
>  include/linux/fs.h|  2 +-
>  6 files changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index b4d28928..acc9d56c 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, 
> struct kobj_attribute *attr,
> sb);
>   if (IS_ERR(bdev)) {
>   if (bdev == ERR_PTR(-EBUSY)) {
> - bdev = lookup_bdev(strim(path));
> + bdev = lookup_bdev(strim(path), 0);
>   mutex_lock(_register_lock);
>   if (!IS_ERR(bdev) && bch_is_open(bdev))
>   err = "device already registered";
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 88130b5d..bca5eaf4 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -410,7 +410,7 @@ dev_t dm_get_dev_t(const char *path)
>   dev_t dev;
>   struct block_device *bdev;
>  
> - bdev = lookup_bdev(path);
> + bdev = lookup_bdev(path, 0);
>   if (IS_ERR(bdev))
>   dev = name_to_dev_t(path);
>   else {
> diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
> index e43fea89..4a4d40c0 100644
> --- a/drivers/mtd/mtdsuper.c
> +++ b/drivers/mtd/mtdsuper.c
> @@ -180,7 +180,7 @@ struct dentry *mount_mtd(struct file_system_type 
> *fs_type, int flags,
>   /* try the old way - the hack where we allowed users to mount
>* /dev/mtdblock$(n) but didn't actually _use_ the blockdev
>*/
> - bdev = lookup_bdev(dev_name);
> + bdev = lookup_bdev(dev_name, 0);
>   if (IS_ERR(bdev)) {
>   ret = PTR_ERR(bdev);
>   pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 4a181fcb..5ca06095 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1662,7 +1662,7 @@ struct block_device *blkdev_get_by_path(const char 
> *path, fmode_t mode,
>   struct block_device *bdev;
>   int err;
>  
> - bdev = lookup_bdev(path);
> + bdev = lookup_bdev(path, 0);
>   if (IS_ERR(bdev))
>   return bdev;
>  
> @@ -2052,12 +2052,14 @@ EXPORT_SYMBOL(ioctl_by_bdev);
>  /**
>   * lookup_bdev  - lookup a struct block_device by name
>   * @pathname:special file representing the block device
> + * @mask:rights to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
>   *
>   * Get a reference to the blockdevice at @pathname in the current
>   * namespace if possible and return it.  Return ERR_PTR(error)
> - * otherwise.
> + * otherwise.  If @mask is non-zero, check for access rights to the
> + * inode at @pathname.
>   */
> -struct block_device *lookup_bdev(const char *pathname)
> +struct block_device *lookup_bdev(const char *pathname, int mask)
>  {
>   struct block_device *bdev;
>   struct inode *inode;
> @@ -2072,6 +2074,11 @@ struct block_device *lookup_bdev(const char *pathname)
>   return ERR_PTR(error);
>  
>   inode = d_backing_inode(path.dentry);
> + if (mask != 0 && !capable(CAP_SYS_ADMIN)) {
> + error = __inode_permission(inode, mask);
> + if (error)
> + goto fail;
> + }
>   error = -ENOTBLK;
>   if

Re: [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()

2017-12-22 Thread Serge E. Hallyn

On Fri, Dec 22, 2017 at 03:32:25PM +0100, Dongsu Park wrote:
> From: Seth Forshee 
> 
> When looking up a block device by path no permission check is
> done to verify that the user has access to the block device inode
> at the specified path. In some cases it may be necessary to
> check permissions towards the inode, such as allowing
> unprivileged users to mount block devices in user namespaces.
> 
> Add an argument to lookup_bdev() to optionally perform this
> permission check. A value of 0 skips the permission check and
> behaves the same as before. A non-zero value specifies the mask
> of access rights required towards the inode at the specified
> path. The check is always skipped if the user has CAP_SYS_ADMIN.
> 
> All callers of lookup_bdev() currently pass a mask of 0, so this
> patch results in no functional change. Subsequent patches will
> add permission checks where appropriate.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8943601/
> 
> Cc: dm-de...@redhat.com
> Cc: linux-bca...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux-...@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Alexander Viro 
> Cc: Jan Kara 
> Cc: Serge Hallyn 

Acked-by: Serge Hallyn 

> Signed-off-by: Seth Forshee 
> Signed-off-by: Dongsu Park 
> ---
>  drivers/md/bcache/super.c |  2 +-
>  drivers/md/dm-table.c |  2 +-
>  drivers/mtd/mtdsuper.c|  2 +-
>  fs/block_dev.c| 13 ++---
>  fs/quota/quota.c  |  2 +-
>  include/linux/fs.h|  2 +-
>  6 files changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index b4d28928..acc9d56c 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, 
> struct kobj_attribute *attr,
> sb);
>   if (IS_ERR(bdev)) {
>   if (bdev == ERR_PTR(-EBUSY)) {
> - bdev = lookup_bdev(strim(path));
> + bdev = lookup_bdev(strim(path), 0);
>   mutex_lock(_register_lock);
>   if (!IS_ERR(bdev) && bch_is_open(bdev))
>   err = "device already registered";
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 88130b5d..bca5eaf4 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -410,7 +410,7 @@ dev_t dm_get_dev_t(const char *path)
>   dev_t dev;
>   struct block_device *bdev;
>  
> - bdev = lookup_bdev(path);
> + bdev = lookup_bdev(path, 0);
>   if (IS_ERR(bdev))
>   dev = name_to_dev_t(path);
>   else {
> diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
> index e43fea89..4a4d40c0 100644
> --- a/drivers/mtd/mtdsuper.c
> +++ b/drivers/mtd/mtdsuper.c
> @@ -180,7 +180,7 @@ struct dentry *mount_mtd(struct file_system_type 
> *fs_type, int flags,
>   /* try the old way - the hack where we allowed users to mount
>* /dev/mtdblock$(n) but didn't actually _use_ the blockdev
>*/
> - bdev = lookup_bdev(dev_name);
> + bdev = lookup_bdev(dev_name, 0);
>   if (IS_ERR(bdev)) {
>   ret = PTR_ERR(bdev);
>   pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 4a181fcb..5ca06095 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1662,7 +1662,7 @@ struct block_device *blkdev_get_by_path(const char 
> *path, fmode_t mode,
>   struct block_device *bdev;
>   int err;
>  
> - bdev = lookup_bdev(path);
> + bdev = lookup_bdev(path, 0);
>   if (IS_ERR(bdev))
>   return bdev;
>  
> @@ -2052,12 +2052,14 @@ EXPORT_SYMBOL(ioctl_by_bdev);
>  /**
>   * lookup_bdev  - lookup a struct block_device by name
>   * @pathname:special file representing the block device
> + * @mask:rights to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
>   *
>   * Get a reference to the blockdevice at @pathname in the current
>   * namespace if possible and return it.  Return ERR_PTR(error)
> - * otherwise.
> + * otherwise.  If @mask is non-zero, check for access rights to the
> + * inode at @pathname.
>   */
> -struct block_device *lookup_bdev(const char *pathname)
> +struct block_device *lookup_bdev(const char *pathname, int mask)
>  {
>   struct block_device *bdev;
>   struct inode *inode;
> @@ -2072,6 +2074,11 @@ struct block_device *lookup_bdev(const char *pathname)
>   return ERR_PTR(error);
>  
>   inode = d_backing_inode(path.dentry);
> + if (mask != 0 && !capable(CAP_SYS_ADMIN)) {
> + error = __inode_permission(inode, mask);
> + if (error)
> + goto fail;
> + }
>   error = -ENOTBLK;
>   if (!S_ISBLK(inode->i_mode))
>   goto fail;
> diff --git a/fs/quota/quota.c b/fs/quota/quota.c
> index 43612e2a..e5d47955 100644
> --- a/fs/quota/quota.c

Re: [PATCH v20 3/7 RESEND] xbitmap: add more operations

2017-12-22 Thread Tetsuo Handa

Matthew Wilcox wrote:
> +/**
> + * xb_set_bit() - Set a bit in the XBitmap.
> + * @xb: The XBitmap.
> + * @bit: Index of the bit to set.
> + *
> + * This function is used to set a bit in the xbitmap.
> + *
> + * Return: 0 on success. -ENOMEM if memory could not be allocated.
> + */
> +int xb_set_bit(struct xb *xb, unsigned long bit)
> +{
> + unsigned long index = bit / IDA_BITMAP_BITS;
> + struct radix_tree_root *root = >xbrt;
> + struct radix_tree_iter iter;
> + void __rcu **slot;
> + struct ida_bitmap *bitmap;
> +
> + bit %= IDA_BITMAP_BITS;
> + radix_tree_iter_init(, index);
> + slot = idr_get_free_cmn(root, , GFP_NOWAIT | __GFP_NOWARN, index);
> + if (IS_ERR(slot)) {
> + if (slot == ERR_PTR(-ENOSPC))
> + return 0;   /* Already set */

Why already set? I guess something is there, but is it guaranteed that
there is a bitmap with the "bit" set?

> + return -ENOMEM;
> + }
> + bitmap = rcu_dereference_raw(*slot);
> + if (!bitmap) {
> + bitmap = this_cpu_xchg(ida_bitmap, NULL);
> + if (!bitmap)
> + return -ENOMEM;

I can't understand this. I can understand if it were

  BUG_ON(!bitmap);

because you called xb_preload().

But

/*
 * Regular test 2
 * set bit 2000, 2001, 2040
 * Next 1 in [0, 2048)  --> 2000
 * Next 1 in [2000, 2002)   --> 2000
 * Next 1 in [2002, 2041)   --> 2040
 * Next 1 in [2002, 2040)   --> none
 * Next 0 in [2000, 2048)   --> 2002
 * Next 0 in [2048, 2060)   --> 2048
 */
xb_preload(GFP_KERNEL);
assert(!xb_set_bit(, 2000));
assert(!xb_set_bit(, 2001));
assert(!xb_set_bit(, 2040));
nbit = 0;
assert(xb_find_set(, 2048, ) == true);
assert(nbit == 2000);
assert(xb_find_set(, 2002, ) == true);
assert(nbit == 2000);
nbit = 2002;
assert(xb_find_set(, 2041, ) == true);
assert(nbit == 2040);
nbit = 2002;
assert(xb_find_set(, 2040, ) == true);
assert(nbit == 2040);
nbit = 2000;
assert(xb_find_zero(, 2048, ) == true);
assert(nbit == 2002);
nbit = 2048;
assert(xb_find_zero(, 2060, ) == true);
assert(nbit == 2048);
xb_zero(, 0, 2047);
nbit = 0;
assert(xb_find_set(, 2048, ) == false);
assert(nbit == 0);
xb_preload_end();

you are not calling xb_preload() prior to each xb_set_bit() call.
This means that, if each xb_set_bit() is not surrounded with
xb_preload()/xb_preload_end(), there is possibility of hitting
this_cpu_xchg(ida_bitmap, NULL) == NULL.

If bitmap == NULL at this_cpu_xchg(ida_bitmap, NULL) is allowed,
you can use kzalloc(sizeof(*bitmap), GFP_NOWAIT | __GFP_NOWARN)
and get rid of xb_preload()/xb_preload_end().

You are using idr_get_free_cmn(GFP_NOWAIT | __GFP_NOWARN), which
means that the caller has to be prepared for allocation failure
when calling xb_set_bit(). Thus, there is no need to use preload
in order to avoid failing to allocate "bitmap".

Also, please clarify why it is OK to just return here.
I don't know what

  radix_tree_iter_replace(root, , slot, bitmap);

is doing. If you created a slot but did not assign "bitmap",
what the caller of xb_test_bit() etc. will find? If there is an
assumption about this slot, won't this cause a problem?

> + memset(bitmap, 0, sizeof(*bitmap));
> + radix_tree_iter_replace(root, , slot, bitmap);
> + }
> +
> + __set_bit(bit, bitmap->bitmap);
> + if (bitmap_full(bitmap->bitmap, IDA_BITMAP_BITS))
> + radix_tree_iter_tag_clear(root, , IDR_FREE);
> + return 0;
> +}

Re: [PATCH v20 3/7 RESEND] xbitmap: add more operations

2017-12-22 Thread Tetsuo Handa

Matthew Wilcox wrote:
> +/**
> + * xb_set_bit() - Set a bit in the XBitmap.
> + * @xb: The XBitmap.
> + * @bit: Index of the bit to set.
> + *
> + * This function is used to set a bit in the xbitmap.
> + *
> + * Return: 0 on success. -ENOMEM if memory could not be allocated.
> + */
> +int xb_set_bit(struct xb *xb, unsigned long bit)
> +{
> + unsigned long index = bit / IDA_BITMAP_BITS;
> + struct radix_tree_root *root = >xbrt;
> + struct radix_tree_iter iter;
> + void __rcu **slot;
> + struct ida_bitmap *bitmap;
> +
> + bit %= IDA_BITMAP_BITS;
> + radix_tree_iter_init(, index);
> + slot = idr_get_free_cmn(root, , GFP_NOWAIT | __GFP_NOWARN, index);
> + if (IS_ERR(slot)) {
> + if (slot == ERR_PTR(-ENOSPC))
> + return 0;   /* Already set */

Why already set? I guess something is there, but is it guaranteed that
there is a bitmap with the "bit" set?

> + return -ENOMEM;
> + }
> + bitmap = rcu_dereference_raw(*slot);
> + if (!bitmap) {
> + bitmap = this_cpu_xchg(ida_bitmap, NULL);
> + if (!bitmap)
> + return -ENOMEM;

I can't understand this. I can understand if it were

  BUG_ON(!bitmap);

because you called xb_preload().

But

/*
 * Regular test 2
 * set bit 2000, 2001, 2040
 * Next 1 in [0, 2048)  --> 2000
 * Next 1 in [2000, 2002)   --> 2000
 * Next 1 in [2002, 2041)   --> 2040
 * Next 1 in [2002, 2040)   --> none
 * Next 0 in [2000, 2048)   --> 2002
 * Next 0 in [2048, 2060)   --> 2048
 */
xb_preload(GFP_KERNEL);
assert(!xb_set_bit(, 2000));
assert(!xb_set_bit(, 2001));
assert(!xb_set_bit(, 2040));
nbit = 0;
assert(xb_find_set(, 2048, ) == true);
assert(nbit == 2000);
assert(xb_find_set(, 2002, ) == true);
assert(nbit == 2000);
nbit = 2002;
assert(xb_find_set(, 2041, ) == true);
assert(nbit == 2040);
nbit = 2002;
assert(xb_find_set(, 2040, ) == true);
assert(nbit == 2040);
nbit = 2000;
assert(xb_find_zero(, 2048, ) == true);
assert(nbit == 2002);
nbit = 2048;
assert(xb_find_zero(, 2060, ) == true);
assert(nbit == 2048);
xb_zero(, 0, 2047);
nbit = 0;
assert(xb_find_set(, 2048, ) == false);
assert(nbit == 0);
xb_preload_end();

you are not calling xb_preload() prior to each xb_set_bit() call.
This means that, if each xb_set_bit() is not surrounded with
xb_preload()/xb_preload_end(), there is possibility of hitting
this_cpu_xchg(ida_bitmap, NULL) == NULL.

If bitmap == NULL at this_cpu_xchg(ida_bitmap, NULL) is allowed,
you can use kzalloc(sizeof(*bitmap), GFP_NOWAIT | __GFP_NOWARN)
and get rid of xb_preload()/xb_preload_end().

You are using idr_get_free_cmn(GFP_NOWAIT | __GFP_NOWARN), which
means that the caller has to be prepared for allocation failure
when calling xb_set_bit(). Thus, there is no need to use preload
in order to avoid failing to allocate "bitmap".

Also, please clarify why it is OK to just return here.
I don't know what

  radix_tree_iter_replace(root, , slot, bitmap);

is doing. If you created a slot but did not assign "bitmap",
what the caller of xb_test_bit() etc. will find? If there is an
assumption about this slot, won't this cause a problem?

> + memset(bitmap, 0, sizeof(*bitmap));
> + radix_tree_iter_replace(root, , slot, bitmap);
> + }
> +
> + __set_bit(bit, bitmap->bitmap);
> + if (bitmap_full(bitmap->bitmap, IDA_BITMAP_BITS))
> + radix_tree_iter_tag_clear(root, , IDR_FREE);
> + return 0;
> +}

Re: [RFC][PATCH][REGRESSION FIX] x86/ftrace: Add ORC annotation to 64 bit ftrace assembly

2017-12-22 Thread Josh Poimboeuf

On Fri, Dec 22, 2017 at 01:38:01PM -0500, Steven Rostedt wrote:
> On Fri, 22 Dec 2017 12:30:24 -0600
> Josh Poimboeuf  wrote:
> 
> > I'm officially on vacation but I got a chance to look at this myself a
> > few minutes ago.  This looks mostly right.  In theory, you should able
> > to mark those as functions by changing END to ENDPROC.  Then you won't
> > need all the UNWIND_HINT_FUNCs.
> 
> Yep, that was my first solution, but as you found, objtool complained
> about it.
> 
> > 
> > I tried making that change, but objtool thinks the stack frame is still
> > modified when the functions return.  I didn't see anything obvious in
> > save_mcount_regs or restore_mcount_regs that should cause it to think
> > that.  I'll need to look into it a little more.
> 
> Thanks!
> 
> Worse comes to worse, we can keep this change, and you can enjoy your
> holidays. I just need this fixed before 4.15 is released.
> 
> I'll be jumping into objtool shortly, to see if I can merge the code
> from recordmcount.c with it too. I'm going to need to learn ORC and
> DWARF to start on extending the function tracer to act more like
> tracepoints (like I discussed with Linus in Prague).

Objtool doesn't like the fact that save_mcount_regs pushes rbp at the
beginning, but it's never restored (directly, at least).

How about something like this instead, where it skips the original rbp
push?  I didn't test it, but I think it should work.  It at least gets
rid of the warnings and doesn't need any unwind hints other than the
UNWIND_HINT_EMPTY in return_to_handler.


diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 81bb565f4497..7e2baf7304ae 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -29,10 +29,13 @@ KASAN_SANITIZE_stacktrace.o := n
 KASAN_SANITIZE_paravirt.o  := n
 
 OBJECT_FILES_NON_STANDARD_relocate_kernel_$(BITS).o:= y
-OBJECT_FILES_NON_STANDARD_ftrace_$(BITS).o := y
 OBJECT_FILES_NON_STANDARD_test_nx.o:= y
 OBJECT_FILES_NON_STANDARD_paravirt_patch_$(BITS).o := y
 
+ifdef CONFIG_FRAME_POINTER
+OBJECT_FILES_NON_STANDARD_ftrace_$(BITS).o := y
+endif
+
 # If instrumentation of this dir is enabled, boot hangs during first second.
 # Probably could be more selective here, but note that files related to irqs,
 # boot, dumpstack/stacktrace, etc are either non-interesting or can lead to
diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S
index c832291d948a..38b079626018 100644
--- a/arch/x86/kernel/ftrace_64.S
+++ b/arch/x86/kernel/ftrace_64.S
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 
 
.code64
@@ -20,7 +21,6 @@ EXPORT_SYMBOL(__fentry__)
 EXPORT_SYMBOL(mcount)
 #endif
 
-/* All cases save the original rbp (8 bytes) */
 #ifdef CONFIG_FRAME_POINTER
 # ifdef CC_USING_FENTRY
 /* Save parent and function stack frames (rip and rbp) */
@@ -31,7 +31,7 @@ EXPORT_SYMBOL(mcount)
 # endif
 #else
 /* No need to save a stack frame */
-# define MCOUNT_FRAME_SIZE 8
+# define MCOUNT_FRAME_SIZE 0
 #endif /* CONFIG_FRAME_POINTER */
 
 /* Size of stack used to save mcount regs in save_mcount_regs */
@@ -64,10 +64,10 @@ EXPORT_SYMBOL(mcount)
  */
 .macro save_mcount_regs added=0
 
-   /* Always save the original rbp */
+#ifdef CONFIG_FRAME_POINTER
+   /* Save the original rbp */
pushq %rbp
 
-#ifdef CONFIG_FRAME_POINTER
/*
 * Stack traces will stop at the ftrace trampoline if the frame pointer
 * is not set up properly. If fentry is used, we need to save a frame
@@ -105,7 +105,11 @@ EXPORT_SYMBOL(mcount)
 * Save the original RBP. Even though the mcount ABI does not
 * require this, it helps out callers.
 */
+#ifdef CONFIG_FRAME_POINTER
movq MCOUNT_REG_SIZE-8(%rsp), %rdx
+#else
+   movq %rbp, %rdx
+#endif
movq %rdx, RBP(%rsp)
 
/* Copy the parent address into %rsi (second parameter) */
@@ -148,7 +152,7 @@ EXPORT_SYMBOL(mcount)
 
 ENTRY(function_hook)
retq
-END(function_hook)
+ENDPROC(function_hook)
 
 ENTRY(ftrace_caller)
/* save_mcount_regs fills in first two parameters */
@@ -184,7 +188,7 @@ GLOBAL(ftrace_graph_call)
 /* This is weak to keep gas from relaxing the jumps */
 WEAK(ftrace_stub)
retq
-END(ftrace_caller)
+ENDPROC(ftrace_caller)
 
 ENTRY(ftrace_regs_caller)
/* Save the current flags before any operations that can change them */
@@ -255,7 +259,7 @@ GLOBAL(ftrace_regs_caller_end)
 
jmp ftrace_epilogue
 
-END(ftrace_regs_caller)
+ENDPROC(ftrace_regs_caller)
 
 
 #else /* ! CONFIG_DYNAMIC_FTRACE */
@@ -313,9 +317,10 @@ ENTRY(ftrace_graph_caller)
restore_mcount_regs
 
retq
-END(ftrace_graph_caller)
+ENDPROC(ftrace_graph_caller)
 
-GLOBAL(return_to_handler)
+ENTRY(return_to_handler)
+   UNWIND_HINT_EMPTY
subq  $24, %rsp
 
/* Save the return values */
@@ -330,4 +335,5 @@

Re: [RFC][PATCH][REGRESSION FIX] x86/ftrace: Add ORC annotation to 64 bit ftrace assembly

2017-12-22 Thread Josh Poimboeuf

On Fri, Dec 22, 2017 at 01:38:01PM -0500, Steven Rostedt wrote:
> On Fri, 22 Dec 2017 12:30:24 -0600
> Josh Poimboeuf  wrote:
> 
> > I'm officially on vacation but I got a chance to look at this myself a
> > few minutes ago.  This looks mostly right.  In theory, you should able
> > to mark those as functions by changing END to ENDPROC.  Then you won't
> > need all the UNWIND_HINT_FUNCs.
> 
> Yep, that was my first solution, but as you found, objtool complained
> about it.
> 
> > 
> > I tried making that change, but objtool thinks the stack frame is still
> > modified when the functions return.  I didn't see anything obvious in
> > save_mcount_regs or restore_mcount_regs that should cause it to think
> > that.  I'll need to look into it a little more.
> 
> Thanks!
> 
> Worse comes to worse, we can keep this change, and you can enjoy your
> holidays. I just need this fixed before 4.15 is released.
> 
> I'll be jumping into objtool shortly, to see if I can merge the code
> from recordmcount.c with it too. I'm going to need to learn ORC and
> DWARF to start on extending the function tracer to act more like
> tracepoints (like I discussed with Linus in Prague).

Objtool doesn't like the fact that save_mcount_regs pushes rbp at the
beginning, but it's never restored (directly, at least).

How about something like this instead, where it skips the original rbp
push?  I didn't test it, but I think it should work.  It at least gets
rid of the warnings and doesn't need any unwind hints other than the
UNWIND_HINT_EMPTY in return_to_handler.


diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 81bb565f4497..7e2baf7304ae 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -29,10 +29,13 @@ KASAN_SANITIZE_stacktrace.o := n
 KASAN_SANITIZE_paravirt.o  := n
 
 OBJECT_FILES_NON_STANDARD_relocate_kernel_$(BITS).o:= y
-OBJECT_FILES_NON_STANDARD_ftrace_$(BITS).o := y
 OBJECT_FILES_NON_STANDARD_test_nx.o:= y
 OBJECT_FILES_NON_STANDARD_paravirt_patch_$(BITS).o := y
 
+ifdef CONFIG_FRAME_POINTER
+OBJECT_FILES_NON_STANDARD_ftrace_$(BITS).o := y
+endif
+
 # If instrumentation of this dir is enabled, boot hangs during first second.
 # Probably could be more selective here, but note that files related to irqs,
 # boot, dumpstack/stacktrace, etc are either non-interesting or can lead to
diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S
index c832291d948a..38b079626018 100644
--- a/arch/x86/kernel/ftrace_64.S
+++ b/arch/x86/kernel/ftrace_64.S
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 
 
.code64
@@ -20,7 +21,6 @@ EXPORT_SYMBOL(__fentry__)
 EXPORT_SYMBOL(mcount)
 #endif
 
-/* All cases save the original rbp (8 bytes) */
 #ifdef CONFIG_FRAME_POINTER
 # ifdef CC_USING_FENTRY
 /* Save parent and function stack frames (rip and rbp) */
@@ -31,7 +31,7 @@ EXPORT_SYMBOL(mcount)
 # endif
 #else
 /* No need to save a stack frame */
-# define MCOUNT_FRAME_SIZE 8
+# define MCOUNT_FRAME_SIZE 0
 #endif /* CONFIG_FRAME_POINTER */
 
 /* Size of stack used to save mcount regs in save_mcount_regs */
@@ -64,10 +64,10 @@ EXPORT_SYMBOL(mcount)
  */
 .macro save_mcount_regs added=0
 
-   /* Always save the original rbp */
+#ifdef CONFIG_FRAME_POINTER
+   /* Save the original rbp */
pushq %rbp
 
-#ifdef CONFIG_FRAME_POINTER
/*
 * Stack traces will stop at the ftrace trampoline if the frame pointer
 * is not set up properly. If fentry is used, we need to save a frame
@@ -105,7 +105,11 @@ EXPORT_SYMBOL(mcount)
 * Save the original RBP. Even though the mcount ABI does not
 * require this, it helps out callers.
 */
+#ifdef CONFIG_FRAME_POINTER
movq MCOUNT_REG_SIZE-8(%rsp), %rdx
+#else
+   movq %rbp, %rdx
+#endif
movq %rdx, RBP(%rsp)
 
/* Copy the parent address into %rsi (second parameter) */
@@ -148,7 +152,7 @@ EXPORT_SYMBOL(mcount)
 
 ENTRY(function_hook)
retq
-END(function_hook)
+ENDPROC(function_hook)
 
 ENTRY(ftrace_caller)
/* save_mcount_regs fills in first two parameters */
@@ -184,7 +188,7 @@ GLOBAL(ftrace_graph_call)
 /* This is weak to keep gas from relaxing the jumps */
 WEAK(ftrace_stub)
retq
-END(ftrace_caller)
+ENDPROC(ftrace_caller)
 
 ENTRY(ftrace_regs_caller)
/* Save the current flags before any operations that can change them */
@@ -255,7 +259,7 @@ GLOBAL(ftrace_regs_caller_end)
 
jmp ftrace_epilogue
 
-END(ftrace_regs_caller)
+ENDPROC(ftrace_regs_caller)
 
 
 #else /* ! CONFIG_DYNAMIC_FTRACE */
@@ -313,9 +317,10 @@ ENTRY(ftrace_graph_caller)
restore_mcount_regs
 
retq
-END(ftrace_graph_caller)
+ENDPROC(ftrace_graph_caller)
 
-GLOBAL(return_to_handler)
+ENTRY(return_to_handler)
+   UNWIND_HINT_EMPTY
subq  $24, %rsp
 
/* Save the return values */
@@ -330,4 +335,5 @@

[PATCH] usbip: fix vudc_rx: harden CMD_SUBMIT path to handle malicious input

2017-12-22 Thread Shuah Khan

Harden CMD_SUBMIT path to handle malicious input that could trigger
large memory allocations. Add checks to validate transfer_buffer_length
and number_of_packets to protect against bad input requesting for
unbounded memory allocations.

Signed-off-by: Shuah Khan 
---
 drivers/usb/usbip/vudc_rx.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/drivers/usb/usbip/vudc_rx.c b/drivers/usb/usbip/vudc_rx.c
index df1e30989148..1e8a23d92cb4 100644
--- a/drivers/usb/usbip/vudc_rx.c
+++ b/drivers/usb/usbip/vudc_rx.c
@@ -120,6 +120,25 @@ static int v_recv_cmd_submit(struct vudc *udc,
urb_p->new = 1;
urb_p->seqnum = pdu->base.seqnum;
 
+   if (urb_p->ep->type == USB_ENDPOINT_XFER_ISOC) {
+   /* validate packet size and number of packets */
+   unsigned int maxp, packets, bytes;
+
+   maxp = usb_endpoint_maxp(urb_p->ep->desc);
+   maxp *= usb_endpoint_maxp_mult(urb_p->ep->desc);
+   bytes = pdu->u.cmd_submit.transfer_buffer_length;
+   packets = DIV_ROUND_UP(bytes, maxp);
+
+   if (pdu->u.cmd_submit.number_of_packets < 0 ||
+   pdu->u.cmd_submit.number_of_packets > packets) {
+   dev_err(>gadget.dev,
+   "CMD_SUBMIT: isoc invalid num packets %d\n",
+   pdu->u.cmd_submit.number_of_packets);
+   ret = -EMSGSIZE;
+   goto free_urbp;
+   }
+   }
+
ret = alloc_urb_from_cmd(_p->urb, pdu, urb_p->ep->type);
if (ret) {
usbip_event_add(>ud, VUDC_EVENT_ERROR_MALLOC);
-- 
2.14.1

[PATCH] usbip: fix vudc_rx: harden CMD_SUBMIT path to handle malicious input

2017-12-22 Thread Shuah Khan

Harden CMD_SUBMIT path to handle malicious input that could trigger
large memory allocations. Add checks to validate transfer_buffer_length
and number_of_packets to protect against bad input requesting for
unbounded memory allocations.

Signed-off-by: Shuah Khan 
---
 drivers/usb/usbip/vudc_rx.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/drivers/usb/usbip/vudc_rx.c b/drivers/usb/usbip/vudc_rx.c
index df1e30989148..1e8a23d92cb4 100644
--- a/drivers/usb/usbip/vudc_rx.c
+++ b/drivers/usb/usbip/vudc_rx.c
@@ -120,6 +120,25 @@ static int v_recv_cmd_submit(struct vudc *udc,
urb_p->new = 1;
urb_p->seqnum = pdu->base.seqnum;
 
+   if (urb_p->ep->type == USB_ENDPOINT_XFER_ISOC) {
+   /* validate packet size and number of packets */
+   unsigned int maxp, packets, bytes;
+
+   maxp = usb_endpoint_maxp(urb_p->ep->desc);
+   maxp *= usb_endpoint_maxp_mult(urb_p->ep->desc);
+   bytes = pdu->u.cmd_submit.transfer_buffer_length;
+   packets = DIV_ROUND_UP(bytes, maxp);
+
+   if (pdu->u.cmd_submit.number_of_packets < 0 ||
+   pdu->u.cmd_submit.number_of_packets > packets) {
+   dev_err(>gadget.dev,
+   "CMD_SUBMIT: isoc invalid num packets %d\n",
+   pdu->u.cmd_submit.number_of_packets);
+   ret = -EMSGSIZE;
+   goto free_urbp;
+   }
+   }
+
ret = alloc_urb_from_cmd(_p->urb, pdu, urb_p->ep->type);
if (ret) {
usbip_event_add(>ud, VUDC_EVENT_ERROR_MALLOC);
-- 
2.14.1

[PATCH] usbip: vudc_tx: fix v_send_ret_submit() vulnerability to null xfer buffer

2017-12-22 Thread Shuah Khan

v_send_ret_submit() handles urb with a null transfer_buffer, when it
replays a packet with potential malicious data that could contain a
null buffer.

Add a check for the condition when actual_length > 0 and transfer_buffer
is null.

Signed-off-by: Shuah Khan 
---
 drivers/usb/usbip/vudc_tx.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/usb/usbip/vudc_tx.c b/drivers/usb/usbip/vudc_tx.c
index 1440ae0919ec..3ccb17c3e840 100644
--- a/drivers/usb/usbip/vudc_tx.c
+++ b/drivers/usb/usbip/vudc_tx.c
@@ -85,6 +85,13 @@ static int v_send_ret_submit(struct vudc *udc, struct urbp 
*urb_p)
memset(_header, 0, sizeof(pdu_header));
memset(, 0, sizeof(msg));
 
+   if (urb->actual_length > 0 && !urb->transfer_buffer) {
+   dev_err(>gadget.dev,
+   "urb: actual_length %d transfer_buffer null\n",
+   urb->actual_length);
+   return -1;
+   }
+
if (urb_p->type == USB_ENDPOINT_XFER_ISOC)
iovnum = 2 + urb->number_of_packets;
else
@@ -100,8 +107,8 @@ static int v_send_ret_submit(struct vudc *udc, struct urbp 
*urb_p)
 
/* 1. setup usbip_header */
setup_ret_submit_pdu(_header, urb_p);
-   usbip_dbg_stub_tx("setup txdata seqnum: %d urb: %p\n",
- pdu_header.base.seqnum, urb);
+   usbip_dbg_stub_tx("setup txdata seqnum: %d\n",
+ pdu_header.base.seqnum);
usbip_header_correct_endian(_header, 1);
 
iov[iovnum].iov_base = _header;
-- 
2.14.1

[PATCH] usbip: vudc_tx: fix v_send_ret_submit() vulnerability to null xfer buffer

2017-12-22 Thread Shuah Khan

v_send_ret_submit() handles urb with a null transfer_buffer, when it
replays a packet with potential malicious data that could contain a
null buffer.

Add a check for the condition when actual_length > 0 and transfer_buffer
is null.

Signed-off-by: Shuah Khan 
---
 drivers/usb/usbip/vudc_tx.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/usb/usbip/vudc_tx.c b/drivers/usb/usbip/vudc_tx.c
index 1440ae0919ec..3ccb17c3e840 100644
--- a/drivers/usb/usbip/vudc_tx.c
+++ b/drivers/usb/usbip/vudc_tx.c
@@ -85,6 +85,13 @@ static int v_send_ret_submit(struct vudc *udc, struct urbp 
*urb_p)
memset(_header, 0, sizeof(pdu_header));
memset(, 0, sizeof(msg));
 
+   if (urb->actual_length > 0 && !urb->transfer_buffer) {
+   dev_err(>gadget.dev,
+   "urb: actual_length %d transfer_buffer null\n",
+   urb->actual_length);
+   return -1;
+   }
+
if (urb_p->type == USB_ENDPOINT_XFER_ISOC)
iovnum = 2 + urb->number_of_packets;
else
@@ -100,8 +107,8 @@ static int v_send_ret_submit(struct vudc *udc, struct urbp 
*urb_p)
 
/* 1. setup usbip_header */
setup_ret_submit_pdu(_header, urb_p);
-   usbip_dbg_stub_tx("setup txdata seqnum: %d urb: %p\n",
- pdu_header.base.seqnum, urb);
+   usbip_dbg_stub_tx("setup txdata seqnum: %d\n",
+ pdu_header.base.seqnum);
usbip_header_correct_endian(_header, 1);
 
iov[iovnum].iov_base = _header;
-- 
2.14.1

mmotm 2017-12-22-17-55 uploaded

2017-12-22 Thread akpm

The mm-of-the-moment snapshot 2017-12-22-17-55 has been uploaded to

   http://www.ozlabs.org/~akpm/mmotm/

mmotm-readme.txt says

README for mm-of-the-moment:

http://www.ozlabs.org/~akpm/mmotm/

This is a snapshot of my -mm patch queue.  Uploaded at random hopefully
more than once a week.

You will need quilt to apply these patches to the latest Linus release (4.x
or 4.x-rcY).  The series file is in broken-out.tar.gz and is duplicated in
http://ozlabs.org/~akpm/mmotm/series

The file broken-out.tar.gz contains two datestamp files: .DATE and
.DATE--mm-dd-hh-mm-ss.  Both contain the string -mm-dd-hh-mm-ss,
followed by the base kernel version against which this patch series is to
be applied.

This tree is partially included in linux-next.  To see which patches are
included in linux-next, consult the `series' file.  Only the patches
within the #NEXT_PATCHES_START/#NEXT_PATCHES_END markers are included in
linux-next.

A git tree which contains the memory management portion of this tree is
maintained at git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
by Michal Hocko.  It contains the patches which are between the
"#NEXT_PATCHES_START mm" and "#NEXT_PATCHES_END" markers, from the series
file, http://www.ozlabs.org/~akpm/mmotm/series.


A full copy of the full kernel tree with the linux-next and mmotm patches
already applied is available through git within an hour of the mmotm
release.  Individual mmotm releases are tagged.  The master branch always
points to the latest release, so it's constantly rebasing.

http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/

To develop on top of mmotm git:

  $ git remote add mmotm 
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
  $ git remote update mmotm
  $ git checkout -b topic mmotm/master
  
  $ git send-email mmotm/master.. [...]

To rebase a branch with older patches to a new mmotm release:

  $ git remote update mmotm
  $ git rebase --onto mmotm/master  topic




The directory http://www.ozlabs.org/~akpm/mmots/ (mm-of-the-second)
contains daily snapshots of the -mm tree.  It is updated more frequently
than mmotm, and is untested.

A git copy of this tree is available at

http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/

and use of this tree is similar to
http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/, described above.


This mmotm tree contains the following patches against 4.15-rc4:
(patches marked "*" will be included in linux-next)

  origin.patch
  i-need-old-gcc.patch
* scripts-decodecode-fix-decoding-for-aarch64-arm64-instructions.patch
* mm-skip-hwpoisoned-pages-when-onlining-pages.patch
* ubsan-dont-handle-misaligned-address-when-support-unaligned-access.patch
* ubsan-dont-handle-misaligned-address-when-support-unaligned-access-v2.patch
* mm-check-pfn_valid-first-in-zero_resv_unavail.patch
* acct-fix-the-acct-needcheck-check-in-check_free_space.patch
* mm-mprotect-add-a-cond_resched-inside-change_pmd_range.patch
* kernel-exitc-export-abort-to-modules.patch
* provide-useful-debugging-information-for-vm_bug.patch
* zsmalloc-add-fsh-include.patch
* mm-sparsec-wrong-allocation-for-mem_section.patch
* arm-arch-arm-include-asm-pageh-needs-personalityh.patch
* prctl-add-pr_et_pdeathsig_proc.patch
* scripts-decodecode-make-it-take-multiline-code-line.patch
* scripts-tags-change-find_other_sources-for-include-folders.patch
* ntfs-remove-i_version-handling.patch
* ocfs2-dlm-clean-dead-code-up.patch
* ocfs2-cluster-neaten-a-member-of-o2net_msg_handler.patch
* ocfs2-give-an-obvious-tip-for-dismatch-cluster-names.patch
* ocfs2-cluster-close-a-race-that-fence-cant-be-triggered.patch
* 
ocfs2-using-the-ocfs2_xattr_root_size-macro-in-ocfs2_reflink_xattr_header.patch
* ocfs2-clean-dead-code-in-suballocc.patch
* ocfs2-get-rid-of-ocfs2_is_o2cb_active-function.patch
* ocfs2-move-some-definitions-to-header-file.patch
* ocfs2-fix-some-small-problems.patch
* ocfs2-add-kobject-for-online-file-check.patch
* ocfs2-add-duplicative-ino-number-check.patch
* ocfs2-add-ocfs2_try_rw_lock-and-ocfs2_try_inode_lock.patch
* ocfs2-add-ocfs2_overwrite_io-function.patch
* ocfs2-nowait-aio-support.patch
* ocfs2-add-trimfs-dlm-lock-resource.patch
* ocfs2-add-trimfs-lock-to-avoid-duplicated-trims-in-cluster.patch
* ocfs2-dont-merge-rightmost-extent-block-if-it-was-locked.patch
* 
ocfs2-fall-back-to-buffer-io-when-append-dio-is-disabled-with-file-hole-existing.patch
* 
ocfs2-fall-back-to-buffer-io-when-append-dio-is-disabled-with-file-hole-existing-fix.patch
* 
block-restore-proc-partitions-to-not-display-non-partitionable-removable-devices.patch
* dentry-fix-kmemcheck-splat-at-take_dentry_name_snapshot.patch
  mm.patch
* mm-terminate-shrink_slab-loop-if-signal-is-pending.patch
* mm-terminate-shrink_slab-loop-if-signal-is-pending-fix.patch
* mm-slab-make-calculate_alignment-function-static.patch
* mm-slab-remove-redundant-assignments-for-slab_state.patch
* include-linux-sched-mmh-uninline-mmdrop_async-etc.patch
* mm-kmemleak-remove-unused-hardirqh.patch
*

mmotm 2017-12-22-17-55 uploaded

2017-12-22 Thread akpm

The mm-of-the-moment snapshot 2017-12-22-17-55 has been uploaded to

   http://www.ozlabs.org/~akpm/mmotm/

mmotm-readme.txt says

README for mm-of-the-moment:

http://www.ozlabs.org/~akpm/mmotm/

This is a snapshot of my -mm patch queue.  Uploaded at random hopefully
more than once a week.

You will need quilt to apply these patches to the latest Linus release (4.x
or 4.x-rcY).  The series file is in broken-out.tar.gz and is duplicated in
http://ozlabs.org/~akpm/mmotm/series

The file broken-out.tar.gz contains two datestamp files: .DATE and
.DATE--mm-dd-hh-mm-ss.  Both contain the string -mm-dd-hh-mm-ss,
followed by the base kernel version against which this patch series is to
be applied.

This tree is partially included in linux-next.  To see which patches are
included in linux-next, consult the `series' file.  Only the patches
within the #NEXT_PATCHES_START/#NEXT_PATCHES_END markers are included in
linux-next.

A git tree which contains the memory management portion of this tree is
maintained at git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
by Michal Hocko.  It contains the patches which are between the
"#NEXT_PATCHES_START mm" and "#NEXT_PATCHES_END" markers, from the series
file, http://www.ozlabs.org/~akpm/mmotm/series.


A full copy of the full kernel tree with the linux-next and mmotm patches
already applied is available through git within an hour of the mmotm
release.  Individual mmotm releases are tagged.  The master branch always
points to the latest release, so it's constantly rebasing.

http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/

To develop on top of mmotm git:

  $ git remote add mmotm 
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
  $ git remote update mmotm
  $ git checkout -b topic mmotm/master
  
  $ git send-email mmotm/master.. [...]

To rebase a branch with older patches to a new mmotm release:

  $ git remote update mmotm
  $ git rebase --onto mmotm/master  topic




The directory http://www.ozlabs.org/~akpm/mmots/ (mm-of-the-second)
contains daily snapshots of the -mm tree.  It is updated more frequently
than mmotm, and is untested.

A git copy of this tree is available at

http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/

and use of this tree is similar to
http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/, described above.


This mmotm tree contains the following patches against 4.15-rc4:
(patches marked "*" will be included in linux-next)

  origin.patch
  i-need-old-gcc.patch
* scripts-decodecode-fix-decoding-for-aarch64-arm64-instructions.patch
* mm-skip-hwpoisoned-pages-when-onlining-pages.patch
* ubsan-dont-handle-misaligned-address-when-support-unaligned-access.patch
* ubsan-dont-handle-misaligned-address-when-support-unaligned-access-v2.patch
* mm-check-pfn_valid-first-in-zero_resv_unavail.patch
* acct-fix-the-acct-needcheck-check-in-check_free_space.patch
* mm-mprotect-add-a-cond_resched-inside-change_pmd_range.patch
* kernel-exitc-export-abort-to-modules.patch
* provide-useful-debugging-information-for-vm_bug.patch
* zsmalloc-add-fsh-include.patch
* mm-sparsec-wrong-allocation-for-mem_section.patch
* arm-arch-arm-include-asm-pageh-needs-personalityh.patch
* prctl-add-pr_et_pdeathsig_proc.patch
* scripts-decodecode-make-it-take-multiline-code-line.patch
* scripts-tags-change-find_other_sources-for-include-folders.patch
* ntfs-remove-i_version-handling.patch
* ocfs2-dlm-clean-dead-code-up.patch
* ocfs2-cluster-neaten-a-member-of-o2net_msg_handler.patch
* ocfs2-give-an-obvious-tip-for-dismatch-cluster-names.patch
* ocfs2-cluster-close-a-race-that-fence-cant-be-triggered.patch
* 
ocfs2-using-the-ocfs2_xattr_root_size-macro-in-ocfs2_reflink_xattr_header.patch
* ocfs2-clean-dead-code-in-suballocc.patch
* ocfs2-get-rid-of-ocfs2_is_o2cb_active-function.patch
* ocfs2-move-some-definitions-to-header-file.patch
* ocfs2-fix-some-small-problems.patch
* ocfs2-add-kobject-for-online-file-check.patch
* ocfs2-add-duplicative-ino-number-check.patch
* ocfs2-add-ocfs2_try_rw_lock-and-ocfs2_try_inode_lock.patch
* ocfs2-add-ocfs2_overwrite_io-function.patch
* ocfs2-nowait-aio-support.patch
* ocfs2-add-trimfs-dlm-lock-resource.patch
* ocfs2-add-trimfs-lock-to-avoid-duplicated-trims-in-cluster.patch
* ocfs2-dont-merge-rightmost-extent-block-if-it-was-locked.patch
* 
ocfs2-fall-back-to-buffer-io-when-append-dio-is-disabled-with-file-hole-existing.patch
* 
ocfs2-fall-back-to-buffer-io-when-append-dio-is-disabled-with-file-hole-existing-fix.patch
* 
block-restore-proc-partitions-to-not-display-non-partitionable-removable-devices.patch
* dentry-fix-kmemcheck-splat-at-take_dentry_name_snapshot.patch
  mm.patch
* mm-terminate-shrink_slab-loop-if-signal-is-pending.patch
* mm-terminate-shrink_slab-loop-if-signal-is-pending-fix.patch
* mm-slab-make-calculate_alignment-function-static.patch
* mm-slab-remove-redundant-assignments-for-slab_state.patch
* include-linux-sched-mmh-uninline-mmdrop_async-etc.patch
* mm-kmemleak-remove-unused-hardirqh.patch
*

Re: [PATCH 04/17] mm: pass the vmem_altmap to arch_add_memory and __add_pages

2017-12-22 Thread Dan Williams

On Fri, Dec 15, 2017 at 6:09 AM, Christoph Hellwig  wrote:
> We can just pass this on instead of having to do a radix tree lookup
> without proper locking 2 levels into the callchain.
>
> Signed-off-by: Christoph Hellwig 
[..]
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 8acdc35c2dfa..e26ade50ae18 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -772,12 +772,12 @@ static void update_end_of_memory_vars(u64 start, u64 
> size)
> }
>  }
>
> -int add_pages(int nid, unsigned long start_pfn,
> - unsigned long nr_pages, bool want_memblock)
> +int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
> +   struct vmem_altmap *altmap, bool want_memblock)
>  {
> int ret;
>
> -   ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
> +   ret = __add_pages(nid, start_pfn, nr_pages, NULL, want_memblock);
> WARN_ON_ONCE(ret);

Should be 'altmap' instead of NULL.

Re: [PATCH 04/17] mm: pass the vmem_altmap to arch_add_memory and __add_pages

2017-12-22 Thread Dan Williams

On Fri, Dec 15, 2017 at 6:09 AM, Christoph Hellwig  wrote:
> We can just pass this on instead of having to do a radix tree lookup
> without proper locking 2 levels into the callchain.
>
> Signed-off-by: Christoph Hellwig 
[..]
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 8acdc35c2dfa..e26ade50ae18 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -772,12 +772,12 @@ static void update_end_of_memory_vars(u64 start, u64 
> size)
> }
>  }
>
> -int add_pages(int nid, unsigned long start_pfn,
> - unsigned long nr_pages, bool want_memblock)
> +int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
> +   struct vmem_altmap *altmap, bool want_memblock)
>  {
> int ret;
>
> -   ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
> +   ret = __add_pages(nid, start_pfn, nr_pages, NULL, want_memblock);
> WARN_ON_ONCE(ret);

Should be 'altmap' instead of NULL.

Re: [PATCH v2 1/3] phy: core: Move runtime PM reference counting to the parent device

2017-12-22 Thread Rafael J. Wysocki

On Sat, Dec 23, 2017 at 2:35 AM, Rafael J. Wysocki  wrote:
> On Thu, Dec 21, 2017 at 11:50 AM, Ulf Hansson  wrote:
>> On 21 December 2017 at 02:39, Rafael J. Wysocki  wrote:
>>> On Wed, Dec 20, 2017 at 3:09 PM, Ulf Hansson  wrote:
 The runtime PM deployment in the phy core is deployed using the phy core
 device, which is created by the phy core and assigned as a child device of
 the phy provider device.

[cut]

>>
>> Also, I have considered how to deal with wakeup paths for phys,
>> although I didn't want to post changes as a part of this series, but
>> maybe I should to give a more complete picture?
>
> Yes, you should.
>
> The point is that without genpd using pm_runtime_force_suspend() the
> phy code could very well stay the way it is.  And it is logical,
> because having a parent with enabled runtime PM without enabling
> runtime PM for its children is at least conceptually questionable.

Actually, I sort of agree that the phy's usage of runtime PM is too
convoluted.  For example, it uses pm_runtime_enabled() unnecessarily
at least in some places, but that doesn't seem to be fixed by your
patches.

Thanks,
Rafael

Re: [PATCH v2 1/3] phy: core: Move runtime PM reference counting to the parent device

2017-12-22 Thread Rafael J. Wysocki

On Sat, Dec 23, 2017 at 2:35 AM, Rafael J. Wysocki  wrote:
> On Thu, Dec 21, 2017 at 11:50 AM, Ulf Hansson  wrote:
>> On 21 December 2017 at 02:39, Rafael J. Wysocki  wrote:
>>> On Wed, Dec 20, 2017 at 3:09 PM, Ulf Hansson  wrote:
 The runtime PM deployment in the phy core is deployed using the phy core
 device, which is created by the phy core and assigned as a child device of
 the phy provider device.

[cut]

>>
>> Also, I have considered how to deal with wakeup paths for phys,
>> although I didn't want to post changes as a part of this series, but
>> maybe I should to give a more complete picture?
>
> Yes, you should.
>
> The point is that without genpd using pm_runtime_force_suspend() the
> phy code could very well stay the way it is.  And it is logical,
> because having a parent with enabled runtime PM without enabling
> runtime PM for its children is at least conceptually questionable.

Actually, I sort of agree that the phy's usage of runtime PM is too
convoluted.  For example, it uses pm_runtime_enabled() unnecessarily
at least in some places, but that doesn't seem to be fixed by your
patches.

Thanks,
Rafael

Re: [PATCH 04/17] mm: pass the vmem_altmap to arch_add_memory and __add_pages

2017-12-22 Thread Dan Williams

On Fri, Dec 15, 2017 at 6:09 AM, Christoph Hellwig  wrote:
> We can just pass this on instead of having to do a radix tree lookup
> without proper locking 2 levels into the callchain.
>
> Signed-off-by: Christoph Hellwig 
[..]
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 403ab9cdb949..16456117a1b1 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -427,7 +427,7 @@ void *devm_memremap_pages(struct device *dev, struct 
> resource *res,
> goto err_pfn_remap;
>
> mem_hotplug_begin();
> -   error = arch_add_memory(nid, align_start, align_size, false);
> +   error = arch_add_memory(nid, align_start, align_size, altmap, false);
> if (!error)
> 
> move_pfn_range_to_zone(_DATA(nid)->node_zones[ZONE_DEVICE],
> align_start >> PAGE_SHIFT,

Subtle bug here. This altmap is the one that was passed in that we
copy into its permanent location in the pgmap, so it looks like this
patch needs to fold the following fix:

diff --git a/kernel/memremap.c b/kernel/memremap.c
index f277bf5b8c57..157a3756e1d5 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -382,6 +382,7 @@ void *devm_memremap_pages(struct device *dev,
struct resource *res,
if (altmap) {
memcpy(_map->altmap, altmap, sizeof(*altmap));
pgmap->altmap = _map->altmap;
+   altmap = pgmap->altmap;
}
pgmap->ref = ref;
pgmap->res = _map->res;

Re: [PATCH 04/17] mm: pass the vmem_altmap to arch_add_memory and __add_pages

2017-12-22 Thread Dan Williams

On Fri, Dec 15, 2017 at 6:09 AM, Christoph Hellwig  wrote:
> We can just pass this on instead of having to do a radix tree lookup
> without proper locking 2 levels into the callchain.
>
> Signed-off-by: Christoph Hellwig 
[..]
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 403ab9cdb949..16456117a1b1 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -427,7 +427,7 @@ void *devm_memremap_pages(struct device *dev, struct 
> resource *res,
> goto err_pfn_remap;
>
> mem_hotplug_begin();
> -   error = arch_add_memory(nid, align_start, align_size, false);
> +   error = arch_add_memory(nid, align_start, align_size, altmap, false);
> if (!error)
> 
> move_pfn_range_to_zone(_DATA(nid)->node_zones[ZONE_DEVICE],
> align_start >> PAGE_SHIFT,

Subtle bug here. This altmap is the one that was passed in that we
copy into its permanent location in the pgmap, so it looks like this
patch needs to fold the following fix:

diff --git a/kernel/memremap.c b/kernel/memremap.c
index f277bf5b8c57..157a3756e1d5 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -382,6 +382,7 @@ void *devm_memremap_pages(struct device *dev,
struct resource *res,
if (altmap) {
memcpy(_map->altmap, altmap, sizeof(*altmap));
pgmap->altmap = _map->altmap;
+   altmap = pgmap->altmap;
}
pgmap->ref = ref;
pgmap->res = _map->res;

Re: platform/x86/thinkpad_acpi: Adjustments for four function implementations

2017-12-22 Thread Henrique de Moraes Holschuh

On Tue, 19 Dec 2017, SF Markus Elfring wrote:
> >>   Delete an error message for a failed memory allocation in three functions
> > 
> > This one is questionable since it prints error messages at ->init() stage.
> > I would rather not touch this.
> 
> Do you find the Linux allocation failure report insufficient in this case?

Leave those pr_ messages alone, please, unless they are really causing
some sort of issue (which?).

> >>   Improve a size determination in tpacpi_new_rfkill()
> > 
> > Doesn't make any sense right now. One style over the other.
> > Nothing gets better or worth at this point.
> 
> Would you like to care for a bit more compliance with information
> from the section “14) Allocating memory” in the document “coding-style.rst”?

No, unless the change is actually fixing something, or gives us a
down-to-earth, *real* advantage of some sort.  In which case, the commit
message better do a rather good job of explaining it.

Doing it just for "compliance" with a doc isn't nearly good enough
reason.

-- 
  Henrique Holschuh

Re: platform/x86/thinkpad_acpi: Adjustments for four function implementations

2017-12-22 Thread Henrique de Moraes Holschuh

On Tue, 19 Dec 2017, SF Markus Elfring wrote:
> >>   Delete an error message for a failed memory allocation in three functions
> > 
> > This one is questionable since it prints error messages at ->init() stage.
> > I would rather not touch this.
> 
> Do you find the Linux allocation failure report insufficient in this case?

Leave those pr_ messages alone, please, unless they are really causing
some sort of issue (which?).

> >>   Improve a size determination in tpacpi_new_rfkill()
> > 
> > Doesn't make any sense right now. One style over the other.
> > Nothing gets better or worth at this point.
> 
> Would you like to care for a bit more compliance with information
> from the section “14) Allocating memory” in the document “coding-style.rst”?

No, unless the change is actually fixing something, or gives us a
down-to-earth, *real* advantage of some sort.  In which case, the commit
message better do a rather good job of explaining it.

Doing it just for "compliance" with a doc isn't nearly good enough
reason.

-- 
  Henrique Holschuh

Re: [PATCH 0/2] platform/x86/thinkpad_acpi: Adjustments for four function implementations

2017-12-22 Thread Henrique de Moraes Holschuh

On Tue, 19 Dec 2017, Andy Shevchenko wrote:
> On Mon, Dec 18, 2017 at 11:26 PM, SF Markus Elfring
>  wrote:
> > From: Markus Elfring 
> > Date: Mon, 18 Dec 2017 22:23:45 +0100
> >
> > Two update suggestions were taken into account
> > from static source code analysis.
> >
> > Markus Elfring (2):
> >   Delete an error message for a failed memory allocation in three functions
> 
> This one is questionable since it prints error messages at ->init() stage.
> I would rather not touch this.
> 
> >   Improve a size determination in tpacpi_new_rfkill()
> 
> Doesn't make any sense right now. One style over the other.
> Nothing gets better or worth at this point.
> 
> Sorry, but NAK for both.

Agreed.  NAK from me as well.

-- 
  Henrique Holschuh

Re: [PATCH 0/2] platform/x86/thinkpad_acpi: Adjustments for four function implementations

2017-12-22 Thread Henrique de Moraes Holschuh

On Tue, 19 Dec 2017, Andy Shevchenko wrote:
> On Mon, Dec 18, 2017 at 11:26 PM, SF Markus Elfring
>  wrote:
> > From: Markus Elfring 
> > Date: Mon, 18 Dec 2017 22:23:45 +0100
> >
> > Two update suggestions were taken into account
> > from static source code analysis.
> >
> > Markus Elfring (2):
> >   Delete an error message for a failed memory allocation in three functions
> 
> This one is questionable since it prints error messages at ->init() stage.
> I would rather not touch this.
> 
> >   Improve a size determination in tpacpi_new_rfkill()
> 
> Doesn't make any sense right now. One style over the other.
> Nothing gets better or worth at this point.
> 
> Sorry, but NAK for both.

Agreed.  NAK from me as well.

-- 
  Henrique Holschuh

Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2017-12-22 Thread Minchan Kim

On Fri, Dec 22, 2017 at 10:14:43PM +0800, Huang, Ying wrote:
> Minchan Kim  writes:
> 
> > On Thu, Dec 21, 2017 at 03:48:56PM +0800, Huang, Ying wrote:
> >> Minchan Kim  writes:
> >> 
> >> > On Wed, Dec 20, 2017 at 09:26:32AM +0800, Huang, Ying wrote:
> >> >> From: Huang Ying 
> >> >> 
> >> >> When the swapin is performed, after getting the swap entry information
> >> >> from the page table, system will swap in the swap entry, without any
> >> >> lock held to prevent the swap device from being swapoff.  This may
> >> >> cause the race like below,
> >> >> 
> >> >> CPU 1   CPU 2
> >> >> -   -
> >> >> do_swap_page
> >> >>   swapin_readahead
> >> >> __read_swap_cache_async
> >> >> swapoff   swapcache_prepare
> >> >>   p->swap_map = NULL__swap_duplicate
> >> >>   p->swap_map[?] /* !!! NULL 
> >> >> pointer access */
> >> >> 
> >> >> Because swapoff is usually done when system shutdown only, the race
> >> >> may not hit many people in practice.  But it is still a race need to
> >> >> be fixed.
> >> >> 
> >> >> To fix the race, get_swap_device() is added to check whether the
> >> >> specified swap entry is valid in its swap device.  If so, it will keep
> >> >> the swap entry valid via preventing the swap device from being
> >> >> swapoff, until put_swap_device() is called.
> >> >> 
> >> >> Because swapoff() is very race code path, to make the normal path runs
> >> >> as fast as possible, RCU instead of reference count is used to
> >> >> implement get/put_swap_device().  From get_swap_device() to
> >> >> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
> >> >> swapoff() will wait until put_swap_device() is called.
> >> >> 
> >> >> In addition to swap_map, cluster_info, etc. data structure in the
> >> >> struct swap_info_struct, the swap cache radix tree will be freed after
> >> >> swapoff, so this patch fixes the race between swap cache looking up
> >> >> and swapoff too.
> >> >> 
> >> >> Cc: Hugh Dickins 
> >> >> Cc: Paul E. McKenney 
> >> >> Cc: Minchan Kim 
> >> >> Cc: Johannes Weiner 
> >> >> Cc: Tim Chen 
> >> >> Cc: Shaohua Li 
> >> >> Cc: Mel Gorman 
> >> >> Cc: "Jrme Glisse" 
> >> >> Cc: Michal Hocko 
> >> >> Cc: Andrea Arcangeli 
> >> >> Cc: David Rientjes 
> >> >> Cc: Rik van Riel 
> >> >> Cc: Jan Kara 
> >> >> Cc: Dave Jiang 
> >> >> Cc: Aaron Lu 
> >> >> Signed-off-by: "Huang, Ying" 
> >> >> 
> >> >> Changelog:
> >> >> 
> >> >> v4:
> >> >> 
> >> >> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
> >> >>   normal paths further.
> >> >
> >> > Hi Huang,
> >> 
> >> Hi, Minchan,
> >> 
> >> > This version is much better than old. To me, it's due to not rcu,
> >> > srcu, refcount thing but it adds swap device dependency(i.e., get/put)
> >> > into every swap related functions so users who don't interested on swap
> >> > don't need to care of it. Good.
> >> >
> >> > The problem is caused by freeing by swap related-data structure
> >> > *dynamically* while old swap logic was based on static data
> >> > structure(i.e., never freed and the verify it's stale).
> >> > So, I reviewed some places where use PageSwapCache and swp_entry_t
> >> > which could make access of swap related data structures.
> >> >
> >> > A example is __isolate_lru_page
> >> >
> >> > It calls page_mapping to get a address_space.
> >> > What happens if the page is on SwapCache and raced with swapoff?
> >> > The mapping got could be disappeared by the race. Right?
> >> 
> >> Yes.  We should think about that.  Considering the file cache pages, the
> >> address_space backing the file cache pages may be freed dynamically too.
> >> So to use page_mapping() return value for the file cache pages, some
> >> kind of locking is needed to guarantee the address_space isn't freed
> >> under us.  Page may be locked, or under writeback, or some other locks
> >
> > I didn't look at the code in detail but I guess every file page should
> > be freed before the address space destruction and page_lock/lru_lock makes
> > the work safe, I guess. So, it wouldn't be a problem.
> >
> > However, in case of swapoff, it doesn't remove pages from LRU list
> > so there is no lock to prevent the race at this moment. :(
> 
> Take a look at file cache pages and file cache address_space freeing
> code path.  It appears that similar situation is possible for them too.
> 
> The file cache pages will be delete from

Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2017-12-22 Thread Minchan Kim

On Fri, Dec 22, 2017 at 10:14:43PM +0800, Huang, Ying wrote:
> Minchan Kim  writes:
> 
> > On Thu, Dec 21, 2017 at 03:48:56PM +0800, Huang, Ying wrote:
> >> Minchan Kim  writes:
> >> 
> >> > On Wed, Dec 20, 2017 at 09:26:32AM +0800, Huang, Ying wrote:
> >> >> From: Huang Ying 
> >> >> 
> >> >> When the swapin is performed, after getting the swap entry information
> >> >> from the page table, system will swap in the swap entry, without any
> >> >> lock held to prevent the swap device from being swapoff.  This may
> >> >> cause the race like below,
> >> >> 
> >> >> CPU 1   CPU 2
> >> >> -   -
> >> >> do_swap_page
> >> >>   swapin_readahead
> >> >> __read_swap_cache_async
> >> >> swapoff   swapcache_prepare
> >> >>   p->swap_map = NULL__swap_duplicate
> >> >>   p->swap_map[?] /* !!! NULL 
> >> >> pointer access */
> >> >> 
> >> >> Because swapoff is usually done when system shutdown only, the race
> >> >> may not hit many people in practice.  But it is still a race need to
> >> >> be fixed.
> >> >> 
> >> >> To fix the race, get_swap_device() is added to check whether the
> >> >> specified swap entry is valid in its swap device.  If so, it will keep
> >> >> the swap entry valid via preventing the swap device from being
> >> >> swapoff, until put_swap_device() is called.
> >> >> 
> >> >> Because swapoff() is very race code path, to make the normal path runs
> >> >> as fast as possible, RCU instead of reference count is used to
> >> >> implement get/put_swap_device().  From get_swap_device() to
> >> >> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
> >> >> swapoff() will wait until put_swap_device() is called.
> >> >> 
> >> >> In addition to swap_map, cluster_info, etc. data structure in the
> >> >> struct swap_info_struct, the swap cache radix tree will be freed after
> >> >> swapoff, so this patch fixes the race between swap cache looking up
> >> >> and swapoff too.
> >> >> 
> >> >> Cc: Hugh Dickins 
> >> >> Cc: Paul E. McKenney 
> >> >> Cc: Minchan Kim 
> >> >> Cc: Johannes Weiner 
> >> >> Cc: Tim Chen 
> >> >> Cc: Shaohua Li 
> >> >> Cc: Mel Gorman 
> >> >> Cc: "Jrme Glisse" 
> >> >> Cc: Michal Hocko 
> >> >> Cc: Andrea Arcangeli 
> >> >> Cc: David Rientjes 
> >> >> Cc: Rik van Riel 
> >> >> Cc: Jan Kara 
> >> >> Cc: Dave Jiang 
> >> >> Cc: Aaron Lu 
> >> >> Signed-off-by: "Huang, Ying" 
> >> >> 
> >> >> Changelog:
> >> >> 
> >> >> v4:
> >> >> 
> >> >> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
> >> >>   normal paths further.
> >> >
> >> > Hi Huang,
> >> 
> >> Hi, Minchan,
> >> 
> >> > This version is much better than old. To me, it's due to not rcu,
> >> > srcu, refcount thing but it adds swap device dependency(i.e., get/put)
> >> > into every swap related functions so users who don't interested on swap
> >> > don't need to care of it. Good.
> >> >
> >> > The problem is caused by freeing by swap related-data structure
> >> > *dynamically* while old swap logic was based on static data
> >> > structure(i.e., never freed and the verify it's stale).
> >> > So, I reviewed some places where use PageSwapCache and swp_entry_t
> >> > which could make access of swap related data structures.
> >> >
> >> > A example is __isolate_lru_page
> >> >
> >> > It calls page_mapping to get a address_space.
> >> > What happens if the page is on SwapCache and raced with swapoff?
> >> > The mapping got could be disappeared by the race. Right?
> >> 
> >> Yes.  We should think about that.  Considering the file cache pages, the
> >> address_space backing the file cache pages may be freed dynamically too.
> >> So to use page_mapping() return value for the file cache pages, some
> >> kind of locking is needed to guarantee the address_space isn't freed
> >> under us.  Page may be locked, or under writeback, or some other locks
> >
> > I didn't look at the code in detail but I guess every file page should
> > be freed before the address space destruction and page_lock/lru_lock makes
> > the work safe, I guess. So, it wouldn't be a problem.
> >
> > However, in case of swapoff, it doesn't remove pages from LRU list
> > so there is no lock to prevent the race at this moment. :(
> 
> Take a look at file cache pages and file cache address_space freeing
> code path.  It appears that similar situation is possible for them too.
> 
> The file cache pages will be delete from file cache address_space before
> address_space (embedded in inode) is freed.  But they will be deleted
> from LRU list only when its refcount dropped to zero, please take a look
> at put_page() and release_pages().  While address_space will be freed
> after putting reference to all file cache pages.  If someone holds a
> reference to a file cache page for quite long time, it is possible

RE: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

2017-12-22 Thread Dexuan Cui

> From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> Sent: Friday, December 22, 2017 14:29
> 
> The output of that precise command run just now on a freshly-compiled
> copy of that commit is attached.
> 
> On Fri, Dec 22, 2017 at 09:31:28PM +, Dexuan Cui wrote:
> > > From: Alexandru Chirvasitu [mailto:achirva...@gmail.com]
> > > Sent: Friday, December 22, 2017 06:21
> > >
> > > In the absence of logs, the best I can do at the moment is attach a
> > > picture of the screen I am presented with on the apic=debug boot
> > > attempt.
> > > Alex
> >
> > The panic happens in irq_matrix_assign_system+0x4e/0xd0 in your picture.
> > IMO we should find which line of code causes the panic. I suppose
> > "objdump -D kernel/irq/matrix.o" can help to do that.
> >
> > Thanks,
> > -- Dexuan

The BUG_ON panic happens at line 147:
   BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));

I'm sure Thomas and Dou know it better than me. 

137 void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit,
138   bool replace)
139 {
140 struct cpumap *cm = this_cpu_ptr(m->maps);
141
142 BUG_ON(bit > m->matrix_bits);
143 BUG_ON(m->online_maps > 1 || (m->online_maps && !replace));
144
145 set_bit(bit, m->system_map);
146 if (replace) {
147 BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
148 cm->allocated--;
149 m->total_allocated--;
150 }
151 if (bit >= m->alloc_start && bit < m->alloc_end)
152 m->systembits_inalloc++;
153
154 trace_irq_matrix_assign_system(bit, m);
155 }

-- Dexuan

Re: [PATCH v2 1/3] phy: core: Move runtime PM reference counting to the parent device

2017-12-22 Thread Rafael J. Wysocki

On Thu, Dec 21, 2017 at 11:50 AM, Ulf Hansson  wrote:
> On 21 December 2017 at 02:39, Rafael J. Wysocki  wrote:
>> On Wed, Dec 20, 2017 at 3:09 PM, Ulf Hansson  wrote:
>>> The runtime PM deployment in the phy core is deployed using the phy core
>>> device, which is created by the phy core and assigned as a child device of
>>> the phy provider device.
>>>
>>> The behaviour around the runtime PM deployment cause some issues during
>>> system suspend, in cases when the phy provider device is put into a low
>>> power state via a call to the pm_runtime_force_suspend() helper, as is the
>>> case for a Renesas SoC, which has its phy provider device attached to the
>>> generic PM domain.
>>>
>>> In more detail, the problem in this case is that pm_runtime_force_suspend()
>>> expects the child device of the provider device, which is the phy core
>>> device, to be runtime suspended, else a WARN splat will be printed
>>> (correctly) when runtime PM gets re-enabled at system resume.
>>
>> So we are now trying to work around issues with
>> pm_runtime_force_suspend().  Lovely. :-/
>
> Yes, we have to, as pm_runtime_force_suspend() is widely deployed. Or
> are you saying we should just ignore all issues related to it?

Or get rid of it?

Seriously, is the resulting pain worth it?

> Of course, if we had something that could replace
> pm_runtime_force_suspend(), that would be great, but there isn't.

I beg to differ.

At least some of it could be replaced with the driver flags.

>>> In the current scenario, even if a call to phy_power_off() triggers it to
>>> invoke pm_runtime_put() during system suspend, the phy core device doesn't
>>> get runtime suspended, because this is prevented in the system suspend
>>> phases by the PM core.
>>>
>>> To solve this problem, let's move the runtime PM deployment from the phy
>>> core device to the phy provider device, as this provides the similar
>>> behaviour. Changing this makes it redundant to enable runtime PM for the
>>> phy core device, so let's avoid doing that.
>>
>> I'm not really convinced that this approach is the best one to be honest.
>>
>> I'll have a deeper look at this in the next few days, stay tuned.
>
> There is different ways to solve this, for sure. I picked this one,
> because I think it's the most trivial thing to do, and it shouldn't
> cause any other problems.
>
> I think any other option would involve assigning ->suspend|resume()
> callbacks to the phy core device, but that's fine too, if you prefer
> that.
>
> Also, I have considered how to deal with wakeup paths for phys,
> although I didn't want to post changes as a part of this series, but
> maybe I should to give a more complete picture?

Yes, you should.

The point is that without genpd using pm_runtime_force_suspend() the
phy code could very well stay the way it is.  And it is logical,
because having a parent with enabled runtime PM without enabling
runtime PM for its children is at least conceptually questionable.

So the conclusion may be that the phy code is OK, but calling
pm_runtime_force_suspend() from genpd isn't.

Thanks,
Rafael

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1842 matches

Mail list logo