date:20130730

Re: [Ksummit-2013-discuss] [ATTEND] DT, maintainership, development process

2013-07-30 Thread James Bottomley

On Tue, 2013-07-30 at 22:47 -0700, H. Peter Anvin wrote:
> On 07/29/2013 03:30 PM, Rafael J. Wysocki wrote:
> > On Monday, July 29, 2013 02:17:34 PM John W. Linville wrote:
> >> On Mon, Jul 29, 2013 at 03:27:44PM +0200, Rafael J. Wysocki wrote:
> >>
> >>> That said we have the same issue with commits with just two SOB tags if
> >>> a maintainer applies a patch that nobody has responded to.  Are they 
> >>> going to
> >>> be regarded as "suspicious" too now?
> >>>
> >>> And what about trusting maintainers?  If Linus trusts them enough to pull 
> >>> from
> >>> them, why can't everybody else trust them enough to assume that they 
> >>> don't do
> >>> bad things on purpose?
> >>
> >> Not just Linus -- it's 'turtles all the way down' here.  As someone
> >> else suggested, a Singed-off-by in the merge commit should suffice
> >> here.  Although, I haven't always made a habit of adding S-o-b to
> >> merge commits either...
> > 
> > An SOB in the merge doesn't provide any additional information that can't
> > be retrieved from git, unless you use a different e-mail address for the
> > sign-off. :-)
> > 
> 
> Watch out for fast forward merges.  Ideally I guess maintainers really
> should disable fast forwards and PGP-sign their merge commits...

If you always pull from a signed tag, this isn't a problem.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-07-30 Thread Takao Indoh

(2013/07/31 12:11), Alex Williamson wrote:
> On Wed, 2013-07-31 at 09:35 +0900, Takao Indoh wrote:
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index e37fea6..c595997 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -3392,6 +3392,59 @@ int pci_reset_function(struct pci_dev *dev)
>>   EXPORT_SYMBOL_GPL(pci_reset_function);
>>   
>>   /**
>> + * pci_reset_bus - reset a PCI bus
>> + * @bus: PCI bus to reset
>> + *
>> + * Returns 0 if the bus was successfully reset or negative if failed.
>> + */
>> +int pci_reset_bus(struct pci_bus *bus)
>> +{
>> +struct pci_dev *pdev;
>> +u16 ctrl;
>> +
>> +if (!bus->self)
>> +return -ENOTTY;
>> +
>> +list_for_each_entry(pdev, >devices, bus_list)
>> +if (pdev->subordinate)
>> +return -ENOTTY;
>> +
>> +/* Save config registers of children */
>> +list_for_each_entry(pdev, >devices, bus_list) {
>> +dev_info(>dev, "Save state\n");
>> +pci_save_state(pdev);
>> +}
>> +
>> +dev_info(>self->dev, "Reset Secondary bus\n");
>> +
>> +/* Assert Secondary Bus Reset */
>> +pci_read_config_word(bus->self, PCI_BRIDGE_CONTROL, );
>> +ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
>> +pci_write_config_word(bus->self, PCI_BRIDGE_CONTROL, ctrl);
>> +
>> +/* Read config again to flush previous write */
>> +pci_read_config_word(bus->self, PCI_BRIDGE_CONTROL, );
>> +
>> +msleep(2);
>> +
>> +/* De-assert Secondary Bus Reset */
>> +ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
>> +pci_write_config_word(bus->self, PCI_BRIDGE_CONTROL, ctrl);
>> +
>> +/* Wait for completion */
>> +msleep(1000);
> 
> 
> We already have secondary bus reset code in this file, why are we
> duplicating it here?  Also, why are these delays different from the
> existing code?  I'm also in need of a bus reset interface for when we
> assign all of the devices on a bus to userspace and do not have working
> function level resets per device.  I'll post my patch series and perhaps
> we can collaborate on a pci bus reset interface.  Thanks,

Good point. Yes, we have already similar functions.

pci_parent_bus_reset()
1. Assert secondary bus reset
2. msleep(100)
3. De-assert secondary bus reset
4. msleep(100)

aer_do_secondary_bus_reset()
1. Assert secondary bus reset
2. msleep(2)
3. De-assert secondary bus reset,
4. msleep(200)

To be honest, I wrote my reset code almost one years ago, so I forgot
the reason why I separated them.

Basically my reset code is based on aer_do_secondary_bus_reset(). The
different is waiting time after reset. My patch has 1000msec waiting
time.

At first my reset code is almost same as aer_do_secondary_bus_reset().
But when I tested the reset code, I found that on certain machine
restoring config registers failed after reset. It failed because 200msec
waiting time was too short. And I found the following description in
PCIe spec. According to this, I thought we should wait at least 1000msec.

6.6.1. Conventional Reset

* The Root Complex and/or system software must allow at least 1.0s
  after a Conventional Reset of a device, before it may determine that a
  device which fails to return a Successful Completion status for a
  valid Configuration Request is a broken device. This period is
  independent of how quickly Link training completes.

  Note: This delay is analogous to the Trhfa parameter specified for
  PCI/PCI-X, and is intended to allow an adequate amount of time for
  devices which require self initialization.

* When attempting a Configuration access to devices on a PCI or PCI-X
  bus segment behind a PCI Express/PCI(-X) Bridge, the timing parameter
  Trhfa must be respected.

And I saw patches you posted today, yes, your patch looks helpful for
my purpose:-)

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ksummit-2013-discuss] [ATTEND] DT, maintainership, development process

2013-07-30 Thread H. Peter Anvin

On 07/29/2013 03:30 PM, Rafael J. Wysocki wrote:
> On Monday, July 29, 2013 02:17:34 PM John W. Linville wrote:
>> On Mon, Jul 29, 2013 at 03:27:44PM +0200, Rafael J. Wysocki wrote:
>>
>>> That said we have the same issue with commits with just two SOB tags if
>>> a maintainer applies a patch that nobody has responded to.  Are they going 
>>> to
>>> be regarded as "suspicious" too now?
>>>
>>> And what about trusting maintainers?  If Linus trusts them enough to pull 
>>> from
>>> them, why can't everybody else trust them enough to assume that they don't 
>>> do
>>> bad things on purpose?
>>
>> Not just Linus -- it's 'turtles all the way down' here.  As someone
>> else suggested, a Singed-off-by in the merge commit should suffice
>> here.  Although, I haven't always made a habit of adding S-o-b to
>> merge commits either...
> 
> An SOB in the merge doesn't provide any additional information that can't
> be retrieved from git, unless you use a different e-mail address for the
> sign-off. :-)
> 

Watch out for fast forward merges.  Ideally I guess maintainers really
should disable fast forwards and PGP-sign their merge commits...

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 6/6] ARM: davinci: da850: configure system configuration chip(CFGCHIP3) for emac

2013-07-30 Thread Sekhar Nori

On Sunday 23 June 2013 08:30 PM, Prabhakar Lad wrote:
> From: "Lad, Prabhakar" 
> 
> This patch makes a common function for to configure emac and calls
> it appropriately in DT and non DT boot mode. The system configuration
> chip CFGCHIP3, controls the emac module. This patch appropriately
> configures this register for emac and sets DA850_MII_MDIO_CLKEN_PIN
> GPIO pin appropriately.
> 
> Signed-off-by: Lad, Prabhakar 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Cc: davinci-linux-open-sou...@linux.davincidsp.com
> Cc: net...@vger.kernel.org
> Cc: devicetree-disc...@lists.ozlabs.org
> Cc: Sekhar Nori 
> Cc: Heiko Schocher 
> ---
>  Changes for v2: none
>  Changes for v3:
>   a> added a common function in da850.c to configure
>  the CFGCHIP3 chip.
>  
>  arch/arm/mach-davinci/board-da850-evm.c|   36 ++
>  arch/arm/mach-davinci/da850.c  |   45 
> 
>  arch/arm/mach-davinci/da8xx-dt.c   |   16 ++
>  arch/arm/mach-davinci/include/mach/da8xx.h |1 +
>  4 files changed, 64 insertions(+), 34 deletions(-)
> 
> diff --git a/arch/arm/mach-davinci/board-da850-evm.c 
> b/arch/arm/mach-davinci/board-da850-evm.c
> index 8a24b6c..03dd1df 100644
> --- a/arch/arm/mach-davinci/board-da850-evm.c
> +++ b/arch/arm/mach-davinci/board-da850-evm.c
> @@ -50,7 +50,6 @@
>  #include 
>  #include 
>  
> -#define DA850_EVM_PHY_ID "davinci_mdio-0:00"
>  #define DA850_LCD_PWR_PINGPIO_TO_PIN(2, 8)
>  #define DA850_LCD_BL_PIN GPIO_TO_PIN(2, 15)
>  
> @@ -60,8 +59,6 @@
>  #define DA850_WLAN_ENGPIO_TO_PIN(6, 9)
>  #define DA850_WLAN_IRQ   GPIO_TO_PIN(6, 10)
>  
> -#define DA850_MII_MDIO_CLKEN_PIN GPIO_TO_PIN(2, 6)
> -
>  static struct mtd_partition da850evm_spiflash_part[] = {
>   [0] = {
>   .name = "UBL",
> @@ -1033,26 +1030,18 @@ static const short da850_evm_rmii_pins[] = {
>  
>  static int __init da850_evm_config_emac(void)
>  {
> - void __iomem *cfg_chip3_base;
> - int ret;
> - u32 val;
>   struct davinci_soc_info *soc_info = _soc_info;
>   u8 rmii_en = soc_info->emac_pdata->rmii_en;
> + int ret;
>  
>   if (!machine_is_davinci_da850_evm())
>   return 0;
>  
> - cfg_chip3_base = DA8XX_SYSCFG0_VIRT(DA8XX_CFGCHIP3_REG);
> -
> - val = __raw_readl(cfg_chip3_base);
> -
>   if (rmii_en) {
> - val |= BIT(8);
>   ret = davinci_cfg_reg_list(da850_evm_rmii_pins);
>   pr_info("EMAC: RMII PHY configured, MII PHY will not be"
>   " functional\n");
>   } else {
> - val &= ~BIT(8);
>   ret = davinci_cfg_reg_list(da850_evm_mii_pins);
>   pr_info("EMAC: MII PHY configured, RMII PHY will not be"
>   " functional\n");
> @@ -1062,28 +1051,7 @@ static int __init da850_evm_config_emac(void)
>   pr_warn("%s: CPGMAC/RMII mux setup failed: %d\n",
>   __func__, ret);
>  
> - /* configure the CFGCHIP3 register for RMII or MII */
> - __raw_writel(val, cfg_chip3_base);
> -
> - ret = davinci_cfg_reg(DA850_GPIO2_6);
> - if (ret)
> - pr_warn("%s:GPIO(2,6) mux setup failed\n", __func__);
> -
> - ret = gpio_request(DA850_MII_MDIO_CLKEN_PIN, "mdio_clk_en");
> - if (ret) {
> - pr_warn("Cannot open GPIO %d\n", DA850_MII_MDIO_CLKEN_PIN);
> - return ret;
> - }
> -
> - /* Enable/Disable MII MDIO clock */
> - gpio_direction_output(DA850_MII_MDIO_CLKEN_PIN, rmii_en);
> -
> - soc_info->emac_pdata->phy_id = DA850_EVM_PHY_ID;
> -
> - ret = da8xx_register_emac();
> - if (ret)
> - pr_warn("%s: EMAC registration failed: %d\n", __func__, ret);
> -
> + da850_config_emac(0, rmii_en);
>   return 0;
>  }
>  device_initcall(da850_evm_config_emac);
> diff --git a/arch/arm/mach-davinci/da850.c b/arch/arm/mach-davinci/da850.c
> index c43abee..d8021bb 100644
> --- a/arch/arm/mach-davinci/da850.c
> +++ b/arch/arm/mach-davinci/da850.c
> @@ -1197,6 +1197,51 @@ no_ddrpll_mem:
>   return ret;
>  }
>  
> +void __init da850_config_emac(u32 dt_mode, u32 rmii_enabled)
> +{
> +#define DA850_MII_MDIO_CLKEN_PIN GPIO_TO_PIN(2, 6)
> +#define DA850_EVM_PHY_ID "davinci_mdio-0:00"
> +
> + void __iomem *cfg_chip3_base;
> + int ret;
> + u32 val;
> +
> + cfg_chip3_base = DA8XX_SYSCFG0_VIRT(DA8XX_CFGCHIP3_REG);
> +
> + val = readl(cfg_chip3_base);
> +
> + if (rmii_enabled)
> + val |= BIT(8);
> + else
> + val &= ~BIT(8);
> +
> + /* configure the CFGCHIP3 register for RMII or MII */
> + writel(val, cfg_chip3_base);
> +
> + ret = davinci_cfg_reg(DA850_GPIO2_6);
> + if (ret)
> + pr_warn("%s:GPIO(2,6) mux setup failed\n", __func__);
> +

[PATCH] net: Remove extern from include/net/ scheduling prototypes

2013-07-30 Thread Joe Perches

There are a mix of function prototypes with and without extern
in the kernel sources.  Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler.  Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Reflow modified prototypes to 80 columns.

Signed-off-by: Joe Perches 
---
 include/net/act_api.h | 60 +++
 include/net/pkt_cls.h | 42 -
 include/net/pkt_sched.h   | 50 +++
 include/net/sch_generic.h | 53 +
 4 files changed, 104 insertions(+), 101 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index b8ffac7..9e90fdf 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -82,36 +82,36 @@ struct tc_action_ops {
int (*walk)(struct sk_buff *, struct netlink_callback *, int, 
struct tc_action *);
 };
 
-extern struct tcf_common *tcf_hash_lookup(u32 index,
- struct tcf_hashinfo *hinfo);
-extern void tcf_hash_destroy(struct tcf_common *p, struct tcf_hashinfo *hinfo);
-extern int tcf_hash_release(struct tcf_common *p, int bind,
-   struct tcf_hashinfo *hinfo);
-extern int tcf_generic_walker(struct sk_buff *skb, struct netlink_callback *cb,
- int type, struct tc_action *a);
-extern u32 tcf_hash_new_index(u32 *idx_gen, struct tcf_hashinfo *hinfo);
-extern int tcf_hash_search(struct tc_action *a, u32 index);
-extern struct tcf_common *tcf_hash_check(u32 index, struct tc_action *a,
-int bind, struct tcf_hashinfo *hinfo);
-extern struct tcf_common *tcf_hash_create(u32 index, struct nlattr *est,
- struct tc_action *a, int size,
- int bind, u32 *idx_gen,
- struct tcf_hashinfo *hinfo);
-extern void tcf_hash_insert(struct tcf_common *p, struct tcf_hashinfo *hinfo);
+struct tcf_common *tcf_hash_lookup(u32 index, struct tcf_hashinfo *hinfo);
+void tcf_hash_destroy(struct tcf_common *p, struct tcf_hashinfo *hinfo);
+int tcf_hash_release(struct tcf_common *p, int bind,
+struct tcf_hashinfo *hinfo);
+int tcf_generic_walker(struct sk_buff *skb, struct netlink_callback *cb,
+  int type, struct tc_action *a);
+u32 tcf_hash_new_index(u32 *idx_gen, struct tcf_hashinfo *hinfo);
+int tcf_hash_search(struct tc_action *a, u32 index);
+struct tcf_common *tcf_hash_check(u32 index, struct tc_action *a,
+ int bind, struct tcf_hashinfo *hinfo);
+struct tcf_common *tcf_hash_create(u32 index, struct nlattr *est,
+  struct tc_action *a, int size,
+  int bind, u32 *idx_gen,
+  struct tcf_hashinfo *hinfo);
+void tcf_hash_insert(struct tcf_common *p, struct tcf_hashinfo *hinfo);
 
-extern int tcf_register_action(struct tc_action_ops *a);
-extern int tcf_unregister_action(struct tc_action_ops *a);
-extern void tcf_action_destroy(struct tc_action *a, int bind);
-extern int tcf_action_exec(struct sk_buff *skb, const struct tc_action *a, 
struct tcf_result *res);
-extern struct tc_action *tcf_action_init(struct net *net, struct nlattr *nla,
-struct nlattr *est, char *n, int ovr,
-int bind);
-extern struct tc_action *tcf_action_init_1(struct net *net, struct nlattr *nla,
-  struct nlattr *est, char *n, int ovr,
-  int bind);
-extern int tcf_action_dump(struct sk_buff *skb, struct tc_action *a, int, int);
-extern int tcf_action_dump_old(struct sk_buff *skb, struct tc_action *a, int, 
int);
-extern int tcf_action_dump_1(struct sk_buff *skb, struct tc_action *a, int, 
int);
-extern int tcf_action_copy_stats (struct sk_buff *,struct tc_action *, int);
+int tcf_register_action(struct tc_action_ops *a);
+int tcf_unregister_action(struct tc_action_ops *a);
+void tcf_action_destroy(struct tc_action *a, int bind);
+int tcf_action_exec(struct sk_buff *skb, const struct tc_action *a,
+   struct tcf_result *res);
+struct tc_action *tcf_action_init(struct net *net, struct nlattr *nla,
+ struct nlattr *est, char *n, int ovr,
+ int bind);
+struct tc_action *tcf_action_init_1(struct net *net, struct nlattr *nla,
+   struct nlattr *est, char *n, int ovr,
+   int bind);
+int tcf_action_dump(struct sk_buff *skb, struct tc_action *a, int, int);
+int tcf_action_dump_old(struct sk_buff *skb, struct tc_action *a, int, int);
+int

RE: [PATCH V2 4/6] cpuidle/pseries: Move the pseries_idle backend driver to sysdev.

2013-07-30 Thread Wang Dongsheng-B40534

Hi Preeti,

> -Original Message-
> From: Preeti U Murthy [mailto:pre...@linux.vnet.ibm.com]
> Sent: Wednesday, July 31, 2013 12:00 PM
> To: Wang Dongsheng-B40534
> Cc: Deepthi Dharwar; b...@kernel.crashing.org; daniel.lezc...@linaro.org;
> linux-kernel@vger.kernel.org; mich...@ellerman.id.au;
> srivatsa.b...@linux.vnet.ibm.com; sva...@linux.vnet.ibm.com; linuxppc-
> d...@lists.ozlabs.org; r...@sisk.pl; linux...@vger.kernel.org
> Subject: Re: [PATCH V2 4/6] cpuidle/pseries: Move the pseries_idle
> backend driver to sysdev.
> 
> Hi Dongsheng,
> 
> On 07/31/2013 08:52 AM, Wang Dongsheng-B40534 wrote:
> >
> >
> >> -Original Message-
> >> From: Deepthi Dharwar [mailto:deep...@linux.vnet.ibm.com]
> >> Sent: Wednesday, July 31, 2013 10:59 AM
> >> To: b...@kernel.crashing.org; daniel.lezc...@linaro.org; linux-
> >> ker...@vger.kernel.org; mich...@ellerman.id.au;
> >> srivatsa.b...@linux.vnet.ibm.com; pre...@linux.vnet.ibm.com;
> >> sva...@linux.vnet.ibm.com; linuxppc-...@lists.ozlabs.org
> >> Cc: r...@sisk.pl; Wang Dongsheng-B40534; linux...@vger.kernel.org
> >> Subject: [PATCH V2 4/6] cpuidle/pseries: Move the pseries_idle
> >> backend driver to sysdev.
> >>
> >> Move pseries_idle backend driver code to arch/powerpc/sysdev so that
> >> the code can be used for a common driver for powernv and pseries.
> >> This removes a lot of code duplicacy.
> >>
> > Why not drivers/cpuidle/?
> >
> > I think it should be move to drivers/cpuidle.
> 
> Please take a look at what the cpuidle under drivers has to provide.
> cpuidle has two parts to it. The front end and the back end. The front
> end constitutes the cpuidle governors, registering of arch specific
> cpuidle drivers, disabling and enabling of cpuidle feature. It is this
> front end code which is present under drivers/cpuidle.
> 
> The arch specific cpuidle drivers which decide what needs to be done to
> enter a specific idle state chosen by the cpuidle governor is what
> constitutes the back end of cpuidle. This will not be in drivers/cpuidle
> but in an arch/ specific code.
> 
> The cpuidle under drivers/cpuidle drives the idle power management, but
> the low level handling of the entry into idle states should be taken care
> of by the architecture.
> 
> Your recent patch :
> cpuidle: add freescale e500 family porcessors idle support IMO should
> hook onto the backend cpuidle driver that this patchset provides.
> 
Sorry, I don't think so, cpuidle framework has been already very common.
Here we just need to do state definition and handling. I wonder whether
we need this layer.

If your handle is platform dependent, it should be in arch/platform.

If it is only for some platforms and the operation of these platforms can be
multiplexed, Why cannot as a driver to put into driver/cpuidle?

If it a general driver, I think we can put some common operating to 
driver/cpuidle
and make the platform specific code to arch/powerpc/platform.

This patch include front end and back end, not just back end.

This patch include too many state of different platforms and handle function. 
This state
and handle that should belong to itself platforms. Not a general way. If 
Deepthi will do
a general powerpc cpuidle, I think, it's cannot just using the macro to 
distinguish
platform. the front end code maybe move to driver/cpuidle(drvier register) 
should be better,
make the Low Power State and what should be handle to arch/powerpc/platform/**, 
because different
platforms have different state of low power consumption, and the processing 
method.
The front end can provide some general methods to register into general powerpc 
cpuidle driver.

-dongsheng

Re: [PATCH 01/15] drivers: phy: add generic PHY framework

2013-07-30 Thread Kishon Vijay Abraham I

Hi,

On Tuesday 30 July 2013 12:41 PM, Felipe Balbi wrote:
> On Sun, Jul 21, 2013 at 08:46:53AM -0700, Greg KH wrote:
>> On Sun, Jul 21, 2013 at 01:12:07PM +0200, Tomasz Figa wrote:
>>> On Sunday 21 of July 2013 16:37:33 Kishon Vijay Abraham I wrote:
 Hi,

 On Sunday 21 July 2013 04:01 PM, Tomasz Figa wrote:
> Hi,
>
> On Saturday 20 of July 2013 19:59:10 Greg KH wrote:
>> On Sat, Jul 20, 2013 at 10:32:26PM -0400, Alan Stern wrote:
>>> On Sat, 20 Jul 2013, Greg KH wrote:
>>> That should be passed using platform data.
>>
>> Ick, don't pass strings around, pass pointers.  If you have
>> platform
>> data you can get to, then put the pointer there, don't use a
>> "name".
>
> I don't think I understood you here :-s We wont have phy pointer
> when we create the device for the controller no?(it'll be done in
> board file). Probably I'm missing something.

 Why will you not have that pointer?  You can't rely on the "name"
 as
 the device id will not match up, so you should be able to rely on
 the pointer being in the structure that the board sets up, right?

 Don't use names, especially as ids can, and will, change, that is
 going
 to cause big problems.  Use pointers, this is C, we are supposed to
 be
 doing that :)
>>>
>>> Kishon, I think what Greg means is this:  The name you are using
>>> must
>>> be stored somewhere in a data structure constructed by the board
>>> file,
>>> right?  Or at least, associated with some data structure somehow.
>>> Otherwise the platform code wouldn't know which PHY hardware
>>> corresponded to a particular name.
>>>
>>> Greg's suggestion is that you store the address of that data
>>> structure
>>> in the platform data instead of storing the name string.  Have the
>>> consumer pass the data structure's address when it calls phy_create,
>>> instead of passing the name.  Then you don't have to worry about two
>>> PHYs accidentally ending up with the same name or any other similar
>>> problems.
>>
>> Close, but the issue is that whatever returns from phy_create()
>> should
>> then be used, no need to call any "find" functions, as you can just
>> use
>> the pointer that phy_create() returns.  Much like all other class api
>> functions in the kernel work.
>
> I think there is a confusion here about who registers the PHYs.
>
> All platform code does is registering a platform/i2c/whatever device,
> which causes a driver (located in drivers/phy/) to be instantiated.
> Such drivers call phy_create(), usually in their probe() callbacks,
> so platform_code has no way (and should have no way, for the sake of
> layering) to get what phy_create() returns.
>>
>> Why not put pointers in the platform data structure that can hold these
>> pointers?  I thought that is why we created those structures in the
>> first place.  If not, what are they there for?
> 
> heh, IMO we shouldn't pass pointers of any kind through platform_data,
> we want to pass data :-)
> 
> Allowing to pass pointers through that, is one of the reasons which got
> us in such a big mess in ARM land, well it was much easier for a
> board-file/driver writer to pass a function pointer then to create a
> generic framework :-)
> 
> IMHO we need a lookup method for PHYs, just like for clocks,
> regulators, PWMs or even i2c busses because there are complex cases
> when passing just a name using platform data will not work. I would
> second what Stephen said [1] and define a structure doing things in a
> DT-like way.
>
> Example;
>
> [platform code]
>
> static const struct phy_lookup my_phy_lookup[] = {
>
>   PHY_LOOKUP("s3c-hsotg.0", "otg", "samsung-usbphy.1", "phy.2"),

 The only problem here is that if *PLATFORM_DEVID_AUTO* is used while
 creating the device, the ids in the device name would change and
 PHY_LOOKUP wont be useful.
>>>
>>> I don't think this is a problem. All the existing lookup methods already 
>>> use ID to identify devices (see regulators, clkdev, PWMs, i2c, ...). You 
>>> can simply add a requirement that the ID must be assigned manually, 
>>> without using PLATFORM_DEVID_AUTO to use PHY lookup.
>>
>> And I'm saying that this idea, of using a specific name and id, is
>> frought with fragility and will break in the future in various ways when
>> devices get added to systems, making these strings constantly have to be
>> kept up to date with different board configurations.
>>
>> People, NEVER, hardcode something like an id.  The fact that this
>> happens today with the clock code, doesn't make it right, it makes the
>> clock code wrong.  Others have already said that this is wrong there as
>> well, as systems change and dynamic ids get

Re: [PATCH v3 6/9] mm, hugetlb: do not use a page in page cache for cow optimization

2013-07-30 Thread Joonsoo Kim

On Tue, Jul 30, 2013 at 08:37:08AM +1000, David Gibson wrote:
> On Mon, Jul 29, 2013 at 02:28:18PM +0900, Joonsoo Kim wrote:
> > Currently, we use a page with mapped count 1 in page cache for cow
> > optimization. If we find this condition, we don't allocate a new
> > page and copy contents. Instead, we map this page directly.
> > This may introduce a problem that writting to private mapping overwrite
> > hugetlb file directly. You can find this situation with following code.
> > 
> > size = 20 * MB;
> > flag = MAP_SHARED;
> > p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
> > if (p == MAP_FAILED) {
> > fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
> > return -1;
> > }
> > p[0] = 's';
> > fprintf(stdout, "BEFORE STEAL PRIVATE WRITE: %c\n", p[0]);
> > munmap(p, size);
> > 
> > flag = MAP_PRIVATE;
> > p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
> > if (p == MAP_FAILED) {
> > fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
> > }
> > p[0] = 'c';
> > munmap(p, size);
> > 
> > flag = MAP_SHARED;
> > p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
> > if (p == MAP_FAILED) {
> > fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
> > return -1;
> > }
> > fprintf(stdout, "AFTER STEAL PRIVATE WRITE: %c\n", p[0]);
> > munmap(p, size);
> > 
> > We can see that "AFTER STEAL PRIVATE WRITE: c", not "AFTER STEAL
> > PRIVATE WRITE: s". If we turn off this optimization to a page
> > in page cache, the problem is disappeared.
> 
> Please add this testcase to libhugetlbfs as well.

Okay!

Thanks.

> 
> -- 
> David Gibson  | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au| minimalist, thank you.  NOT _the_ 
> _other_
>   | _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 17/18] mm, hugetlb: retry if we fail to allocate a hugepage with use_reserve

2013-07-30 Thread Joonsoo Kim

Hello, David.

On Mon, Jul 29, 2013 at 05:28:23PM +1000, David Gibson wrote:
> On Mon, Jul 29, 2013 at 02:32:08PM +0900, Joonsoo Kim wrote:
> > If parallel fault occur, we can fail to allocate a hugepage,
> > because many threads dequeue a hugepage to handle a fault of same address.
> > This makes reserved pool shortage just for a little while and this cause
> > faulting thread who is ensured to have enough reserved hugepages
> > to get a SIGBUS signal.
> 
> It's not just about reserved pages.  The same race can happen
> perfectly well when you're really, truly allocating the last hugepage
> in the system.

Yes, you are right.
This is a critical comment to this patchset :(

IIUC, the case you mentioned is about tasks which have a mapping
with MAP_NORESERVE. Should we ensure them to allocate the last hugepage?
They map a region with MAP_NORESERVE, so don't assume that their requests
always succeed.

> 
> > 
> > To solve this problem, we already have a nice solution, that is,
> > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > a fault handler. This solve the problem clearly, but it introduce
> > performance degradation, because it serialize all fault handling.
> > 
> > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > performance degradation. A prerequisite is that other thread should
> > not get a SIGBUS if they are ensured to have enough reserved pages.
> > 
> > For this purpose, if we fail to allocate a new hugepage with use_reserve,
> > we return just 0, instead of VM_FAULT_SIGBUS. use_reserve
> > represent that this user is legimate one who are ensured to have enough
> > reserved pages. This prevent these thread not to get a SIGBUS signal and
> > make these thread retrying fault handling.
> 
> Not sufficient, since it can happen without reserved pages.

Ditto.

> 
> Also, I think there are edge cases where even reserved mappings can
> run out, in particular with the interaction between MAP_PRIVATE,
> fork() and reservations.  In this case, when you have a genuine out of
> memory condition, you will spin forever on the fault.

If there are edge cases, we can fix it. It doesn't matter.
If you find it, please tell me in more detail.

Thanks.

> 
> > 
> > Signed-off-by: Joonsoo Kim 
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 6a9ec69..909075b 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2623,7 +2623,10 @@ retry_avoidcopy:
> > WARN_ON_ONCE(1);
> > }
> >  
> > -   ret = VM_FAULT_SIGBUS;
> > +   if (use_reserve)
> > +   ret = 0;
> > +   else
> > +   ret = VM_FAULT_SIGBUS;
> > goto out_lock;
> > }
> >  
> > @@ -2741,7 +2744,10 @@ retry:
> >  
> > page = alloc_huge_page(vma, address, use_reserve);
> > if (IS_ERR(page)) {
> > -   ret = VM_FAULT_SIGBUS;
> > +   if (use_reserve)
> > +   ret = 0;
> > +   else
> > +   ret = VM_FAULT_SIGBUS;
> > goto out;
> > }
> > clear_huge_page(page, address, pages_per_huge_page(h));
> 
> -- 
> David Gibson  | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au| minimalist, thank you.  NOT _the_ 
> _other_
>   | _way_ _around_!
> http://www.ozlabs.org/~dgibson


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/9] ARM: edma: Add function to manually trigger an EDMA channel

2013-07-30 Thread Sekhar Nori

On Wednesday 31 July 2013 10:00 AM, Joel Fernandes wrote:
> On 07/30/2013 12:18 AM, Sekhar Nori wrote:
>> On Monday 29 July 2013 06:59 PM, Joel Fernandes wrote:
>>> Manual trigger for events missed as a result of splitting a
>>> scatter gather list and DMA'ing it in batches. Add a helper
>>> function to trigger a channel incase any such events are missed.
>>>
>>> Signed-off-by: Joel Fernandes 
>>> ---
>>>  arch/arm/common/edma.c |   21 +
>>>  include/linux/platform_data/edma.h |2 ++
>>>  2 files changed, 23 insertions(+)
>>>
>>> diff --git a/arch/arm/common/edma.c b/arch/arm/common/edma.c
>>> index 3567ba1..10995b2 100644
>>> --- a/arch/arm/common/edma.c
>>> +++ b/arch/arm/common/edma.c
>>> @@ -1236,6 +1236,27 @@ void edma_resume(unsigned channel)
>>>  }
>>>  EXPORT_SYMBOL(edma_resume);
>>>  
>>> +int edma_manual_trigger(unsigned channel)
>>
>> edma_trigger_channel() maybe? Brings consistency with
>> edma_alloc_channel() edma_free_channel() etc.
> 
> Ok, sure.
> 
>>
>>> +{
>>> +   unsigned ctlr;
>>> +   int j;
>>> +   unsigned int mask;
>>> +
>>> +   ctlr = EDMA_CTLR(channel);
>>> +   channel = EDMA_CHAN_SLOT(channel);
>>> +   mask = BIT(channel & 0x1f);
>>> +
>>> +   j = channel >> 5;
>>> +
>>> +   /* EDMA channels without event association */
>>
>> May be actually check for no-event association before you trigger in
>> software? You can do that by looking at unused channel list, no?
> 
> But, we want to trigger whether there is event association or not in
> this function. For ex, MMC has event associated but still this function
> is used to trigger event for it.

Okay, just drop the misleading comment then.

Regards,
Sekhar
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/9] ARM: edma: Add function to manually trigger an EDMA channel

2013-07-30 Thread Joel Fernandes

On 07/30/2013 12:18 AM, Sekhar Nori wrote:
> On Monday 29 July 2013 06:59 PM, Joel Fernandes wrote:
>> Manual trigger for events missed as a result of splitting a
>> scatter gather list and DMA'ing it in batches. Add a helper
>> function to trigger a channel incase any such events are missed.
>>
>> Signed-off-by: Joel Fernandes 
>> ---
>>  arch/arm/common/edma.c |   21 +
>>  include/linux/platform_data/edma.h |2 ++
>>  2 files changed, 23 insertions(+)
>>
>> diff --git a/arch/arm/common/edma.c b/arch/arm/common/edma.c
>> index 3567ba1..10995b2 100644
>> --- a/arch/arm/common/edma.c
>> +++ b/arch/arm/common/edma.c
>> @@ -1236,6 +1236,27 @@ void edma_resume(unsigned channel)
>>  }
>>  EXPORT_SYMBOL(edma_resume);
>>  
>> +int edma_manual_trigger(unsigned channel)
> 
> edma_trigger_channel() maybe? Brings consistency with
> edma_alloc_channel() edma_free_channel() etc.

Ok, sure.

> 
>> +{
>> +unsigned ctlr;
>> +int j;
>> +unsigned int mask;
>> +
>> +ctlr = EDMA_CTLR(channel);
>> +channel = EDMA_CHAN_SLOT(channel);
>> +mask = BIT(channel & 0x1f);
>> +
>> +j = channel >> 5;
>> +
>> +/* EDMA channels without event association */
> 
> May be actually check for no-event association before you trigger in
> software? You can do that by looking at unused channel list, no?

But, we want to trigger whether there is event association or not in
this function. For ex, MMC has event associated but still this function
is used to trigger event for it.

> 
>> +edma_shadow0_write_array(ctlr, SH_ESR, j, mask);
> 
> edma_shadow0_write_array(ctlr, SH_ESR, channel >> 5, mask) is no less
> readable, but I leave it to you.

Sure that's more readable, will changed it to that.

Thanks,

-Joel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/9] dma: edma: Find missed events and issue them

2013-07-30 Thread Joel Fernandes

Hi Sekhar,

On 07/30/2013 02:05 AM, Sekhar Nori wrote:
> On Monday 29 July 2013 06:59 PM, Joel Fernandes wrote:
>> In an effort to move to using Scatter gather lists of any size with
>> EDMA as discussed at [1] instead of placing limitations on the driver,
>> we work through the limitations of the EDMAC hardware to find missed
>> events and issue them.
>>
>> The sequence of events that require this are:
>>
>> For the scenario where MAX slots for an EDMA channel is 3:
>>
>> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> Null
>>
>> The above SG list will have to be DMA'd in 2 sets:
>>
>> (1) SG1 -> SG2 -> SG3 -> Null
>> (2) SG4 -> SG5 -> SG6 -> Null
>>
>> After (1) is succesfully transferred, the events from the MMC controller
>> donot stop coming and are missed by the time we have setup the transfer
>> for (2). So here, we catch the events missed as an error condition and
>> issue them manually.
> 
> Are you sure there wont be any effect of these missed events on the
> peripheral side. For example, wont McASP get into an underrun condition
> when it encounters a null PaRAM set? Even UART has to transmit to a

But it will not encounter null PaRAM set because McASP uses contiguous
buffers for transfer which are not scattered across physical memory.
This can be accomplished with an SG of size 1. For such SGs, this patch
series leaves it linked Dummy and does not link to Null set. Null set is
only used for SG lists that are > MAX_NR_SG in size such as those
created for example by MMC and Crypto.

> particular baud so I guess it cannot wait like the way MMC/SD can.

Existing driver have to wait anyway if they hit MAX SG limit today. If
they don't want to wait, they would have allocated a contiguous block of
memory and DMA that in one stretch so they don't lose any events, and in
such cases we are not linking to Null.

> Also, wont this lead to under-utilization of the peripheral bandwith?
> Meaning, MMC/SD is ready with data but cannot transfer because the DMA
> is waiting to be set-up.

But it is waiting anyway even today. Currently based on MAX segs, MMC
driver/subsystem will make SG list of size max_segs. Between these
sessions of creating such smaller SG-lists, if for some reason the MMC
controller is sending events, these will be lost anyway.

What will happen now with this patch series is we are simply accepting a
bigger list than this, and handling all the max_segs stuff within the
EDMA driver itself without outside world knowing. This is actually more
efficient as for long transfers, we are not going back and forth much
between the client and EDMA driver.

> Did you consider a ping-pong scheme with say three PaRAM sets per
> channel? That way you can keep a continuous transfer going on from the
> peripheral over the complete SG list.

Do you mean ping-pong scheme as used in the davinci-pcm driver today?
This can be used only for buffers that are contiguous in memory, not
those that are scattered across memory.

Thanks,

-Joel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 7/9] ARM: edma: Don't clear EMR of channel in edma_stop

2013-07-30 Thread Joel Fernandes

On 07/30/2013 03:29 AM, Sekhar Nori wrote:
> On Monday 29 July 2013 06:59 PM, Joel Fernandes wrote:
>> We certainly don't want error conditions to be cleared anywhere
> 
> 'anywhere' is a really loaded term.
> 
>> as this will make us 'forget' about missed events. We depend on
>> knowing which events were missed in order to be able to reissue them.
> 
>> This fixes a race condition where the EMR was being cleared
>> by the transfer completion interrupt handler.
>>
>> Basically, what was happening was:
>>
>> Missed event
>>  |
>>  |
>>  V
>> SG1-SG2-SG3-Null
>>  \
>>   \__TC Interrupt (Almost same time as ARM is executing
>> TC interrupt handler, an event got missed and also forgotten
>> by clearing the EMR).
> 
> Sorry, but I dont see how edma_stop() is coming into picture in the race
> you describe?

In edma_callback function, for the case of DMA_COMPLETE (Transfer
completion interrupt), edma_stop() is called when all sets have been
processed. This had the effect of clearing the EMR.

This has 2 problems:

1.
If error interrupt is also pending and TC interrupt clears the EMR.

Due to this the ARM will execute the error interrupt even though the EMR
is clear. As a result, the following if condition in dma_ccerr_handler
will be true and IRQ_NONE is returned.

if ((edma_read_array(ctlr, EDMA_EMR, 0) == 0) &&
(edma_read_array(ctlr, EDMA_EMR, 1) == 0) &&
(edma_read(ctlr, EDMA_QEMR) == 0) &&
(edma_read(ctlr, EDMA_CCERR) == 0))
return IRQ_NONE;

If this happens enough number of times, IRQ subsystem disables the
interrupt thinking its spurious which creates serious problems.

2.
If the above if statement condition is removed, then EMR is 0 so the
callback function will not be called in dma_ccerr_handler thus the event
is forgotten, never triggered manually or never sets missed flag of the
channel.

So about the race: TC interrupt handler executing before the error
interrupt handler can result in clearing the EMR and creates these problems.

>> The EMR is ultimately being cleared by the Error interrupt
>> handler once it is handled so we don't have to do it in edma_stop.
> 
> This, I agree with. edma_clean_channel() also there to re-initialize the
> channel so doing it in edma_stop() certainly seems superfluous.

Sure.

Thanks,

-Joel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] ARM: EDMA: Fix clearing of unused list for DT DMA resources

2013-07-30 Thread Joel Fernandes

On 07/30/2013 11:29 AM, Sekhar Nori wrote:
> On 7/30/2013 9:17 AM, Joel Fernandes wrote:
> 
 diff --git a/arch/arm/common/edma.c b/arch/arm/common/edma.c
 index a432e6c..765d578 100644
 --- a/arch/arm/common/edma.c
 +++ b/arch/arm/common/edma.c
> 
 +  } else {
 +  for (; i < pdev->num_resources; i++) {
 +  if ((pdev->resource[i].flags & IORESOURCE_DMA) &&
 +  (int)pdev->resource[i].start >= 0) {
 +  ctlr = EDMA_CTLR(pdev->resource[i].start);
 +  clear_bit(EDMA_CHAN_SLOT(
 +pdev->resource[i].start),
 +edma_cc[ctlr]->edma_unused);
 +  }
>>>
>>> So there is very little in common between OF and non-OF versions of this
>>> function. Why not have two different versions of this function for the
>>> two cases? The OF version can reside under the CONFIG_OF conditional
>>> already in use in the file. This will also save you the ugly line breaks
>>> you had to resort to due to too deep indentation.
>>
>> Actually those line breaks are not necessary and wouldn't result in
>> compilation errors. I was planning to drop them. I'll go ahead and split
>> it out anyway, now that also the OF version of the function is going to
>> be bit longer if we use the of_parse functions.
>>
>> Thanks for your review,
> 
> It turns out, I gave a bad idea. What I suggested will break the case of
> non-DT boot with CONFIG_OF enabled. So what you had was fine. May be
> just return from "if (dev->of_node)" so you don't need to have an else
> block and can save on the indentation.>

Ok, sure. I will go ahead and return from the if block.

Thanks,

-Joel


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 16/18] mm, hugetlb: return a reserved page to a reserved pool if failed

2013-07-30 Thread Joonsoo Kim

On Mon, Jul 29, 2013 at 04:19:10PM -0400, Naoya Horiguchi wrote:
> On Mon, Jul 29, 2013 at 02:32:07PM +0900, Joonsoo Kim wrote:
> > If we fail with a reserved page, just calling put_page() is not sufficient,
> > because put_page() invoke free_huge_page() at last step and it doesn't
> > know whether a page comes from a reserved pool or not. So it doesn't do
> > anything related to reserved count. This makes reserve count lower
> > than how we need, because reserve count already decrease in
> > dequeue_huge_page_vma(). This patch fix this situation.
> 
> I think we could use a page flag (for example PG_reserve) on a hugepage
> in order to record that the hugepage comes from the reserved pool.
> Furthermore, the reserve flag would be set when dequeueing a free hugepage,
> and cleared when hugepage_fault returns, whether it fails or not.
> I think it's simpler than put_page variant approach, but doesn't it work
> to solve your problem?

Yes. That's good idea.
I thought this idea before, but didn't implement that way, because
I was worry that this may make patchset more larger and complex. But
implementing that way may be better.

Thanks.

> 
> Thanks,
> Naoya Horiguchi
> 
> > Signed-off-by: Joonsoo Kim 
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index bb8a45f..6a9ec69 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -649,6 +649,34 @@ struct hstate *size_to_hstate(unsigned long size)
> > return NULL;
> >  }
> >  
> > +static void put_huge_page(struct page *page, int use_reserve)
> > +{
> > +   struct hstate *h = page_hstate(page);
> > +   struct hugepage_subpool *spool =
> > +   (struct hugepage_subpool *)page_private(page);
> > +
> > +   if (!use_reserve) {
> > +   put_page(page);
> > +   return;
> > +   }
> > +
> > +   if (!put_page_testzero(page))
> > +   return;
> > +
> > +   set_page_private(page, 0);
> > +   page->mapping = NULL;
> > +   BUG_ON(page_count(page));
> > +   BUG_ON(page_mapcount(page));
> > +
> > +   spin_lock(_lock);
> > +   hugetlb_cgroup_uncharge_page(hstate_index(h),
> > +pages_per_huge_page(h), page);
> > +   enqueue_huge_page(h, page);
> > +   h->resv_huge_pages++;
> > +   spin_unlock(_lock);
> > +   hugepage_subpool_put_pages(spool, 1);
> > +}
> > +
> >  static void free_huge_page(struct page *page)
> >  {
> > /*
> > @@ -2625,7 +2653,7 @@ retry_avoidcopy:
> > spin_unlock(>page_table_lock);
> > mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> >  
> > -   page_cache_release(new_page);
> > +   put_huge_page(new_page, use_reserve);
> >  out_old_page:
> > page_cache_release(old_page);
> >  out_lock:
> > @@ -2725,7 +2753,7 @@ retry:
> >  
> > err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
> > if (err) {
> > -   put_page(page);
> > +   put_huge_page(page, use_reserve);
> > if (err == -EEXIST)
> > goto retry;
> > goto out;
> > @@ -2798,7 +2826,7 @@ backout:
> > spin_unlock(>page_table_lock);
> >  backout_unlocked:
> > unlock_page(page);
> > -   put_page(page);
> > +   put_huge_page(page, use_reserve);
> > goto out;
> >  }
> >  
> > -- 
> > 1.7.9.5
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majord...@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REGRESSION/PATCH] acpi: blacklist win8 OSI for ASUS Zenbok Prime UX31A

2013-07-30 Thread Matthew Garrett

On Tue, Jul 30, 2013 at 07:11:06PM -0500, Felipe Contreras wrote:

> No, the ACPI driver is exposing a backlight interface, which has a
> defined stable API.

>From the ACPI spec:

'The OEM may define the number 0 as "Zero brightness" that can mean to 
turn off the lighting (e.g. LCD panel backlight) in the device.'

There's no mechanism for an OS to know whether or not a firmware 
implementation will actually turn the backlight off at 0, so there's no 
way the OS can define the lowest backlight state as anything other than 
"May or may not turn the screen off". If your userspace application 
depends on specific numbers having specific meanings, your userspace 
application is broken. Don't ascribe meanings to arbitrary values.

-- 
Matthew Garrett | mj...@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 15/18] mm, hugetlb: move up anon_vma_prepare()

2013-07-30 Thread Joonsoo Kim

On Mon, Jul 29, 2013 at 03:19:15PM -0400, Naoya Horiguchi wrote:
> On Mon, Jul 29, 2013 at 03:05:37PM -0400, Naoya Horiguchi wrote:
> > On Mon, Jul 29, 2013 at 02:32:06PM +0900, Joonsoo Kim wrote:
> > > If we fail with a allocated hugepage, it is hard to recover properly.
> > > One such example is reserve count. We don't have any method to recover
> > > reserve count. Although, I will introduce a function to recover reserve
> > > count in following patch, it is better not to allocate a hugepage
> > > as much as possible. So move up anon_vma_prepare() which can be failed
> > > in OOM situation.
> > > 
> > > Signed-off-by: Joonsoo Kim 
> > 
> > Reviewed-by: Naoya Horiguchi 
> 
> Sorry, let me suspend this Reviewed for a question.
> If alloc_huge_page failed after we succeeded anon_vma_parepare,
> the allocated anon_vma_chain and/or anon_vma are safely freed?
> Or don't we have to free them?

Yes, it will be freed by free_pgtables() and then unlink_anon_vmas()
when a task terminate. So, we don't have to free them.

Thanks.

> 
> Thanks,
> Naoya Horiguchi
> 
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 683fd38..bb8a45f 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -2536,6 +2536,15 @@ retry_avoidcopy:
> > >   /* Drop page_table_lock as buddy allocator may be called */
> > >   spin_unlock(>page_table_lock);
> > >  
> > > + /*
> > > +  * When the original hugepage is shared one, it does not have
> > > +  * anon_vma prepared.
> > > +  */
> > > + if (unlikely(anon_vma_prepare(vma))) {
> > > + ret = VM_FAULT_OOM;
> > > + goto out_old_page;
> > > + }
> > > +
> > >   use_reserve = vma_has_reserves(h, vma, address);
> > >   if (use_reserve == -ENOMEM) {
> > >   ret = VM_FAULT_OOM;
> > > @@ -2590,15 +2599,6 @@ retry_avoidcopy:
> > >   goto out_lock;
> > >   }
> > >  
> > > - /*
> > > -  * When the original hugepage is shared one, it does not have
> > > -  * anon_vma prepared.
> > > -  */
> > > - if (unlikely(anon_vma_prepare(vma))) {
> > > - ret = VM_FAULT_OOM;
> > > - goto out_new_page;
> > > - }
> > > -
> > >   copy_user_huge_page(new_page, old_page, address, vma,
> > >   pages_per_huge_page(h));
> > >   __SetPageUptodate(new_page);
> > > @@ -2625,7 +2625,6 @@ retry_avoidcopy:
> > >   spin_unlock(>page_table_lock);
> > >   mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > >  
> > > -out_new_page:
> > >   page_cache_release(new_page);
> > >  out_old_page:
> > >   page_cache_release(old_page);
> > > -- 
> > > 1.7.9.5
> > > 
> > > --
> > > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > > the body to majord...@kvack.org.  For more info on Linux MM,
> > > see: http://www.linux-mm.org/ .
> > > Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> > >
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majord...@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 11/18] mm, hugetlb: move down outside_reserve check

2013-07-30 Thread Joonsoo Kim

On Mon, Jul 29, 2013 at 02:39:30PM -0400, Naoya Horiguchi wrote:
> On Mon, Jul 29, 2013 at 02:32:02PM +0900, Joonsoo Kim wrote:
> > Just move down outsider_reserve check.
> > This makes code more readable.
> > 
> > There is no functional change.
> 
> Why don't you do this in 10/18?

Just help to review :)
Step-by-step approach may help to review, so I decide to be separate it.
If you don't want it, I will merge it in next spin.

Thanks.

> 
> Thanks,
> Naoya Horiguchi
> 
> > Signed-off-by: Joonsoo Kim 
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 5f31ca5..94173e0 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2530,20 +2530,6 @@ retry_avoidcopy:
> > return 0;
> > }
> >  
> > -   /*
> > -* If the process that created a MAP_PRIVATE mapping is about to
> > -* perform a COW due to a shared page count, attempt to satisfy
> > -* the allocation without using the existing reserves. The pagecache
> > -* page is used to determine if the reserve at this address was
> > -* consumed or not. If reserves were used, a partial faulted mapping
> > -* at the time of fork() could consume its reserves on COW instead
> > -* of the full address range.
> > -*/
> > -   if (!(vma->vm_flags & VM_MAYSHARE) &&
> > -   is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
> > -   old_page != pagecache_page)
> > -   outside_reserve = 1;
> > -
> > page_cache_get(old_page);
> >  
> > /* Drop page_table_lock as buddy allocator may be called */
> > @@ -2557,6 +2543,20 @@ retry_avoidcopy:
> > spin_lock(>page_table_lock);
> > return VM_FAULT_OOM;
> > }
> > +
> > +   /*
> > +* If the process that created a MAP_PRIVATE mapping is about to
> > +* perform a COW due to a shared page count, attempt to satisfy
> > +* the allocation without using the existing reserves. The pagecache
> > +* page is used to determine if the reserve at this address was
> > +* consumed or not. If reserves were used, a partial faulted mapping
> > +* at the time of fork() could consume its reserves on COW instead
> > +* of the full address range.
> > +*/
> > +   if (!(vma->vm_flags & VM_MAYSHARE) &&
> > +   is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
> > +   old_page != pagecache_page)
> > +   outside_reserve = 1;
> > use_reserve = use_reserve && !outside_reserve;
> >  
> > new_page = alloc_huge_page(vma, address, use_reserve);
> > -- 
> > 1.7.9.5
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majord...@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 10/18] mm, hugetlb: call vma_has_reserve() before entering alloc_huge_page()

2013-07-30 Thread Joonsoo Kim

On Mon, Jul 29, 2013 at 02:27:54PM -0400, Naoya Horiguchi wrote:
> On Mon, Jul 29, 2013 at 02:32:01PM +0900, Joonsoo Kim wrote:
> > To implement a graceful failure handling, we need to know whether
> > allocation request is for reserved pool or not, on higher level.
> > In this patch, we just move up vma_has_reseve() to caller function
> > in order to know it. There is no functional change.
> > 
> > Following patches implement a grace failure handling and remove
> > a hugetlb_instantiation_mutex.
> > 
> > Signed-off-by: Joonsoo Kim 
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index a66226e..5f31ca5 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1123,12 +1123,12 @@ static void vma_commit_reservation(struct hstate *h,
> >  }
> >  
> >  static struct page *alloc_huge_page(struct vm_area_struct *vma,
> > -   unsigned long addr, int avoid_reserve)
> > +   unsigned long addr, int use_reserve)
> >  {
> > struct hugepage_subpool *spool = subpool_vma(vma);
> > struct hstate *h = hstate_vma(vma);
> > struct page *page;
> > -   int ret, idx, use_reserve;
> > +   int ret, idx;
> > struct hugetlb_cgroup *h_cg;
> >  
> > idx = hstate_index(h);
> > @@ -1140,11 +1140,6 @@ static struct page *alloc_huge_page(struct 
> > vm_area_struct *vma,
> >  * need pages and subpool limit allocated allocated if no reserve
> >  * mapping overlaps.
> >  */
> > -   use_reserve = vma_has_reserves(h, vma, addr);
> > -   if (use_reserve < 0)
> > -   return ERR_PTR(-ENOMEM);
> > -
> > -   use_reserve = use_reserve && !avoid_reserve;
> > if (!use_reserve && (hugepage_subpool_get_pages(spool, 1) < 0))
> > return ERR_PTR(-ENOSPC);
> >  
> > @@ -2520,7 +2515,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct 
> > vm_area_struct *vma,
> >  {
> > struct hstate *h = hstate_vma(vma);
> > struct page *old_page, *new_page;
> > -   int outside_reserve = 0;
> > +   int use_reserve, outside_reserve = 0;
> > unsigned long mmun_start;   /* For mmu_notifiers */
> > unsigned long mmun_end; /* For mmu_notifiers */
> >  
> > @@ -2553,7 +2548,18 @@ retry_avoidcopy:
> >  
> > /* Drop page_table_lock as buddy allocator may be called */
> > spin_unlock(>page_table_lock);
> > -   new_page = alloc_huge_page(vma, address, outside_reserve);
> > +
> > +   use_reserve = vma_has_reserves(h, vma, address);
> > +   if (use_reserve == -ENOMEM) {
> > +   page_cache_release(old_page);
> > +
> > +   /* Caller expects lock to be held */
> > +   spin_lock(>page_table_lock);
> > +   return VM_FAULT_OOM;
> > +   }
> > +   use_reserve = use_reserve && !outside_reserve;
> 
> When outside_reserve is true, we don't have to call vma_has_reserves
> because then use_reserve is always false. So something like:
> 
>   use_reserve = 0;
>   if (!outside_reserve) {
>   use_reserve = vma_has_reserves(...);
>   ...
>   }
> 
> looks better to me.
> Or if you expect vma_has_reserves to change resv_map implicitly,
> could you add a comment about it.

Yes, you are right.
I will change it.

Thanks.

> 
> Thanks,
> Naoya Horiguchi
> 
> > +
> > +   new_page = alloc_huge_page(vma, address, use_reserve);
> >  
> > if (IS_ERR(new_page)) {
> > long err = PTR_ERR(new_page);
> > @@ -2679,6 +2685,7 @@ static int hugetlb_no_page(struct mm_struct *mm, 
> > struct vm_area_struct *vma,
> > struct page *page;
> > struct address_space *mapping;
> > pte_t new_pte;
> > +   int use_reserve = 0;
> >  
> > /*
> >  * Currently, we are forced to kill the process in the event the
> > @@ -2704,7 +2711,14 @@ retry:
> > size = i_size_read(mapping->host) >> huge_page_shift(h);
> > if (idx >= size)
> > goto out;
> > -   page = alloc_huge_page(vma, address, 0);
> > +
> > +   use_reserve = vma_has_reserves(h, vma, address);
> > +   if (use_reserve == -ENOMEM) {
> > +   ret = VM_FAULT_OOM;
> > +   goto out;
> > +   }
> > +
> > +   page = alloc_huge_page(vma, address, use_reserve);
> > if (IS_ERR(page)) {
> > ret = PTR_ERR(page);
> > if (ret == -ENOMEM)
> > -- 
> > 1.7.9.5
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 08/18] mm, hugetlb: do hugepage_subpool_get_pages() when avoid_reserve

2013-07-30 Thread Joonsoo Kim

On Mon, Jul 29, 2013 at 02:05:51PM -0400, Naoya Horiguchi wrote:
> On Mon, Jul 29, 2013 at 02:31:59PM +0900, Joonsoo Kim wrote:
> > When we try to get a huge page with avoid_reserve, we don't consume
> > a reserved page. So it is treated like as non-reserve case.
> 
> This patch will be completely overwritten with 9/18.
> So is this patch necessary?

Yes. This is a bug fix, so should be separate.
When we try to allocate with avoid_reserve, we don't use reserved page pool.
So, hugepage_subpool_get_pages() should be called and returned if failed.

If we merge these into one, we cannot know that there exists a bug.

Thanks.

> 
> Naoya Horiguchi
> 
> > Signed-off-by: Joonsoo Kim 
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 1426c03..749629e 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1149,12 +1149,13 @@ static struct page *alloc_huge_page(struct 
> > vm_area_struct *vma,
> > if (has_reserve < 0)
> > return ERR_PTR(-ENOMEM);
> >  
> > -   if (!has_reserve && (hugepage_subpool_get_pages(spool, 1) < 0))
> > +   if ((!has_reserve || avoid_reserve)
> > +   && (hugepage_subpool_get_pages(spool, 1) < 0))
> > return ERR_PTR(-ENOSPC);
> >  
> > ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), _cg);
> > if (ret) {
> > -   if (!has_reserve)
> > +   if (!has_reserve || avoid_reserve)
> > hugepage_subpool_put_pages(spool, 1);
> > return ERR_PTR(-ENOSPC);
> > }
> > @@ -1167,7 +1168,7 @@ static struct page *alloc_huge_page(struct 
> > vm_area_struct *vma,
> > hugetlb_cgroup_uncharge_cgroup(idx,
> >pages_per_huge_page(h),
> >h_cg);
> > -   if (!has_reserve)
> > +   if (!has_reserve || avoid_reserve)
> > hugepage_subpool_put_pages(spool, 1);
> > return ERR_PTR(-ENOSPC);
> > }
> > -- 
> > 1.7.9.5
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] selftests: Add test of PMU instruction counting on powerpc

2013-07-30 Thread Michael Ellerman

This commit adds a test of instruction counting using the PMU on powerpc.

Although the bulk of the code is architecture agnostic, the code needs to
run a precisely sized loop which is implemented in assembler.

Signed-off-by: Michael Ellerman 
---
 tools/testing/selftests/powerpc/Makefile   |   2 +-
 tools/testing/selftests/powerpc/pmu/Makefile   |  23 
 .../selftests/powerpc/pmu/count_instructions.c | 135 +
 tools/testing/selftests/powerpc/pmu/event.c| 105 
 tools/testing/selftests/powerpc/pmu/event.h|  39 ++
 tools/testing/selftests/powerpc/pmu/loop.S |  46 +++
 6 files changed, 349 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/powerpc/pmu/Makefile
 create mode 100644 tools/testing/selftests/powerpc/pmu/count_instructions.c
 create mode 100644 tools/testing/selftests/powerpc/pmu/event.c
 create mode 100644 tools/testing/selftests/powerpc/pmu/event.h
 create mode 100644 tools/testing/selftests/powerpc/pmu/loop.S

diff --git a/tools/testing/selftests/powerpc/Makefile 
b/tools/testing/selftests/powerpc/Makefile
index bade865..7cb4744 100644
--- a/tools/testing/selftests/powerpc/Makefile
+++ b/tools/testing/selftests/powerpc/Makefile
@@ -13,7 +13,7 @@ CFLAGS := -Wall -O2 -flto -Wall -Werror 
-DGIT_VERSION='"$(GIT_VERSION)"' -I$(CUR
 
 export CC CFLAGS
 
-TARGETS =
+TARGETS = pmu
 
 endif
 
diff --git a/tools/testing/selftests/powerpc/pmu/Makefile 
b/tools/testing/selftests/powerpc/pmu/Makefile
new file mode 100644
index 000..7216f00
--- /dev/null
+++ b/tools/testing/selftests/powerpc/pmu/Makefile
@@ -0,0 +1,23 @@
+noarg:
+   $(MAKE) -C ../
+
+PROGS := count_instructions
+EXTRA_SOURCES := ../harness.c event.c
+
+all: $(PROGS)
+
+$(PROGS): $(EXTRA_SOURCES)
+
+# loop.S can only be built 64-bit
+count_instructions: loop.S count_instructions.c $(EXTRA_SOURCES)
+   $(CC) $(CFLAGS) -m64 -o $@ $^
+
+run_tests: all
+   @-for PROG in $(PROGS); do \
+   ./$$PROG; \
+   done;
+
+clean:
+   rm -f $(PROGS) loop.o
+
+.PHONY: all run_tests clean
diff --git a/tools/testing/selftests/powerpc/pmu/count_instructions.c 
b/tools/testing/selftests/powerpc/pmu/count_instructions.c
new file mode 100644
index 000..312b4f0
--- /dev/null
+++ b/tools/testing/selftests/powerpc/pmu/count_instructions.c
@@ -0,0 +1,135 @@
+/*
+ * Copyright 2013, Michael Ellerman, IBM Corp.
+ * Licensed under GPLv2.
+ */
+
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+
+#include "event.h"
+#include "utils.h"
+
+extern void thirty_two_instruction_loop(u64 loops);
+
+static void setup_event(struct event *e, u64 config, char *name)
+{
+   event_init_opts(e, config, PERF_TYPE_HARDWARE, name);
+
+   e->attr.disabled = 1;
+   e->attr.exclude_kernel = 1;
+   e->attr.exclude_hv = 1;
+   e->attr.exclude_idle = 1;
+}
+
+static int do_count_loop(struct event *events, u64 instructions,
+u64 overhead, bool report)
+{
+   s64 difference, expected;
+   double percentage;
+
+   prctl(PR_TASK_PERF_EVENTS_ENABLE);
+
+   /* Run for 1M instructions */
+   thirty_two_instruction_loop(instructions >> 5);
+
+   prctl(PR_TASK_PERF_EVENTS_DISABLE);
+
+   event_read([0]);
+   event_read([1]);
+
+   expected = instructions + overhead;
+   difference = events[0].result.value - expected;
+   percentage = (double)difference / events[0].result.value * 100;
+
+   if (report) {
+   event_report([0]);
+   event_report([1]);
+
+   printf("Looped for %llu instructions, overhead %llu\n", 
instructions, overhead);
+   printf("Expected %llu\n", expected);
+   printf("Actual   %llu\n", events[0].result.value);
+   printf("Delta%lld, %f%%\n", difference, percentage);
+   }
+
+   event_reset([0]);
+   event_reset([1]);
+
+   if (difference < 0)
+   difference = -difference;
+
+   /* Tolerate a difference below 0.0001 % */
+   difference *= 1 * 100;
+   if (difference / events[0].result.value)
+   return -1;
+
+   return 0;
+}
+
+/* Count how many instructions it takes to do a null loop */
+static u64 determine_overhead(struct event *events)
+{
+   u64 current, overhead;
+   int i;
+
+   do_count_loop(events, 0, 0, false);
+   overhead = events[0].result.value;
+
+   for (i = 0; i < 100; i++) {
+   do_count_loop(events, 0, 0, false);
+   current = events[0].result.value;
+   if (current < overhead) {
+   printf("Replacing overhead %llu with %llu\n", overhead, 
current);
+   overhead = current;
+   }
+   }
+
+   return overhead;
+}
+
+static int count_instructions(void)
+{
+   struct event events[2];
+   u64 overhead;
+
+   setup_event([0], PERF_COUNT_HW_INSTRUCTIONS,

[PATCH 2/3] selftests: Add support files for powerpc tests

2013-07-30 Thread Michael Ellerman

This commit adds support code used by upcoming powerpc tests.

Signed-off-by: Michael Ellerman 
---
 tools/testing/selftests/powerpc/harness.c | 83 +++
 tools/testing/selftests/powerpc/subunit.h | 47 +
 tools/testing/selftests/powerpc/utils.h   | 34 +
 3 files changed, 164 insertions(+)
 create mode 100644 tools/testing/selftests/powerpc/harness.c
 create mode 100644 tools/testing/selftests/powerpc/subunit.h
 create mode 100644 tools/testing/selftests/powerpc/utils.h

diff --git a/tools/testing/selftests/powerpc/harness.c 
b/tools/testing/selftests/powerpc/harness.c
new file mode 100644
index 000..ad73a34
--- /dev/null
+++ b/tools/testing/selftests/powerpc/harness.c
@@ -0,0 +1,83 @@
+/*
+ * Copyright 2013, Michael Ellerman, IBM Corp.
+ * Licensed under GPLv2.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "subunit.h"
+#include "utils.h"
+
+#define TIMEOUT120
+#define KILL_TIMEOUT   5
+
+
+int run_test(int (test_function)(void), char *name)
+{
+   bool terminated;
+   int rc, status;
+   pid_t pid;
+
+   /* Make sure output is flushed before forking */
+   fflush(stdout);
+
+   pid = fork();
+   if (pid == 0) {
+   exit(test_function());
+   } else if (pid == -1) {
+   perror("fork");
+   return 1;
+   }
+
+   /* Wake us up in timeout seconds */
+   alarm(TIMEOUT);
+   terminated = false;
+
+wait:
+   rc = waitpid(pid, , 0);
+   if (rc == -1) {
+   if (errno != EINTR) {
+   printf("unknown error from waitpid\n");
+   return 1;
+   }
+
+   if (terminated) {
+   printf("!! force killing %s\n", name);
+   kill(pid, SIGKILL);
+   return 1;
+   } else {
+   printf("!! killing %s\n", name);
+   kill(pid, SIGTERM);
+   terminated = true;
+   alarm(KILL_TIMEOUT);
+   goto wait;
+   }
+   }
+
+   if (WIFEXITED(status))
+   status = WEXITSTATUS(status);
+   else
+   status = 1; /* Signal or other */
+
+   return status;
+}
+
+int test_harness(int (test_function)(void), char *name)
+{
+   int rc;
+
+   test_start(name);
+   test_set_git_version(GIT_VERSION);
+
+   rc = run_test(test_function, name);
+
+   test_finish(name, rc);
+
+   return rc;
+}
diff --git a/tools/testing/selftests/powerpc/subunit.h 
b/tools/testing/selftests/powerpc/subunit.h
new file mode 100644
index 000..98a2292
--- /dev/null
+++ b/tools/testing/selftests/powerpc/subunit.h
@@ -0,0 +1,47 @@
+/*
+ * Copyright 2013, Michael Ellerman, IBM Corp.
+ * Licensed under GPLv2.
+ */
+
+#ifndef _SELFTESTS_POWERPC_SUBUNIT_H
+#define _SELFTESTS_POWERPC_SUBUNIT_H
+
+static inline void test_start(char *name)
+{
+   printf("test: %s\n", name);
+}
+
+static inline void test_failure_detail(char *name, char *detail)
+{
+   printf("failure: %s [%s]\n", name, detail);
+}
+
+static inline void test_failure(char *name)
+{
+   printf("failure: %s\n", name);
+}
+
+static inline void test_error(char *name)
+{
+   printf("error: %s\n", name);
+}
+
+static inline void test_success(char *name)
+{
+   printf("success: %s\n", name);
+}
+
+static inline void test_finish(char *name, int status)
+{
+   if (status)
+   test_failure(name);
+   else
+   test_success(name);
+}
+
+static inline void test_set_git_version(char *value)
+{
+   printf("tags: git_version:%s\n", value);
+}
+
+#endif /* _SELFTESTS_POWERPC_SUBUNIT_H */
diff --git a/tools/testing/selftests/powerpc/utils.h 
b/tools/testing/selftests/powerpc/utils.h
new file mode 100644
index 000..5851c4b
--- /dev/null
+++ b/tools/testing/selftests/powerpc/utils.h
@@ -0,0 +1,34 @@
+/*
+ * Copyright 2013, Michael Ellerman, IBM Corp.
+ * Licensed under GPLv2.
+ */
+
+#ifndef _SELFTESTS_POWERPC_UTILS_H
+#define _SELFTESTS_POWERPC_UTILS_H
+
+#include 
+#include 
+
+/* Avoid headaches with PRI?64 - just use %ll? always */
+typedef unsigned long long u64;
+typedef   signed long long s64;
+
+/* Just for familiarity */
+typedef uint32_t u32;
+typedef uint8_t u8;
+
+
+int test_harness(int (test_function)(void), char *name);
+
+
+/* Yes, this is evil */
+#define FAIL_IF(x) \
+do {   \
+   if ((x)) {  \
+   fprintf(stderr, \
+   "[FAIL] Test FAILED on line %d\n", __LINE__);   \
+   return 1;   \
+   }   \
+} while (0)
+
+#endif /*

[PATCH 1/3] selftests: Add infrastructure for powerpc selftests

2013-07-30 Thread Michael Ellerman

This commit adds a powerpc subdirectory to tools/testing/selftests,
for tests that are powerpc specific.

On other architectures nothing is built. The makefile supports cross
compilation if the user sets ARCH and CROSS_COMPILE.

Signed-off-by: Michael Ellerman 
---
 tools/testing/selftests/Makefile |  1 +
 tools/testing/selftests/powerpc/Makefile | 35 
 2 files changed, 36 insertions(+)
 create mode 100644 tools/testing/selftests/powerpc/Makefile

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 4cb14ca..9f3eae2 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -8,6 +8,7 @@ TARGETS += net
 TARGETS += ptrace
 TARGETS += timers
 TARGETS += vm
+TARGETS += powerpc
 
 all:
for TARGET in $(TARGETS); do \
diff --git a/tools/testing/selftests/powerpc/Makefile 
b/tools/testing/selftests/powerpc/Makefile
new file mode 100644
index 000..bade865
--- /dev/null
+++ b/tools/testing/selftests/powerpc/Makefile
@@ -0,0 +1,35 @@
+# Makefile for powerpc selftests
+
+# ARCH can be overridden by the user for cross compiling
+ARCH ?= $(shell uname -m)
+ARCH := $(shell echo $(ARCH) | sed -e s/ppc.*/powerpc/)
+
+ifeq ($(ARCH),powerpc)
+
+GIT_VERSION = $(shell git describe --always --long --dirty || echo "unknown")
+
+CC := $(CROSS_COMPILE)$(CC)
+CFLAGS := -Wall -O2 -flto -Wall -Werror -DGIT_VERSION='"$(GIT_VERSION)"' 
-I$(CURDIR) $(CFLAGS)
+
+export CC CFLAGS
+
+TARGETS =
+
+endif
+
+all:
+   @for TARGET in $(TARGETS); do \
+   $(MAKE) -C $$TARGET all; \
+   done;
+
+run_tests: all
+   @for TARGET in $(TARGETS); do \
+   $(MAKE) -C $$TARGET run_tests; \
+   done;
+
+clean:
+   @for TARGET in $(TARGETS); do \
+   $(MAKE) -C $$TARGET clean; \
+   done;
+
+.PHONY: all run_tests clean
-- 
1.8.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 06/18] mm, hugetlb: remove vma_need_reservation()

2013-07-30 Thread Joonsoo Kim

On Tue, Jul 30, 2013 at 11:19:58PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim  writes:
> 
> > vma_need_reservation() can be substituted by vma_has_reserves()
> > with minor change. These function do almost same thing,
> > so unifying them is better to maintain.
> 
> I found the resulting code confusing and complex. I am sure there is
> more that what is explained in the commit message. If you are just doing
> this for cleanup, may be we should avoid doing this  ?

I may need this cleanup, because I want to decide whether this page comes
from reserved page pool or not, more clearly. Without this, this decision
can be harder.
Anyway, I should describe the purpose of this patch in detail.

Thanks.

> 
> 
> >
> > Signed-off-by: Joonsoo Kim 
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index bf2ee11..ff46a2c 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -451,8 +451,18 @@ void reset_vma_resv_huge_pages(struct vm_area_struct 
> > *vma)
> > vma->vm_private_data = (void *)0;
> >  }
> >
> > -/* Returns true if the VMA has associated reserve pages */
> > -static int vma_has_reserves(struct vm_area_struct *vma, long chg)
> > +/*
> > + * Determine if the huge page at addr within the vma has an associated
> > + * reservation.  Where it does not we will need to logically increase
> > + * reservation and actually increase subpool usage before an allocation
> > + * can occur.  Where any new reservation would be required the
> > + * reservation change is prepared, but not committed.  Once the page
> > + * has been allocated from the subpool and instantiated the change should
> > + * be committed via vma_commit_reservation.  No action is required on
> > + * failure.
> > + */
> > +static int vma_has_reserves(struct hstate *h,
> > +   struct vm_area_struct *vma, unsigned long addr)
> >  {
> > if (vma->vm_flags & VM_NORESERVE) {
> > /*
> > @@ -464,10 +474,22 @@ static int vma_has_reserves(struct vm_area_struct 
> > *vma, long chg)
> >  * step. Currently, we don't have any other solution to deal
> >  * with this situation properly, so add work-around here.
> >  */
> > -   if (vma->vm_flags & VM_MAYSHARE && chg == 0)
> > -   return 1;
> > -   else
> > -   return 0;
> > +   if (vma->vm_flags & VM_MAYSHARE) {
> > +   struct address_space *mapping = vma->vm_file->f_mapping;
> > +   struct inode *inode = mapping->host;
> > +   pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> > +   struct resv_map *resv = inode->i_mapping->private_data;
> > +   long chg;
> > +
> > +   chg = region_chg(resv, idx, idx + 1);
> > +   if (chg < 0)
> > +   return -ENOMEM;
> > +
> > +   if (chg == 0)
> > +   return 1;
> > +   }
> > +
> > +   return 0;
> > }
> >
> > /* Shared mappings always use reserves */
> > @@ -478,8 +500,16 @@ static int vma_has_reserves(struct vm_area_struct 
> > *vma, long chg)
> >  * Only the process that called mmap() has reserves for
> >  * private mappings.
> >  */
> > -   if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
> > +   if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> > +   pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> > +   struct resv_map *resv = vma_resv_map(vma);
> > +
> > +   /* Just for allocating region structure */
> > +   if (region_chg(resv, idx, idx + 1) < 0)
> > +   return -ENOMEM;
> > +
> > return 1;
> > +   }
> >
> > return 0;
> >  }
> > @@ -542,8 +572,7 @@ static struct page *dequeue_huge_page_node(struct 
> > hstate *h, int nid)
> >
> >  static struct page *dequeue_huge_page_vma(struct hstate *h,
> > struct vm_area_struct *vma,
> > -   unsigned long address, int avoid_reserve,
> > -   long chg)
> > +   unsigned long address, int avoid_reserve)
> >  {
> > struct page *page = NULL;
> > struct mempolicy *mpol;
> > @@ -558,7 +587,7 @@ static struct page *dequeue_huge_page_vma(struct hstate 
> > *h,
> >  * have no page reserves. This check ensures that reservations are
> >  * not "stolen". The child may still get SIGKILLed
> >  */
> > -   if (!vma_has_reserves(vma, chg) &&
> > +   if (!vma_has_reserves(h, vma, address) &&
> > h->free_huge_pages - h->resv_huge_pages == 0)
> > return NULL;
> >
> > @@ -578,7 +607,7 @@ retry_cpuset:
> > if (page) {
> > if (avoid_reserve)
> > break;
> > -   if (!vma_has_reserves(vma, chg))
> > +   if (!vma_has_reserves(h, vma, address))
> >

Re: [PATCH 06/18] mm, hugetlb: remove vma_need_reservation()

2013-07-30 Thread Joonsoo Kim

Hello, Naoya.

On Mon, Jul 29, 2013 at 01:52:52PM -0400, Naoya Horiguchi wrote:
> Hi, 
> 
> On Mon, Jul 29, 2013 at 02:31:57PM +0900, Joonsoo Kim wrote:
> > vma_need_reservation() can be substituted by vma_has_reserves()
> > with minor change. These function do almost same thing,
> > so unifying them is better to maintain.
> 
> I think it could, but not with minor changes.
> vma_has_reserves is mostly the negation of the vma_needs_reservation,
> but not exactly (consider that vma(VM_NORESERVE) does not have nor
> need any reservation.)

Yes, currently not exactly same, but we can merge these functions.
I think that confusion comes from selection of base function. If I merge
vma_has_reserve() into vma_need_reservation(), it may be more trivial.

> 
> > Signed-off-by: Joonsoo Kim 
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index bf2ee11..ff46a2c 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -451,8 +451,18 @@ void reset_vma_resv_huge_pages(struct vm_area_struct 
> > *vma)
> > vma->vm_private_data = (void *)0;
> >  }
> >  
> > -/* Returns true if the VMA has associated reserve pages */
> > -static int vma_has_reserves(struct vm_area_struct *vma, long chg)
> > +/*
> > + * Determine if the huge page at addr within the vma has an associated
> > + * reservation.  Where it does not we will need to logically increase
> > + * reservation and actually increase subpool usage before an allocation
> > + * can occur.  Where any new reservation would be required the
> > + * reservation change is prepared, but not committed.  Once the page
> > + * has been allocated from the subpool and instantiated the change should
> > + * be committed via vma_commit_reservation.  No action is required on
> > + * failure.
> > + */
> > +static int vma_has_reserves(struct hstate *h,
> > +   struct vm_area_struct *vma, unsigned long addr)
> >  {
> > if (vma->vm_flags & VM_NORESERVE) {
> > /*
> 
> Could you add more detailed comment on top of the function for better
> maintenability? What is the expected behavior, or what is the returned
> values for each case with what reasons?

Okay. I should do it.

> And this function is not a simple reserve checker any more, but it can
> change reservation maps, so it would be nice to give more suitable name,
> though I don't have any good suggestions.

Okay.

> 
> > @@ -464,10 +474,22 @@ static int vma_has_reserves(struct vm_area_struct 
> > *vma, long chg)
> >  * step. Currently, we don't have any other solution to deal
> >  * with this situation properly, so add work-around here.
> >  */
> > -   if (vma->vm_flags & VM_MAYSHARE && chg == 0)
> > -   return 1;
> > -   else
> > -   return 0;
> > +   if (vma->vm_flags & VM_MAYSHARE) {
> > +   struct address_space *mapping = vma->vm_file->f_mapping;
> > +   struct inode *inode = mapping->host;
> > +   pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> > +   struct resv_map *resv = inode->i_mapping->private_data;
> 
> It would be nice if we can get resv_map from vma in a single function for
> VM_MAYSHARE, so can you add it on vma_resv_map?

Yes, that's good idea. I will consider it.

> 
> > +   long chg;
> > +
> > +   chg = region_chg(resv, idx, idx + 1);
> > +   if (chg < 0)
> > +   return -ENOMEM;
> > +
> > +   if (chg == 0)
> > +   return 1;
> > +   }
> > +
> > +   return 0;
> > }
> >  
> > /* Shared mappings always use reserves */
> > @@ -478,8 +500,16 @@ static int vma_has_reserves(struct vm_area_struct 
> > *vma, long chg)
> >  * Only the process that called mmap() has reserves for
> >  * private mappings.
> >  */
> > -   if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
> > +   if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> > +   pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> > +   struct resv_map *resv = vma_resv_map(vma);
> > +
> > +   /* Just for allocating region structure */
> > +   if (region_chg(resv, idx, idx + 1) < 0)
> > +   return -ENOMEM;
> > +
> > return 1;
> > +   }
> >  
> > return 0;
> >  }
> > @@ -542,8 +572,7 @@ static struct page *dequeue_huge_page_node(struct 
> > hstate *h, int nid)
> >  
> >  static struct page *dequeue_huge_page_vma(struct hstate *h,
> > struct vm_area_struct *vma,
> > -   unsigned long address, int avoid_reserve,
> > -   long chg)
> > +   unsigned long address, int avoid_reserve)
> >  {
> > struct page *page = NULL;
> > struct mempolicy *mpol;
> > @@ -558,7 +587,7 @@ static struct page *dequeue_huge_page_vma(struct hstate 
> > *h,
> >  * have no page reserves. This check

Re: [PATCH 01/18] mm, hugetlb: protect reserved pages when softofflining requests the pages

2013-07-30 Thread Joonsoo Kim

On Wed, Jul 31, 2013 at 10:49:24AM +0800, Hillf Danton wrote:
> On Wed, Jul 31, 2013 at 10:27 AM, Joonsoo Kim  wrote:
> > On Mon, Jul 29, 2013 at 03:24:46PM +0800, Hillf Danton wrote:
> >> On Mon, Jul 29, 2013 at 1:31 PM, Joonsoo Kim  
> >> wrote:
> >> > alloc_huge_page_node() use dequeue_huge_page_node() without
> >> > any validation check, so it can steal reserved page unconditionally.
> >>
> >> Well, why is it illegal to use reserved page here?
> >
> > If we use reserved page here, other processes which are promised to use
> > enough hugepages cannot get enough hugepages and can die. This is
> > unexpected result to them.
> >
> But, how do you determine that a huge page is requested by a process
> that is not allowed to use reserved pages?

Reserved page is just one for each address or file offset. If we need to
move this page, this means that it already use it's own reserved page, this
page is it. So we should not use other reserved page for moving this page.

Thanks.

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 3/3] cpufreq: cpufreq-cpu0: clk rate-change notifiers

2013-07-30 Thread Viresh Kumar

On 31 July 2013 07:13, Mike Turquette  wrote:
> Quoting Viresh Kumar (2013-07-07 21:10:54)
>> On Mon, Jul 8, 2013 at 7:14 AM, Mike Turquette  wrote:
>> > Removes direct handling of OPP tables and voltage regulators by calling
>> > of_clk_cpufreq_notifier_handler, introduced by commit "clk: cpufreq
>> > helper for voltage scaling".
>> >
>> > In the future this can help consolidate code found across similar
>> > CPUfreq drivers.
>> >
>> > Signed-off-by: Mike Turquette 
>> > ---
>> >  drivers/cpufreq/cpufreq-cpu0.c | 125 
>> > -
>> >  1 file changed, 22 insertions(+), 103 deletions(-)
>>
>> Good patch, really gets lots of stuff out from cpufreq drivers.
>>
>> Acked-by: Viresh Kumar 
>
> Viresh,
>
> Thanks for the Ack. I received no comments on this series besides your
> own, so I plan to merge it. Do you want to take patch #3, or do you want
> me to take it, or should I give you and Rafael a stable branch instead?

Its only related to cpufreq-cpu0 and to handle dependencies well they
should go through a single tree.. Probably your tree is the right place for
now. Rafael?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 0/4] [SCSI] sg: fix race condition in sg_open

2013-07-30 Thread vaughan

On 07/26/2013 04:33 AM, Douglas Gilbert wrote:
> On 13-07-25 11:32 AM, vaughan wrote:
>> On 07/23/2013 01:03 AM, Jörn Engel wrote:
>>> On Mon, 22 July 2013 12:40:29 +0800, Vaughan Cao wrote:
 There is a race when open sg with O_EXCL flag. Also a race may
 happen between
 sg_open and sg_remove.

 Changes from v4:
   * [3/4] use ERR_PTR series instead of adding another parameter in
 sg_add_sfp
   * [4/4] fix conflict for cherry-pick from v3.

 Changes from v3:
   * release o_sem in sg_release(), not in sg_remove_sfp().
   * not set exclude with sfd_lock held.

 Vaughan Cao (4):
[SCSI] sg: use rwsem to solve race during exclusive open
[SCSI] sg: no need sg_open_exclusive_lock
[SCSI] sg: checking sdp->detached isn't protected when open
[SCSI] sg: push file descriptor list locking down to per-device
  locking

   drivers/scsi/sg.c | 178
 +-
   1 file changed, 83 insertions(+), 95 deletions(-)
>>> Patchset looks good to me, although I didn't test it on hardware yet.
>>> Signed-off-by: Joern Engel 
>>>
>>> James, care to pick this up?
>>>
>>> Jörn
>> Hi James,
>>
>> sg driver has two races happen in
>>   a) exclusive open and non-exclusive open
>>   b) sg removal and sg open
>> I explained the scenario detail in the separate patches. I did test
>> those patches and
>> Jörn has reviewed them.  I got no response from Doug Gilbert for a long
>> time.
>> Would you care to pick these up?
>
> Hi,
> Your patches applied with a little noise to lk 3.10.2 and
> gave this warning from the build.
>
>   CC [M]  drivers/scsi/sg.o
> drivers/scsi/sg.c: In function ‘sg_open’:
> drivers/scsi/sg.c:242:6: warning: unused variable ‘res’
> [-Wunused-variable]
>
> I'll keep testing ...
Hi Doug,

Can I ask how about the test result?

Thanks,
Vaughan
>
> Doug Gilbert
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 01/11] writeback: plug writeback at a high level

2013-07-30 Thread Dave Chinner

From: Dave Chinner 

Doing writeback on lots of little files causes terrible IOPS storms
because of the per-mapping writeback plugging we do. This
essentially causes imeediate dispatch of IO for each mapping,
regardless of the context in which writeback is occurring.

IOWs, running a concurrent write-lots-of-small 4k files using fsmark
on XFS results in a huge number of IOPS being issued for data
writes.  Metadata writes are sorted and plugged at a high level by
XFS, so aggregate nicely into large IOs. However, data writeback IOs
are dispatched in individual 4k IOs, even when the blocks of two
consecutively written files are adjacent.

Test VM: 8p, 8GB RAM, 4xSSD in RAID0, 100TB sparse XFS filesystem,
metadata CRCs enabled.

Kernel: 3.10-rc5 + xfsdev + my 3.11 xfs queue (~70 patches)

Test:

$ ./fs_mark  -D  1  -S0  -n  1  -s  4096  -L  120  -d
/mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
/mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
/mnt/scratch/6  -d  /mnt/scratch/7

Result:

wallsys create rate Physical write IO
timeCPU (avg files/s)IOPS   Bandwidth
-   -   --  -
unpatched   6m56s   15m47s  24,000+/-50026,000  130MB/s
patched 5m06s   13m28s  32,800+/-600 1,500  180MB/s
improvement -26.44% -14.68%   +36.67%   -94.23% +38.46%

If I use zero length files, this workload at about 500 IOPS, so
plugging drops the data IOs from roughly 25,500/s to 1000/s.
3 lines of code, 35% better throughput for 15% less CPU.

The benefits of plugging at this layer are likely to be higher for
spinning media as the IO patterns for this workload are going make a
much bigger difference on high IO latency devices.

Signed-off-by: Dave Chinner 
---
 fs/fs-writeback.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 68851ff..1d23d9a 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -589,7 +589,9 @@ static long writeback_sb_inodes(struct super_block *sb,
unsigned long start_time = jiffies;
long write_chunk;
long wrote = 0;  /* count both pages and inodes */
+   struct blk_plug plug;
 
+   blk_start_plug();
while (!list_empty(>b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);
 
@@ -686,6 +688,7 @@ static long writeback_sb_inodes(struct super_block *sb,
break;
}
}
+   blk_finish_plug();
return wrote;
 }
 
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 04/11] sync: serialise per-superblock sync operations

2013-07-30 Thread Dave Chinner

From: Dave Chinner 

When competing sync(2) calls walk the same filesystem, they need to
walk the list of inodes on the superblock to find all the inodes
that we need to wait for IO completion on. However, when multiple
wait_sb_inodes() calls do this at the same time, they contend on the
the inode_sb_list_lock and the contention causes system wide
slowdowns. In effect, concurrent sync(2) calls can take longer and
burn more CPU than if they were serialised.

Stop the worst of the contention by adding a per-sb mutex to wrap
around wait_sb_inodes() so that we only execute one sync(2) IO
completion walk per superblock superblock at a time and hence avoid
contention being triggered by concurrent sync(2) calls.

Signed-off-by: Dave Chinner 
---
 fs/fs-writeback.c  | 11 +++
 fs/super.c |  1 +
 include/linux/fs.h |  2 ++
 3 files changed, 14 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ca66dc8..56272ec 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1207,6 +1207,15 @@ out_unlock_inode:
 }
 EXPORT_SYMBOL(__mark_inode_dirty);
 
+/*
+ * The @s_sync_lock is used to serialise concurrent sync operations
+ * to avoid lock contention problems with concurrent wait_sb_inodes() calls.
+ * Concurrent callers will block on the s_sync_lock rather than doing 
contending
+ * walks. The queueing maintains sync(2) required behaviour as all the IO that
+ * has been issued up to the time this function is enter is guaranteed to be
+ * completed by the time we have gained the lock and waited for all IO that is
+ * in progress regardless of the order callers are granted the lock.
+ */
 static void wait_sb_inodes(struct super_block *sb)
 {
struct inode *inode, *old_inode = NULL;
@@ -1217,6 +1226,7 @@ static void wait_sb_inodes(struct super_block *sb)
 */
WARN_ON(!rwsem_is_locked(>s_umount));
 
+   mutex_lock(>s_sync_lock);
spin_lock(>s_inode_list_lock);
 
/*
@@ -1258,6 +1268,7 @@ static void wait_sb_inodes(struct super_block *sb)
}
spin_unlock(>s_inode_list_lock);
iput(old_inode);
+   mutex_unlock(>s_sync_lock);
 }
 
 /**
diff --git a/fs/super.c b/fs/super.c
index d4d753e..7f98fd6 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -200,6 +200,7 @@ static struct super_block *alloc_super(struct 
file_system_type *type, int flags)
s->s_bdi = _backing_dev_info;
INIT_HLIST_NODE(>s_instances);
INIT_HLIST_BL_HEAD(>s_anon);
+   mutex_init(>s_sync_lock);
INIT_LIST_HEAD(>s_inodes);
spin_lock_init(>s_inode_list_lock);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 923b465..971e8be 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1321,6 +1321,8 @@ struct super_block {
/* Being remounted read-only */
int s_readonly_remount;
 
+   struct mutexs_sync_lock;/* sync serialisation lock */
+
/*
 * Keep the lru lists last in the structure so they always sit on their
 * own individual cachelines.
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 03/11] inode: convert inode_sb_list_lock to per-sb

2013-07-30 Thread Dave Chinner

From: Dave Chinner 

The process of reducing contention on per-superblock inode lists
starts with moving the locking to match the per-superblock inode
list. This takes the global lock out of the picture and reduces the
contention problems to within a single filesystem. This doesn't get
rid of contention as the locks still have global CPU scope, but it
does isolate operations on different superblocks form each other.

Signed-off-by: Dave Chinner 
---
 fs/block_dev.c   | 12 ++--
 fs/drop_caches.c | 10 ++
 fs/fs-writeback.c| 12 ++--
 fs/inode.c   | 28 +---
 fs/internal.h|  1 -
 fs/notify/inode_mark.c   | 20 ++--
 fs/quota/dquot.c | 16 
 fs/super.c   |  3 ++-
 include/linux/fs.h   |  5 -
 include/linux/fsnotify_backend.h |  2 +-
 10 files changed, 56 insertions(+), 53 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index c7bda5c..c39c0f3 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1669,7 +1669,7 @@ void iterate_bdevs(void (*func)(struct block_device *, 
void *), void *arg)
 {
struct inode *inode, *old_inode = NULL;
 
-   spin_lock(_sb_list_lock);
+   spin_lock(_superblock->s_inode_list_lock);
list_for_each_entry(inode, _superblock->s_inodes, i_sb_list) {
struct address_space *mapping = inode->i_mapping;
 
@@ -1681,13 +1681,13 @@ void iterate_bdevs(void (*func)(struct block_device *, 
void *), void *arg)
}
__iget(inode);
spin_unlock(>i_lock);
-   spin_unlock(_sb_list_lock);
+   spin_unlock(_superblock->s_inode_list_lock);
/*
 * We hold a reference to 'inode' so it couldn't have been
 * removed from s_inodes list while we dropped the
-* inode_sb_list_lock.  We cannot iput the inode now as we can
+* s_inode_list_lock  We cannot iput the inode now as we can
 * be holding the last reference and we cannot iput it under
-* inode_sb_list_lock. So we keep the reference and iput it
+* s_inode_list_lock. So we keep the reference and iput it
 * later.
 */
iput(old_inode);
@@ -1695,8 +1695,8 @@ void iterate_bdevs(void (*func)(struct block_device *, 
void *), void *arg)
 
func(I_BDEV(inode), arg);
 
-   spin_lock(_sb_list_lock);
+   spin_lock(_superblock->s_inode_list_lock);
}
-   spin_unlock(_sb_list_lock);
+   spin_unlock(_superblock->s_inode_list_lock);
iput(old_inode);
 }
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 9fd702f..f1be790 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -17,7 +17,7 @@ static void drop_pagecache_sb(struct super_block *sb, void 
*unused)
 {
struct inode *inode, *toput_inode = NULL;
 
-   spin_lock(_sb_list_lock);
+   spin_lock(>s_inode_list_lock);
list_for_each_entry(inode, >s_inodes, i_sb_list) {
spin_lock(>i_lock);
if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
@@ -27,13 +27,15 @@ static void drop_pagecache_sb(struct super_block *sb, void 
*unused)
}
__iget(inode);
spin_unlock(>i_lock);
-   spin_unlock(_sb_list_lock);
+   spin_unlock(>s_inode_list_lock);
+
invalidate_mapping_pages(inode->i_mapping, 0, -1);
iput(toput_inode);
toput_inode = inode;
-   spin_lock(_sb_list_lock);
+
+   spin_lock(>s_inode_list_lock);
}
-   spin_unlock(_sb_list_lock);
+   spin_unlock(>s_inode_list_lock);
iput(toput_inode);
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1d23d9a..ca66dc8 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1217,7 +1217,7 @@ static void wait_sb_inodes(struct super_block *sb)
 */
WARN_ON(!rwsem_is_locked(>s_umount));
 
-   spin_lock(_sb_list_lock);
+   spin_lock(>s_inode_list_lock);
 
/*
 * Data integrity sync. Must wait for all pages under writeback,
@@ -1237,14 +1237,14 @@ static void wait_sb_inodes(struct super_block *sb)
}
__iget(inode);
spin_unlock(>i_lock);
-   spin_unlock(_sb_list_lock);
+   spin_unlock(>s_inode_list_lock);
 
/*
 * We hold a reference to 'inode' so it couldn't have been
 * removed from s_inodes list while we dropped the
-* inode_sb_list_lock.  We cannot iput the inode now as we can
+* s_inode_list_lock.  We cannot iput the inode now as we can
 * be holding the last reference and we cannot iput it under
-

[PATCH 00/11] Sync and VFS scalability improvements

2013-07-30 Thread Dave Chinner

Hi folks,

This series of patches is against the curent mmotm tree here:

http://git.cmpxchg.org/cgit/linux-mmotm.git/

It addresses several VFS scalability issues, the most pressing of which is lock
contention triggered by concurrent sync(2) calls.

The patches in the series are:

writeback: plug writeback at a high level

This patch greatly reduces writeback IOPS on XFS when writing lots of small
files. Improves performance by ~20-30% on XFS on fast devices by reducing small
file write IOPS by 95%, but doesn't seem to impact ext4 or btrfs performance or
IOPS in any noticable way.

inode: add IOP_NOTHASHED to avoid inode hash lock in evict

Roughly 5-10% of the spinlock contention on 16-way create workloads on XFS comes
from inode_hash_remove(), even though XFS doesn't use the inode hash and uses
inode_hash_fake() to avoid neeeding to insert inodes into the hash. We still
take the lock to remove it form the hash. This patch avoids the lock on inode
eviction, too.

inode: convert inode_sb_list_lock to per-sb
sync: serialise per-superblock sync operations
inode: rename i_wb_list to i_io_list
bdi: add a new writeback list for sync
writeback: periodically trim the writeback list

This series removes the global inode_sb_list_lock and all the contention points
related to sync(2) The global lock is first converted to a per-filesystem lock
to reduce the scope of global contention, a mutex is add to wait_sb_inodes() to
avoid concurrent sync(2) operations from walking the inode list at the same time
while still guaranteeing sync(2) waits for all the IO it needs to.  It then adds
patches to track inodes under writeback for sync(2) in an optimal manner,
greatly reducing the overhead of sync(2) on large inode caches.

inode: convert per-sb inode list to a list_lru

This patch converts the per-sb list and lock to the per-node list_lru structures
to remove the global lock bottleneck for workloads that have heavy cache
insertion and removal concurrency. A 4-node numa machine saw a 3.5x speedup on
inode cache intensive concurrent bulkstat operation (cycling 1.7 million
inodes/s through the XFS inode cache) as a result of this patch.

c8cb115 fs: Use RCU lookups for inode cache

Lockless inode hash traversals for ext4 and btrfs. Both see signficant speedups
for directory traversal intensive workloads with this patch as it removes the
inode_hash_lock from cache lookups. The inode_hash_lock is still a limiting
factor for inode cache inserts and removals, but that's a much more complex
problem to solve.

8925a8d list_lru: don't need node lock in list_lru_count_node
4411917 list_lru: don't lock during add/del if unnecessary

Optimisations for the list_lru primitives. Because of the sheer number of calls
to these functions under heavy concurrent VFS workloads, these functions show up
quite hot in profiles. Hence making sure we don't take locks when we don't
really need to makes a measurable difference to the CPU consumption shown in the
profiles.


Performance Summary
---

Concurrent sync:

Load 8 million XFs inodes into the cache - all clean - and run
100 concurrent sync calls using;

$ time (for i in `seq 0 1 100`; do sync & done; wait)

inodes total sync time
  real   system
mmotm   8366826 146.080s1481.698s
patched 8560697   0.109s   0.346s

System interactivity on mmotm is crap - it's completely CPU bound and takes
seconds to repsond to input.

Run fsmark creating 10 million 4k files with 16 threads, run the above 100
concurrent sync calls when when 1.5 million files have been created.

fsmark  syncsync system time
mmotm   259s502.794s4893.977s
patched 204s 62.423s   3.224s

Note: the difference in fsmark performance on this workload is due to the
first patch in the series - the writeback plugging patch.

Inode cache modification intensive workloads:

Simple workloads:

- 16 way fsmark to create 51.2 million empty files.
- multithreaded bulkstat, one thread per AG
- 16-way 'find /mnt/N -ctime 1' (directory + inode read)
- 16-way unlink

Storage: 100TB sparse filesystem image with a 1MB extent size hint on XFS on
4x64GB SSD RAID 0 (i.e. thin-provisioned with 1MB allocation granularity):

XFS create  bulkstatlookup  unlink
mmotm   4m28s   2m42s   2m206m46s
patched 4m22s   0m37s   1m59s   6m45s

create and unlink are no faster as the reduction in lock contention on the
inode lists translated into causing more contention on the XFS transaction
commit code (I have other patches to address that). The bulkstat scaled almost
linearly with the number of inode lists, and lookup improved significantly as
well.

For ext4, I didn't bother with unlinks because they are single threaded due to
the orphan

[PATCH 11/11] list_lru: don't lock during add/del if unnecessary

2013-07-30 Thread Dave Chinner

From: Dave Chinner 

We only add an item to the LRU is it is empty, and we only remove
them if they are on the LRU. We can check these conditions without
holding the node lock and avoid grabbing the lock if they evaluate
as not needing a list operation.

We can do this without concern, because we already have the
situation that two concurrent callers to list_lru_add/del() for the
same object will race on the node lock and so the result of the
concurrent operations on the same object has always been always
undefined.

We still do the check once we have the lock, so even if we are
racing and decide we need the node lock to do the operation, then
the behaviour under the lock will be unchanged.

This significantly reduces the per-node lock traffic on workloads
where we aren't lazy about adding and removing objects from the LRU
lists.

Signed-off-by: Dave Chinner 
---
 mm/list_lru.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 9aadb6c..5207ae2 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -15,6 +15,9 @@ bool list_lru_add(struct list_lru *lru, struct list_head 
*item)
int nid = page_to_nid(virt_to_page(item));
struct list_lru_node *nlru = >node[nid];
 
+   if (!list_empty(item))
+   return false;
+
spin_lock(>lock);
WARN_ON_ONCE(nlru->nr_items < 0);
if (list_empty(item)) {
@@ -34,6 +37,9 @@ bool list_lru_del(struct list_lru *lru, struct list_head 
*item)
int nid = page_to_nid(virt_to_page(item));
struct list_lru_node *nlru = >node[nid];
 
+   if (list_empty(item))
+   return false;
+
spin_lock(>lock);
if (!list_empty(item)) {
list_del_init(item);
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 06/11] bdi: add a new writeback list for sync

2013-07-30 Thread Dave Chinner

From: Dave Chinner 

wait_sb_inodes() current does a walk of all inodes in the filesystem
to find dirty one to wait on during sync. This is highly
inefficient and wastes a lot of CPU when there are lots of clean
cached inodes that we don't need to wait on.

To avoid this "all inode" walk, we need to track inodes that are
currently under writeback that we need to wait for. We do this by
adding inodes to a writeback list on the bdi when the mapping is
first tagged as having pages under writeback.  wait_sb_inodes() can
then walk this list of "inodes under IO" and wait specifically just
for the inodes that the current sync(2) needs to wait for.

To avoid needing all the realted locking to be safe against
interrupts, Jan Kara suggested that we be lazy about removal from
the writeback list. That is, we don't remove inodes from the
writeback list on IO completion, but do it directly during a
wait_sb_inodes() walk.

This means that the a rare sync(2) call will have some work to do
skipping clean inodes However, in the current problem case of
concurrent sync workloads, concurrent wait_sb_inodes() calls only
walk the very recently dispatched inodes and hence should have very
little work to do.

This also means that we have to remove the inodes from the writeback
list during eviction. Do this in inode_wait_for_writeback() once
all writeback on the inode is complete.

Signed-off-by: Dave Chinner 
---
 fs/fs-writeback.c   | 110 +---
 fs/inode.c  |   1 +
 include/linux/backing-dev.h |   3 ++
 include/linux/fs.h  |   1 +
 mm/backing-dev.c|   1 +
 mm/page-writeback.c |  14 ++
 6 files changed, 114 insertions(+), 16 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index b7ac1c2..638f122 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -176,6 +176,34 @@ void inode_wb_list_del(struct inode *inode)
 }
 
 /*
+ * mark an inode as under writeback on the given bdi
+ */
+void bdi_mark_inode_writeback(struct backing_dev_info *bdi, struct inode 
*inode)
+{
+   WARN_ON_ONCE(bdi != inode_to_bdi(inode));
+   if (list_empty(>i_wb_list)) {
+   spin_lock(>wb.list_lock);
+   if (list_empty(>i_wb_list))
+   list_add_tail(>i_wb_list, >wb.b_writeback);
+   spin_unlock(>wb.list_lock);
+   }
+}
+
+/*
+ * clear an inode as under writeback on the given bdi
+ */
+static void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
+ struct inode *inode)
+{
+   WARN_ON_ONCE(bdi != inode_to_bdi(inode));
+   if (!list_empty(>i_wb_list)) {
+   spin_lock(>wb.list_lock);
+   list_del_init(>i_wb_list);
+   spin_unlock(>wb.list_lock);
+   }
+}
+
+/*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
  *
@@ -330,13 +358,18 @@ static void __inode_wait_for_writeback(struct inode 
*inode)
 }
 
 /*
- * Wait for writeback on an inode to complete. Caller must have inode pinned.
+ * Wait for writeback on an inode to complete during eviction. Caller must have
+ * inode pinned.
  */
 void inode_wait_for_writeback(struct inode *inode)
 {
+   BUG_ON(!(inode->i_state & I_FREEING));
+
spin_lock(>i_lock);
__inode_wait_for_writeback(inode);
spin_unlock(>i_lock);
+
+   bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
 }
 
 /*
@@ -1218,7 +1251,9 @@ EXPORT_SYMBOL(__mark_inode_dirty);
  */
 static void wait_sb_inodes(struct super_block *sb)
 {
-   struct inode *inode, *old_inode = NULL;
+   struct backing_dev_info *bdi = sb->s_bdi;
+   struct inode *old_inode = NULL;
+   LIST_HEAD(sync_list);
 
/*
 * We need to be protected against the filesystem going from
@@ -1226,28 +1261,59 @@ static void wait_sb_inodes(struct super_block *sb)
 */
WARN_ON(!rwsem_is_locked(>s_umount));
 
-   mutex_lock(>s_sync_lock);
-   spin_lock(>s_inode_list_lock);
-
/*
-* Data integrity sync. Must wait for all pages under writeback,
-* because there may have been pages dirtied before our sync
-* call, but which had writeout started before we write it out.
-* In which case, the inode may not be on the dirty list, but
-* we still have to wait for that writeout.
+* Data integrity sync. Must wait for all pages under writeback, because
+* there may have been pages dirtied before our sync call, but which had
+* writeout started before we write it out.  In which case, the inode
+* may not be on the dirty list, but we still have to wait for that
+* writeout.
+*
+* To avoid syncing inodes put under IO after we have started here,
+* splice the io list to a temporary list head and walk that. Newly
+* dirtied inodes will go onto the primary list so we

[PATCH 09/11] fs: Use RCU lookups for inode cache

2013-07-30 Thread Dave Chinner

From: Dave Chinner 

Convert inode cache lookups to be protected by RCU locking rather
than the global inode_hash_lock. This will improve scalability of
inode lookup intensive workloads.

Tested w/ ext4 and btrfs on concurrent fsmark/lookup/unlink
workloads only. It removes the inode hash lock from the inode lookup
paths, but does not solve the problem of the inode hash lock being a
bottleneck on the inode cache insert/remove paths.

Signed-off-by: Dave Chinner 
---
 fs/inode.c | 76 +++---
 1 file changed, 43 insertions(+), 33 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 7eea591..810386e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -465,7 +465,7 @@ void __insert_inode_hash(struct inode *inode, unsigned long 
hashval)
 
spin_lock(_hash_lock);
spin_lock(>i_lock);
-   hlist_add_head(>i_hash, b);
+   hlist_add_head_rcu(>i_hash, b);
spin_unlock(>i_lock);
spin_unlock(_hash_lock);
 }
@@ -481,7 +481,7 @@ void __remove_inode_hash(struct inode *inode)
 {
spin_lock(_hash_lock);
spin_lock(>i_lock);
-   hlist_del_init(>i_hash);
+   hlist_del_init_rcu(>i_hash);
spin_unlock(>i_lock);
spin_unlock(_hash_lock);
 }
@@ -776,13 +776,18 @@ static void __wait_on_freeing_inode(struct inode *inode);
 static struct inode *find_inode(struct super_block *sb,
struct hlist_head *head,
int (*test)(struct inode *, void *),
-   void *data)
+   void *data, bool locked)
 {
struct inode *inode = NULL;
 
 repeat:
-   hlist_for_each_entry(inode, head, i_hash) {
+   rcu_read_lock();
+   hlist_for_each_entry_rcu(inode, head, i_hash) {
spin_lock(>i_lock);
+   if (inode_unhashed(inode)) {
+   spin_unlock(>i_lock);
+   continue;
+   }
if (inode->i_sb != sb) {
spin_unlock(>i_lock);
continue;
@@ -792,13 +797,20 @@ repeat:
continue;
}
if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
+   rcu_read_unlock();
+   if (locked)
+   spin_unlock(_hash_lock);
__wait_on_freeing_inode(inode);
+   if (locked)
+   spin_lock(_hash_lock);
goto repeat;
}
__iget(inode);
spin_unlock(>i_lock);
+   rcu_read_unlock();
return inode;
}
+   rcu_read_unlock();
return NULL;
 }
 
@@ -807,13 +819,19 @@ repeat:
  * iget_locked for details.
  */
 static struct inode *find_inode_fast(struct super_block *sb,
-   struct hlist_head *head, unsigned long ino)
+struct hlist_head *head,
+unsigned long ino, bool locked)
 {
struct inode *inode = NULL;
 
 repeat:
-   hlist_for_each_entry(inode, head, i_hash) {
+   rcu_read_lock();
+   hlist_for_each_entry_rcu(inode, head, i_hash) {
spin_lock(>i_lock);
+   if (inode_unhashed(inode)) {
+   spin_unlock(>i_lock);
+   continue;
+   }
if (inode->i_ino != ino) {
spin_unlock(>i_lock);
continue;
@@ -823,13 +841,20 @@ repeat:
continue;
}
if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
+   rcu_read_unlock();
+   if (locked)
+   spin_unlock(_hash_lock);
__wait_on_freeing_inode(inode);
+   if (locked)
+   spin_lock(_hash_lock);
goto repeat;
}
__iget(inode);
spin_unlock(>i_lock);
+   rcu_read_unlock();
return inode;
}
+   rcu_read_unlock();
return NULL;
 }
 
@@ -984,9 +1009,7 @@ struct inode *iget5_locked(struct super_block *sb, 
unsigned long hashval,
struct hlist_head *head = inode_hashtable + hash(sb, hashval);
struct inode *inode;
 
-   spin_lock(_hash_lock);
-   inode = find_inode(sb, head, test, data);
-   spin_unlock(_hash_lock);
+   inode = find_inode(sb, head, test, data, false);
 
if (inode) {
wait_on_inode(inode);
@@ -998,8 +1021,7 @@ struct inode *iget5_locked(struct super_block *sb, 
unsigned long hashval,
struct inode *old;
 
spin_lock(_hash_lock);
-   /* We released the lock, so.. */
-   old = find_inode(sb, head, test, data);
+

[PATCH 08/11] inode: convert per-sb inode list to a list_lru

2013-07-30 Thread Dave Chinner

From: Dave Chinner 

The per-superblock inode list and lock is a bottleneck for systems
that cycle inodes in and out of cache concurrently. The global lock
is a limiting factor.

Most of the additions to the sb inode list occur on the CPU that
allocated the inode, and most of the removals occur during evict()
calls as a result of memory reclaim. Both of these events are local
to the node that the inode belongs to, so it maps to the per-node
lists that the list_lru uses.

There are several places where the inode list is walked. These can
be converted easily to use list_lru_walk() to do their work on each
inode on the list.

Signed-off-by: Dave Chinner 
---
 fs/block_dev.c |  75 -
 fs/drop_caches.c   |  57 +++-
 fs/fs-writeback.c  |   4 +-
 fs/inode.c | 134 +++--
 fs/notify/inode_mark.c | 111 +--
 fs/quota/dquot.c   | 174 -
 fs/super.c |   9 ++-
 include/linux/fs.h |   9 ++-
 8 files changed, 327 insertions(+), 246 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index bec0c26..b8ec2bd 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1665,38 +1665,55 @@ int __invalidate_device(struct block_device *bdev, bool 
kill_dirty)
 }
 EXPORT_SYMBOL(__invalidate_device);
 
-void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
-{
-   struct inode *inode, *old_inode = NULL;
+struct bdev_iter {
+   void (*func)(struct block_device *, void *);
+   void *arg;
+   struct inode *toput_inode;
+};
 
-   spin_lock(_superblock->s_inode_list_lock);
-   list_for_each_entry(inode, _superblock->s_inodes, i_sb_list) {
-   struct address_space *mapping = inode->i_mapping;
+static enum lru_status
+bdev_iter_cb(struct list_head *item, spinlock_t *lock, void *cb_arg)
+{
+   struct bdev_iter *iter = cb_arg;
+   struct inode *inode = container_of(item, struct inode, i_sb_list);
 
-   spin_lock(>i_lock);
-   if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
-   mapping->nrpages == 0) {
-   spin_unlock(>i_lock);
-   continue;
-   }
-   __iget(inode);
+   spin_lock(>i_lock);
+   if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
+   inode->i_mapping->nrpages == 0) {
spin_unlock(>i_lock);
-   spin_unlock(_superblock->s_inode_list_lock);
-   /*
-* We hold a reference to 'inode' so it couldn't have been
-* removed from s_inodes list while we dropped the
-* s_inode_list_lock  We cannot iput the inode now as we can
-* be holding the last reference and we cannot iput it under
-* s_inode_list_lock. So we keep the reference and iput it
-* later.
-*/
-   iput(old_inode);
-   old_inode = inode;
+   return LRU_SKIP;
+   }
+   __iget(inode);
+   spin_unlock(>i_lock);
+   spin_unlock(lock);
 
-   func(I_BDEV(inode), arg);
+   iput(iter->toput_inode);
+   iter->toput_inode = inode;
 
-   spin_lock(_superblock->s_inode_list_lock);
-   }
-   spin_unlock(_superblock->s_inode_list_lock);
-   iput(old_inode);
+   iter->func(I_BDEV(inode), iter->arg);
+
+   /*
+* Even though we have dropped the lock here, we can return LRU_SKIP as
+* we have a reference to the current inode and so it's next pointer is
+* guaranteed to be valid even though we dropped the list lock.
+*/
+   spin_lock(lock);
+   return LRU_SKIP;
+}
+
+/*
+ * iterate_bdevs - run a callback across all block devices
+ */
+void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
+{
+   struct bdev_iter iter = {
+   .func = func,
+   .arg = arg,
+   };
+
+   list_lru_walk(_superblock->s_inode_list, bdev_iter_cb, ,
+ ULONG_MAX);
+
+   /* the list walk doesn't release the last inode it sees! */
+   iput(iter.toput_inode);
 }
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index f1be790..048a7d7 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -13,29 +13,50 @@
 /* A global variable is a bit ugly, but it keeps the code simple */
 int sysctl_drop_caches;
 
-static void drop_pagecache_sb(struct super_block *sb, void *unused)
+static enum lru_status
+drop_pagecache_inode(struct list_head *item, spinlock_t *lock, void *cb_arg)
 {
-   struct inode *inode, *toput_inode = NULL;
+   struct inode **toput_inode = cb_arg;
+   struct inode *inode = container_of(item, struct inode, i_sb_list);
 
-   spin_lock(>s_inode_list_lock);
-   list_for_each_entry(inode, >s_inodes, i_sb_list) {
-   spin_lock(>i_lock);
-

[PATCH 02/11] inode: add IOP_NOTHASHED to avoid inode hash lock in evict

2013-07-30 Thread Dave Chinner

From: Dave Chinner 

Some filesystems don't use the VFS inode hash and fake the fact they
are hashed so that all the writeback code works correctly. However,
this means the evict() path still tries to remove the inode from the
hash, meaning that the inode_hash_lock() needs to be taken
unnecessarily. Hence under certain workloads the inode_hash_lock can
be contended even if the inode is never actually hashed.

To avoid this, add an inode opflag to allow inode_hash_remove() to
avoid taking the hash lock on inodes have never actually been
hashed.

Signed-off-by: Dave Chinner 
---
 fs/xfs/xfs_iops.c  | 2 ++
 include/linux/fs.h | 3 ++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 96dda62..68a8264 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1168,8 +1168,10 @@ xfs_setup_inode(
inode->i_state = I_NEW;
 
inode_sb_list_add(inode);
+
/* make the inode look hashed for the writeback code */
hlist_add_fake(>i_hash);
+   inode->i_opflags |= IOP_NOTHASHED;
 
inode->i_mode   = ip->i_d.di_mode;
set_nlink(inode, ip->i_d.di_nlink);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b09ddc0..51cf6ed 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -515,6 +515,7 @@ struct posix_acl;
 #define IOP_FASTPERM   0x0001
 #define IOP_LOOKUP 0x0002
 #define IOP_NOFOLLOW   0x0004
+#define IOP_NOTHASHED  0x0008  /* inode never hashed, avoid unhashing */
 
 /*
  * Keep mostly read-only and often accessed (especially for
@@ -2371,7 +2372,7 @@ static inline void insert_inode_hash(struct inode *inode)
 extern void __remove_inode_hash(struct inode *);
 static inline void remove_inode_hash(struct inode *inode)
 {
-   if (!inode_unhashed(inode))
+   if (!((inode->i_opflags & IOP_NOTHASHED) || inode_unhashed(inode)))
__remove_inode_hash(inode);
 }
 
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 07/11] writeback: periodically trim the writeback list

2013-07-30 Thread Dave Chinner

From: Dave Chinner 

Inodes are removed lazily from the bdi writeback list, so in the
absence of sync(2) work inodes will build up on the bdi writback
list even though they are no longer under IO. Use the periodic
kupdate work check to remove inodes no longer under IO from the
writeback list.

Signed-off-by: Dave Chinner 
---
 fs/fs-writeback.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 638f122..7c9bbf0 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -962,6 +962,23 @@ static long wb_check_background_flush(struct bdi_writeback 
*wb)
return 0;
 }
 
+/*
+ * clean out writeback list for all inodes that don't have IO in progress
+ */
+static void wb_trim_writeback_list(struct bdi_writeback *wb)
+{
+   struct inode *inode;
+   struct inode *tmp;
+
+   spin_lock(>list_lock);
+   list_for_each_entry_safe(inode, tmp, >b_writeback, i_wb_list) {
+   if (!mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK))
+   list_del_init(>i_wb_list);
+   }
+   spin_unlock(>list_lock);
+
+}
+
 static long wb_check_old_data_flush(struct bdi_writeback *wb)
 {
unsigned long expired;
@@ -978,6 +995,8 @@ static long wb_check_old_data_flush(struct bdi_writeback 
*wb)
if (time_before(jiffies, expired))
return 0;
 
+   wb_trim_writeback_list(wb);
+
wb->last_old_flush = jiffies;
nr_pages = get_nr_dirty_pages();
 
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RESEND 0/1] AHCI: Optimize interrupt processing

2013-07-30 Thread Marc C

>> One thing which would probably be worthwhile tho is getting rid of the
>> bitmap based qc tag allocator in libata.  That one is just borderline
>> stupid to keep around on any setup which is supposed to be scalable.
> Your border might be wider than mine :-). Yes, the bitmap should
> definitely go.
A naive implementation is obviously less-than-efficient. However, what
other problems exist with the libata QC tag allocator? I highly doubt
SATA will change to beyond 32 queue tags, primarily because it would
be a pain to change SDB FIS (it's likely to break the dozens of AHCI
controller implementations out there). Further, it seems like the
industry stopped caring about SATA and is pushing NVMe for future
offerings.

In any event, most modern systems should have instructions to count
leading zeroes and modify bits atomically.

-MC

On Mon, Jul 29, 2013 at 12:19 PM, Nicholas A. Bellinger
 wrote:
> On Mon, 2013-07-29 at 13:18 +0200, Alexander Gordeev wrote:
>> On Fri, Jul 26, 2013 at 05:43:13PM -0700, Nicholas A. Bellinger wrote:
>> > On Fri, 2013-07-26 at 14:14 -0700, Nicholas A. Bellinger wrote:
>
> 
>
>> I also tried to make a "quick" conversion and hit the same issue(s) as you.
>> Generally, I am concerned with these assumptions in such approach:
>>
>> 1. While libata concept of tags matches nicely with blk-mq (blk_mq_hw_ctx::
>> rqs[] vs ata_port::qcmd[]) right now, it is too exposed to changes in blk-mq
>> in the long run. I.e. ata_link::sactive limits tags to indices, while tags
>> might become hashes. Easily fixable, but still.
>>
>> 2. Unallocated requests in blk-mq are accessed/analized from libata-eh.c as
>> result of such iterations:
>>
>> for (tag = 0; tag < ATA_MAX_QUEUE; tag++) {
>> qc = __ata_qc_from_tag(ap, tag);
>>
>> if (!(qc->flags & ATA_QCFLAG_FAILED))
>> continue;
>>
>>   ...
>>   }
>>
>> While it is probably okay right now, it is still based on a premise that
>> blk-mq will not change the contents/concept of "payload", i.e. from embedded
>> to (re-)allocated memory.
>>
>> > The thing that I'm hung up on now for existing __ata_qc_from_tag() usage
>> > outside of the main blk_mq_ops->queue_rq -> SHT->queuecommand_mq()
>> > dispatch path, is how to actually locate the underlying scsi_device ->
>> > request_queue -> blk_mq_ctx -> blk_mq_hw_hctx from the passed
>> > ata_port..?
>>
>> I am actually in favor of getting rid of ata_queued_cmd::tag. Converting
>> ata_link::sactive to a list, making ata_link::active_tag as struct
>> ata_queued_cmd *ata_link::active_qc and converting ata_port::qc_allocated to 
>> a
>> list seems solves it all, including [2]. Have not checked it though.
>>
>> Anyway, if we need a blk-mq tag (why?), we have qc->scsicmd->request->tag.
>>
>
> Hi Alexander,
>
> So given the feedback from Tejun, I'm going to setup back for the moment
> from a larger conversion, and keep the SHT->cmd_size=0 setting for
> libata in the scsi-mq WIP branch.
>
> I'm happy to accept patches to drop the bitmap piece that Tejun
> mentioned if your interested, but at least on my end right now there are
> bigger fish to fry for scsi-mq.  ;)
>
> Thanks,
>
> --nab
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 10/11] list_lru: don't need node lock in list_lru_count_node

2013-07-30 Thread Dave Chinner

From: Dave Chinner 

The overall count of objects on a node might be accurate, but the
moment it is returned to the caller it is no longer perfectly
accurate. Hence we don't really need to hold the node lock to
protect the read of the object. The count is a long, so can be read
atomically on all platforms and so we don't need the lock there,
either. And the cost of the lock is not trivial, either, as it is
showing up in profiles on 16-way lookup workloads like so:

-  15.44%  [kernel]  [k] __ticket_spin_trylock
  - 46.59% _raw_spin_lock
 + 69.40% list_lru_add
   17.65% list_lru_del
   5.70% list_lru_count_node

IOWs, while the LRU locking scales, it is still costly. The locking
doesn't provide any real advantage for counting, so just kill the
locking in list_lru_count_node().

Signed-off-by: Dave Chinner 
---
 mm/list_lru.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 7246791..9aadb6c 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -51,15 +51,9 @@ EXPORT_SYMBOL_GPL(list_lru_del);
 unsigned long
 list_lru_count_node(struct list_lru *lru, int nid)
 {
-   unsigned long count = 0;
struct list_lru_node *nlru = >node[nid];
-
-   spin_lock(>lock);
WARN_ON_ONCE(nlru->nr_items < 0);
-   count += nlru->nr_items;
-   spin_unlock(>lock);
-
-   return count;
+   return nlru->nr_items;
 }
 EXPORT_SYMBOL_GPL(list_lru_count_node);
 
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 05/11] inode: rename i_wb_list to i_io_list

2013-07-30 Thread Dave Chinner

From: Dave Chinner 

There's a small consistency problem between the inode and writeback
naming. Writeback calls the "for IO" inode queues b_io and
b_more_io, but the inode calls these the "writeback list" or
i_wb_list. This makes it hard to an new "under writeback" list to
the inode, or call it an "under IO" list on the bdi because either
way we'll have writeback on IO and IO on writeback and it'll just be
confusing. I'm getting confused just writing this!

So, rename the inode "for IO" list variable to i_io_list so we can
add a new "writeback list" in a subsequent patch.

Signed-off-by: Dave Chinner 
---
 fs/block_dev.c |  2 +-
 fs/fs-writeback.c  | 18 +-
 fs/inode.c |  6 +++---
 include/linux/fs.h |  2 +-
 mm/backing-dev.c   |  6 +++---
 5 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index c39c0f3..bec0c26 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -68,7 +68,7 @@ static void bdev_inode_switch_bdi(struct inode *inode,
if (inode->i_state & I_DIRTY) {
if (bdi_cap_writeback_dirty(dst) && !wb_has_dirty_io(>wb))
wakeup_bdi = true;
-   list_move(>i_wb_list, >wb.b_dirty);
+   list_move(>i_io_list, >wb.b_dirty);
}
spin_unlock(>i_lock);
spin_unlock(>wb.list_lock);
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 56272ec..b7ac1c2 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -77,7 +77,7 @@ static inline struct backing_dev_info *inode_to_bdi(struct 
inode *inode)
 
 static inline struct inode *wb_inode(struct list_head *head)
 {
-   return list_entry(head, struct inode, i_wb_list);
+   return list_entry(head, struct inode, i_io_list);
 }
 
 /*
@@ -171,7 +171,7 @@ void inode_wb_list_del(struct inode *inode)
struct backing_dev_info *bdi = inode_to_bdi(inode);
 
spin_lock(>wb.list_lock);
-   list_del_init(>i_wb_list);
+   list_del_init(>i_io_list);
spin_unlock(>wb.list_lock);
 }
 
@@ -194,7 +194,7 @@ static void redirty_tail(struct inode *inode, struct 
bdi_writeback *wb)
if (time_before(inode->dirtied_when, tail->dirtied_when))
inode->dirtied_when = jiffies;
}
-   list_move(>i_wb_list, >b_dirty);
+   list_move(>i_io_list, >b_dirty);
 }
 
 /*
@@ -203,7 +203,7 @@ static void redirty_tail(struct inode *inode, struct 
bdi_writeback *wb)
 static void requeue_io(struct inode *inode, struct bdi_writeback *wb)
 {
assert_spin_locked(>list_lock);
-   list_move(>i_wb_list, >b_more_io);
+   list_move(>i_io_list, >b_more_io);
 }
 
 static void inode_sync_complete(struct inode *inode)
@@ -254,7 +254,7 @@ static int move_expired_inodes(struct list_head 
*delaying_queue,
if (sb && sb != inode->i_sb)
do_sb_sort = 1;
sb = inode->i_sb;
-   list_move(>i_wb_list, );
+   list_move(>i_io_list, );
moved++;
}
 
@@ -270,7 +270,7 @@ static int move_expired_inodes(struct list_head 
*delaying_queue,
list_for_each_prev_safe(pos, node, ) {
inode = wb_inode(pos);
if (inode->i_sb == sb)
-   list_move(>i_wb_list, dispatch_queue);
+   list_move(>i_io_list, dispatch_queue);
}
}
 out:
@@ -418,7 +418,7 @@ static void requeue_inode(struct inode *inode, struct 
bdi_writeback *wb,
redirty_tail(inode, wb);
} else {
/* The inode is clean. Remove from writeback lists. */
-   list_del_init(>i_wb_list);
+   list_del_init(>i_io_list);
}
 }
 
@@ -528,7 +528,7 @@ writeback_single_inode(struct inode *inode, struct 
bdi_writeback *wb,
 * touch it. See comment above for explanation.
 */
if (!(inode->i_state & I_DIRTY))
-   list_del_init(>i_wb_list);
+   list_del_init(>i_io_list);
spin_unlock(>list_lock);
inode_sync_complete(inode);
 out:
@@ -1193,7 +1193,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
spin_unlock(>i_lock);
spin_lock(>wb.list_lock);
inode->dirtied_when = jiffies;
-   list_move(>i_wb_list, >wb.b_dirty);
+   list_move(>i_io_list, >wb.b_dirty);
spin_unlock(>wb.list_lock);
 
if (wakeup_bdi)
diff --git a/fs/inode.c b/fs/inode.c
index cd10287..f8e0f2f 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -30,7 +30,7 @@
  * inode->i_sb->s_inode_list_lock protects:
  *   inode->i_sb->s_inodes, inode->i_sb_list
  * bdi->wb.list_lock protects:
- *   bdi->wb.b_{dirty,io,more_io}, inode->i_wb_list
+ *   bdi->wb.b_{dirty,io,more_io}, inode->i_io_list
  * inode_hash_lock protects:
  *   inode_hashtable, inode->i_hash

Re: [PATCH v2] xfs: introduce object readahead to log recovery

2013-07-30 Thread Zhi Yong Wu

On Wed, Jul 31, 2013 at 7:11 AM, Dave Chinner  wrote:
> On Tue, Jul 30, 2013 at 05:59:07PM +0800, zwu.ker...@gmail.com wrote:
>> From: Zhi Yong Wu 
>>
>>   It can take a long time to run log recovery operation because it is
>> single threaded and is bound by read latency. We can find that it took
>> most of the time to wait for the read IO to occur, so if one object
>> readahead is introduced to log recovery, it will obviously reduce the
>> log recovery time.
>>
>> Log recovery time stat:
>>
>>   w/o this patchw/ this patch
>>
>> real:0m15.023s 0m7.802s
>> user:0m0.001s  0m0.001s
>> sys: 0m0.246s  0m0.107s
>>
>> Signed-off-by: Zhi Yong Wu 
>> ---
>>  fs/xfs/xfs_log_recover.c | 162 
>> +--
>>  fs/xfs/xfs_log_recover.h |   2 +
>>  2 files changed, 159 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
>> index 7681b19..029826f 100644
>> --- a/fs/xfs/xfs_log_recover.c
>> +++ b/fs/xfs/xfs_log_recover.c
>> @@ -3116,6 +3116,111 @@ xlog_recover_free_trans(
>>   kmem_free(trans);
>>  }
>>
>> +STATIC void
>> +xlog_recover_buffer_ra_pass2(
>> + struct xlog *log,
>> + struct xlog_recover_item*item)
>> +{
>> + xfs_buf_log_format_t*buf_f = item->ri_buf[0].i_addr;
>> + xfs_mount_t *mp = log->l_mp;
>
> struct xfs_buf_log_format
> struct xfs_mount
Why? *_t is also used in a lot of other places.
>
>> +
>> + if (xlog_check_buffer_cancelled(log, buf_f->blf_blkno,
>> + buf_f->blf_len, buf_f->blf_flags)) {
>> + return;
>> + }
>> +
>> + xfs_buf_readahead(mp->m_ddev_targp, buf_f->blf_blkno,
>> + buf_f->blf_len, NULL);
>> +}
>> +
>> +STATIC void
>> +xlog_recover_inode_ra_pass2(
>> + struct xlog *log,
>> + struct xlog_recover_item*item)
>> +{
>> + xfs_inode_log_format_t  in_buf, *in_f;
>> + xfs_mount_t *mp = log->l_mp;
>
> struct xfs_inode_log_format
> struct xfs_mount
>
> and a separate declaration for each variable.
>
> Also, in_buf and in_f are not very good names as tehy don't follow
> any commonly used XFs naming convention. The shorthand for a
> struct xfs_inode_log_format variable is "ilf" and a pointer to such
> an object is "ilfp". i.e:
>
> struct xfs_inode_log_format ilf_buf;
> struct xfs_inode_log_format *ilfp;
>
>> +xlog_recover_dquot_ra_pass2(
>> + struct xlog *log,
>> + struct xlog_recover_item*item)
>> +{
>> + xfs_mount_t *mp = log->l_mp;
>> + struct xfs_disk_dquot   *recddq;
>> + int error;
>> + xfs_dq_logformat_t  *dq_f;
>> + uinttype;
>
> More typedefs.
>
>> +
>> +
>> + if (mp->m_qflags == 0)
>> + return;
>> +
>> + recddq = item->ri_buf[1].i_addr;
>> + if (recddq == NULL)
>> + return;
>> + if (item->ri_buf[1].i_len < sizeof(xfs_disk_dquot_t))
>> + return;
>> +
>> + type = recddq->d_flags & (XFS_DQ_USER | XFS_DQ_PROJ | XFS_DQ_GROUP);
>> + ASSERT(type);
>> + if (log->l_quotaoffs_flag & type)
>> + return;
>> +
>> + dq_f = item->ri_buf[0].i_addr;
>> + ASSERT(dq_f);
>> + error = xfs_qm_dqcheck(mp, recddq, dq_f->qlf_id, 0, XFS_QMOPT_DOWARN,
>> +"xlog_recover_dquot_ra_pass2 (log copy)");
>
> You don't need to do validation of the dquot in the transaction
> here - it's unlikely to be corrupt. Just do the readahead like for a
> normal buffer, and the validation can occur when recovering the
> item in the next pass.
Agreed, done.
>
>> + if (error)
>> + return;
>> + ASSERT(dq_f->qlf_len == 1);
>> +
>> + xfs_buf_readahead(mp->m_ddev_targp, dq_f->qlf_blkno,
>> + dq_f->qlf_len, NULL);
>> +}
>> +
>> +STATIC void
>> +xlog_recover_ra_pass2(
>> + struct xlog *log,
>> + struct xlog_recover_item*item)
>> +{
>> + switch (ITEM_TYPE(item)) {
>> + case XFS_LI_BUF:
>> + xlog_recover_buffer_ra_pass2(log, item);
>> + break;
>> + case XFS_LI_INODE:
>> + xlog_recover_inode_ra_pass2(log, item);
>> + break;
>> + case XFS_LI_DQUOT:
>> + xlog_recover_dquot_ra_pass2(log, item);
>> + break;
>> + case XFS_LI_EFI:
>> + case XFS_LI_EFD:
>> + case XFS_LI_QUOTAOFF:
>> + default:
>> + break;
>> + }
>> +}
>> +
>>  STATIC int
>>  xlog_recover_commit_pass1(
>>   struct xlog *log,
>> @@ -3177,6 +3282,26 @@ xlog_recover_commit_pass2(
>>   }
>>  }
>>
>> +STATIC int
>> +xlog_recover_items_pass2(
>> + struct xlog *log,
>> + struct xlog_recover *trans,
>> + struct

Re: [PATCH V2 4/6] cpuidle/pseries: Move the pseries_idle backend driver to sysdev.

2013-07-30 Thread Preeti U Murthy

Hi Dongsheng,

On 07/31/2013 08:52 AM, Wang Dongsheng-B40534 wrote:
> 
> 
>> -Original Message-
>> From: Deepthi Dharwar [mailto:deep...@linux.vnet.ibm.com]
>> Sent: Wednesday, July 31, 2013 10:59 AM
>> To: b...@kernel.crashing.org; daniel.lezc...@linaro.org; linux-
>> ker...@vger.kernel.org; mich...@ellerman.id.au;
>> srivatsa.b...@linux.vnet.ibm.com; pre...@linux.vnet.ibm.com;
>> sva...@linux.vnet.ibm.com; linuxppc-...@lists.ozlabs.org
>> Cc: r...@sisk.pl; Wang Dongsheng-B40534; linux...@vger.kernel.org
>> Subject: [PATCH V2 4/6] cpuidle/pseries: Move the pseries_idle backend
>> driver to sysdev.
>>
>> Move pseries_idle backend driver code to arch/powerpc/sysdev
>> so that the code can be used for a common driver for powernv
>> and pseries. This removes a lot of code duplicacy.
>>
> Why not drivers/cpuidle/?
> 
> I think it should be move to drivers/cpuidle.

Please take a look at what the cpuidle under drivers has to provide.
cpuidle has two parts to it. The front end and the back end. The front
end constitutes the cpuidle governors, registering of arch specific
cpuidle drivers, disabling and enabling of cpuidle feature. It is this
front end code which is present under drivers/cpuidle.

The arch specific cpuidle drivers which decide what needs to be done to
enter a specific idle state chosen by the cpuidle governor is what
constitutes the back end of cpuidle. This will not be in drivers/cpuidle
but in an arch/ specific code.

The cpuidle under drivers/cpuidle drives the idle power management, but
the low level handling of the entry into idle states should be taken
care of by the architecture.

Your recent patch :
cpuidle: add freescale e500 family porcessors idle support IMO should
hook onto the backend cpuidle driver that this patchset provides.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH V2 5/6] cpuidle/powerpc: Backend-powerpc idle driver for powernv and pseries.

2013-07-30 Thread Wang Dongsheng-B40534



> 
> -static int pseries_cpuidle_add_cpu_notifier(struct notifier_block *n,
> +static int powerpc_cpuidle_add_cpu_notifier(struct notifier_block *n,
>   unsigned long action, void *hcpu)
>  {
>   int hotcpu = (unsigned long)hcpu;
>   struct cpuidle_device *dev =
> - per_cpu_ptr(pseries_cpuidle_devices, hotcpu);
> + per_cpu_ptr(powerpc_cpuidle_devices, hotcpu);
> 
>   if (dev && cpuidle_get_driver()) {
>   switch (action) {
> @@ -235,16 +270,16 @@ static int pseries_cpuidle_add_cpu_notifier(struct
> notifier_block *n,
>  }
> 
>  static struct notifier_block setup_hotplug_notifier = {
> - .notifier_call = pseries_cpuidle_add_cpu_notifier,
> + .notifier_call = powerpc_cpuidle_add_cpu_notifier,
>  };
> 
I think Daniel means move the notifier to cpuidle framework, not just powerpc.

And should be remove all about *device*. If the notifier handle using device,
you can use "cpuidle_devices".

- dongsheng

> -static int __init pseries_processor_idle_init(void)
> +static int __init powerpc_processor_idle_init(void)
>  {
>   int retval;
> 
> - retval = pseries_idle_probe();
> + retval = powerpc_idle_probe();
>   if (retval)
>   return retval;
> 
> - pseries_cpuidle_driver_init();
> - retval = cpuidle_register_driver(_idle_driver);
> + powerpc_cpuidle_driver_init();
> + retval = cpuidle_register_driver(_idle_driver);
>   if (retval) {
> - printk(KERN_DEBUG "Registration of pseries driver failed.\n");
> + printk(KERN_DEBUG "Registration of powerpc driver failed.\n");
>   return retval;
>   }
> 
>   update_smt_snooze_delay(-1, per_cpu(smt_snooze_delay, 0));
> 
> - retval = pseries_idle_devices_init();
> + retval = powerpc_idle_devices_init();
Should be remove all *device*, using cpuidle_register.

- dongsheng

Re: [PATCH] pinctrl: msm: Add support for MSM TLMM pinmux

2013-07-30 Thread Stephen Warren

On 07/30/2013 06:13 PM, Hanumant Singh wrote:
> On 7/30/2013 5:08 PM, Stephen Warren wrote:
>> On 07/30/2013 06:01 PM, Hanumant Singh wrote:
>>> On 7/30/2013 2:22 PM, Stephen Warren wrote:
 On 07/30/2013 03:10 PM, hanumant wrote:
 ...
> We actually have the same TLMM pinmux used by several socs of a
> family.
> The number of pins on each soc may vary.
> Also a given soc gets used in a number of boards.
> The device tree for a given soc is split into the different boards
> that
> its in ie the boards inherit a common soc.dtsi but have separate dts.
> The boards for the same soc may use different pin groups for
> accomplishing a function, since we have multiple i2c, spi uart etc
> peripheral instances on a soc. A different instance of each of the
> above
> peripherals, can be used in different boards, utilizing different
> or subset of same pin groups.
> Thus I would need to have multiple C files for one soc, based on the
> boards that it goes into.

 The pinctrl driver should be exposing the raw capabilities of the HW.
 All the board-specific configuration should be expressed in DT. So, the
 driver shouldn't have to know anything about different boards at
 compile-time.

>>> I agree, so I wanted to keep the pin grouping information in DT, we
>>> already have a board based differentiation of dts files in DT, for the
>>> same soc.
>>
>> That's the opposite of what I was saying. Pin groups are a feature of
>> the SoC design, not the board.
>>
> Sorry I guess I wasn't clear.
> Right now I have a soc-pinctrl.dtsi containing pin groupings.
> This will be "inherited" by soc-boardtype.dts.
> The pinctrl client device nodes in soc-boardtype.dts will point to pin
> groupings in soc-pinctrl.dtsi that are valid for that particular boardtype.
> Is this a valid design?

OK, so you have two types of child node inside the pinctrl DT node; some
define the pin groups the SoC has (in soc.dtsi) and some define pinctrl
states that reference the pin group nodes and are referenced by the
client nodes.

That's probably fine. However, I'd still question putting the pin group
nodes in DT at all; I'm not convinced it's better than just putting
those into the driver itself. You end up with the same data tables after
parsing the DT anyway.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] openrisc: Makefile: append "-Dlinux" to KBUILD_CFLAGS

2013-07-30 Thread Chen Gang

Need append "_D__linux__" to KBUILD_CFLAGS, just like some of another
architectures have done, or 'allmodconfig' can not pass compiling.

The related error:

CC [M]  fs/coda/psdev.o
  In file included from include/linux/coda.h:65:0,
   from fs/coda/psdev.c:45:
  include/uapi/linux/coda.h:221:2: error: expected specifier-qualifier-list 
before 'u_quad_t'

The related compiler information:

  [root@dhcp122 ~]# /usr/local/bin/or32-linux-gcc -v
  Using built-in specs.
  COLLECT_GCC=/usr/local/bin/or32-linux-gcc
  
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/or32-linux/4.5.1-or32-1.0rc1/lto-wrapper
  Target: or32-linux
  Configured with: ../gcc-openrisc/configure --target=or32-linux --disable-nls 
--enable-languages=c --without-headers --disable-shared --disable-threads 
--enable-werror=no
  Thread model: single
  gcc version 4.5.1-or32-1.0rc1 (GCC) 


Signed-off-by: Chen Gang 
---
 arch/openrisc/Makefile |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/openrisc/Makefile b/arch/openrisc/Makefile
index 4739b83..89076a6 100644
--- a/arch/openrisc/Makefile
+++ b/arch/openrisc/Makefile
@@ -24,7 +24,7 @@ OBJCOPYFLAGS:= -O binary -R .note -R .comment -S
 LDFLAGS_vmlinux :=
 LIBGCC := $(shell $(CC) $(KBUILD_CFLAGS) 
-print-libgcc-file-name)
 
-KBUILD_CFLAGS  += -pipe -ffixed-r10
+KBUILD_CFLAGS  += -pipe -ffixed-r10 -D__linux__
 
 ifeq ($(CONFIG_OPENRISC_HAVE_INST_MUL),y)
KBUILD_CFLAGS += $(call cc-option,-mhard-mul)
-- 
1.7.7.6
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] fbdev: fix build warning in vga16fb.c

2013-07-30 Thread Gu Zheng

Hoho, Tomi has applied the patch from Lius to fix this warning.
And this is the sixth patch to fix the same issue since last week.

Thanks,
Gu


On 07/31/2013 11:21 AM, Xishi Qiu wrote:

> When building v3.11-rc3, I get the following warning:
> ...
> drivers/video/vga16fb.c: In function ‘vga16fb_destroy’:
> drivers/video/vga16fb.c:1268: warning: unused variable ‘dev’
> ...
> 
> Signed-off-by: Xishi Qiu 
> ---
>  drivers/video/vga16fb.c |1 -
>  1 files changed, 0 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/video/vga16fb.c b/drivers/video/vga16fb.c
> index 830ded4..2827333 100644
> --- a/drivers/video/vga16fb.c
> +++ b/drivers/video/vga16fb.c
> @@ -1265,7 +1265,6 @@ static void vga16fb_imageblit(struct fb_info *info, 
> const struct fb_image *image
>  
>  static void vga16fb_destroy(struct fb_info *info)
>  {
> - struct platform_device *dev = container_of(info->device, struct 
> platform_device, dev);
>   iounmap(info->screen_base);
>   fb_dealloc_cmap(>cmap);
>   /* XXX unshare VGA regions */


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/5] mmc: dw_mmc: Indicate that regulators may be absent

2013-07-30 Thread Jaehoon Chung

Acked-by: Jaehoon Chung 

Best Regards,
Jaehoon Chung

On 07/30/2013 10:33 PM, Seungwon Jeon wrote:
> On Tue, July 30, 2013, Mark Brown wrote:
>> From: Mark Brown 
>>
>> Use regulator_get_optional() to tell the core that requests for regulators
>> can fail in a real system.
>>
>> Signed-off-by: Mark Brown 
> Acked-by: Seungwon Jeon 
> 
> Thanks,
> Seungwon Jeon
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH V2 4/6] cpuidle/pseries: Move the pseries_idle backend driver to sysdev.

2013-07-30 Thread Wang Dongsheng-B40534



> -Original Message-
> From: Deepthi Dharwar [mailto:deep...@linux.vnet.ibm.com]
> Sent: Wednesday, July 31, 2013 10:59 AM
> To: b...@kernel.crashing.org; daniel.lezc...@linaro.org; linux-
> ker...@vger.kernel.org; mich...@ellerman.id.au;
> srivatsa.b...@linux.vnet.ibm.com; pre...@linux.vnet.ibm.com;
> sva...@linux.vnet.ibm.com; linuxppc-...@lists.ozlabs.org
> Cc: r...@sisk.pl; Wang Dongsheng-B40534; linux...@vger.kernel.org
> Subject: [PATCH V2 4/6] cpuidle/pseries: Move the pseries_idle backend
> driver to sysdev.
> 
> Move pseries_idle backend driver code to arch/powerpc/sysdev
> so that the code can be used for a common driver for powernv
> and pseries. This removes a lot of code duplicacy.
> 
Why not drivers/cpuidle/?

I think it should be move to drivers/cpuidle.

-dongsheng

> Signed-off-by: Deepthi Dharwar 
> ---
>  arch/powerpc/platforms/pseries/Kconfig  |9 -
>  arch/powerpc/platforms/pseries/Makefile |1
>  arch/powerpc/platforms/pseries/processor_idle.c |  384 -
> --
>  arch/powerpc/sysdev/Kconfig |9 +
>  arch/powerpc/sysdev/Makefile|1
>  arch/powerpc/sysdev/processor_idle.c|  384
> +++
>  6 files changed, 394 insertions(+), 394 deletions(-)
>  delete mode 100644 arch/powerpc/platforms/pseries/processor_idle.c
>  create mode 100644 arch/powerpc/sysdev/processor_idle.c
> 
> diff --git a/arch/powerpc/platforms/pseries/Kconfig
> b/arch/powerpc/platforms/pseries/Kconfig
> index 62b4f80..bb59bb0 100644
> --- a/arch/powerpc/platforms/pseries/Kconfig
> +++ b/arch/powerpc/platforms/pseries/Kconfig
> @@ -119,12 +119,3 @@ config DTL
> which are accessible through a debugfs file.
> 
> Say N if you are unsure.
> -
> -config PSERIES_IDLE
> - bool "Cpuidle driver for pSeries platforms"
> - depends on CPU_IDLE
> - depends on PPC_PSERIES
> - default y
> - help
> -   Select this option to enable processor idle state management
> -   through cpuidle subsystem.
> diff --git a/arch/powerpc/platforms/pseries/Makefile
> b/arch/powerpc/platforms/pseries/Makefile
> index 8ae0103..4b22379 100644
> --- a/arch/powerpc/platforms/pseries/Makefile
> +++ b/arch/powerpc/platforms/pseries/Makefile
> @@ -21,7 +21,6 @@ obj-$(CONFIG_HCALL_STATS)   += hvCall_inst.o
>  obj-$(CONFIG_CMM)+= cmm.o
>  obj-$(CONFIG_DTL)+= dtl.o
>  obj-$(CONFIG_IO_EVENT_IRQ)   += io_event_irq.o
> -obj-$(CONFIG_PSERIES_IDLE)   += processor_idle.o
> 
>  ifeq ($(CONFIG_PPC_PSERIES),y)
>  obj-$(CONFIG_SUSPEND)+= suspend.o
> diff --git a/arch/powerpc/platforms/pseries/processor_idle.c
> b/arch/powerpc/platforms/pseries/processor_idle.c
> deleted file mode 100644
> index 0d75a54..000
> --- a/arch/powerpc/platforms/pseries/processor_idle.c
> +++ /dev/null
> @@ -1,384 +0,0 @@
> -/*
> - *  processor_idle - idle state cpuidle driver.
> - *  Adapted from drivers/idle/intel_idle.c and
> - *  drivers/acpi/processor_idle.c
> - *
> - */
> -
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -
> -/* Snooze Delay, pseries_idle */
> -DECLARE_PER_CPU(long, smt_snooze_delay);
> -
> -struct cpuidle_driver pseries_idle_driver = {
> - .name = "pseries_idle",
> - .owner= THIS_MODULE,
> -};
> -
> -#define MAX_IDLE_STATE_COUNT 2
> -
> -static int max_idle_state = MAX_IDLE_STATE_COUNT - 1;
> -static struct cpuidle_device __percpu *pseries_cpuidle_devices;
> -static struct cpuidle_state *cpuidle_state_table;
> -
> -static inline void idle_loop_prolog(unsigned long *in_purr)
> -{
> - *in_purr = mfspr(SPRN_PURR);
> - /*
> -  * Indicate to the HV that we are idle. Now would be
> -  * a good time to find other work to dispatch.
> -  */
> - get_lppaca()->idle = 1;
> -}
> -
> -static inline void idle_loop_epilog(unsigned long in_purr)
> -{
> - get_lppaca()->wait_state_cycles += mfspr(SPRN_PURR) - in_purr;
> - get_lppaca()->idle = 0;
> -}
> -
> -static int snooze_loop(struct cpuidle_device *dev,
> - struct cpuidle_driver *drv,
> - int index)
> -{
> - unsigned long in_purr;
> - int cpu = dev->cpu;
> -
> - idle_loop_prolog(_purr);
> - local_irq_enable();
> - set_thread_flag(TIF_POLLING_NRFLAG);
> -
> - while ((!need_resched()) && cpu_online(cpu)) {
> - ppc64_runlatch_off();
> - HMT_low();
> - HMT_very_low();
> - }
> -
> - HMT_medium();
> - clear_thread_flag(TIF_POLLING_NRFLAG);
> - smp_mb();
> -
> - idle_loop_epilog(in_purr);
> -
> - return index;
> -}
> -
> -static void check_and_cede_processor(void)
> -{
> - /*
> -  * Ensure our interrupt state is properly tracked,
> -  * also checks if no interrupt has occurred while we
> -  * were soft-disabled
> -

[PATCH] fbdev: fix build warning in vga16fb.c

2013-07-30 Thread Xishi Qiu

When building v3.11-rc3, I get the following warning:
...
drivers/video/vga16fb.c: In function ‘vga16fb_destroy’:
drivers/video/vga16fb.c:1268: warning: unused variable ‘dev’
...

Signed-off-by: Xishi Qiu 
---
 drivers/video/vga16fb.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/video/vga16fb.c b/drivers/video/vga16fb.c
index 830ded4..2827333 100644
--- a/drivers/video/vga16fb.c
+++ b/drivers/video/vga16fb.c
@@ -1265,7 +1265,6 @@ static void vga16fb_imageblit(struct fb_info *info, const 
struct fb_image *image
 
 static void vga16fb_destroy(struct fb_info *info)
 {
-   struct platform_device *dev = container_of(info->device, struct 
platform_device, dev);
iounmap(info->screen_base);
fb_dealloc_cmap(>cmap);
/* XXX unshare VGA regions */
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-30 Thread Martin K. Petersen

> "Nick" == Nick Alcock  writes:

Nick> in which case we don't actually know that your Areca controller
Nick> supports the VPD page we thought it did: quite possibly only this
Nick> underlying disk does.

The ATA Information VPD page is created by the SCSI-ATA Translation
layer. The controller firmware in this case.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH trivial] include/linux/coda.h: remove useless '#else'

2013-07-30 Thread Chen Gang F T

On 07/31/2013 09:53 AM, Chen Gang wrote:
> On 07/31/2013 09:44 AM, Chen Gang wrote:
>> On 07/30/2013 08:29 PM, Joe Perches wrote:
>>> On Tue, 2013-07-30 at 15:30 +0800, Chen Gang wrote:
 '#else' is useless, need remove.

 Signed-off-by: Chen Gang 
 ---
  include/linux/coda.h |1 -
  1 files changed, 0 insertions(+), 1 deletions(-)

 diff --git a/include/linux/coda.h b/include/linux/coda.h
 index cff544f..d30209b 100644
 --- a/include/linux/coda.h
 +++ b/include/linux/coda.h
 @@ -60,7 +60,6 @@ Mellon the rights to redistribute these changes without 
 encumbrance.
  
  #if defined(__linux__)
  typedef unsigned long long u_quad_t;
 -#else
  #endif
  #include 
  #endif 
>>>
>>> Why have the #if at all?
>>>
>>>
>>>
>>

OH, sorry, what I said about openrisc cross-compiler is not precise.

Need a patch for openrisc (just like another architectures have done):

---diff begin---

diff --git a/arch/openrisc/Makefile b/arch/openrisc/Makefile
index 4739b83..89076a6 100644
--- a/arch/openrisc/Makefile
+++ b/arch/openrisc/Makefile
@@ -24,7 +24,7 @@ OBJCOPYFLAGS:= -O binary -R .note -R .comment -S
 LDFLAGS_vmlinux :=
 LIBGCC := $(shell $(CC) $(KBUILD_CFLAGS) 
-print-libgcc-file-name)
 
-KBUILD_CFLAGS  += -pipe -ffixed-r10
+KBUILD_CFLAGS  += -pipe -ffixed-r10 -D__linux__
 
 ifeq ($(CONFIG_OPENRISC_HAVE_INST_MUL),y)
KBUILD_CFLAGS += $(call cc-option,-mhard-mul)

---diff end-

But for "include/linux/coda.h", I still suggest to check __linux__
whether defined, at least it can find the building issues.

And next, I will send the related patch to openrisc mailing list.

:-)

Thanks.

>> Hmm... some old version compilers do not define __linux__ automatically
>> (e.g. or32-linux-gcc 4.5.1-or32-1.0rc1, which is the latest cross
>> compiler for openrisc, though).
>>
>> If not define __linux__, the compiler will report error (u_quad_t is
>> not defined).
>>
>> When we remove "#if defined(__linux__)" from "include/linux/coda.h",
>> the compiler will not report error, and 'cdev_t' will be defined as
>> 'dev_t', not 'u_quad_t' (I guess, it is an implicit bug).
>>
>>
>> In "uapi/include/*", only "coda.h" has "#if defined(__linux__)", maybe
>> they want multiple OS can share the same "coda.h" file (at least, we
>> can not say it is a bad idea).
>>
>> Neither suitable to define __linux__ in "include/linux/coda.h".
>>
>>
>> All together, I think:
>>
>>   the direct cause:
>> "uapi/include/coda.h" wants to share itself for multiple OS (so they 
>> need check __linux__).
>>   (in "uapi/include/*", only coda.h need check __linux__)
>>
>>   the root cause:
>> the compiler for linux should define __linux__ automatically, the latest 
>> version of gcc has done, but some of the old version is not.
>> most of cross compilers have merged their code into gcc main tree, but 
>> still left some (e.g. openrisc).
>>   (at least, I can not build openrisc from gcc main tree correctly, but 
>> most of others can)
>> some of latest cross compilers still use old version of gcc, (still not 
>> define __linux__ automatically).
>>
> 
> Maybe what I said is incorrect (it is just my current understanding).
> 
> Welcome any other members' suggestions or completions for discussing and
> checking.
> 
> Thanks.
> 
>>
>> The related code in "uapi/linux/coda.h" is below:
>>
>> ---code begin---
>>
>>  73 #if defined(DJGPP) || defined(__CYGWIN32__)
>>  74 #ifdef KERNEL
>>  75 typedef unsigned long u_long;
>>  76 typedef unsigned int u_int;
>>  77 typedef unsigned short u_short;
>>  78 typedef u_long ino_t;
>>  79 typedef u_long dev_t;
>>  80 typedef void * caddr_t;
>>  81 #ifdef DOS
>>  82 typedef unsigned __int64 u_quad_t;
>>  83 #else
>>  84 typedef unsigned long long u_quad_t;
>>  85 #endif
>>  86 
>>  87 #define inline
>>  88 
>>  89 struct timespec {
>>  90 long   ts_sec;
>>  91 long   ts_nsec;
>>  92 };
>>  93 #else  /* DJGPP but not KERNEL */
>>  94 #include 
>>  95 typedef unsigned long long u_quad_t;
>>  96 #endif /* !KERNEL */
>>  97 #endif /* !DJGPP */
>>  98 
>>  99 
>> 100 #if defined(__linux__)
>> 101 #include 
>> 102 #define cdev_t u_quad_t
>> 103 #ifndef __KERNEL__
>> 104 #if !defined(_UQUAD_T_) && (!defined(__GLIBC__) || __GLIBC__ < 2)
>> 105 #define _UQUAD_T_ 1
>> 106 typedef unsigned long long u_quad_t;
>> 107 #endif
>> 108 #endif /* __KERNEL__ */
>> 109 #else
>> 110 #define cdev_t dev_t
>> 111 #endif
>> 112 
>>
>> ---code end-
>>
>>
>> Thanks.
>>
> 
> 


-- 
Chen Gang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-30 Thread Martin K. Petersen

> "Bernd" == Bernd Schubert  writes:

Bernd,

>> Product revision level: R001 

It's clearly not verbatim passthrough...

Bernd> Besides the firmware, the difference might be that I'm exporting
Bernd> single disks without any areca-raidset in between.  I can try to
Bernd> confirm that tomorrow, I just need the system as it is till
Bernd> tomorrow noon.

That would be a great data point. I don't have any Areca boards.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-07-30 Thread Alex Williamson

On Wed, 2013-07-31 at 09:35 +0900, Takao Indoh wrote:
> (2013/07/31 0:59), Bjorn Helgaas wrote:
> > On Tue, Jul 30, 2013 at 12:09 AM, Takao Indoh
> >  wrote:
> >> (2013/07/29 23:17), Bjorn Helgaas wrote:
> >>> On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh  
> >>> wrote:
>  (2013/07/26 2:00), Bjorn Helgaas wrote:
> > 
> > My point about IOMMU and PCI initialization order doesn't go away just
> > because it doesn't fit "kdump policy."  Having system initialization
> > occur in a logical order is far more important than making kdump work.
> 
>  My next plan is as follows. I think this is matched to logical order
>  on boot.
> 
>  drivers/pci/pci.c
>  - Add function to reset bus, for example, pci_reset_bus(struct pci_bus 
>  *bus)
> 
>  drivers/iommu/intel-iommu.c
>  - On initialization, if IOMMU is already enabled, call this bus reset
>  function before disabling and re-enabling IOMMU.
> >>>
> >>> I raised this issue because of arches like sparc that enumerate the
> >>> IOMMU before the PCI devices that use it.  In that situation, I think
> >>> you're proposing this:
> >>>
> >>> panic kernel
> >>>   enable IOMMU
> >>>   panic
> >>> kdump kernel
> >>>   initialize IOMMU (already enabled)
> >>> pci_reset_bus
> >>> disable IOMMU
> >>> enable IOMMU
> >>>   enumerate PCI devices
> >>>
> >>> But the problem is that when you call pci_reset_bus(), you haven't
> >>> enumerated the PCI devices, so you don't know what to reset.
> >>
> >> Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu
> >> initialization is based on the assumption that enumeration of PCI devices
> >> is already done. We can find target device from IOMMU page table instead
> >> of scanning all devices in pci tree.
> >>
> >> Therefore, this idea is only for intel-iommu. Other architectures need
> >> to implement their own reset code.
> > 
> > That's my point.  I'm opposed to adding code to PCI when it only
> > benefits x86 and we know other arches will need a fundamentally
> > different design.  I would rather have a design that can work for all
> > arches.
> > 
> > If your implementation is totally implemented under arch/x86 (or in
> > intel-iommu.c, I guess), I can't object as much.  However, I think
> > that eventually even x86 should enumerate the IOMMUs via ACPI before
> > we enumerate PCI devices.
> > 
> > It's pretty clear that's how BIOS designers expect the OS to work.
> > For example, sec 8.7.3 of the Intel Virtualization Technology for
> > Directed I/O spec, rev 1.3, shows the expectation that remapping
> > hardware (IOMMU) is initialized before discovering the I/O hierarchy
> > below a hot-added host bridge.  Obviously you're not talking about a
> > hot-add scenario, but I think the same sequence should apply at
> > boot-time as well.
> 
> Of course I won't add something just for x86 into common PCI layer. I
> attach my new patch, though it is not well tested yet.
> 
> On x86, currently IOMMU initialization run *after* PCI enumeration, but
> what you are talking about is that it should be changed so that x86
> IOMMU initialization is done *before* PCI enumeration like sparc, right?
> 
> Hmm, ok, I think I need to post attached patch to iommu list and
> discuss it including current order of x86 IOMMU initialization.
> 
> Thanks,
> Takao Indoh
> ---
>  drivers/iommu/intel-iommu.c |   55 +-
>  drivers/pci/pci.c   |   53 
>  include/linux/pci.h |1
>  3 files changed, 108 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index eec0d3e..fb8a546 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -3663,6 +3663,56 @@ static struct notifier_block device_nb = {
>   .notifier_call = device_notifier,
>  };
>  
> +/* Reset PCI device if its entry exists in DMAR table */
> +static void __init iommu_reset_devices(struct intel_iommu *iommu, u16 
> segment)
> +{
> + u64 addr;
> + struct root_entry *root;
> + struct context_entry *context;
> + int bus, devfn;
> + struct pci_dev *dev;
> +
> + addr = dmar_readq(iommu->reg + DMAR_RTADDR_REG);
> + if (!addr)
> + return;
> +
> + /*
> +  *  In the case of kdump, ioremap is needed because root-entry table
> +  *  exists in first kernel's memory area which is not mapped in second
> +  *  kernel
> +  */
> + root = (struct root_entry*)ioremap(addr, PAGE_SIZE);
> + if (!root)
> + return;
> +
> + for (bus=0; bus + if (!root_present([bus]))
> + continue;
> +
> + context = (struct context_entry *)ioremap(
> + root[bus].val & VTD_PAGE_MASK, PAGE_SIZE);
> + if (!context)
> + continue;
> +
> + for (devfn=0; devfn +

Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-30 Thread Martin K. Petersen

> "Doug" == Douglas Gilbert  writes:

Doug> I just examined a more recent Areca SAS RAID controller and would
Doug> describe it as the SCSI device from hell. One solution to this
Doug> problem is to modify the arcmsr driver so it returns a more
Doug> consistent set of lies to the management SCSI commands that Martin
Doug> is asking about.

Yeah. This is quite the challenge given that the product id is
user-specified and product revision hardcoded. I can match on "Areca" in
the vendor id and that's about it :(

My current approach is to tweak the driver so that I can set
skip_vpd_pages for the ATA models. Under the assumption that the SAS
controllers actually feature the ATA Information VPD...

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 3/6] pseries: Move plpar_wrapper.h to powerpc common include/asm location.

2013-07-30 Thread Deepthi Dharwar

As a part of pseries_idle backend driver cleanup to make
the code common to both pseries and powernv archs, it
is necessary to move the backend-driver code to powerpc/sysdev.

As a pre-requisite to that, it is essential to move plpar_wrapper.h
to include/asm.

Signed-off-by: Deepthi Dharwar 
---
 arch/powerpc/include/asm/plpar_wrappers.h   |  324 +++
 arch/powerpc/platforms/pseries/cmm.c|3 
 arch/powerpc/platforms/pseries/dtl.c|3 
 arch/powerpc/platforms/pseries/hotplug-cpu.c|3 
 arch/powerpc/platforms/pseries/hvconsole.c  |2 
 arch/powerpc/platforms/pseries/iommu.c  |3 
 arch/powerpc/platforms/pseries/kexec.c  |2 
 arch/powerpc/platforms/pseries/lpar.c   |2 
 arch/powerpc/platforms/pseries/plpar_wrappers.h |  324 ---
 arch/powerpc/platforms/pseries/processor_idle.c |3 
 arch/powerpc/platforms/pseries/setup.c  |2 
 arch/powerpc/platforms/pseries/smp.c|2 
 12 files changed, 335 insertions(+), 338 deletions(-)
 create mode 100644 arch/powerpc/include/asm/plpar_wrappers.h
 delete mode 100644 arch/powerpc/platforms/pseries/plpar_wrappers.h

diff --git a/arch/powerpc/include/asm/plpar_wrappers.h 
b/arch/powerpc/include/asm/plpar_wrappers.h
new file mode 100644
index 000..f35787b
--- /dev/null
+++ b/arch/powerpc/include/asm/plpar_wrappers.h
@@ -0,0 +1,324 @@
+#ifndef _PSERIES_PLPAR_WRAPPERS_H
+#define _PSERIES_PLPAR_WRAPPERS_H
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+/* Get state of physical CPU from query_cpu_stopped */
+int smp_query_cpu_stopped(unsigned int pcpu);
+#define QCSS_STOPPED 0
+#define QCSS_STOPPING 1
+#define QCSS_NOT_STOPPED 2
+#define QCSS_HARDWARE_ERROR -1
+#define QCSS_HARDWARE_BUSY -2
+
+static inline long poll_pending(void)
+{
+   return plpar_hcall_norets(H_POLL_PENDING);
+}
+
+static inline u8 get_cede_latency_hint(void)
+{
+   return get_lppaca()->cede_latency_hint;
+}
+
+static inline void set_cede_latency_hint(u8 latency_hint)
+{
+   get_lppaca()->cede_latency_hint = latency_hint;
+}
+
+static inline long cede_processor(void)
+{
+   return plpar_hcall_norets(H_CEDE);
+}
+
+static inline long extended_cede_processor(unsigned long latency_hint)
+{
+   long rc;
+   u8 old_latency_hint = get_cede_latency_hint();
+
+   set_cede_latency_hint(latency_hint);
+
+   rc = cede_processor();
+#ifdef CONFIG_TRACE_IRQFLAGS
+   /* Ensure that H_CEDE returns with IRQs on */
+   if (WARN_ON(!(mfmsr() & MSR_EE)))
+   __hard_irq_enable();
+#endif
+
+   set_cede_latency_hint(old_latency_hint);
+
+   return rc;
+}
+
+static inline long vpa_call(unsigned long flags, unsigned long cpu,
+   unsigned long vpa)
+{
+   flags = flags << H_VPA_FUNC_SHIFT;
+
+   return plpar_hcall_norets(H_REGISTER_VPA, flags, cpu, vpa);
+}
+
+static inline long unregister_vpa(unsigned long cpu)
+{
+   return vpa_call(H_VPA_DEREG_VPA, cpu, 0);
+}
+
+static inline long register_vpa(unsigned long cpu, unsigned long vpa)
+{
+   return vpa_call(H_VPA_REG_VPA, cpu, vpa);
+}
+
+static inline long unregister_slb_shadow(unsigned long cpu)
+{
+   return vpa_call(H_VPA_DEREG_SLB, cpu, 0);
+}
+
+static inline long register_slb_shadow(unsigned long cpu, unsigned long vpa)
+{
+   return vpa_call(H_VPA_REG_SLB, cpu, vpa);
+}
+
+static inline long unregister_dtl(unsigned long cpu)
+{
+   return vpa_call(H_VPA_DEREG_DTL, cpu, 0);
+}
+
+static inline long register_dtl(unsigned long cpu, unsigned long vpa)
+{
+   return vpa_call(H_VPA_REG_DTL, cpu, vpa);
+}
+
+static inline long plpar_page_set_loaned(unsigned long vpa)
+{
+   unsigned long cmo_page_sz = cmo_get_page_size();
+   long rc = 0;
+   int i;
+
+   for (i = 0; !rc && i < PAGE_SIZE; i += cmo_page_sz)
+   rc = plpar_hcall_norets(H_PAGE_INIT, H_PAGE_SET_LOANED, vpa + 
i, 0);
+
+   for (i -= cmo_page_sz; rc && i != 0; i -= cmo_page_sz)
+   plpar_hcall_norets(H_PAGE_INIT, H_PAGE_SET_ACTIVE,
+  vpa + i - cmo_page_sz, 0);
+
+   return rc;
+}
+
+static inline long plpar_page_set_active(unsigned long vpa)
+{
+   unsigned long cmo_page_sz = cmo_get_page_size();
+   long rc = 0;
+   int i;
+
+   for (i = 0; !rc && i < PAGE_SIZE; i += cmo_page_sz)
+   rc = plpar_hcall_norets(H_PAGE_INIT, H_PAGE_SET_ACTIVE, vpa + 
i, 0);
+
+   for (i -= cmo_page_sz; rc && i != 0; i -= cmo_page_sz)
+   plpar_hcall_norets(H_PAGE_INIT, H_PAGE_SET_LOANED,
+  vpa + i - cmo_page_sz, 0);
+
+   return rc;
+}
+
+extern void vpa_init(int cpu);
+
+static inline long plpar_pte_enter(unsigned long flags,
+   unsigned long hpte_group, unsigned long hpte_v,
+   unsigned long hpte_r, unsigned long *slot)
+{
+   long rc;
+   unsigned long

[PATCH V2 4/6] cpuidle/pseries: Move the pseries_idle backend driver to sysdev.

2013-07-30 Thread Deepthi Dharwar

Move pseries_idle backend driver code to arch/powerpc/sysdev
so that the code can be used for a common driver for powernv
and pseries. This removes a lot of code duplicacy.

Signed-off-by: Deepthi Dharwar 
---
 arch/powerpc/platforms/pseries/Kconfig  |9 -
 arch/powerpc/platforms/pseries/Makefile |1 
 arch/powerpc/platforms/pseries/processor_idle.c |  384 ---
 arch/powerpc/sysdev/Kconfig |9 +
 arch/powerpc/sysdev/Makefile|1 
 arch/powerpc/sysdev/processor_idle.c|  384 +++
 6 files changed, 394 insertions(+), 394 deletions(-)
 delete mode 100644 arch/powerpc/platforms/pseries/processor_idle.c
 create mode 100644 arch/powerpc/sysdev/processor_idle.c

diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platforms/pseries/Kconfig
index 62b4f80..bb59bb0 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -119,12 +119,3 @@ config DTL
  which are accessible through a debugfs file.
 
  Say N if you are unsure.
-
-config PSERIES_IDLE
-   bool "Cpuidle driver for pSeries platforms"
-   depends on CPU_IDLE
-   depends on PPC_PSERIES
-   default y
-   help
- Select this option to enable processor idle state management
- through cpuidle subsystem.
diff --git a/arch/powerpc/platforms/pseries/Makefile 
b/arch/powerpc/platforms/pseries/Makefile
index 8ae0103..4b22379 100644
--- a/arch/powerpc/platforms/pseries/Makefile
+++ b/arch/powerpc/platforms/pseries/Makefile
@@ -21,7 +21,6 @@ obj-$(CONFIG_HCALL_STATS) += hvCall_inst.o
 obj-$(CONFIG_CMM)  += cmm.o
 obj-$(CONFIG_DTL)  += dtl.o
 obj-$(CONFIG_IO_EVENT_IRQ) += io_event_irq.o
-obj-$(CONFIG_PSERIES_IDLE) += processor_idle.o
 
 ifeq ($(CONFIG_PPC_PSERIES),y)
 obj-$(CONFIG_SUSPEND)  += suspend.o
diff --git a/arch/powerpc/platforms/pseries/processor_idle.c 
b/arch/powerpc/platforms/pseries/processor_idle.c
deleted file mode 100644
index 0d75a54..000
--- a/arch/powerpc/platforms/pseries/processor_idle.c
+++ /dev/null
@@ -1,384 +0,0 @@
-/*
- *  processor_idle - idle state cpuidle driver.
- *  Adapted from drivers/idle/intel_idle.c and
- *  drivers/acpi/processor_idle.c
- *
- */
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-/* Snooze Delay, pseries_idle */
-DECLARE_PER_CPU(long, smt_snooze_delay);
-
-struct cpuidle_driver pseries_idle_driver = {
-   .name = "pseries_idle",
-   .owner= THIS_MODULE,
-};
-
-#define MAX_IDLE_STATE_COUNT   2
-
-static int max_idle_state = MAX_IDLE_STATE_COUNT - 1;
-static struct cpuidle_device __percpu *pseries_cpuidle_devices;
-static struct cpuidle_state *cpuidle_state_table;
-
-static inline void idle_loop_prolog(unsigned long *in_purr)
-{
-   *in_purr = mfspr(SPRN_PURR);
-   /*
-* Indicate to the HV that we are idle. Now would be
-* a good time to find other work to dispatch.
-*/
-   get_lppaca()->idle = 1;
-}
-
-static inline void idle_loop_epilog(unsigned long in_purr)
-{
-   get_lppaca()->wait_state_cycles += mfspr(SPRN_PURR) - in_purr;
-   get_lppaca()->idle = 0;
-}
-
-static int snooze_loop(struct cpuidle_device *dev,
-   struct cpuidle_driver *drv,
-   int index)
-{
-   unsigned long in_purr;
-   int cpu = dev->cpu;
-
-   idle_loop_prolog(_purr);
-   local_irq_enable();
-   set_thread_flag(TIF_POLLING_NRFLAG);
-
-   while ((!need_resched()) && cpu_online(cpu)) {
-   ppc64_runlatch_off();
-   HMT_low();
-   HMT_very_low();
-   }
-
-   HMT_medium();
-   clear_thread_flag(TIF_POLLING_NRFLAG);
-   smp_mb();
-
-   idle_loop_epilog(in_purr);
-
-   return index;
-}
-
-static void check_and_cede_processor(void)
-{
-   /*
-* Ensure our interrupt state is properly tracked,
-* also checks if no interrupt has occurred while we
-* were soft-disabled
-*/
-   if (prep_irq_for_idle()) {
-   cede_processor();
-#ifdef CONFIG_TRACE_IRQFLAGS
-   /* Ensure that H_CEDE returns with IRQs on */
-   if (WARN_ON(!(mfmsr() & MSR_EE)))
-   __hard_irq_enable();
-#endif
-   }
-}
-
-static int dedicated_cede_loop(struct cpuidle_device *dev,
-   struct cpuidle_driver *drv,
-   int index)
-{
-   unsigned long in_purr;
-
-   idle_loop_prolog(_purr);
-   get_lppaca()->donate_dedicated_cpu = 1;
-
-   ppc64_runlatch_off();
-   HMT_medium();
-   check_and_cede_processor();
-
-   get_lppaca()->donate_dedicated_cpu = 0;
-
-   idle_loop_epilog(in_purr);
-
-   return index;
-}
-
-static int shared_cede_loop(struct

[PATCH V2 5/6] cpuidle/powerpc: Backend-powerpc idle driver for powernv and pseries.

2013-07-30 Thread Deepthi Dharwar

The following patch extends the current pseries backend
idle driver to powernv platform.

Signed-off-by: Deepthi Dharwar 
---
 arch/powerpc/include/asm/processor.h |2 -
 arch/powerpc/sysdev/Kconfig  |8 +-
 arch/powerpc/sysdev/Makefile |2 -
 arch/powerpc/sysdev/processor_idle.c |  132 ++
 4 files changed, 92 insertions(+), 52 deletions(-)

diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index 47a35b0..e64b817 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -426,7 +426,7 @@ enum idle_boot_override {IDLE_NO_OVERRIDE = 0, 
IDLE_POWERSAVE_OFF};
 extern int powersave_nap;  /* set if nap mode can be used in idle loop */
 extern void power7_nap(void);
 
-#ifdef CONFIG_PSERIES_IDLE
+#ifdef CONFIG_POWERPC_IDLE
 extern void update_smt_snooze_delay(int cpu, int residency);
 #else
 static inline void update_smt_snooze_delay(int cpu, int residency) {}
diff --git a/arch/powerpc/sysdev/Kconfig b/arch/powerpc/sysdev/Kconfig
index 8564a3f..f61d794 100644
--- a/arch/powerpc/sysdev/Kconfig
+++ b/arch/powerpc/sysdev/Kconfig
@@ -35,11 +35,11 @@ config GE_FPGA
bool
default n
 
-config PSERIES_IDLE
-   bool "Cpuidle driver for pSeries platforms"
+config POWERPC_IDLE
+   bool "Cpuidle driver for POWERPC platforms"
depends on CPU_IDLE
-   depends on PPC_PSERIES
+   depends on PPC_PSERIES || PPC_POWERNV
default y
help
  Select this option to enable processor idle state management
- for pSeries through cpuidle subsystem.
+ for POWER and POWERNV through cpuidle subsystem.
diff --git a/arch/powerpc/sysdev/Makefile b/arch/powerpc/sysdev/Makefile
index 93d2cdd..ec290e2 100644
--- a/arch/powerpc/sysdev/Makefile
+++ b/arch/powerpc/sysdev/Makefile
@@ -49,7 +49,7 @@ endif
 obj-$(CONFIG_PPC4xx_MSI)   += ppc4xx_msi.o
 obj-$(CONFIG_PPC4xx_CPM)   += ppc4xx_cpm.o
 obj-$(CONFIG_PPC4xx_GPIO)  += ppc4xx_gpio.o
-obj-$(CONFIG_PSERIES_IDLE) += processor_idle.o
+obj-$(CONFIG_POWERPC_IDLE) += processor_idle.o
 
 obj-$(CONFIG_CPM)  += cpm_common.o
 obj-$(CONFIG_CPM2) += cpm2.o cpm2_pic.o
diff --git a/arch/powerpc/sysdev/processor_idle.c 
b/arch/powerpc/sysdev/processor_idle.c
index 0d75a54..d152f540d 100644
--- a/arch/powerpc/sysdev/processor_idle.c
+++ b/arch/powerpc/sysdev/processor_idle.c
@@ -20,18 +20,18 @@
 #include 
 #include 
 
-/* Snooze Delay, pseries_idle */
+/* Snooze Delay, powerpc_idle */
 DECLARE_PER_CPU(long, smt_snooze_delay);
 
-struct cpuidle_driver pseries_idle_driver = {
-   .name = "pseries_idle",
+struct cpuidle_driver powerpc_idle_driver = {
+   .name = "powerpc_idle",
.owner= THIS_MODULE,
 };
 
 #define MAX_IDLE_STATE_COUNT   2
 
 static int max_idle_state = MAX_IDLE_STATE_COUNT - 1;
-static struct cpuidle_device __percpu *pseries_cpuidle_devices;
+static struct cpuidle_device __percpu *powerpc_cpuidle_devices;
 static struct cpuidle_state *cpuidle_state_table;
 
 static inline void idle_loop_prolog(unsigned long *in_purr)
@@ -55,13 +55,14 @@ static int snooze_loop(struct cpuidle_device *dev,
int index)
 {
unsigned long in_purr;
-   int cpu = dev->cpu;
 
+#ifndef PPC_POWERNV
idle_loop_prolog(_purr);
+#endif
local_irq_enable();
set_thread_flag(TIF_POLLING_NRFLAG);
 
-   while ((!need_resched()) && cpu_online(cpu)) {
+   while (!need_resched()) {
ppc64_runlatch_off();
HMT_low();
HMT_very_low();
@@ -71,7 +72,9 @@ static int snooze_loop(struct cpuidle_device *dev,
clear_thread_flag(TIF_POLLING_NRFLAG);
smp_mb();
 
+#ifndef PPC_POWERNV
idle_loop_epilog(in_purr);
+#endif
 
return index;
 }
@@ -135,10 +138,21 @@ static int shared_cede_loop(struct cpuidle_device *dev,
return index;
 }
 
+#ifdef PPC_POWERNV
+static int nap_loop(struct cpuidle_device *dev,
+   struct cpuidle_driver *drv,
+   int index)
+{
+   ppc64_runlatch_off();
+   power7_idle();
+   return index;
+}
+#endif
+
 /*
  * States for dedicated partition case.
  */
-static struct cpuidle_state dedicated_states[MAX_IDLE_STATE_COUNT] = {
+static struct cpuidle_state pseries_dedicated_states[MAX_IDLE_STATE_COUNT] = {
{ /* Snooze */
.name = "snooze",
.desc = "snooze",
@@ -158,7 +172,7 @@ static struct cpuidle_state 
dedicated_states[MAX_IDLE_STATE_COUNT] = {
 /*
  * States for shared partition case.
  */
-static struct cpuidle_state shared_states[MAX_IDLE_STATE_COUNT] = {
+static struct cpuidle_state pseries_shared_states[MAX_IDLE_STATE_COUNT] = {
{ /* Shared Cede */
.name = "Shared Cede",
.desc = "Shared Cede",
@@ -168,13 +182,34 @@ static struct cpuidle_state

[PATCH V2 2/6] cpuidle/pseries: Remove dependency of pseries.h file

2013-07-30 Thread Deepthi Dharwar

As a part of pseries_idle cleanup to make the backend driver
code common to both pseries and powernv, this patch
is essential to remove the dependency of pseries.h
header file.

Move the declaration of smt_snooze_delay from pseries.h
to processor_idle.c. smt_snooze_delay variable is
needed for both pseries and powernv.

/* Snooze Delay, pseries_idle */
DECLARE_PER_CPU(long, smt_snooze_delay);

Signed-off-by: Deepthi Dharwar 
---
 arch/powerpc/platforms/pseries/processor_idle.c |4 +++-
 arch/powerpc/platforms/pseries/pseries.h|3 ---
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/processor_idle.c 
b/arch/powerpc/platforms/pseries/processor_idle.c
index d6a1caa..64da5cc 100644
--- a/arch/powerpc/platforms/pseries/processor_idle.c
+++ b/arch/powerpc/platforms/pseries/processor_idle.c
@@ -20,7 +20,9 @@
 #include 
 
 #include "plpar_wrappers.h"
-#include "pseries.h"
+
+/* Snooze Delay, pseries_idle */
+DECLARE_PER_CPU(long, smt_snooze_delay);
 
 struct cpuidle_driver pseries_idle_driver = {
.name = "pseries_idle",
diff --git a/arch/powerpc/platforms/pseries/pseries.h 
b/arch/powerpc/platforms/pseries/pseries.h
index c2a3a25..d1b07e6 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -60,9 +60,6 @@ extern struct device_node *dlpar_configure_connector(u32);
 extern int dlpar_attach_node(struct device_node *);
 extern int dlpar_detach_node(struct device_node *);
 
-/* Snooze Delay, pseries_idle */
-DECLARE_PER_CPU(long, smt_snooze_delay);
-
 /* PCI root bridge prepare function override for pseries */
 struct pci_host_bridge;
 int pseries_root_bridge_prepare(struct pci_host_bridge *bridge);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 6/6] cpuidle/powernv: Enable idle powernv cpu to call into the cpuidle framework.

2013-07-30 Thread Deepthi Dharwar

This patch enables idle powernv cpu to hook on to the cpuidle
framework, if available, else call on to default idle platform
code.

Signed-off-by: Deepthi Dharwar 
---
 arch/powerpc/platforms/powernv/setup.c |   12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/setup.c 
b/arch/powerpc/platforms/powernv/setup.c
index 84438af..d4283dd 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -196,6 +197,15 @@ static int __init pnv_probe(void)
return 1;
 }
 
+void powernv_idle(void)
+{
+   /* Hook to cpuidle framework if available, else
+* call on default platform idle code
+   */
+   if (cpuidle_idle_call())
+   power7_idle();
+}
+
 define_machine(powernv) {
.name   = "PowerNV",
.probe  = pnv_probe,
@@ -205,7 +215,7 @@ define_machine(powernv) {
.show_cpuinfo   = pnv_show_cpuinfo,
.progress   = pnv_progress,
.machine_shutdown   = pnv_shutdown,
-   .power_save = power7_idle,
+   .power_save = powernv_idle,
.calibrate_decr = generic_calibrate_decr,
 #ifdef CONFIG_KEXEC
.kexec_cpu_down = pnv_kexec_cpu_down,

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 1/6] cpuidle/pseries: Fix kernel command line parameter smt-snooze-delay

2013-07-30 Thread Deepthi Dharwar

smt-snooze-delay is tunable provided currently on powerpc to delay the
entry of an idle cpu to NAP state. By default, the value is 100us,
which is entry criteria for NAP state i.e only if the idle period is
above 100us it would enter NAP. Value of -1 disables entry into NAP.
This value can be set either through sysfs, ppc64_cpu util or by
passing it via kernel command line. Currently this feature is broken
when the value is passed via the kernel command line.

This patch aims to fix this, by taking the appropritate action
based on the value after the pseries driver is registered.
This check is carried on in the backend driver rather in
setup_smt_snooze_delay() as one is not sure if the cpuidle driver
is even registered when setup routine is executed.
Also, this fixes re-enabling of NAP states by setting appropriate
value without having to reboot.

Also, to note is, smt-snooze-delay is per-cpu variable.
This can be used to enable/disable NAP on per-cpu
basis using sysfs but when this variable is passed
via kernel command line or using the smt-snooze-delay
it applies to all the cpus. Per-cpu tuning can
only be done via sysfs.

Signed-off-by: Deepthi Dharwar 
---
 arch/powerpc/platforms/pseries/processor_idle.c |   35 ++-
 1 file changed, 28 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/processor_idle.c 
b/arch/powerpc/platforms/pseries/processor_idle.c
index 4644efa0..d6a1caa 100644
--- a/arch/powerpc/platforms/pseries/processor_idle.c
+++ b/arch/powerpc/platforms/pseries/processor_idle.c
@@ -170,18 +170,37 @@ static struct cpuidle_state 
shared_states[MAX_IDLE_STATE_COUNT] = {
 void update_smt_snooze_delay(int cpu, int residency)
 {
struct cpuidle_driver *drv = cpuidle_get_driver();
-   struct cpuidle_device *dev = per_cpu(cpuidle_devices, cpu);
+   struct cpuidle_device *dev;
 
if (cpuidle_state_table != dedicated_states)
return;
 
-   if (residency < 0) {
-   /* Disable the Nap state on that cpu */
-   if (dev)
-   dev->states_usage[1].disable = 1;
-   } else
-   if (drv)
+   if (!drv)
+   return;
+
+   if (cpu == -1) {
+   if (residency < 0) {
+   /* Disable NAP on all cpus */
+   drv->states[1].disabled = true;
+   return;
+   } else {
drv->states[1].target_residency = residency;
+   drv->states[1].disabled = false;
+   return;
+   }
+   }
+
+   dev = per_cpu(cpuidle_devices, cpu);
+   if (!dev)
+   return;
+
+   if (residency < 0)
+   dev->states_usage[1].disable = 1;
+   else {
+   drv->states[1].target_residency = residency;
+   drv->states[1].disabled = false;
+   dev->states_usage[1].disable = 0;
+   }
 }
 
 static int pseries_cpuidle_add_cpu_notifier(struct notifier_block *n,
@@ -331,6 +350,8 @@ static int __init pseries_processor_idle_init(void)
return retval;
}
 
+   update_smt_snooze_delay(-1, per_cpu(smt_snooze_delay, 0));
+
retval = pseries_idle_devices_init();
if (retval) {
pseries_idle_devices_uninit();

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 0/6] cpuidle/powerpc: POWERPC cpuidle driver for POWER and POWERNV platforms

2013-07-30 Thread Deepthi Dharwar

Following patch series consolidates backend cpuidle driver
for pseries and powernv platforms. 

Current existing backend driver for pseries has been moved to
arch/powerpc/sysdev and has been extended to accommodate 
powernv idle power mgmt states. 
As seen in V1 of this patch series, too much code duplication persists
having separate back-end driver for individual platforms.

Moving the idle states over to cpuidle framework can take advantage 
of advanced heuristics, tunables and features provided by cpuidle 
framework. Additional idle states can be exploited using the cpuidle 
framework. The statistics and tracing infrastructure provided by 
the cpuidle framework also helps in enabling power management 
related tools and help tune the system and applications.

This series aims to maintain compatibility and functionality to
existing pseries and powernv idle cpu management code.  
There are no new functions or idle states added as part of this series. 
This can be extended by adding more states to this existing framework.

With this patch series the powernv cpuidle functionalities
are on-par with pSeries idle management. Other POWERPC
platforms can use this to exploit CPUIDLE framework.

This patch mainly focus on an integrated CPUIDLE backend
driver for POWERPC. Minor cpuidle clean-ups will be
taken up going further.

One need to enable POWERPC_IDLE config option to exploit these
backend drivers.

V1 -> http://lkml.org/lkml/2013/7/23/143

 Deepthi Dharwar (6):
  cpuidle/pseries: Fix kernel command line parameter smt-snooze-delay
  cpuidle/pseries: Remove dependency of pseries.h file
  pseries: Move plpar_wrapper.h to powerpc common include/asm location.
  cpuidle/pseries: Move the pseries_idle backend driver to sysdev.
  cpuidle/powerpc: Backend-powerpc idle driver for powernv and pseries.
  cpuidle/powernv: Enable idle powernv cpu to call into the cpuidle 
framework.


  arch/powerpc/include/asm/plpar_wrappers.h   |  324 ++
 arch/powerpc/include/asm/processor.h|2 
 arch/powerpc/platforms/powernv/setup.c  |   12 +
 arch/powerpc/platforms/pseries/Kconfig  |9 
 arch/powerpc/platforms/pseries/Makefile |1 
 arch/powerpc/platforms/pseries/cmm.c|3 
 arch/powerpc/platforms/pseries/dtl.c|3 
 arch/powerpc/platforms/pseries/hotplug-cpu.c|3 
 arch/powerpc/platforms/pseries/hvconsole.c  |2 
 arch/powerpc/platforms/pseries/iommu.c  |3 
 arch/powerpc/platforms/pseries/kexec.c  |2 
 arch/powerpc/platforms/pseries/lpar.c   |2 
 arch/powerpc/platforms/pseries/plpar_wrappers.h |  324 --
 arch/powerpc/platforms/pseries/processor_idle.c |  362 
 arch/powerpc/platforms/pseries/pseries.h|3 
 arch/powerpc/platforms/pseries/setup.c  |2 
 arch/powerpc/platforms/pseries/smp.c|2 
 arch/powerpc/sysdev/Kconfig |9 
 arch/powerpc/sysdev/Makefile|1 
 arch/powerpc/sysdev/processor_idle.c|  424 +++
 20 files changed, 780 insertions(+), 713 deletions(-)
 create mode 100644 arch/powerpc/include/asm/plpar_wrappers.h
 delete mode 100644 arch/powerpc/platforms/pseries/plpar_wrappers.h
 delete mode 100644 arch/powerpc/platforms/pseries/processor_idle.c
 create mode 100644 arch/powerpc/sysdev/processor_idle.c


-- Deepthi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2] usb: core: don't try to reset_device() a port that got just disconnected

2013-07-30 Thread Julius Werner

The USB hub driver's event handler contains a check to catch SuperSpeed
devices that transitioned into the SS.Inactive state and tries to fix
them with a reset. It decides whether to do a plain hub port reset or
call the usb_reset_device() function based on whether there was a device
attached to the port.

However, there are device/hub combinations (found with a JetFlash
Transcend mass storage stick (8564:1000) on the root hub of an Intel
LynxPoint PCH) which can transition to the SS.Inactive state on
disconnect (and stay there long enough for the host to notice). In this
case, above-mentioned reset check will call usb_reset_device() on the
stale device data structure. The kernel will send pointless LPM control
messages to the no longer connected device address and can even cause
several 5 second khubd stalls on some (buggy?) host controllers, before
finally accepting the device's fate amongst a flurry of error messages.

This patch makes the choice of reset dependent on the port status that
has just been read from the hub in addition to the existence of an
in-kernel data structure for the device, and only proceeds with the more
extensive reset if both are valid.

Signed-off-by: Julius Werner 
---
 drivers/usb/core/hub.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
index 4a8a1d6..558313d 100644
--- a/drivers/usb/core/hub.c
+++ b/drivers/usb/core/hub.c
@@ -4798,7 +4798,8 @@ static void hub_events(void)
hub->ports[i - 1]->child;
 
dev_dbg(hub_dev, "warm reset port %d\n", i);
-   if (!udev) {
+   if (!udev || !(portstatus &
+   USB_PORT_STAT_CONNECTION)) {
status = hub_port_reset(hub, i,
NULL, HUB_BH_RESET_TIME,
true);
@@ -4808,8 +4809,8 @@ static void hub_events(void)
usb_lock_device(udev);
status = usb_reset_device(udev);
usb_unlock_device(udev);
+   connect_change = 0;
}
-   connect_change = 0;
}
 
if (connect_change)
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 01/18] mm, hugetlb: protect reserved pages when softofflining requests the pages

2013-07-30 Thread Hillf Danton

On Wed, Jul 31, 2013 at 10:27 AM, Joonsoo Kim  wrote:
> On Mon, Jul 29, 2013 at 03:24:46PM +0800, Hillf Danton wrote:
>> On Mon, Jul 29, 2013 at 1:31 PM, Joonsoo Kim  wrote:
>> > alloc_huge_page_node() use dequeue_huge_page_node() without
>> > any validation check, so it can steal reserved page unconditionally.
>>
>> Well, why is it illegal to use reserved page here?
>
> If we use reserved page here, other processes which are promised to use
> enough hugepages cannot get enough hugepages and can die. This is
> unexpected result to them.
>
But, how do you determine that a huge page is requested by a process
that is not allowed to use reserved pages?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 05/18] mm, hugetlb: protect region tracking via newly introduced resv_map lock

2013-07-30 Thread Joonsoo Kim

On Mon, Jul 29, 2013 at 11:53:05AM -0700, Davidlohr Bueso wrote:
> On Mon, 2013-07-29 at 14:31 +0900, Joonsoo Kim wrote:
> > There is a race condition if we map a same file on different processes.
> > Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
> > When we do mmap, we don't grab a hugetlb_instantiation_mutex, but,
> > grab a mmap_sem. This doesn't prevent other process to modify region
> > structure, so it can be modified by two processes concurrently.
> > 
> > To solve this, I introduce a lock to resv_map and make region manipulation
> > function grab a lock before they do actual work. This makes region
> > tracking safe.
> 
> I think we can use a rwsem here instead of a spinlock, similar to my
> previous 1/2 approach.

Hello, Davidlohr.

As I said in reply to Hillf, I prefer to use a spinlock.
Please refer that reply.

Thanks.

> 
> > 
> > Signed-off-by: Joonsoo Kim 
> > 
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 2677c07..e29e28f 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -26,6 +26,7 @@ struct hugepage_subpool {
> >  
> >  struct resv_map {
> > struct kref refs;
> > +   spinlock_t lock;
> > struct list_head regions;
> >  };
> >  extern struct resv_map *resv_map_alloc(void);
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 24c0111..bf2ee11 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -134,15 +134,8 @@ static inline struct hugepage_subpool 
> > *subpool_vma(struct vm_area_struct *vma)
> >   * Region tracking -- allows tracking of reservations and instantiated 
> > pages
> >   *across the pages in a mapping.
> >   *
> > - * The region data structures are protected by a combination of the 
> > mmap_sem
> > - * and the hugetlb_instantiation_mutex.  To access or modify a region the 
> > caller
> > - * must either hold the mmap_sem for write, or the mmap_sem for read and
> > - * the hugetlb_instantiation_mutex:
> > - *
> > - * down_write(>mmap_sem);
> > - * or
> > - * down_read(>mmap_sem);
> > - * mutex_lock(_instantiation_mutex);
> > + * The region data structures are embedded into a resv_map and
> > + * protected by a resv_map's lock
> >   */
> >  struct file_region {
> > struct list_head link;
> > @@ -155,6 +148,7 @@ static long region_add(struct resv_map *resv, long f, 
> > long t)
> > struct list_head *head = >regions;
> > struct file_region *rg, *nrg, *trg;
> >  
> > +   spin_lock(>lock);
> > /* Locate the region we are either in or before. */
> > list_for_each_entry(rg, head, link)
> > if (f <= rg->to)
> > @@ -184,6 +178,7 @@ static long region_add(struct resv_map *resv, long f, 
> > long t)
> > }
> > nrg->from = f;
> > nrg->to = t;
> > +   spin_unlock(>lock);
> > return 0;
> >  }
> >  
> > @@ -193,6 +188,7 @@ static long region_chg(struct resv_map *resv, long f, 
> > long t)
> > struct file_region *rg, *nrg;
> > long chg = 0;
> >  
> > +   spin_lock(>lock);
> > /* Locate the region we are before or in. */
> > list_for_each_entry(rg, head, link)
> > if (f <= rg->to)
> > @@ -203,14 +199,18 @@ static long region_chg(struct resv_map *resv, long f, 
> > long t)
> >  * size such that we can guarantee to record the reservation. */
> > if (>link == head || t < rg->from) {
> > nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> > -   if (!nrg)
> > -   return -ENOMEM;
> > +   if (!nrg) {
> > +   chg = -ENOMEM;
> > +   goto out;
> > +   }
> > +
> > nrg->from = f;
> > nrg->to   = f;
> > INIT_LIST_HEAD(>link);
> > list_add(>link, rg->link.prev);
> >  
> > -   return t - f;
> > +   chg = t - f;
> > +   goto out;
> > }
> >  
> > /* Round our left edge to the current segment if it encloses us. */
> > @@ -223,7 +223,7 @@ static long region_chg(struct resv_map *resv, long f, 
> > long t)
> > if (>link == head)
> > break;
> > if (rg->from > t)
> > -   return chg;
> > +   goto out;
> >  
> > /* We overlap with this area, if it extends further than
> >  * us then we must extend ourselves.  Account for its
> > @@ -234,6 +234,9 @@ static long region_chg(struct resv_map *resv, long f, 
> > long t)
> > }
> > chg -= rg->to - rg->from;
> > }
> > +
> > +out:
> > +   spin_unlock(>lock);
> > return chg;
> >  }
> >  
> > @@ -243,12 +246,13 @@ static long region_truncate(struct resv_map *resv, 
> > long end)
> > struct file_region *rg, *trg;
> > long chg = 0;
> >  
> > +   spin_lock(>lock);
> > /* Locate the region we are either in or before. */
> > list_for_each_entry(rg, head, link)
> > if (end <= rg->to)
> > break;
> > if (>link == head)
> > -   return 0;
>

Re: [PATCH 05/18] mm, hugetlb: protect region tracking via newly introduced resv_map lock

2013-07-30 Thread Joonsoo Kim

On Mon, Jul 29, 2013 at 04:58:57PM +0800, Hillf Danton wrote:
> On Mon, Jul 29, 2013 at 1:31 PM, Joonsoo Kim  wrote:
> > There is a race condition if we map a same file on different processes.
> > Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
> > When we do mmap, we don't grab a hugetlb_instantiation_mutex, but,
> > grab a mmap_sem. This doesn't prevent other process to modify region
> > structure, so it can be modified by two processes concurrently.
> >
> > To solve this, I introduce a lock to resv_map and make region manipulation
> > function grab a lock before they do actual work. This makes region
> > tracking safe.
> >
> > Signed-off-by: Joonsoo Kim 
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 2677c07..e29e28f 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -26,6 +26,7 @@ struct hugepage_subpool {
> >
> >  struct resv_map {
> > struct kref refs;
> > +   spinlock_t lock;
> > struct list_head regions;
> >  };
> >  extern struct resv_map *resv_map_alloc(void);
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 24c0111..bf2ee11 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> [...]
> > @@ -193,6 +188,7 @@ static long region_chg(struct resv_map *resv, long f, 
> > long t)
> > struct file_region *rg, *nrg;
> > long chg = 0;
> >
> > +   spin_lock(>lock);
> > /* Locate the region we are before or in. */
> > list_for_each_entry(rg, head, link)
> > if (f <= rg->to)
> > @@ -203,14 +199,18 @@ static long region_chg(struct resv_map *resv, long f, 
> > long t)
> >  * size such that we can guarantee to record the reservation. */
> > if (>link == head || t < rg->from) {
> > nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> 
> Hm, you are allocating a piece of memory with spin lock held.
> How about replacing that spin lock with a mutex?

I think that lock held period here is very short, so mutex is not appropriate
to use. How about trying allocate with GFP_NOWAIT? And then if failed,
release the lock and allocate with GFP_KERNEL and retry at the beginnig.

Thanks.

> 
> > -   if (!nrg)
> > -   return -ENOMEM;
> > +   if (!nrg) {
> > +   chg = -ENOMEM;
> > +   goto out;
> > +   }
> > +
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[ANNOUNCE] ktap 0.2 released

2013-07-30 Thread zhangwei(Jovi)

Dear,

I'm pleased to announce that ktap released v0.2, the archive is available at:

https://github.com/ktap/ktap/archive/v0.2.tar.gz


= what's ktap?

   A New Scripting Dynamic Tracing Tool For Linux

   ktap have different design principles from Linux mainstream dynamic tracing
   language in that it's based on bytecode, so it doesn't depend upon GCC,
   doesn't require compiling kernel module for each script, safe to use in
   production environment, fulfilling the embedd ecosystem's tracing needs.

   ktap is released as GPL license.

   More information can be found at ktap/doc directory.

= Highlight features

   - support tracepoints, k(ret)probe, u(ret)probe, timer, function tracing, 
etc.
   - support x86, x86-64, powerpc, arm, and preempt-rt kernel.
   - support kernel 3.1 and later versions, include Linux mainline.

= Script highlight changes from v0.1

   * new tracing block syntax (introduce keywords: trace, trace_end)
trace EVENTDEF function (e) { BODY }
trace_end function () { BODY }

   * event became to a built-in type in language
print(e) will output string presentation of event.

for syscalls:sys_enter_write event, print(e) will output like:
sys_write(fd: 3, buf: b9369c78, count: 60)

   * support trace filter
trace 'sched:sched_switch /prev_comm == foo || next_comm == foo/

   * support kprobe/kretprobe (no debuginfo handing)
trace "probe:do_sys_open dfd=%di filename=%dx flags=%cx mode=+4($stack)"
trace "probe:do_sys_open%return fd=$retval"

   * support uprobe/uretprobe (no debuginfo handling)
trace "probe:/lib/libc.so.6:0x000773c0"
trace "probe:/lib/libc.so.6:0x000773c0%return"

   * support function tracing
trace "ftrace:function /ip == mutex*/"

   * support oneline scripting
ktap -e 'trace "syscalls:*" function (e) { print(e) }'

   * specific pid or cpu to tracing
ktap -C cpu *.kp
ktap -p pid *.kp

   * more sample scripts

   * support calling print_backtrace() in any context

   * support calling exit() in any context

= Backend highlight changes from v0.1

   * unified perf callback mechanism
   * use ring buffer transport instead of relayfs
   * reentrant in ktap tracing
   * performance boost(use percpu data in many case)
   * safe table/string manipulation
   * safe ktap exit
   * big code cleanups
   * fixed a lot of bugs, more stable than v0.1

= Source code

   Please get latest code from:
   https://github.com/ktap/ktap.git

= Building & Running

   [root@jovi]# git clone https://github.com/ktap/ktap.git

   [root@jovi]# cd ktap
   [root@jovi]# make#generate ktapvm kernel module and ktap binary

   [root@jovi]# insmod ./ktapvm.ko
   [root@jovi]# ./ktap scripts/syscalls.kp

= Samples scripts

   There have many samples scripts in README and ktap/scripts.

= Documentation

   Documentation is in ktap/doc/

= Mailing list

   k...@freelists.org
   You can subscribe ktap mailing list at link: 
http://www.freelists.org/list/ktap

= Feedback

   ktap is still under active development, so any feedback, bug reports, 
patches,
   feature requests, as always, is welcome.


.jovi







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 03/18] mm, hugetlb: unify region structure handling

2013-07-30 Thread Joonsoo Kim

On Tue, Jul 30, 2013 at 10:57:37PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim  writes:
> 
> > Currently, to track a reserved and allocated region, we use two different
> > ways for MAP_SHARED and MAP_PRIVATE. For MAP_SHARED, we use
> > address_mapping's private_list and, for MAP_PRIVATE, we use a resv_map.
> > Now, we are preparing to change a coarse grained lock which protect
> > a region structure to fine grained lock, and this difference hinder it.
> > So, before changing it, unify region structure handling.
> >
> > Signed-off-by: Joonsoo Kim 
> >
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index a3f868a..a1ae3ada 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -366,7 +366,12 @@ static void truncate_hugepages(struct inode *inode, 
> > loff_t lstart)
> >
> >  static void hugetlbfs_evict_inode(struct inode *inode)
> >  {
> > +   struct resv_map *resv_map;
> > +
> > truncate_hugepages(inode, 0);
> > +   resv_map = (struct resv_map *)inode->i_mapping->private_data;
> > +   if (resv_map)
> 
> can resv_map == NULL ?

Hello, Aneesh.

Yes, it can.
root inode which is allocated in hugetlbfs_get_root() doesn't allocate a 
resv_map.

> 
> 
> > +   kref_put(_map->refs, resv_map_release);
> 
> Also the kref_put is confusing. For shared mapping we don't have ref
> count incremented right ? So may be you can directly call
> resv_map_release or add a comment around explaining this more ?

Yes, I can call resv_map_release() directly, but I think that release
via reference management is better than it.

Thanks.

> 
> 
> > clear_inode(inode);
> >  }
> >
> > @@ -468,6 +473,11 @@ static struct inode *hugetlbfs_get_inode(struct 
> > super_block *sb,
> > umode_t mode, dev_t dev)
> >  {
> > struct inode *inode;
> > +   struct resv_map *resv_map;
> > +
> > +   resv_map = resv_map_alloc();
> > +   if (!resv_map)
> > +   return NULL;
> 
> -aneesh
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] usb: core: don't try to reset_device() a port that got just disconnected

2013-07-30 Thread Julius Werner

> Wait a moment.  Why does each of these attempts lead to a 5-second
> timeout?  Why don't they fail immediately?

Now that you mention it, that's a very good question. The kernel
enqueues a control transfer to the now disconnected address because
it's internal bookkeeping is not yet updated, but I guess that should
usually fail very fast from an xHC-internal transaction timeout. I
have now tried to debug the cause of this: I see the transfer being
enqueued and the doorbell triggered, but never get a transfer event
back from it (until the software timeout calls usb_kill_urb() which
stops the endpoint). With the same setup on a PantherPoint system I
get the same U1/U2 disable control messages on unplugging, but they
fail within <5ms with a transaction error... so I guess this must be a
LynxPoint hardware bug.

Regardless, calling usb_reset_device() is still wrong and will at
least lead to pointless transfer attempts and error messages, so I
think my patch still has merit.

> What will happen here if udev is NULL?  Maybe you should change the
> test to (!udev || !(portstatus & ...)).

Right... I'm not sure if that can happen in practice, but I'll change
it just in case.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] staging/lustre: lloop depends on BLOCK

2013-07-30 Thread Xiong Zhou



On Tue, 30 Jul 2013, Peng Tao wrote:

> On Tue, Jul 30, 2013 at 8:29 AM, Xiong Zhou  wrote:
> > From: Xiong Zhou 
> >
> > In the lustre client driver, lloop depends on BLOCK. Add an
> > option for this dependence. Last version of this patch makes
> > LUSTRE_FS depends on BLOCK.
> > Remove unnecessary jdb head files which depends on BLOCK.
> >
> Thanks for the patch. One minor comment below. Other than that, you can add
> Reviewed-by: Peng Tao 
> 
> > Signed-off-by: Xiong Zhou 
> > ---
> >  drivers/staging/lustre/lustre/Kconfig  |7 ++-
> >  drivers/staging/lustre/lustre/fld/fld_cache.c  |1 -
> >  drivers/staging/lustre/lustre/fld/fld_request.c|1 -
> >  .../lustre/lustre/include/linux/lustre_compat25.h  |2 ++
> >  drivers/staging/lustre/lustre/llite/Makefile   |2 +-
> >  drivers/staging/lustre/lustre/lvfs/fsfilt.c|1 -
> >  6 files changed, 9 insertions(+), 5 deletions(-)
> >
> >  #define TREE_READ_UNLOCK_IRQ(mapping)  
> > spin_unlock_irq(&(mapping)->tree_lock)
> >
> > +#ifdef CONFIG_BLOCK
> >  static inline
> >  int ll_unregister_blkdev(unsigned int dev, const char *name)
> It is better to remove the wrapper and just call unregister_blkdev in
> lloop.c. The wrapper exists to support old kernels in Lustre master
> but doesn't make sense in upstream kernel.
> 
> Thanks,
> Tao
> 

Good comment, Thanks for the review.



From: Xiong Zhou 

First version of this patch makes LUSTRE_FS depends on BLOCK.
Second version makes only lloop depends on BLOCK with a config
option for this dependence, and remove unnecessary jdb header
files which depends on BLOCK.
This version removes the wrapper ll_unregister_blkdev which 
depends on BLOCK in header and just call unregister_blkdev in 
lloop.c based on Peng Tao's comment.

Signed-off-by: Xiong Zhou 
Reviewed-by: Peng Tao 
---
 drivers/staging/lustre/lustre/Kconfig  |7 ++-
 drivers/staging/lustre/lustre/fld/fld_cache.c  |1 -
 drivers/staging/lustre/lustre/fld/fld_request.c|1 -
 .../lustre/lustre/include/linux/lustre_compat25.h  |7 ---
 drivers/staging/lustre/lustre/llite/Makefile   |2 +-
 drivers/staging/lustre/lustre/llite/lloop.c|6 ++
 drivers/staging/lustre/lustre/lvfs/fsfilt.c|1 -
 7 files changed, 9 insertions(+), 16 deletions(-)

diff --git a/drivers/staging/lustre/lustre/Kconfig 
b/drivers/staging/lustre/lustre/Kconfig
index 0b45de0..4e898e4 100644
--- a/drivers/staging/lustre/lustre/Kconfig
+++ b/drivers/staging/lustre/lustre/Kconfig
@@ -1,6 +1,6 @@
 config LUSTRE_FS
tristate "Lustre file system client support"
-   depends on STAGING && INET && BLOCK && m
+   depends on INET && m
select LNET
select CRYPTO
select CRYPTO_CRC32
@@ -53,3 +53,8 @@ config LUSTRE_TRANSLATE_ERRNOS
bool
depends on LUSTRE_FS && !X86
default true
+
+config LUSTRE_LLITE_LLOOP
+   bool "Lustre virtual block device"
+   depends on LUSTRE_FS && BLOCK
+   default m
diff --git a/drivers/staging/lustre/lustre/fld/fld_cache.c 
b/drivers/staging/lustre/lustre/fld/fld_cache.c
index 347f2ae..8410107 100644
--- a/drivers/staging/lustre/lustre/fld/fld_cache.c
+++ b/drivers/staging/lustre/lustre/fld/fld_cache.c
@@ -45,7 +45,6 @@
 
 # include 
 # include 
-# include 
 # include 
 
 #include 
diff --git a/drivers/staging/lustre/lustre/fld/fld_request.c 
b/drivers/staging/lustre/lustre/fld/fld_request.c
index c99b945..049322a 100644
--- a/drivers/staging/lustre/lustre/fld/fld_request.c
+++ b/drivers/staging/lustre/lustre/fld/fld_request.c
@@ -44,7 +44,6 @@
 
 # include 
 # include 
-# include 
 # include 
 
 #include 
diff --git a/drivers/staging/lustre/lustre/include/linux/lustre_compat25.h 
b/drivers/staging/lustre/lustre/include/linux/lustre_compat25.h
index 426533b..018e604 100644
--- a/drivers/staging/lustre/lustre/include/linux/lustre_compat25.h
+++ b/drivers/staging/lustre/lustre/include/linux/lustre_compat25.h
@@ -111,13 +111,6 @@ static inline void ll_set_fs_pwd(struct fs_struct *fs, 
struct vfsmount *mnt,
 #define TREE_READ_LOCK_IRQ(mapping)spin_lock_irq(&(mapping)->tree_lock)
 #define TREE_READ_UNLOCK_IRQ(mapping)  spin_unlock_irq(&(mapping)->tree_lock)
 
-static inline
-int ll_unregister_blkdev(unsigned int dev, const char *name)
-{
-   unregister_blkdev(dev, name);
-   return 0;
-}
-
 #define ll_invalidate_bdev(a,b) invalidate_bdev((a))
 
 #ifndef FS_HAS_FIEMAP
diff --git a/drivers/staging/lustre/lustre/llite/Makefile 
b/drivers/staging/lustre/lustre/llite/Makefile
index dff0c04..f493e07 100644
--- a/drivers/staging/lustre/lustre/llite/Makefile
+++ b/drivers/staging/lustre/lustre/llite/Makefile
@@ -1,5 +1,5 @@
 obj-$(CONFIG_LUSTRE_FS) += lustre.o
-obj-$(CONFIG_LUSTRE_FS) += llite_lloop.o
+obj-$(CONFIG_LUSTRE_LLITE_LLOOP) += llite_lloop.o
 lustre-y := dcache.o dir.o file.o llite_close.o llite_lib.o llite_nfs.o \
rw.o lproc_llite.o namei.o symlink.o

Re: [PATCH 01/18] mm, hugetlb: protect reserved pages when softofflining requests the pages

2013-07-30 Thread Joonsoo Kim

On Mon, Jul 29, 2013 at 03:24:46PM +0800, Hillf Danton wrote:
> On Mon, Jul 29, 2013 at 1:31 PM, Joonsoo Kim  wrote:
> > alloc_huge_page_node() use dequeue_huge_page_node() without
> > any validation check, so it can steal reserved page unconditionally.
> 
> Well, why is it illegal to use reserved page here?

Hello, Hillf.

If we use reserved page here, other processes which are promised to use
enough hugepages cannot get enough hugepages and can die. This is
unexpected result to them.

Thanks.

> 
> > To fix it, check the number of free_huge_page in alloc_huge_page_node().
> >
> > Signed-off-by: Joonsoo Kim 
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 6782b41..d971233 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -935,10 +935,11 @@ static struct page *alloc_buddy_huge_page(struct 
> > hstate *h, int nid)
> >   */
> >  struct page *alloc_huge_page_node(struct hstate *h, int nid)
> >  {
> > -   struct page *page;
> > +   struct page *page = NULL;
> >
> > spin_lock(_lock);
> > -   page = dequeue_huge_page_node(h, nid);
> > +   if (h->free_huge_pages - h->resv_huge_pages > 0)
> > +   page = dequeue_huge_page_node(h, nid);
> > spin_unlock(_lock);
> >
> > if (!page)
> > --
> > 1.7.9.5
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REGRESSION/PATCH] acpi: blacklist win8 OSI for ASUS Zenbok Prime UX31A

2013-07-30 Thread Aaron Lu

On 07/31/2013 10:07 AM, Felipe Contreras wrote:
> On Tue, Jul 30, 2013 at 8:36 PM, Aaron Lu  wrote:
>> On 07/31/2013 08:11 AM, Felipe Contreras wrote:
>>> On Tue, Jul 30, 2013 at 6:13 PM, Rafael J. Wysocki  wrote:
>>>
> If 0 turns the screen off with the intel driver, 0 should turn the
> screen off with the ACPI driver, having inconsistent behavior
> depending on which driver is used is a bug.

 The ACPI driver simply exposes and interface to interact with the AML 
 methods
 in the BIOS directly.
>>>
>>> No, the ACPI driver is exposing a backlight interface, which has a
>>> defined stable API.
>>>
>>> Documentation/ABI/stable/sysfs-class-backlight
>>>
>>> Yes, the interface doesn't define what should happen at 0, that is a
>>> bug in the interface definition.
>>>
>>> *How* it achieves that is an implementation detail.
>>>
 Yes, this is a mistake and shouldn't be designed this way.

 However, incidentally, this makes backlight control work on your machine.

 Anyway, we need all backlight drivers to work consistently and don't tempt 
 me
 to rip the ACPI driver entirely from the kernel for what it's worth.
>>>
>>> Yes, they should work consistently, and go ahead, rip the ACPI driver,
>>> *then* you'll see many more people complaining about the Linux kernel
>>> breaking user-space, which should never happen. Mistakes happen, but
>>> if you do this willingly and knowingly, I think there would be
>>> repercussions for you.
>>>
 Yes, that will break backlight on your system and *then* you can complain 
 to
 Linus if you wish.
>>>
>>> It is already broken in v3.11-rc3, in fact I just booted that to try
>>> it out and it booted with the screen completely black (fortunately I
>>> knew exactly what to type to change that).
>>
>> That is bad, can you please file a bug for that? I'll need to take a
>> look at your ACPI tables, thanks.
> 
> File a bug where?

https://bugzilla.kernel.org/
For component, please choose ACPI->Power_Video.

> 
>>> Apparently this commit also needs to be reverted: efaa14c (ACPI /
>>> video: no automatic brightness changes by win8-compatible firmware).
>>> In this machine it makes the backlight work again (without
>>> acpi_osi="!Windows 2012"), but by doing so the ACPI driver also turns
>>> off the screen completely at level 0. Also, each time I change the
>>
>> So with rc3 and commit efaa14c reverted, when you set level 0 to ACPI
>> video's backlight interface, the screen will be off now? And this is not
>> the case in 3.10, right?
> 
> No, setting level 0 turns it off if efaa14c is *not* reverted. In 3.10
> 0 doesn't turn it off.

AFAIK, Win8 firmware will check a flag to decide what to do on hotkey
press and backlight brightness level change. Common behavior is:
If that flag is set:
  - on hotkey press, send out event;
  - on brightness level change, use GPU driver's interface to change
backlight level following operation region spec.
If that flag is not set:
  - on hotkey press, adjust the brightness level without sending out
event;
  - on brightness level change, use EC to change the brightness level.

The said commit sets the flag, so it seems with the flag set now,
it is possible the firmware interface will also give a screen off
result. But since we want to have notification event, we will have to
set that flag I think.

> 
>>> backlight level from X, the screen blinks as if going 100%, 0%, and
>>> then the desired level.
>>
>> Please attach acpidump output to the to be opened bugzilla page, thanks.
> 
> I looked at the code in the DCDT, it appears to me that they store
> different levels depending on the OSI version, so I don't think the
> problem is in the ACPI driver. Yet, the issue remains there.

It's good to know what is the problem, so I would like to see the table.
Most of the bugs I solved in ACPI video driver's backlight control are
caused by firmware, so yes, it's very unlikely a bug of the ACPI video
driver itself.

-Aaron
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REGRESSION/PATCH] acpi: blacklist win8 OSI for ASUS Zenbok Prime UX31A

2013-07-30 Thread Felipe Contreras

On Tue, Jul 30, 2013 at 8:59 PM, Aaron Lu  wrote:
> On 07/31/2013 04:59 AM, Felipe Contreras wrote:

>> There is another interface the turn the screen off.
>>
>> If 0 turns the screen off with the intel driver, 0 should turn the
>> screen off with the ACPI driver, having inconsistent behavior
>> depending on which driver is used is a bug.
>
> I'm not sure of this. Remember the days when we switch the hard disk
> driver from IDE to SCSI? The block device name changed from hdx to sdx.
> Is this a bug?

The name might have changed, but the way the interface worked did not;
both hdx and sdx worked exactly the same.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] cfq-iosched: limit slice_idle when many busy queues are in idle window

2013-07-30 Thread Shaohua Li

On Tue, Jul 30, 2013 at 03:30:33PM -0400, Tomoki Sekiyama wrote:
> Hi,
> 
> When some application launches several hundreds of processes that issue
> only a few small sync I/O requests, CFQ may cause heavy latencies
> (10+ seconds at the worst case), although the request rate is low enough for
> the disk to handle it without waiting. This is because CFQ waits for
> slice_idle (default:8ms) every time before processing each request, until
> their thinktimes are evaluated.
> 
> This scenario can be reproduced using fio with parameters below:
>   fio -filename=/tmp/test -rw=randread -size=5G -runtime=15 -name=file1 \
>   -bs=4k -numjobs=500 -thinktime=100
> In this case, 500 processes issue a random read request every second.

For this workload CFQ should perfectly detect it's a seek queue and disable
idle. I suppose the reason is CFQ hasn't enough data/time to disable idle yet,
since your thinktime is long and runtime is short.

I thought the real problem here is cfq_init_cfqq() shouldn't set idle_window
when initializing a queue. We should enable idle window after we detect the
queue is worthy idle.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REGRESSION/PATCH] acpi: blacklist win8 OSI for ASUS Zenbok Prime UX31A

2013-07-30 Thread Felipe Contreras

On Tue, Jul 30, 2013 at 8:36 PM, Aaron Lu  wrote:
> On 07/31/2013 08:11 AM, Felipe Contreras wrote:
>> On Tue, Jul 30, 2013 at 6:13 PM, Rafael J. Wysocki  wrote:
>>
 If 0 turns the screen off with the intel driver, 0 should turn the
 screen off with the ACPI driver, having inconsistent behavior
 depending on which driver is used is a bug.
>>>
>>> The ACPI driver simply exposes and interface to interact with the AML 
>>> methods
>>> in the BIOS directly.
>>
>> No, the ACPI driver is exposing a backlight interface, which has a
>> defined stable API.
>>
>> Documentation/ABI/stable/sysfs-class-backlight
>>
>> Yes, the interface doesn't define what should happen at 0, that is a
>> bug in the interface definition.
>>
>> *How* it achieves that is an implementation detail.
>>
>>> Yes, this is a mistake and shouldn't be designed this way.
>>>
>>> However, incidentally, this makes backlight control work on your machine.
>>>
>>> Anyway, we need all backlight drivers to work consistently and don't tempt 
>>> me
>>> to rip the ACPI driver entirely from the kernel for what it's worth.
>>
>> Yes, they should work consistently, and go ahead, rip the ACPI driver,
>> *then* you'll see many more people complaining about the Linux kernel
>> breaking user-space, which should never happen. Mistakes happen, but
>> if you do this willingly and knowingly, I think there would be
>> repercussions for you.
>>
>>> Yes, that will break backlight on your system and *then* you can complain to
>>> Linus if you wish.
>>
>> It is already broken in v3.11-rc3, in fact I just booted that to try
>> it out and it booted with the screen completely black (fortunately I
>> knew exactly what to type to change that).
>
> That is bad, can you please file a bug for that? I'll need to take a
> look at your ACPI tables, thanks.

File a bug where?

>> Apparently this commit also needs to be reverted: efaa14c (ACPI /
>> video: no automatic brightness changes by win8-compatible firmware).
>> In this machine it makes the backlight work again (without
>> acpi_osi="!Windows 2012"), but by doing so the ACPI driver also turns
>> off the screen completely at level 0. Also, each time I change the
>
> So with rc3 and commit efaa14c reverted, when you set level 0 to ACPI
> video's backlight interface, the screen will be off now? And this is not
> the case in 3.10, right?

No, setting level 0 turns it off if efaa14c is *not* reverted. In 3.10
0 doesn't turn it off.

>> backlight level from X, the screen blinks as if going 100%, 0%, and
>> then the desired level.
>
> Please attach acpidump output to the to be opened bugzilla page, thanks.

I looked at the code in the DCDT, it appears to me that they store
different levels depending on the OSI version, so I don't think the
problem is in the ACPI driver. Yet, the issue remains there.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] f2fs: add a wait step when submit bio with {READ,WRITE}_SYNC

2013-07-30 Thread Gu Zheng

Hi Kim,

On 07/30/2013 08:29 PM, Jaegeuk Kim wrote:

> Hi Gu,
> 
> The original read flow was to avoid redandunt lock/unlock_page() calls.

Right, this can gain better read performance. But is the wait step after 
submitting bio with READ_SYNC needless too?

> And we should not wait for WRITE_SYNC, since it is just for write
> priority, not for synchronization of the file system.

Got it, thanks for your explanation.:) 

Best regards,
Gu

> Thanks,
> 
> 2013-07-30 (화), 18:06 +0800, Gu Zheng:
>> When we submit bio with READ_SYNC or WRITE_SYNC, we need to wait a
>> moment for the io completion, current codes only find_data_page() follows the
>> rule, other places missing this step, so add it.
>>
>> Further more, moving the PageUptodate check into f2fs_readpage() to clean up
>> the codes.
>>
>> Signed-off-by: Gu Zheng 
>> ---
>>  fs/f2fs/checkpoint.c |1 -
>>  fs/f2fs/data.c   |   39 +--
>>  fs/f2fs/node.c   |1 -
>>  fs/f2fs/recovery.c   |2 --
>>  fs/f2fs/segment.c|2 +-
>>  5 files changed, 18 insertions(+), 27 deletions(-)
>>
>> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
>> index fe91773..e376a42 100644
>> --- a/fs/f2fs/checkpoint.c
>> +++ b/fs/f2fs/checkpoint.c
>> @@ -64,7 +64,6 @@ repeat:
>>  if (f2fs_readpage(sbi, page, index, READ_SYNC))
>>  goto repeat;
>>  
>> -lock_page(page);
>>  if (page->mapping != mapping) {
>>  f2fs_put_page(page, 1);
>>  goto repeat;
>> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
>> index 19cd7c6..b048936 100644
>> --- a/fs/f2fs/data.c
>> +++ b/fs/f2fs/data.c
>> @@ -216,13 +216,11 @@ struct page *find_data_page(struct inode *inode, 
>> pgoff_t index, bool sync)
>>  
>>  err = f2fs_readpage(sbi, page, dn.data_blkaddr,
>>  sync ? READ_SYNC : READA);
>> -if (sync) {
>> -wait_on_page_locked(page);
>> -if (!PageUptodate(page)) {
>> -f2fs_put_page(page, 0);
>> -return ERR_PTR(-EIO);
>> -}
>> -}
>> +if (err)
>> +return ERR_PTR(err);
>> +
>> +if (sync)
>> +unlock_page(page);
>>  return page;
>>  }
>>  
>> @@ -267,11 +265,6 @@ repeat:
>>  if (err)
>>  return ERR_PTR(err);
>>  
>> -lock_page(page);
>> -if (!PageUptodate(page)) {
>> -f2fs_put_page(page, 1);
>> -return ERR_PTR(-EIO);
>> -}
>>  if (page->mapping != mapping) {
>>  f2fs_put_page(page, 1);
>>  goto repeat;
>> @@ -325,11 +318,7 @@ repeat:
>>  err = f2fs_readpage(sbi, page, dn.data_blkaddr, READ_SYNC);
>>  if (err)
>>  return ERR_PTR(err);
>> -lock_page(page);
>> -if (!PageUptodate(page)) {
>> -f2fs_put_page(page, 1);
>> -return ERR_PTR(-EIO);
>> -}
>> +
>>  if (page->mapping != mapping) {
>>  f2fs_put_page(page, 1);
>>  goto repeat;
>> @@ -399,6 +388,16 @@ int f2fs_readpage(struct f2fs_sb_info *sbi, struct page 
>> *page,
>>  
>>  submit_bio(type, bio);
>>  up_read(>bio_sem);
>> +
>> +if (type == READ_SYNC) {
>> +wait_on_page_locked(page);
>> +lock_page(page);
>> +if (!PageUptodate(page)) {
>> +f2fs_put_page(page, 1);
>> +return -EIO;
>> +}
>> +}
>> +
>>  return 0;
>>  }
>>  
>> @@ -679,11 +678,7 @@ repeat:
>>  err = f2fs_readpage(sbi, page, dn.data_blkaddr, READ_SYNC);
>>  if (err)
>>  return err;
>> -lock_page(page);
>> -if (!PageUptodate(page)) {
>> -f2fs_put_page(page, 1);
>> -return -EIO;
>> -}
>> +
>>  if (page->mapping != mapping) {
>>  f2fs_put_page(page, 1);
>>  goto repeat;
>> diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
>> index f5172e2..f061554 100644
>> --- a/fs/f2fs/node.c
>> +++ b/fs/f2fs/node.c
>> @@ -1534,7 +1534,6 @@ int restore_node_summary(struct f2fs_sb_info *sbi,
>>  if (f2fs_readpage(sbi, page, addr, READ_SYNC))
>>  goto out;
>>  
>> -lock_page(page);
>>  rn = F2FS_NODE(page);
>>  sum_entry->nid = rn->footer.nid;
>>  sum_entry->version = 0;
>> diff --git a/fs/f2fs/recovery.c b/fs/f2fs/recovery.c
>> index 639eb34..ec68183 100644
>> --- a/fs/f2fs/recovery.c
>> +++ b/fs/f2fs/recovery.c
>> @@ -140,8 +140,6 @@ static int find_fsync_dnodes(struct f2fs_sb_info *sbi, 
>> struct list_head *head)
>>  if (err)
>>  goto out;
>>  
>> -lock_page(page);
>> -
>>  if (cp_ver != cpver_of_node(page))
>>  break;
>>  
>> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c

Re: [REGRESSION/PATCH] acpi: blacklist win8 OSI for ASUS Zenbok Prime UX31A

2013-07-30 Thread Aaron Lu

On 07/31/2013 04:59 AM, Felipe Contreras wrote:
> On Tue, Jul 30, 2013 at 8:42 AM, Rafael J. Wysocki  wrote:
>> On Tuesday, July 30, 2013 01:57:55 PM Aaron Lu wrote:
>>> On 07/30/2013 01:51 PM, Aaron Lu wrote:
 On 07/30/2013 11:44 AM, Felipe Contreras wrote:
> On Mon, Jul 29, 2013 at 10:11 PM, Aaron Lu  wrote:
>> On 07/30/2013 03:20 AM, Felipe Contreras wrote:
>>> Since v3.7 the acpi backlight driver doesn't work at all on this machine
>>> because presumably the ACPI code contains stub code when Windows 8 OSI 
>>> is
>>> reported.
>>>
>>> The commit ea45ea7 (in v3.11-rc2) tried to fix this problem by using 
>>> the intel
>>> backlight driver, however, on this machine it turns the backlight 
>>> completely
>>> off when it reaches level 0%, after which the user might have a lot 
>>> trouble
>>> trying to bring it back.
>>
>> What do you mean by a lot of trouble? If you press hotkey to increase
>> backlight brightness level, does it work?
>
> I guess so, *if* there is indeed a user-space power manager handling
> that, *and* the keyboard has such keys, *or* the user has set custom
> hotkeys.

 Right, for a GUI environment this may not be a big problem(user uses Fn
 key to decrease brightness level and then hit the black screen, it's
 natural he will use Fn key to increase brightness level).

>
>> If so, the screen should not
>> be black any more, it's not that user has to blindly enter some command
>> to get out of the black screen.
>>
>> And I'm not sure if this is a bug of intel_backlight(setting a low level
>> makes the screen almost off), I see different models with different
>> vendors having this behavior.
>
> I mean, the screen is completely off, I cannot see absolutely
> anything. I don't see this behavior with the ACPI backlight driver,
> nor do I see that in Windows 7.
>
>> If this is deemed a bug, then I'm afraid
>> intel_backlight interface is useless for a lot of systems...perhaps we
>> can only say, intel_backlight's interpretation of low levels are
>> different with ACPI video's, and that's probably why its type is named
>> as raw :-)
>
> Well, a bug is defined as unexpected behavior -- as a user, if I'm
> changing the brightness of the screen, I certainly don't expect the
> screen to turn off, and I think that's the expectation from most
> people. It's the first time I see something like that.

 I agree this is kind of un-expected. At the same time, this seems to be
 the normal behavior for intel_backlight. I don't know what the correct
 thing to do here if this is something we want to avoid - modify intel
 backlight handling code not to set too low value or change the user
 space tool not to set a too low value if they are interacting with a
 raw type interface. Neither of them sounds cool... Or, users may get
 used to it, I for example, don't find this to be very annoying, but
 maybe I'm already used to it.
>>>
>>> BTW, for the complete screen off problem, I don't see there is anything
>>> wrong with it from code's point of view. It's not that there is an error
>>> in code or this is a broken hardware that caused the screen off when
>>> setting a very low or 0 brightness level, it is simply the expected
>>> behavior of what this interface can provide. It can really set the
>>> brightness level to minimum(zero) or maximum. Don't get me wrong, I
>>> didn't mean this is a good user experience, I don't know that. I just
>>> don't think this is a program bug, and I don't know if this should be
>>> fixed or not - obviously this interface did what it is asked to do,
>>> correctly.
>>
>> Precisely, user space asks for 0 and the kernel delivers.
>>
>> And there are reasons why 0 should be "screen off", like power management
>> (when you have a policy to dim the screen completely after a period of
>> inactivity, for example).
> 
> There is another interface the turn the screen off.
> 
> If 0 turns the screen off with the intel driver, 0 should turn the
> screen off with the ACPI driver, having inconsistent behavior
> depending on which driver is used is a bug.

I'm not sure of this. Remember the days when we switch the hard disk
driver from IDE to SCSI? The block device name changed from hdx to sdx.
Is this a bug?

> 
> If 0 did not turn off the screen in v3.10, 0 should not turn off the
> screen in v3.11, to do so would be a *regression*.

That depends on how you see it. I believe 0 also turns off the screen in
v3.10 if we are talking about the same driver(intel_backlight).

-Aaron

> 
>> So in my opinion, if that's a problem for anyone, it has to be addressed in
>> user space and if there are any vendors who try to address *that* in their 
>> ACPI
>> tables, that's one more reason to avoid using ACPI for backlight control.
> 
> If you think it's the user-space

Re: [PATCH trivial] include/linux/coda.h: remove useless '#else'

2013-07-30 Thread Chen Gang

On 07/31/2013 09:44 AM, Chen Gang wrote:
> On 07/30/2013 08:29 PM, Joe Perches wrote:
>> On Tue, 2013-07-30 at 15:30 +0800, Chen Gang wrote:
>>> '#else' is useless, need remove.
>>>
>>> Signed-off-by: Chen Gang 
>>> ---
>>>  include/linux/coda.h |1 -
>>>  1 files changed, 0 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/include/linux/coda.h b/include/linux/coda.h
>>> index cff544f..d30209b 100644
>>> --- a/include/linux/coda.h
>>> +++ b/include/linux/coda.h
>>> @@ -60,7 +60,6 @@ Mellon the rights to redistribute these changes without 
>>> encumbrance.
>>>  
>>>  #if defined(__linux__)
>>>  typedef unsigned long long u_quad_t;
>>> -#else
>>>  #endif
>>>  #include 
>>>  #endif 
>>
>> Why have the #if at all?
>>
>>
>>
> 
> Hmm... some old version compilers do not define __linux__ automatically
> (e.g. or32-linux-gcc 4.5.1-or32-1.0rc1, which is the latest cross
> compiler for openrisc, though).
> 
> If not define __linux__, the compiler will report error (u_quad_t is
> not defined).
> 
> When we remove "#if defined(__linux__)" from "include/linux/coda.h",
> the compiler will not report error, and 'cdev_t' will be defined as
> 'dev_t', not 'u_quad_t' (I guess, it is an implicit bug).
> 
> 
> In "uapi/include/*", only "coda.h" has "#if defined(__linux__)", maybe
> they want multiple OS can share the same "coda.h" file (at least, we
> can not say it is a bad idea).
> 
> Neither suitable to define __linux__ in "include/linux/coda.h".
> 
> 
> All together, I think:
> 
>   the direct cause:
> "uapi/include/coda.h" wants to share itself for multiple OS (so they need 
> check __linux__).
>   (in "uapi/include/*", only coda.h need check __linux__)
> 
>   the root cause:
> the compiler for linux should define __linux__ automatically, the latest 
> version of gcc has done, but some of the old version is not.
> most of cross compilers have merged their code into gcc main tree, but 
> still left some (e.g. openrisc).
>   (at least, I can not build openrisc from gcc main tree correctly, but 
> most of others can)
> some of latest cross compilers still use old version of gcc, (still not 
> define __linux__ automatically).
> 

Maybe what I said is incorrect (it is just my current understanding).

Welcome any other members' suggestions or completions for discussing and
checking.

Thanks.

> 
> The related code in "uapi/linux/coda.h" is below:
> 
> ---code begin---
> 
>  73 #if defined(DJGPP) || defined(__CYGWIN32__)
>  74 #ifdef KERNEL
>  75 typedef unsigned long u_long;
>  76 typedef unsigned int u_int;
>  77 typedef unsigned short u_short;
>  78 typedef u_long ino_t;
>  79 typedef u_long dev_t;
>  80 typedef void * caddr_t;
>  81 #ifdef DOS
>  82 typedef unsigned __int64 u_quad_t;
>  83 #else
>  84 typedef unsigned long long u_quad_t;
>  85 #endif
>  86 
>  87 #define inline
>  88 
>  89 struct timespec {
>  90 long   ts_sec;
>  91 long   ts_nsec;
>  92 };
>  93 #else  /* DJGPP but not KERNEL */
>  94 #include 
>  95 typedef unsigned long long u_quad_t;
>  96 #endif /* !KERNEL */
>  97 #endif /* !DJGPP */
>  98 
>  99 
> 100 #if defined(__linux__)
> 101 #include 
> 102 #define cdev_t u_quad_t
> 103 #ifndef __KERNEL__
> 104 #if !defined(_UQUAD_T_) && (!defined(__GLIBC__) || __GLIBC__ < 2)
> 105 #define _UQUAD_T_ 1
> 106 typedef unsigned long long u_quad_t;
> 107 #endif
> 108 #endif /* __KERNEL__ */
> 109 #else
> 110 #define cdev_t dev_t
> 111 #endif
> 112 
> 
> ---code end-
> 
> 
> Thanks.
> 


-- 
Chen Gang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 5/8] memcg: convert to use cgroup id

2013-07-30 Thread Li Zefan

Use cgroup id instead of css id. This is a preparation to kill css id.

Note, as memcg treat 0 as an invalid id, while cgroup id starts with 0,
we define memcg_id == cgroup_id + 1.

Signed-off-by: Li Zefan 
Acked-by: Michal Hocko 
---
 mm/memcontrol.c | 34 --
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 626c426..62541f5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -512,6 +512,25 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup 
*memcg)
return (memcg == root_mem_cgroup);
 }
 
+static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
+{
+   /*
+* The ID of the root cgroup is 0, but memcg treat 0 as an
+* invalid ID, so we return (cgroup_id + 1).
+*/
+   return memcg->css.cgroup->id + 1;
+}
+
+static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
+{
+   struct cgroup *cgrp;
+
+   cgrp = cgroup_from_id(_cgroup_subsys, id - 1);
+   if (!cgrp)
+   return NULL;
+   return mem_cgroup_from_cont(cgrp);
+}
+
 /* Writing them here to avoid exposing memcg's inner layout */
 #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
 
@@ -2821,15 +2840,10 @@ static void __mem_cgroup_cancel_local_charge(struct 
mem_cgroup *memcg,
  */
 static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
 {
-   struct cgroup_subsys_state *css;
-
/* ID 0 is unused ID */
if (!id)
return NULL;
-   css = css_lookup(_cgroup_subsys, id);
-   if (!css)
-   return NULL;
-   return mem_cgroup_from_css(css);
+   return mem_cgroup_from_id(id);
 }
 
 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
@@ -4328,7 +4342,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, 
swp_entry_t ent, bool swapout)
 * css_get() was called in uncharge().
 */
if (do_swap_account && swapout && memcg)
-   swap_cgroup_record(ent, css_id(>css));
+   swap_cgroup_record(ent, mem_cgroup_id(memcg));
 }
 #endif
 
@@ -4380,8 +4394,8 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
 {
unsigned short old_id, new_id;
 
-   old_id = css_id(>css);
-   new_id = css_id(>css);
+   old_id = mem_cgroup_id(from);
+   new_id = mem_cgroup_id(to);
 
if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
mem_cgroup_swap_statistics(from, false);
@@ -6542,7 +6556,7 @@ static enum mc_target_type get_mctgt_type(struct 
vm_area_struct *vma,
}
/* There is a swap entry and a page doesn't exist or isn't charged */
if (ent.val && !ret &&
-   css_id(>css) == lookup_swap_cgroup_id(ent)) {
+   mem_cgroup_id(mc.from) == lookup_swap_cgroup_id(ent)) {
ret = MC_TARGET_SWAP;
if (target)
target->ent = ent;
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 8/8] cgroup: kill css_id

2013-07-30 Thread Li Zefan

The only user of css_id was memcg, and it has been convered to use
cgroup->id, so kill css_id.

Signed-off-by: Li Zefan 
Reviewed-by: Michal Hocko 
---
 include/linux/cgroup.h |  37 
 kernel/cgroup.c| 252 +
 2 files changed, 1 insertion(+), 288 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4ef8452..c07e175 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -597,11 +597,6 @@ struct cgroup_subsys {
int subsys_id;
int disabled;
int early_init;
-   /*
-* True if this subsys uses ID. ID is not available before cgroup_init()
-* (not available in early_init time.)
-*/
-   bool use_id;
 
/*
 * If %false, this subsystem is properly hierarchical -
@@ -627,9 +622,6 @@ struct cgroup_subsys {
 */
struct cgroupfs_root *root;
struct list_head sibling;
-   /* used when use_id == true */
-   struct idr idr;
-   spinlock_t id_lock;
 
/* list of cftype_sets */
struct list_head cftsets;
@@ -874,35 +866,6 @@ int cgroup_scan_tasks(struct cgroup_scanner *scan);
 int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
 int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from);
 
-/*
- * CSS ID is ID for cgroup_subsys_state structs under subsys. This only works
- * if cgroup_subsys.use_id == true. It can be used for looking up and scanning.
- * CSS ID is assigned at cgroup allocation (create) automatically
- * and removed when subsys calls free_css_id() function. This is because
- * the lifetime of cgroup_subsys_state is subsys's matter.
- *
- * Looking up and scanning function should be called under rcu_read_lock().
- * Taking cgroup_mutex is not necessary for following calls.
- * But the css returned by this routine can be "not populated yet" or "being
- * destroyed". The caller should check css and cgroup's status.
- */
-
-/*
- * Typically Called at ->destroy(), or somewhere the subsys frees
- * cgroup_subsys_state.
- */
-void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css);
-
-/* Find a cgroup_subsys_state which has given ID */
-
-struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id);
-
-/* Returns true if root is ancestor of cg */
-bool css_is_ancestor(struct cgroup_subsys_state *cg,
-const struct cgroup_subsys_state *root);
-
-/* Get id and depth of css */
-unsigned short css_id(struct cgroup_subsys_state *css);
 struct cgroup_subsys_state *cgroup_css_from_dir(struct file *f, int id);
 
 #else /* !CONFIG_CGROUPS */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index a26e083..a64b7b8 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -123,38 +123,6 @@ struct cfent {
 };
 
 /*
- * CSS ID -- ID per subsys's Cgroup Subsys State(CSS). used only when
- * cgroup_subsys->use_id != 0.
- */
-#define CSS_ID_MAX (65535)
-struct css_id {
-   /*
-* The css to which this ID points. This pointer is set to valid value
-* after cgroup is populated. If cgroup is removed, this will be NULL.
-* This pointer is expected to be RCU-safe because destroy()
-* is called after synchronize_rcu(). But for safe use, css_tryget()
-* should be used for avoiding race.
-*/
-   struct cgroup_subsys_state __rcu *css;
-   /*
-* ID of this css.
-*/
-   unsigned short id;
-   /*
-* Depth in hierarchy which this ID belongs to.
-*/
-   unsigned short depth;
-   /*
-* ID is freed by RCU. (and lookup routine is RCU safe.)
-*/
-   struct rcu_head rcu_head;
-   /*
-* Hierarchy of CSS ID belongs to.
-*/
-   unsigned short stack[0]; /* Array of Length (depth+1) */
-};
-
-/*
  * cgroup_event represents events which userspace want to receive.
  */
 struct cgroup_event {
@@ -364,9 +332,6 @@ struct cgrp_cset_link {
 static struct css_set init_css_set;
 static struct cgrp_cset_link init_cgrp_cset_link;
 
-static int cgroup_init_idr(struct cgroup_subsys *ss,
-  struct cgroup_subsys_state *css);
-
 /* css_set_lock protects the list of css_set objects, and the
  * chain of tasks off each css_set.  Nests outside task->alloc_lock
  * due to cgroup_iter_start() */
@@ -815,9 +780,6 @@ static struct backing_dev_info cgroup_backing_dev_info = {
.capabilities   = BDI_CAP_NO_ACCT_AND_WRITEBACK,
 };
 
-static int alloc_css_id(struct cgroup_subsys *ss,
-   struct cgroup *parent, struct cgroup *child);
-
 static struct inode *cgroup_new_inode(umode_t mode, struct super_block *sb)
 {
struct inode *inode = new_inode(sb);
@@ -4160,20 +4122,6 @@ static int cgroup_populate_dir(struct cgroup *cgrp, 
unsigned long subsys_mask)
}
}
 
-   /* This cgroup is ready now */
-   for_each_root_subsys(cgrp->root, ss) {
-

[PATCH v4 3/8] cgroup: implement cgroup_from_id()

2013-07-30 Thread Li Zefan

This will be used as a replacement for css_lookup().

There's a difference with cgroup id and css id. cgroup id starts with 0,
while css id starts with 1.

v4:
- also check if cggroup_mutex is held.
- make it an inline function.

Signed-off-by: Li Zefan 
Reviewed-by: Michal Hocko 
---
 include/linux/cgroup.h | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 8c107e9..4ef8452 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -720,6 +720,24 @@ static inline struct cgroup* task_cgroup(struct 
task_struct *task,
return task_subsys_state(task, subsys_id)->cgroup;
 }
 
+/**
+ * cgroup_from_id - lookup cgroup by id
+ * @ss: cgroup subsys to be looked into
+ * @id: the cgroup id
+ *
+ * Returns the cgroup if there's valid one with @id, otherwise returns NULL.
+ * Should be called under rcu_read_lock().
+ */
+static inline struct cgroup *cgroup_from_id(struct cgroup_subsys *ss, int id)
+{
+#ifdef CONFIG_PROVE_RCU
+   rcu_lockdep_assert(rcu_read_lock_held() ||
+  lockdep_is_held(_mutex),
+  "cgroup_from_id() needs proper protection");
+#endif
+   return idr_find(>root->cgroup_idr, id);
+}
+
 struct cgroup *cgroup_next_sibling(struct cgroup *pos);
 
 /**
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 6/8] memcg: fail to create cgroup if the cgroup id is too big

2013-07-30 Thread Li Zefan

memcg requires the cgroup id to be smaller than 65536.

This is a preparation to kill css id.

Signed-off-by: Li Zefan 
Acked-by: Michal Hocko 
---
 mm/memcontrol.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 62541f5..ad7d7c9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -512,6 +512,12 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup 
*memcg)
return (memcg == root_mem_cgroup);
 }
 
+/*
+ * We restrict the id in the range of [1, 65535], so it can fit into
+ * an unsigned short.
+ */
+#define MEM_CGROUP_ID_MAX  USHRT_MAX
+
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 {
/*
@@ -6248,6 +6254,9 @@ mem_cgroup_css_alloc(struct cgroup *cont)
long error = -ENOMEM;
int node;
 
+   if (cont->id > MEM_CGROUP_ID_MAX)
+   return ERR_PTR(-ENOSPC);
+
memcg = mem_cgroup_alloc();
if (!memcg)
return ERR_PTR(error);
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 7/8] memcg: stop using css id

2013-07-30 Thread Li Zefan

Now memcg uses cgroup id instead of css id. Update some comments and
set mem_cgroup_subsys->use_id to 0.

Signed-off-by: Li Zefan 
Acked-by: Michal Hocko 
---
 mm/memcontrol.c | 23 ---
 1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ad7d7c9..6b77ab6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -608,16 +608,11 @@ static void disarm_sock_keys(struct mem_cgroup *memcg)
 #ifdef CONFIG_MEMCG_KMEM
 /*
  * This will be the memcg's index in each cache's ->memcg_params->memcg_caches.
- * There are two main reasons for not using the css_id for this:
- *  1) this works better in sparse environments, where we have a lot of memcgs,
- * but only a few kmem-limited. Or also, if we have, for instance, 200
- * memcgs, and none but the 200th is kmem-limited, we'd have to have a
- * 200 entry array for that.
- *
- *  2) In order not to violate the cgroup API, we would like to do all memory
- * allocation in ->create(). At that point, we haven't yet allocated the
- * css_id. Having a separate index prevents us from messing with the cgroup
- * core for this
+ * The main reason for not using cgroup id for this:
+ *  this works better in sparse environments, where we have a lot of memcgs,
+ *  but only a few kmem-limited. Or also, if we have, for instance, 200
+ *  memcgs, and none but the 200th is kmem-limited, we'd have to have a
+ *  200 entry array for that.
  *
  * The current size of the caches array is stored in
  * memcg_limited_groups_array_size.  It will double each time we have to
@@ -632,14 +627,14 @@ int memcg_limited_groups_array_size;
  * cgroups is a reasonable guess. In the future, it could be a parameter or
  * tunable, but that is strictly not necessary.
  *
- * MAX_SIZE should be as large as the number of css_ids. Ideally, we could get
+ * MAX_SIZE should be as large as the number of cgrp_ids. Ideally, we could get
  * this constant directly from cgroup, but it is understandable that this is
  * better kept as an internal representation in cgroup.c. In any case, the
- * css_id space is not getting any smaller, and we don't have to necessarily
+ * cgrp_id space is not getting any smaller, and we don't have to necessarily
  * increase ours as well if it increases.
  */
 #define MEMCG_CACHES_MIN_SIZE 4
-#define MEMCG_CACHES_MAX_SIZE 65535
+#define MEMCG_CACHES_MAX_SIZE MEM_CGROUP_ID_MAX
 
 /*
  * A lot of the calls to the cache allocation functions are expected to be
@@ -6188,7 +6183,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
size_t size = memcg_size();
 
mem_cgroup_remove_from_trees(memcg);
-   free_css_id(_cgroup_subsys, >css);
 
for_each_node(node)
free_mem_cgroup_per_zone_info(memcg, node);
@@ -6985,7 +6979,6 @@ struct cgroup_subsys mem_cgroup_subsys = {
.bind = mem_cgroup_bind,
.base_cftypes = mem_cgroup_files,
.early_init = 0,
-   .use_id = 1,
 };
 
 #ifdef CONFIG_MEMCG_SWAP
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 2/8] cgroup: document how cgroup IDs are assigned

2013-07-30 Thread Li Zefan

As cgroup id has been used in netprio cgroup and will be used in memcg,
it's important to make it clear how a cgroup id is allocated.

For example, in netprio cgroup, the id is used as index of anarray.

Signed-off-by: Li Zefan 
Reviewed-by: Michal Hocko 
---
 include/linux/cgroup.h | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 2bd052d..8c107e9 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -161,7 +161,13 @@ struct cgroup_name {
 struct cgroup {
unsigned long flags;/* "unsigned long" so bitops work */
 
-   int id; /* idr allocated in-hierarchy ID */
+   /*
+* idr allocated in-hierarchy ID.
+*
+* The ID of the root cgroup is always 0, and a new cgroup
+* will be assigned with a smallest available ID.
+*/
+   int id;
 
/*
 * We link our 'sibling' struct into our parent's 'children'.
-- 
1.8.0.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 4/8] memcg: convert to use cgroup_is_descendant()

2013-07-30 Thread Li Zefan

This is a preparation to kill css_id.

Signed-off-by: Li Zefan 
Acked-by: Michal Hocko 
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d12ca6f..626c426 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1434,7 +1434,7 @@ bool __mem_cgroup_same_or_subtree(const struct mem_cgroup 
*root_memcg,
return true;
if (!root_memcg->use_hierarchy || !memcg)
return false;
-   return css_is_ancestor(>css, _memcg->css);
+   return cgroup_is_descendant(memcg->css.cgroup, root_memcg->css.cgroup);
 }
 
 static bool mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
-- 
1.8.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 1/8] cgroup: convert cgroup_ida to cgroup_idr

2013-07-30 Thread Li Zefan

This enables us to lookup a cgroup by its id.

v4:
- add a comment for idr_remove() in cgroup_offline_fn().

v3:
- on success, idr_alloc() returns the id but not 0, so fix the BUG_ON()
  in cgroup_init().
- pass the right value to idr_alloc() so that the id for dummy cgroup is 0.

Signed-off-by: Li Zefan 
Reviewed-by: Michal Hocko 
---
 include/linux/cgroup.h |  4 ++--
 kernel/cgroup.c| 33 +++--
 2 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 297462b..2bd052d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -161,7 +161,7 @@ struct cgroup_name {
 struct cgroup {
unsigned long flags;/* "unsigned long" so bitops work */
 
-   int id; /* ida allocated in-hierarchy ID */
+   int id; /* idr allocated in-hierarchy ID */
 
/*
 * We link our 'sibling' struct into our parent's 'children'.
@@ -322,7 +322,7 @@ struct cgroupfs_root {
unsigned long flags;
 
/* IDs for cgroups in this hierarchy */
-   struct ida cgroup_ida;
+   struct idr cgroup_idr;
 
/* The path to use for release notifications. */
char release_agent_path[PATH_MAX];
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 345fac8..a26e083 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -866,8 +866,6 @@ static void cgroup_free_fn(struct work_struct *work)
 */
dput(cgrp->parent->dentry);
 
-   ida_simple_remove(>root->cgroup_ida, cgrp->id);
-
/*
 * Drop the active superblock reference that we took when we
 * created the cgroup. This will free cgrp->root, if we are
@@ -1379,6 +1377,7 @@ static void init_cgroup_root(struct cgroupfs_root *root)
cgrp->root = root;
RCU_INIT_POINTER(cgrp->name, _cgroup_name);
init_cgroup_housekeeping(cgrp);
+   idr_init(>cgroup_idr);
 }
 
 static int cgroup_init_root_id(struct cgroupfs_root *root, int start, int end)
@@ -1451,7 +1450,6 @@ static struct cgroupfs_root *cgroup_root_from_opts(struct 
cgroup_sb_opts *opts)
 */
root->subsys_mask = opts->subsys_mask;
root->flags = opts->flags;
-   ida_init(>cgroup_ida);
if (opts->release_agent)
strcpy(root->release_agent_path, opts->release_agent);
if (opts->name)
@@ -1467,7 +1465,7 @@ static void cgroup_free_root(struct cgroupfs_root *root)
/* hierarhcy ID shoulid already have been released */
WARN_ON_ONCE(root->hierarchy_id);
 
-   ida_destroy(>cgroup_ida);
+   idr_destroy(>cgroup_idr);
kfree(root);
}
 }
@@ -1582,6 +1580,11 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,
mutex_lock(_mutex);
mutex_lock(_root_mutex);
 
+   root_cgrp->id = idr_alloc(>cgroup_idr, root_cgrp,
+  0, 1, GFP_KERNEL);
+   if (root_cgrp->id < 0)
+   goto unlock_drop;
+
/* Check for name clashes with existing mounts */
ret = -EBUSY;
if (strlen(root->name))
@@ -4273,7 +4276,11 @@ static long cgroup_create(struct cgroup *parent, struct 
dentry *dentry,
goto err_free_cgrp;
rcu_assign_pointer(cgrp->name, name);
 
-   cgrp->id = ida_simple_get(>cgroup_ida, 1, 0, GFP_KERNEL);
+   /*
+* Temporarily set the pointer to NULL, so idr_find() won't return
+* a half-baked cgroup.
+*/
+   cgrp->id = idr_alloc(>cgroup_idr, NULL, 1, 0, GFP_KERNEL);
if (cgrp->id < 0)
goto err_free_name;
 
@@ -4371,6 +4378,8 @@ static long cgroup_create(struct cgroup *parent, struct 
dentry *dentry,
}
}
 
+   idr_replace(>cgroup_idr, cgrp, cgrp->id);
+
err = cgroup_addrm_files(cgrp, NULL, cgroup_base_files, true);
if (err)
goto err_destroy;
@@ -4397,7 +4406,7 @@ err_free_all:
/* Release the reference count that we took on the superblock */
deactivate_super(sb);
 err_free_id:
-   ida_simple_remove(>cgroup_ida, cgrp->id);
+   idr_remove(>cgroup_idr, cgrp->id);
 err_free_name:
kfree(rcu_dereference_raw(cgrp->name));
 err_free_cgrp:
@@ -4590,6 +4599,14 @@ static void cgroup_offline_fn(struct work_struct *work)
/* delete this cgroup from parent->children */
list_del_rcu(>sibling);
 
+   /*
+* We should remove the cgroup object from idr before its grace
+* period starts, so we won't be looking up a cgroup while the
+* cgroup is being freed.
+*/
+   idr_remove(>root->cgroup_idr, cgrp->id);
+   cgrp->id = -1;
+
dput(d);
 
set_bit(CGRP_RELEASABLE, >flags);
@@ -4915,6 +4932,10 @@ int __init cgroup_init(void)
 
BUG_ON(cgroup_init_root_id(_dummy_root,

[PATCH v4 0/8] memcg, cgroup: kill css_id

2013-07-30 Thread Li Zefan

This patchset converts memcg to use cgroup->id, and then we can remove
cgroup css_id.

As we've removed memcg's own refcnt, converting memcg to use cgroup->id
is very straight-forward.

The patchset is based on Tejun's cgroup tree.

v4:
- make cgroup_from_id() inline and check if cgroup_mutex is held.
- add a comment for idr_remove() in cgroup_offline)fn().

v2->v3:
- some minor cleanups suggested by Michal.
- fixed the call to idr_alloc() in cgroup_init() in the first patch.

Li Zefan (8):
  cgroup: convert cgroup_ida to cgroup_idr
  cgroup: document how cgroup IDs are assigned
  cgroup: implement cgroup_from_id()
  memcg: convert to use cgroup_is_descendant()
  memcg: convert to use cgroup id
  memcg: fail to create cgroup if the cgroup id is too big
  memcg: stop using css id
  cgroup: kill css_id
--
 include/linux/cgroup.h |  65 ++
 kernel/cgroup.c| 285 
++-
 mm/memcontrol.c|  68 --
 3 files changed, 96 insertions(+), 322 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH trivial] include/linux/coda.h: remove useless '#else'

2013-07-30 Thread Chen Gang

On 07/30/2013 08:29 PM, Joe Perches wrote:
> On Tue, 2013-07-30 at 15:30 +0800, Chen Gang wrote:
>> '#else' is useless, need remove.
>>
>> Signed-off-by: Chen Gang 
>> ---
>>  include/linux/coda.h |1 -
>>  1 files changed, 0 insertions(+), 1 deletions(-)
>>
>> diff --git a/include/linux/coda.h b/include/linux/coda.h
>> index cff544f..d30209b 100644
>> --- a/include/linux/coda.h
>> +++ b/include/linux/coda.h
>> @@ -60,7 +60,6 @@ Mellon the rights to redistribute these changes without 
>> encumbrance.
>>  
>>  #if defined(__linux__)
>>  typedef unsigned long long u_quad_t;
>> -#else
>>  #endif
>>  #include 
>>  #endif 
> 
> Why have the #if at all?
> 
> 
> 

Hmm... some old version compilers do not define __linux__ automatically
(e.g. or32-linux-gcc 4.5.1-or32-1.0rc1, which is the latest cross
compiler for openrisc, though).

If not define __linux__, the compiler will report error (u_quad_t is
not defined).

When we remove "#if defined(__linux__)" from "include/linux/coda.h",
the compiler will not report error, and 'cdev_t' will be defined as
'dev_t', not 'u_quad_t' (I guess, it is an implicit bug).

In "uapi/include/*", only "coda.h" has "#if defined(__linux__)", maybe
they want multiple OS can share the same "coda.h" file (at least, we
can not say it is a bad idea).

Neither suitable to define __linux__ in "include/linux/coda.h".

All together, I think:

  the direct cause:
"uapi/include/coda.h" wants to share itself for multiple OS (so they need 
check __linux__).
  (in "uapi/include/*", only coda.h need check __linux__)

  the root cause:
the compiler for linux should define __linux__ automatically, the latest 
version of gcc has done, but some of the old version is not.
most of cross compilers have merged their code into gcc main tree, but 
still left some (e.g. openrisc).
  (at least, I can not build openrisc from gcc main tree correctly, but 
most of others can)
some of latest cross compilers still use old version of gcc, (still not 
define __linux__ automatically).

The related code in "uapi/linux/coda.h" is below:

---code begin---

 73 #if defined(DJGPP) || defined(__CYGWIN32__)
 74 #ifdef KERNEL
 75 typedef unsigned long u_long;
 76 typedef unsigned int u_int;
 77 typedef unsigned short u_short;
 78 typedef u_long ino_t;
 79 typedef u_long dev_t;
 80 typedef void * caddr_t;
 81 #ifdef DOS
 82 typedef unsigned __int64 u_quad_t;
 83 #else
 84 typedef unsigned long long u_quad_t;
 85 #endif
 86 
 87 #define inline
 88 
 89 struct timespec {
 90 long   ts_sec;
 91 long   ts_nsec;
 92 };
 93 #else  /* DJGPP but not KERNEL */
 94 #include 
 95 typedef unsigned long long u_quad_t;
 96 #endif /* !KERNEL */
 97 #endif /* !DJGPP */
 98 
 99 
100 #if defined(__linux__)
101 #include 
102 #define cdev_t u_quad_t
103 #ifndef __KERNEL__
104 #if !defined(_UQUAD_T_) && (!defined(__GLIBC__) || __GLIBC__ < 2)
105 #define _UQUAD_T_ 1
106 typedef unsigned long long u_quad_t;
107 #endif
108 #endif /* __KERNEL__ */
109 #else
110 #define cdev_t dev_t
111 #endif
112 

---code end-

Thanks.
-- 
Chen Gang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] ARM kernel summit 2013, Oct 22-23

2013-07-30 Thread Barry Song

2013/7/30 Olof Johansson :
> As previous years, we are planning on hosting a ARM Kernel Summit
> attached to the main Kernel Summit, and we have graciously been given
> space to host it for two days, Oct 22-23 (Oct 23 is the track day for
> the main KS, so we're getting an extra day) in Edinburgh, UK.
>
> To make sure we get the right set of people invited, we're looking for
> people to propose topics they would like to see covered as part of the
> two-day aganda, as well as general invite requests. To request an
> invite or to propose a topic, please send an email to
> ksummit-2013-disc...@lists.linuxfoundation.org, and cc LAKML/LKML as
> appropriate. Please use the tag "[ARM ATTEND]" in the subject to
> distinguish these from the general KS ATTEND threads.
>
> It's pretty obvious that device trees and bindings are hot topics this
> year, with plenty of overlap between ARM and regular kernel summit and
> other subsystems. We are tentatively planning on dedicating Wednesday
> afternoon to the topic, to fit the regular agenda of KS track day to
> make sure as many interested parties as possible can attend. We'd
> still be interested in hearing specific topic requests in that area
> though.

It seems IKS and HMP of b.L are coming to hot. There is a prototype
now, and we need to push it to practice.
and as an user of both mainline and google Android, even there are
many Android patches in mainline now, we still need to maintain both
now. it seems google likes to rebase its patch series againest new
kernel version instead of merging everytime. so what about talking
Android mainlining project more?

>
>
> The group of us organizing and coordinating this year is:
>
> Arnd Bergmann
> Mark Brown
> Kevin Hilman
> Olof Johansson
> Grant Likely

-barry
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch]workqueue: copy attr with all fields.

2013-07-30 Thread Shaohua Li

On Tue, Jul 30, 2013 at 10:14:19AM -0400, Tejun Heo wrote:
> Hello, Shaohua.
> 
> On Tue, Jul 30, 2013 at 01:49:55PM +0800, Shaohua Li wrote:
> > > Note that copy_workqueue_attrs() is also used while creating worker
> > > pools and this change would mean that there will be worker pools with
> > > no_numa attribute assigned basically randomly depending on which wq
> > > created the pool for the first time which should be fine AFAICS but
> > > isn't very nice.  Given that there's only single place where a pool's
> > > attrs is set, maybe it'd be best to clear no_numa right after copy and
> > > explain what's going on there?
> > 
> > alloc_workqueue_attrs() always zero the new allocated attr.
> 
> Hmmm?  Yeah, sure, that's how it currently is always zero for all
> pools.  If you change the copy function, it'll now be set to whatever
> the input value was when a pool was created, which shouldn't break
> anything but is a bit confusing as no_numa doesn't have anything to do
> with pools directly.  Just in case, I'm talking about
> copy_workqueue_attrs() call in get_unbound_pool().

Hmm, I didn't agree it's more confusing to change copy_workqueue_attrs(), the
name of the function suggests it is a 'copy'. And clearing no_numa in
apply_workqueue_attrs() after copy_workqueue_attrs() looks like a hack to me.
But it depends on you, feel free to fix it by yourself.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/4] ARM: sun6i: Enable clock support in the DTSI

2013-07-30 Thread Emilio López

Hi Maxime,

Overall this looks good to me, but I have some small comments:

El 30/07/13 11:44, Maxime Ripard escribió:
> Now that the clock driver has support for the A31 clocks, we can add
> them to the DTSI and start using them in the relevant hardware blocks.
> 
> Signed-off-by: Maxime Ripard 
> ---
>  arch/arm/boot/dts/sun6i-a31.dtsi | 137 
> ---
>  1 file changed, 127 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm/boot/dts/sun6i-a31.dtsi 
> b/arch/arm/boot/dts/sun6i-a31.dtsi
> index 0d13dc6..c6a3a91 100644
> --- a/arch/arm/boot/dts/sun6i-a31.dtsi
> +++ b/arch/arm/boot/dts/sun6i-a31.dtsi
> @@ -51,13 +51,130 @@
>  
>   clocks {
>   #address-cells = <1>;
> - #size-cells = <0>;
> + #size-cells = <1>;
> + ranges;
>  
> - osc: oscillator {
> + osc24M: hosc {

Please use osc24M and osc32k instead of hosc and losc, respectively.

>   #clock-cells = <0>;
>   compatible = "fixed-clock";
>   clock-frequency = <2400>;

Is osc24M not gatable on A31?

>   };
> +
> + osc32k: losc {
> + #clock-cells = <0>;
> + compatible = "fixed-clock";
> + clock-frequency = <32768>;
> + };
> +
> + pll1: pll1@01c2 {
> + #clock-cells = <0>;
> + compatible = "allwinner,sun6i-pll1-clk";
> + reg = <0x01c2 0x4>;
> + clocks = <>;
> + };
> +
> + /*
> +  * This is a dummy clock, to be used as placeholder on
> +  * other mux clocks when a specific parent clock is not
> +  * yet implemented. It should be dropped when the driver
> +  * is complete.
> +  */
> + pll6: pll6 {
> + #clock-cells = <0>;
> + compatible = "fixed-clock";
> + clock-frequency = <0>;
> + };
> +
> + cpu: cpu@01c20050 {
> + #clock-cells = <0>;
> + compatible = "allwinner,sun4i-cpu-clk";
> + reg = <0x01c20050 0x4>;
> + clocks = <>, <>, <>, <>;

Listing pll1 twice doesn't sound correct, but vendor code seems to
indicate so. A comment to clarify it's not a typo would be good I think.

> + };
> +
> + axi: axi@01c20050 {
> + #clock-cells = <0>;
> + compatible = "allwinner,sun4i-axi-clk";
> + reg = <0x01c20050 0x4>;
> + clocks = <>;
> + };
> +
> + ahb1_mux: ahb1_mux@01c20054 {
> + #clock-cells = <0>;
> + compatible = "allwinner,sun6i-ahb1-mux-clk";
> + reg = <0x01c20054 0x4>;
> + clocks = <>, <>, <>, <>;
> + };
> +
> + ahb1: ahb1@01c20054 {
> + #clock-cells = <0>;
> + compatible = "allwinner,sun4i-ahb-clk";
> + reg = <0x01c20054 0x4>;
> + clocks = <_mux>;
> + };

Depending on when this lands, I believe these two above could be merged
into one with the refactoring introduced on my patchset.

> +
> + ahb1_gates: ahb1_gates@01c20060 {
> + #clock-cells = <1>;
> + compatible = "allwinner,sun6i-a31-ahb1-gates-clk";
> + reg = <0x01c20060 0x8>;
> + clocks = <>;
> + clock-output-names = "ahb1_mipidsi", "ahb1_ss",
> + "ahb1_dma", "ahb1_mmc0", "ahb1_mmc1",
> + "ahb1_mmc2", "ahb1_mmc3", "ahb1_nand1",
> + "ahb1_nand0", "ahb1_sdram",
> + "ahb1_gmac", "ahb1_ts", "ahb1_hstimer",
> + "ahb1_spi0", "ahb1_spi1", "ahb1_spi2",
> + "ahb1_spi3", "ahb1_otg", "ahb1_ehci0",
> + "ahb1_ehci1", "ahb1_ohci0",
> + "ahb1_ohci1", "ahb1_ohci2", "ahb1_ve",
> + "ahb1_lcd0", "ahb1_lcd1", "ahb1_csi",
> + "ahb1_hdmi", "ahb1_de0", "ahb1_de1",
> + "ahb1_fe0", "ahb1_fe1", "ahb1_mp",
> + "ahb1_gpu", "ahb1_deu0", "ahb1_deu1",
> + "ahb1_drc0", "ahb1_drc1";
> + };
> +
> + apb1: apb1@01c20054 {
> + #clock-cells = <0>;
> + compatible = "allwinner,sun4i-apb0-clk";
> + reg = <0x01c20054 0x4>;
> + clocks = <>;
>

Re: [REGRESSION/PATCH] acpi: blacklist win8 OSI for ASUS Zenbok Prime UX31A

2013-07-30 Thread Aaron Lu

On 07/31/2013 08:11 AM, Felipe Contreras wrote:
> On Tue, Jul 30, 2013 at 6:13 PM, Rafael J. Wysocki  wrote:
> 
>>> If 0 turns the screen off with the intel driver, 0 should turn the
>>> screen off with the ACPI driver, having inconsistent behavior
>>> depending on which driver is used is a bug.
>>
>> The ACPI driver simply exposes and interface to interact with the AML methods
>> in the BIOS directly.
> 
> No, the ACPI driver is exposing a backlight interface, which has a
> defined stable API.
> 
> Documentation/ABI/stable/sysfs-class-backlight
> 
> Yes, the interface doesn't define what should happen at 0, that is a
> bug in the interface definition.
> 
> *How* it achieves that is an implementation detail.
> 
>> Yes, this is a mistake and shouldn't be designed this way.
>>
>> However, incidentally, this makes backlight control work on your machine.
>>
>> Anyway, we need all backlight drivers to work consistently and don't tempt me
>> to rip the ACPI driver entirely from the kernel for what it's worth.
> 
> Yes, they should work consistently, and go ahead, rip the ACPI driver,
> *then* you'll see many more people complaining about the Linux kernel
> breaking user-space, which should never happen. Mistakes happen, but
> if you do this willingly and knowingly, I think there would be
> repercussions for you.
> 
>> Yes, that will break backlight on your system and *then* you can complain to
>> Linus if you wish.
> 
> It is already broken in v3.11-rc3, in fact I just booted that to try
> it out and it booted with the screen completely black (fortunately I
> knew exactly what to type to change that).

That is bad, can you please file a bug for that? I'll need to take a
look at your ACPI tables, thanks.

> 
> Apparently this commit also needs to be reverted: efaa14c (ACPI /
> video: no automatic brightness changes by win8-compatible firmware).
> In this machine it makes the backlight work again (without
> acpi_osi="!Windows 2012"), but by doing so the ACPI driver also turns
> off the screen completely at level 0. Also, each time I change the

So with rc3 and commit efaa14c reverted, when you set level 0 to ACPI
video's backlight interface, the screen will be off now? And this is not
the case in 3.10, right?

> backlight level from X, the screen blinks as if going 100%, 0%, and
> then the desired level.

Please attach acpidump output to the to be opened bugzilla page, thanks.

-Aaron

> 
> For this particular machine simply applying the attached patch would
> solve all those regressions, but who knows in other machines, I think
> it's safer to revert efaa14c.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: build failure after merge of the ext4 tree

2013-07-30 Thread Stephen Rothwell

Hi Ted,

On Mon, 29 Jul 2013 11:08:16 +1000 Stephen Rothwell  
wrote:
>
> After merging the ext4 tree, today's linux-next build (powerpc
> ppc64_defconfig) failed like this:
> 
> fs/ext4/ialloc.c: In function '__ext4_new_inode':
> fs/ext4/ialloc.c:817:1: warning: label 'next_ino' defined but not used 
> [-Wunused-label]
>  next_ino:
>  ^
> fs/ext4/ialloc.c:792:4: error: label 'next_inode' used but not defined
> goto next_inode;
> ^
> 
> Hmm ...
> 
> Caused by commit 4a8603ef197a ("ext4: avoid reusing recently deleted
> inodes in no journal mode").
> 
> I have used the ext4 tree from next-20130726 for today.

I am still getting this ...
-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au


pgpQxpcVsV926.pgp
Description: PGP signature

Re: [patch 1/3] raid5: offload stripe handle to workqueue

2013-07-30 Thread Shaohua Li

On Tue, Jul 30, 2013 at 09:57:51AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jul 30, 2013 at 09:07:08PM +0800, Shaohua Li wrote:
> > Ok, I should explain here. I can't add a work_struct for each stripe, 
> > because
> > this will stress workqueue very hard. My system handles > 1M/s stripes, 
> > which
> > makes workqueue pool lock contended very hard.
> 
> It doesn't have to be embedding work_struct in each stripe and
> schduling them altogether.  It's more about scheduling "work units"
> rather than "workers" - ie. letting each scheduled work item handle
> single work unit rather than making it dispatch multiple work items.
> It may make controlling concurrency a bit more interesting but you can
> always do it with workqueue_set_max_active(), which is the intended
> usage anyway.

stripe is the work unit actually. As I said, if I queue a work for each stripe,
just queue_work() will make the system blast because of the pwq->pool->lock
contention. dispatching one work has another side effect that I can't add block
plug. Since this is the queue stage concurrency problem,
workqueue_set_max_active() doesn't help.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-07-30 Thread Takao Indoh

(2013/07/31 0:59), Bjorn Helgaas wrote:
> On Tue, Jul 30, 2013 at 12:09 AM, Takao Indoh
>  wrote:
>> (2013/07/29 23:17), Bjorn Helgaas wrote:
>>> On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh  
>>> wrote:
 (2013/07/26 2:00), Bjorn Helgaas wrote:
> 
> My point about IOMMU and PCI initialization order doesn't go away just
> because it doesn't fit "kdump policy."  Having system initialization
> occur in a logical order is far more important than making kdump work.

 My next plan is as follows. I think this is matched to logical order
 on boot.

 drivers/pci/pci.c
 - Add function to reset bus, for example, pci_reset_bus(struct pci_bus 
 *bus)

 drivers/iommu/intel-iommu.c
 - On initialization, if IOMMU is already enabled, call this bus reset
 function before disabling and re-enabling IOMMU.
>>>
>>> I raised this issue because of arches like sparc that enumerate the
>>> IOMMU before the PCI devices that use it.  In that situation, I think
>>> you're proposing this:
>>>
>>> panic kernel
>>>   enable IOMMU
>>>   panic
>>> kdump kernel
>>>   initialize IOMMU (already enabled)
>>> pci_reset_bus
>>> disable IOMMU
>>> enable IOMMU
>>>   enumerate PCI devices
>>>
>>> But the problem is that when you call pci_reset_bus(), you haven't
>>> enumerated the PCI devices, so you don't know what to reset.
>>
>> Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu
>> initialization is based on the assumption that enumeration of PCI devices
>> is already done. We can find target device from IOMMU page table instead
>> of scanning all devices in pci tree.
>>
>> Therefore, this idea is only for intel-iommu. Other architectures need
>> to implement their own reset code.
> 
> That's my point.  I'm opposed to adding code to PCI when it only
> benefits x86 and we know other arches will need a fundamentally
> different design.  I would rather have a design that can work for all
> arches.
> 
> If your implementation is totally implemented under arch/x86 (or in
> intel-iommu.c, I guess), I can't object as much.  However, I think
> that eventually even x86 should enumerate the IOMMUs via ACPI before
> we enumerate PCI devices.
> 
> It's pretty clear that's how BIOS designers expect the OS to work.
> For example, sec 8.7.3 of the Intel Virtualization Technology for
> Directed I/O spec, rev 1.3, shows the expectation that remapping
> hardware (IOMMU) is initialized before discovering the I/O hierarchy
> below a hot-added host bridge.  Obviously you're not talking about a
> hot-add scenario, but I think the same sequence should apply at
> boot-time as well.

Of course I won't add something just for x86 into common PCI layer. I
attach my new patch, though it is not well tested yet.

On x86, currently IOMMU initialization run *after* PCI enumeration, but
what you are talking about is that it should be changed so that x86
IOMMU initialization is done *before* PCI enumeration like sparc, right?

Hmm, ok, I think I need to post attached patch to iommu list and
discuss it including current order of x86 IOMMU initialization.

Thanks,
Takao Indoh
---
 drivers/iommu/intel-iommu.c |   55 +-
 drivers/pci/pci.c   |   53 
 include/linux/pci.h |1
 3 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index eec0d3e..fb8a546 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3663,6 +3663,56 @@ static struct notifier_block device_nb = {
.notifier_call = device_notifier,
 };
 
+/* Reset PCI device if its entry exists in DMAR table */
+static void __init iommu_reset_devices(struct intel_iommu *iommu, u16 segment)
+{
+   u64 addr;
+   struct root_entry *root;
+   struct context_entry *context;
+   int bus, devfn;
+   struct pci_dev *dev;
+
+   addr = dmar_readq(iommu->reg + DMAR_RTADDR_REG);
+   if (!addr)
+   return;
+
+   /*
+*  In the case of kdump, ioremap is needed because root-entry table
+*  exists in first kernel's memory area which is not mapped in second
+*  kernel
+*/
+   root = (struct root_entry*)ioremap(addr, PAGE_SIZE);
+   if (!root)
+   return;
+
+   for (bus=0; busbus)) /* go to next bus */
+   break;
+   else /* Try per-function reset */
+   pci_reset_function(dev);
+
+   }
+   iounmap(context);
+   }
+   iounmap(root);
+}
+
 int __init intel_iommu_init(void)
 {
int ret = 0;
@@ -3687,8 +3737,11 @@ int __init intel_iommu_init(void)
continue;
 
iommu = drhd->iommu;
-   if (iommu->gcmd & DMA_GCMD_TE)
+   if (iommu->gcmd & DMA_GCMD_TE) {
+

Re: [PATCH 3/4] clk: sunxi: Add A31 clocks support

2013-07-30 Thread Emilio López

Hi Maxime,

El 30/07/13 11:44, Maxime Ripard escribió:
> The A31 has a mostly different clock set compared to the other older
> SoCs currently supported in the Allwinner clock driver.
> 
> Add support for the basic useful clocks. The other ones will come in
> eventually.
> 
> Signed-off-by: Maxime Ripard 
> ---
>  drivers/clk/sunxi/clk-sunxi.c | 120 
> ++
>  1 file changed, 120 insertions(+)
> 
> diff --git a/drivers/clk/sunxi/clk-sunxi.c b/drivers/clk/sunxi/clk-sunxi.c
> index 6e9cbc9..8cd69e6 100644
> --- a/drivers/clk/sunxi/clk-sunxi.c
> +++ b/drivers/clk/sunxi/clk-sunxi.c
> @@ -124,7 +124,67 @@ static void sun4i_get_pll1_factors(u32 *freq, u32 
> parent_rate,
>   *n = div / 4;
>  }
>  
> +/**
> + * sun6i_get_pll1_factors() - calculates n, k and m factors for PLL1
> + * PLL1 rate is calculated as follows
> + * rate = parent_rate * (n + 1) * (k + 1) / (m + 1);
> + * parent_rate should always be 24MHz
> + */
> +static void sun6i_get_pll1_factors(u32 *freq, u32 parent_rate,
> +u8 *n, u8 *k, u8 *m, u8 *p)
> +{
> + /*
> +  * We can operate only on MHz, this will make our life easier
> +  * later.
> +  */
> + u32 freq_mhz = *freq / 100;
> + u32 parent_freq_mhz = parent_rate / 100;
> +
> + /* If the frequency is a multiple of 32 MHz, k is always 3 */
> + if (!(freq_mhz % 32))
> + *k = 3;
> + /* If the frequency is a multiple of 9 MHz, k is always 2 */
> + else if (!(freq_mhz % 9))
> + *k = 2;
> + /* If the frequency is a multiple of 8 MHz, k is always 1 */
> + else if (!(freq_mhz % 8))
> + *k = 1;
> + /* Otherwise, we don't use the k factor */
> + else
> + *k = 0;
>  
> + /*
> +  * If the frequency is a multiple of 2 but not a multiple of
> +  * 3, m is 3. This is the first time we use 6 here, yet we
> +  * will use it on several other places.
> +  * We use this number because it's the lowest frequency we can
> +  * generate (with n = 0, k = 0, m = 3), so every other frequency
> +  * somehow relates to this frequency.
> +  */
> + if ((freq_mhz % 6) == 2 || (freq_mhz % 6) == 4)
> + *m = 2;
> + /*
> +  * If the frequency is a multiple of 6MHz, but the factor is
> +  * odd, m will be 3
> +  */
> + else if ((freq_mhz / 6) & 1)
> + *m = 3;
> + /* Otherwise, we end up with m = 1 */
> + else
> + *m = 1;
> +
> + /* Calculate n thanks to the above factors we already got */
> + *n = freq_mhz * (*m + 1) / ((*k + 1) * parent_freq_mhz) - 1;
> +
> + /*
> +  * If n end up being outbound, and that we can still decrease
> +  * m, do it.
> +  */
> + if ((*n + 1) > 31 && (*m + 1) > 1) {
> + *n = (*n + 1) / 2 - 1;
> + *m = (*m + 1) / 2 - 1;
> + }
> +}

Nice! My rate-to-factors functions pale in comparison :) Remember that
n/k/m/p may be NULL when the caller just wants you to round freq to an
achievable value (see clk_factors_round_rate(...) in clk-factors.c).

>  /**
>   * sun4i_get_apb1_factors() - calculates m, p factors for APB1
> @@ -189,6 +249,15 @@ static struct clk_factors_config sun4i_pll1_config = {
>   .pwidth = 2,
>  };
>  
> +static struct clk_factors_config sun6i_pll1_config = {
> + .nshift = 8,
> + .nwidth = 5,
> + .kshift = 4,
> + .kwidth = 2,
> + .mshift = 0,
> + .mwidth = 2,
> +};
> +
>  static struct clk_factors_config sun4i_apb1_config = {
>   .mshift = 0,
>   .mwidth = 5,
> @@ -201,6 +270,11 @@ static const __initconst struct factors_data 
> sun4i_pll1_data = {
>   .getter = sun4i_get_pll1_factors,
>  };
>  
> +static const __initconst struct factors_data sun6i_pll1_data = {
> + .table = _pll1_config,
> + .getter = sun6i_get_pll1_factors,
> +};
> +
>  static const __initconst struct factors_data sun4i_apb1_data = {
>   .table = _apb1_config,
>   .getter = sun4i_get_apb1_factors,
> @@ -243,6 +317,10 @@ static const __initconst struct mux_data 
> sun4i_cpu_mux_data = {
>   .shift = 16,
>  };
>  
> +static const __initconst struct mux_data sun6i_ahb1_mux_data = {
> + .shift = 12,
> +};
> +
>  static const __initconst struct mux_data sun4i_apb1_mux_data = {
>   .shift = 24,
>  };
> @@ -301,6 +379,12 @@ static const __initconst struct div_data sun4i_apb0_data 
> = {
>   .width  = 2,
>  };
>  
> +static const __initconst struct div_data sun6i_apb2_div_data = {
> + .shift  = 0,
> + .pow= 0,
> + .width  = 4,
> +};
> +
>  static void __init sunxi_divider_clk_setup(struct device_node *node,
>  struct div_data *data)
>  {
> @@ -347,6 +431,10 @@ static const __initconst struct gates_data 
> sun5i_a13_ahb_gates_data = {
>   .mask = {0x107067e7, 0x185111},
>  };
>  
> +static const __initconst struct gates_data sun6i_a31_ahb1_gates_data = {
> + .mask =

Re: [PATCH 01/17] perf util: Save pid-cmdline mapping into tracing header

2013-07-30 Thread Namhyung Kim


Hi Arnaldo,

2013-07-30 오후 10:28, Arnaldo Carvalho de Melo 쓴 글:

Em Tue, Jul 30, 2013 at 06:18:58PM +0900, Namhyung Kim escreveu:

From: Namhyung Kim 

Current trace info data lacks the saved cmdline mapping which is
needed for pevent to find out the comm of a task.  Add this and bump
up the version number so that perf can determine its presence when
reading.


Can't we get this from the PERF_RECORD_COMM synthesized at the beginning
of the session + the kernel generated ones for new threads?

We would just call machine__find_thread() and use thread->comm, etc, no?


Yes, I think it's doable.  However it needs an additional perf event 
only to generate the COMM events.  So I decided not to do it and reused 
saved cmdline info from ftrace.  But IIUC it also has a problem of 
possible information loss due to the size of table.


Thanks,
Namhyung

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH resend] shdma: fixup sh_dmae_get_partial() calculation error

2013-07-30 Thread Simon Horman

On Tue, Jul 30, 2013 at 11:22:31AM +0200, Guennadi Liakhovetski wrote:
> Hi Simon
> 
> On Tue, 30 Jul 2013, Simon Horman wrote:
> 
> > [ Cc Vinod ]
> > 
> > On Tue, Jul 23, 2013 at 11:12:41PM -0700, Kuninori Morimoto wrote:
> > > sh_desc->hw.tcr is controlling real data size,
> > > and, register TCR is controlling data transfer count
> > > which was xmit_shifted value of hw.tcr.
> > > Current sh_dmae_get_partial() is calculating in different unit.
> > > This patch fixes it.
> > > 
> > > Signed-off-by: Kuninori Morimoto 
> > 
> > Can I confirm that this was a regression introduced by
> > 4f46f8ac80416b0e8fd3aba6a0d842205fb29140   
> > ("dmaengine: shdma: restore partial transfer calculation") ?
> 
> No, don't think so. That patch only restores what was accidentally deleted 
> from the driver before. So, you cannot say, that before that patch the 
> calculation was correct. And even the way it restores it was already wrong 
> before. The error goes back to the original implementation:
> 
> commit c014906a870ce70e009def0c9d170ccabeb0be63
> Author: Guennadi Liakhovetski 
> Date:   Thu Feb 18 16:30:02 2010 +
> 
> dmaengine: shdma: extend .device_terminate_all() to record partial 
> transfer

Thanks, I will add the following to the changelog:

This bug has been present since c014906a870ce70e009def0c9d170ccabeb0be63
("dmaengine: shdma: extend .device_terminate_all() to record partial
transfer"), which was added in 2.6.34-rc1.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: __ftrace_hash_rec_update FTRACE_WARN_ON.

2013-07-30 Thread Steven Rostedt

On Sat, 2013-07-27 at 12:16 -0700, Steve Hodgson wrote:

> This patch fixes ftrace across module removal/reinsertion on our 3.6.11 
> kernel.

I found a few corner cases that this patch doesn't work with, and the
solution is just getting uglier and uglier.

I found a new solution that seems to handle all the corner cases
(including this one). Can you test this as well. I want to add your
tested-by tag for this too.

Thanks!

-- Steve

>From 4c647dde26cf8c47ccaaed6f2e2fffee5dab5871 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Red Hat)" 
Date: Tue, 30 Jul 2013 00:04:32 -0400
Subject: [PATCH] ftrace: Check module functions being traced on reload
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

There's been a nasty bug that would show up and not give much info.
The bug displayed the following warning:

 WARNING: at kernel/trace/ftrace.c:1529 __ftrace_hash_rec_update+0x1e3/0x230()
 Pid: 20903, comm: bash Tainted: G   O 3.6.11+ #38405.trunk
 Call Trace:
  [] warn_slowpath_common+0x7f/0xc0
  [] warn_slowpath_null+0x1a/0x20
  [] __ftrace_hash_rec_update+0x1e3/0x230
  [] ftrace_hash_move+0x28/0x1d0
  [] ? kfree+0x2c/0x110
  [] ftrace_regex_release+0x8e/0x150
  [] __fput+0xae/0x220
  [] fput+0xe/0x10
  [] task_work_run+0x72/0x90
  [] do_notify_resume+0x6c/0xc0
  [] ? trace_hardirqs_on_thunk+0x3a/0x3c
  [] int_signal+0x12/0x17
 ---[ end trace 793179526ee09b2c ]---

It was finally narrowed down to unloading a module that was being traced.

It was actually more than that. When functions are being traced, there's
a table of all functions that have a ref count of the number of active
tracers attached to that function. When a function trace callback is
registered to a function, the function's record ref count is incremented.
When it is unregistered, the function's record ref count is decremented.
If an inconsistency is detected (ref count goes below zero) the above
warning is shown and the function tracing is permanently disabled until
reboot.

The ftrace callback ops holds a hash of functions that it filters on
(and/or filters off). If the hash is empty, the default means to filter
all functions (for the filter_hash) or to disable no functions (for the
notrace_hash).

When a module is unloaded, it frees the function records that represent
the module functions. These records exist on their own pages, that is
function records for one module will not exist on the same page as
function records for other modules or even the core kernel.

Now when a module unloads, the records that represents its functions are
freed. When the module is loaded again, the records are recreated with
a default ref count of zero (unless there's a callback that traces all
functions, then they will also be traced, and the ref count will be
incremented).

The problem is that if an ftrace callback hash includes functions of the
module being unloaded, those hash entries will not be removed. If the
module is reloaded in the same location, the hash entries still point
to the functions of the module but the module's ref counts do not reflect
that.

With the help of Steve and Joern, we found a reproducer:

 Using uinput module and uinput_release function.

 cd /sys/kernel/debug/tracing
 modprobe uinput
 echo uinput_release > set_ftrace_filter
 echo function > current_tracer
 rmmod uinput
 modprobe uinput
 # check /proc/modules to see if loaded in same addr, otherwise try again
 echo nop > current_tracer

 [BOOM]

The above loads the uinput module, which creates a table of functions that
can be traced within the module.

We add uinput_release to the filter_hash to trace just that function.

Enable function tracincg, which increments the ref count of the record
associated to uinput_release.

Remove uinput, which frees the records including the one that represents
uinput_release.

Load the uinput module again (and make sure it's at the same address).
This recreates the function records all with a ref count of zero,
including uinput_release.

Disable function tracing, which will decrement the ref count for uinput_release
which is now zero because of the module removal and reload, and we have
a mismatch (below zero ref count).

The solution is to check all currently tracing ftrace callbacks to see if any
are tracing any of the module's functions when a module is loaded (it already 
does
that with callbacks that trace all functions). If a callback happens to have
a module function being traced, it increments that records ref count and starts
tracing that function.

There may be a strange side effect with this, where tracing module functions
on unload and then reloading a new module may have that new module's functions
being traced. This may be something that confuses the user, but it's not
a big deal. Another approach is to disable all callback hashes on module unload,
but this leaves some ftrace callbacks that may not be registered, but can
still have hashes tracing the module's function where ftrace

Re: [PATCH] Driver core: Add accessor for device tree node

2013-07-30 Thread Jingoo Han

On Tuesday, July 30, 2013 10:57 PM, Greg Kroah-Hartman wrote:
> On Tue, Jul 30, 2013 at 06:20:47PM +0900, Jingoo Han wrote:
> > Add the wrapper function for retrieving device tree node instead of
> > accessing dev->of_node directly.
> 
> Why?

Sorry, I cannot find the special reason why it should be used.
Please, ignore it. :)
Thanks.

Best regards,
Jingoo Han

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1792 matches

Mail list logo