Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff

2015-12-31 Thread Eduardo Valentin
can we have a shorter title?

On Tue, Dec 29, 2015 at 02:46:49PM +0530, Keerthy wrote:
> Hi Nishanth,
> 

 
> >
> >I am not sure if this #ifdeffery is even needed.
> >
> >
> >Eduardo, Rui: If this is not the suggested technique, maybe you guys
> >could suggest how we could handle a case where userspace might be
> >hungup due to some reason and a case where a critical temperature
> >event in the middle of device probe was triggered?

Orderly power off is supposed to take care of this. Looking at the code,
it will force a shutdown in case execution of userland command fails:

static int __orderly_poweroff(bool force)
{
int ret;

ret = run_cmd(poweroff_cmd);

if (ret && force) {
pr_warn("Failed to start orderly shutdown: forcing the 
issue\n");

/*
 * I guess this should try to kick off some daemon to sync and
 * poweroff asap.  Or not even bother syncing if we're doing an
 * emergency shutdown?
 */
emergency_sync();
kernel_power_off();
}

> >
> >Obviously, we'd like to take into consideration userspace latencies as
> >well- but that is very specific to fs being run.. not really a simple
> >problem, IMHO..
> >
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff

2015-12-31 Thread Nishanth Menon
On 12/31/2015 12:20 PM, Eduardo Valentin wrote:
> On Thu, Dec 31, 2015 at 11:47:57AM -0600, Nishanth Menon wrote:
>> On 12/31/2015 11:29 AM, Eduardo Valentin wrote:
>>> can we have a shorter title?
>>>
>>>
>>> Orderly power off is supposed to take care of this. Looking at the code,
>>> it will force a shutdown in case execution of userland command fails:
>>>
>>> static int __orderly_poweroff(bool force)
>>> {
>>> int ret;
>>>
>>> ret = run_cmd(poweroff_cmd);
>>>
>>> if (ret && force) {
>>> pr_warn("Failed to start orderly shutdown: forcing the 
>>> issue\n");
>>>
>>> /*
>>>  * I guess this should try to kick off some daemon to sync 
>>> and
>>>  * poweroff asap.  Or not even bother syncing if we're 
>>> doing an
>>>  * emergency shutdown?
>>>  */
>>> emergency_sync();
>>> kernel_power_off();
>>> }
>>
>> Yes, it will *IF* userspace fails. the condition that I had tracked
>> was before identifying the following fix[1] - Example fail is here[2]
>>
> 
> OK. But still, why other users of orderly_poweroff do not
> deserve to be fixed, then?
> 


I'd agree as well.. I guess the comment from Robin Holt 
anticipated something like this will eventually occur.

"* I guess this should try to kick off some daemon to sync and
* poweroff asap.  Or not even bother syncing if we're doing an
* emergency shutdown?
"

Keerthy - would you spin this as a generic fix?

>>
>> I hope this explains the problem.
>>
>> [1]
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=00917b5c55aeb01322d5ab51af8c025b82959224
>> [2] http://pastebin.ubuntu.com/14326688/
>>
>> [3]
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/am57xx-beagle-x15.dts#n738
>>
>> -- 
>> Regards,
>> Nishanth Menon


-- 
Regards,
Nishanth Menon
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff

2015-12-31 Thread Eduardo Valentin
On Mon, Dec 21, 2015 at 11:16:18AM +0530, Keerthy wrote:
> In few rare conditions like during boot up the orderly_poweroff
> function might not be able to complete fully leaving the device
> running at dangerously high temperatures. Hence adding a backup
> workqueue to act after a known period of time and poweroff the device.

I really wish a better description of what is going on. I am not really
sure why thermal subsystem must deal with the case of a failing kernel
API. If orderly power off is failing, why don't we fix it instead?
What are the failing conditions? few rare conditions seams very odd.
How does it fail? Why fixing it in thermal? Other users of it don't
deserver the same fix?

>   }
>  }
> -- 
> 1.9.1
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff

2015-12-31 Thread Eduardo Valentin
On Thu, Dec 31, 2015 at 11:47:57AM -0600, Nishanth Menon wrote:
> On 12/31/2015 11:29 AM, Eduardo Valentin wrote:
> > can we have a shorter title?
> > 
> > 
> > Orderly power off is supposed to take care of this. Looking at the code,
> > it will force a shutdown in case execution of userland command fails:
> > 
> > static int __orderly_poweroff(bool force)
> > {
> > int ret;
> > 
> > ret = run_cmd(poweroff_cmd);
> > 
> > if (ret && force) {
> > pr_warn("Failed to start orderly shutdown: forcing the 
> > issue\n");
> > 
> > /*
> >  * I guess this should try to kick off some daemon to sync 
> > and
> >  * poweroff asap.  Or not even bother syncing if we're 
> > doing an
> >  * emergency shutdown?
> >  */
> > emergency_sync();
> > kernel_power_off();
> > }
> 
> Yes, it will *IF* userspace fails. the condition that I had tracked
> was before identifying the following fix[1] - Example fail is here[2]
> 

OK. But still, why other users of orderly_poweroff do not
deserve to be fixed, then?

> 
> I hope this explains the problem.
> 
> [1]
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=00917b5c55aeb01322d5ab51af8c025b82959224
> [2] http://pastebin.ubuntu.com/14326688/
> 
> [3]
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/am57xx-beagle-x15.dts#n738
> 
> -- 
> Regards,
> Nishanth Menon
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff

2015-12-31 Thread Nishanth Menon
On 12/31/2015 11:29 AM, Eduardo Valentin wrote:
> can we have a shorter title?
> 
> On Tue, Dec 29, 2015 at 02:46:49PM +0530, Keerthy wrote:
>> Hi Nishanth,
>>
> 
>  
>>>
>>> I am not sure if this #ifdeffery is even needed.
>>>
>>>
>>> Eduardo, Rui: If this is not the suggested technique, maybe you guys
>>> could suggest how we could handle a case where userspace might be
>>> hungup due to some reason and a case where a critical temperature
>>> event in the middle of device probe was triggered?
> 
> Orderly power off is supposed to take care of this. Looking at the code,
> it will force a shutdown in case execution of userland command fails:
> 
> static int __orderly_poweroff(bool force)
> {
> int ret;
> 
> ret = run_cmd(poweroff_cmd);
> 
> if (ret && force) {
> pr_warn("Failed to start orderly shutdown: forcing the 
> issue\n");
> 
> /*
>  * I guess this should try to kick off some daemon to sync and
>  * poweroff asap.  Or not even bother syncing if we're doing 
> an
>  * emergency shutdown?
>  */
> emergency_sync();
> kernel_power_off();
> }

Yes, it will *IF* userspace fails. the condition that I had tracked
was before identifying the following fix[1] - Example fail is here[2]

In this case, tmp102 is setup for X15 as [3] - and built as a module.
as the kernel startsup filesystem and starts a modprobe of all modules
via udev rules, the probe of tmp102 detects (falsely) a critical
temperature condition. Shutdown attempt in the middle of driver probe
is always a tricky business.

As we look at the log in [2], Line  472
> thermal thermal_zone3: critical temperature reached(108 C),shutting down
We have userspace trigger for shutdown taking place.

Line 495: INIT: Sending processes the TERM signal

userspace starts shutting down services. (but note that probe for
other devices were either in progress or queued up to complete)..

at line 647 - we are in a weird place -> sysrq shows that system is
idled and userspace is shutdown and system is still active.


In this case, we entered the case thanks to a driver bug, but if this
situation was a real world temperature scenario, then we'd probably in
an overtemp scenario, then device damage could take place OR something
much worse.

The only alternative is to run a parallel thread in case userspace
fails to complete the job in some given period of time - due to what
ever be the condition triggering the problem.

I hope this explains the problem.

[1]
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=00917b5c55aeb01322d5ab51af8c025b82959224
[2] http://pastebin.ubuntu.com/14326688/

[3]
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/am57xx-beagle-x15.dts#n738

-- 
Regards,
Nishanth Menon
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff

2015-12-29 Thread Keerthy

Hi Nishanth,

On Monday 28 December 2015 11:11 PM, Nishanth Menon wrote:

On 12/20/2015 11:46 PM, Keerthy wrote:

+linux-pm.


Thanks for looping!




In few rare conditions like during boot up the orderly_poweroff
function might not be able to complete fully leaving the device
running at dangerously high temperatures. Hence adding a backup
workqueue to act after a known period of time and poweroff the device.





Suggested-by: Nishanth Menon 
Reported-by: Nishanth Menon 


The specific case I hit was a critical temperature event as soon as
the hwmon device was probed (the driver tmp102 was a kernel module).


Signed-off-by: Keerthy 
---
  drivers/thermal/Kconfig|  9 +
  drivers/thermal/thermal_core.c | 26 ++
  2 files changed, 35 insertions(+)

diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index 8cc4ac6..25584ee 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -92,6 +92,15 @@ config THERMAL_DEFAULT_GOV_POWER_ALLOCATOR

  endchoice

+config THERMAL_BKUP_SHUTDOWN_DELAY_MS
+int "Thermal shutdown  backup delay in milli-seconds"
+depends on THERMAL
+default "5000"
+---help---
+The number of milliseconds to delay after orderly_poweroff call.


Probably needs a rephrase.


Delay in milliseconds before the backup thermal shutdown is triggered.




+
+Default: 5000 (5 seconds)
+
  config THERMAL_GOV_FAIR_SHARE
bool "Fair-share thermal governor"
help
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index d9e525c..b793857 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -61,6 +61,12 @@ static DEFINE_MUTEX(thermal_governor_lock);

  static struct thermal_governor *def_governor;

+#ifdef CONFIG_THERMAL_BKUP_SHUTDOWN_DELAY_MS
+#define BKUP_SHUTDOWN_DELAY CONFIG_THERMAL_BKUP_SHUTDOWN_DELAY_MS
+#else
+#define BKUP_SHUTDOWN_DELAY 5000
+#endif
+


I am not sure if this #ifdeffery is even needed.


Eduardo, Rui: If this is not the suggested technique, maybe you guys
could suggest how we could handle a case where userspace might be
hungup due to some reason and a case where a critical temperature
event in the middle of device probe was triggered?

Obviously, we'd like to take into consideration userspace latencies as
well- but that is very specific to fs being run.. not really a simple
problem, IMHO..


  static struct thermal_governor *__find_governor(const char *name)
  {
struct thermal_governor *pos;
@@ -423,6 +429,18 @@ static void handle_non_critical_trips(struct 
thermal_zone_device *tz,
   def_governor->throttle(tz, trip);
  }

+static void bkup_shutdown_func(struct work_struct *unused);
+static DECLARE_DELAYED_WORK(bkup_shutdown_work, bkup_shutdown_func);
+
+static void bkup_shutdown_func(struct work_struct *work)
+{
+   pr_warn("Orderly_poweroff has failed! Attempting kernel_power_off\n");
+   kernel_power_off();
+
+   pr_warn("kernel_power_off has failed! Attempting emergency_restart\n");
+   emergency_restart();


I think documentation is necessary that we are hoping for bootloader
to be able to detect and halt as needed -> risk here obviously is an
infinite reboot loop :( .


Agreed.





+}
+
  static void handle_critical_trips(struct thermal_zone_device *tz,
int trip, enum thermal_trip_type trip_type)
  {
@@ -443,6 +461,14 @@ static void handle_critical_trips(struct 
thermal_zone_device *tz,
dev_emerg(>device,
  "critical temperature reached(%d C),shutting down\n",
  tz->temperature / 1000);
+   /**


needs to be kernel doc style?


+* This is a backup option to shutdown the
+* system in case orderly_poweroff
+* fails

Maybe adjust to 80char?


Okay.




+*/
+   schedule_delayed_work(_shutdown_work,
+ msecs_to_jiffies(BKUP_SHUTDOWN_DELAY));
+
orderly_poweroff(true);
}
  }






I will wait for Eduardo/Rui's inputs before posting the next version.

Best Regards,
Keerthy
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff

2015-12-28 Thread Nishanth Menon
On 12/20/2015 11:46 PM, Keerthy wrote:

+linux-pm.

> In few rare conditions like during boot up the orderly_poweroff
> function might not be able to complete fully leaving the device
> running at dangerously high temperatures. Hence adding a backup
> workqueue to act after a known period of time and poweroff the device.
> 


> Suggested-by: Nishanth Menon 
> Reported-by: Nishanth Menon 

The specific case I hit was a critical temperature event as soon as
the hwmon device was probed (the driver tmp102 was a kernel module).

> Signed-off-by: Keerthy 
> ---
>  drivers/thermal/Kconfig|  9 +
>  drivers/thermal/thermal_core.c | 26 ++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> index 8cc4ac6..25584ee 100644
> --- a/drivers/thermal/Kconfig
> +++ b/drivers/thermal/Kconfig
> @@ -92,6 +92,15 @@ config THERMAL_DEFAULT_GOV_POWER_ALLOCATOR
>  
>  endchoice
>  
> +config THERMAL_BKUP_SHUTDOWN_DELAY_MS
> +int "Thermal shutdown  backup delay in milli-seconds"
> +depends on THERMAL
> +default "5000"
> +---help---
> +The number of milliseconds to delay after orderly_poweroff call.

Probably needs a rephrase.

> +
> +Default: 5000 (5 seconds)
> +
>  config THERMAL_GOV_FAIR_SHARE
>   bool "Fair-share thermal governor"
>   help
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index d9e525c..b793857 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -61,6 +61,12 @@ static DEFINE_MUTEX(thermal_governor_lock);
>  
>  static struct thermal_governor *def_governor;
>  
> +#ifdef CONFIG_THERMAL_BKUP_SHUTDOWN_DELAY_MS
> +#define BKUP_SHUTDOWN_DELAY CONFIG_THERMAL_BKUP_SHUTDOWN_DELAY_MS
> +#else
> +#define BKUP_SHUTDOWN_DELAY 5000
> +#endif
> +

I am not sure if this #ifdeffery is even needed.


Eduardo, Rui: If this is not the suggested technique, maybe you guys
could suggest how we could handle a case where userspace might be
hungup due to some reason and a case where a critical temperature
event in the middle of device probe was triggered?

Obviously, we'd like to take into consideration userspace latencies as
well- but that is very specific to fs being run.. not really a simple
problem, IMHO..

>  static struct thermal_governor *__find_governor(const char *name)
>  {
>   struct thermal_governor *pos;
> @@ -423,6 +429,18 @@ static void handle_non_critical_trips(struct 
> thermal_zone_device *tz,
>  def_governor->throttle(tz, trip);
>  }
>  
> +static void bkup_shutdown_func(struct work_struct *unused);
> +static DECLARE_DELAYED_WORK(bkup_shutdown_work, bkup_shutdown_func);
> +
> +static void bkup_shutdown_func(struct work_struct *work)
> +{
> + pr_warn("Orderly_poweroff has failed! Attempting kernel_power_off\n");
> + kernel_power_off();
> +
> + pr_warn("kernel_power_off has failed! Attempting emergency_restart\n");
> + emergency_restart();

I think documentation is necessary that we are hoping for bootloader
to be able to detect and halt as needed -> risk here obviously is an
infinite reboot loop :( .


> +}
> +
>  static void handle_critical_trips(struct thermal_zone_device *tz,
>   int trip, enum thermal_trip_type trip_type)
>  {
> @@ -443,6 +461,14 @@ static void handle_critical_trips(struct 
> thermal_zone_device *tz,
>   dev_emerg(>device,
> "critical temperature reached(%d C),shutting down\n",
> tz->temperature / 1000);
> + /**

needs to be kernel doc style?

> +  * This is a backup option to shutdown the
> +  * system in case orderly_poweroff
> +  * fails
Maybe adjust to 80char?

> +  */
> + schedule_delayed_work(_shutdown_work,
> +   msecs_to_jiffies(BKUP_SHUTDOWN_DELAY));
> +
>   orderly_poweroff(true);
>   }
>  }
> 


-- 
Regards,
Nishanth Menon
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff

2015-12-20 Thread Keerthy
In few rare conditions like during boot up the orderly_poweroff
function might not be able to complete fully leaving the device
running at dangerously high temperatures. Hence adding a backup
workqueue to act after a known period of time and poweroff the device.

Suggested-by: Nishanth Menon 
Reported-by: Nishanth Menon 
Signed-off-by: Keerthy 
---
 drivers/thermal/Kconfig|  9 +
 drivers/thermal/thermal_core.c | 26 ++
 2 files changed, 35 insertions(+)

diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index 8cc4ac6..25584ee 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -92,6 +92,15 @@ config THERMAL_DEFAULT_GOV_POWER_ALLOCATOR
 
 endchoice
 
+config THERMAL_BKUP_SHUTDOWN_DELAY_MS
+int "Thermal shutdown  backup delay in milli-seconds"
+depends on THERMAL
+default "5000"
+---help---
+The number of milliseconds to delay after orderly_poweroff call.
+
+Default: 5000 (5 seconds)
+
 config THERMAL_GOV_FAIR_SHARE
bool "Fair-share thermal governor"
help
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index d9e525c..b793857 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -61,6 +61,12 @@ static DEFINE_MUTEX(thermal_governor_lock);
 
 static struct thermal_governor *def_governor;
 
+#ifdef CONFIG_THERMAL_BKUP_SHUTDOWN_DELAY_MS
+#define BKUP_SHUTDOWN_DELAY CONFIG_THERMAL_BKUP_SHUTDOWN_DELAY_MS
+#else
+#define BKUP_SHUTDOWN_DELAY 5000
+#endif
+
 static struct thermal_governor *__find_governor(const char *name)
 {
struct thermal_governor *pos;
@@ -423,6 +429,18 @@ static void handle_non_critical_trips(struct 
thermal_zone_device *tz,
   def_governor->throttle(tz, trip);
 }
 
+static void bkup_shutdown_func(struct work_struct *unused);
+static DECLARE_DELAYED_WORK(bkup_shutdown_work, bkup_shutdown_func);
+
+static void bkup_shutdown_func(struct work_struct *work)
+{
+   pr_warn("Orderly_poweroff has failed! Attempting kernel_power_off\n");
+   kernel_power_off();
+
+   pr_warn("kernel_power_off has failed! Attempting emergency_restart\n");
+   emergency_restart();
+}
+
 static void handle_critical_trips(struct thermal_zone_device *tz,
int trip, enum thermal_trip_type trip_type)
 {
@@ -443,6 +461,14 @@ static void handle_critical_trips(struct 
thermal_zone_device *tz,
dev_emerg(>device,
  "critical temperature reached(%d C),shutting down\n",
  tz->temperature / 1000);
+   /**
+* This is a backup option to shutdown the
+* system in case orderly_poweroff
+* fails
+*/
+   schedule_delayed_work(_shutdown_work,
+ msecs_to_jiffies(BKUP_SHUTDOWN_DELAY));
+
orderly_poweroff(true);
}
 }
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html