On 07/26/2016 05:26 AM, Sathya Perla wrote:
-----Original Message-----
From: Guilherme G. Piccoli [mailto:gpicc...@linux.vnet.ibm.com]

On 07/25/2016 07:48 AM, Sathya Perla wrote:
-----Original Message-----
From: Guilherme G. Piccoli [mailto:gpicc...@linux.vnet.ibm.com]

Temperature values on be2net driver are made available to userspace
via
hwmon abstraction, so tools like lm-
sensors can present them to the user.
The driver provides hwmon structures for each adapter's function.
Nevertheless, the temperature information come from fw queries
performed
by
be_worker() with some frequency, and this procedure is called with a
single function as argument; this means
that the temperature value is updated only in the specific function
that
was passed to be_worker().

This can lead to incongruency in reported temperature by a function,
or
in a worse scenario, some functions
might be unable to provide temperature info to userspace, if they
weren't fed with this information from fw in
be_worker() run.

Hi, I'm wondering if you are OK with the temperature value being 128s
old
(2/2 patch), then why is it a problem
if two different functions report a temperature value that is queried
a few seconds apart?
Also, you'll not have a scenario where the FW cmd succeeds for one
function and fails for other functions.
It's a common FW for the entire adapter.


This patch changes the way temperature is set in be2net driver. At
anytime the fw query is performed, it will set
the temperature value for all functions of the adapter, instead of
only
setting the temperature of the function
passed to be_worker().
If the possible inconsistency across functions is indeed a problem,
then a simpler solution would be to issue the FW cmd synchronously
when the sysfs attr is read, i.e., in
be_hwmon_show_temp() routine itself.


Hi Sathya, thanks very much for your quick reply. I agree with you that an
1 or 2 sec inconsistency wouldn't
harm, but the main problem we're seeing is that be_worker() is being
called with a single function as a parameter
- in our case, the last function is being passed as argument to
be_worker() multiple times in a row, and then we
have its temperature updated but the other functions' temperature set as
invalid.

Hi Guilherme, this doesn't sound right to me and is not expected. The
be_worker() routine must execute for *each* function every second.
Can you pls share the driver/fw version and any debug logs (with prints) you
may have and also lspci output.

Hi Sathya, indeed...this is _not right_...from my side heheh
Unfortunately I made a mistake in my analysis and ended up over-engineering a "solution" to an issue which root cause wasn't clear to me! I want to thank you for your relevant questions and the information you provided, which helped a lot to figure exactly what's going on.

Our issue is seen because some adapter's functions (3 out of 4) have their interface down, and the fw temperature queries are performed only for functions which interface is up. The following conditional avoids fw query to occur whenever adapter's interface is down:

  if (!netif_running(adapter->netdev))
[be_main.c:5002, kernel v4.7]

It seems harmless to change the fw query location to perform temperature read for all functions regardless the state of its interface - this will solve our issue. I wrote a simple patch (to "net", and not "net-next" anymore) to improve this driver's behavior.
I'll send it right after this message, please let me know what you think.

Again, thanks very much for your attention and sorry for my confusion.
Cheers,


Guilherme



Regarding the temperature update run on be_hwmon_show_temp(), it was an
idea too, but I was afraid in delay
this output too much - imagine some userspace tool reads hwmon attributes
for all functions almost at "same
time", supposing the fw command can't run in parallel, the "last" read
would need to wait 4 fw commands to
complete before showing it's output.

I don't see any issue even if the sensors program queries each function one
after another. These calls would only be
a few milli-seconds apart.

Besides, in a worse scenario, some "not-friendly" tool might issue lots of
reads to hwmon per second then
issuing lots of fw commands, which does not seem a good idea. Of course
this last case we can avoid by
implementing a counter or timer on be_hwmon_show_temp() to allow maximum
number of fw cmds in a time
frame.
Yes, this is not an issue. If the hwmon read is issued with-in a few seconds
of the previous read then you can just return the old temperature value.
We are anyway querying this value only once in 64s now.
But, I'd like to root-cause the issue you are seeing above before we "fix"
anything.

thanks,
-Sathya


Reply via email to