Re: flush_icache_range in the sticore driver

2022-07-05 Thread James Bottomley
On Tue, 2022-07-05 at 18:46 +0200, Christoph Hellwig wrote:
> Hi all,
> 
> flush_icache_range is supposed to flush the instruction cache, which
> is something no driver should be doing.  It was added in commit
> 03b18f1b2afe ("[PARISC] Clean up sti_flush") but the explanation in
> there looks odd.
> 
> Can someone shed a light what flushes this tries to flush and why it
> can't be done behind a proper API?

What's wrong with the explanation:

"sti_flush is supposed to flush the caches so we can execute the STI
rom we copied to memory."

Which is in that commit?

The STI driver is taking the executable STI ROM code, which is in a
very slow to access part of the system and copying it to RAM so it can
be executed.  After the driver has done the copy it needs to flush the
icache just in case there's anything stale in there before executing
the newly copied code.

If you grep the drive for sti_call, you'll see all the points we call
into the copied rom code.

James




Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr

2022-02-28 Thread James Bottomley
On Mon, 2022-02-28 at 23:59 +0200, Mike Rapoport wrote:
> 
> On February 28, 2022 10:42:53 PM GMT+02:00, James Bottomley <
> james.bottom...@hansenpartnership.com> wrote:
> > On Mon, 2022-02-28 at 21:07 +0100, Christian König wrote:
[...]
> > > > I do wish we could actually poison the 'pos' value after the
> > > > loop somehow - but clearly the "might be uninitialized" I was
> > > > hoping for isn't the way to do it.
> > > > 
> > > > Anybody have any ideas?
> > > 
> > > I think we should look at the use cases why code is touching
> > > (pos) after the loop.
> > > 
> > > Just from skimming over the patches to change this and experience
> > > with the drivers/subsystems I help to maintain I think the
> > > primary pattern looks something like this:
> > > 
> > > list_for_each_entry(entry, head, member) {
> > >  if (some_condition_checking(entry))
> > >  break;
> > > }
> > > do_something_with(entry);
> > 
> > Actually, we usually have a check to see if the loop found
> > anything, but in that case it should something like
> > 
> > if (list_entry_is_head(entry, head, member)) {
> >return with error;
> > }
> > do_somethin_with(entry);
> > 
> > Suffice?  The list_entry_is_head() macro is designed to cope with
> > the bogus entry on head problem.
> 
> Won't suffice because the end goal of this work is to limit scope of
> entry only to loop. Hence the need for additional variable.

Well, yes, but my objection is more to the size of churn than the
desire to do loop local.  I'm not even sure loop local is possible,
because it's always annoyed me that for (int i = 0; ...  in C++ defines
i in the outer scope not the loop scope, which is why I never use it.

However, if the desire is really to poison the loop variable then we
can do

#define list_for_each_entry(pos, head, member)  \
for (pos = list_first_entry(head, typeof(*pos), member);\
 !list_entry_is_head(pos, head, member) && ((pos = NULL) == NULL;   
\
 pos = list_next_entry(pos, member))

Which would at least set pos to NULL when the loop completes.

> Besides, there are no guarantees that people won't
> do_something_with(entry) without the check or won't compare entry to
> NULL to check if the loop finished with break or not.

I get the wider goal, but we have to patch the problem cases now and a
simple one-liner is better than a larger patch that may or may not work
if we ever achieve the local definition or value poisoning idea.  I'm
also fairly certain coccinelle can come up with a use without checking
for loop completion semantic patch which we can add to 0day.

James




Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr

2022-02-28 Thread James Bottomley
On Mon, 2022-02-28 at 21:56 +0100, Christian König wrote:
> 
> Am 28.02.22 um 21:42 schrieb James Bottomley:
> > On Mon, 2022-02-28 at 21:07 +0100, Christian König wrote:
> > > Am 28.02.22 um 20:56 schrieb Linus Torvalds:
> > > > On Mon, Feb 28, 2022 at 4:19 AM Christian König
> > > >  wrote:
> > > > [SNIP]
> > > > Anybody have any ideas?
> > > I think we should look at the use cases why code is touching
> > > (pos)
> > > after the loop.
> > > 
> > > Just from skimming over the patches to change this and experience
> > > with the drivers/subsystems I help to maintain I think the
> > > primary pattern looks something like this:
> > > 
> > > list_for_each_entry(entry, head, member) {
> > >   if (some_condition_checking(entry))
> > >   break;
> > > }
> > > do_something_with(entry);
> > 
> > Actually, we usually have a check to see if the loop found
> > anything, but in that case it should something like
> > 
> > if (list_entry_is_head(entry, head, member)) {
> >  return with error;
> > }
> > do_somethin_with(entry);
> > 
> > Suffice?  The list_entry_is_head() macro is designed to cope with
> > the bogus entry on head problem.
> 
> That will work and is also what people already do.
> 
> The key problem is that we let people do the same thing over and
> over again with slightly different implementations.
> 
> Out in the wild I've seen at least using a separate variable, using
> a bool to indicate that something was found and just assuming that
> the list has an entry.
> 
> The last case is bogus and basically what can break badly.

Yes, I understand that.  I'm saying we should replace that bogus checks
of entry->something against some_value loop termination condition with
the list_entry_is_head() macro.  That should be a one line and fairly
mechanical change rather than the explosion of code changes we seem to
have in the patch series.

James




Re: [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr

2022-02-28 Thread James Bottomley
On Mon, 2022-02-28 at 21:07 +0100, Christian König wrote:
> Am 28.02.22 um 20:56 schrieb Linus Torvalds:
> > On Mon, Feb 28, 2022 at 4:19 AM Christian König
> >  wrote:
> > > I don't think that using the extra variable makes the code in any
> > > way
> > > more reliable or easier to read.
> > So I think the next step is to do the attached patch (which
> > requires
> > that "-std=gnu11" that was discussed in the original thread).
> > 
> > That will guarantee that the 'pos' parameter of
> > list_for_each_entry()
> > is only updated INSIDE the for_each_list_entry() loop, and can
> > never
> > point to the (wrongly typed) head entry.
> > 
> > And I would actually hope that it should actually cause compiler
> > warnings about possibly uninitialized variables if people then use
> > the
> > 'pos' pointer outside the loop. Except
> > 
> >   (a) that code in sgx/encl.c currently initializes 'tmp' to NULL
> > for
> > inexplicable reasons - possibly because it already expected this
> > behavior
> > 
> >   (b) when I remove that NULL initializer, I still don't get a
> > warning,
> > because we've disabled -Wno-maybe-uninitialized since it results in
> > so
> > many false positives.
> > 
> > Oh well.
> > 
> > Anyway, give this patch a look, and at least if it's expanded to do
> > "(pos) = NULL" in the entry statement for the for-loop, it will
> > avoid the HEAD type confusion that Jakob is working on. And I think
> > in a cleaner way than the horrid games he plays.
> > 
> > (But it won't avoid possible CPU speculation of such type
> > confusion. That, in my opinion, is a completely different issue)
> 
> Yes, completely agree.
> 
> > I do wish we could actually poison the 'pos' value after the loop
> > somehow - but clearly the "might be uninitialized" I was hoping for
> > isn't the way to do it.
> > 
> > Anybody have any ideas?
> 
> I think we should look at the use cases why code is touching (pos)
> after the loop.
> 
> Just from skimming over the patches to change this and experience
> with the drivers/subsystems I help to maintain I think the primary
> pattern looks something like this:
> 
> list_for_each_entry(entry, head, member) {
>  if (some_condition_checking(entry))
>  break;
> }
> do_something_with(entry);


Actually, we usually have a check to see if the loop found anything,
but in that case it should something like

if (list_entry_is_head(entry, head, member)) {
return with error;
}
do_somethin_with(entry);

Suffice?  The list_entry_is_head() macro is designed to cope with the
bogus entry on head problem.

James




Re: [PATCH v3 3/4] tpm_tis: Disable interrupts if interrupt storm detected

2020-12-07 Thread James Bottomley
On Mon, 2020-12-07 at 15:28 -0400, Jason Gunthorpe wrote:
> On Sun, Dec 06, 2020 at 08:26:16PM +0100, Thomas Gleixner wrote:
> > Just as a side note. I was looking at tpm_tis_probe_irq_single()
> > and that function is leaking the interrupt request if any of the
> > checks afterwards fails, except for the final interrupt probe check
> > which does a cleanup. That means on fail before that the interrupt
> > handler stays requested up to the point where the module is
> > removed. If that's a shared interrupt and some other device is
> > active on the same line, then each interrupt from that device will
> > call into the TPM code. Something like the below is needed.
> > 
> > Also the X86 autoprobe mechanism is interesting:
> > 
> > if (IS_ENABLED(CONFIG_X86))
> > for (i = 3; i <= 15; i++)
> > if (!tpm_tis_probe_irq_single(chip, intmask, 0,
> > i))
> > return;
> > 
> > The third argument is 'flags' which is handed to request_irq(). So
> > that won't ever be able to probe a shared interrupt. But if an
> > interrupt number > 0 is handed to tpm_tis_core_init() the interrupt
> > is requested with IRQF_SHARED. Same issue when the chip has an
> > interrupt number in the register. It's also requested exclusive
> > which is pretty likely to fail on ancient x86 machines.
> 
> It is very likely none of this works any more, it has been repeatedly
> reworked over the years and just left behind out of fear someone
> needs it. I've thought it should be deleted for a while now.
> 
> I suppose the original logic was to try and probe without SHARED
> because a probe would need exclusive access to the interrupt to tell
> if the TPM was actually the source, not some other device.
> 
> It is all very old and very out of step with current thinking, IMHO.
> I skeptical that TPM interrupts were ever valuable enough to deserve
> this in the first place.

For what it's worth, I agree.  Trying to probe all 15 ISA interrupts is
last millennium thinking we should completely avoid.  If it's not
described in ACPI then you don't get an interrupt full stop.

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v3 1/4] irq: export kstat_irqs

2020-12-06 Thread James Bottomley
On Sun, 2020-12-06 at 17:40 +0100, Thomas Gleixner wrote:
> On Sat, Dec 05 2020 at 12:39, Jarkko Sakkinen wrote:
> > On Fri, Dec 04, 2020 at 06:43:37PM -0700, Jerry Snitselaar wrote:
> > > To try and detect potential interrupt storms that
> > > have been occurring with tpm_tis devices it was suggested
> > > to use kstat_irqs() to get the number of interrupts.
> > > Since tpm_tis can be built as a module it needs kstat_irqs
> > > exported.
> > 
> > I think you should also have a paragraph explicitly stating that
> > i915_pmu.c contains a duplicate of kstat_irqs() because it is not
> > exported as of today. It adds a lot more weight to this given that
> > there is already existing mainline usage (kind of).
> 
> It's abusage and just the fact that it exists is not an argument by
> itself.

What we want is a count of the interrupts to see if we're having an
interrupt storm from the TPM device (some seem to be wired to fire the
interrupt even when there's no event to warrant it).  Since
kstat_irqs_user() does the correct RCU locking, should we be using that
instead?

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH v1 00/12] Replace strstarts() by str_has_prefix()

2020-12-04 Thread James Bottomley
On Fri, 2020-12-04 at 18:03 +0100, laniel_fran...@privacyrequired.com
wrote:
> In this patch set, I replaced all calls to strstarts() by calls to
> str_has_prefix(). Indeed, the kernel has two functions to test if a
> string begins with an other:
> 1. strstarts() which returns a bool, so 1 if the string begins with
> the prefix,0 otherwise.
> 2. str_has_prefix() which returns the length of the prefix or 0.
> 
> str_has_prefix() was introduced later than strstarts(), in commit
> 495d714ad140 which also stated that str_has_prefix() should replace
> strstarts(). This is what this patch set does.

What's the reason why?  If you look at the use cases for the
replacement of strstart()  they're all cases where we need to know the
length we're skipping and this is hard coded, leading to potential
errors later.  This is a classic example:  3d739c1f6156 ("tracing: Use
the return of str_has_prefix() to remove open coded numbers").  However
you're not doing this transformation in the conversion, so the
conversion is pretty useless.  I also see no case for replacing
strstart() where we're using it simply as a boolean without needing to
know the length of the prefix.

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Intel-wired-lan] [PATCH 000/141] Fix fall-through warnings for Clang

2020-11-24 Thread James Bottomley
On Tue, 2020-11-24 at 13:32 -0800, Kees Cook wrote:
> On Mon, Nov 23, 2020 at 08:31:30AM -0800, James Bottomley wrote:
> > Really, no ... something which produces no improvement has no value
> > at all ... we really shouldn't be wasting maintainer time with it
> > because it has a cost to merge.  I'm not sure we understand where
> > the balance lies in value vs cost to merge but I am confident in
> > the zero value case.
> 
> What? We can't measure how many future bugs aren't introduced because
> the kernel requires explicit case flow-control statements for all new
> code.

No but we can measure how vulnerable our current coding habits are to
the mistake this warning would potentially prevent.  I don't think it's
wrong to extrapolate that if we had no instances at all of prior coding
problems we likely wouldn't have any in future either making adopting
the changes needed to enable the warning valueless ... that's the zero
value case I was referring to above.

Now, what we have seems to be about 6 cases (at least what's been shown
in this thread) where a missing break would cause potentially user
visible issues.  That means the value of this isn't zero, but it's not
a no-brainer massive win either.  That's why I think asking what we've
invested vs the return isn't a useless exercise.

> We already enable -Wimplicit-fallthrough globally, so that's not the
> discussion. The issue is that Clang is (correctly) even more strict
> than GCC for this, so these are the remaining ones to fix for full
> Clang coverage too.
> 
> People have spent more time debating this already than it would have
> taken to apply the patches. :)

You mean we've already spent 90% of the effort to come this far so we
might as well go the remaining 10% because then at least we get some
return? It's certainly a clinching argument in defence procurement ...

> This is about robustness and language wrangling. It's a big code-
> base, and this is the price of our managing technical debt for
> permanent robustness improvements. (The numbers I ran from Gustavo's
> earlier patches were that about 10% of the places adjusted were
> identified as legitimate bugs being fixed. This final series may be
> lower, but there are still bugs being found from it -- we need to
> finish this and shut the door on it for good.)

I got my six patches by analyzing the lwn.net report of the fixes that
was cited which had 21 of which 50% didn't actually change the emitted
code, and 25% didn't have a user visible effect.

But the broader point I'm making is just because the compiler people
come up with a shiny new warning doesn't necessarily mean the problem
it's detecting is one that causes us actual problems in the code base. 
I'd really be happier if we had a theory about what classes of CVE or
bug we could eliminate before we embrace the next new warning.

James



___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 000/141] Fix fall-through warnings for Clang

2020-11-23 Thread James Bottomley
On Mon, 2020-11-23 at 19:56 +0100, Miguel Ojeda wrote:
> On Mon, Nov 23, 2020 at 4:58 PM James Bottomley
>  wrote:
> > Well, I used git.  It says that as of today in Linus' tree we have
> > 889 patches related to fall throughs and the first series went in
> > in october 2017 ... ignoring a couple of outliers back to February.
> 
> I can see ~10k insertions over ~1k commits and 15 years that mention
> a fallthrough in the entire repo. That is including some commits
> (like the biggest one, 960 insertions) that have nothing to do with C
> fallthrough. A single kernel release has an order of magnitude more
> changes than this...
> 
> But if we do the math, for an author, at even 1 minute per line
> change and assuming nothing can be automated at all, it would take 1
> month of work. For maintainers, a couple of trivial lines is noise
> compared to many other patches.

So you think a one line patch should take one minute to produce ... I
really don't think that's grounded in reality.  I suppose a one line
patch only takes a minute to merge with b4 if no-one reviews or tests
it, but that's not really desirable.

> In fact, this discussion probably took more time than the time it
> would take to review the 200 lines. :-)

I'm framing the discussion in terms of the whole series of changes we
have done for fall through, both what's in the tree currently (889
patches) both in terms of the produce and the consumer.  That's what I
used for my figures for cost.

> > We're also complaining about the inability to recruit maintainers:
> > 
> > https://www.theregister.com/2020/06/30/hard_to_find_linux_maintainers_says_torvalds/
> > 
> > And burn out:
> > 
> > http://antirez.com/news/129
> 
> Accepting trivial and useful 1-line patches

Part of what I'm trying to measure is the "and useful" bit because
that's not a given.

> is not what makes a voluntary maintainer quit...

so the proverb "straw which broke the camel's back" uniquely doesn't
apply to maintainers

>  Thankless work with demanding deadlines is.

That's another potential reason, but it doesn't may other reasons less
valid.

> > The whole crux of your argument seems to be maintainers' time isn't
> > important so we should accept all trivial patches
> 
> I have not said that, at all. In fact, I am a voluntary one and I
> welcome patches like this. It takes very little effort on my side to
> review and it helps the kernel overall.

Well, you know, subsystems are very different in terms of the amount of
patches a maintainer has to process per release cycle of the kernel. 
If a maintainer is close to capacity, additional patches, however
trivial, become a problem.  If a maintainer has spare cycles, trivial
patches may look easy.

> Paid maintainers are the ones that can take care of big
> features/reviews.
> 
> > What I'm actually trying to articulate is a way of measuring value
> > of the patch vs cost ... it has nothing really to do with who foots
> > the actual bill.
> 
> I understand your point, but you were the one putting it in terms of
> a junior FTE.

No, I evaluated the producer side in terms of an FTE.  What we're
mostly arguing about here is the consumer side: the maintainers and
people who have to rework their patch sets. I estimated that at 100h.

>  In my view, 1 month-work (worst case) is very much worth
> removing a class of errors from a critical codebase.
> 
> > One thesis I'm actually starting to formulate is that this
> > continual devaluing of maintainers is why we have so much
> > difficulty keeping and recruiting them.
> 
> That may very well be true, but I don't feel anybody has devalued
> maintainers in this discussion.

You seem to be saying that because you find it easy to merge trivial
patches, everyone should.  I'm reminded of a friend long ago who
thought being a Tees River Pilot was a sinecure because he could
navigate the Tees blindfold.  What he forgot, of course, is that just
because it's easy with a trawler doesn't mean it's easy with an oil
tanker.  In fact it takes longer to qualify as a Tees River Pilot than
it does to get a PhD.

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Intel-wired-lan] [PATCH 000/141] Fix fall-through warnings for Clang

2020-11-23 Thread James Bottomley
On Mon, 2020-11-23 at 07:03 -0600, Gustavo A. R. Silva wrote:
> On Sun, Nov 22, 2020 at 11:53:55AM -0800, James Bottomley wrote:
> > On Sun, 2020-11-22 at 11:22 -0800, Joe Perches wrote:
> > > On Sun, 2020-11-22 at 11:12 -0800, James Bottomley wrote:
> > > > On Sun, 2020-11-22 at 10:25 -0800, Joe Perches wrote:
> > > > > On Sun, 2020-11-22 at 10:21 -0800, James Bottomley wrote:
> > > > > > Please tell me our reward for all this effort isn't a
> > > > > > single missing error print.
> > > > > 
> > > > > There were quite literally dozens of logical defects found
> > > > > by the fallthrough additions.  Very few were logging only.
> > > > 
> > > > So can you give us the best examples (or indeed all of them if
> > > > someone is keeping score)?  hopefully this isn't a US election
> > > > situation ...
> > > 
> > > Gustavo?  Are you running for congress now?
> > > 
> > > https://lwn.net/Articles/794944/
> > 
> > That's 21 reported fixes of which about 50% seem to produce no
> > change in code behaviour at all, a quarter seem to have no user
> > visible effect with the remaining quarter producing unexpected
> > errors on obscure configuration parameters, which is why no-one
> > really noticed them before.
> 
> The really important point here is the number of bugs this has
> prevented and will prevent in the future. See an example of this,
> below:
> 
> https://lore.kernel.org/linux-iio/20190813135802.gb27...@kroah.com/

I think this falls into the same category as the other six bugs: it
changes the output/input for parameters but no-one has really noticed,
usually because the command is obscure or the bias effect is minor.

> This work is still relevant, even if the total number of issues/bugs
> we find in the process is zero (which is not the case).

Really, no ... something which produces no improvement has no value at
all ... we really shouldn't be wasting maintainer time with it because
it has a cost to merge.  I'm not sure we understand where the balance
lies in value vs cost to merge but I am confident in the zero value
case.

> "The sucky thing about doing hard work to deploy hardening is that
> the result is totally invisible by definition (things not happening)
> [..]"
> - Dmitry Vyukov

Really, no.  Something that can't be measured at all doesn't exist.

And actually hardening is one of those things you can measure (which I
do have to admit isn't true for everything in the security space) ...
it's number of exploitable bugs found before you did it vs number of
exploitable bugs found after you did it.  Usually hardening eliminates
a class of bug, so the way I've measured hardening before is to go
through the CVE list for the last couple of years for product X, find
all the bugs that are of the class we're looking to eliminate and say
if we had hardened X against this class of bug we'd have eliminated Y%
of the exploits.  It can be quite impressive if Y is a suitably big
number.

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 000/141] Fix fall-through warnings for Clang

2020-11-23 Thread James Bottomley
On Mon, 2020-11-23 at 15:19 +0100, Miguel Ojeda wrote:
> On Sun, Nov 22, 2020 at 11:36 PM James Bottomley
>  wrote:
> > Well, it seems to be three years of someone's time plus the
> > maintainer review time and series disruption of nearly a thousand
> > patches.  Let's be conservative and assume the producer worked
> > about 30% on the series and it takes about 5-10 minutes per patch
> > to review, merge and for others to rework existing series.  So
> > let's say it's cost a person year of a relatively junior engineer
> > producing the patches and say 100h of review and application
> > time.  The latter is likely the big ticket item because it's what
> > we have in least supply in the kernel (even though it's 20x vs the
> > producer time).
> 
> How are you arriving at such numbers? It is a total of ~200 trivial
> lines.

Well, I used git.  It says that as of today in Linus' tree we have 889
patches related to fall throughs and the first series went in in
october 2017 ... ignoring a couple of outliers back to February.

> > It's not about the risk of the changes it's about the cost of
> > implementing them.  Even if you discount the producer time (which
> > someone gets to pay for, and if I were the engineering manager, I'd
> > be unhappy about), the review/merge/rework time is pretty
> > significant in exchange for six minor bug fixes.  Fine, when a new
> > compiler warning comes along it's certainly reasonable to see if we
> > can benefit from it and the fact that the compiler people think
> > it's worthwhile is enough evidence to assume this initially.  But
> > at some point you have to ask whether that assumption is supported
> > by the evidence we've accumulated over the time we've been using
> > it.  And if the evidence doesn't support it perhaps it is time to
> > stop the experiment.
> 
> Maintainers routinely review 1-line trivial patches, not to mention
> internal API changes, etc.

We're also complaining about the inability to recruit maintainers:

https://www.theregister.com/2020/06/30/hard_to_find_linux_maintainers_says_torvalds/

And burn out:

http://antirez.com/news/129

The whole crux of your argument seems to be maintainers' time isn't
important so we should accept all trivial patches ... I'm pushing back
on that assumption in two places, firstly the valulessness of the time
and secondly that all trivial patches are valuable.

> If some company does not want to pay for that, that's fine, but they
> don't get to be maintainers and claim `Supported`.

What I'm actually trying to articulate is a way of measuring value of
the patch vs cost ... it has nothing really to do with who foots the
actual bill.

One thesis I'm actually starting to formulate is that this continual
devaluing of maintainers is why we have so much difficulty keeping and
recruiting them.

James



___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 000/141] Fix fall-through warnings for Clang

2020-11-22 Thread James Bottomley
On Mon, 2020-11-23 at 09:54 +1100, Finn Thain wrote:
> But is anyone keeping score of the regressions? If unreported bugs
> count, what about unreported regressions?

Well, I was curious about the former (obviously no tool will tell me
about the latter), so I asked git what patches had a fall-through
series named in a fixes tag and these three popped out:

9cf51446e686 bpf, powerpc: Fix misuse of fallthrough in bpf_jit_comp()
6a9dc5fd6170 lib: Revert use of fallthrough pseudo-keyword in lib/
91dbd73a1739 mips/oprofile: Fix fallthrough placement

I don't think any of these is fixing a significant problem, but they
did cause someone time and trouble to investigate.

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 000/141] Fix fall-through warnings for Clang

2020-11-22 Thread James Bottomley
On Sun, 2020-11-22 at 21:35 +0100, Miguel Ojeda wrote:
> On Sun, Nov 22, 2020 at 7:22 PM James Bottomley
>  wrote:
> > Well, it's a problem in an error leg, sure, but it's not a really
> > compelling reason for a 141 patch series, is it?  All that fixing
> > this error will do is get the driver to print "oh dear there's a
> > problem" under four more conditions than it previously did.
> > 
> > We've been at this for three years now with nearly a thousand
> > patches, firstly marking all the fall throughs with /* fall through
> > */ and later changing it to fallthrough.  At some point we do have
> > to ask if the effort is commensurate with the protection
> > afforded.  Please tell me our reward for all this effort isn't a
> > single missing error print.
> 
> It isn't that much effort, isn't it?

Well, it seems to be three years of someone's time plus the maintainer
review time and series disruption of nearly a thousand patches.  Let's
be conservative and assume the producer worked about 30% on the series
and it takes about 5-10 minutes per patch to review, merge and for
others to rework existing series.  So let's say it's cost a person year
of a relatively junior engineer producing the patches and say 100h of
review and application time.  The latter is likely the big ticket item
because it's what we have in least supply in the kernel (even though
it's 20x vs the producer time).

>  Plus we need to take into account the future mistakes that it might
> prevent, too. So even if there were zero problems found so far, it is
> still a positive change.

Well, the question I was asking is if it's worth the cost which I've
tried to outline above.

> I would agree if these changes were high risk, though; but they are
> almost trivial.

It's not about the risk of the changes it's about the cost of
implementing them.  Even if you discount the producer time (which
someone gets to pay for, and if I were the engineering manager, I'd be
unhappy about), the review/merge/rework time is pretty significant in
exchange for six minor bug fixes.  Fine, when a new compiler warning
comes along it's certainly reasonable to see if we can benefit from it
and the fact that the compiler people think it's worthwhile is enough
evidence to assume this initially.  But at some point you have to ask
whether that assumption is supported by the evidence we've accumulated
over the time we've been using it.  And if the evidence doesn't support
it perhaps it is time to stop the experiment.

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 000/141] Fix fall-through warnings for Clang

2020-11-22 Thread James Bottomley
On Sun, 2020-11-22 at 11:22 -0800, Joe Perches wrote:
> On Sun, 2020-11-22 at 11:12 -0800, James Bottomley wrote:
> > On Sun, 2020-11-22 at 10:25 -0800, Joe Perches wrote:
> > > On Sun, 2020-11-22 at 10:21 -0800, James Bottomley wrote:
> > > > Please tell me our reward for all this effort isn't a single
> > > > missing error print.
> > > 
> > > There were quite literally dozens of logical defects found
> > > by the fallthrough additions.  Very few were logging only.
> > 
> > So can you give us the best examples (or indeed all of them if
> > someone is keeping score)?  hopefully this isn't a US election
> > situation ...
> 
> Gustavo?  Are you running for congress now?
> 
> https://lwn.net/Articles/794944/

That's 21 reported fixes of which about 50% seem to produce no change
in code behaviour at all, a quarter seem to have no user visible effect
with the remaining quarter producing unexpected errors on obscure
configuration parameters, which is why no-one really noticed them
before.

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 000/141] Fix fall-through warnings for Clang

2020-11-22 Thread James Bottomley
On Sun, 2020-11-22 at 10:25 -0800, Joe Perches wrote:
> On Sun, 2020-11-22 at 10:21 -0800, James Bottomley wrote:
> > Please tell me our reward for all this effort isn't a single
> > missing error print.
> 
> There were quite literally dozens of logical defects found
> by the fallthrough additions.  Very few were logging only.

So can you give us the best examples (or indeed all of them if someone
is keeping score)?  hopefully this isn't a US election situation ...

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 000/141] Fix fall-through warnings for Clang

2020-11-22 Thread James Bottomley
On Sun, 2020-11-22 at 08:17 -0800, Kees Cook wrote:
> On Fri, Nov 20, 2020 at 11:51:42AM -0800, Jakub Kicinski wrote:
> > On Fri, 20 Nov 2020 11:30:40 -0800 Kees Cook wrote:
> > > On Fri, Nov 20, 2020 at 10:53:44AM -0800, Jakub Kicinski wrote:
> > > > On Fri, 20 Nov 2020 12:21:39 -0600 Gustavo A. R. Silva wrote:  
> > > > > This series aims to fix almost all remaining fall-through
> > > > > warnings in order to enable -Wimplicit-fallthrough for Clang.
> > > > > 
> > > > > In preparation to enable -Wimplicit-fallthrough for Clang,
> > > > > explicitly add multiple break/goto/return/fallthrough
> > > > > statements instead of just letting the code fall through to
> > > > > the next case.
> > > > > 
> > > > > Notice that in order to enable -Wimplicit-fallthrough for
> > > > > Clang, this change[1] is meant to be reverted at some point.
> > > > > So, this patch helps to move in that direction.
> > > > > 
> > > > > Something important to mention is that there is currently a
> > > > > discrepancy between GCC and Clang when dealing with switch
> > > > > fall-through to empty case statements or to cases that only
> > > > > contain a break/continue/return statement[2][3][4].  
> > > > 
> > > > Are we sure we want to make this change? Was it discussed
> > > > before?
> > > > 
> > > > Are there any bugs Clangs puritanical definition of fallthrough
> > > > helped find?
> > > > 
> > > > IMVHO compiler warnings are supposed to warn about issues that
> > > > could be bugs. Falling through to default: break; can hardly be
> > > > a bug?!  
> > > 
> > > It's certainly a place where the intent is not always clear. I
> > > think this makes all the cases unambiguous, and doesn't impact
> > > the machine code, since the compiler will happily optimize away
> > > any behavioral redundancy.
> > 
> > If none of the 140 patches here fix a real bug, and there is no
> > change to machine code then it sounds to me like a W=2 kind of a
> > warning.
> 
> FWIW, this series has found at least one bug so far:
> https://lore.kernel.org/lkml/CAFCwf11izHF=g1mGry1fE5kvFFFrxzhPSM6qKAO8gxSp=kr...@mail.gmail.com/


Well, it's a problem in an error leg, sure, but it's not a really
compelling reason for a 141 patch series, is it?  All that fixing this
error will do is get the driver to print "oh dear there's a problem"
under four more conditions than it previously did.

We've been at this for three years now with nearly a thousand patches,
firstly marking all the fall throughs with /* fall through */ and later
changing it to fallthrough.  At some point we do have to ask if the
effort is commensurate with the protection afforded.  Please tell me
our reward for all this effort isn't a single missing error print.

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC] MAINTAINERS tag for cleanup robot

2020-11-22 Thread James Bottomley
On Sun, 2020-11-22 at 08:10 -0800, Tom Rix wrote:
> On 11/22/20 6:56 AM, Matthew Wilcox wrote:
> > On Sun, Nov 22, 2020 at 06:46:46AM -0800, Tom Rix wrote:
> > > On 11/21/20 7:23 PM, Matthew Wilcox wrote:
> > > > On Sat, Nov 21, 2020 at 08:50:58AM -0800, t...@redhat.com
> > > > wrote:
> > > > > The fixer review is
> > > > > https://reviews.llvm.org/D91789
> > > > > 
> > > > > A run over allyesconfig for x86_64 finds 62 issues, 5 are
> > > > > false positives. The false positives are caused by macros
> > > > > passed to other macros and by some macro expansions that did
> > > > > not have an extra semicolon.
> > > > > 
> > > > > This cleans up about 1,000 of the current 10,000 -Wextra-
> > > > > semi-stmt warnings in linux-next.
> > > > Are any of them not false-positives?  It's all very well to
> > > > enable stricter warnings, but if they don't fix any bugs,
> > > > they're just churn.
> > > > 
> > > While enabling additional warnings may be a side effect of this
> > > effort
> > > 
> > > the primary goal is to set up a cleaning robot. After that a
> > > refactoring robot.
> > Why do we need such a thing?  Again, it sounds like more churn.
> > It's really annoying when I'm working on something important that
> > gets derailed by pointless churn.  Churn also makes it harder to
> > backport patches to earlier kernels.
> > 
> A refactoring example on moving to treewide, consistent use of a new
> api may help.
> 
> Consider
> 
> 2efc459d06f1630001e3984854848a5647086232
> 
> sysfs: Add sysfs_emit and sysfs_emit_at to format sysfs output
> 
> A new api for printing in the sysfs.  How do we use it treewide ?
> 
> Done manually, it would be a heroic effort requiring high level
> maintainers pushing and likely only get partially done.
> 
> If a refactoring programatic fixit is done and validated on a one
> subsystem, it can run on all the subsystems.
> 
> The effort is a couple of weeks to write and validate the fixer,
> hours to run over the tree.
> 
> It won't be perfect but will be better than doing it manually.

Here's a thought: perhaps we don't.  sysfs_emit isn't a "new api" its a
minor rewrap of existing best practice.  The damage caused by the churn
of forcing its use everywhere would far outweigh any actual benefit
because pretty much every bug in this area has already been caught and
killed by existing tools.  We can enforce sysfs_emit going forwards
using tools like checkpatch but there's no benefit and a lot of harm to
be done by trying to churn the entire tree retrofitting it (both in
terms of review time wasted as well as patch series derailed).

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC] MAINTAINERS tag for cleanup robot

2020-11-21 Thread James Bottomley
On Sat, 2020-11-21 at 08:50 -0800, t...@redhat.com wrote:
> A difficult part of automating commits is composing the subsystem
> preamble in the commit log.  For the ongoing effort of a fixer
> producing
> one or two fixes a release the use of 'treewide:' does not seem
> appropriate.
> 
> It would be better if the normal prefix was used.  Unfortunately
> normal is
> not consistent across the tree.
> 
> 
>   D: Commit subsystem prefix
> 
> ex/ for FPGA DFL DRIVERS
> 
>   D: fpga: dfl:
> 

I've got to bet this is going to cause more issues than it solves. 
SCSI uses scsi: : for drivers but not every driver has a
MAINTAINERS entry.  We use either scsi: or scsi: core: for mid layer
things, but we're not consistent.  Block uses blk-: for all
of it's stuff but almost no s have a MAINTAINERS entry.  So
the next thing you're going to cause is an explosion of suggested
MAINTAINERs entries.

Has anyone actually complained about treewide:?

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Ocfs2-devel] [RFC] treewide: cleanup unreachable breaks

2020-10-18 Thread James Bottomley
On Sun, 2020-10-18 at 20:16 +0100, Matthew Wilcox wrote:
> On Sun, Oct 18, 2020 at 12:13:35PM -0700, James Bottomley wrote:
> > On Sun, 2020-10-18 at 19:59 +0100, Matthew Wilcox wrote:
> > > On Sat, Oct 17, 2020 at 09:09:28AM -0700, t...@redhat.com wrote:
> > > > clang has a number of useful, new warnings see
> > > > https://urldefense.com/v3/__https://clang.llvm.org/docs/DiagnosticsReference.html__;!!GqivPVa7Brio!Krxz78O3RKcB9JBMVo_F98FupVhj_jxX60ddN6tKGEbv_cnooXc1nnBmchm-e_O9ieGnyQ$
> > > >  
> > > 
> > > Please get your IT department to remove that stupidity.  If you
> > > can't, please send email from a non-Red Hat email address.
> > 
> > Actually, the problem is at Oracle's end somewhere in the ocfs2
> > list ... if you could fix it, that would be great.  The usual real
> > mailing lists didn't get this transformation
> > 
> > https://lore.kernel.org/bpf/20201017160928.12698-1-t...@redhat.com/
> > 
> > but the ocfs2 list archive did:
> > 
> > https://oss.oracle.com/pipermail/ocfs2-devel/2020-October/015330.html
> > 
> > I bet Oracle IT has put some spam filter on the list that mangles
> > URLs this way.
> 
> *sigh*.  I'm sure there's a way.  I've raised it with someone who
> should be able to fix it.

As someone who works for IBM I can only say I feel your pain ...

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Ocfs2-devel] [RFC] treewide: cleanup unreachable breaks

2020-10-18 Thread James Bottomley
On Sun, 2020-10-18 at 19:59 +0100, Matthew Wilcox wrote:
> On Sat, Oct 17, 2020 at 09:09:28AM -0700, t...@redhat.com wrote:
> > clang has a number of useful, new warnings see
> > https://urldefense.com/v3/__https://clang.llvm.org/docs/DiagnosticsReference.html__;!!GqivPVa7Brio!Krxz78O3RKcB9JBMVo_F98FupVhj_jxX60ddN6tKGEbv_cnooXc1nnBmchm-e_O9ieGnyQ$
> >  
> 
> Please get your IT department to remove that stupidity.  If you
> can't, please send email from a non-Red Hat email address.

Actually, the problem is at Oracle's end somewhere in the ocfs2 list
... if you could fix it, that would be great.  The usual real mailing
lists didn't get this transformation

https://lore.kernel.org/bpf/20201017160928.12698-1-t...@redhat.com/

but the ocfs2 list archive did:

https://oss.oracle.com/pipermail/ocfs2-devel/2020-October/015330.html

I bet Oracle IT has put some spam filter on the list that mangles URLs
this way.

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH RFC PKS/PMEM 22/58] fs/f2fs: Utilize new kmap_thread()

2020-10-09 Thread James Bottomley
On Fri, 2020-10-09 at 14:34 -0700, Eric Biggers wrote:
> On Fri, Oct 09, 2020 at 12:49:57PM -0700, ira.we...@intel.com wrote:
> > From: Ira Weiny 
> > 
> > The kmap() calls in this FS are localized to a single thread.  To
> > avoid the over head of global PKRS updates use the new
> > kmap_thread() call.
> > 
> > Cc: Jaegeuk Kim 
> > Cc: Chao Yu 
> > Signed-off-by: Ira Weiny 
> > ---
> >  fs/f2fs/f2fs.h | 8 
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> > index d9e52a7f3702..ff72a45a577e 100644
> > --- a/fs/f2fs/f2fs.h
> > +++ b/fs/f2fs/f2fs.h
> > @@ -2410,12 +2410,12 @@ static inline struct page
> > *f2fs_pagecache_get_page(
> >  
> >  static inline void f2fs_copy_page(struct page *src, struct page
> > *dst)
> >  {
> > -   char *src_kaddr = kmap(src);
> > -   char *dst_kaddr = kmap(dst);
> > +   char *src_kaddr = kmap_thread(src);
> > +   char *dst_kaddr = kmap_thread(dst);
> >  
> > memcpy(dst_kaddr, src_kaddr, PAGE_SIZE);
> > -   kunmap(dst);
> > -   kunmap(src);
> > +   kunmap_thread(dst);
> > +   kunmap_thread(src);
> >  }
> 
> Wouldn't it make more sense to switch cases like this to
> kmap_atomic()?
> The pages are only mapped to do a memcpy(), then they're immediately
> unmapped.

On a VIPT/VIVT architecture, this is horrendously wasteful.  You're
taking something that was mapped at colour c_src mapping it to a new
address src_kaddr, which is likely a different colour and necessitates
flushing the original c_src, then you copy it to dst_kaddr, which is
also likely a different colour from c_dst, so dst_kaddr has to be
flushed on kunmap and c_dst has to be invalidated on kmap.  What we
should have is an architectural primitive for doing this, something
like kmemcopy_arch(dst, src).  PIPT architectures can implement it as
the above (possibly losing kmap if they don't need it) but VIPT/VIVT
architectures can set up a correctly coloured mapping so they can
simply copy from c_src to c_dst without any need to flush and the data
arrives cache hot at c_dst.

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] block: convert tasklets to use new tasklet_setup() API

2020-08-19 Thread James Bottomley
On Wed, 2020-08-19 at 21:54 +0530, Allen wrote:
> > [...]
> > > > Since both threads seem to have petered out, let me suggest in
> > > > kernel.h:
> > > > 
> > > > #define cast_out(ptr, container, member) \
> > > > container_of(ptr, typeof(*container), member)
> > > > 
> > > > It does what you want, the argument order is the same as
> > > > container_of with the only difference being you name the
> > > > containing structure instead of having to specify its type.
> > > 
> > > Not to incessantly bike shed on the naming, but I don't like
> > > cast_out, it's not very descriptive. And it has connotations of
> > > getting rid of something, which isn't really true.
> > 
> > Um, I thought it was exactly descriptive: you're casting to the
> > outer container.  I thought about following the C++ dynamic casting
> > style, so out_cast(), but that seemed a bit pejorative.  What about
> > outer_cast()?
> > 
> > > FWIW, I like the from_ part of the original naming, as it has
> > > some clues as to what is being done here. Why not just
> > > from_container()? That should immediately tell people what it
> > > does without having to look up the implementation, even before
> > > this becomes a part of the accepted coding norm.
> > 
> > I'm not opposed to container_from() but it seems a little less
> > descriptive than outer_cast() but I don't really care.  I always
> > have to look up container_of() when I'm using it so this would just
> > be another macro of that type ...
> > 
> 
>  So far we have a few which have been suggested as replacement
> for from_tasklet()
> 
> - out_cast() or outer_cast()
> - from_member().
> - container_from() or from_container()
> 
> from_container() sounds fine, would trimming it a bit work? like
> from_cont().

I'm fine with container_from().  It's the same form as container_of()
and I think we need urgent agreement to not stall everything else so
the most innocuous name is likely to get the widest acceptance.

James

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] block: convert tasklets to use new tasklet_setup() API

2020-08-19 Thread James Bottomley
On Wed, 2020-08-19 at 07:00 -0600, Jens Axboe wrote:
> On 8/18/20 1:00 PM, James Bottomley wrote:
[...]
> > Since both threads seem to have petered out, let me suggest in
> > kernel.h:
> > 
> > #define cast_out(ptr, container, member) \
> > container_of(ptr, typeof(*container), member)
> > 
> > It does what you want, the argument order is the same as
> > container_of with the only difference being you name the containing
> > structure instead of having to specify its type.
> 
> Not to incessantly bike shed on the naming, but I don't like
> cast_out, it's not very descriptive. And it has connotations of
> getting rid of something, which isn't really true.

Um, I thought it was exactly descriptive: you're casting to the outer
container.  I thought about following the C++ dynamic casting style, so
out_cast(), but that seemed a bit pejorative.  What about outer_cast()?

> FWIW, I like the from_ part of the original naming, as it has some
> clues as to what is being done here. Why not just from_container()?
> That should immediately tell people what it does without having to
> look up the implementation, even before this becomes a part of the
> accepted coding norm.

I'm not opposed to container_from() but it seems a little less
descriptive than outer_cast() but I don't really care.  I always have
to look up container_of() when I'm using it so this would just be
another macro of that type ...

James

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] block: convert tasklets to use new tasklet_setup() API

2020-08-18 Thread James Bottomley
On Tue, 2020-08-18 at 13:10 -0700, Kees Cook wrote:
> On Tue, Aug 18, 2020 at 01:00:33PM -0700, James Bottomley wrote:
> > On Mon, 2020-08-17 at 13:02 -0700, Jens Axboe wrote:
> > > On 8/17/20 12:48 PM, Kees Cook wrote:
> > > > On Mon, Aug 17, 2020 at 12:44:34PM -0700, Jens Axboe wrote:
> > > > > On 8/17/20 12:29 PM, Kees Cook wrote:
> > > > > > On Mon, Aug 17, 2020 at 06:56:47AM -0700, Jens Axboe wrote:
> > > > > > > On 8/17/20 2:15 AM, Allen Pais wrote:
> > > > > > > > From: Allen Pais 
> > > > > > > > 
> > > > > > > > In preparation for unconditionally passing the
> > > > > > > > struct tasklet_struct pointer to all tasklet
> > > > > > > > callbacks, switch to using the new tasklet_setup()
> > > > > > > > and from_tasklet() to pass the tasklet pointer
> > > > > > > > explicitly.
> > > > > > > 
> > > > > > > Who came up with the idea to add a macro 'from_tasklet'
> > > > > > > that
> > > > > > > is just container_of? container_of in the code would be
> > > > > > > _much_ more readable, and not leave anyone guessing wtf
> > > > > > > from_tasklet is doing.
> > > > > > > 
> > > > > > > I'd fix that up now before everything else goes in...
> > > > > > 
> > > > > > As I mentioned in the other thread, I think this makes
> > > > > > things
> > > > > > much more readable. It's the same thing that the
> > > > > > timer_struct
> > > > > > conversion did (added a container_of wrapper) to avoid the
> > > > > > ever-repeating use of typeof(), long lines, etc.
> > > > > 
> > > > > But then it should use a generic name, instead of each sub-
> > > > > system 
> > > > > using some random name that makes people look up exactly what
> > > > > it
> > > > > does. I'm not huge fan of the container_of() redundancy, but
> > > > > adding private variants of this doesn't seem like the best
> > > > > way
> > > > > forward. Let's have a generic helper that does this, and use
> > > > > it
> > > > > everywhere.
> > > > 
> > > > I'm open to suggestions, but as things stand, these kinds of
> > > > treewide
> > > 
> > > On naming? Implementation is just as it stands, from_tasklet() is
> > > totally generic which is why I objected to it. from_member()? Not
> > > great with naming... But I can see this going further and then
> > > we'll
> > > suddenly have tons of these. It's not good for readability.
> > 
> > Since both threads seem to have petered out, let me suggest in
> > kernel.h:
> > 
> > #define cast_out(ptr, container, member) \
> > container_of(ptr, typeof(*container), member)
> > 
> > It does what you want, the argument order is the same as
> > container_of with the only difference being you name the containing
> > structure instead of having to specify its type.
> 
> I like this! Shall I send this to Linus to see if this can land in
> -rc2 for use going forward?

Sure ... he's probably been lurking on this thread anyway ... it's
about time he got off his arse^Wthe fence and made an executive
decision ...

James

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] block: convert tasklets to use new tasklet_setup() API

2020-08-18 Thread James Bottomley
On Mon, 2020-08-17 at 13:02 -0700, Jens Axboe wrote:
> On 8/17/20 12:48 PM, Kees Cook wrote:
> > On Mon, Aug 17, 2020 at 12:44:34PM -0700, Jens Axboe wrote:
> > > On 8/17/20 12:29 PM, Kees Cook wrote:
> > > > On Mon, Aug 17, 2020 at 06:56:47AM -0700, Jens Axboe wrote:
> > > > > On 8/17/20 2:15 AM, Allen Pais wrote:
> > > > > > From: Allen Pais 
> > > > > > 
> > > > > > In preparation for unconditionally passing the
> > > > > > struct tasklet_struct pointer to all tasklet
> > > > > > callbacks, switch to using the new tasklet_setup()
> > > > > > and from_tasklet() to pass the tasklet pointer explicitly.
> > > > > 
> > > > > Who came up with the idea to add a macro 'from_tasklet' that
> > > > > is just container_of? container_of in the code would be
> > > > > _much_ more readable, and not leave anyone guessing wtf
> > > > > from_tasklet is doing.
> > > > > 
> > > > > I'd fix that up now before everything else goes in...
> > > > 
> > > > As I mentioned in the other thread, I think this makes things
> > > > much more readable. It's the same thing that the timer_struct
> > > > conversion did (added a container_of wrapper) to avoid the
> > > > ever-repeating use of typeof(), long lines, etc.
> > > 
> > > But then it should use a generic name, instead of each sub-system 
> > > using some random name that makes people look up exactly what it
> > > does. I'm not huge fan of the container_of() redundancy, but
> > > adding private variants of this doesn't seem like the best way
> > > forward. Let's have a generic helper that does this, and use it
> > > everywhere.
> > 
> > I'm open to suggestions, but as things stand, these kinds of
> > treewide
> 
> On naming? Implementation is just as it stands, from_tasklet() is
> totally generic which is why I objected to it. from_member()? Not
> great with naming... But I can see this going further and then we'll
> suddenly have tons of these. It's not good for readability.

Since both threads seem to have petered out, let me suggest in
kernel.h:

#define cast_out(ptr, container, member) \
container_of(ptr, typeof(*container), member)

It does what you want, the argument order is the same as container_of
with the only difference being you name the containing structure
instead of having to specify its type.

James

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH RFC 00/15] Zero ****s, hugload of hugs <3

2018-11-30 Thread James Bottomley
On Fri, 2018-11-30 at 14:26 -0800, Jarkko Sakkinen wrote:
> On Fri, Nov 30, 2018 at 03:14:59PM -0700, Jonathan Corbet wrote:
[...]
> > Have you read Documentation/process/code-of-conduct-
> > interpretation.rst? 
> > As has been pointed out, it contains a clear answer to how things
> > should be interpreted here.
> 
> Ugh, was not aware that there two documents.
> 
> Yeah, definitely sheds light. Why the documents could not be merged
> to single common sense code of conduct?

The fact that we've arrived at essentially an original CoC
reinterpreted to the point where it's effectively a new CoC has been
the source of much debate and recrimination over the last few months
... you can read it in the ksummit-discuss archives, but I really think
we don't want to reopen that can of worms.

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH RFC 00/15] Zero ****s, hugload of hugs <3

2018-11-30 Thread James Bottomley
On Fri, 2018-11-30 at 14:12 -0800, Jarkko Sakkinen wrote:
[...]
> I pasted this already to another response and this was probably the
> part that ignited me to send the patch set (was a few days ago, so
> had to revisit to find the exact paragraph):

I replied in to the other thread.

> "Maintainers have the right and responsibility to remove, edit, or
> reject comments, commits, code, wiki edits, issues, and other
> contributions that are not aligned to this Code of Conduct, or to ban
> temporarily or permanently any contributor for other behaviors that
> they deem inappropriate, threatening, offensive, or harmful."
> 
> The whole patch set is neither a joke/troll nor something I would
> necessarily want to be include myself. It does have the RFC tag.
> 
> As a maintainer myself (and based on somewhat disturbed feedback from
> other maintainers) I can only make the conclusion that nobody knows
> what the responsibility part here means.
> 
> I would interpret, if I read it like at lawyer at least, that even
> for existing code you would need to do the changes postmorterm.

That's wrong in the light of the interpretation document, yes.

> Is this wrong interpretation?  Should I conclude that I made a
> mistake by reading the CoC and trying to understand what it
> *actually* says?

You can't read it in isolation, you need to read it along with the
interpretation document.  The latter was created precisely because
there was a lot of push back on interpretation problems and ambiguities
with the original CoC and it specifically covers this case (and a lot
of others).

James


> After this discussion, I can say that I understand it less than
> before.
> 
> /Jarkko
> 

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH RFC 00/15] Zero ****s, hugload of hugs <3

2018-11-30 Thread James Bottomley
On Fri, 2018-11-30 at 13:54 -0800, Jarkko Sakkinen wrote:
> On Fri, Nov 30, 2018 at 01:48:08PM -0800, David Miller wrote:
> > From: Jarkko Sakkinen 
> > Date: Fri, 30 Nov 2018 13:44:05 -0800
> > 
> > > On Fri, Nov 30, 2018 at 01:01:02PM -0800, James Bottomley wrote:
> > > > No because use of what some people consider to be bad language
> > > > isn't
> > > > necessarily abusive, offensive or degrading.  Our most heavily
> > > > censored
> > > > medium is TV and "fuck" is now considered acceptable in certain
> > > > contexts on most channels in the UK and EU.
> > > 
> > > This makes following the CoC extremely hard to a non-native
> > > speaker as
> > > it is not too explicit on what is OK and what is not. I did
> > > through the
> > > whole thing with an eye glass and this what I deduced from it.
> > 
> > It would be helpful if you could explain what part of the language
> > is unclear wrt. explaining how CoC does not apply to existing code.
> > 
> > That part seems very explicit to me.
> 
> Well, now that I re-read it, it was this part to be exact:
> 
> "Maintainers have the right and responsibility to remove, edit, or
> reject comments, commits, code, wiki edits, issues, and other
> contributions that are not aligned to this Code of Conduct, or to ban
> temporarily or permanently any contributor for other behaviors that
> they deem inappropriate, threatening, offensive, or harmful."
> 
> How this should be interpreted?

Firstly, this is *only* about contributions going forward.  The
interpretation document says we don't have to look back into the
repository and we definitely can't remove something from git that's
already been committed.

As a Maintainer, if you feel bad language is inappropriate for your
subsystem then you say so and reject with that reason patches that come
in containing it (suggesting alternative rewordings).  However, your
determination about what is or isn't acceptable in your subsystem isn't
binding on other maintainers, who may have different standards ... this
is identical to what we do with coding, like your insistence on one
line per variable or other subsystem's insistence on reverse christmas
tree for includes ...

James


> I have not really followed the previous CoC discussions as I try to
> always use polite language so I'm sorry if this duplicate.
> 
> /Jarkko
> 

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH RFC 00/15] Zero ****s, hugload of hugs <3

2018-11-30 Thread James Bottomley
On Fri, 2018-11-30 at 13:44 -0800, Jarkko Sakkinen wrote:
> On Fri, Nov 30, 2018 at 01:01:02PM -0800, James Bottomley wrote:
> > No because use of what some people consider to be bad language
> > isn't necessarily abusive, offensive or degrading.  Our most
> > heavily censored medium is TV and "fuck" is now considered
> > acceptable in certain contexts on most channels in the UK and EU.
> 
> This makes following the CoC extremely hard to a non-native speaker
> as it is not too explicit on what is OK and what is not. I did
> through the whole thing with an eye glass and this what I deduced
> from it.

OK, so something that would simply be considered in some quarters as
bad language isn't explicitly banned.  The thing which differentiates
simple bad language from "abusive, offensive or degrading language",
which is called out by the CoC, is the context and the target.

So when it's a simple expletive or the code of the author or even the
hardware is the target, I'd say it's an easy determination it's not a
CoC violation.  If someone else's code is the target or the inventor of
the hardware is targetted by name, I'd say it is.  Even non-native
English speakers should be able to determine target and context,
because that's the essence of meaning.

James

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH RFC 00/15] Zero ****s, hugload of hugs <3

2018-11-30 Thread James Bottomley
On Fri, 2018-11-30 at 12:55 -0800, Jarkko Sakkinen wrote:
> On Fri, Nov 30, 2018 at 11:56:52AM -0800, Davidlohr Bueso wrote:
> > On Fri, 30 Nov 2018, Kees Cook wrote:
> > 
> > > On Fri, Nov 30, 2018 at 11:27 AM Jarkko Sakkinen
> > >  wrote:
> > > > 
> > > > In order to comply with the CoC, replace  with a hug.
> > 
> > I hope this is some kind of joke. How would anyone get offended by
> > reading technical comments? This is all beyond me...
> 
> Well... Not a joke really but more like conversation starter :-)
> 
> I had 10h flight from Amsterdam to Portland and one of the things
> that I did was to read the new CoC properly.
> 
> This a direct quote from the CoC:
> 
> "Harassment includes the use of abusive, offensive or degrading
> language, intimidation, stalking, harassing photography or recording,
> inappropriate physical contact, sexual imagery and unwelcome sexual
> advances or requests for sexual favors."
> 
> Doesn't this fall into this category?

No because use of what some people consider to be bad language isn't
necessarily abusive, offensive or degrading.  Our most heavily censored
medium is TV and "fuck" is now considered acceptable in certain
contexts on most channels in the UK and EU.

> Your argument is not that great because you could say that from any
> LKML discussion. If you don't like hugging, please propose something
> else
> :-)

The interpretation document also says this:

   ontributions submitted for the kernel should use appropriate
   language. Content that already exists predating the Code of Conduct
   will not be addressed now as a violation.

So that definitely means there should be no hunting down of existing
comments in kernel code.

James

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Intel-gfx] Skylake graphics regression: projector failure with 4.8-rc3

2016-09-19 Thread James Bottomley
On Mon, 2016-09-19 at 08:09 -0700, James Bottomley wrote:
> On Sun, 2016-09-18 at 13:35 +0200, Thorsten Leemhuis wrote:
> > Hi! James & Paulo: What's the current status of this?
> 
> No, the only interaction has been the suggestion below for a revert,
> which didn't fix the problem.
> 
> >  Was this issue discussed elsewhere or even fixed in between? Just 
> > asking, because this issue is on the list of regressions for 4.8.
> 
> 
> I'm just about to try out -rc7, but it's not fixed so far.

OK, with -rc7 and the i915 fixes, there seems to be a marked
improvement.  I can no longer crash the crtc by using lots of xrandr
switches, which was the principal problem.

I've so far only got one of these in the logs

[14858.635035] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR*
CPU pipe B FIFO underrun

And the only residual problem seems to be that the monitor goes blank
periodically, but this can be fixed by switching resolution a couple of
times.

I haven't seen any of the link training errors so far and I've run
through my usual battery of be nasty to the external monitor tests.

James





[Intel-gfx] Skylake graphics regression: projector failure with 4.8-rc3

2016-09-19 Thread James Bottomley
On Sun, 2016-09-18 at 13:35 +0200, Thorsten Leemhuis wrote:
> Hi! James & Paulo: What's the current status of this?

No, the only interaction has been the suggestion below for a revert,
which didn't fix the problem.

>  Was this issue discussed elsewhere or even fixed in between? Just 
> asking, because this issue is on the list of regressions for 4.8.


I'm just about to try out -rc7, but it's not fixed so far.

James


> Ciao, Thorsten
> 
> On 01.09.2016 00:25, James Bottomley wrote:
> > On Wed, 2016-08-31 at 21:51 +, Zanoni, Paulo R wrote:
> > > Em Qua, 2016-08-31 às 14:43 -0700, James Bottomley escreveu:
> > > > On Wed, 2016-08-31 at 11:23 -0700, James Bottomley wrote:
> > > > > On Fri, 2016-08-26 at 09:10 -0400, James Bottomley wrote:
> > > > > > We seem to have an xrandr regression with skylake now. 
> > > > > >  What's
> > > > > > happening is that I can get output on to a projector, but
> > > > > > the 
> > > > > > system is losing video when I change the xrandr sessions
> > > > > > (like 
> > > > > > going from a --above b to a --same-as b).  The main screen
> > > > > > goes
> > > > > > blank, which is basically a reboot situation. 
> > > > > >  Unfortunately, I
> > > > > > can't seem to get the logs out of systemd to see if there
> > > > > > was a
> > > > > > dump to dmesg (the system was definitely responding).
> > > > > > 
> > > > > > I fell back to 4.6.2 which worked perfectly, so this is
> > > > > > definitely 
> > > > > > some sort of regression.  I'll be able to debug more fully
> > > > > > when
> > > > > > I 
> > > > > > get back home from the Linux Security Summit.
> > > > > 
> > > > > I'm home now.  Unfortunately, my monitor isn't as problematic
> > > > > as
> > > > > the
> > > > > projector, but by flipping between various modes and
> > > > > separating
> > > > > and
> > > > > overlaying the panels with --above and --same-as (xrandr), I
> > > > > can
> > > > > eventually get it to the point where the main LCD panel goes
> > > > > black 
> > > > > and can only be restarted by specifying a different mode.
> > > > > 
> > > > > This seems to be associated with these lines in the X
> > > > > 
> > > > > [ 14714.389] (EE) intel(0): failed to set mode: Invalid
> > > > > argument
> > > > > [22]
> > > > > 
> > > > > But the curious thing is that even if this fails with the
> > > > > error 
> > > > > message once, it may succeed a second time, so it looks to be
> > > > > a 
> > > > > transient error translation problem from the kernel driver.
> > > > > 
> > > > > I've attached the full log below.
> > > > > 
> > > > > This is only with a VGA output.  I currently don't have a
> > > > > HDMI 
> > > > > dongle, but I'm in the process of acquiring one.
> > > > 
> > > > After more playing around, I'm getting thousands of these in
> > > > the
> > > > kernel
> > > > log (possibly millions: the log wraps very fast):
> > > > 
> > > > [23504.873606] [drm:intel_dp_start_link_train [i915]] *ERROR*
> > > > failed
> > > > to train DP, aborting
> > > > 
> > > > And then finally it gives up with 
> > > > 
> > > > [25023.770951] [drm:intel_cpu_fifo_underrun_irq_handler [i915]]
> > > > *ERROR* CPU pipe B FIFO underrun
> > > > [25561.926075] [drm:intel_cpu_fifo_underrun_irq_handler [i915]]
> > > > *ERROR* CPU pipe A FIFO underrun
> > > > 
> > > > And the crtc for the VGA output becomes non-responsive to any
> > > > configuration command.  This requires a reboot and sometimes a
> > > > UEFI
> > > > variable reset before it comes back.
> > > 
> > > Please see this discussion:
> > > https://patchwork.freedesktop.org/patch/103237/
> > > 
> > > Do you have this patch on your tree? Does the problem go away if
> > > you
> > > revert it?
> > 
> > Yes, I've got it, it went in in 4.8-rc3 according to git:
> > 
> > commit 58e311b09c319183254d9220c50a533e7157c9ab
> > Author: Matt Roper 
> > Date:   Thu Aug 4 14:08:00 2016 -0700
> > 
> > drm/i915/gen9: Give one extra block per line for SKL plane WM
> > calculations
> > 
> > Reverting it causes the secondary display not to sync pretty much
> > at
> > all.  However, in the flickers I can see, it does work OK and
> > doesn't
> > now crash switching from --same-as to --above and back
> > 
> > I also still get the logs filling up with the link training errors.
> > 
> > On balance, although the behaviour is different, it's not an
> > improvement because if I can't sync with the projector, I can't
> > really
> > use this as a fix.
> > 
> > James
> 



[Intel-gfx] Skylake graphics regression: projector failure with 4.8-rc3

2016-08-31 Thread James Bottomley
On Wed, 2016-08-31 at 21:51 +, Zanoni, Paulo R wrote:
> Hi
> 
> Em Qua, 2016-08-31 às 14:43 -0700, James Bottomley escreveu:
> > On Wed, 2016-08-31 at 11:23 -0700, James Bottomley wrote:
> > > 
> > > On Fri, 2016-08-26 at 09:10 -0400, James Bottomley wrote:
> > > > 
> > > > We seem to have an xrandr regression with skylake now.  What's
> > > > happening is that I can get output on to a projector, but the 
> > > > system is losing video when I change the xrandr sessions (like 
> > > > going from a --above b to a --same-as b).  The main screen goes
> > > > blank, which is basically a reboot situation.  Unfortunately, I
> > > > can't seem to get the logs out of systemd to see if there was a
> > > > dump to dmesg (the system was definitely responding).
> > > > 
> > > > I fell back to 4.6.2 which worked perfectly, so this is
> > > > definitely 
> > > > some sort of regression.  I'll be able to debug more fully when
> > > > I 
> > > > get back home from the Linux Security Summit.
> > > 
> > > I'm home now.  Unfortunately, my monitor isn't as problematic as
> > > the
> > > projector, but by flipping between various modes and separating
> > > and
> > > overlaying the panels with --above and --same-as (xrandr), I can
> > > eventually get it to the point where the main LCD panel goes
> > > black 
> > > and can only be restarted by specifying a different mode.
> > > 
> > > This seems to be associated with these lines in the X
> > > 
> > > [ 14714.389] (EE) intel(0): failed to set mode: Invalid argument
> > > [22]
> > > 
> > > But the curious thing is that even if this fails with the error 
> > > message once, it may succeed a second time, so it looks to be a 
> > > transient error translation problem from the kernel driver.
> > > 
> > > I've attached the full log below.
> > > 
> > > This is only with a VGA output.  I currently don't have a HDMI 
> > > dongle, but I'm in the process of acquiring one.
> > 
> > After more playing around, I'm getting thousands of these in the
> > kernel
> > log (possibly millions: the log wraps very fast):
> > 
> > [23504.873606] [drm:intel_dp_start_link_train [i915]] *ERROR*
> > failed
> > to train DP, aborting
> > 
> > And then finally it gives up with 
> > 
> > [25023.770951] [drm:intel_cpu_fifo_underrun_irq_handler [i915]]
> > *ERROR* CPU pipe B FIFO underrun
> > [25561.926075] [drm:intel_cpu_fifo_underrun_irq_handler [i915]]
> > *ERROR* CPU pipe A FIFO underrun
> > 
> > And the crtc for the VGA output becomes non-responsive to any
> > configuration command.  This requires a reboot and sometimes a UEFI
> > variable reset before it comes back.
> 
> Please see this discussion:
> https://patchwork.freedesktop.org/patch/103237/
> 
> Do you have this patch on your tree? Does the problem go away if you
> revert it?

Yes, I've got it, it went in in 4.8-rc3 according to git:

commit 58e311b09c319183254d9220c50a533e7157c9ab
Author: Matt Roper 
Date:   Thu Aug 4 14:08:00 2016 -0700

drm/i915/gen9: Give one extra block per line for SKL plane WM
calculations

Reverting it causes the secondary display not to sync pretty much at
all.  However, in the flickers I can see, it does work OK and doesn't
now crash switching from --same-as to --above and back

I also still get the logs filling up with the link training errors.

On balance, although the behaviour is different, it's not an
improvement because if I can't sync with the projector, I can't really
use this as a fix.

James



Skylake graphics regression: projector failure with 4.8-rc3

2016-08-31 Thread James Bottomley
On Wed, 2016-08-31 at 11:23 -0700, James Bottomley wrote:
> On Fri, 2016-08-26 at 09:10 -0400, James Bottomley wrote:
> > We seem to have an xrandr regression with skylake now.  What's
> > happening is that I can get output on to a projector, but the 
> > system is losing video when I change the xrandr sessions (like 
> > going from a --above b to a --same-as b).  The main screen goes 
> > blank, which is basically a reboot situation.  Unfortunately, I 
> > can't seem to get the logs out of systemd to see if there was a 
> > dump to dmesg (the system was definitely responding).
> > 
> > I fell back to 4.6.2 which worked perfectly, so this is definitely 
> > some sort of regression.  I'll be able to debug more fully when I 
> > get back home from the Linux Security Summit.
> 
> I'm home now.  Unfortunately, my monitor isn't as problematic as the
> projector, but by flipping between various modes and separating and
> overlaying the panels with --above and --same-as (xrandr), I can
> eventually get it to the point where the main LCD panel goes black 
> and can only be restarted by specifying a different mode.
> 
> This seems to be associated with these lines in the X
> 
> [ 14714.389] (EE) intel(0): failed to set mode: Invalid argument [22]
> 
> But the curious thing is that even if this fails with the error 
> message once, it may succeed a second time, so it looks to be a 
> transient error translation problem from the kernel driver.
> 
> I've attached the full log below.
> 
> This is only with a VGA output.  I currently don't have a HDMI 
> dongle, but I'm in the process of acquiring one.

After more playing around, I'm getting thousands of these in the kernel
log (possibly millions: the log wraps very fast):

[23504.873606] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train 
DP, aborting

And then finally it gives up with 

[25023.770951] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU 
pipe B FIFO underrun
[25561.926075] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU 
pipe A FIFO underrun

And the crtc for the VGA output becomes non-responsive to any
configuration command.  This requires a reboot and sometimes a UEFI
variable reset before it comes back.

James



Skylake graphics regression: projector failure with 4.8-rc3

2016-08-31 Thread James Bottomley
On Fri, 2016-08-26 at 09:10 -0400, James Bottomley wrote:
> We seem to have an xrandr regression with skylake now.  What's
> happening is that I can get output on to a projector, but the system 
> is losing video when I change the xrandr sessions (like going from a 
> --above b to a --same-as b).  The main screen goes blank, which is
> basically a reboot situation.  Unfortunately, I can't seem to get the
> logs out of systemd to see if there was a dump to dmesg (the system 
> was definitely responding).
> 
> I fell back to 4.6.2 which worked perfectly, so this is definitely 
> some sort of regression.  I'll be able to debug more fully when I get 
> back home from the Linux Security Summit.

I'm home now.  Unfortunately, my monitor isn't as problematic as the
projector, but by flipping between various modes and separating and
overlaying the panels with --above and --same-as (xrandr), I can
eventually get it to the point where the main LCD panel goes black and
can only be restarted by specifying a different mode.

This seems to be associated with these lines in the X

[ 14714.389] (EE) intel(0): failed to set mode: Invalid argument [22]

But the curious thing is that even if this fails with the error message
once, it may succeed a second time, so it looks to be a transient error
translation problem from the kernel driver.

I've attached the full log below.

This is only with a VGA output.  I currently don't have a HDMI dongle,
but I'm in the process of acquiring one.

James
-- next part --
A non-text attachment was scrubbed...
Name: Xorg.0.log
Type: text/x-log
Size: 31792 bytes
Desc: not available
URL: 
<https://lists.freedesktop.org/archives/dri-devel/attachments/20160831/9456e526/attachment-0001.bin>


Skylake graphics regression: projector failure with 4.8-rc3

2016-08-26 Thread James Bottomley
On Fri, 2016-08-26 at 09:10 -0400, James Bottomley wrote:
> We seem to have an xrandr regression with skylake now.  What's
> happening is that I can get output on to a projector, but the system 
> is losing video when I change the xrandr sessions (like going from a 
> --above b to a --same-as b).  The main screen goes blank, which is
> basically a reboot situation.  Unfortunately, I can't seem to get the
> logs out of systemd to see if there was a dump to dmesg (the system 
> was definitely responding).
> 
> I fell back to 4.6.2 which worked perfectly, so this is definitely 
> some sort of regression.  I'll be able to debug more fully when I get 
> back home from the Linux Security Summit.

Sorry, ignore the logs: they're from a previous bug; as I said I'll try
to gather logs when I get back to a more controlled environment over
the weekend.

James




Skylake graphics regression: projector failure with 4.8-rc3

2016-08-26 Thread James Bottomley
We seem to have an xrandr regression with skylake now.  What's
happening is that I can get output on to a projector, but the system is
losing video when I change the xrandr sessions (like going from a -
-above b to a --same-as b).  The main screen goes blank, which is
basically a reboot situation.  Unfortunately, I can't seem to get the
logs out of systemd to see if there was a dump to dmesg (the system was
definitely responding).

I fell back to 4.6.2 which worked perfectly, so this is definitely some
sort of regression.  I'll be able to debug more fully when I get back
home from the Linux Security Summit.

James
-- next part --
A non-text attachment was scrubbed...
Name: Xorg.0.log
Type: text/x-log
Size: 27983 bytes
Desc: not available
URL: 

-- next part --
[0.00] Linux version 4.7.0-rc1+ (jejb at jarvis) (gcc version 4.8.5 
(SUSE Linux) ) #1 SMP PREEMPT Mon May 30 12:29:53 PDT 2016
[0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-test 
root=UUID=994363fd-c12c-4c6c-92e9-85e8ac695cf9 ro resume=/dev/nvme0n1p2 
splash=silent quiet showopts pcie_aspm=force
[0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[0.00] x86/fpu: xstate_offset[3]:  960, xstate_sizes[3]:   64
[0.00] x86/fpu: xstate_offset[4]: 1024, xstate_sizes[4]:   64
[0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point 
registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
[0.00] x86/fpu: Enabled xstate features 0x1f, context size is 1088 
bytes, using 'standard' format.
[0.00] x86/fpu: Using 'eager' FPU context switches.
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0fff] reserved
[0.00] BIOS-e820: [mem 0x1000-0x00057fff] usable
[0.00] BIOS-e820: [mem 0x00058000-0x00058fff] reserved
[0.00] BIOS-e820: [mem 0x00059000-0x0009cfff] usable
[0.00] BIOS-e820: [mem 0x0009d000-0x0009] reserved
[0.00] BIOS-e820: [mem 0x0010-0x69537fff] usable
[0.00] BIOS-e820: [mem 0x69538000-0x69538fff] ACPI NVS
[0.00] BIOS-e820: [mem 0x69539000-0x69562fff] reserved
[0.00] BIOS-e820: [mem 0x69563000-0x695bdfff] usable
[0.00] BIOS-e820: [mem 0x695be000-0x69daefff] reserved
[0.00] BIOS-e820: [mem 0x69daf000-0x7907dfff] usable
[0.00] BIOS-e820: [mem 0x7907e000-0x793f8fff] reserved
[0.00] BIOS-e820: [mem 0x793f9000-0x79448fff] ACPI data
[0.00] BIOS-e820: [mem 0x79449000-0x79c00fff] ACPI NVS
[0.00] BIOS-e820: [mem 0x79c01000-0x7a4fefff] reserved
[0.00] BIOS-e820: [mem 0x7a4ff000-0x7a4f] usable
[0.00] BIOS-e820: [mem 0x7a50-0x7a5f] reserved
[0.00] BIOS-e820: [mem 0xe000-0xefff] reserved
[0.00] BIOS-e820: [mem 0xfe00-0xfe010fff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
[0.00] BIOS-e820: [mem 0xff00-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00027eff] usable
[0.00] NX (Execute Disable) protection: active
[0.00] e820: update [mem 0x67823018-0x67833057] usable ==> usable
[0.00] extended physical RAM map:
[0.00] reserve setup_data: [mem 0x-0x0fff] 
reserved
[0.00] reserve setup_data: [mem 0x1000-0x00057fff] 
usable
[0.00] reserve setup_data: [mem 0x00058000-0x00058fff] 
reserved
[0.00] reserve setup_data: [mem 0x00059000-0x0009cfff] 
usable
[0.00] reserve setup_data: [mem 0x0009d000-0x0009] 
reserved
[0.00] reserve setup_data: [mem 0x0010-0x67823017] 
usable
[0.00] reserve setup_data: [mem 0x67823018-0x67833057] 
usable
[0.00] reserve setup_data: [mem 0x67833058-0x69537fff] 
usable
[0.00] reserve setup_data: [mem 0x69538000-0x69538fff] 
ACPI NVS
[0.00] reserve setup_data: [mem 0x69539000-0x69562fff] 
reserved
[0.00] reserve setup_data: [mem 0x69563000-0x695bdfff] 
usable
[0.00] 

[Intel-gfx] Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-07-08 Thread James Bottomley
On Fri, 2016-07-08 at 13:19 +0300, Ville Syrjälä wrote:
> On Thu, Jul 07, 2016 at 12:19:36PM -0700, James Bottomley wrote:
> > On Thu, 2016-07-07 at 09:55 -0700, James Bottomley wrote:
> > > On Thu, 2016-07-07 at 19:14 +0300, Ville Syrjälä wrote:
> > > > On Tue, Jun 21, 2016 at 06:44:34PM +0300, Ville Syrjälä wrote:
> > > > > On Tue, Jun 21, 2016 at 09:53:15AM -0400, James Bottomley
> > > > > wrote:
> > > > > > On Mon, 2016-06-20 at 11:03 +0300, Jani Nikula wrote:
> > > > > > > Cc: Ville
> > > > > > > 
> > > > > > > On Mon, 20 Jun 2016, James Bottomley <
> > > > > > > James.Bottomley at HansenPartnership.com> wrote:
> > > > > > > > OK, my candidate bad commit is this one:
> > > > > > > > 
> > > > > > > > commit a05628195a0d9f3173dd9aa76f482aef692e46ee
> > > > > > > > Author: Ville Syrjälä 
> > > > > > > > Date:   Mon Apr 11 10:23:51 2016 +0300
> > > > > > > > 
> > > > > > > > drm/i915: Get panel_type from OpRegion panel
> > > > > > > > details
> > > > > > > > 
> > > > > > > > After being more careful about waiting to identify
> > > > > > > > flicker,
> > > > > > > > this one seems to be the one the bisect finds.  I'm now
> > > > > > > > running v4.7-rc3 with this one reverted and am
> > > > > > > > currently 
> > > > > > > > seeing no flicker problems.   It is, however, early
> > > > > > > > days 
> > > > > > > > because the flicker can hide for long periods, so I 'll
> > > > > > > > wait 
> > > > > > > > until Monday evening and a few reboots before declaring
> > > > > > > > victory.
> > > > > > > 
> > > > > > > If that turns out to be the bad commit, it doesn't really
> > > > > > > surprise me, and that in itself is depressing.
> > > > > > 
> > > > > > As far as I can tell, after running for a day with this
> > > > > > reverted, 
> > > > > > this is the problem.  The flicker hasn't appeared with it 
> > > > > > reverted.  It's pretty noticeable with this commit
> > > > > > included.
> > > > > 
> > > > > Hmm. The only difference I can see is low vs. normal vswing.
> > > > > Panel 
> > > > > 0 has low, panel 2 has normal. So either the VBT or opregion
> > > > > is 
> > > > > telling utter lies, or there's some other bug in our low
> > > > > vswing
> > > > > support.
> > > > 
> > > > I did a quick once over of out DDI vswing stuff and didn't find
> > > > anything too serious. There were some buglets in the iboost
> > > > handling, 
> > > > but I'm not very hopeful that fixing those would help with your
> > > > machine. 
> > > > 
> > > > Here's a branch anyway in case you want to give it a go:
> > > > git://github.com/vsyrjala/linux.git ddi_iboost_fixes
> > > > 
> > > > Actually, I think the only patch in there that might make a 
> > > > difference is 15d887855180 ("drm/i915: Fix iboost setting for
> > > > DDI 
> > > > with 4 lanes on SKL")
> > > 
> > > Running with it now (the entire branch).  So far it looks OK, but
> > > I'll give it a couple of days to see if anything manifests before
> > > declaring victory.
> > 
> > Bad news, I'm afraid: after a couple of hours of run time, there is
> > now
> > noticeable flicker on the display, so although the iboost fixes may
> > have lessened it, it's still present.
> 
> Oh well, I suspected as much. Which BIOS version did you have 
> exactly?

As mentioned upthread, now 1.4.3 since skylake microcode was
potentially implicated in the problem:

http://mid.gmane.org/1466179729.2271.33.camel at HansenPartnership.com

> If I'm reading the Dell website correctly there's a new one (1.4.4)
> released on Jun 30, which is after you reported the issue. Might be
> I'm reading the wrong thing though. Can you double check this?

New bios doesn't change the microcode:

[2.390172] microcode: CPU0 sig=0x406e3, pf=0x80, revision=0x8a

And the flicker is still present.

James



[Intel-gfx] Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-07-07 Thread James Bottomley
On Thu, 2016-07-07 at 09:55 -0700, James Bottomley wrote:
> On Thu, 2016-07-07 at 19:14 +0300, Ville Syrjälä wrote:
> > On Tue, Jun 21, 2016 at 06:44:34PM +0300, Ville Syrjälä wrote:
> > > On Tue, Jun 21, 2016 at 09:53:15AM -0400, James Bottomley wrote:
> > > > On Mon, 2016-06-20 at 11:03 +0300, Jani Nikula wrote:
> > > > > Cc: Ville
> > > > > 
> > > > > On Mon, 20 Jun 2016, James Bottomley <
> > > > > James.Bottomley at HansenPartnership.com> wrote:
> > > > > > OK, my candidate bad commit is this one:
> > > > > > 
> > > > > > commit a05628195a0d9f3173dd9aa76f482aef692e46ee
> > > > > > Author: Ville Syrjälä 
> > > > > > Date:   Mon Apr 11 10:23:51 2016 +0300
> > > > > > 
> > > > > > drm/i915: Get panel_type from OpRegion panel details
> > > > > > 
> > > > > > After being more careful about waiting to identify flicker,
> > > > > > this one seems to be the one the bisect finds.  I'm now 
> > > > > > running v4.7-rc3 with this one reverted and am currently 
> > > > > > seeing no flicker problems.   It is, however, early days 
> > > > > > because the flicker can hide for long periods, so I 'll
> > > > > > wait 
> > > > > > until Monday evening and a few reboots before declaring
> > > > > > victory.
> > > > > 
> > > > > If that turns out to be the bad commit, it doesn't really 
> > > > > surprise me, and that in itself is depressing.
> > > > 
> > > > As far as I can tell, after running for a day with this
> > > > reverted, 
> > > > this is the problem.  The flicker hasn't appeared with it 
> > > > reverted.  It's pretty noticeable with this commit included.
> > > 
> > > Hmm. The only difference I can see is low vs. normal vswing.
> > > Panel 
> > > 0 has low, panel 2 has normal. So either the VBT or opregion is 
> > > telling utter lies, or there's some other bug in our low vswing
> > > support.
> > 
> > I did a quick once over of out DDI vswing stuff and didn't find 
> > anything too serious. There were some buglets in the iboost
> > handling, 
> > but I'm not very hopeful that fixing those would help with your 
> > machine. 
> > 
> > Here's a branch anyway in case you want to give it a go:
> > git://github.com/vsyrjala/linux.git ddi_iboost_fixes
> > 
> > Actually, I think the only patch in there that might make a 
> > difference is 15d887855180 ("drm/i915: Fix iboost setting for DDI 
> > with 4 lanes on SKL")
> 
> Running with it now (the entire branch).  So far it looks OK, but 
> I'll give it a couple of days to see if anything manifests before
> declaring victory.

Bad news, I'm afraid: after a couple of hours of run time, there is now
noticeable flicker on the display, so although the iboost fixes may
have lessened it, it's still present.

James




[Intel-gfx] Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-07-07 Thread James Bottomley
On Thu, 2016-07-07 at 19:14 +0300, Ville Syrjälä wrote:
> On Tue, Jun 21, 2016 at 06:44:34PM +0300, Ville Syrjälä wrote:
> > On Tue, Jun 21, 2016 at 09:53:15AM -0400, James Bottomley wrote:
> > > On Mon, 2016-06-20 at 11:03 +0300, Jani Nikula wrote:
> > > > Cc: Ville
> > > > 
> > > > On Mon, 20 Jun 2016, James Bottomley <
> > > > James.Bottomley at HansenPartnership.com> wrote:
> > > > > OK, my candidate bad commit is this one:
> > > > > 
> > > > > commit a05628195a0d9f3173dd9aa76f482aef692e46ee
> > > > > Author: Ville Syrjälä 
> > > > > Date:   Mon Apr 11 10:23:51 2016 +0300
> > > > > 
> > > > > drm/i915: Get panel_type from OpRegion panel details
> > > > > 
> > > > > After being more careful about waiting to identify flicker, 
> > > > > this one seems to be the one the bisect finds.  I'm now 
> > > > > running v4.7-rc3 with this one reverted and am currently 
> > > > > seeing no flicker problems.   It is, however, early days 
> > > > > because the flicker can hide for long periods, so I 'll wait 
> > > > > until Monday evening and a few reboots before declaring
> > > > > victory.
> > > > 
> > > > If that turns out to be the bad commit, it doesn't really 
> > > > surprise me, and that in itself is depressing.
> > > 
> > > As far as I can tell, after running for a day with this reverted, 
> > > this is the problem.  The flicker hasn't appeared with it 
> > > reverted.  It's pretty noticeable with this commit included.
> > 
> > Hmm. The only difference I can see is low vs. normal vswing. Panel 
> > 0 has low, panel 2 has normal. So either the VBT or opregion is 
> > telling utter lies, or there's some other bug in our low vswing
> > support.
> 
> I did a quick once over of out DDI vswing stuff and didn't find 
> anything too serious. There were some buglets in the iboost handling, 
> but I'm not very hopeful that fixing those would help with your 
> machine. 
> 
> Here's a branch anyway in case you want to give it a go:
> git://github.com/vsyrjala/linux.git ddi_iboost_fixes
> 
> Actually, I think the only patch in there that might make a 
> difference is 15d887855180 ("drm/i915: Fix iboost setting for DDI 
> with 4 lanes on SKL")

Running with it now (the entire branch).  So far it looks OK, but I'll
give it a couple of days to see if anything manifests before declaring
victory.

James



[Intel-gfx] Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-06-23 Thread James Bottomley
On Tue, 2016-06-21 at 17:00 -0400, James Bottomley wrote:
> On Tue, 2016-06-21 at 18:44 +0300, Ville Syrjälä wrote:
> > On Tue, Jun 21, 2016 at 09:53:15AM -0400, James Bottomley wrote:
> > > On Mon, 2016-06-20 at 11:03 +0300, Jani Nikula wrote:
> > > > Cc: Ville
> > > > 
> > > > On Mon, 20 Jun 2016, James Bottomley <
> > > > James.Bottomley at HansenPartnership.com> wrote:
> > > > > OK, my candidate bad commit is this one:
> > > > > 
> > > > > commit a05628195a0d9f3173dd9aa76f482aef692e46ee
> > > > > Author: Ville Syrjälä 
> > > > > Date:   Mon Apr 11 10:23:51 2016 +0300
> > > > > 
> > > > > drm/i915: Get panel_type from OpRegion panel details
> > > > > 
> > > > > After being more careful about waiting to identify flicker, 
> > > > > this one seems to be the one the bisect finds.  I'm now 
> > > > > running v4.7-rc3 with this one reverted and am currently 
> > > > > seeing no flicker problems.   It is, however, early days 
> > > > > because the flicker can hide for long periods, so I 'll wait 
> > > > > until Monday evening and a few reboots before declaring
> > > > > victory.
> > > > 
> > > > If that turns out to be the bad commit, it doesn't really 
> > > > surprise me, and that in itself is depressing.
> > > 
> > > As far as I can tell, after running for a day with this reverted,
> > > this is the problem.  The flicker hasn't appeared with it 
> > > reverted.  It's pretty noticeable with this commit included.
> > 
> > Hmm. The only difference I can see is low vs. normal vswing. Panel 
> > 0 has low, panel 2 has normal. So either the VBT or opregion is
> > telling utter lies, or there's some other bug in our low vswing
> > support.
> > 
> > To confirm it's really a vswing issue, you should be able to run 
> > with i915.edp_vswing=2 without flickers on the broken kernel.
> 
> Preliminary boot indicates no flicker with the bad commit included 
> and this option, but I'll have to run for quite a bit longer to 
> verify, since it can sometimes be elusive.

Two days of runtime seems to confirm this is the problem (still no
flicker issues).

James




[Intel-gfx] Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-06-21 Thread James Bottomley
On Tue, 2016-06-21 at 18:44 +0300, Ville Syrjälä wrote:
> On Tue, Jun 21, 2016 at 09:53:15AM -0400, James Bottomley wrote:
> > On Mon, 2016-06-20 at 11:03 +0300, Jani Nikula wrote:
> > > Cc: Ville
> > > 
> > > On Mon, 20 Jun 2016, James Bottomley <
> > > James.Bottomley at HansenPartnership.com> wrote:
> > > > OK, my candidate bad commit is this one:
> > > > 
> > > > commit a05628195a0d9f3173dd9aa76f482aef692e46ee
> > > > Author: Ville Syrjälä 
> > > > Date:   Mon Apr 11 10:23:51 2016 +0300
> > > > 
> > > > drm/i915: Get panel_type from OpRegion panel details
> > > > 
> > > > After being more careful about waiting to identify flicker, 
> > > > this one seems to be the one the bisect finds.  I'm now running 
> > > > v4.7-rc3 with this one reverted and am currently seeing no 
> > > > flicker problems.   It is, however, early days because the 
> > > > flicker can hide for long periods, so I 'll wait until Monday 
> > > > evening and a few reboots before declaring victory.
> > > 
> > > If that turns out to be the bad commit, it doesn't really 
> > > surprise me, and that in itself is depressing.
> > 
> > As far as I can tell, after running for a day with this reverted, 
> > this is the problem.  The flicker hasn't appeared with it reverted.
> >  It's pretty noticeable with this commit included.
> 
> Hmm. The only difference I can see is low vs. normal vswing. Panel 0 
> has low, panel 2 has normal. So either the VBT or opregion is telling
> utter lies, or there's some other bug in our low vswing support.
> 
> To confirm it's really a vswing issue, you should be able to run with
> i915.edp_vswing=2 without flickers on the broken kernel.

Preliminary boot indicates no flicker with the bad commit included and
this option, but I'll have to run for quite a bit longer to verify,
since it can sometimes be elusive.

James



[Intel-gfx] Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-06-21 Thread James Bottomley
On Mon, 2016-06-20 at 11:03 +0300, Jani Nikula wrote:
> Cc: Ville
> 
> On Mon, 20 Jun 2016, James Bottomley <
> James.Bottomley at HansenPartnership.com> wrote:
> > OK, my candidate bad commit is this one:
> > 
> > commit a05628195a0d9f3173dd9aa76f482aef692e46ee
> > Author: Ville Syrjälä 
> > Date:   Mon Apr 11 10:23:51 2016 +0300
> > 
> > drm/i915: Get panel_type from OpRegion panel details
> > 
> > After being more careful about waiting to identify flicker, this 
> > one seems to be the one the bisect finds.  I'm now running v4.7-rc3
> > with this one reverted and am currently seeing no flicker problems.
> >   It is, however, early days because the flicker can hide for long 
> > periods, so I 'll wait until Monday evening and a few reboots 
> > before declaring victory.
> 
> If that turns out to be the bad commit, it doesn't really surprise 
> me, and that in itself is depressing.

As far as I can tell, after running for a day with this reverted, this
is the problem.  The flicker hasn't appeared with it reverted.  It's
pretty noticeable with this commit included.

> It might be helpful if you could add drm.debug=14 module parameter, 
> and provide dmesgs from boot both with and without the above commit 
> (it's enough to see i915 load). Please also provide /sys/kernel/debug
> /dri/0/i915_opregion.

Assuming the Mailing list doesn't eat the attachments, they should all
be included.

James

-- next part --
[0.00] Linux version 4.7.0-rc4+ (jejb at jarvis) (gcc version 4.8.5 
(SUSE Linux) ) #2 SMP PREEMPT Tue Jun 21 07:59:21 EDT 2016
[0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-test 
root=UUID=994363fd-c12c-4c6c-92e9-85e8ac695cf9 ro resume=/dev/nvme0n1p2 
splash=silent quiet showopts pcie_aspm=force drm.debug=14
[0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[0.00] x86/fpu: xstate_offset[3]:  960, xstate_sizes[3]:   64
[0.00] x86/fpu: xstate_offset[4]: 1024, xstate_sizes[4]:   64
[0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point 
registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
[0.00] x86/fpu: Enabled xstate features 0x1f, context size is 1088 
bytes, using 'standard' format.
[0.00] x86/fpu: Using 'eager' FPU context switches.
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0fff] reserved
[0.00] BIOS-e820: [mem 0x1000-0x00057fff] usable
[0.00] BIOS-e820: [mem 0x00058000-0x00058fff] reserved
[0.00] BIOS-e820: [mem 0x00059000-0x0009cfff] usable
[0.00] BIOS-e820: [mem 0x0009d000-0x0009] reserved
[0.00] BIOS-e820: [mem 0x0010-0x69537fff] usable
[0.00] BIOS-e820: [mem 0x69538000-0x69538fff] ACPI NVS
[0.00] BIOS-e820: [mem 0x69539000-0x69562fff] reserved
[0.00] BIOS-e820: [mem 0x69563000-0x695bdfff] usable
[0.00] BIOS-e820: [mem 0x695be000-0x69daefff] reserved
[0.00] BIOS-e820: [mem 0x69daf000-0x7907dfff] usable
[0.00] BIOS-e820: [mem 0x7907e000-0x793f8fff] reserved
[0.00] BIOS-e820: [mem 0x793f9000-0x79448fff] ACPI data
[0.00] BIOS-e820: [mem 0x79449000-0x79c00fff] ACPI NVS
[0.00] BIOS-e820: [mem 0x79c01000-0x7a4fefff] reserved
[0.00] BIOS-e820: [mem 0x7a4ff000-0x7a4f] usable
[0.00] BIOS-e820: [mem 0x7a50-0x7a5f] reserved
[0.00] BIOS-e820: [mem 0xe000-0xefff] reserved
[0.00] BIOS-e820: [mem 0xfe00-0xfe010fff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
[0.00] BIOS-e820: [mem 0xff00-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00027eff] usable
[0.00] NX (Execute Disable) protection: active
[0.00] e820: update [mem 0x67a09018-0x67a19057] usable ==> usable
[0.00] extended physical RAM map:
[0.00] reserve setup_data: [mem 0x-0x0fff] 
reserved
[0.00] reserve setup_data: [mem 0x1000-0x00057fff] 
usable
[0.00] reserve setup_data: [mem 0x00058000-0x00058fff] 
reserved
[0.00] rese

[Intel-gfx] Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-06-19 Thread James Bottomley
On Fri, 2016-06-17 at 16:06 -0700, James Bottomley wrote:
> On Fri, 2016-06-17 at 16:34 +0300, Jani Nikula wrote:
> > On Fri, 17 Jun 2016, Daniel Vetter  wrote:
> > > On Thu, Jun 16, 2016 at 03:42:12PM -0700, James Bottomley wrote:
> > > > On Thu, 2016-06-16 at 14:29 -0700, James Bottomley wrote:
> > > > > On Thu, 2016-06-16 at 23:24 +0200, Daniel Vetter wrote:
> > > > > > I guess we'll need the bisect on this one to make progress.
> > > > > 
> > > > > Sigh, I was afraid that might be the next step.
> > > > 
> > > > OK, I have a curious data point.  I assumed the problem would
> > > > be
> > > > somewhere in the drm update, so I started bisecting that at the
> > > > top. 
> > > >  However, the top most commit:
> > > > 
> > > > commit 1d6da87a3241deb13d073c4125d19ed0e5a0c62c
> > > > Merge: 1f40c49 a39ed68
> > > > Author: Linus Torvalds 
> > > > Date:   Mon May 23 11:48:48 2016 -0700
> > > > 
> > > > Merge branch 'drm-next' of
> > > > git://people.freedesktop.org/~airlied/linux
> > > > 
> > > > Isn't actually bad.  There's no flicker here, so whatever
> > > > caused
> > > > the
> > > > problem came from some update after this.
> > > 
> > > There was a fixes pull after this. Might be worth it to restrict
> > > to
> > > just
> > > the i915 changes, which are just
> > > 5b4fd5bb1230cd037..157d2c7fad0863222
> > > 
> > > Looking at those nothing seems to stick out which might explain
> > > what's
> > > happening for you.
> 
> OK, so just on the firmware, the system seems less flickery with the
> new 1.4.3 UEFI, so I'm starting to think it is a Skylake errata 
> issue.  The flicker isn't gone for good, but seems to be reboot 
> dependent (it's there in some boots, but gone on a reboot).
> 
> > This should be easy enough to try before bisecting:
> > http://patchwork.freedesktop.org/patch/msgid/1466162081-12042-1-git
> > -s
> > end-email-mika.kahola at intel.com
> 
> Applying this didn't seem to make a difference: still there on some 
> and gone on other reboots.

OK, my candidate bad commit is this one:

commit a05628195a0d9f3173dd9aa76f482aef692e46ee
Author: Ville Syrjälä 
Date:   Mon Apr 11 10:23:51 2016 +0300

drm/i915: Get panel_type from OpRegion panel details

After being more careful about waiting to identify flicker, this one
seems to be the one the bisect finds.  I'm now running v4.7-rc3 with
this one reverted and am currently seeing no flicker problems.  It is,
however, early days because the flicker can hide for long periods, so I
'll wait until Monday evening and a few reboots before declaring
victory.

James




[Intel-gfx] Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-06-17 Thread James Bottomley
On Fri, 2016-06-17 at 16:34 +0300, Jani Nikula wrote:
> On Fri, 17 Jun 2016, Daniel Vetter  wrote:
> > On Thu, Jun 16, 2016 at 03:42:12PM -0700, James Bottomley wrote:
> > > On Thu, 2016-06-16 at 14:29 -0700, James Bottomley wrote:
> > > > On Thu, 2016-06-16 at 23:24 +0200, Daniel Vetter wrote:
> > > > > I guess we'll need the bisect on this one to make progress.
> > > > 
> > > > Sigh, I was afraid that might be the next step.
> > > 
> > > OK, I have a curious data point.  I assumed the problem would be
> > > somewhere in the drm update, so I started bisecting that at the
> > > top. 
> > >  However, the top most commit:
> > > 
> > > commit 1d6da87a3241deb13d073c4125d19ed0e5a0c62c
> > > Merge: 1f40c49 a39ed68
> > > Author: Linus Torvalds 
> > > Date:   Mon May 23 11:48:48 2016 -0700
> > > 
> > > Merge branch 'drm-next' of
> > > git://people.freedesktop.org/~airlied/linux
> > > 
> > > Isn't actually bad.  There's no flicker here, so whatever caused
> > > the
> > > problem came from some update after this.
> > 
> > There was a fixes pull after this. Might be worth it to restrict to
> > just
> > the i915 changes, which are just
> > 5b4fd5bb1230cd037..157d2c7fad0863222
> > 
> > Looking at those nothing seems to stick out which might explain
> > what's
> > happening for you.

OK, so just on the firmware, the system seems less flickery with the
new 1.4.3 UEFI, so I'm starting to think it is a Skylake errata issue. 
 The flicker isn't gone for good, but seems to be reboot dependent
(it's there in some boots, but gone on a reboot).

> This should be easy enough to try before bisecting:
> http://patchwork.freedesktop.org/patch/msgid/1466162081-12042-1-git-s
> end-email-mika.kahola at intel.com

Applying this didn't seem to make a difference: still there on some and
gone on other reboots.

James




[Intel-gfx] Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-06-17 Thread James Bottomley
On Fri, 2016-06-17 at 16:34 +0300, Jani Nikula wrote:
> On Fri, 17 Jun 2016, Daniel Vetter  wrote:
> > On Thu, Jun 16, 2016 at 03:42:12PM -0700, James Bottomley wrote:
> > > On Thu, 2016-06-16 at 14:29 -0700, James Bottomley wrote:
> > > > On Thu, 2016-06-16 at 23:24 +0200, Daniel Vetter wrote:
> > > > > I guess we'll need the bisect on this one to make progress.
> > > > 
> > > > Sigh, I was afraid that might be the next step.
> > > 
> > > OK, I have a curious data point.  I assumed the problem would be
> > > somewhere in the drm update, so I started bisecting that at the
> > > top. 
> > >  However, the top most commit:
> > > 
> > > commit 1d6da87a3241deb13d073c4125d19ed0e5a0c62c
> > > Merge: 1f40c49 a39ed68
> > > Author: Linus Torvalds 
> > > Date:   Mon May 23 11:48:48 2016 -0700
> > > 
> > > Merge branch 'drm-next' of
> > > git://people.freedesktop.org/~airlied/linux
> > > 
> > > Isn't actually bad.  There's no flicker here, so whatever caused
> > > the
> > > problem came from some update after this.
> > 
> > There was a fixes pull after this. Might be worth it to restrict to
> > just
> > the i915 changes, which are just
> > 5b4fd5bb1230cd037..157d2c7fad0863222
> > 
> > Looking at those nothing seems to stick out which might explain
> > what's
> > happening for you.
> 
> This should be easy enough to try before bisecting:
> http://patchwork.freedesktop.org/patch/msgid/1466162081-12042-1-git-s
> end-email-mika.kahola at intel.com

I'll try that.  You should know this platform has a firmware bug in the
digital signals across the display port ... some problem with the mux
in the thunderbolt.  Unfortunately the microcode update for this (the
thunderbolt firmware) can't be applied from Linux or even the usual
FreeDOS, so I'm a bit stuck.

It's also been pointed out to me off list that there's a lot of Skylake
errata that have display flicker due to non-graphics issues:

http://www.intel.com/content/www/us/en/processors/core/desktop-6th-gen-core-family-spec-update.html

My system is a Dell XPS 9350 and I have updated it with the just released 
firmware:

http://www.dell.com/support/home/us/en/19/Drivers/DriversDetails?driverId=VNT5J

Which causes my cpu microcode version to go from

> [2.314472] microcode: CPU0 sig=0x406e3, pf=0x80, revision=0x82

to

> [2.264669] microcode: CPU0 sig=0x406e3, pf=0x80, revision=0x8a

But thanks to the completely opaque microcode update process via the
OEMs it's not clear whether this is sufficient to fix the errata that
are listed as causing flicker (the reported revision on current skylake
lenovos seems to be 0x94)

James



[Intel-gfx] Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-06-16 Thread James Bottomley
On Thu, 2016-06-16 at 14:29 -0700, James Bottomley wrote:
> On Thu, 2016-06-16 at 23:24 +0200, Daniel Vetter wrote:
> > I guess we'll need the bisect on this one to make progress.
> 
> Sigh, I was afraid that might be the next step.

OK, I have a curious data point.  I assumed the problem would be
somewhere in the drm update, so I started bisecting that at the top. 
 However, the top most commit:

commit 1d6da87a3241deb13d073c4125d19ed0e5a0c62c
Merge: 1f40c49 a39ed68
Author: Linus Torvalds 
Date:   Mon May 23 11:48:48 2016 -0700

Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux

Isn't actually bad.  There's no flicker here, so whatever caused the
problem came from some update after this.

James



[Intel-gfx] Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-06-16 Thread James Bottomley
On Thu, 2016-06-16 at 23:24 +0200, Daniel Vetter wrote:
> On Thu, Jun 16, 2016 at 11:15 PM, James Bottomley
>  wrote:
> > On Mon, 2016-06-13 at 13:14 +0300, Jani Nikula wrote:
> > > On Tue, 31 May 2016, James Bottomley <
> > > James.Bottomley at HansenPartnership.com> wrote:
> > > > On Tue, 2016-05-31 at 10:51 +0300, Jani Nikula wrote:
> > > > > On Mon, 30 May 2016, James Bottomley <
> > > > > James.Bottomley at HansenPartnership.com> wrote:
> > > > > > I've tested a pristine 4.6.0 system, so it's definitely
> > > > > > something
> > > > > > that
> > > > > > went in during the merge window.  The flicker isn't
> > > > > > continuous,
> > > > > > it's
> > > > > > periodic, with an interval of something like 2-5 seconds. 
> > > > > >  It
> > > > > > looks
> > > > > > like an old analogue TV going out of sync and then
> > > > > > resyncing.
> > > > > >  I've
> > > > > > attached the dmesg and X.org log below just in case they
> > > > > > can
> > > > > > help.
> > > > > >  I
> > > > > > might be able to bisect this next week, but, unfortunately,
> > > > > > this is
> > > > > > my
> > > > > > current laptop and I'm travelling this week.
> > > > > 
> > > > > Please try i915.enable_psr=0 module parameter.
> > > > 
> > > > Makes no discernable difference.  Current parameter settings
> > > > are:
> > > 
> > > Sorry for the silence. Would you mind trying out drm-intel
> > > -nightly
> > > branch of [1]?
> > > 
> > > BR,
> > > Jani.
> > > 
> > > [1] http://cgit.freedesktop.org/drm-intel
> > 
> > No, flicker is still there (and in fact seems worse) with the tree
> > with
> > this commit at the top:
> > 
> > commit 3eb202ecc3668583f9ff4338211dbab47d755d1c
> > Author: Daniel Vetter 
> > Date:   Thu Jun 16 14:38:54 2016 +0200
> > 
> > drm-intel-nightly: 2016y-06m-16d-12h-38m-37s UTC integration
> > manifest
> 
> Strange indeed, I hoped the improved watermark code in -nightly would
> help. I assume nothing in dmesg about underruns or something similar?

Not that I can tell.  I've attached the full dmesg just in case.

> I guess we'll need the bisect on this one to make progress.

Sigh, I was afraid that might be the next step.

James

-- next part --
[0.00] Linux version 4.7.0-rc3+ (jejb at jarvis) (gcc version 4.8.5 
(SUSE Linux) ) #4 SMP PREEMPT Thu Jun 16 13:57:07 PDT 2016
[0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-test 
root=UUID=994363fd-c12c-4c6c-92e9-85e8ac695cf9 ro resume=/dev/nvme0n1p2 
splash=silent quiet showopts pcie_aspm=force
[0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[0.00] x86/fpu: xstate_offset[3]:  960, xstate_sizes[3]:   64
[0.00] x86/fpu: xstate_offset[4]: 1024, xstate_sizes[4]:   64
[0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point 
registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
[0.00] x86/fpu: Enabled xstate features 0x1f, context size is 1088 
bytes, using 'standard' format.
[0.00] x86/fpu: Using 'eager' FPU context switches.
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0fff] reserved
[0.00] BIOS-e820: [mem 0x1000-0x00057fff] usable
[0.00] BIOS-e820: [mem 0x00058000-0x00058fff] reserved
[0.00] BIOS-e820: [mem 0x00059000-0x0009cfff] usable
[0.00] BIOS-e820: [mem 0x0009d000-0x0009] reserved
[0.00] BIOS-e820: [mem 0x0010-0x69537fff] usable
[0.00] BIOS-e820: [mem 0x69538000-0x69538fff] ACPI NVS
[0.00] BIOS-e820: [mem 0x69539000-0x69562fff] reserved
[0.00] BIOS-e820: [mem 0x69563000-0x695bdfff] usable
[0.00] BIOS-e820: [mem 0x695be000-0x69daefff] reserved
[0.00] BIOS-e820: [mem 0x69daf000-0x7907dfff] usable
[0.00] BIOS-e820: [mem 0x7907e000-0x793f8fff] reserved
[0.00] BIOS-e820: [mem 0x793f9000-0x79448

Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-06-16 Thread James Bottomley
On Mon, 2016-06-13 at 13:14 +0300, Jani Nikula wrote:
> On Tue, 31 May 2016, James Bottomley <
> James.Bottomley at HansenPartnership.com> wrote:
> > On Tue, 2016-05-31 at 10:51 +0300, Jani Nikula wrote:
> > > On Mon, 30 May 2016, James Bottomley <
> > > James.Bottomley at HansenPartnership.com> wrote:
> > > > I've tested a pristine 4.6.0 system, so it's definitely
> > > > something
> > > > that
> > > > went in during the merge window.  The flicker isn't continuous,
> > > > it's
> > > > periodic, with an interval of something like 2-5 seconds.  It
> > > > looks
> > > > like an old analogue TV going out of sync and then resyncing. 
> > > >  I've
> > > > attached the dmesg and X.org log below just in case they can
> > > > help. 
> > > >  I
> > > > might be able to bisect this next week, but, unfortunately,
> > > > this is
> > > > my
> > > > current laptop and I'm travelling this week.
> > > 
> > > Please try i915.enable_psr=0 module parameter.
> > 
> > Makes no discernable difference.  Current parameter settings are:
> 
> Sorry for the silence. Would you mind trying out drm-intel-nightly
> branch of [1]?
> 
> BR,
> Jani.
> 
> [1] http://cgit.freedesktop.org/drm-intel

No, flicker is still there (and in fact seems worse) with the tree with
this commit at the top:

commit 3eb202ecc3668583f9ff4338211dbab47d755d1c
Author: Daniel Vetter 
Date:   Thu Jun 16 14:38:54 2016 +0200

drm-intel-nightly: 2016y-06m-16d-12h-38m-37s UTC integration
manifest

James



Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-05-31 Thread James Bottomley
On Tue, 2016-05-31 at 10:51 +0300, Jani Nikula wrote:
> On Mon, 30 May 2016, James Bottomley <
> James.Bottomley at HansenPartnership.com> wrote:
> > I've tested a pristine 4.6.0 system, so it's definitely something
> > that
> > went in during the merge window.  The flicker isn't continuous,
> > it's
> > periodic, with an interval of something like 2-5 seconds.  It looks
> > like an old analogue TV going out of sync and then resyncing.  I've
> > attached the dmesg and X.org log below just in case they can help. 
> >  I
> > might be able to bisect this next week, but, unfortunately, this is
> > my
> > current laptop and I'm travelling this week.
> 
> Please try i915.enable_psr=0 module parameter.

Makes no discernable difference.  Current parameter settings are:

disable_display: N
disable_power_well: 1
edp_vswing: 0
enable_cmd_parser: 1
enable_dc: -1
enable_dp_mst: Y
enable_execlists: 1
enable_fbc: -1
enable_guc_submission: N
enable_hangcheck: Y
enable_ips: 1
enable_ppgtt: 3
enable_psr: 0
enable_rc6: 1
fastboot: N
guc_log_level: -1
inject_load_failure: 0
invert_brightness: 0
load_detect_test: N
lvds_channel_mode: 0
lvds_use_ssc: -1
mmio_debug: 0
modeset: -1
nuclear_pageflip: N
panel_ignore_lid: 1
prefault_disable: N
preliminary_hw_support: 1
reset: Y
semaphores: -1
use_mmio_flip: 0
vbt_sdvo_panel_type: -1
verbose_state_checks: Y

James



Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-05-30 Thread James Bottomley
I've tested a pristine 4.6.0 system, so it's definitely something that
went in during the merge window.  The flicker isn't continuous, it's
periodic, with an interval of something like 2-5 seconds.  It looks
like an old analogue TV going out of sync and then resyncing.  I've
attached the dmesg and X.org log below just in case they can help.  I
might be able to bisect this next week, but, unfortunately, this is my
current laptop and I'm travelling this week.

James
-- next part --
A non-text attachment was scrubbed...
Name: Xorg.0.log
Type: text/x-log
Size: 27983 bytes
Desc: not available
URL: 

-- next part --
[0.00] Linux version 4.7.0-rc1+ (jejb at jarvis) (gcc version 4.8.5 
(SUSE Linux) ) #1 SMP PREEMPT Mon May 30 12:29:53 PDT 2016
[0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-test 
root=UUID=994363fd-c12c-4c6c-92e9-85e8ac695cf9 ro resume=/dev/nvme0n1p2 
splash=silent quiet showopts pcie_aspm=force
[0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[0.00] x86/fpu: xstate_offset[3]:  960, xstate_sizes[3]:   64
[0.00] x86/fpu: xstate_offset[4]: 1024, xstate_sizes[4]:   64
[0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point 
registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
[0.00] x86/fpu: Enabled xstate features 0x1f, context size is 1088 
bytes, using 'standard' format.
[0.00] x86/fpu: Using 'eager' FPU context switches.
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0fff] reserved
[0.00] BIOS-e820: [mem 0x1000-0x00057fff] usable
[0.00] BIOS-e820: [mem 0x00058000-0x00058fff] reserved
[0.00] BIOS-e820: [mem 0x00059000-0x0009cfff] usable
[0.00] BIOS-e820: [mem 0x0009d000-0x0009] reserved
[0.00] BIOS-e820: [mem 0x0010-0x69537fff] usable
[0.00] BIOS-e820: [mem 0x69538000-0x69538fff] ACPI NVS
[0.00] BIOS-e820: [mem 0x69539000-0x69562fff] reserved
[0.00] BIOS-e820: [mem 0x69563000-0x695bdfff] usable
[0.00] BIOS-e820: [mem 0x695be000-0x69daefff] reserved
[0.00] BIOS-e820: [mem 0x69daf000-0x7907dfff] usable
[0.00] BIOS-e820: [mem 0x7907e000-0x793f8fff] reserved
[0.00] BIOS-e820: [mem 0x793f9000-0x79448fff] ACPI data
[0.00] BIOS-e820: [mem 0x79449000-0x79c00fff] ACPI NVS
[0.00] BIOS-e820: [mem 0x79c01000-0x7a4fefff] reserved
[0.00] BIOS-e820: [mem 0x7a4ff000-0x7a4f] usable
[0.00] BIOS-e820: [mem 0x7a50-0x7a5f] reserved
[0.00] BIOS-e820: [mem 0xe000-0xefff] reserved
[0.00] BIOS-e820: [mem 0xfe00-0xfe010fff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
[0.00] BIOS-e820: [mem 0xff00-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00027eff] usable
[0.00] NX (Execute Disable) protection: active
[0.00] e820: update [mem 0x67823018-0x67833057] usable ==> usable
[0.00] extended physical RAM map:
[0.00] reserve setup_data: [mem 0x-0x0fff] 
reserved
[0.00] reserve setup_data: [mem 0x1000-0x00057fff] 
usable
[0.00] reserve setup_data: [mem 0x00058000-0x00058fff] 
reserved
[0.00] reserve setup_data: [mem 0x00059000-0x0009cfff] 
usable
[0.00] reserve setup_data: [mem 0x0009d000-0x0009] 
reserved
[0.00] reserve setup_data: [mem 0x0010-0x67823017] 
usable
[0.00] reserve setup_data: [mem 0x67823018-0x67833057] 
usable
[0.00] reserve setup_data: [mem 0x67833058-0x69537fff] 
usable
[0.00] reserve setup_data: [mem 0x69538000-0x69538fff] 
ACPI NVS
[0.00] reserve setup_data: [mem 0x69539000-0x69562fff] 
reserved
[0.00] reserve setup_data: [mem 0x69563000-0x695bdfff] 
usable
[0.00] reserve setup_data: [mem 0x695be000-0x69daefff] 
reserved
[0.00] reserve setup_data: [mem 0x69daf000-0x7907dfff] 

[PATCH 0/25] Replace DEFINE_PCI_DEVICE_TABLE macro use

2014-07-18 Thread James Bottomley
On Fri, 2014-07-18 at 11:17 -0700, Greg KH wrote:
> On Fri, Jul 18, 2014 at 09:54:32AM -0700, James Bottomley wrote:
> > On Fri, 2014-07-18 at 09:43 -0700, Greg KH wrote:
> > > On Fri, Jul 18, 2014 at 12:22:13PM -0400, John W. Linville wrote:
> > > > On Fri, Jul 18, 2014 at 05:26:47PM +0200, Benoit Taine wrote:
> > > > > We should prefer `const struct pci_device_id` over
> > > > > `DEFINE_PCI_DEVICE_TABLE` to meet kernel coding style guidelines.
> > > > > This issue was reported by checkpatch.
> > > > 
> > > > Honestly, I prefer the macro -- it stands-out more.  Maybe the style
> > > > guidelines and/or checkpatch should change instead?
> > > 
> > > The macro is horrid, no other bus has this type of thing just to save a
> > > few characters in typing
> > 
> > OK, so this is the macro:
> > 
> > #define DEFINE_PCI_DEVICE_TABLE(_table) \
> > const struct pci_device_id _table[]
> > 
> > Could you explain what's so horrible?
> > 
> > The reason it's useful today is that people forget the const (and
> > sometimes the [] making it a true table instead of a pointer).  If you
> > use the DEFINE_PCI_DEVICE_TABLE macro, the compile breaks if you use it
> > wrongly (good) and you automatically get the correct annotations.
> 
> We have almost 1000 more uses of the non-macro version than the "macro"
> version in the kernel today:
> $ git grep -w DEFINE_PCI_DEVICE_TABLE | wc -l
> 262
> $ git grep "const struct pci_device_id" | wc -l
> 1254
> 
> My big complaint is that we need to be consistant, either pick one or
> the other and stick to it.  As the macro is the least used, it's easiest
> to fix up, and it also is more consistant with all other kernel
> subsystems which do not have such a macro.

I've a weak preference for consistency, but not at the expense of
hundreds of patches churning the kernel to remove an innocuous macro.
Churn costs time and effort.

> As there is no need for the __init macro mess anymore, there's no real
> need for the DEFINE_PCI_DEVICE_TABLE macro either.  I think checkpatch
> will catch the use of non-const users for the id table already today, it
> catches lots of other uses like this already.
> 
> > > , so why should PCI be "special" in this regard
> > > anymore?
> > 
> > I think the PCI usage dwarfs most other bus types now, so you could turn
> > the question around.  However, I don't think majority voting is a good
> > guide to best practise; lets debate the merits for their own sake.
> 
> Not really "dwarf", USB is close with over 700 such structures:
> $ git grep "const struct usb_device_id" | wc -l
> 725
> 
> And i2c is almost just as big as PCI:
> $ git grep "const struct i2c_device_id" | wc -l
> 1223
> 
> So again, this macro is not consistent with the majority of PCI drivers,
> nor with any other type of "device id" declaration in the kernel, which
> is why I feel it should be removed.
> 
> And finally, the PCI documentation itself says to not use this macro, so
> this isn't a "new" thing.  From Documentation/PCI/pci.txt:
> 
>   The ID table is an array of struct pci_device_id entries ending with an
>   all-zero entry.  Definitions with static const are generally preferred.
>   Use of the deprecated macro DEFINE_PCI_DEVICE_TABLE should be avoided.
> 
> That wording went into the file last December, when we last talked about
> this and everyone in that discussion agreed to remove the macro for the
> above reasons.
> 
> Consistency matters.

In this case, I don't think it does that much ... a cut and paste either
way (from a macro or non-macro based driver) yields correct code.  Since
there's no bug here and no apparent way to misuse the macro, why bother?

Anyway, it's PCI code ... let the PCI maintainer decide.  However, if he
does want this do it as one big bang patch via either the PCI or Trivial
tree (latter because Ji?? has experience doing this, but the former
might be useful so the decider feels the pain ...)

James




[PATCH 0/25] Replace DEFINE_PCI_DEVICE_TABLE macro use

2014-07-18 Thread James Bottomley
On Fri, 2014-07-18 at 09:43 -0700, Greg KH wrote:
> On Fri, Jul 18, 2014 at 12:22:13PM -0400, John W. Linville wrote:
> > On Fri, Jul 18, 2014 at 05:26:47PM +0200, Benoit Taine wrote:
> > > We should prefer `const struct pci_device_id` over
> > > `DEFINE_PCI_DEVICE_TABLE` to meet kernel coding style guidelines.
> > > This issue was reported by checkpatch.
> > 
> > Honestly, I prefer the macro -- it stands-out more.  Maybe the style
> > guidelines and/or checkpatch should change instead?
> 
> The macro is horrid, no other bus has this type of thing just to save a
> few characters in typing

OK, so this is the macro:

#define DEFINE_PCI_DEVICE_TABLE(_table) \
const struct pci_device_id _table[]

Could you explain what's so horrible?

The reason it's useful today is that people forget the const (and
sometimes the [] making it a true table instead of a pointer).  If you
use the DEFINE_PCI_DEVICE_TABLE macro, the compile breaks if you use it
wrongly (good) and you automatically get the correct annotations.

> , so why should PCI be "special" in this regard
> anymore?

I think the PCI usage dwarfs most other bus types now, so you could turn
the question around.  However, I don't think majority voting is a good
guide to best practise; lets debate the merits for their own sake.

James




[PATCH 0/25] Replace DEFINE_PCI_DEVICE_TABLE macro use

2014-07-18 Thread James Bottomley
On Fri, 2014-07-18 at 17:26 +0200, Benoit Taine wrote:
> We should prefer `const struct pci_device_id` over
> `DEFINE_PCI_DEVICE_TABLE` to meet kernel coding style guidelines.
> This issue was reported by checkpatch.

What kernel coding style?  checkpatch isn't the arbiter of style, if
that's the only problem.

The DEFINE_PCI macro was a well reasoned addition when it was added in
2008.  The problem was most people were getting the definition wrong.
When we converted away from CONFIG_HOTPLUG, having this DEFINE_ meant
that only one place needed changing instead of hundreds for PCI tables.

The reason people were getting the PCI table wrong was mostly the init
section specifiers which are now gone, but it has enough underlying
utility (mostly constification) that I don't think we'd want to churn
the kernel hugely to make a change to struct pci_table and then have to
start detecting and fixing misuses.

James




[REPOST PATCH 1/8] fence: dma-buf cross-device synchronization (v17)

2014-06-19 Thread James Bottomley
On Thu, 2014-06-19 at 15:39 -0700, H. Peter Anvin wrote:
> On 06/19/2014 01:01 PM, Greg KH wrote:
> > On Thu, Jun 19, 2014 at 09:15:36PM +0200, Daniel Vetter wrote:
> >> On Thu, Jun 19, 2014 at 7:00 PM, Greg KH  
> >> wrote:
> >> + BUG_ON(f1->context != f2->context);
> >
> > Nice, you just crashed the kernel, making it impossible to debug or
> > recover :(
> 
>  agreed, that should probably be 'if (WARN_ON(...)) return NULL;'
> 
>  (but at least I wouldn't expect to hit that under console_lock so you
>  should at least see the last N lines of the backtrace on the screen
>  ;-))
> >>>
> >>> Lots of devices don't have console screens :)
> >>
> >> Aside: This is a pet peeve of mine and recently I've switched to
> >> rejecting all patch that have a BUG_ON, period.
> > 
> > Please do, I have been for a few years now as well for the same reasons
> > you cite.
> > 
> 
> I'm actually concerned about this trend.  Downgrading things to WARN_ON
> can allow a security bug in the kernel to continue to exist, for
> example, or make the error message disappear.

Me too.  We use BUG_ON in the I/O subsystem where we're forced to
violate a guarantee.  When the choice is corrupt something or panic the
system, I prefer the latter every time.

> I am wondering if the right thing here isn't to have a user (command
> line?) settable policy as to how to proceed on an assert violation,
> instead of hardcoding it at compile time.

I'd say it depends on the consequence of the assertion violation.  We
have assertions that are largely theoretical, ones that govern process
internal state (so killing the process mostly sanitizes the system) and
a few that imply data loss or data corruption.

James




[REPOST PATCH 1/8] fence: dma-buf cross-device synchronization (v17)

2014-06-19 Thread James Bottomley
On Thu, 2014-06-19 at 11:19 -0700, Greg KH wrote:
> On Thu, Jun 19, 2014 at 01:45:30PM -0400, Rob Clark wrote:
> > On Thu, Jun 19, 2014 at 1:00 PM, Greg KH  
> > wrote:
> > > On Thu, Jun 19, 2014 at 10:00:18AM -0400, Rob Clark wrote:
> > >> On Wed, Jun 18, 2014 at 9:13 PM, Greg KH  
> > >> wrote:
> > >> > On Wed, Jun 18, 2014 at 12:36:54PM +0200, Maarten Lankhorst wrote:
> > >> >> +#define CREATE_TRACE_POINTS
> > >> >> +#include 
> > >> >> +
> > >> >> +EXPORT_TRACEPOINT_SYMBOL(fence_annotate_wait_on);
> > >> >> +EXPORT_TRACEPOINT_SYMBOL(fence_emit);
> > >> >
> > >> > Are you really willing to live with these as tracepoints for forever?
> > >> > What is the use of them in debugging?  Was it just for debugging the
> > >> > fence code, or for something else?
> > >> >
> > >> >> +/**
> > >> >> + * fence_context_alloc - allocate an array of fence contexts
> > >> >> + * @num: [in]amount of contexts to allocate
> > >> >> + *
> > >> >> + * This function will return the first index of the number of fences 
> > >> >> allocated.
> > >> >> + * The fence context is used for setting fence->context to a unique 
> > >> >> number.
> > >> >> + */
> > >> >> +unsigned fence_context_alloc(unsigned num)
> > >> >> +{
> > >> >> + BUG_ON(!num);
> > >> >> + return atomic_add_return(num, _context_counter) - num;
> > >> >> +}
> > >> >> +EXPORT_SYMBOL(fence_context_alloc);
> > >> >
> > >> > EXPORT_SYMBOL_GPL()?  Same goes for all of the exports in here.
> > >> > Traditionally all of the driver core exports have been with this
> > >> > marking, any objection to making that change here as well?
> > >>
> > >> tbh, I prefer EXPORT_SYMBOL()..  well, I'd prefer even more if there
> > >> wasn't even a need for EXPORT_SYMBOL_GPL(), but sadly it is a fact of
> > >> life.  We already went through this debate once with dma-buf.  We
> > >> aren't going to change $evil_vendor's mind about non-gpl modules.  The
> > >> only result will be a more flugly convoluted solution (ie. use syncpt
> > >> EXPORT_SYMBOL() on top of fence EXPORT_SYMBOL_GPL()) just as a
> > >> workaround, with the result that no-one benefits.
> > >
> > > It has been proven that using _GPL() exports have caused companies to
> > > release their code "properly" over the years, so as these really are
> > > Linux-only apis, please change them to be marked this way, it helps
> > > everyone out in the end.
> > 
> > Well, maybe that is the true in some cases.  But it certainly didn't
> > work out that way for dma-buf.  And I think the end result is worse.
> > 
> > I don't really like coming down on the side of EXPORT_SYMBOL() instead
> > of EXPORT_SYMBOL_GPL(), but if we do use EXPORT_SYMBOL_GPL() then the
> > result will only be creative workarounds using the _GPL symbols
> > indirectly by whatever is available via EXPORT_SYMBOL().  I don't
> > really see how that will be better.
> 
> You are saying that you _know_ companies will violate our license, so
> you should just "give up"?  And how do you know people aren't working on
> preventing those "indirect" usages as well?  :)
> 
> Sorry, I'm not going to give up here, again, it has proven to work in
> the past in changing the ways of _very_ large companies, why stop now?

When you try to train a dog, you have to be consistent about it.  We're
fantastically inconsistent in symbol exports.

For instance, the mutex primitives are all EXPORT_SYMBOL(), so we're
telling proprietary modules they can use them.  However, when the kernel
is built with CONFIG_DEBUG_MUTEX, they all become
EXPORT_SYMBOL_GPL() ... what type of crazy message is that supposed to
send?  It's OK to use mutexes but it's potentially a GPL violation to
debug them?

We either need to decide that we have a defined and consistent part of
our API that's GPL only or make the bold statement that we don't have
any part of our API that's usable by non-GPL modules.  Right at the
minute we do neither and it confuses people no end about what is and
isn't allowed.

James




Massive power regression going 3.4->3.5

2012-08-07 Thread James Bottomley
On Sun, 2012-08-05 at 22:36 +0200, Daniel Vetter wrote:
> On Wed, Aug 01, 2012 at 11:08:19AM +0100, James Bottomley wrote:
> > On Wed, 2012-08-01 at 09:58 +0100, Chris Wilson wrote:
> > > On Wed, 01 Aug 2012 09:45:04 +0100, James Bottomley  > > HansenPartnership.com> wrote:
> > > > On Wed, 2012-08-01 at 09:16 +0100, Chris Wilson wrote:
> > > > > On Wed, 01 Aug 2012 09:06:12 +0100, James Bottomley  > > > > at HansenPartnership.com> wrote:
> > > > > > I got the attached to apply and it doesn't really improve the idle 
> > > > > > power
> > > > > > much (12.5W).
> > > > > 
> > > > > That's good to know. Next step is to try overriding i915.semaphores.
> > > > > Can you please test with i915.semaphores=0 and i915.semaphores=1?
> > > > 
> > > > There's not much point doing i915_semaphores=1 since that's the default
> > > > on gen 6 hardware, but i915_semaphores=0 recovers and idle power of
> > > > ~6.5W
> > > 
> > > It is only the default if iommu is off, and changing the default
> > > was one of the side-effects of the patch you bisected.
> > > 
> > > Can you please login to the desktop, let it idle, record
> > > /sys/kernel/debug/dri/0/i915_cur_delayinfo and .../i915_drpc_info.
> > > Then trace-cmd record -e i915 sleep 10s, and follow up with a new pair
> > > of /sys/kernel/debug/dri/0/i915_cur_delayinfo and .../i915_drpc_info.
> > > 
> > > This will let us see whether the pm counters are truly advancing and
> > > what activity the driver is performing whilst idle.
> > 
> > OK, so here it is
> > 
> > James
> 
> Hm, if I haven't botched the math, you have a rc6 residency of about 320
> seconds between the two cats of drpc_info. Can you please script this so
> that we have exactly 10s in between? (Aside: 3.6 has a neat interface for
> rc6 residency in sysfs ...)

You botched the maths, I think.  The three cats after the sleep was
three up arrows ... if it went over 11s I'd be surprised.

> Also, you need to attach the output of trace-cmd report (like with perf),
> so that we see the tracepoints in detail.

You mean you want the full trace.dat file rather than what the output
summary says?  I can, but it's 800k compressed which is probably over
the list limit ... I can upload it somewhere when I get back from
holiday next Monday.

> Another quick thing to confirm: What is the power consumption on the old
> kernel when booting with i915.i915_semaphores=1?

It idles at around 13W, which means the history of the problem must be
this:

What looks to have happened seems to be because of a merge failure in
drm:

In 3.2 Keith Packard disabled semaphores on sandybridge with

commit ebbd857e6b9a92c0aff4aacd1b1d2361d888633e
Author: Keith Packard 
Date:   Mon Dec 26 17:02:10 2011 -0800

drm/i915: Disable semaphores by default on SNB

Because of an apparent bug causing a GPU hang.

I think this is what gave me the power savings in 3.4 when the PCI layer
was ready for it.

It got re-enabled accidentally in 3.5 by a mismerge of 

commit 2911a35b2e4eb87ec48d03aeb11f019e51ae3c0d
Author: Ben Widawsky 
Date:   Thu Apr 5 14:47:36 2012 -0700

drm/i915: use semaphores for the display plane

Because that puts back the pre ebbd857e6b9a92c0aff4aacd1b1d2361d888633e
semaphore enabling code, but in a different place, which is probably why
it wasn't spotted, so semaphores got re-enabled on sandybridge.

Perhaps what we should be doing is verifying that semaphores aren't
sucking the same 6W of power on ivybridge and if not, just re-disable
them on sandybridge, since we'll have to do that anyway to re-apply the
bug fix.

James






Re: Massive power regression going 3.4-3.5

2012-08-07 Thread James Bottomley
On Sun, 2012-08-05 at 22:36 +0200, Daniel Vetter wrote:
 On Wed, Aug 01, 2012 at 11:08:19AM +0100, James Bottomley wrote:
  On Wed, 2012-08-01 at 09:58 +0100, Chris Wilson wrote:
   On Wed, 01 Aug 2012 09:45:04 +0100, James Bottomley 
   james.bottom...@hansenpartnership.com wrote:
On Wed, 2012-08-01 at 09:16 +0100, Chris Wilson wrote:
 On Wed, 01 Aug 2012 09:06:12 +0100, James Bottomley 
 james.bottom...@hansenpartnership.com wrote:
  I got the attached to apply and it doesn't really improve the idle 
  power
  much (12.5W).
 
 That's good to know. Next step is to try overriding i915.semaphores.
 Can you please test with i915.semaphores=0 and i915.semaphores=1?

There's not much point doing i915_semaphores=1 since that's the default
on gen 6 hardware, but i915_semaphores=0 recovers and idle power of
~6.5W
   
   It is only the default if iommu is off, and changing the default
   was one of the side-effects of the patch you bisected.
   
   Can you please login to the desktop, let it idle, record
   /sys/kernel/debug/dri/0/i915_cur_delayinfo and .../i915_drpc_info.
   Then trace-cmd record -e i915 sleep 10s, and follow up with a new pair
   of /sys/kernel/debug/dri/0/i915_cur_delayinfo and .../i915_drpc_info.
   
   This will let us see whether the pm counters are truly advancing and
   what activity the driver is performing whilst idle.
  
  OK, so here it is
  
  James
 
 Hm, if I haven't botched the math, you have a rc6 residency of about 320
 seconds between the two cats of drpc_info. Can you please script this so
 that we have exactly 10s in between? (Aside: 3.6 has a neat interface for
 rc6 residency in sysfs ...)

You botched the maths, I think.  The three cats after the sleep was
three up arrows ... if it went over 11s I'd be surprised.

 Also, you need to attach the output of trace-cmd report (like with perf),
 so that we see the tracepoints in detail.

You mean you want the full trace.dat file rather than what the output
summary says?  I can, but it's 800k compressed which is probably over
the list limit ... I can upload it somewhere when I get back from
holiday next Monday.

 Another quick thing to confirm: What is the power consumption on the old
 kernel when booting with i915.i915_semaphores=1?

It idles at around 13W, which means the history of the problem must be
this:

What looks to have happened seems to be because of a merge failure in
drm:

In 3.2 Keith Packard disabled semaphores on sandybridge with

commit ebbd857e6b9a92c0aff4aacd1b1d2361d888633e
Author: Keith Packard kei...@keithp.com
Date:   Mon Dec 26 17:02:10 2011 -0800

drm/i915: Disable semaphores by default on SNB

Because of an apparent bug causing a GPU hang.

I think this is what gave me the power savings in 3.4 when the PCI layer
was ready for it.

It got re-enabled accidentally in 3.5 by a mismerge of 

commit 2911a35b2e4eb87ec48d03aeb11f019e51ae3c0d
Author: Ben Widawsky b...@bwidawsk.net
Date:   Thu Apr 5 14:47:36 2012 -0700

drm/i915: use semaphores for the display plane

Because that puts back the pre ebbd857e6b9a92c0aff4aacd1b1d2361d888633e
semaphore enabling code, but in a different place, which is probably why
it wasn't spotted, so semaphores got re-enabled on sandybridge.

Perhaps what we should be doing is verifying that semaphores aren't
sucking the same 6W of power on ivybridge and if not, just re-disable
them on sandybridge, since we'll have to do that anyway to re-apply the
bug fix.

James




___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Massive power regression going 3.4->3.5

2012-08-02 Thread James Bottomley
On Wed, 2012-08-01 at 22:08 -0700, bwidawsk wrote:
> On 2012-08-01 03:06, Chris Wilson wrote:
> > On Wed, 01 Aug 2012 10:38:36 +0100, James Bottomley
> >  wrote:
> >> On Wed, 2012-08-01 at 09:58 +0100, Chris Wilson wrote:
> >> > On Wed, 01 Aug 2012 09:45:04 +0100, James Bottomley 
> >>  wrote:
> >> > > On Wed, 2012-08-01 at 09:16 +0100, Chris Wilson wrote:
> >> > > > On Wed, 01 Aug 2012 09:06:12 +0100, James Bottomley 
> >>  wrote:
> >> > > > > I got the attached to apply and it doesn't really improve 
> >> the idle power
> >> > > > > much (12.5W).
> >> > > >
> >> > > > That's good to know. Next step is to try overriding 
> >> i915.semaphores.
> >> > > > Can you please test with i915.semaphores=0 and 
> >> i915.semaphores=1?
> >> > >
> >> > > There's not much point doing i915_semaphores=1 since that's the 
> >> default
> >> > > on gen 6 hardware, but i915_semaphores=0 recovers and idle power 
> >> of
> >> > > ~6.5W
> >> >
> >> > It is only the default if iommu is off, and changing the default
> >> > was one of the side-effects of the patch you bisected.
> >> >
> >> > Can you please login to the desktop, let it idle, record
> >> > /sys/kernel/debug/dri/0/i915_cur_delayinfo and .../i915_drpc_info.
> >> > Then trace-cmd record -e i915 sleep 10s,
> >>
> >> OK, what is trace-cmd?  It looks similar to perf tools ... is that 
> >> it?
> >
> > Yes, it is roughly equivalent and you should be able to achieve the 
> > same
> > with perf trace - except I haven't done it before so I don't have 
> > quick
> > advice on how to drive it. :)
> > -Chris
> 
> Should be something like:
> perf record -f -g -e i915:* -a

I already sent the output of trace-cmd ... is that enough?

James




Re: Massive power regression going 3.4-3.5

2012-08-02 Thread James Bottomley
On Wed, 2012-08-01 at 22:08 -0700, bwidawsk wrote:
 On 2012-08-01 03:06, Chris Wilson wrote:
  On Wed, 01 Aug 2012 10:38:36 +0100, James Bottomley
  james.bottom...@hansenpartnership.com wrote:
  On Wed, 2012-08-01 at 09:58 +0100, Chris Wilson wrote:
   On Wed, 01 Aug 2012 09:45:04 +0100, James Bottomley 
  james.bottom...@hansenpartnership.com wrote:
On Wed, 2012-08-01 at 09:16 +0100, Chris Wilson wrote:
 On Wed, 01 Aug 2012 09:06:12 +0100, James Bottomley 
  james.bottom...@hansenpartnership.com wrote:
  I got the attached to apply and it doesn't really improve 
  the idle power
  much (12.5W).

 That's good to know. Next step is to try overriding 
  i915.semaphores.
 Can you please test with i915.semaphores=0 and 
  i915.semaphores=1?
   
There's not much point doing i915_semaphores=1 since that's the 
  default
on gen 6 hardware, but i915_semaphores=0 recovers and idle power 
  of
~6.5W
  
   It is only the default if iommu is off, and changing the default
   was one of the side-effects of the patch you bisected.
  
   Can you please login to the desktop, let it idle, record
   /sys/kernel/debug/dri/0/i915_cur_delayinfo and .../i915_drpc_info.
   Then trace-cmd record -e i915 sleep 10s,
 
  OK, what is trace-cmd?  It looks similar to perf tools ... is that 
  it?
 
  Yes, it is roughly equivalent and you should be able to achieve the 
  same
  with perf trace - except I haven't done it before so I don't have 
  quick
  advice on how to drive it. :)
  -Chris
 
 Should be something like:
 perf record -f -g -e i915:* -a

I already sent the output of trace-cmd ... is that enough?

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Massive power regression going 3.4->3.5

2012-08-01 Thread James Bottomley
On Wed, 2012-08-01 at 09:58 +0100, Chris Wilson wrote:
> On Wed, 01 Aug 2012 09:45:04 +0100, James Bottomley  HansenPartnership.com> wrote:
> > On Wed, 2012-08-01 at 09:16 +0100, Chris Wilson wrote:
> > > On Wed, 01 Aug 2012 09:06:12 +0100, James Bottomley  > > HansenPartnership.com> wrote:
> > > > I got the attached to apply and it doesn't really improve the idle power
> > > > much (12.5W).
> > > 
> > > That's good to know. Next step is to try overriding i915.semaphores.
> > > Can you please test with i915.semaphores=0 and i915.semaphores=1?
> > 
> > There's not much point doing i915_semaphores=1 since that's the default
> > on gen 6 hardware, but i915_semaphores=0 recovers and idle power of
> > ~6.5W
> 
> It is only the default if iommu is off, and changing the default
> was one of the side-effects of the patch you bisected.
> 
> Can you please login to the desktop, let it idle, record
> /sys/kernel/debug/dri/0/i915_cur_delayinfo and .../i915_drpc_info.
> Then trace-cmd record -e i915 sleep 10s, and follow up with a new pair
> of /sys/kernel/debug/dri/0/i915_cur_delayinfo and .../i915_drpc_info.
> 
> This will let us see whether the pm counters are truly advancing and
> what activity the driver is performing whilst idle.

OK, so here it is

James

---

jejb at dabdike> cat /sys/kernel/debug/dri/0/i915_cur_delayinfo
GT_PERF_STATUS: 0x16c7
RPSTAT1: 0x0004160d
Render p-state ratio: 22
Render p-state VID: 199
Render p-state limit: 255
CAGF: 1100MHz
RP CUR UP EI: 96491us
RP CUR UP: 252us
RP PREV UP: 0us
RP CUR DOWN EI: 0us
RP CUR DOWN: 513us
RP PREV DOWN: 0us
Lowest (RPN) frequency: 650MHz
Nominal (RP1) frequency: 650MHz
Max non-overclocked (RP0) frequency: 1100MHz
jejb at dabdike> cat /sys/kernel/debug/dri/0/i915_drpc_info
RC information inaccurate because somebody holds a forcewake reference 
Video Turbo Mode: yes
HW control enabled: yes
SW control enabled: no
RC1e Enabled: no
RC6 Enabled: yes
Deep RC6 Enabled: no
Deepest RC6 Enabled: no
Current RC state: on
Core Power Down: no
RC6 "Locked to RPn" residency since boot: 0
RC6 residency since boot: 360123443
RC6+ residency since boot: 0
RC6++ residency since boot: 0
jejb at dabdike> ./git/trace-cmd/trace-cmd record -e i915 sleep 10s
trace-cmd: Permission denied
  opening '/sys/kernel/debug/tracing/tracing_on'
jejb at dabdike> sudo ./git/trace-cmd/trace-cmd record -e i915 sleep 10s
/sys/kernel/debug/tracing/events/i915/filter
/sys/kernel/debug/tracing/events/*/i915/filter
Kernel buffer statistics:
  Note: "entries" are the entries left in the kernel ring buffer and are not
recorded in the trace data. They should all be zero.

CPU: 0
entries: 0
overrun: 0
commit overrun: 0
bytes: 1080
oldest event ts:  1076.352744
now ts:  1076.651396

CPU: 1
entries: 0
overrun: 0
commit overrun: 0
bytes: 932
oldest event ts:  1067.676405
now ts:  1076.651452

CPU: 2
entries: 0
overrun: 0
commit overrun: 0
bytes: 3784
oldest event ts:  1076.090225
now ts:  1076.651501

CPU: 3
entries: 0
overrun: 0
commit overrun: 0
bytes: 0
oldest event ts: 15281105439.050279
now ts:  1076.651550

CPU0 data recorded at offset=0x39a000
221184 bytes in size
CPU1 data recorded at offset=0x3d
16384 bytes in size
CPU2 data recorded at offset=0x3d4000
32768 bytes in size
CPU3 data recorded at offset=0x3dc000
0 bytes in size
jejb at dabdike> cat /sys/kernel/debug/dri/0/i915_cur_delayinfo
GT_PERF_STATUS: 0x16c7
RPSTAT1: 0x0004160d
Render p-state ratio: 22
Render p-state VID: 199
Render p-state limit: 255
CAGF: 1100MHz
RP CUR UP EI: 49171us
RP CUR UP: 122us
RP PREV UP: 0us
RP CUR DOWN EI: 0us
RP CUR DOWN: 562us
RP PREV DOWN: 0us
Lowest (RPN) frequency: 650MHz
Nominal (RP1) frequency: 650MHz
Max non-overclocked (RP0) frequency: 1100MHz
jejb at dabdike> cat /sys/kernel/debug/dri/0/i915_drpc_info
RC information accurate: yes
Video Turbo Mode: yes
HW control enabled: yes
SW control enabled: no
RC1e Enabled: no
RC6 Enabled: yes
Deep RC6 Enabled: no
Deepest RC6 Enabled: no
Current RC state: RC6
Core Power Down: no
RC6 "Locked to RPn" residency since boot: 0
RC6 residency since boot: 362653127
RC6+ residency since boot: 0
RC6++ residency since boot: 0




Massive power regression going 3.4->3.5

2012-08-01 Thread James Bottomley
On Wed, 2012-08-01 at 09:58 +0100, Chris Wilson wrote:
> On Wed, 01 Aug 2012 09:45:04 +0100, James Bottomley  HansenPartnership.com> wrote:
> > On Wed, 2012-08-01 at 09:16 +0100, Chris Wilson wrote:
> > > On Wed, 01 Aug 2012 09:06:12 +0100, James Bottomley  > > HansenPartnership.com> wrote:
> > > > I got the attached to apply and it doesn't really improve the idle power
> > > > much (12.5W).
> > > 
> > > That's good to know. Next step is to try overriding i915.semaphores.
> > > Can you please test with i915.semaphores=0 and i915.semaphores=1?
> > 
> > There's not much point doing i915_semaphores=1 since that's the default
> > on gen 6 hardware, but i915_semaphores=0 recovers and idle power of
> > ~6.5W
> 
> It is only the default if iommu is off, and changing the default
> was one of the side-effects of the patch you bisected.
> 
> Can you please login to the desktop, let it idle, record
> /sys/kernel/debug/dri/0/i915_cur_delayinfo and .../i915_drpc_info.
> Then trace-cmd record -e i915 sleep 10s,

OK, what is trace-cmd?  It looks similar to perf tools ... is that it?

James


>  and follow up with a new pair
> of /sys/kernel/debug/dri/0/i915_cur_delayinfo and .../i915_drpc_info.
> 
> This will let us see whether the pm counters are truly advancing and
> what activity the driver is performing whilst idle.
> -Chris
> 




Massive power regression going 3.4->3.5

2012-08-01 Thread James Bottomley
On Wed, 2012-08-01 at 09:58 +0100, Chris Wilson wrote:
> On Wed, 01 Aug 2012 09:45:04 +0100, James Bottomley  HansenPartnership.com> wrote:
> > On Wed, 2012-08-01 at 09:16 +0100, Chris Wilson wrote:
> > > On Wed, 01 Aug 2012 09:06:12 +0100, James Bottomley  > > HansenPartnership.com> wrote:
> > > > I got the attached to apply and it doesn't really improve the idle power
> > > > much (12.5W).
> > > 
> > > That's good to know. Next step is to try overriding i915.semaphores.
> > > Can you please test with i915.semaphores=0 and i915.semaphores=1?
> > 
> > There's not much point doing i915_semaphores=1 since that's the default
> > on gen 6 hardware, but i915_semaphores=0 recovers and idle power of
> > ~6.5W
> 
> It is only the default if iommu is off, and changing the default
> was one of the side-effects of the patch you bisected.

Sandybridge mobile doesn't have an iommu (or at least, if it does, the
kernel doesn't detect it).

> Can you please login to the desktop, let it idle, record
> /sys/kernel/debug/dri/0/i915_cur_delayinfo and .../i915_drpc_info.
> Then trace-cmd record -e i915 sleep 10s, and follow up with a new pair
> of /sys/kernel/debug/dri/0/i915_cur_delayinfo and .../i915_drpc_info.
> 
> This will let us see whether the pm counters are truly advancing and
> what activity the driver is performing whilst idle.

With or without i915_semaphore=0?

James




Massive power regression going 3.4->3.5

2012-08-01 Thread James Bottomley
On Wed, 2012-08-01 at 09:16 +0100, Chris Wilson wrote:
> On Wed, 01 Aug 2012 09:06:12 +0100, James Bottomley  HansenPartnership.com> wrote:
> > I got the attached to apply and it doesn't really improve the idle power
> > much (12.5W).
> 
> That's good to know. Next step is to try overriding i915.semaphores.
> Can you please test with i915.semaphores=0 and i915.semaphores=1?

There's not much point doing i915_semaphores=1 since that's the default
on gen 6 hardware, but i915_semaphores=0 recovers and idle power of
~6.5W

James




Massive power regression going 3.4->3.5

2012-08-01 Thread James Bottomley
On Tue, 2012-07-31 at 20:24 +0100, Chris Wilson wrote:
> On Tue, 31 Jul 2012 11:14:17 +0100, Chris Wilson  chris-wilson.co.uk> wrote:
> > On Tue, 31 Jul 2012 10:57:10 +0100, James Bottomley  > HansenPartnership.com> wrote:
> > > > When did you inspect the debug files? One effect I can imagine is that
> > > > if your system was previously stuck at RPn and never upclocking the GPU
> > > > when X starts. The question would then be what is preventing the GPU
> > > > from reaching its lowest power state again.
> > > 
> > > After I logged into an xfce4 session and powertop showed idle had been
> > > reached.
> 
> That you are using xfce4 makes the use of semaphores for pageflips as
> being the root cause even more suspect. Pageflips are only used for a
> fullscreen DRI client caalling SwapBuffers, to my knowledge xfce4 does
> not use DRI at all - its compositing manager is XRender based if you
> happen to be using it.
> 
> Please can you try the small patch to disable the use of semaphores for
> pageflips and see if the regression remains (which I judge it will...):
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c
> b/drivers/gpu/drm/i915/i915_gem.c
> index 5c4657a..f301f2f 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -3067,7 +3067,7 @@ i915_gem_object_pin_to_display_plane(struct 
> drm_i915_gem_o
> return ret;
>  
> if (pipelined != obj->ring) {
> -   ret = i915_gem_object_sync(obj, pipelined);
> +   ret = i915_gem_object_wait_rendering(obj);
> if (ret)
> return ret;
> }

Your patch doesn't apply ... I think because in v3.5 this line is
displaced by about 200 lines in the file.

patching file drivers/gpu/drm/i915/i915_gem.c
Hunk #1 FAILED at 3067.
1 out of 1 hunk FAILED -- saving rejects to file
drivers/gpu/drm/i915/i915_gem.c.rej

I got the attached to apply and it doesn't really improve the idle power
much (12.5W).

James

---

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 288d7b8..2f3f279 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2869,7 +2869,7 @@ i915_gem_object_pin_to_display_plane(struct 
drm_i915_gem_object *obj,
return ret;

if (pipelined != obj->ring) {
-   ret = i915_gem_object_sync(obj, pipelined);
+   ret = i915_gem_object_wait_rendering(obj);
if (ret)
return ret;
}




Re: Massive power regression going 3.4-3.5

2012-08-01 Thread James Bottomley
On Tue, 2012-07-31 at 20:24 +0100, Chris Wilson wrote:
 On Tue, 31 Jul 2012 11:14:17 +0100, Chris Wilson ch...@chris-wilson.co.uk 
 wrote:
  On Tue, 31 Jul 2012 10:57:10 +0100, James Bottomley 
  james.bottom...@hansenpartnership.com wrote:
When did you inspect the debug files? One effect I can imagine is that
if your system was previously stuck at RPn and never upclocking the GPU
when X starts. The question would then be what is preventing the GPU
from reaching its lowest power state again.
   
   After I logged into an xfce4 session and powertop showed idle had been
   reached.
 
 That you are using xfce4 makes the use of semaphores for pageflips as
 being the root cause even more suspect. Pageflips are only used for a
 fullscreen DRI client caalling SwapBuffers, to my knowledge xfce4 does
 not use DRI at all - its compositing manager is XRender based if you
 happen to be using it.
 
 Please can you try the small patch to disable the use of semaphores for
 pageflips and see if the regression remains (which I judge it will...):
 
 diff --git a/drivers/gpu/drm/i915/i915_gem.c
 b/drivers/gpu/drm/i915/i915_gem.c
 index 5c4657a..f301f2f 100644
 --- a/drivers/gpu/drm/i915/i915_gem.c
 +++ b/drivers/gpu/drm/i915/i915_gem.c
 @@ -3067,7 +3067,7 @@ i915_gem_object_pin_to_display_plane(struct 
 drm_i915_gem_o
 return ret;
  
 if (pipelined != obj-ring) {
 -   ret = i915_gem_object_sync(obj, pipelined);
 +   ret = i915_gem_object_wait_rendering(obj);
 if (ret)
 return ret;
 }

Your patch doesn't apply ... I think because in v3.5 this line is
displaced by about 200 lines in the file.

patching file drivers/gpu/drm/i915/i915_gem.c
Hunk #1 FAILED at 3067.
1 out of 1 hunk FAILED -- saving rejects to file
drivers/gpu/drm/i915/i915_gem.c.rej

I got the attached to apply and it doesn't really improve the idle power
much (12.5W).

James

---

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 288d7b8..2f3f279 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2869,7 +2869,7 @@ i915_gem_object_pin_to_display_plane(struct 
drm_i915_gem_object *obj,
return ret;
 
if (pipelined != obj-ring) {
-   ret = i915_gem_object_sync(obj, pipelined);
+   ret = i915_gem_object_wait_rendering(obj);
if (ret)
return ret;
}


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Massive power regression going 3.4-3.5

2012-08-01 Thread James Bottomley
On Wed, 2012-08-01 at 09:16 +0100, Chris Wilson wrote:
 On Wed, 01 Aug 2012 09:06:12 +0100, James Bottomley 
 james.bottom...@hansenpartnership.com wrote:
  I got the attached to apply and it doesn't really improve the idle power
  much (12.5W).
 
 That's good to know. Next step is to try overriding i915.semaphores.
 Can you please test with i915.semaphores=0 and i915.semaphores=1?

There's not much point doing i915_semaphores=1 since that's the default
on gen 6 hardware, but i915_semaphores=0 recovers and idle power of
~6.5W

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Massive power regression going 3.4-3.5

2012-08-01 Thread James Bottomley
On Wed, 2012-08-01 at 09:58 +0100, Chris Wilson wrote:
 On Wed, 01 Aug 2012 09:45:04 +0100, James Bottomley 
 james.bottom...@hansenpartnership.com wrote:
  On Wed, 2012-08-01 at 09:16 +0100, Chris Wilson wrote:
   On Wed, 01 Aug 2012 09:06:12 +0100, James Bottomley 
   james.bottom...@hansenpartnership.com wrote:
I got the attached to apply and it doesn't really improve the idle power
much (12.5W).
   
   That's good to know. Next step is to try overriding i915.semaphores.
   Can you please test with i915.semaphores=0 and i915.semaphores=1?
  
  There's not much point doing i915_semaphores=1 since that's the default
  on gen 6 hardware, but i915_semaphores=0 recovers and idle power of
  ~6.5W
 
 It is only the default if iommu is off, and changing the default
 was one of the side-effects of the patch you bisected.

Sandybridge mobile doesn't have an iommu (or at least, if it does, the
kernel doesn't detect it).

 Can you please login to the desktop, let it idle, record
 /sys/kernel/debug/dri/0/i915_cur_delayinfo and .../i915_drpc_info.
 Then trace-cmd record -e i915 sleep 10s, and follow up with a new pair
 of /sys/kernel/debug/dri/0/i915_cur_delayinfo and .../i915_drpc_info.
 
 This will let us see whether the pm counters are truly advancing and
 what activity the driver is performing whilst idle.

With or without i915_semaphore=0?

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Massive power regression going 3.4-3.5

2012-08-01 Thread James Bottomley
On Wed, 2012-08-01 at 09:58 +0100, Chris Wilson wrote:
 On Wed, 01 Aug 2012 09:45:04 +0100, James Bottomley 
 james.bottom...@hansenpartnership.com wrote:
  On Wed, 2012-08-01 at 09:16 +0100, Chris Wilson wrote:
   On Wed, 01 Aug 2012 09:06:12 +0100, James Bottomley 
   james.bottom...@hansenpartnership.com wrote:
I got the attached to apply and it doesn't really improve the idle power
much (12.5W).
   
   That's good to know. Next step is to try overriding i915.semaphores.
   Can you please test with i915.semaphores=0 and i915.semaphores=1?
  
  There's not much point doing i915_semaphores=1 since that's the default
  on gen 6 hardware, but i915_semaphores=0 recovers and idle power of
  ~6.5W
 
 It is only the default if iommu is off, and changing the default
 was one of the side-effects of the patch you bisected.
 
 Can you please login to the desktop, let it idle, record
 /sys/kernel/debug/dri/0/i915_cur_delayinfo and .../i915_drpc_info.
 Then trace-cmd record -e i915 sleep 10s,

OK, what is trace-cmd?  It looks similar to perf tools ... is that it?

James


  and follow up with a new pair
 of /sys/kernel/debug/dri/0/i915_cur_delayinfo and .../i915_drpc_info.
 
 This will let us see whether the pm counters are truly advancing and
 what activity the driver is performing whilst idle.
 -Chris
 


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Massive power regression going 3.4-3.5

2012-08-01 Thread James Bottomley
On Wed, 2012-08-01 at 09:58 +0100, Chris Wilson wrote:
 On Wed, 01 Aug 2012 09:45:04 +0100, James Bottomley 
 james.bottom...@hansenpartnership.com wrote:
  On Wed, 2012-08-01 at 09:16 +0100, Chris Wilson wrote:
   On Wed, 01 Aug 2012 09:06:12 +0100, James Bottomley 
   james.bottom...@hansenpartnership.com wrote:
I got the attached to apply and it doesn't really improve the idle power
much (12.5W).
   
   That's good to know. Next step is to try overriding i915.semaphores.
   Can you please test with i915.semaphores=0 and i915.semaphores=1?
  
  There's not much point doing i915_semaphores=1 since that's the default
  on gen 6 hardware, but i915_semaphores=0 recovers and idle power of
  ~6.5W
 
 It is only the default if iommu is off, and changing the default
 was one of the side-effects of the patch you bisected.
 
 Can you please login to the desktop, let it idle, record
 /sys/kernel/debug/dri/0/i915_cur_delayinfo and .../i915_drpc_info.
 Then trace-cmd record -e i915 sleep 10s, and follow up with a new pair
 of /sys/kernel/debug/dri/0/i915_cur_delayinfo and .../i915_drpc_info.
 
 This will let us see whether the pm counters are truly advancing and
 what activity the driver is performing whilst idle.

OK, so here it is

James

---

jejb@dabdike cat /sys/kernel/debug/dri/0/i915_cur_delayinfo
GT_PERF_STATUS: 0x16c7
RPSTAT1: 0x0004160d
Render p-state ratio: 22
Render p-state VID: 199
Render p-state limit: 255
CAGF: 1100MHz
RP CUR UP EI: 96491us
RP CUR UP: 252us
RP PREV UP: 0us
RP CUR DOWN EI: 0us
RP CUR DOWN: 513us
RP PREV DOWN: 0us
Lowest (RPN) frequency: 650MHz
Nominal (RP1) frequency: 650MHz
Max non-overclocked (RP0) frequency: 1100MHz
jejb@dabdike cat /sys/kernel/debug/dri/0/i915_drpc_info
RC information inaccurate because somebody holds a forcewake reference 
Video Turbo Mode: yes
HW control enabled: yes
SW control enabled: no
RC1e Enabled: no
RC6 Enabled: yes
Deep RC6 Enabled: no
Deepest RC6 Enabled: no
Current RC state: on
Core Power Down: no
RC6 Locked to RPn residency since boot: 0
RC6 residency since boot: 360123443
RC6+ residency since boot: 0
RC6++ residency since boot: 0
jejb@dabdike ./git/trace-cmd/trace-cmd record -e i915 sleep 10s
trace-cmd: Permission denied
  opening '/sys/kernel/debug/tracing/tracing_on'
jejb@dabdike sudo ./git/trace-cmd/trace-cmd record -e i915 sleep 10s
/sys/kernel/debug/tracing/events/i915/filter
/sys/kernel/debug/tracing/events/*/i915/filter
Kernel buffer statistics:
  Note: entries are the entries left in the kernel ring buffer and are not
recorded in the trace data. They should all be zero.

CPU: 0
entries: 0
overrun: 0
commit overrun: 0
bytes: 1080
oldest event ts:  1076.352744
now ts:  1076.651396

CPU: 1
entries: 0
overrun: 0
commit overrun: 0
bytes: 932
oldest event ts:  1067.676405
now ts:  1076.651452

CPU: 2
entries: 0
overrun: 0
commit overrun: 0
bytes: 3784
oldest event ts:  1076.090225
now ts:  1076.651501

CPU: 3
entries: 0
overrun: 0
commit overrun: 0
bytes: 0
oldest event ts: 15281105439.050279
now ts:  1076.651550

CPU0 data recorded at offset=0x39a000
221184 bytes in size
CPU1 data recorded at offset=0x3d
16384 bytes in size
CPU2 data recorded at offset=0x3d4000
32768 bytes in size
CPU3 data recorded at offset=0x3dc000
0 bytes in size
jejb@dabdike cat /sys/kernel/debug/dri/0/i915_cur_delayinfo
GT_PERF_STATUS: 0x16c7
RPSTAT1: 0x0004160d
Render p-state ratio: 22
Render p-state VID: 199
Render p-state limit: 255
CAGF: 1100MHz
RP CUR UP EI: 49171us
RP CUR UP: 122us
RP PREV UP: 0us
RP CUR DOWN EI: 0us
RP CUR DOWN: 562us
RP PREV DOWN: 0us
Lowest (RPN) frequency: 650MHz
Nominal (RP1) frequency: 650MHz
Max non-overclocked (RP0) frequency: 1100MHz
jejb@dabdike cat /sys/kernel/debug/dri/0/i915_drpc_info
RC information accurate: yes
Video Turbo Mode: yes
HW control enabled: yes
SW control enabled: no
RC1e Enabled: no
RC6 Enabled: yes
Deep RC6 Enabled: no
Deepest RC6 Enabled: no
Current RC state: RC6
Core Power Down: no
RC6 Locked to RPn residency since boot: 0
RC6 residency since boot: 362653127
RC6+ residency since boot: 0
RC6++ residency since boot: 0


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Massive power regression going 3.4->3.5

2012-07-31 Thread James Bottomley
On Tue, 2012-07-31 at 16:09 +0100, James Bottomley wrote:
> On Tue, 2012-07-31 at 07:27 -0700, Keith Packard wrote:
> > James Bottomley  writes:
> > 
> > > on 3.5 killing X causes idle power to go 14W -> 5.9W
> > > on 3.4.6 killing X causes idle power to go 6.8W -> 5.7W
> > 
> > That's actually pretty good news -- you're just not getting to RC6
> > when X is running, but RC6 is otherwise working. And, yes, the GPU
> > really can suck that much power. The debug info that Chris pointed you
> > at should tell a more complete story. For comparison, on my sandybridge
> > box this morning:
> 
> Well, I don't know what it means, but I have a culprit from bisect (I
> managed to manually verify the bisection heads which would step back
> into 3.3).
> 
> 2911a35b2e4eb87ec48d03aeb11f019e51ae3c0d is the first bad commit
> commit 2911a35b2e4eb87ec48d03aeb11f019e51ae3c0d
> Author: Ben Widawsky 
> Date:   Thu Apr 5 14:47:36 2012 -0700
> 
> drm/i915: use semaphores for the display plane
> 
> I'm going to try building 3.5 with this reverted (it doesn't revert
> cleanly).

OK, I confirm that reverting this patch in 3.5 recovers a ~6.5W idle
power.

James




Massive power regression going 3.4->3.5

2012-07-31 Thread James Bottomley
On Tue, 2012-07-31 at 07:27 -0700, Keith Packard wrote:
> James Bottomley  writes:
> 
> > on 3.5 killing X causes idle power to go 14W -> 5.9W
> > on 3.4.6 killing X causes idle power to go 6.8W -> 5.7W
> 
> That's actually pretty good news -- you're just not getting to RC6
> when X is running, but RC6 is otherwise working. And, yes, the GPU
> really can suck that much power. The debug info that Chris pointed you
> at should tell a more complete story. For comparison, on my sandybridge
> box this morning:

Well, I don't know what it means, but I have a culprit from bisect (I
managed to manually verify the bisection heads which would step back
into 3.3).

2911a35b2e4eb87ec48d03aeb11f019e51ae3c0d is the first bad commit
commit 2911a35b2e4eb87ec48d03aeb11f019e51ae3c0d
Author: Ben Widawsky 
Date:   Thu Apr 5 14:47:36 2012 -0700

drm/i915: use semaphores for the display plane

I'm going to try building 3.5 with this reverted (it doesn't revert
cleanly).

James




Massive power regression going 3.4->3.5

2012-07-31 Thread James Bottomley
On Tue, 2012-07-31 at 10:54 +0100, Chris Wilson wrote:
> On Tue, 31 Jul 2012 10:37:35 +0100, James Bottomley  HansenPartnership.com> wrote:
> > On Tue, 2012-07-31 at 09:28 +0100, Chris Wilson wrote:
> > > On Tue, 31 Jul 2012 09:06:42 +0100, James Bottomley  > > HansenPartnership.com> wrote:
> > > > Actually, bad news: it looks like the problem is drm:
> > > > 
> > > > on 3.5 killing X causes idle power to go 14W -> 5.9W
> > > > on 3.4.6 killing X causes idle power to go 6.8W -> 5.7W
> > > 
> > > The files that will be the most interesting to compare at first are:
> > > 
> > > /sys/kernel/debug/dri/0/i915_drpc_info
> > > /sys/kernel/debug/dri/0/i915_cur_delayinfo
> > > /sys/kernel/debug/dri/0/i915_fbc_status
> > 
> > This is for the good kernel 3.4.6
> > 
> > jejb at dabdike> cat /sys/kernel/debug/dri/0/i915_drpc_info
> > RC information accurate: yes
> > Video Turbo Mode: yes
> > HW control enabled: yes
> > SW control enabled: no
> > RC1e Enabled: no
> > RC6 Enabled: yes
> > Deep RC6 Enabled: no
> > Deepest RC6 Enabled: no
> > Current RC state: RC6
> > Core Power Down: no
> > jejb at dabdike> cat /sys/kernel/debug/dri/0/i915_cur_delayinfo
> > GT_PERF_STATUS: 0x0d29
> > RPSTAT1: 0x00040d00
> > Render p-state ratio: 13
> > Render p-state VID: 41
> > Render p-state limit: 255
> > CAGF: 650MHz
> > RP CUR UP EI: 20459us
> > RP CUR UP: 172us
> > RP PREV UP: 0us
> > RP CUR DOWN EI: 0us
> > RP CUR DOWN: 0us
> > RP PREV DOWN: 0us
> > Lowest (RPN) frequency: 650MHz
> > Nominal (RP1) frequency: 650MHz
> > Max non-overclocked (RP0) frequency: 1100MHz
> > jejb at dabdike> cat /sys/kernel/debug/dri/0/i915_fbc_status
> > FBC disabled: disabled per module param (default off)
> > 
> > And the bad kernel 3.5
> > 
> > jejb at dabdike> cat /sys/kernel/debug/dri/0/i915_drpc_info
> > RC information accurate: yes
> > Video Turbo Mode: yes
> > HW control enabled: yes
> > SW control enabled: no
> > RC1e Enabled: no
> > RC6 Enabled: yes
> > Deep RC6 Enabled: no
> > Deepest RC6 Enabled: no
> > Current RC state: RC6
> > Core Power Down: no
> > RC6 "Locked to RPn" residency since boot: 0
> > RC6 residency since boot: 97671911
> > RC6+ residency since boot: 0
> > RC6++ residency since boot: 0
> > jejb at dabdike> cat /sys/kernel/debug/dri/0/i915_cur_delayinfo
> > GT_PERF_STATUS: 0x0d29
> > RPSTAT1: 0x00048d00
> > Render p-state ratio: 13
> > Render p-state VID: 41
> > Render p-state limit: 255
> > CAGF: 650MHz
> > RP CUR UP EI: 63719us
> > RP CUR UP: 26us
> > RP PREV UP: 0us
> > RP CUR DOWN EI: 0us
> > RP CUR DOWN: 0us
> > RP PREV DOWN: 0us
> > Lowest (RPN) frequency: 650MHz
> > Nominal (RP1) frequency: 650MHz
> > Max non-overclocked (RP0) frequency: 1100MHz
> > jejb at dabdike> cat /sys/kernel/debug/dri/0/i915_fbc_status
> > FBC disabled: disabled per module param (default off)
> 
> Ok, that rules out the the easy case of rc6 being disabled or not
> functioning at all, which could easily account for 6W.
> 
> When did you inspect the debug files? One effect I can imagine is that
> if your system was previously stuck at RPn and never upclocking the GPU
> when X starts. The question would then be what is preventing the GPU
> from reaching its lowest power state again.

After I logged into an xfce4 session and powertop showed idle had been
reached.

James




Massive power regression going 3.4->3.5

2012-07-31 Thread James Bottomley
On Tue, 2012-07-31 at 09:28 +0100, Chris Wilson wrote:
> On Tue, 31 Jul 2012 09:06:42 +0100, James Bottomley  HansenPartnership.com> wrote:
> > Actually, bad news: it looks like the problem is drm:
> > 
> > on 3.5 killing X causes idle power to go 14W -> 5.9W
> > on 3.4.6 killing X causes idle power to go 6.8W -> 5.7W
> 
> The files that will be the most interesting to compare at first are:
> 
> /sys/kernel/debug/dri/0/i915_drpc_info
> /sys/kernel/debug/dri/0/i915_cur_delayinfo
> /sys/kernel/debug/dri/0/i915_fbc_status

This is for the good kernel 3.4.6

jejb at dabdike> cat /sys/kernel/debug/dri/0/i915_drpc_info
RC information accurate: yes
Video Turbo Mode: yes
HW control enabled: yes
SW control enabled: no
RC1e Enabled: no
RC6 Enabled: yes
Deep RC6 Enabled: no
Deepest RC6 Enabled: no
Current RC state: RC6
Core Power Down: no
jejb at dabdike> cat /sys/kernel/debug/dri/0/i915_cur_delayinfo
GT_PERF_STATUS: 0x0d29
RPSTAT1: 0x00040d00
Render p-state ratio: 13
Render p-state VID: 41
Render p-state limit: 255
CAGF: 650MHz
RP CUR UP EI: 20459us
RP CUR UP: 172us
RP PREV UP: 0us
RP CUR DOWN EI: 0us
RP CUR DOWN: 0us
RP PREV DOWN: 0us
Lowest (RPN) frequency: 650MHz
Nominal (RP1) frequency: 650MHz
Max non-overclocked (RP0) frequency: 1100MHz
jejb at dabdike> cat /sys/kernel/debug/dri/0/i915_fbc_status
FBC disabled: disabled per module param (default off)

And the bad kernel 3.5

jejb at dabdike> cat /sys/kernel/debug/dri/0/i915_drpc_info
RC information accurate: yes
Video Turbo Mode: yes
HW control enabled: yes
SW control enabled: no
RC1e Enabled: no
RC6 Enabled: yes
Deep RC6 Enabled: no
Deepest RC6 Enabled: no
Current RC state: RC6
Core Power Down: no
RC6 "Locked to RPn" residency since boot: 0
RC6 residency since boot: 97671911
RC6+ residency since boot: 0
RC6++ residency since boot: 0
jejb at dabdike> cat /sys/kernel/debug/dri/0/i915_cur_delayinfo
GT_PERF_STATUS: 0x0d29
RPSTAT1: 0x00048d00
Render p-state ratio: 13
Render p-state VID: 41
Render p-state limit: 255
CAGF: 650MHz
RP CUR UP EI: 63719us
RP CUR UP: 26us
RP PREV UP: 0us
RP CUR DOWN EI: 0us
RP CUR DOWN: 0us
RP PREV DOWN: 0us
Lowest (RPN) frequency: 650MHz
Nominal (RP1) frequency: 650MHz
Max non-overclocked (RP0) frequency: 1100MHz
jejb at dabdike> cat /sys/kernel/debug/dri/0/i915_fbc_status
FBC disabled: disabled per module param (default off)


> However if it was simple regression in drm, then the bisect would have
> continued to work despite the merge point jumping between 3.4 and 3.5,
> right?

No ... the bisect stepped back into 3.3 which mean I lost the ability to
detect the regression.  I think it might be fixable given I have a more
precise identifier for the tree because it looks like none of the roots
of the drm tree is before 3.4-rc.

James




Massive power regression going 3.4->3.5

2012-07-31 Thread James Bottomley
On Tue, 2012-07-31 at 08:31 +0100, James Bottomley wrote:
> On Mon, 2012-07-30 at 11:23 -0700, Keith Packard wrote:
> > James Bottomley  writes:
> > 
> > > On Mon, 2012-07-30 at 09:33 -0700, Keith Packard wrote:
> > >> James Bottomley  writes:
> > >> 
> > >> > OK, I've run the bisect as far as I can.  It looks to be in the drm
> > >> > tree.  Unfortunately, this tree has several merge points, some of which
> > >> > go further back than v3.4.  Unfortunately, once the bisect steps back
> > >> > before 3.4, we lose the changes that gave us the power savings, making
> > >> > further debugging impossible
> > >> 
> > >> What machine is this on? There are a few 'disable some power savings'
> > >> patches in that list to work around issues on various machines; knowing
> > >> what machine you're using can isolate which ones might have had some
> > >> effect on power usage...
> > >
> > > Lenovo X220i
> > 
> > I don't see a whole lot of context from the elided email bits you'd sent
> > previously; can you summarize the issue in terms of how much power
> > savings you're losing, how you're measuring it and what's going on in
> > the system when the power savings is different?
> 
> Sure.  Going from 3.3->3.4 we saw massive increase in power savings due
> to various autosuspend updates.  The idle power consumption of the X220i
> went from about 13W to 6.5W.  In 3.5 this seems to be reversed, with the
> idle power consumption back up to around 14W.  I can't quite believe the
> graphics chip is responsible for around 7W, so it looks like it's some
> interaction between graphics and other subsystems.
> 
> > Have you tried measuring power with X not running? How about with
> > compositing and other desktop effects disabled? 
> 
> Sure, I can try doing that ... but remember this is a system where the
> drm is used for the console as well.

Actually, bad news: it looks like the problem is drm:

on 3.5 killing X causes idle power to go 14W -> 5.9W
on 3.4.6 killing X causes idle power to go 6.8W -> 5.7W

James




Massive power regression going 3.4->3.5

2012-07-31 Thread James Bottomley
On Mon, 2012-07-30 at 11:23 -0700, Keith Packard wrote:
> James Bottomley  writes:
> 
> > On Mon, 2012-07-30 at 09:33 -0700, Keith Packard wrote:
> >> James Bottomley  writes:
> >> 
> >> > OK, I've run the bisect as far as I can.  It looks to be in the drm
> >> > tree.  Unfortunately, this tree has several merge points, some of which
> >> > go further back than v3.4.  Unfortunately, once the bisect steps back
> >> > before 3.4, we lose the changes that gave us the power savings, making
> >> > further debugging impossible
> >> 
> >> What machine is this on? There are a few 'disable some power savings'
> >> patches in that list to work around issues on various machines; knowing
> >> what machine you're using can isolate which ones might have had some
> >> effect on power usage...
> >
> > Lenovo X220i
> 
> I don't see a whole lot of context from the elided email bits you'd sent
> previously; can you summarize the issue in terms of how much power
> savings you're losing, how you're measuring it and what's going on in
> the system when the power savings is different?

Sure.  Going from 3.3->3.4 we saw massive increase in power savings due
to various autosuspend updates.  The idle power consumption of the X220i
went from about 13W to 6.5W.  In 3.5 this seems to be reversed, with the
idle power consumption back up to around 14W.  I can't quite believe the
graphics chip is responsible for around 7W, so it looks like it's some
interaction between graphics and other subsystems.

> Have you tried measuring power with X not running? How about with
> compositing and other desktop effects disabled? 

Sure, I can try doing that ... but remember this is a system where the
drm is used for the console as well.

James




Re: Massive power regression going 3.4-3.5

2012-07-31 Thread James Bottomley
On Tue, 2012-07-31 at 08:31 +0100, James Bottomley wrote:
 On Mon, 2012-07-30 at 11:23 -0700, Keith Packard wrote:
  James Bottomley james.bottom...@hansenpartnership.com writes:
  
   On Mon, 2012-07-30 at 09:33 -0700, Keith Packard wrote:
   James Bottomley james.bottom...@hansenpartnership.com writes:
   
OK, I've run the bisect as far as I can.  It looks to be in the drm
tree.  Unfortunately, this tree has several merge points, some of which
go further back than v3.4.  Unfortunately, once the bisect steps back
before 3.4, we lose the changes that gave us the power savings, making
further debugging impossible
   
   What machine is this on? There are a few 'disable some power savings'
   patches in that list to work around issues on various machines; knowing
   what machine you're using can isolate which ones might have had some
   effect on power usage...
  
   Lenovo X220i
  
  I don't see a whole lot of context from the elided email bits you'd sent
  previously; can you summarize the issue in terms of how much power
  savings you're losing, how you're measuring it and what's going on in
  the system when the power savings is different?
 
 Sure.  Going from 3.3-3.4 we saw massive increase in power savings due
 to various autosuspend updates.  The idle power consumption of the X220i
 went from about 13W to 6.5W.  In 3.5 this seems to be reversed, with the
 idle power consumption back up to around 14W.  I can't quite believe the
 graphics chip is responsible for around 7W, so it looks like it's some
 interaction between graphics and other subsystems.
 
  Have you tried measuring power with X not running? How about with
  compositing and other desktop effects disabled? 
 
 Sure, I can try doing that ... but remember this is a system where the
 drm is used for the console as well.

Actually, bad news: it looks like the problem is drm:

on 3.5 killing X causes idle power to go 14W - 5.9W
on 3.4.6 killing X causes idle power to go 6.8W - 5.7W

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Massive power regression going 3.4-3.5

2012-07-31 Thread James Bottomley
On Tue, 2012-07-31 at 09:28 +0100, Chris Wilson wrote:
 On Tue, 31 Jul 2012 09:06:42 +0100, James Bottomley 
 james.bottom...@hansenpartnership.com wrote:
  Actually, bad news: it looks like the problem is drm:
  
  on 3.5 killing X causes idle power to go 14W - 5.9W
  on 3.4.6 killing X causes idle power to go 6.8W - 5.7W
 
 The files that will be the most interesting to compare at first are:
 
 /sys/kernel/debug/dri/0/i915_drpc_info
 /sys/kernel/debug/dri/0/i915_cur_delayinfo
 /sys/kernel/debug/dri/0/i915_fbc_status

This is for the good kernel 3.4.6

jejb@dabdike cat /sys/kernel/debug/dri/0/i915_drpc_info
RC information accurate: yes
Video Turbo Mode: yes
HW control enabled: yes
SW control enabled: no
RC1e Enabled: no
RC6 Enabled: yes
Deep RC6 Enabled: no
Deepest RC6 Enabled: no
Current RC state: RC6
Core Power Down: no
jejb@dabdike cat /sys/kernel/debug/dri/0/i915_cur_delayinfo
GT_PERF_STATUS: 0x0d29
RPSTAT1: 0x00040d00
Render p-state ratio: 13
Render p-state VID: 41
Render p-state limit: 255
CAGF: 650MHz
RP CUR UP EI: 20459us
RP CUR UP: 172us
RP PREV UP: 0us
RP CUR DOWN EI: 0us
RP CUR DOWN: 0us
RP PREV DOWN: 0us
Lowest (RPN) frequency: 650MHz
Nominal (RP1) frequency: 650MHz
Max non-overclocked (RP0) frequency: 1100MHz
jejb@dabdike cat /sys/kernel/debug/dri/0/i915_fbc_status
FBC disabled: disabled per module param (default off)

And the bad kernel 3.5

jejb@dabdike cat /sys/kernel/debug/dri/0/i915_drpc_info
RC information accurate: yes
Video Turbo Mode: yes
HW control enabled: yes
SW control enabled: no
RC1e Enabled: no
RC6 Enabled: yes
Deep RC6 Enabled: no
Deepest RC6 Enabled: no
Current RC state: RC6
Core Power Down: no
RC6 Locked to RPn residency since boot: 0
RC6 residency since boot: 97671911
RC6+ residency since boot: 0
RC6++ residency since boot: 0
jejb@dabdike cat /sys/kernel/debug/dri/0/i915_cur_delayinfo
GT_PERF_STATUS: 0x0d29
RPSTAT1: 0x00048d00
Render p-state ratio: 13
Render p-state VID: 41
Render p-state limit: 255
CAGF: 650MHz
RP CUR UP EI: 63719us
RP CUR UP: 26us
RP PREV UP: 0us
RP CUR DOWN EI: 0us
RP CUR DOWN: 0us
RP PREV DOWN: 0us
Lowest (RPN) frequency: 650MHz
Nominal (RP1) frequency: 650MHz
Max non-overclocked (RP0) frequency: 1100MHz
jejb@dabdike cat /sys/kernel/debug/dri/0/i915_fbc_status
FBC disabled: disabled per module param (default off)


 However if it was simple regression in drm, then the bisect would have
 continued to work despite the merge point jumping between 3.4 and 3.5,
 right?

No ... the bisect stepped back into 3.3 which mean I lost the ability to
detect the regression.  I think it might be fixable given I have a more
precise identifier for the tree because it looks like none of the roots
of the drm tree is before 3.4-rc.

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Massive power regression going 3.4-3.5

2012-07-31 Thread James Bottomley
On Tue, 2012-07-31 at 10:54 +0100, Chris Wilson wrote:
 On Tue, 31 Jul 2012 10:37:35 +0100, James Bottomley 
 james.bottom...@hansenpartnership.com wrote:
  On Tue, 2012-07-31 at 09:28 +0100, Chris Wilson wrote:
   On Tue, 31 Jul 2012 09:06:42 +0100, James Bottomley 
   james.bottom...@hansenpartnership.com wrote:
Actually, bad news: it looks like the problem is drm:

on 3.5 killing X causes idle power to go 14W - 5.9W
on 3.4.6 killing X causes idle power to go 6.8W - 5.7W
   
   The files that will be the most interesting to compare at first are:
   
   /sys/kernel/debug/dri/0/i915_drpc_info
   /sys/kernel/debug/dri/0/i915_cur_delayinfo
   /sys/kernel/debug/dri/0/i915_fbc_status
  
  This is for the good kernel 3.4.6
  
  jejb@dabdike cat /sys/kernel/debug/dri/0/i915_drpc_info
  RC information accurate: yes
  Video Turbo Mode: yes
  HW control enabled: yes
  SW control enabled: no
  RC1e Enabled: no
  RC6 Enabled: yes
  Deep RC6 Enabled: no
  Deepest RC6 Enabled: no
  Current RC state: RC6
  Core Power Down: no
  jejb@dabdike cat /sys/kernel/debug/dri/0/i915_cur_delayinfo
  GT_PERF_STATUS: 0x0d29
  RPSTAT1: 0x00040d00
  Render p-state ratio: 13
  Render p-state VID: 41
  Render p-state limit: 255
  CAGF: 650MHz
  RP CUR UP EI: 20459us
  RP CUR UP: 172us
  RP PREV UP: 0us
  RP CUR DOWN EI: 0us
  RP CUR DOWN: 0us
  RP PREV DOWN: 0us
  Lowest (RPN) frequency: 650MHz
  Nominal (RP1) frequency: 650MHz
  Max non-overclocked (RP0) frequency: 1100MHz
  jejb@dabdike cat /sys/kernel/debug/dri/0/i915_fbc_status
  FBC disabled: disabled per module param (default off)
  
  And the bad kernel 3.5
  
  jejb@dabdike cat /sys/kernel/debug/dri/0/i915_drpc_info
  RC information accurate: yes
  Video Turbo Mode: yes
  HW control enabled: yes
  SW control enabled: no
  RC1e Enabled: no
  RC6 Enabled: yes
  Deep RC6 Enabled: no
  Deepest RC6 Enabled: no
  Current RC state: RC6
  Core Power Down: no
  RC6 Locked to RPn residency since boot: 0
  RC6 residency since boot: 97671911
  RC6+ residency since boot: 0
  RC6++ residency since boot: 0
  jejb@dabdike cat /sys/kernel/debug/dri/0/i915_cur_delayinfo
  GT_PERF_STATUS: 0x0d29
  RPSTAT1: 0x00048d00
  Render p-state ratio: 13
  Render p-state VID: 41
  Render p-state limit: 255
  CAGF: 650MHz
  RP CUR UP EI: 63719us
  RP CUR UP: 26us
  RP PREV UP: 0us
  RP CUR DOWN EI: 0us
  RP CUR DOWN: 0us
  RP PREV DOWN: 0us
  Lowest (RPN) frequency: 650MHz
  Nominal (RP1) frequency: 650MHz
  Max non-overclocked (RP0) frequency: 1100MHz
  jejb@dabdike cat /sys/kernel/debug/dri/0/i915_fbc_status
  FBC disabled: disabled per module param (default off)
 
 Ok, that rules out the the easy case of rc6 being disabled or not
 functioning at all, which could easily account for 6W.
 
 When did you inspect the debug files? One effect I can imagine is that
 if your system was previously stuck at RPn and never upclocking the GPU
 when X starts. The question would then be what is preventing the GPU
 from reaching its lowest power state again.

After I logged into an xfce4 session and powertop showed idle had been
reached.

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Massive power regression going 3.4-3.5

2012-07-31 Thread James Bottomley
On Tue, 2012-07-31 at 07:27 -0700, Keith Packard wrote:
 James Bottomley james.bottom...@hansenpartnership.com writes:
 
  on 3.5 killing X causes idle power to go 14W - 5.9W
  on 3.4.6 killing X causes idle power to go 6.8W - 5.7W
 
 That's actually pretty good news -- you're just not getting to RC6
 when X is running, but RC6 is otherwise working. And, yes, the GPU
 really can suck that much power. The debug info that Chris pointed you
 at should tell a more complete story. For comparison, on my sandybridge
 box this morning:

Well, I don't know what it means, but I have a culprit from bisect (I
managed to manually verify the bisection heads which would step back
into 3.3).

2911a35b2e4eb87ec48d03aeb11f019e51ae3c0d is the first bad commit
commit 2911a35b2e4eb87ec48d03aeb11f019e51ae3c0d
Author: Ben Widawsky b...@bwidawsk.net
Date:   Thu Apr 5 14:47:36 2012 -0700

drm/i915: use semaphores for the display plane

I'm going to try building 3.5 with this reverted (it doesn't revert
cleanly).

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Massive power regression going 3.4-3.5

2012-07-31 Thread James Bottomley
On Tue, 2012-07-31 at 16:09 +0100, James Bottomley wrote:
 On Tue, 2012-07-31 at 07:27 -0700, Keith Packard wrote:
  James Bottomley james.bottom...@hansenpartnership.com writes:
  
   on 3.5 killing X causes idle power to go 14W - 5.9W
   on 3.4.6 killing X causes idle power to go 6.8W - 5.7W
  
  That's actually pretty good news -- you're just not getting to RC6
  when X is running, but RC6 is otherwise working. And, yes, the GPU
  really can suck that much power. The debug info that Chris pointed you
  at should tell a more complete story. For comparison, on my sandybridge
  box this morning:
 
 Well, I don't know what it means, but I have a culprit from bisect (I
 managed to manually verify the bisection heads which would step back
 into 3.3).
 
 2911a35b2e4eb87ec48d03aeb11f019e51ae3c0d is the first bad commit
 commit 2911a35b2e4eb87ec48d03aeb11f019e51ae3c0d
 Author: Ben Widawsky b...@bwidawsk.net
 Date:   Thu Apr 5 14:47:36 2012 -0700
 
 drm/i915: use semaphores for the display plane
 
 I'm going to try building 3.5 with this reverted (it doesn't revert
 cleanly).

OK, I confirm that reverting this patch in 3.5 recovers a ~6.5W idle
power.

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Massive power regression going 3.4->3.5

2012-07-30 Thread James Bottomley
On Mon, 2012-07-30 at 09:33 -0700, Keith Packard wrote:
> James Bottomley  writes:
> 
> > OK, I've run the bisect as far as I can.  It looks to be in the drm
> > tree.  Unfortunately, this tree has several merge points, some of which
> > go further back than v3.4.  Unfortunately, once the bisect steps back
> > before 3.4, we lose the changes that gave us the power savings, making
> > further debugging impossible
> 
> What machine is this on? There are a few 'disable some power savings'
> patches in that list to work around issues on various machines; knowing
> what machine you're using can isolate which ones might have had some
> effect on power usage...

Lenovo X220i

The display device is

00:02.0 VGA compatible controller: Intel Corporation
82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 03)
(prog-if 00 [VGA controller])
Subsystem: Giga-byte Technology Device 2562
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at e000 (32-bit, prefetchable) [size=128M]
Memory at e820 (32-bit, non-prefetchable) [size=512K]
Expansion ROM at  [disabled]
Capabilities: 
Kernel driver in use: i915


James





Massive power regression going 3.4->3.5

2012-07-30 Thread James Bottomley
On Mon, 2012-07-30 at 10:46 +0100, James Bottomley wrote:
> On Sun, 2012-07-29 at 21:25 +0200, Rafael J. Wysocki wrote:
> > On Sunday, July 29, 2012, Rafael J. Wysocki wrote:
> > > On Sunday, July 29, 2012, James Bottomley wrote:
> > > > On Sat, 2012-07-28 at 22:29 +0200, Rafael J. Wysocki wrote:
> > > > > On Saturday, July 28, 2012, Rafael J. Wysocki wrote:
> > > > > > On Saturday, July 28, 2012, James Bottomley wrote:
> > > > > > > One of the great things about the 3.4 kernel was the massive 
> > > > > > > increase in
> > > > > > > power savings on my x220i laptop.  With full PCI suspend, it 
> > > > > > > could get
> > > > > > > down to 6.5W in idle with a dim screen, provided I used powertop 
> > > > > > > 2.0 to
> > > > > > > activate all the tunings).  I just upgraded to 3.5 (the openSUSE
> > > > > > > tumbleweed kernel) and all the power savings are gone.  Now it's 
> > > > > > > back to
> > > > > > > its previous behaviour of idle somewhere between 16-18W.
> > > > > > 
> > > > > > Please check dbe9a2e (ACPI / PM: Make acpi_pm_device_sleep_state() 
> > > > > > follow the
> > > > > > specification) for starters.
> > > > > > 
> > > > > > If that is not the culprit, 38c92ff (ACPI / PM: Make 
> > > > > > __acpi_bus_get_power()
> > > > > > cover D3cold correctly) is worth ckecking too.
> > > > > > 
> > > > > > If none of the above, c2fb8a3 (USB: add NO_D3_DURING_SLEEP flag and 
> > > > > > revert
> > > > > > 151b61284776be2) is one more candidate.
> > > > > > 
> > > > > > I can't recall anything else that might be related to the symptom 
> > > > > > at the moment.
> > > > > 
> > > > > Actually, dbe9a2e and c2fb8a3 only affect system suspend, so for 
> > > > > runtime PM
> > > > > 38c92ff seems to be the only relevant one from the above.
> > > > 
> > > > I verified the problem shows up on vanilla 3.5 (just in case it was an
> > > > openSUSE problem).  I also tried reverting 38c92ff with no success.  The
> > > > symptoms appear to be that something is preventing the PCI/device
> > > > subsystem from going into auto suspend.
> > > 
> > > The number of core power management commits during the 3.5 cycle was 
> > > rather
> > > limited and none of them should affect PCI runtime PM.  I have no idea 
> > > what
> > > the root cause of that may be, quite frankly.
> > 
> > I've just realized that you said "auto suspend", which makes me think that 
> > the
> > problem may be related to USB.
> > 
> > PCI doesn't really use any kind of auto suspend for what I know, but USB 
> > does
> > and it may cause PCI USB controllers to be suspended as a result.
> 
> I'm trying to bisect this, but I've got stuck around here:
> 
> git bisect bad 2f78d8e249973f1eeb88315e6444e616c60177ae
> git bisect good 28f3d717618156c0dcd2f497d791b578a7931d87
> 
> That's around the drm tree ... unfortunately that broke a lot of the
> basics of my i915 based system (compositing and resolution) as I step
> into it.

OK, I've run the bisect as far as I can.  It looks to be in the drm
tree.  Unfortunately, this tree has several merge points, some of which
go further back than v3.4.  Unfortunately, once the bisect steps back
before 3.4, we lose the changes that gave us the power savings, making
further debugging impossible

Here's the log

git bisect start
# bad: [28a33cbc24e4256c143dce96c7d93bf423229f92] Linux 3.5
git bisect bad 28a33cbc24e4256c143dce96c7d93bf423229f92
# bad: [28a33cbc24e4256c143dce96c7d93bf423229f92] Linux 3.5
git bisect bad 28a33cbc24e4256c143dce96c7d93bf423229f92
# good: [76e10d158efb6d4516018846f60c2ab5501900bc] Linux 3.4
git bisect good 76e10d158efb6d4516018846f60c2ab5501900bc
# good: [59cd358a7a5b2f6b61faa01dae6cfda3830ac62a] Merge tag 
'perf-core-for-mingo' of 
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent
git bisect good 59cd358a7a5b2f6b61faa01dae6cfda3830ac62a
# bad: [7e5b2db77b05746613516599c916a8cc2e321077] Merge branch 'upstream' of 
git://git.linux-mips.org/pub/scm/ralf/upstream-linus
git bisect bad 7e5b2db77b05746613516599c916a8cc2e321077
# good: [f9369910a6225b8d4892c3f20ae740a711cd5ace] Merge branch 'for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/viro

Re: Massive power regression going 3.4-3.5

2012-07-30 Thread James Bottomley
On Mon, 2012-07-30 at 10:46 +0100, James Bottomley wrote:
 On Sun, 2012-07-29 at 21:25 +0200, Rafael J. Wysocki wrote:
  On Sunday, July 29, 2012, Rafael J. Wysocki wrote:
   On Sunday, July 29, 2012, James Bottomley wrote:
On Sat, 2012-07-28 at 22:29 +0200, Rafael J. Wysocki wrote:
 On Saturday, July 28, 2012, Rafael J. Wysocki wrote:
  On Saturday, July 28, 2012, James Bottomley wrote:
   One of the great things about the 3.4 kernel was the massive 
   increase in
   power savings on my x220i laptop.  With full PCI suspend, it 
   could get
   down to 6.5W in idle with a dim screen, provided I used powertop 
   2.0 to
   activate all the tunings).  I just upgraded to 3.5 (the openSUSE
   tumbleweed kernel) and all the power savings are gone.  Now it's 
   back to
   its previous behaviour of idle somewhere between 16-18W.
  
  Please check dbe9a2e (ACPI / PM: Make acpi_pm_device_sleep_state() 
  follow the
  specification) for starters.
  
  If that is not the culprit, 38c92ff (ACPI / PM: Make 
  __acpi_bus_get_power()
  cover D3cold correctly) is worth ckecking too.
  
  If none of the above, c2fb8a3 (USB: add NO_D3_DURING_SLEEP flag and 
  revert
  151b61284776be2) is one more candidate.
  
  I can't recall anything else that might be related to the symptom 
  at the moment.
 
 Actually, dbe9a2e and c2fb8a3 only affect system suspend, so for 
 runtime PM
 38c92ff seems to be the only relevant one from the above.

I verified the problem shows up on vanilla 3.5 (just in case it was an
openSUSE problem).  I also tried reverting 38c92ff with no success.  The
symptoms appear to be that something is preventing the PCI/device
subsystem from going into auto suspend.
   
   The number of core power management commits during the 3.5 cycle was 
   rather
   limited and none of them should affect PCI runtime PM.  I have no idea 
   what
   the root cause of that may be, quite frankly.
  
  I've just realized that you said auto suspend, which makes me think that 
  the
  problem may be related to USB.
  
  PCI doesn't really use any kind of auto suspend for what I know, but USB 
  does
  and it may cause PCI USB controllers to be suspended as a result.
 
 I'm trying to bisect this, but I've got stuck around here:
 
 git bisect bad 2f78d8e249973f1eeb88315e6444e616c60177ae
 git bisect good 28f3d717618156c0dcd2f497d791b578a7931d87
 
 That's around the drm tree ... unfortunately that broke a lot of the
 basics of my i915 based system (compositing and resolution) as I step
 into it.

OK, I've run the bisect as far as I can.  It looks to be in the drm
tree.  Unfortunately, this tree has several merge points, some of which
go further back than v3.4.  Unfortunately, once the bisect steps back
before 3.4, we lose the changes that gave us the power savings, making
further debugging impossible

Here's the log

git bisect start
# bad: [28a33cbc24e4256c143dce96c7d93bf423229f92] Linux 3.5
git bisect bad 28a33cbc24e4256c143dce96c7d93bf423229f92
# bad: [28a33cbc24e4256c143dce96c7d93bf423229f92] Linux 3.5
git bisect bad 28a33cbc24e4256c143dce96c7d93bf423229f92
# good: [76e10d158efb6d4516018846f60c2ab5501900bc] Linux 3.4
git bisect good 76e10d158efb6d4516018846f60c2ab5501900bc
# good: [59cd358a7a5b2f6b61faa01dae6cfda3830ac62a] Merge tag 
'perf-core-for-mingo' of 
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent
git bisect good 59cd358a7a5b2f6b61faa01dae6cfda3830ac62a
# bad: [7e5b2db77b05746613516599c916a8cc2e321077] Merge branch 'upstream' of 
git://git.linux-mips.org/pub/scm/ralf/upstream-linus
git bisect bad 7e5b2db77b05746613516599c916a8cc2e321077
# good: [f9369910a6225b8d4892c3f20ae740a711cd5ace] Merge branch 'for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
git bisect good f9369910a6225b8d4892c3f20ae740a711cd5ace
# bad: [2f78d8e249973f1eeb88315e6444e616c60177ae] Merge tag 'firewire-updates' 
of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394
git bisect bad 2f78d8e249973f1eeb88315e6444e616c60177ae
# good: [28f3d717618156c0dcd2f497d791b578a7931d87] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
git bisect good 28f3d717618156c0dcd2f497d791b578a7931d87
# bad: [b3daeaef559d87b974c13a096582c5c70dc11061] drm/i915: move rps/emon 
function declarations
git bisect bad b3daeaef559d87b974c13a096582c5c70dc11061
# bad: [246bdbeb0f0fb14c4c085b6ed7edbab11afccd33] drm/i915: share forcewaking 
code between IVB and HSW
git bisect bad 246bdbeb0f0fb14c4c085b6ed7edbab11afccd33

If you do a gitk on the last bad and good 

gitk 
28f3d717618156c0dcd2f497d791b578a7931d87..246bdbeb0f0fb14c4c085b6ed7edbab11afccd33

You see there are only drm commits in there.

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Massive power regression going 3.4-3.5

2012-07-30 Thread James Bottomley
On Mon, 2012-07-30 at 09:33 -0700, Keith Packard wrote:
 James Bottomley james.bottom...@hansenpartnership.com writes:
 
  OK, I've run the bisect as far as I can.  It looks to be in the drm
  tree.  Unfortunately, this tree has several merge points, some of which
  go further back than v3.4.  Unfortunately, once the bisect steps back
  before 3.4, we lose the changes that gave us the power savings, making
  further debugging impossible
 
 What machine is this on? There are a few 'disable some power savings'
 patches in that list to work around issues on various machines; knowing
 what machine you're using can isolate which ones might have had some
 effect on power usage...

Lenovo X220i

The display device is

00:02.0 VGA compatible controller: Intel Corporation
82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 03)
(prog-if 00 [VGA controller])
Subsystem: Giga-byte Technology Device 2562
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at e000 (32-bit, prefetchable) [size=128M]
Memory at e820 (32-bit, non-prefetchable) [size=512K]
Expansion ROM at unassigned [disabled]
Capabilities: access denied
Kernel driver in use: i915


James



___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[PATCH 00/12] Fix various section mismatches and build errors.

2011-06-29 Thread James Bottomley
On Wed, 2011-06-29 at 17:19 +0100, Ralf Baechle wrote:
> On Wed, Jun 29, 2011 at 08:14:24AM -0700, Greg KH wrote:
> 
> > 
> > On Wed, Jun 29, 2011 at 08:58:19AM -0500, James Bottomley wrote:
> > > I think we should simply concentrate on __init and __exit; that's where
> > > most of the discard value lies and stop expending huge efforts on the
> > > __devX stuff which adds huge complexity for no real gain.
> > 
> > I have long felt that those __devX markings should just go away as they
> > cause nothing but problems and have no real gain as you point out.
> 
> The suggestion to do that has been floated around before but seems to
> have missed sufficient thrust.  I'm all for it; the manual tagging with
> __devX has not been very efficient on developer time.  I just want to see
> meaningful warnings again over all that noise the current mechanisn may
> produce.

For me, just go ahead and fix the actual problems: so _init sections and
_exit sections that are used from the main body, just strip the
annotations, don't try to change them for _devX ones.

Thanks,

James




[PATCH 00/12] Fix various section mismatches and build errors.

2011-06-29 Thread James Bottomley
On Wed, 2011-06-29 at 14:07 +0100, Ralf Baechle wrote:
> On Mon, Jun 27, 2011 at 10:12:57PM -0700, David Miller wrote:
> 
> > commit 948252cb9e01d65a89ecadf67be5018351eee15e
> > Author: David S. Miller 
> > Date:   Tue May 31 19:27:48 2011 -0700
> > 
> > Revert "net: fix section mismatches"
> > 
> > This reverts commit e5cb966c0838e4da43a3b0751bdcac7fe719f7b4.
> > 
> > It causes new build regressions with gcc-4.2 which is
> > pretty common on non-x86 platforms.
> > 
> > Reported-by: James Bottomley 
> > Signed-off-by: David S. Miller 
> > 
> > and postings that led to this revert including:
> > 
> > http://marc.info/?l=linux-netdev=130653748205263=2
> 
> Thanks for the pointers; I looked into it a bit deeper and found that the
> construct which hppa64-linux-gcc 4.2.4 doesn't like is the combination of
> const and __devinitconst __devinitdata.
> 
> My patches are minimalistic and don't do any constification and seem to
> work fine for PA-RISC.
> 
> A possible alternative to allow the use of Micha?'s reverted patch would
> be to conditionalize the definition of __devinitconst.  There is no
> user of __devexitconst so I left that unchanged.

To be honest, my own take on this is that, apart from the compiler
cockups trying to do read only annotations, which affect various
versions of gcc not just the parisc ones, the _devX annotations are
pretty pointless.  They only really do something in the non-hotplug
case, so since 95% of the world seems to use hotplug now and the other
5% doesn't care that much about the odd page of memory (which you rarely
get, since modules sections are accumulated per module, not aggregate),
I'd just favour stripping __init and __exit where there's a problem.

I think we should simply concentrate on __init and __exit; that's where
most of the discard value lies and stop expending huge efforts on the
__devX stuff which adds huge complexity for no real gain.

James




Re: [PATCH 00/12] Fix various section mismatches and build errors.

2011-06-29 Thread James Bottomley
On Wed, 2011-06-29 at 14:07 +0100, Ralf Baechle wrote:
 On Mon, Jun 27, 2011 at 10:12:57PM -0700, David Miller wrote:
 
  commit 948252cb9e01d65a89ecadf67be5018351eee15e
  Author: David S. Miller da...@davemloft.net
  Date:   Tue May 31 19:27:48 2011 -0700
  
  Revert net: fix section mismatches
  
  This reverts commit e5cb966c0838e4da43a3b0751bdcac7fe719f7b4.
  
  It causes new build regressions with gcc-4.2 which is
  pretty common on non-x86 platforms.
  
  Reported-by: James Bottomley james.bottom...@hansenpartnership.com
  Signed-off-by: David S. Miller da...@davemloft.net
  
  and postings that led to this revert including:
  
  http://marc.info/?l=linux-netdevm=130653748205263w=2
 
 Thanks for the pointers; I looked into it a bit deeper and found that the
 construct which hppa64-linux-gcc 4.2.4 doesn't like is the combination of
 const and __devinitconst __devinitdata.
 
 My patches are minimalistic and don't do any constification and seem to
 work fine for PA-RISC.
 
 A possible alternative to allow the use of Michał's reverted patch would
 be to conditionalize the definition of __devinitconst.  There is no
 user of __devexitconst so I left that unchanged.

To be honest, my own take on this is that, apart from the compiler
cockups trying to do read only annotations, which affect various
versions of gcc not just the parisc ones, the _devX annotations are
pretty pointless.  They only really do something in the non-hotplug
case, so since 95% of the world seems to use hotplug now and the other
5% doesn't care that much about the odd page of memory (which you rarely
get, since modules sections are accumulated per module, not aggregate),
I'd just favour stripping __init and __exit where there's a problem.

I think we should simply concentrate on __init and __exit; that's where
most of the discard value lies and stop expending huge efforts on the
__devX stuff which adds huge complexity for no real gain.

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


(Short?) merge window reminder

2011-05-24 Thread James Bottomley
On Mon, 2011-05-23 at 12:22 -0700, Greg KH wrote:
> On Mon, May 23, 2011 at 12:13:29PM -0700, Linus Torvalds wrote:
> > PS. The voices in my head also tell me that the numbers are getting
> > too big. I may just call the thing 2.8.0. And I almost guarantee that
> > this PS is going to result in more discussion than the rest, but when
> > the voices tell me to do things, I listen.
> 
> If you do this, I will buy you a bottle of whatever whiskey you want
> that I can get my hands on in Tokyo next week.

I can recommend Hanyu Ace of Spades ...  I can even arrange to be on
hand just to make sure it's as good as it should be ...

James




Re: (Short?) merge window reminder

2011-05-24 Thread James Bottomley
On Mon, 2011-05-23 at 12:22 -0700, Greg KH wrote:
 On Mon, May 23, 2011 at 12:13:29PM -0700, Linus Torvalds wrote:
  PS. The voices in my head also tell me that the numbers are getting
  too big. I may just call the thing 2.8.0. And I almost guarantee that
  this PS is going to result in more discussion than the rest, but when
  the voices tell me to do things, I listen.
 
 If you do this, I will buy you a bottle of whatever whiskey you want
 that I can get my hands on in Tokyo next week.

I can recommend Hanyu Ace of Spades ...  I can even arrange to be on
hand just to make sure it's as good as it should be ...

James


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel