Re: [Intel-gfx] [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr

2022-02-28 Thread James Bottomley
On Mon, 2022-02-28 at 23:59 +0200, Mike Rapoport wrote:
> 
> On February 28, 2022 10:42:53 PM GMT+02:00, James Bottomley <
> james.bottom...@hansenpartnership.com> wrote:
> > On Mon, 2022-02-28 at 21:07 +0100, Christian König wrote:
[...]
> > > > I do wish we could actually poison the 'pos' value after the
> > > > loop somehow - but clearly the "might be uninitialized" I was
> > > > hoping for isn't the way to do it.
> > > > 
> > > > Anybody have any ideas?
> > > 
> > > I think we should look at the use cases why code is touching
> > > (pos) after the loop.
> > > 
> > > Just from skimming over the patches to change this and experience
> > > with the drivers/subsystems I help to maintain I think the
> > > primary pattern looks something like this:
> > > 
> > > list_for_each_entry(entry, head, member) {
> > >  if (some_condition_checking(entry))
> > >  break;
> > > }
> > > do_something_with(entry);
> > 
> > Actually, we usually have a check to see if the loop found
> > anything, but in that case it should something like
> > 
> > if (list_entry_is_head(entry, head, member)) {
> >return with error;
> > }
> > do_somethin_with(entry);
> > 
> > Suffice?  The list_entry_is_head() macro is designed to cope with
> > the bogus entry on head problem.
> 
> Won't suffice because the end goal of this work is to limit scope of
> entry only to loop. Hence the need for additional variable.

Well, yes, but my objection is more to the size of churn than the
desire to do loop local.  I'm not even sure loop local is possible,
because it's always annoyed me that for (int i = 0; ...  in C++ defines
i in the outer scope not the loop scope, which is why I never use it.

However, if the desire is really to poison the loop variable then we
can do

#define list_for_each_entry(pos, head, member)  \
for (pos = list_first_entry(head, typeof(*pos), member);\
 !list_entry_is_head(pos, head, member) && ((pos = NULL) == NULL;   
\
 pos = list_next_entry(pos, member))

Which would at least set pos to NULL when the loop completes.

> Besides, there are no guarantees that people won't
> do_something_with(entry) without the check or won't compare entry to
> NULL to check if the loop finished with break or not.

I get the wider goal, but we have to patch the problem cases now and a
simple one-liner is better than a larger patch that may or may not work
if we ever achieve the local definition or value poisoning idea.  I'm
also fairly certain coccinelle can come up with a use without checking
for loop completion semantic patch which we can add to 0day.

James




Re: [Intel-gfx] [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr

2022-02-28 Thread James Bottomley
On Mon, 2022-02-28 at 21:56 +0100, Christian König wrote:
> 
> Am 28.02.22 um 21:42 schrieb James Bottomley:
> > On Mon, 2022-02-28 at 21:07 +0100, Christian König wrote:
> > > Am 28.02.22 um 20:56 schrieb Linus Torvalds:
> > > > On Mon, Feb 28, 2022 at 4:19 AM Christian König
> > > >  wrote:
> > > > [SNIP]
> > > > Anybody have any ideas?
> > > I think we should look at the use cases why code is touching
> > > (pos)
> > > after the loop.
> > > 
> > > Just from skimming over the patches to change this and experience
> > > with the drivers/subsystems I help to maintain I think the
> > > primary pattern looks something like this:
> > > 
> > > list_for_each_entry(entry, head, member) {
> > >   if (some_condition_checking(entry))
> > >   break;
> > > }
> > > do_something_with(entry);
> > 
> > Actually, we usually have a check to see if the loop found
> > anything, but in that case it should something like
> > 
> > if (list_entry_is_head(entry, head, member)) {
> >  return with error;
> > }
> > do_somethin_with(entry);
> > 
> > Suffice?  The list_entry_is_head() macro is designed to cope with
> > the bogus entry on head problem.
> 
> That will work and is also what people already do.
> 
> The key problem is that we let people do the same thing over and
> over again with slightly different implementations.
> 
> Out in the wild I've seen at least using a separate variable, using
> a bool to indicate that something was found and just assuming that
> the list has an entry.
> 
> The last case is bogus and basically what can break badly.

Yes, I understand that.  I'm saying we should replace that bogus checks
of entry->something against some_value loop termination condition with
the list_entry_is_head() macro.  That should be a one line and fairly
mechanical change rather than the explosion of code changes we seem to
have in the patch series.

James




Re: [Intel-gfx] [PATCH 2/6] treewide: remove using list iterator after loop body as a ptr

2022-02-28 Thread James Bottomley
On Mon, 2022-02-28 at 21:07 +0100, Christian König wrote:
> Am 28.02.22 um 20:56 schrieb Linus Torvalds:
> > On Mon, Feb 28, 2022 at 4:19 AM Christian König
> >  wrote:
> > > I don't think that using the extra variable makes the code in any
> > > way
> > > more reliable or easier to read.
> > So I think the next step is to do the attached patch (which
> > requires
> > that "-std=gnu11" that was discussed in the original thread).
> > 
> > That will guarantee that the 'pos' parameter of
> > list_for_each_entry()
> > is only updated INSIDE the for_each_list_entry() loop, and can
> > never
> > point to the (wrongly typed) head entry.
> > 
> > And I would actually hope that it should actually cause compiler
> > warnings about possibly uninitialized variables if people then use
> > the
> > 'pos' pointer outside the loop. Except
> > 
> >   (a) that code in sgx/encl.c currently initializes 'tmp' to NULL
> > for
> > inexplicable reasons - possibly because it already expected this
> > behavior
> > 
> >   (b) when I remove that NULL initializer, I still don't get a
> > warning,
> > because we've disabled -Wno-maybe-uninitialized since it results in
> > so
> > many false positives.
> > 
> > Oh well.
> > 
> > Anyway, give this patch a look, and at least if it's expanded to do
> > "(pos) = NULL" in the entry statement for the for-loop, it will
> > avoid the HEAD type confusion that Jakob is working on. And I think
> > in a cleaner way than the horrid games he plays.
> > 
> > (But it won't avoid possible CPU speculation of such type
> > confusion. That, in my opinion, is a completely different issue)
> 
> Yes, completely agree.
> 
> > I do wish we could actually poison the 'pos' value after the loop
> > somehow - but clearly the "might be uninitialized" I was hoping for
> > isn't the way to do it.
> > 
> > Anybody have any ideas?
> 
> I think we should look at the use cases why code is touching (pos)
> after the loop.
> 
> Just from skimming over the patches to change this and experience
> with the drivers/subsystems I help to maintain I think the primary
> pattern looks something like this:
> 
> list_for_each_entry(entry, head, member) {
>  if (some_condition_checking(entry))
>  break;
> }
> do_something_with(entry);


Actually, we usually have a check to see if the loop found anything,
but in that case it should something like

if (list_entry_is_head(entry, head, member)) {
return with error;
}
do_somethin_with(entry);

Suffice?  The list_entry_is_head() macro is designed to cope with the
bogus entry on head problem.

James




Re: [Intel-gfx] [PATCH v3 3/4] tpm_tis: Disable interrupts if interrupt storm detected

2020-12-07 Thread James Bottomley
On Mon, 2020-12-07 at 15:28 -0400, Jason Gunthorpe wrote:
> On Sun, Dec 06, 2020 at 08:26:16PM +0100, Thomas Gleixner wrote:
> > Just as a side note. I was looking at tpm_tis_probe_irq_single()
> > and that function is leaking the interrupt request if any of the
> > checks afterwards fails, except for the final interrupt probe check
> > which does a cleanup. That means on fail before that the interrupt
> > handler stays requested up to the point where the module is
> > removed. If that's a shared interrupt and some other device is
> > active on the same line, then each interrupt from that device will
> > call into the TPM code. Something like the below is needed.
> > 
> > Also the X86 autoprobe mechanism is interesting:
> > 
> > if (IS_ENABLED(CONFIG_X86))
> > for (i = 3; i <= 15; i++)
> > if (!tpm_tis_probe_irq_single(chip, intmask, 0,
> > i))
> > return;
> > 
> > The third argument is 'flags' which is handed to request_irq(). So
> > that won't ever be able to probe a shared interrupt. But if an
> > interrupt number > 0 is handed to tpm_tis_core_init() the interrupt
> > is requested with IRQF_SHARED. Same issue when the chip has an
> > interrupt number in the register. It's also requested exclusive
> > which is pretty likely to fail on ancient x86 machines.
> 
> It is very likely none of this works any more, it has been repeatedly
> reworked over the years and just left behind out of fear someone
> needs it. I've thought it should be deleted for a while now.
> 
> I suppose the original logic was to try and probe without SHARED
> because a probe would need exclusive access to the interrupt to tell
> if the TPM was actually the source, not some other device.
> 
> It is all very old and very out of step with current thinking, IMHO.
> I skeptical that TPM interrupts were ever valuable enough to deserve
> this in the first place.

For what it's worth, I agree.  Trying to probe all 15 ISA interrupts is
last millennium thinking we should completely avoid.  If it's not
described in ACPI then you don't get an interrupt full stop.

James


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH v3 1/4] irq: export kstat_irqs

2020-12-06 Thread James Bottomley
On Sun, 2020-12-06 at 17:40 +0100, Thomas Gleixner wrote:
> On Sat, Dec 05 2020 at 12:39, Jarkko Sakkinen wrote:
> > On Fri, Dec 04, 2020 at 06:43:37PM -0700, Jerry Snitselaar wrote:
> > > To try and detect potential interrupt storms that
> > > have been occurring with tpm_tis devices it was suggested
> > > to use kstat_irqs() to get the number of interrupts.
> > > Since tpm_tis can be built as a module it needs kstat_irqs
> > > exported.
> > 
> > I think you should also have a paragraph explicitly stating that
> > i915_pmu.c contains a duplicate of kstat_irqs() because it is not
> > exported as of today. It adds a lot more weight to this given that
> > there is already existing mainline usage (kind of).
> 
> It's abusage and just the fact that it exists is not an argument by
> itself.

What we want is a count of the interrupts to see if we're having an
interrupt storm from the TPM device (some seem to be wired to fire the
interrupt even when there's no event to warrant it).  Since
kstat_irqs_user() does the correct RCU locking, should we be using that
instead?

James


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [Intel-wired-lan] [PATCH 000/141] Fix fall-through warnings for Clang

2020-11-24 Thread James Bottomley
On Tue, 2020-11-24 at 13:32 -0800, Kees Cook wrote:
> On Mon, Nov 23, 2020 at 08:31:30AM -0800, James Bottomley wrote:
> > Really, no ... something which produces no improvement has no value
> > at all ... we really shouldn't be wasting maintainer time with it
> > because it has a cost to merge.  I'm not sure we understand where
> > the balance lies in value vs cost to merge but I am confident in
> > the zero value case.
> 
> What? We can't measure how many future bugs aren't introduced because
> the kernel requires explicit case flow-control statements for all new
> code.

No but we can measure how vulnerable our current coding habits are to
the mistake this warning would potentially prevent.  I don't think it's
wrong to extrapolate that if we had no instances at all of prior coding
problems we likely wouldn't have any in future either making adopting
the changes needed to enable the warning valueless ... that's the zero
value case I was referring to above.

Now, what we have seems to be about 6 cases (at least what's been shown
in this thread) where a missing break would cause potentially user
visible issues.  That means the value of this isn't zero, but it's not
a no-brainer massive win either.  That's why I think asking what we've
invested vs the return isn't a useless exercise.

> We already enable -Wimplicit-fallthrough globally, so that's not the
> discussion. The issue is that Clang is (correctly) even more strict
> than GCC for this, so these are the remaining ones to fix for full
> Clang coverage too.
> 
> People have spent more time debating this already than it would have
> taken to apply the patches. :)

You mean we've already spent 90% of the effort to come this far so we
might as well go the remaining 10% because then at least we get some
return? It's certainly a clinching argument in defence procurement ...

> This is about robustness and language wrangling. It's a big code-
> base, and this is the price of our managing technical debt for
> permanent robustness improvements. (The numbers I ran from Gustavo's
> earlier patches were that about 10% of the places adjusted were
> identified as legitimate bugs being fixed. This final series may be
> lower, but there are still bugs being found from it -- we need to
> finish this and shut the door on it for good.)

I got my six patches by analyzing the lwn.net report of the fixes that
was cited which had 21 of which 50% didn't actually change the emitted
code, and 25% didn't have a user visible effect.

But the broader point I'm making is just because the compiler people
come up with a shiny new warning doesn't necessarily mean the problem
it's detecting is one that causes us actual problems in the code base. 
I'd really be happier if we had a theory about what classes of CVE or
bug we could eliminate before we embrace the next new warning.

James



___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 000/141] Fix fall-through warnings for Clang

2020-11-23 Thread James Bottomley
On Mon, 2020-11-23 at 19:56 +0100, Miguel Ojeda wrote:
> On Mon, Nov 23, 2020 at 4:58 PM James Bottomley
>  wrote:
> > Well, I used git.  It says that as of today in Linus' tree we have
> > 889 patches related to fall throughs and the first series went in
> > in october 2017 ... ignoring a couple of outliers back to February.
> 
> I can see ~10k insertions over ~1k commits and 15 years that mention
> a fallthrough in the entire repo. That is including some commits
> (like the biggest one, 960 insertions) that have nothing to do with C
> fallthrough. A single kernel release has an order of magnitude more
> changes than this...
> 
> But if we do the math, for an author, at even 1 minute per line
> change and assuming nothing can be automated at all, it would take 1
> month of work. For maintainers, a couple of trivial lines is noise
> compared to many other patches.

So you think a one line patch should take one minute to produce ... I
really don't think that's grounded in reality.  I suppose a one line
patch only takes a minute to merge with b4 if no-one reviews or tests
it, but that's not really desirable.

> In fact, this discussion probably took more time than the time it
> would take to review the 200 lines. :-)

I'm framing the discussion in terms of the whole series of changes we
have done for fall through, both what's in the tree currently (889
patches) both in terms of the produce and the consumer.  That's what I
used for my figures for cost.

> > We're also complaining about the inability to recruit maintainers:
> > 
> > https://www.theregister.com/2020/06/30/hard_to_find_linux_maintainers_says_torvalds/
> > 
> > And burn out:
> > 
> > http://antirez.com/news/129
> 
> Accepting trivial and useful 1-line patches

Part of what I'm trying to measure is the "and useful" bit because
that's not a given.

> is not what makes a voluntary maintainer quit...

so the proverb "straw which broke the camel's back" uniquely doesn't
apply to maintainers

>  Thankless work with demanding deadlines is.

That's another potential reason, but it doesn't may other reasons less
valid.

> > The whole crux of your argument seems to be maintainers' time isn't
> > important so we should accept all trivial patches
> 
> I have not said that, at all. In fact, I am a voluntary one and I
> welcome patches like this. It takes very little effort on my side to
> review and it helps the kernel overall.

Well, you know, subsystems are very different in terms of the amount of
patches a maintainer has to process per release cycle of the kernel. 
If a maintainer is close to capacity, additional patches, however
trivial, become a problem.  If a maintainer has spare cycles, trivial
patches may look easy.

> Paid maintainers are the ones that can take care of big
> features/reviews.
> 
> > What I'm actually trying to articulate is a way of measuring value
> > of the patch vs cost ... it has nothing really to do with who foots
> > the actual bill.
> 
> I understand your point, but you were the one putting it in terms of
> a junior FTE.

No, I evaluated the producer side in terms of an FTE.  What we're
mostly arguing about here is the consumer side: the maintainers and
people who have to rework their patch sets. I estimated that at 100h.

>  In my view, 1 month-work (worst case) is very much worth
> removing a class of errors from a critical codebase.
> 
> > One thesis I'm actually starting to formulate is that this
> > continual devaluing of maintainers is why we have so much
> > difficulty keeping and recruiting them.
> 
> That may very well be true, but I don't feel anybody has devalued
> maintainers in this discussion.

You seem to be saying that because you find it easy to merge trivial
patches, everyone should.  I'm reminded of a friend long ago who
thought being a Tees River Pilot was a sinecure because he could
navigate the Tees blindfold.  What he forgot, of course, is that just
because it's easy with a trawler doesn't mean it's easy with an oil
tanker.  In fact it takes longer to qualify as a Tees River Pilot than
it does to get a PhD.

James


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [Intel-wired-lan] [PATCH 000/141] Fix fall-through warnings for Clang

2020-11-23 Thread James Bottomley
On Mon, 2020-11-23 at 07:03 -0600, Gustavo A. R. Silva wrote:
> On Sun, Nov 22, 2020 at 11:53:55AM -0800, James Bottomley wrote:
> > On Sun, 2020-11-22 at 11:22 -0800, Joe Perches wrote:
> > > On Sun, 2020-11-22 at 11:12 -0800, James Bottomley wrote:
> > > > On Sun, 2020-11-22 at 10:25 -0800, Joe Perches wrote:
> > > > > On Sun, 2020-11-22 at 10:21 -0800, James Bottomley wrote:
> > > > > > Please tell me our reward for all this effort isn't a
> > > > > > single missing error print.
> > > > > 
> > > > > There were quite literally dozens of logical defects found
> > > > > by the fallthrough additions.  Very few were logging only.
> > > > 
> > > > So can you give us the best examples (or indeed all of them if
> > > > someone is keeping score)?  hopefully this isn't a US election
> > > > situation ...
> > > 
> > > Gustavo?  Are you running for congress now?
> > > 
> > > https://lwn.net/Articles/794944/
> > 
> > That's 21 reported fixes of which about 50% seem to produce no
> > change in code behaviour at all, a quarter seem to have no user
> > visible effect with the remaining quarter producing unexpected
> > errors on obscure configuration parameters, which is why no-one
> > really noticed them before.
> 
> The really important point here is the number of bugs this has
> prevented and will prevent in the future. See an example of this,
> below:
> 
> https://lore.kernel.org/linux-iio/20190813135802.gb27...@kroah.com/

I think this falls into the same category as the other six bugs: it
changes the output/input for parameters but no-one has really noticed,
usually because the command is obscure or the bias effect is minor.

> This work is still relevant, even if the total number of issues/bugs
> we find in the process is zero (which is not the case).

Really, no ... something which produces no improvement has no value at
all ... we really shouldn't be wasting maintainer time with it because
it has a cost to merge.  I'm not sure we understand where the balance
lies in value vs cost to merge but I am confident in the zero value
case.

> "The sucky thing about doing hard work to deploy hardening is that
> the result is totally invisible by definition (things not happening)
> [..]"
> - Dmitry Vyukov

Really, no.  Something that can't be measured at all doesn't exist.

And actually hardening is one of those things you can measure (which I
do have to admit isn't true for everything in the security space) ...
it's number of exploitable bugs found before you did it vs number of
exploitable bugs found after you did it.  Usually hardening eliminates
a class of bug, so the way I've measured hardening before is to go
through the CVE list for the last couple of years for product X, find
all the bugs that are of the class we're looking to eliminate and say
if we had hardened X against this class of bug we'd have eliminated Y%
of the exploits.  It can be quite impressive if Y is a suitably big
number.

James


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 000/141] Fix fall-through warnings for Clang

2020-11-23 Thread James Bottomley
On Mon, 2020-11-23 at 15:19 +0100, Miguel Ojeda wrote:
> On Sun, Nov 22, 2020 at 11:36 PM James Bottomley
>  wrote:
> > Well, it seems to be three years of someone's time plus the
> > maintainer review time and series disruption of nearly a thousand
> > patches.  Let's be conservative and assume the producer worked
> > about 30% on the series and it takes about 5-10 minutes per patch
> > to review, merge and for others to rework existing series.  So
> > let's say it's cost a person year of a relatively junior engineer
> > producing the patches and say 100h of review and application
> > time.  The latter is likely the big ticket item because it's what
> > we have in least supply in the kernel (even though it's 20x vs the
> > producer time).
> 
> How are you arriving at such numbers? It is a total of ~200 trivial
> lines.

Well, I used git.  It says that as of today in Linus' tree we have 889
patches related to fall throughs and the first series went in in
october 2017 ... ignoring a couple of outliers back to February.

> > It's not about the risk of the changes it's about the cost of
> > implementing them.  Even if you discount the producer time (which
> > someone gets to pay for, and if I were the engineering manager, I'd
> > be unhappy about), the review/merge/rework time is pretty
> > significant in exchange for six minor bug fixes.  Fine, when a new
> > compiler warning comes along it's certainly reasonable to see if we
> > can benefit from it and the fact that the compiler people think
> > it's worthwhile is enough evidence to assume this initially.  But
> > at some point you have to ask whether that assumption is supported
> > by the evidence we've accumulated over the time we've been using
> > it.  And if the evidence doesn't support it perhaps it is time to
> > stop the experiment.
> 
> Maintainers routinely review 1-line trivial patches, not to mention
> internal API changes, etc.

We're also complaining about the inability to recruit maintainers:

https://www.theregister.com/2020/06/30/hard_to_find_linux_maintainers_says_torvalds/

And burn out:

http://antirez.com/news/129

The whole crux of your argument seems to be maintainers' time isn't
important so we should accept all trivial patches ... I'm pushing back
on that assumption in two places, firstly the valulessness of the time
and secondly that all trivial patches are valuable.

> If some company does not want to pay for that, that's fine, but they
> don't get to be maintainers and claim `Supported`.

What I'm actually trying to articulate is a way of measuring value of
the patch vs cost ... it has nothing really to do with who foots the
actual bill.

One thesis I'm actually starting to formulate is that this continual
devaluing of maintainers is why we have so much difficulty keeping and
recruiting them.

James



___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 000/141] Fix fall-through warnings for Clang

2020-11-22 Thread James Bottomley
On Mon, 2020-11-23 at 09:54 +1100, Finn Thain wrote:
> But is anyone keeping score of the regressions? If unreported bugs
> count, what about unreported regressions?

Well, I was curious about the former (obviously no tool will tell me
about the latter), so I asked git what patches had a fall-through
series named in a fixes tag and these three popped out:

9cf51446e686 bpf, powerpc: Fix misuse of fallthrough in bpf_jit_comp()
6a9dc5fd6170 lib: Revert use of fallthrough pseudo-keyword in lib/
91dbd73a1739 mips/oprofile: Fix fallthrough placement

I don't think any of these is fixing a significant problem, but they
did cause someone time and trouble to investigate.

James


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 000/141] Fix fall-through warnings for Clang

2020-11-22 Thread James Bottomley
On Sun, 2020-11-22 at 21:35 +0100, Miguel Ojeda wrote:
> On Sun, Nov 22, 2020 at 7:22 PM James Bottomley
>  wrote:
> > Well, it's a problem in an error leg, sure, but it's not a really
> > compelling reason for a 141 patch series, is it?  All that fixing
> > this error will do is get the driver to print "oh dear there's a
> > problem" under four more conditions than it previously did.
> > 
> > We've been at this for three years now with nearly a thousand
> > patches, firstly marking all the fall throughs with /* fall through
> > */ and later changing it to fallthrough.  At some point we do have
> > to ask if the effort is commensurate with the protection
> > afforded.  Please tell me our reward for all this effort isn't a
> > single missing error print.
> 
> It isn't that much effort, isn't it?

Well, it seems to be three years of someone's time plus the maintainer
review time and series disruption of nearly a thousand patches.  Let's
be conservative and assume the producer worked about 30% on the series
and it takes about 5-10 minutes per patch to review, merge and for
others to rework existing series.  So let's say it's cost a person year
of a relatively junior engineer producing the patches and say 100h of
review and application time.  The latter is likely the big ticket item
because it's what we have in least supply in the kernel (even though
it's 20x vs the producer time).

>  Plus we need to take into account the future mistakes that it might
> prevent, too. So even if there were zero problems found so far, it is
> still a positive change.

Well, the question I was asking is if it's worth the cost which I've
tried to outline above.

> I would agree if these changes were high risk, though; but they are
> almost trivial.

It's not about the risk of the changes it's about the cost of
implementing them.  Even if you discount the producer time (which
someone gets to pay for, and if I were the engineering manager, I'd be
unhappy about), the review/merge/rework time is pretty significant in
exchange for six minor bug fixes.  Fine, when a new compiler warning
comes along it's certainly reasonable to see if we can benefit from it
and the fact that the compiler people think it's worthwhile is enough
evidence to assume this initially.  But at some point you have to ask
whether that assumption is supported by the evidence we've accumulated
over the time we've been using it.  And if the evidence doesn't support
it perhaps it is time to stop the experiment.

James


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 000/141] Fix fall-through warnings for Clang

2020-11-22 Thread James Bottomley
On Sun, 2020-11-22 at 11:22 -0800, Joe Perches wrote:
> On Sun, 2020-11-22 at 11:12 -0800, James Bottomley wrote:
> > On Sun, 2020-11-22 at 10:25 -0800, Joe Perches wrote:
> > > On Sun, 2020-11-22 at 10:21 -0800, James Bottomley wrote:
> > > > Please tell me our reward for all this effort isn't a single
> > > > missing error print.
> > > 
> > > There were quite literally dozens of logical defects found
> > > by the fallthrough additions.  Very few were logging only.
> > 
> > So can you give us the best examples (or indeed all of them if
> > someone is keeping score)?  hopefully this isn't a US election
> > situation ...
> 
> Gustavo?  Are you running for congress now?
> 
> https://lwn.net/Articles/794944/

That's 21 reported fixes of which about 50% seem to produce no change
in code behaviour at all, a quarter seem to have no user visible effect
with the remaining quarter producing unexpected errors on obscure
configuration parameters, which is why no-one really noticed them
before.

James


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 000/141] Fix fall-through warnings for Clang

2020-11-22 Thread James Bottomley
On Sun, 2020-11-22 at 10:25 -0800, Joe Perches wrote:
> On Sun, 2020-11-22 at 10:21 -0800, James Bottomley wrote:
> > Please tell me our reward for all this effort isn't a single
> > missing error print.
> 
> There were quite literally dozens of logical defects found
> by the fallthrough additions.  Very few were logging only.

So can you give us the best examples (or indeed all of them if someone
is keeping score)?  hopefully this isn't a US election situation ...

James


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 000/141] Fix fall-through warnings for Clang

2020-11-22 Thread James Bottomley
On Sun, 2020-11-22 at 08:17 -0800, Kees Cook wrote:
> On Fri, Nov 20, 2020 at 11:51:42AM -0800, Jakub Kicinski wrote:
> > On Fri, 20 Nov 2020 11:30:40 -0800 Kees Cook wrote:
> > > On Fri, Nov 20, 2020 at 10:53:44AM -0800, Jakub Kicinski wrote:
> > > > On Fri, 20 Nov 2020 12:21:39 -0600 Gustavo A. R. Silva wrote:  
> > > > > This series aims to fix almost all remaining fall-through
> > > > > warnings in order to enable -Wimplicit-fallthrough for Clang.
> > > > > 
> > > > > In preparation to enable -Wimplicit-fallthrough for Clang,
> > > > > explicitly add multiple break/goto/return/fallthrough
> > > > > statements instead of just letting the code fall through to
> > > > > the next case.
> > > > > 
> > > > > Notice that in order to enable -Wimplicit-fallthrough for
> > > > > Clang, this change[1] is meant to be reverted at some point.
> > > > > So, this patch helps to move in that direction.
> > > > > 
> > > > > Something important to mention is that there is currently a
> > > > > discrepancy between GCC and Clang when dealing with switch
> > > > > fall-through to empty case statements or to cases that only
> > > > > contain a break/continue/return statement[2][3][4].  
> > > > 
> > > > Are we sure we want to make this change? Was it discussed
> > > > before?
> > > > 
> > > > Are there any bugs Clangs puritanical definition of fallthrough
> > > > helped find?
> > > > 
> > > > IMVHO compiler warnings are supposed to warn about issues that
> > > > could be bugs. Falling through to default: break; can hardly be
> > > > a bug?!  
> > > 
> > > It's certainly a place where the intent is not always clear. I
> > > think this makes all the cases unambiguous, and doesn't impact
> > > the machine code, since the compiler will happily optimize away
> > > any behavioral redundancy.
> > 
> > If none of the 140 patches here fix a real bug, and there is no
> > change to machine code then it sounds to me like a W=2 kind of a
> > warning.
> 
> FWIW, this series has found at least one bug so far:
> https://lore.kernel.org/lkml/CAFCwf11izHF=g1mGry1fE5kvFFFrxzhPSM6qKAO8gxSp=kr...@mail.gmail.com/


Well, it's a problem in an error leg, sure, but it's not a really
compelling reason for a 141 patch series, is it?  All that fixing this
error will do is get the driver to print "oh dear there's a problem"
under four more conditions than it previously did.

We've been at this for three years now with nearly a thousand patches,
firstly marking all the fall throughs with /* fall through */ and later
changing it to fallthrough.  At some point we do have to ask if the
effort is commensurate with the protection afforded.  Please tell me
our reward for all this effort isn't a single missing error print.

James


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC] MAINTAINERS tag for cleanup robot

2020-11-22 Thread James Bottomley
On Sun, 2020-11-22 at 08:10 -0800, Tom Rix wrote:
> On 11/22/20 6:56 AM, Matthew Wilcox wrote:
> > On Sun, Nov 22, 2020 at 06:46:46AM -0800, Tom Rix wrote:
> > > On 11/21/20 7:23 PM, Matthew Wilcox wrote:
> > > > On Sat, Nov 21, 2020 at 08:50:58AM -0800, t...@redhat.com
> > > > wrote:
> > > > > The fixer review is
> > > > > https://reviews.llvm.org/D91789
> > > > > 
> > > > > A run over allyesconfig for x86_64 finds 62 issues, 5 are
> > > > > false positives. The false positives are caused by macros
> > > > > passed to other macros and by some macro expansions that did
> > > > > not have an extra semicolon.
> > > > > 
> > > > > This cleans up about 1,000 of the current 10,000 -Wextra-
> > > > > semi-stmt warnings in linux-next.
> > > > Are any of them not false-positives?  It's all very well to
> > > > enable stricter warnings, but if they don't fix any bugs,
> > > > they're just churn.
> > > > 
> > > While enabling additional warnings may be a side effect of this
> > > effort
> > > 
> > > the primary goal is to set up a cleaning robot. After that a
> > > refactoring robot.
> > Why do we need such a thing?  Again, it sounds like more churn.
> > It's really annoying when I'm working on something important that
> > gets derailed by pointless churn.  Churn also makes it harder to
> > backport patches to earlier kernels.
> > 
> A refactoring example on moving to treewide, consistent use of a new
> api may help.
> 
> Consider
> 
> 2efc459d06f1630001e3984854848a5647086232
> 
> sysfs: Add sysfs_emit and sysfs_emit_at to format sysfs output
> 
> A new api for printing in the sysfs.  How do we use it treewide ?
> 
> Done manually, it would be a heroic effort requiring high level
> maintainers pushing and likely only get partially done.
> 
> If a refactoring programatic fixit is done and validated on a one
> subsystem, it can run on all the subsystems.
> 
> The effort is a couple of weeks to write and validate the fixer,
> hours to run over the tree.
> 
> It won't be perfect but will be better than doing it manually.

Here's a thought: perhaps we don't.  sysfs_emit isn't a "new api" its a
minor rewrap of existing best practice.  The damage caused by the churn
of forcing its use everywhere would far outweigh any actual benefit
because pretty much every bug in this area has already been caught and
killed by existing tools.  We can enforce sysfs_emit going forwards
using tools like checkpatch but there's no benefit and a lot of harm to
be done by trying to churn the entire tree retrofitting it (both in
terms of review time wasted as well as patch series derailed).

James


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC] MAINTAINERS tag for cleanup robot

2020-11-21 Thread James Bottomley
On Sat, 2020-11-21 at 08:50 -0800, t...@redhat.com wrote:
> A difficult part of automating commits is composing the subsystem
> preamble in the commit log.  For the ongoing effort of a fixer
> producing
> one or two fixes a release the use of 'treewide:' does not seem
> appropriate.
> 
> It would be better if the normal prefix was used.  Unfortunately
> normal is
> not consistent across the tree.
> 
> 
>   D: Commit subsystem prefix
> 
> ex/ for FPGA DFL DRIVERS
> 
>   D: fpga: dfl:
> 

I've got to bet this is going to cause more issues than it solves. 
SCSI uses scsi: : for drivers but not every driver has a
MAINTAINERS entry.  We use either scsi: or scsi: core: for mid layer
things, but we're not consistent.  Block uses blk-: for all
of it's stuff but almost no s have a MAINTAINERS entry.  So
the next thing you're going to cause is an explosion of suggested
MAINTAINERs entries.

Has anyone actually complained about treewide:?

James


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH RFC PKS/PMEM 22/58] fs/f2fs: Utilize new kmap_thread()

2020-10-09 Thread James Bottomley
On Fri, 2020-10-09 at 14:34 -0700, Eric Biggers wrote:
> On Fri, Oct 09, 2020 at 12:49:57PM -0700, ira.we...@intel.com wrote:
> > From: Ira Weiny 
> > 
> > The kmap() calls in this FS are localized to a single thread.  To
> > avoid the over head of global PKRS updates use the new
> > kmap_thread() call.
> > 
> > Cc: Jaegeuk Kim 
> > Cc: Chao Yu 
> > Signed-off-by: Ira Weiny 
> > ---
> >  fs/f2fs/f2fs.h | 8 
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> > index d9e52a7f3702..ff72a45a577e 100644
> > --- a/fs/f2fs/f2fs.h
> > +++ b/fs/f2fs/f2fs.h
> > @@ -2410,12 +2410,12 @@ static inline struct page
> > *f2fs_pagecache_get_page(
> >  
> >  static inline void f2fs_copy_page(struct page *src, struct page
> > *dst)
> >  {
> > -   char *src_kaddr = kmap(src);
> > -   char *dst_kaddr = kmap(dst);
> > +   char *src_kaddr = kmap_thread(src);
> > +   char *dst_kaddr = kmap_thread(dst);
> >  
> > memcpy(dst_kaddr, src_kaddr, PAGE_SIZE);
> > -   kunmap(dst);
> > -   kunmap(src);
> > +   kunmap_thread(dst);
> > +   kunmap_thread(src);
> >  }
> 
> Wouldn't it make more sense to switch cases like this to
> kmap_atomic()?
> The pages are only mapped to do a memcpy(), then they're immediately
> unmapped.

On a VIPT/VIVT architecture, this is horrendously wasteful.  You're
taking something that was mapped at colour c_src mapping it to a new
address src_kaddr, which is likely a different colour and necessitates
flushing the original c_src, then you copy it to dst_kaddr, which is
also likely a different colour from c_dst, so dst_kaddr has to be
flushed on kunmap and c_dst has to be invalidated on kmap.  What we
should have is an architectural primitive for doing this, something
like kmemcopy_arch(dst, src).  PIPT architectures can implement it as
the above (possibly losing kmap if they don't need it) but VIPT/VIVT
architectures can set up a correctly coloured mapping so they can
simply copy from c_src to c_dst without any need to flush and the data
arrives cache hot at c_dst.

James


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] block: convert tasklets to use new tasklet_setup() API

2020-08-19 Thread James Bottomley
On Wed, 2020-08-19 at 21:54 +0530, Allen wrote:
> > [...]
> > > > Since both threads seem to have petered out, let me suggest in
> > > > kernel.h:
> > > > 
> > > > #define cast_out(ptr, container, member) \
> > > > container_of(ptr, typeof(*container), member)
> > > > 
> > > > It does what you want, the argument order is the same as
> > > > container_of with the only difference being you name the
> > > > containing structure instead of having to specify its type.
> > > 
> > > Not to incessantly bike shed on the naming, but I don't like
> > > cast_out, it's not very descriptive. And it has connotations of
> > > getting rid of something, which isn't really true.
> > 
> > Um, I thought it was exactly descriptive: you're casting to the
> > outer container.  I thought about following the C++ dynamic casting
> > style, so out_cast(), but that seemed a bit pejorative.  What about
> > outer_cast()?
> > 
> > > FWIW, I like the from_ part of the original naming, as it has
> > > some clues as to what is being done here. Why not just
> > > from_container()? That should immediately tell people what it
> > > does without having to look up the implementation, even before
> > > this becomes a part of the accepted coding norm.
> > 
> > I'm not opposed to container_from() but it seems a little less
> > descriptive than outer_cast() but I don't really care.  I always
> > have to look up container_of() when I'm using it so this would just
> > be another macro of that type ...
> > 
> 
>  So far we have a few which have been suggested as replacement
> for from_tasklet()
> 
> - out_cast() or outer_cast()
> - from_member().
> - container_from() or from_container()
> 
> from_container() sounds fine, would trimming it a bit work? like
> from_cont().

I'm fine with container_from().  It's the same form as container_of()
and I think we need urgent agreement to not stall everything else so
the most innocuous name is likely to get the widest acceptance.

James

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] block: convert tasklets to use new tasklet_setup() API

2020-08-19 Thread James Bottomley
On Wed, 2020-08-19 at 07:00 -0600, Jens Axboe wrote:
> On 8/18/20 1:00 PM, James Bottomley wrote:
[...]
> > Since both threads seem to have petered out, let me suggest in
> > kernel.h:
> > 
> > #define cast_out(ptr, container, member) \
> > container_of(ptr, typeof(*container), member)
> > 
> > It does what you want, the argument order is the same as
> > container_of with the only difference being you name the containing
> > structure instead of having to specify its type.
> 
> Not to incessantly bike shed on the naming, but I don't like
> cast_out, it's not very descriptive. And it has connotations of
> getting rid of something, which isn't really true.

Um, I thought it was exactly descriptive: you're casting to the outer
container.  I thought about following the C++ dynamic casting style, so
out_cast(), but that seemed a bit pejorative.  What about outer_cast()?

> FWIW, I like the from_ part of the original naming, as it has some
> clues as to what is being done here. Why not just from_container()?
> That should immediately tell people what it does without having to
> look up the implementation, even before this becomes a part of the
> accepted coding norm.

I'm not opposed to container_from() but it seems a little less
descriptive than outer_cast() but I don't really care.  I always have
to look up container_of() when I'm using it so this would just be
another macro of that type ...

James

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] block: convert tasklets to use new tasklet_setup() API

2020-08-18 Thread James Bottomley
On Tue, 2020-08-18 at 13:10 -0700, Kees Cook wrote:
> On Tue, Aug 18, 2020 at 01:00:33PM -0700, James Bottomley wrote:
> > On Mon, 2020-08-17 at 13:02 -0700, Jens Axboe wrote:
> > > On 8/17/20 12:48 PM, Kees Cook wrote:
> > > > On Mon, Aug 17, 2020 at 12:44:34PM -0700, Jens Axboe wrote:
> > > > > On 8/17/20 12:29 PM, Kees Cook wrote:
> > > > > > On Mon, Aug 17, 2020 at 06:56:47AM -0700, Jens Axboe wrote:
> > > > > > > On 8/17/20 2:15 AM, Allen Pais wrote:
> > > > > > > > From: Allen Pais 
> > > > > > > > 
> > > > > > > > In preparation for unconditionally passing the
> > > > > > > > struct tasklet_struct pointer to all tasklet
> > > > > > > > callbacks, switch to using the new tasklet_setup()
> > > > > > > > and from_tasklet() to pass the tasklet pointer
> > > > > > > > explicitly.
> > > > > > > 
> > > > > > > Who came up with the idea to add a macro 'from_tasklet'
> > > > > > > that
> > > > > > > is just container_of? container_of in the code would be
> > > > > > > _much_ more readable, and not leave anyone guessing wtf
> > > > > > > from_tasklet is doing.
> > > > > > > 
> > > > > > > I'd fix that up now before everything else goes in...
> > > > > > 
> > > > > > As I mentioned in the other thread, I think this makes
> > > > > > things
> > > > > > much more readable. It's the same thing that the
> > > > > > timer_struct
> > > > > > conversion did (added a container_of wrapper) to avoid the
> > > > > > ever-repeating use of typeof(), long lines, etc.
> > > > > 
> > > > > But then it should use a generic name, instead of each sub-
> > > > > system 
> > > > > using some random name that makes people look up exactly what
> > > > > it
> > > > > does. I'm not huge fan of the container_of() redundancy, but
> > > > > adding private variants of this doesn't seem like the best
> > > > > way
> > > > > forward. Let's have a generic helper that does this, and use
> > > > > it
> > > > > everywhere.
> > > > 
> > > > I'm open to suggestions, but as things stand, these kinds of
> > > > treewide
> > > 
> > > On naming? Implementation is just as it stands, from_tasklet() is
> > > totally generic which is why I objected to it. from_member()? Not
> > > great with naming... But I can see this going further and then
> > > we'll
> > > suddenly have tons of these. It's not good for readability.
> > 
> > Since both threads seem to have petered out, let me suggest in
> > kernel.h:
> > 
> > #define cast_out(ptr, container, member) \
> > container_of(ptr, typeof(*container), member)
> > 
> > It does what you want, the argument order is the same as
> > container_of with the only difference being you name the containing
> > structure instead of having to specify its type.
> 
> I like this! Shall I send this to Linus to see if this can land in
> -rc2 for use going forward?

Sure ... he's probably been lurking on this thread anyway ... it's
about time he got off his arse^Wthe fence and made an executive
decision ...

James

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] block: convert tasklets to use new tasklet_setup() API

2020-08-18 Thread James Bottomley
On Mon, 2020-08-17 at 13:02 -0700, Jens Axboe wrote:
> On 8/17/20 12:48 PM, Kees Cook wrote:
> > On Mon, Aug 17, 2020 at 12:44:34PM -0700, Jens Axboe wrote:
> > > On 8/17/20 12:29 PM, Kees Cook wrote:
> > > > On Mon, Aug 17, 2020 at 06:56:47AM -0700, Jens Axboe wrote:
> > > > > On 8/17/20 2:15 AM, Allen Pais wrote:
> > > > > > From: Allen Pais 
> > > > > > 
> > > > > > In preparation for unconditionally passing the
> > > > > > struct tasklet_struct pointer to all tasklet
> > > > > > callbacks, switch to using the new tasklet_setup()
> > > > > > and from_tasklet() to pass the tasklet pointer explicitly.
> > > > > 
> > > > > Who came up with the idea to add a macro 'from_tasklet' that
> > > > > is just container_of? container_of in the code would be
> > > > > _much_ more readable, and not leave anyone guessing wtf
> > > > > from_tasklet is doing.
> > > > > 
> > > > > I'd fix that up now before everything else goes in...
> > > > 
> > > > As I mentioned in the other thread, I think this makes things
> > > > much more readable. It's the same thing that the timer_struct
> > > > conversion did (added a container_of wrapper) to avoid the
> > > > ever-repeating use of typeof(), long lines, etc.
> > > 
> > > But then it should use a generic name, instead of each sub-system 
> > > using some random name that makes people look up exactly what it
> > > does. I'm not huge fan of the container_of() redundancy, but
> > > adding private variants of this doesn't seem like the best way
> > > forward. Let's have a generic helper that does this, and use it
> > > everywhere.
> > 
> > I'm open to suggestions, but as things stand, these kinds of
> > treewide
> 
> On naming? Implementation is just as it stands, from_tasklet() is
> totally generic which is why I objected to it. from_member()? Not
> great with naming... But I can see this going further and then we'll
> suddenly have tons of these. It's not good for readability.

Since both threads seem to have petered out, let me suggest in
kernel.h:

#define cast_out(ptr, container, member) \
container_of(ptr, typeof(*container), member)

It does what you want, the argument order is the same as container_of
with the only difference being you name the containing structure
instead of having to specify its type.

James

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] screen freeze with 5.2-rc6 Dell XPS-13 skylake i915

2019-07-17 Thread James Bottomley
On Wed, 2019-07-17 at 23:27 +0200, Paul Bolle wrote:
> Hi Jose,
> 
> Souza, Jose schreef op di 16-07-2019 om 16:32 [+]:
> > Paul and James could you test this final solution(at least for
> > 5.2)? Please revert the hack patch and apply this one.
> 
> I've just reached a day of uptime with your revert. (The proper
> uptime is just a fraction of a day, this being a laptop.) Anyhow, no
> screen freezes occurred during this day.

I'm afraid my status is that I'm in Tokyo doing conference
presentations (and kernel demos) so I'm a bit reluctant to mess with my
setup until I finish everything on Friday, but I will test it after
then, promise.

> So feel free to add my Tested-by tag to your revert. But please don't
> forget that James earned a Reported-by tag.

Thanks,

James

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] screen freeze with 5.2-rc6 Dell XPS-13 skylake i915

2019-07-12 Thread James Bottomley
On Thu, 2019-07-11 at 16:40 -0700, James Bottomley wrote:
> On Thu, 2019-07-11 at 23:28 +, Souza, Jose wrote:
> > On Fri, 2019-07-12 at 01:03 +0200, Paul Bolle wrote:
> > > James Bottomley schreef op do 11-07-2019 om 15:38 [-0700]:
> > > > On Thu, 2019-07-11 at 22:26 +, Souza, Jose wrote:
> > > > > It eventually comes back from screen freeze? Like moving the
> > > > > mouse or typing brings it back?
> > > > 
> > > > No, it seems to be frozen for all time (at least until I got
> > > > bored waiting, which was probably 20 minutes).  Even if I
> > > > reboot the machine, the current screen state stays until the
> > > > system powers off.
> > > 
> > > As I mentioned earlier, a suspend/resume cycle unfreezes the
> > > screen.
> > > 
> > > And I seem to remember that, if the gnome screen-locking
> > > eventually kicks in, unlocking the screen still works, as the
> > > screen then isn't frozen anymore.
> > > 
> > > Thanks,
> > 
> > Thanks for all the information Paul.
> > 
> > Could test with the patch attached?
> 
> Applied and running with it now.

It has survived 6h without manifesting the regression.  Starting again
to try a whole day.

James

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] screen freeze with 5.2-rc6 Dell XPS-13 skylake i915

2019-07-11 Thread James Bottomley
On Thu, 2019-07-11 at 23:28 +, Souza, Jose wrote:
> On Fri, 2019-07-12 at 01:03 +0200, Paul Bolle wrote:
> > James Bottomley schreef op do 11-07-2019 om 15:38 [-0700]:
> > > On Thu, 2019-07-11 at 22:26 +, Souza, Jose wrote:
> > > > It eventually comes back from screen freeze? Like moving the
> > > > mouse or typing brings it back?
> > > 
> > > No, it seems to be frozen for all time (at least until I got
> > > bored waiting, which was probably 20 minutes).  Even if I reboot
> > > the machine, the current screen state stays until the system
> > > powers off.
> > 
> > As I mentioned earlier, a suspend/resume cycle unfreezes the
> > screen.
> > 
> > And I seem to remember that, if the gnome screen-locking eventually
> > kicks in, unlocking the screen still works, as the screen then
> > isn't frozen anymore.
> > 
> > Thanks,
> 
> Thanks for all the information Paul.
> 
> Could test with the patch attached?

Applied and running with it now.

> If the issue happens again could send the output of:
> 
> /sys/kernel/debug/dri/0/eDP-1/i915_psr_sink_status
> /sys/kernel/debug/dri/0/i915_edp_psr_status
> 
> Thanks so much for all the help

Sure,

James

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] screen freeze with 5.2-rc6 Dell XPS-13 skylake i915

2019-07-11 Thread James Bottomley
On Thu, 2019-07-11 at 22:26 +, Souza, Jose wrote:
> On Thu, 2019-07-11 at 14:57 -0700, James Bottomley wrote:
> > On Thu, 2019-07-11 at 13:28 -0700, James Bottomley wrote:
> > > I've also updated to the released 5.2 kernel and am running with
> > > the
> > > debug parameters you requested ... but so far no reproduction.
> > 
> > OK, it's happened.  I've attached the dmesg (it's 4MB
> > uncompressed). 
> > Is there any other output you'd like from the machine?  I've got an
> > ssh session into it so I can try anything.
> 
> Thanks, could you also share the output of this after the screen
> freeze?
> 
> /sys/kernel/debug/dri/0/i915_edp_psr_status
> /sys/kernel/debug/dri/0/i915_display_info
> /sys/kernel/debug/dri/0/i915_dmc_info
> /sys/kernel/debug/pmc_core/package_cstate_show

jarvis:~ # for f in `cat ~/tmp.txt`; do echo $f; cat $f; done
/sys/kernel/debug/dri/0/i915_edp_psr_status
Sink support: yes [0x01]
PSR mode: PSR1 enabled
Source PSR ctl: disabled [0x01f00726]
Source PSR status: IDLE [0x04010216]
Busy frontbuffer bits: 0x0001
/sys/kernel/debug/dri/0/i915_display_info
CRTC info
-
CRTC 47: pipe: A, active=yes, (size=3200x1800), dither=no, bpp=24
fb: 118, pos: 0x0, size: 3200x1800
encoder 84: type: DDI A, connectors:
connector 85: type: eDP-1, status: connected, mode:
"": 0 373250 3200 3248 3280 3360 1800 1803 1808 1852
0x0 0xa
cursor visible? yes, position (-3, 261), size 256x256, addr
0x0174
num_scalers=2, scaler_users=0 scaler_id=-1, scalers[0]: use=no,
mode=0, scalers[1]: use=no, mode=0
--Plane id 30: type=PRI, crtc_pos=   0x   0,
crtc_size=3200x1800, src_pos=0.x0.,
src_size=3200.x1800., format=XR24 little-endian (0x34325258),
rotation=0 (0x0001)
--Plane id 37: type=OVL, crtc_pos=   0x   0,
crtc_size=   0x   0, src_pos=0.x0., src_size=0.x0.,
format=N/A, rotation=0 (0x0001)
--Plane id 44: type=CUR, crtc_pos=  -3x 261, crtc_size= 256x
256, src_pos=0.x0., src_size=256.x256., format=AR24
little-endian (0x34325241), rotation=0 (0x0001)
underrun reporting: cpu=yes pch=yes 
CRTC 65: pipe: B, active=yes, (size=1600x1200), dither=no, bpp=24
fb: 118, pos: 0x0, size: 3200x1800
encoder 90: type: DDI B, connectors:
connector 91: type: DP-1, status: connected, mode:
"": 0 162000 1600 1664 1856 2160 1200 1201 1204 1250
0x0 0x5
cursor visible? yes, position (-3, 261), size 256x256, addr
0x017c
num_scalers=2, scaler_users=0 scaler_id=-1, scalers[0]: use=no,
mode=0, scalers[1]: use=no, mode=0
--Plane id 48: type=PRI, crtc_pos=   0x   0,
crtc_size=1600x1200, src_pos=0.x0.,
src_size=1600.x1200., format=XR24 little-endian (0x34325258),
rotation=0 (0x0001)
--Plane id 55: type=OVL, crtc_pos=   0x   0,
crtc_size=   0x   0, src_pos=0.x0., src_size=0.x0.,
format=N/A, rotation=0 (0x0001)
--Plane id 62: type=CUR, crtc_pos=  -3x 261, crtc_size= 256x
256, src_pos=0.x0., src_size=256.x256., format=AR24
little-endian (0x34325241), rotation=0 (0x0001)
underrun reporting: cpu=yes pch=yes 
CRTC 83: pipe: C, active=no, (size=0x0), dither=no, bpp=0
underrun reporting: cpu=yes pch=yes 

Connector info
--
connector 85: type eDP-1, status: connected
physical dimensions: 290x170mm
subpixel order: Unknown
CEA rev: 0
DPCD rev: 12
audio support: no
fixed mode:
"3200x1800": 60 373250 3200 3248 3280 3360 1800 1803
1808 1852 0x48 0xa
DP branch device present: no
modes:
"3200x1800": 60 373250 3200 3248 3280 3360 1800 1803
1808 1852 0x48 0xa
"3200x1800": 48 298600 3200 3248 3280 3360 1800 1803
1808 1852 0x40 0xa
connector 91: type DP-1, status: connected
physical dimensions: 430x320mm
subpixel order: Unknown
CEA rev: 0
DPCD rev: 12
audio support: no
DP branch device present: yes
Type: VGA
ID: 
HW: 0.0
SW: 1.0
modes:
"1600x1200": 60 162000 1600 1664 1856 2160 1200 1201
1204 1250 0x48 0x5
"1400x1050": 75 156000 1400 1504 1648 1896 1050 1053
1057 1099 0x40 0x6
"1400x1050": 60 121750 1400 1488 1632 1864 1050 1053
1057 1089 0x40 0x6
"1280x1024": 75 135000 1280 1296 1440 1688 1024 1025
1028 1066 0x40 0x5
"1280x1024": 60 108000 1280 1328 1440 1688 1024 1025
1028 1066 0x40 0x5
"1280x960": 60 108000 1280 1376 1488 1800 960 961 964
1000 0x40 0x5
"1152x864": 75 10800

Re: [Intel-gfx] screen freeze with 5.2-rc6 Dell XPS-13 skylake i915

2019-07-11 Thread James Bottomley
On Thu, 2019-07-11 at 20:25 +, Souza, Jose wrote:
> On Thu, 2019-07-11 at 13:11 -0700, James Bottomley wrote:
> > On Thu, 2019-07-11 at 10:29 +0100, Chris Wilson wrote:
> > > Quoting James Bottomley (2019-06-29 19:56:52)
> > > > The symptoms are really weird: the screen image is locked in
> > > > place.  The machine is still functional and if I log in over
> > > > the network can do anything I like, including killing the X
> > > > server and the display will never alter.  It also seems that
> > > > the system is accepting keyboard input because when it freezes
> > > > I can cat information to a file (if the mouse was over an
> > > > xterm) and verify over the network the file contents. Nothing
> > > > unusual appears in dmesg when the lockup happens.
> > > > 
> > > > The last kernel I booted successfully on the system was 5.0, so
> > > > I'll try compiling 5.1 to narrow down the changes.
> > > 
> > > It's likely this is panel self-refresh going haywire.
> > > 
> > > commit 8f6e87d6d561f10cfa48a687345512419839b6d8
> > > Author: José Roberto de Souza 
> > > Date:   Thu Mar 7 16:00:50 2019 -0800
> > > 
> > > drm/i915: Enable PSR2 by default
> > > 
> > > The support for PSR2 was polished, IGT tests for PSR2 was
> > > added and
> > > it was tested performing regular user workloads like
> > > browsing,
> > > editing documents and compiling Linux, so it is time to
> > > enable it by
> > > default and enjoy even more power-savings.
> > > 
> > > Temporary workaround would be to set i915.enable_psr=0
> > 
> > It looks plausible.  I have to say I was just about to mark a
> > bisect containing this as good, but that probably reflects my
> > difficulty
> > reproducing the issue.
> 
> Take at look of what PSR version is supported by your panel, it
> likely that a notebook shipped with Skylake will have panel that
> supports only PSR1 so that patch has no effect on your machine.
> 
> sudo more /sys/kernel/debug/dri/0/i915_edp_psr_status
> Sink support: yes [0x01]

It says

Sink support: yes [0x01]
PSR mode: PSR1 enabled
Source PSR ctl: enabled [0x81f00726]
Source PSR status: IDLE [0x04010212]
Busy frontbuffer bits: 0x


I've also updated to the released 5.2 kernel and am running with the
debug parameters you requested ... but so far no reproduction.

James



Re: [Intel-gfx] screen freeze with 5.2-rc6 Dell XPS-13 skylake i915

2019-07-11 Thread James Bottomley
On Thu, 2019-07-11 at 10:29 +0100, Chris Wilson wrote:
> Quoting James Bottomley (2019-06-29 19:56:52)
> > The symptoms are really weird: the screen image is locked in
> > place.  The machine is still functional and if I log in over the
> > network can do anything I like, including killing the X server and
> > the display will never alter.  It also seems that the system is
> > accepting keyboard input because when it freezes I can cat
> > information to a file (if the mouse was over an xterm) and verify
> > over the network the file contents. Nothing unusual appears in
> > dmesg when the lockup happens.
> > 
> > The last kernel I booted successfully on the system was 5.0, so
> > I'll try compiling 5.1 to narrow down the changes.
> 
> It's likely this is panel self-refresh going haywire.
> 
> commit 8f6e87d6d561f10cfa48a687345512419839b6d8
> Author: José Roberto de Souza 
> Date:   Thu Mar 7 16:00:50 2019 -0800
> 
> drm/i915: Enable PSR2 by default
> 
> The support for PSR2 was polished, IGT tests for PSR2 was added
> and
> it was tested performing regular user workloads like browsing,
> editing documents and compiling Linux, so it is time to enable it
> by
> default and enjoy even more power-savings.
> 
> Temporary workaround would be to set i915.enable_psr=0

It looks plausible.  I have to say I was just about to mark a bisect
containing this as good, but that probably reflects my difficulty
reproducing the issue.

James

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] screen freeze with 5.2-rc6 Dell XPS-13 skylake i915

2019-07-10 Thread James Bottomley
On Wed, 2019-07-10 at 23:59 +0200, Paul Bolle wrote:
> James Bottomley schreef op wo 10-07-2019 om 10:35 [-0700]:
> > I can get back to it this afternoon, when I'm done with the meeting
> > requirements and doing other dev stuff.
> 
> I've started bisecting using your suggestion of that drm merge:
> $ git bisect log
> git bisect start
> # good: [89c3b37af87ec183b666d83428cb28cc421671a6] Merge
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide
> git bisect good 89c3b37af87ec183b666d83428cb28cc421671a6
> # bad: [a2d635decbfa9c1e4ae15cb05b68b2559f7f827c] Merge tag 'drm-
> next-2019-05-09' of git://anongit.freedesktop.org/drm/drm
> git bisect bad a2d635decbfa9c1e4ae15cb05b68b2559f7f827c
> # bad: [ad2c467aa92e283e9e8009bb9eb29a5c6a2d1217] drm/i915:
> Update DRIVER_DATE to 20190417
> git bisect bad ad2c467aa92e283e9e8009bb9eb29a5c6a2d1217
> 
> Git told me I have nine steps after this. So at two hours per step I
> might
> pinpoint the offending commit by Friday the 12th. If I'm lucky.
> (There are
> other things to do than bisecting this issue.)
> 
> If you find that commit before I do, I'll be all ears.

Sure ... I'm doing the holistic thing and looking at the tree in that
branch.  It seems to consist of 7 i915 updates

c09d39166d8a3f3788680b32dbb0a40a70de32e2 DRIVER_DATE to 20190207
47ed55a9bb9e284d46d6f2489e32a53b59152809 DRIVER_DATE to 20190220
f4ecb8ae70de86710e85138ce49af5c689951953 DRIVER_DATE to 20190311
1284ec985572232ace4817476baeb2d82b60be7a DRIVER_DATE to 20190320
a01b2c6f47d86c7d1a9fa822b3b91ec233b61784 DRIVER_DATE to 20190328
28d618e9ab86f26a31af0b235ced55beb3e343c8 DRIVER_DATE to 20190404
ad2c467aa92e283e9e8009bb9eb29a5c6a2d1217 DRIVER_DATE to 20190417

So I figured I'd see if I can locate the problem by bisection of those
plus inspection.

James

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] screen freeze with 5.2-rc6 Dell XPS-13 skylake i915

2019-07-10 Thread James Bottomley
On Wed, 2019-07-10 at 18:45 +0200, Paul Bolle wrote:
> James Bottomley schreef op wo 10-07-2019 om 09:32 [-0700]:
> > You seem to be getting it to happen much more often than I can.
> > Last
> > night, on the below pull request it took me a good hour to see the
> > freeze.
> 
> Yes. Sometimes within a minute of resuming. Typing stuff into
> evolution seems to help with triggering this. It's all a bit
> mysterious, but this message alone frooze my laptop a few times.
> Seriously!
> 
> > Sure, my current testing indicates it's somewhere inside this pull
> > request:
> > 
> > Merge: 89c3b37af87e eb85d03e01c3
> > Author: Linus Torvalds 
> > Date:   Wed May 8 21:35:19 2019 -0700
> > 
> > Merge tag 'drm-next-2019-05-09' of
> > git://anongit.freedesktop.org/drm/drm
> > 
> > Pull drm updates from Dave Airlie:
> 
> Lazy question: how does one determine the first and last commit
> inside a merge request? Can I simply do
> good   a2d635decbfa9c1e4ae15cb05b68b2559f7f827c^
> bada2d635decbfa9c1e4ae15cb05b68b2559f7f827c
> 
> for git bisect?

I think so.  I actually did

bad a2d635decbfa9c1e4ae15cb05b68b2559f7f827c
good89c3b37af87ec183b666d83428cb28cc421671a6

But I think ^ steps down in the main branch.  Note, I've only done a
couple of hours run on what I think is the good commit ... and actually
I'd already marked a2d635decbfa9c1e4ae15cb05b68b2559f7f827c as good
until the screen froze just after I'd built the new bisect kernel, so
there's still some chance 89c3b37af87ec183b666d83428cb28cc421671a6 is
bad.

> > So I was about to test out the i915 changes in that but since my
> > laptop is what I use for daily work, it's a bit hard (can't freeze
> > up on video calls for instance).
> 
> I usually use one of the ThinkPads from my embarrassing pile of
> outdated hardware to do nasty bisects, but I'm not about to loose any
> income if my much appreciated XPS 13 is out of order for a while.

I can get back to it this afternoon, when I'm done with the meeting
requirements and doing other dev stuff.

James

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] screen freeze with 5.2-rc6 Dell XPS-13 skylake i915

2019-07-10 Thread James Bottomley
On Wed, 2019-07-10 at 18:16 +0200, Paul Bolle wrote:
> Hi James,
> 
> James Bottomley schreef op wo 10-07-2019 om 08:01 [-0700]:
> > I've confirmed that 5.1 doesn't have the regression and I'm now
> > trying to bisect the 5.2 merge window, but since the problem takes
> > quite a while to manifest this will take some time.  Any hints
> > about specific patches that might be the problem would be welcome.
> 
> (Perhaps my message of yesterday never reached you.)

No, sorry, if the list is followup to list, I'm not subscribed.  I see
it now I look in the archives, though.

---
> Upgrading to 5.2 (from 5.1.y) on a "Dell XPS 13 9350" (is that a
> skylake too?)

I believe so.  My laptop is a 9350.  I believe they're the earliest
skylake produced. 

>  showed similar symptoms. There's no pattern to the freezes that I
> can see. They're rather frequent too (think every few minutes). Eg,
> two freezes while composing this message!

You seem to be getting it to happen much more often than I can. Last
night, on the below pull request it took me a good hour to see the
freeze.

---
> It seems I hit this problem quite easily. Bisecting v5.1..v5.2 could
> be a real chore, so perhaps we could coordinate efforts (off-list)?

Sure, my current testing indicates it's somewhere inside this pull
request:

Merge: 89c3b37af87e eb85d03e01c3
Author: Linus Torvalds 
Date:   Wed May 8 21:35:19 2019 -0700

Merge tag 'drm-next-2019-05-09' of git://anongit.freedesktop.org/drm/drm

Pull drm updates from Dave Airlie:

So I was about to test out the i915 changes in that but since my laptop
is what I use for daily work, it's a bit hard (can't freeze up on video
calls for instance).

James

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] screen freeze with 5.2-rc6 Dell XPS-13 skylake i915

2019-07-10 Thread James Bottomley
On Sat, 2019-06-29 at 11:56 -0700, James Bottomley wrote:
> The symptoms are really weird: the screen image is locked in place. 
> The machine is still functional and if I log in over the network I
> can do anything I like, including killing the X server and the
> display will never alter.  It also seems that the system is accepting
> keyboard input because when it freezes I can cat information to a
> file (if the mouse was over an xterm) and verify over the network the
> file contents. Nothing unusual appears in dmesg when the lockup
> happens.
> 
> The last kernel I booted successfully on the system was 5.0, so I'll
> try compiling 5.1 to narrow down the changes.

I've confirmed that 5.1 doesn't have the regression and I'm now trying
to bisect the 5.2 merge window, but since the problem takes quite a
while to manifest this will take some time.  Any hints about specific
patches that might be the problem would be welcome.

James

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

screen freeze with 5.2-rc6 Dell XPS-13 skylake i915

2019-06-29 Thread James Bottomley
The symptoms are really weird: the screen image is locked in place. 
The machine is still functional and if I log in over the network I can
do anything I like, including killing the X server and the display will
never alter.  It also seems that the system is accepting keyboard input
because when it freezes I can cat information to a file (if the mouse
was over an xterm) and verify over the network the file contents. 
Nothing unusual appears in dmesg when the lockup happens.

The last kernel I booted successfully on the system was 5.0, so I'll
try compiling 5.1 to narrow down the changes.

James



Re: [Intel-gfx] Skylake graphics regression: projector failure with 4.8-rc3

2016-09-19 Thread James Bottomley
On Mon, 2016-09-19 at 08:09 -0700, James Bottomley wrote:
> On Sun, 2016-09-18 at 13:35 +0200, Thorsten Leemhuis wrote:
> > Hi! James & Paulo: What's the current status of this?
> 
> No, the only interaction has been the suggestion below for a revert,
> which didn't fix the problem.
> 
> >  Was this issue discussed elsewhere or even fixed in between? Just 
> > asking, because this issue is on the list of regressions for 4.8.
> 
> 
> I'm just about to try out -rc7, but it's not fixed so far.

OK, with -rc7 and the i915 fixes, there seems to be a marked
improvement.  I can no longer crash the crtc by using lots of xrandr
switches, which was the principal problem.

I've so far only got one of these in the logs

[14858.635035] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR*
CPU pipe B FIFO underrun

And the only residual problem seems to be that the monitor goes blank
periodically, but this can be fixed by switching resolution a couple of
times.

I haven't seen any of the link training errors so far and I've run
through my usual battery of be nasty to the external monitor tests.

James



___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] Skylake graphics regression: projector failure with 4.8-rc3

2016-09-19 Thread James Bottomley
On Sun, 2016-09-18 at 13:35 +0200, Thorsten Leemhuis wrote:
> Hi! James & Paulo: What's the current status of this?

No, the only interaction has been the suggestion below for a revert,
which didn't fix the problem.

>  Was this issue discussed elsewhere or even fixed in between? Just 
> asking, because this issue is on the list of regressions for 4.8.


I'm just about to try out -rc7, but it's not fixed so far.

James


> Ciao, Thorsten
> 
> On 01.09.2016 00:25, James Bottomley wrote:
> > On Wed, 2016-08-31 at 21:51 +, Zanoni, Paulo R wrote:
> > > Em Qua, 2016-08-31 às 14:43 -0700, James Bottomley escreveu:
> > > > On Wed, 2016-08-31 at 11:23 -0700, James Bottomley wrote:
> > > > > On Fri, 2016-08-26 at 09:10 -0400, James Bottomley wrote:
> > > > > > We seem to have an xrandr regression with skylake now. 
> > > > > >  What's
> > > > > > happening is that I can get output on to a projector, but
> > > > > > the 
> > > > > > system is losing video when I change the xrandr sessions
> > > > > > (like 
> > > > > > going from a --above b to a --same-as b).  The main screen
> > > > > > goes
> > > > > > blank, which is basically a reboot situation. 
> > > > > >  Unfortunately, I
> > > > > > can't seem to get the logs out of systemd to see if there
> > > > > > was a
> > > > > > dump to dmesg (the system was definitely responding).
> > > > > > 
> > > > > > I fell back to 4.6.2 which worked perfectly, so this is
> > > > > > definitely 
> > > > > > some sort of regression.  I'll be able to debug more fully
> > > > > > when
> > > > > > I 
> > > > > > get back home from the Linux Security Summit.
> > > > > 
> > > > > I'm home now.  Unfortunately, my monitor isn't as problematic
> > > > > as
> > > > > the
> > > > > projector, but by flipping between various modes and
> > > > > separating
> > > > > and
> > > > > overlaying the panels with --above and --same-as (xrandr), I
> > > > > can
> > > > > eventually get it to the point where the main LCD panel goes
> > > > > black 
> > > > > and can only be restarted by specifying a different mode.
> > > > > 
> > > > > This seems to be associated with these lines in the X
> > > > > 
> > > > > [ 14714.389] (EE) intel(0): failed to set mode: Invalid
> > > > > argument
> > > > > [22]
> > > > > 
> > > > > But the curious thing is that even if this fails with the
> > > > > error 
> > > > > message once, it may succeed a second time, so it looks to be
> > > > > a 
> > > > > transient error translation problem from the kernel driver.
> > > > > 
> > > > > I've attached the full log below.
> > > > > 
> > > > > This is only with a VGA output.  I currently don't have a
> > > > > HDMI 
> > > > > dongle, but I'm in the process of acquiring one.
> > > > 
> > > > After more playing around, I'm getting thousands of these in
> > > > the
> > > > kernel
> > > > log (possibly millions: the log wraps very fast):
> > > > 
> > > > [23504.873606] [drm:intel_dp_start_link_train [i915]] *ERROR*
> > > > failed
> > > > to train DP, aborting
> > > > 
> > > > And then finally it gives up with 
> > > > 
> > > > [25023.770951] [drm:intel_cpu_fifo_underrun_irq_handler [i915]]
> > > > *ERROR* CPU pipe B FIFO underrun
> > > > [25561.926075] [drm:intel_cpu_fifo_underrun_irq_handler [i915]]
> > > > *ERROR* CPU pipe A FIFO underrun
> > > > 
> > > > And the crtc for the VGA output becomes non-responsive to any
> > > > configuration command.  This requires a reboot and sometimes a
> > > > UEFI
> > > > variable reset before it comes back.
> > > 
> > > Please see this discussion:
> > > https://patchwork.freedesktop.org/patch/103237/
> > > 
> > > Do you have this patch on your tree? Does the problem go away if
> > > you
> > > revert it?
> > 
> > Yes, I've got it, it went in in 4.8-rc3 according to git:
> > 
> > commit 58e311b09c319183254d9220c50a533e7157c9ab
> > Author: Matt Roper <matthew.d.ro...@intel.com>
> > Date:   Thu Aug 4 14:08:00 2016 -0700
> > 
> > drm/i915/gen9: Give one extra block per line for SKL plane WM
> > calculations
> > 
> > Reverting it causes the secondary display not to sync pretty much
> > at
> > all.  However, in the flickers I can see, it does work OK and
> > doesn't
> > now crash switching from --same-as to --above and back
> > 
> > I also still get the logs filling up with the link training errors.
> > 
> > On balance, although the behaviour is different, it's not an
> > improvement because if I can't sync with the projector, I can't
> > really
> > use this as a fix.
> > 
> > James
> 

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] Skylake graphics regression: projector failure with 4.8-rc3

2016-08-31 Thread James Bottomley
On Wed, 2016-08-31 at 21:51 +, Zanoni, Paulo R wrote:
> Hi
> 
> Em Qua, 2016-08-31 às 14:43 -0700, James Bottomley escreveu:
> > On Wed, 2016-08-31 at 11:23 -0700, James Bottomley wrote:
> > > 
> > > On Fri, 2016-08-26 at 09:10 -0400, James Bottomley wrote:
> > > > 
> > > > We seem to have an xrandr regression with skylake now.  What's
> > > > happening is that I can get output on to a projector, but the 
> > > > system is losing video when I change the xrandr sessions (like 
> > > > going from a --above b to a --same-as b).  The main screen goes
> > > > blank, which is basically a reboot situation.  Unfortunately, I
> > > > can't seem to get the logs out of systemd to see if there was a
> > > > dump to dmesg (the system was definitely responding).
> > > > 
> > > > I fell back to 4.6.2 which worked perfectly, so this is
> > > > definitely 
> > > > some sort of regression.  I'll be able to debug more fully when
> > > > I 
> > > > get back home from the Linux Security Summit.
> > > 
> > > I'm home now.  Unfortunately, my monitor isn't as problematic as
> > > the
> > > projector, but by flipping between various modes and separating
> > > and
> > > overlaying the panels with --above and --same-as (xrandr), I can
> > > eventually get it to the point where the main LCD panel goes
> > > black 
> > > and can only be restarted by specifying a different mode.
> > > 
> > > This seems to be associated with these lines in the X
> > > 
> > > [ 14714.389] (EE) intel(0): failed to set mode: Invalid argument
> > > [22]
> > > 
> > > But the curious thing is that even if this fails with the error 
> > > message once, it may succeed a second time, so it looks to be a 
> > > transient error translation problem from the kernel driver.
> > > 
> > > I've attached the full log below.
> > > 
> > > This is only with a VGA output.  I currently don't have a HDMI 
> > > dongle, but I'm in the process of acquiring one.
> > 
> > After more playing around, I'm getting thousands of these in the
> > kernel
> > log (possibly millions: the log wraps very fast):
> > 
> > [23504.873606] [drm:intel_dp_start_link_train [i915]] *ERROR*
> > failed
> > to train DP, aborting
> > 
> > And then finally it gives up with 
> > 
> > [25023.770951] [drm:intel_cpu_fifo_underrun_irq_handler [i915]]
> > *ERROR* CPU pipe B FIFO underrun
> > [25561.926075] [drm:intel_cpu_fifo_underrun_irq_handler [i915]]
> > *ERROR* CPU pipe A FIFO underrun
> > 
> > And the crtc for the VGA output becomes non-responsive to any
> > configuration command.  This requires a reboot and sometimes a UEFI
> > variable reset before it comes back.
> 
> Please see this discussion:
> https://patchwork.freedesktop.org/patch/103237/
> 
> Do you have this patch on your tree? Does the problem go away if you
> revert it?

Yes, I've got it, it went in in 4.8-rc3 according to git:

commit 58e311b09c319183254d9220c50a533e7157c9ab
Author: Matt Roper <matthew.d.ro...@intel.com>
Date:   Thu Aug 4 14:08:00 2016 -0700

drm/i915/gen9: Give one extra block per line for SKL plane WM
calculations

Reverting it causes the secondary display not to sync pretty much at
all.  However, in the flickers I can see, it does work OK and doesn't
now crash switching from --same-as to --above and back

I also still get the logs filling up with the link training errors.

On balance, although the behaviour is different, it's not an
improvement because if I can't sync with the projector, I can't really
use this as a fix.

James

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] Skylake graphics regression: projector failure with 4.8-rc3

2016-08-31 Thread James Bottomley
On Wed, 2016-08-31 at 11:23 -0700, James Bottomley wrote:
> On Fri, 2016-08-26 at 09:10 -0400, James Bottomley wrote:
> > We seem to have an xrandr regression with skylake now.  What's
> > happening is that I can get output on to a projector, but the 
> > system is losing video when I change the xrandr sessions (like 
> > going from a --above b to a --same-as b).  The main screen goes 
> > blank, which is basically a reboot situation.  Unfortunately, I 
> > can't seem to get the logs out of systemd to see if there was a 
> > dump to dmesg (the system was definitely responding).
> > 
> > I fell back to 4.6.2 which worked perfectly, so this is definitely 
> > some sort of regression.  I'll be able to debug more fully when I 
> > get back home from the Linux Security Summit.
> 
> I'm home now.  Unfortunately, my monitor isn't as problematic as the
> projector, but by flipping between various modes and separating and
> overlaying the panels with --above and --same-as (xrandr), I can
> eventually get it to the point where the main LCD panel goes black 
> and can only be restarted by specifying a different mode.
> 
> This seems to be associated with these lines in the X
> 
> [ 14714.389] (EE) intel(0): failed to set mode: Invalid argument [22]
> 
> But the curious thing is that even if this fails with the error 
> message once, it may succeed a second time, so it looks to be a 
> transient error translation problem from the kernel driver.
> 
> I've attached the full log below.
> 
> This is only with a VGA output.  I currently don't have a HDMI 
> dongle, but I'm in the process of acquiring one.

After more playing around, I'm getting thousands of these in the kernel
log (possibly millions: the log wraps very fast):

[23504.873606] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train 
DP, aborting

And then finally it gives up with 

[25023.770951] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU 
pipe B FIFO underrun
[25561.926075] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU 
pipe A FIFO underrun

And the crtc for the VGA output becomes non-responsive to any
configuration command.  This requires a reboot and sometimes a UEFI
variable reset before it comes back.

James

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] Skylake graphics regression: projector failure with 4.8-rc3

2016-08-31 Thread James Bottomley
On Fri, 2016-08-26 at 09:10 -0400, James Bottomley wrote:
> We seem to have an xrandr regression with skylake now.  What's
> happening is that I can get output on to a projector, but the system 
> is losing video when I change the xrandr sessions (like going from a 
> --above b to a --same-as b).  The main screen goes blank, which is
> basically a reboot situation.  Unfortunately, I can't seem to get the
> logs out of systemd to see if there was a dump to dmesg (the system 
> was definitely responding).
> 
> I fell back to 4.6.2 which worked perfectly, so this is definitely 
> some sort of regression.  I'll be able to debug more fully when I get 
> back home from the Linux Security Summit.

I'm home now.  Unfortunately, my monitor isn't as problematic as the
projector, but by flipping between various modes and separating and
overlaying the panels with --above and --same-as (xrandr), I can
eventually get it to the point where the main LCD panel goes black and
can only be restarted by specifying a different mode.

This seems to be associated with these lines in the X

[ 14714.389] (EE) intel(0): failed to set mode: Invalid argument [22]

But the curious thing is that even if this fails with the error message
once, it may succeed a second time, so it looks to be a transient error
translation problem from the kernel driver.

I've attached the full log below.

This is only with a VGA output.  I currently don't have a HDMI dongle,
but I'm in the process of acquiring one.

James
[30.672] 
X.Org X Server 1.17.2
Release Date: 2015-06-16
[30.672] X Protocol Version 11, Revision 0
[30.672] Build Operating System: openSUSE SUSE LINUX
[30.672] Current Operating System: Linux jarvis 4.8.0-rc3+ #12 SMP PREEMPT Mon Aug 22 08:16:45 EDT 2016 x86_64
[30.672] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-test root=UUID=994363fd-c12c-4c6c-92e9-85e8ac695cf9 ro resume=/dev/nvme0n1p2 splash=silent quiet showopts pcie_aspm=force
[30.672] Build Date: 07 March 2016  08:22:28AM
[30.672]  
[30.672] Current version of pixman: 0.32.6
[30.672] 	Before reporting problems, check http://wiki.x.org
	to make sure that you have the latest version.
[30.672] Markers: (--) probed, (**) from config file, (==) default setting,
	(++) from command line, (!!) notice, (II) informational,
	(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
[30.673] (==) Log file: "/var/log/Xorg.0.log", Time: Wed Aug 31 07:10:56 2016
[30.673] (==) Using config directory: "/etc/X11/xorg.conf.d"
[30.673] (==) Using system config directory "/usr/share/X11/xorg.conf.d"
[30.673] (==) No Layout section.  Using the first Screen section.
[30.673] (==) No screen section available. Using defaults.
[30.673] (**) |-->Screen "Default Screen Section" (0)
[30.673] (**) |   |-->Monitor ""
[30.673] (==) No monitor specified for screen "Default Screen Section".
	Using a default monitor configuration.
[30.673] (==) Automatically adding devices
[30.673] (==) Automatically enabling devices
[30.673] (==) Automatically adding GPU devices
[30.674] (WW) The directory "/usr/share/fonts/misc/sgi" does not exist.
[30.674] 	Entry deleted from font path.
[30.674] (==) FontPath set to:
	/usr/share/fonts/misc:unscaled,
	/usr/share/fonts/Type1/,
	/usr/share/fonts/100dpi:unscaled,
	/usr/share/fonts/75dpi:unscaled,
	/usr/share/fonts/ghostscript/,
	/usr/share/fonts/cyrillic:unscaled,
	/usr/share/fonts/truetype/,
	built-ins
[30.674] (==) ModulePath set to "/usr/lib64/xorg/modules"
[30.674] (II) The server relies on udev to provide the list of input devices.
	If no devices become available, reconfigure udev or disable AutoAddDevices.
[30.674] (II) Loader magic: 0x80dd00
[30.674] (II) Module ABI versions:
[30.674] 	X.Org ANSI C Emulation: 0.4
[30.674] 	X.Org Video Driver: 19.0
[30.674] 	X.Org XInput driver : 21.0
[30.674] 	X.Org Server Extension : 9.0
[30.675] (++) using VT number 7

[30.675] (II) systemd-logind: logind integration requires -keeptty and -keeptty was not provided, disabling logind integration
[30.675] (II) xfree86: Adding drm device (/dev/dri/card0)
[30.688] (--) PCI:*(0:0:2:0) 8086:1916:1028:0704 rev 7, Mem @ 0xdb00/16777216, 0x9000/268435456, I/O @ 0xf000/64, BIOS @ 0x/131072
[30.688] (II) LoadModule: "glx"
[30.689] (II) Loading /usr/lib64/xorg/modules/extensions/libglx.so
[30.691] (II) Module glx: vendor="X.Org Foundation"
[30.691] 	compiled for 1.17.2, module version = 1.0.0
[30.691] 	ABI class: X.Org Server Extension, version 9.0
[30.691] (==) AIGLX enabled
[30.691] (==) Matched intel as autoconfigured driver 0
[30.691] (==) Matched intel as autoconfigured driver 1
[30.691] (==) Matched modesetting as autoconfigured driver 2
[30.691] (=

Re: [Intel-gfx] Skylake graphics regression: projector failure with 4.8-rc3

2016-08-26 Thread James Bottomley
On Fri, 2016-08-26 at 09:10 -0400, James Bottomley wrote:
> We seem to have an xrandr regression with skylake now.  What's
> happening is that I can get output on to a projector, but the system 
> is losing video when I change the xrandr sessions (like going from a 
> --above b to a --same-as b).  The main screen goes blank, which is
> basically a reboot situation.  Unfortunately, I can't seem to get the
> logs out of systemd to see if there was a dump to dmesg (the system 
> was definitely responding).
> 
> I fell back to 4.6.2 which worked perfectly, so this is definitely 
> some sort of regression.  I'll be able to debug more fully when I get 
> back home from the Linux Security Summit.

Sorry, ignore the logs: they're from a previous bug; as I said I'll try
to gather logs when I get back to a more controlled environment over
the weekend.

James


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] Skylake graphics regression: projector failure with 4.8-rc3

2016-08-26 Thread James Bottomley
We seem to have an xrandr regression with skylake now.  What's
happening is that I can get output on to a projector, but the system is
losing video when I change the xrandr sessions (like going from a -
-above b to a --same-as b).  The main screen goes blank, which is
basically a reboot situation.  Unfortunately, I can't seem to get the
logs out of systemd to see if there was a dump to dmesg (the system was
definitely responding).

I fell back to 4.6.2 which worked perfectly, so this is definitely some
sort of regression.  I'll be able to debug more fully when I get back
home from the Linux Security Summit.

James
[13.504] 
X.Org X Server 1.17.2
Release Date: 2015-06-16
[13.504] X Protocol Version 11, Revision 0
[13.504] Build Operating System: openSUSE SUSE LINUX
[13.504] Current Operating System: Linux jarvis 4.7.0-rc1+ #1 SMP PREEMPT Mon May 30 12:29:53 PDT 2016 x86_64
[13.504] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-test root=UUID=994363fd-c12c-4c6c-92e9-85e8ac695cf9 ro resume=/dev/nvme0n1p2 splash=silent quiet showopts pcie_aspm=force
[13.504] Build Date: 07 March 2016  08:22:28AM
[13.504]  
[13.504] Current version of pixman: 0.32.6
[13.504] 	Before reporting problems, check http://wiki.x.org
	to make sure that you have the latest version.
[13.504] Markers: (--) probed, (**) from config file, (==) default setting,
	(++) from command line, (!!) notice, (II) informational,
	(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
[13.504] (==) Log file: "/var/log/Xorg.0.log", Time: Mon May 30 12:40:53 2016
[13.504] (==) Using config directory: "/etc/X11/xorg.conf.d"
[13.504] (==) Using system config directory "/usr/share/X11/xorg.conf.d"
[13.504] (==) No Layout section.  Using the first Screen section.
[13.504] (==) No screen section available. Using defaults.
[13.504] (**) |-->Screen "Default Screen Section" (0)
[13.504] (**) |   |-->Monitor ""
[13.505] (==) No monitor specified for screen "Default Screen Section".
	Using a default monitor configuration.
[13.505] (==) Automatically adding devices
[13.505] (==) Automatically enabling devices
[13.505] (==) Automatically adding GPU devices
[13.509] (WW) The directory "/usr/share/fonts/misc/sgi" does not exist.
[13.509] 	Entry deleted from font path.
[13.509] (==) FontPath set to:
	/usr/share/fonts/misc:unscaled,
	/usr/share/fonts/Type1/,
	/usr/share/fonts/100dpi:unscaled,
	/usr/share/fonts/75dpi:unscaled,
	/usr/share/fonts/ghostscript/,
	/usr/share/fonts/cyrillic:unscaled,
	/usr/share/fonts/truetype/,
	built-ins
[13.509] (==) ModulePath set to "/usr/lib64/xorg/modules"
[13.509] (II) The server relies on udev to provide the list of input devices.
	If no devices become available, reconfigure udev or disable AutoAddDevices.
[13.509] (II) Loader magic: 0x80dd00
[13.509] (II) Module ABI versions:
[13.509] 	X.Org ANSI C Emulation: 0.4
[13.509] 	X.Org Video Driver: 19.0
[13.509] 	X.Org XInput driver : 21.0
[13.509] 	X.Org Server Extension : 9.0
[13.509] (++) using VT number 7

[13.509] (II) systemd-logind: logind integration requires -keeptty and -keeptty was not provided, disabling logind integration
[13.510] (II) xfree86: Adding drm device (/dev/dri/card0)
[13.925] (--) PCI:*(0:0:2:0) 8086:1916:1028:0704 rev 7, Mem @ 0xdb00/16777216, 0x9000/268435456, I/O @ 0xf000/64, BIOS @ 0x/131072
[13.925] (II) LoadModule: "glx"
[13.926] (II) Loading /usr/lib64/xorg/modules/extensions/libglx.so
[13.930] (II) Module glx: vendor="X.Org Foundation"
[13.930] 	compiled for 1.17.2, module version = 1.0.0
[13.931] 	ABI class: X.Org Server Extension, version 9.0
[13.931] (==) AIGLX enabled
[13.931] (==) Matched intel as autoconfigured driver 0
[13.931] (==) Matched intel as autoconfigured driver 1
[13.931] (==) Matched modesetting as autoconfigured driver 2
[13.931] (==) Matched fbdev as autoconfigured driver 3
[13.931] (==) Matched vesa as autoconfigured driver 4
[13.931] (==) Assigned the driver to the xf86ConfigLayout
[13.931] (II) LoadModule: "intel"
[13.931] (II) Loading /usr/lib64/xorg/modules/drivers/intel_drv.so
[13.932] (II) Module intel: vendor="X.Org Foundation"
[13.932] 	compiled for 1.17.2, module version = 2.99.917
[13.932] 	Module class: X.Org Video Driver
[13.932] 	ABI class: X.Org Video Driver, version 19.0
[13.932] (II) LoadModule: "modesetting"
[13.932] (II) Loading /usr/lib64/xorg/modules/drivers/modesetting_drv.so
[13.932] (II) Module modesetting: vendor="X.Org Foundation"
[13.932] 	compiled for 1.17.2, module version = 1.17.2
[13.932] 	Module class: X.Org Video Driver
[13.932] 	ABI class: X.Org Video Driver, version 19.0
[13.932] (II) LoadModule: "fbdev"
[13.933] (II) Loading /usr/lib64/xorg/modules/drivers/fbdev_drv.so
[13.933] (II) Module fbdev: vendor="X.Org Foundation"
[

Re: [Intel-gfx] [PATCH] drm/i915: Ignore panel type from OpRegion on SKL

2016-07-13 Thread James Bottomley
On Tue, 2016-07-12 at 15:00 +0300, ville.syrj...@linux.intel.com wrote:
> From: Ville Syrjälä <ville.syrj...@linux.intel.com>
> 
> Dell XPS 13 9350 apparently doesn't like it when we use the panel
> type
> from OpRegion. The OpRegion panel type (0) tells us to use use low
> vswing for eDP, whereas the VBT panel type (2) tells us to use normal
> vswing. The problem is that low vswing results in some display
> flickers.
> Since no one seems to know how this stuff is supposed to be handled,
> let's just ignore the OpRegion panel type on SKL for now.
> 
> Reported-by: James Bottomley <james.bottom...@hansenpartnership.com>
> Cc: James Bottomley <james.bottom...@hansenpartnership.com>
> Cc: drm-intel-fi...@lists.freedesktop.org
> References: https://lists.freedesktop.org/archives/intel-gfx/2016-Jun
> e/098826.html
> Fixes: a05628195a0d ("drm/i915: Get panel_type from OpRegion panel
> details")
> Signed-off-by: Ville Syrjälä <ville.syrj...@linux.intel.com>

This one doesn't actually compile on git head:


drivers/gpu/drm/i915/intel_opregion.c: In function
‘intel_opregion_get_panel_type’:
drivers/gpu/drm/i915/intel_opregion.c:1047:17: error: ‘dev_priv’
undeclared (first use in this function)
  if (IS_SKYLAKE(dev_priv))

So it's going to be hard to backport to stable.

However, I've applied it on top of the drm branch and there have been
no flicker problems after 12h of use, so you can add a tested by from
me if you want.

James



___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-07-08 Thread James Bottomley
On Fri, 2016-07-08 at 13:19 +0300, Ville Syrjälä wrote:
> On Thu, Jul 07, 2016 at 12:19:36PM -0700, James Bottomley wrote:
> > On Thu, 2016-07-07 at 09:55 -0700, James Bottomley wrote:
> > > On Thu, 2016-07-07 at 19:14 +0300, Ville Syrjälä wrote:
> > > > On Tue, Jun 21, 2016 at 06:44:34PM +0300, Ville Syrjälä wrote:
> > > > > On Tue, Jun 21, 2016 at 09:53:15AM -0400, James Bottomley
> > > > > wrote:
> > > > > > On Mon, 2016-06-20 at 11:03 +0300, Jani Nikula wrote:
> > > > > > > Cc: Ville
> > > > > > > 
> > > > > > > On Mon, 20 Jun 2016, James Bottomley <
> > > > > > > james.bottom...@hansenpartnership.com> wrote:
> > > > > > > > OK, my candidate bad commit is this one:
> > > > > > > > 
> > > > > > > > commit a05628195a0d9f3173dd9aa76f482aef692e46ee
> > > > > > > > Author: Ville Syrjälä <ville.syrj...@linux.intel.com>
> > > > > > > > Date:   Mon Apr 11 10:23:51 2016 +0300
> > > > > > > > 
> > > > > > > > drm/i915: Get panel_type from OpRegion panel
> > > > > > > > details
> > > > > > > > 
> > > > > > > > After being more careful about waiting to identify
> > > > > > > > flicker,
> > > > > > > > this one seems to be the one the bisect finds.  I'm now
> > > > > > > > running v4.7-rc3 with this one reverted and am
> > > > > > > > currently 
> > > > > > > > seeing no flicker problems.   It is, however, early
> > > > > > > > days 
> > > > > > > > because the flicker can hide for long periods, so I 'll
> > > > > > > > wait 
> > > > > > > > until Monday evening and a few reboots before declaring
> > > > > > > > victory.
> > > > > > > 
> > > > > > > If that turns out to be the bad commit, it doesn't really
> > > > > > > surprise me, and that in itself is depressing.
> > > > > > 
> > > > > > As far as I can tell, after running for a day with this
> > > > > > reverted, 
> > > > > > this is the problem.  The flicker hasn't appeared with it 
> > > > > > reverted.  It's pretty noticeable with this commit
> > > > > > included.
> > > > > 
> > > > > Hmm. The only difference I can see is low vs. normal vswing.
> > > > > Panel 
> > > > > 0 has low, panel 2 has normal. So either the VBT or opregion
> > > > > is 
> > > > > telling utter lies, or there's some other bug in our low
> > > > > vswing
> > > > > support.
> > > > 
> > > > I did a quick once over of out DDI vswing stuff and didn't find
> > > > anything too serious. There were some buglets in the iboost
> > > > handling, 
> > > > but I'm not very hopeful that fixing those would help with your
> > > > machine. 
> > > > 
> > > > Here's a branch anyway in case you want to give it a go:
> > > > git://github.com/vsyrjala/linux.git ddi_iboost_fixes
> > > > 
> > > > Actually, I think the only patch in there that might make a 
> > > > difference is 15d887855180 ("drm/i915: Fix iboost setting for
> > > > DDI 
> > > > with 4 lanes on SKL")
> > > 
> > > Running with it now (the entire branch).  So far it looks OK, but
> > > I'll give it a couple of days to see if anything manifests before
> > > declaring victory.
> > 
> > Bad news, I'm afraid: after a couple of hours of run time, there is
> > now
> > noticeable flicker on the display, so although the iboost fixes may
> > have lessened it, it's still present.
> 
> Oh well, I suspected as much. Which BIOS version did you have 
> exactly?

As mentioned upthread, now 1.4.3 since skylake microcode was
potentially implicated in the problem:

http://mid.gmane.org/1466179729.2271.33.ca...@hansenpartnership.com

> If I'm reading the Dell website correctly there's a new one (1.4.4)
> released on Jun 30, which is after you reported the issue. Might be
> I'm reading the wrong thing though. Can you double check this?

New bios doesn't change the microcode:

[2.390172] microcode: CPU0 sig=0x406e3, pf=0x80, revision=0x8a

And the flicker is still present.

James

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-07-07 Thread James Bottomley
On Thu, 2016-07-07 at 09:55 -0700, James Bottomley wrote:
> On Thu, 2016-07-07 at 19:14 +0300, Ville Syrjälä wrote:
> > On Tue, Jun 21, 2016 at 06:44:34PM +0300, Ville Syrjälä wrote:
> > > On Tue, Jun 21, 2016 at 09:53:15AM -0400, James Bottomley wrote:
> > > > On Mon, 2016-06-20 at 11:03 +0300, Jani Nikula wrote:
> > > > > Cc: Ville
> > > > > 
> > > > > On Mon, 20 Jun 2016, James Bottomley <
> > > > > james.bottom...@hansenpartnership.com> wrote:
> > > > > > OK, my candidate bad commit is this one:
> > > > > > 
> > > > > > commit a05628195a0d9f3173dd9aa76f482aef692e46ee
> > > > > > Author: Ville Syrjälä <ville.syrj...@linux.intel.com>
> > > > > > Date:   Mon Apr 11 10:23:51 2016 +0300
> > > > > > 
> > > > > > drm/i915: Get panel_type from OpRegion panel details
> > > > > > 
> > > > > > After being more careful about waiting to identify flicker,
> > > > > > this one seems to be the one the bisect finds.  I'm now 
> > > > > > running v4.7-rc3 with this one reverted and am currently 
> > > > > > seeing no flicker problems.   It is, however, early days 
> > > > > > because the flicker can hide for long periods, so I 'll
> > > > > > wait 
> > > > > > until Monday evening and a few reboots before declaring
> > > > > > victory.
> > > > > 
> > > > > If that turns out to be the bad commit, it doesn't really 
> > > > > surprise me, and that in itself is depressing.
> > > > 
> > > > As far as I can tell, after running for a day with this
> > > > reverted, 
> > > > this is the problem.  The flicker hasn't appeared with it 
> > > > reverted.  It's pretty noticeable with this commit included.
> > > 
> > > Hmm. The only difference I can see is low vs. normal vswing.
> > > Panel 
> > > 0 has low, panel 2 has normal. So either the VBT or opregion is 
> > > telling utter lies, or there's some other bug in our low vswing
> > > support.
> > 
> > I did a quick once over of out DDI vswing stuff and didn't find 
> > anything too serious. There were some buglets in the iboost
> > handling, 
> > but I'm not very hopeful that fixing those would help with your 
> > machine. 
> > 
> > Here's a branch anyway in case you want to give it a go:
> > git://github.com/vsyrjala/linux.git ddi_iboost_fixes
> > 
> > Actually, I think the only patch in there that might make a 
> > difference is 15d887855180 ("drm/i915: Fix iboost setting for DDI 
> > with 4 lanes on SKL")
> 
> Running with it now (the entire branch).  So far it looks OK, but 
> I'll give it a couple of days to see if anything manifests before
> declaring victory.

Bad news, I'm afraid: after a couple of hours of run time, there is now
noticeable flicker on the display, so although the iboost fixes may
have lessened it, it's still present.

James


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-07-07 Thread James Bottomley
On Thu, 2016-07-07 at 19:14 +0300, Ville Syrjälä wrote:
> On Tue, Jun 21, 2016 at 06:44:34PM +0300, Ville Syrjälä wrote:
> > On Tue, Jun 21, 2016 at 09:53:15AM -0400, James Bottomley wrote:
> > > On Mon, 2016-06-20 at 11:03 +0300, Jani Nikula wrote:
> > > > Cc: Ville
> > > > 
> > > > On Mon, 20 Jun 2016, James Bottomley <
> > > > james.bottom...@hansenpartnership.com> wrote:
> > > > > OK, my candidate bad commit is this one:
> > > > > 
> > > > > commit a05628195a0d9f3173dd9aa76f482aef692e46ee
> > > > > Author: Ville Syrjälä <ville.syrj...@linux.intel.com>
> > > > > Date:   Mon Apr 11 10:23:51 2016 +0300
> > > > > 
> > > > > drm/i915: Get panel_type from OpRegion panel details
> > > > > 
> > > > > After being more careful about waiting to identify flicker, 
> > > > > this one seems to be the one the bisect finds.  I'm now 
> > > > > running v4.7-rc3 with this one reverted and am currently 
> > > > > seeing no flicker problems.   It is, however, early days 
> > > > > because the flicker can hide for long periods, so I 'll wait 
> > > > > until Monday evening and a few reboots before declaring
> > > > > victory.
> > > > 
> > > > If that turns out to be the bad commit, it doesn't really 
> > > > surprise me, and that in itself is depressing.
> > > 
> > > As far as I can tell, after running for a day with this reverted, 
> > > this is the problem.  The flicker hasn't appeared with it 
> > > reverted.  It's pretty noticeable with this commit included.
> > 
> > Hmm. The only difference I can see is low vs. normal vswing. Panel 
> > 0 has low, panel 2 has normal. So either the VBT or opregion is 
> > telling utter lies, or there's some other bug in our low vswing
> > support.
> 
> I did a quick once over of out DDI vswing stuff and didn't find 
> anything too serious. There were some buglets in the iboost handling, 
> but I'm not very hopeful that fixing those would help with your 
> machine. 
> 
> Here's a branch anyway in case you want to give it a go:
> git://github.com/vsyrjala/linux.git ddi_iboost_fixes
> 
> Actually, I think the only patch in there that might make a 
> difference is 15d887855180 ("drm/i915: Fix iboost setting for DDI 
> with 4 lanes on SKL")

Running with it now (the entire branch).  So far it looks OK, but I'll
give it a couple of days to see if anything manifests before declaring
victory.

James

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-06-23 Thread James Bottomley
On Tue, 2016-06-21 at 17:00 -0400, James Bottomley wrote:
> On Tue, 2016-06-21 at 18:44 +0300, Ville Syrjälä wrote:
> > On Tue, Jun 21, 2016 at 09:53:15AM -0400, James Bottomley wrote:
> > > On Mon, 2016-06-20 at 11:03 +0300, Jani Nikula wrote:
> > > > Cc: Ville
> > > > 
> > > > On Mon, 20 Jun 2016, James Bottomley <
> > > > james.bottom...@hansenpartnership.com> wrote:
> > > > > OK, my candidate bad commit is this one:
> > > > > 
> > > > > commit a05628195a0d9f3173dd9aa76f482aef692e46ee
> > > > > Author: Ville Syrjälä <ville.syrj...@linux.intel.com>
> > > > > Date:   Mon Apr 11 10:23:51 2016 +0300
> > > > > 
> > > > > drm/i915: Get panel_type from OpRegion panel details
> > > > > 
> > > > > After being more careful about waiting to identify flicker, 
> > > > > this one seems to be the one the bisect finds.  I'm now 
> > > > > running v4.7-rc3 with this one reverted and am currently 
> > > > > seeing no flicker problems.   It is, however, early days 
> > > > > because the flicker can hide for long periods, so I 'll wait 
> > > > > until Monday evening and a few reboots before declaring
> > > > > victory.
> > > > 
> > > > If that turns out to be the bad commit, it doesn't really 
> > > > surprise me, and that in itself is depressing.
> > > 
> > > As far as I can tell, after running for a day with this reverted,
> > > this is the problem.  The flicker hasn't appeared with it 
> > > reverted.  It's pretty noticeable with this commit included.
> > 
> > Hmm. The only difference I can see is low vs. normal vswing. Panel 
> > 0 has low, panel 2 has normal. So either the VBT or opregion is
> > telling utter lies, or there's some other bug in our low vswing
> > support.
> > 
> > To confirm it's really a vswing issue, you should be able to run 
> > with i915.edp_vswing=2 without flickers on the broken kernel.
> 
> Preliminary boot indicates no flicker with the bad commit included 
> and this option, but I'll have to run for quite a bit longer to 
> verify, since it can sometimes be elusive.

Two days of runtime seems to confirm this is the problem (still no
flicker issues).

James


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-06-21 Thread James Bottomley
On Tue, 2016-06-21 at 18:44 +0300, Ville Syrjälä wrote:
> On Tue, Jun 21, 2016 at 09:53:15AM -0400, James Bottomley wrote:
> > On Mon, 2016-06-20 at 11:03 +0300, Jani Nikula wrote:
> > > Cc: Ville
> > > 
> > > On Mon, 20 Jun 2016, James Bottomley <
> > > james.bottom...@hansenpartnership.com> wrote:
> > > > OK, my candidate bad commit is this one:
> > > > 
> > > > commit a05628195a0d9f3173dd9aa76f482aef692e46ee
> > > > Author: Ville Syrjälä <ville.syrj...@linux.intel.com>
> > > > Date:   Mon Apr 11 10:23:51 2016 +0300
> > > > 
> > > > drm/i915: Get panel_type from OpRegion panel details
> > > > 
> > > > After being more careful about waiting to identify flicker, 
> > > > this one seems to be the one the bisect finds.  I'm now running 
> > > > v4.7-rc3 with this one reverted and am currently seeing no 
> > > > flicker problems.   It is, however, early days because the 
> > > > flicker can hide for long periods, so I 'll wait until Monday 
> > > > evening and a few reboots before declaring victory.
> > > 
> > > If that turns out to be the bad commit, it doesn't really 
> > > surprise me, and that in itself is depressing.
> > 
> > As far as I can tell, after running for a day with this reverted, 
> > this is the problem.  The flicker hasn't appeared with it reverted.
> >  It's pretty noticeable with this commit included.
> 
> Hmm. The only difference I can see is low vs. normal vswing. Panel 0 
> has low, panel 2 has normal. So either the VBT or opregion is telling
> utter lies, or there's some other bug in our low vswing support.
> 
> To confirm it's really a vswing issue, you should be able to run with
> i915.edp_vswing=2 without flickers on the broken kernel.

Preliminary boot indicates no flicker with the bad commit included and
this option, but I'll have to run for quite a bit longer to verify,
since it can sometimes be elusive.

James

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-06-19 Thread James Bottomley
On Fri, 2016-06-17 at 16:06 -0700, James Bottomley wrote:
> On Fri, 2016-06-17 at 16:34 +0300, Jani Nikula wrote:
> > On Fri, 17 Jun 2016, Daniel Vetter <dan...@ffwll.ch> wrote:
> > > On Thu, Jun 16, 2016 at 03:42:12PM -0700, James Bottomley wrote:
> > > > On Thu, 2016-06-16 at 14:29 -0700, James Bottomley wrote:
> > > > > On Thu, 2016-06-16 at 23:24 +0200, Daniel Vetter wrote:
> > > > > > I guess we'll need the bisect on this one to make progress.
> > > > > 
> > > > > Sigh, I was afraid that might be the next step.
> > > > 
> > > > OK, I have a curious data point.  I assumed the problem would
> > > > be
> > > > somewhere in the drm update, so I started bisecting that at the
> > > > top. 
> > > >  However, the top most commit:
> > > > 
> > > > commit 1d6da87a3241deb13d073c4125d19ed0e5a0c62c
> > > > Merge: 1f40c49 a39ed68
> > > > Author: Linus Torvalds <torva...@linux-foundation.org>
> > > > Date:   Mon May 23 11:48:48 2016 -0700
> > > > 
> > > > Merge branch 'drm-next' of
> > > > git://people.freedesktop.org/~airlied/linux
> > > > 
> > > > Isn't actually bad.  There's no flicker here, so whatever
> > > > caused
> > > > the
> > > > problem came from some update after this.
> > > 
> > > There was a fixes pull after this. Might be worth it to restrict
> > > to
> > > just
> > > the i915 changes, which are just
> > > 5b4fd5bb1230cd037..157d2c7fad0863222
> > > 
> > > Looking at those nothing seems to stick out which might explain
> > > what's
> > > happening for you.
> 
> OK, so just on the firmware, the system seems less flickery with the
> new 1.4.3 UEFI, so I'm starting to think it is a Skylake errata 
> issue.  The flicker isn't gone for good, but seems to be reboot 
> dependent (it's there in some boots, but gone on a reboot).
> 
> > This should be easy enough to try before bisecting:
> > http://patchwork.freedesktop.org/patch/msgid/1466162081-12042-1-git
> > -s
> > end-email-mika.kah...@intel.com
> 
> Applying this didn't seem to make a difference: still there on some 
> and gone on other reboots.

OK, my candidate bad commit is this one:

commit a05628195a0d9f3173dd9aa76f482aef692e46ee
Author: Ville Syrjälä <ville.syrj...@linux.intel.com>
Date:   Mon Apr 11 10:23:51 2016 +0300

drm/i915: Get panel_type from OpRegion panel details

After being more careful about waiting to identify flicker, this one
seems to be the one the bisect finds.  I'm now running v4.7-rc3 with
this one reverted and am currently seeing no flicker problems.  It is,
however, early days because the flicker can hide for long periods, so I
'll wait until Monday evening and a few reboots before declaring
victory.

James


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-06-17 Thread James Bottomley
On Fri, 2016-06-17 at 16:34 +0300, Jani Nikula wrote:
> On Fri, 17 Jun 2016, Daniel Vetter <dan...@ffwll.ch> wrote:
> > On Thu, Jun 16, 2016 at 03:42:12PM -0700, James Bottomley wrote:
> > > On Thu, 2016-06-16 at 14:29 -0700, James Bottomley wrote:
> > > > On Thu, 2016-06-16 at 23:24 +0200, Daniel Vetter wrote:
> > > > > I guess we'll need the bisect on this one to make progress.
> > > > 
> > > > Sigh, I was afraid that might be the next step.
> > > 
> > > OK, I have a curious data point.  I assumed the problem would be
> > > somewhere in the drm update, so I started bisecting that at the
> > > top. 
> > >  However, the top most commit:
> > > 
> > > commit 1d6da87a3241deb13d073c4125d19ed0e5a0c62c
> > > Merge: 1f40c49 a39ed68
> > > Author: Linus Torvalds <torva...@linux-foundation.org>
> > > Date:   Mon May 23 11:48:48 2016 -0700
> > > 
> > > Merge branch 'drm-next' of
> > > git://people.freedesktop.org/~airlied/linux
> > > 
> > > Isn't actually bad.  There's no flicker here, so whatever caused
> > > the
> > > problem came from some update after this.
> > 
> > There was a fixes pull after this. Might be worth it to restrict to
> > just
> > the i915 changes, which are just
> > 5b4fd5bb1230cd037..157d2c7fad0863222
> > 
> > Looking at those nothing seems to stick out which might explain
> > what's
> > happening for you.

OK, so just on the firmware, the system seems less flickery with the
new 1.4.3 UEFI, so I'm starting to think it is a Skylake errata issue. 
 The flicker isn't gone for good, but seems to be reboot dependent
(it's there in some boots, but gone on a reboot).

> This should be easy enough to try before bisecting:
> http://patchwork.freedesktop.org/patch/msgid/1466162081-12042-1-git-s
> end-email-mika.kah...@intel.com

Applying this didn't seem to make a difference: still there on some and
gone on other reboots.

James


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-06-17 Thread James Bottomley
On Fri, 2016-06-17 at 16:34 +0300, Jani Nikula wrote:
> On Fri, 17 Jun 2016, Daniel Vetter <dan...@ffwll.ch> wrote:
> > On Thu, Jun 16, 2016 at 03:42:12PM -0700, James Bottomley wrote:
> > > On Thu, 2016-06-16 at 14:29 -0700, James Bottomley wrote:
> > > > On Thu, 2016-06-16 at 23:24 +0200, Daniel Vetter wrote:
> > > > > I guess we'll need the bisect on this one to make progress.
> > > > 
> > > > Sigh, I was afraid that might be the next step.
> > > 
> > > OK, I have a curious data point.  I assumed the problem would be
> > > somewhere in the drm update, so I started bisecting that at the
> > > top. 
> > >  However, the top most commit:
> > > 
> > > commit 1d6da87a3241deb13d073c4125d19ed0e5a0c62c
> > > Merge: 1f40c49 a39ed68
> > > Author: Linus Torvalds <torva...@linux-foundation.org>
> > > Date:   Mon May 23 11:48:48 2016 -0700
> > > 
> > > Merge branch 'drm-next' of
> > > git://people.freedesktop.org/~airlied/linux
> > > 
> > > Isn't actually bad.  There's no flicker here, so whatever caused
> > > the
> > > problem came from some update after this.
> > 
> > There was a fixes pull after this. Might be worth it to restrict to
> > just
> > the i915 changes, which are just
> > 5b4fd5bb1230cd037..157d2c7fad0863222
> > 
> > Looking at those nothing seems to stick out which might explain
> > what's
> > happening for you.
> 
> This should be easy enough to try before bisecting:
> http://patchwork.freedesktop.org/patch/msgid/1466162081-12042-1-git-s
> end-email-mika.kah...@intel.com

I'll try that.  You should know this platform has a firmware bug in the
digital signals across the display port ... some problem with the mux
in the thunderbolt.  Unfortunately the microcode update for this (the
thunderbolt firmware) can't be applied from Linux or even the usual
FreeDOS, so I'm a bit stuck.

It's also been pointed out to me off list that there's a lot of Skylake
errata that have display flicker due to non-graphics issues:

http://www.intel.com/content/www/us/en/processors/core/desktop-6th-gen-core-family-spec-update.html

My system is a Dell XPS 9350 and I have updated it with the just released 
firmware:

http://www.dell.com/support/home/us/en/19/Drivers/DriversDetails?driverId=VNT5J

Which causes my cpu microcode version to go from

> [2.314472] microcode: CPU0 sig=0x406e3, pf=0x80, revision=0x82

to

> [2.264669] microcode: CPU0 sig=0x406e3, pf=0x80, revision=0x8a

But thanks to the completely opaque microcode update process via the
OEMs it's not clear whether this is sufficient to fix the errata that
are listed as causing flicker (the reported revision on current skylake
lenovos seems to be 0x94)

James

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-06-16 Thread James Bottomley
On Thu, 2016-06-16 at 14:29 -0700, James Bottomley wrote:
> On Thu, 2016-06-16 at 23:24 +0200, Daniel Vetter wrote:
> > I guess we'll need the bisect on this one to make progress.
> 
> Sigh, I was afraid that might be the next step.

OK, I have a curious data point.  I assumed the problem would be
somewhere in the drm update, so I started bisecting that at the top. 
 However, the top most commit:

commit 1d6da87a3241deb13d073c4125d19ed0e5a0c62c
Merge: 1f40c49 a39ed68
Author: Linus Torvalds <torva...@linux-foundation.org>
Date:   Mon May 23 11:48:48 2016 -0700

Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux

Isn't actually bad.  There's no flicker here, so whatever caused the
problem came from some update after this.

James

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-06-16 Thread James Bottomley
On Thu, 2016-06-16 at 23:24 +0200, Daniel Vetter wrote:
> On Thu, Jun 16, 2016 at 11:15 PM, James Bottomley
> <james.bottom...@hansenpartnership.com> wrote:
> > On Mon, 2016-06-13 at 13:14 +0300, Jani Nikula wrote:
> > > On Tue, 31 May 2016, James Bottomley <
> > > james.bottom...@hansenpartnership.com> wrote:
> > > > On Tue, 2016-05-31 at 10:51 +0300, Jani Nikula wrote:
> > > > > On Mon, 30 May 2016, James Bottomley <
> > > > > james.bottom...@hansenpartnership.com> wrote:
> > > > > > I've tested a pristine 4.6.0 system, so it's definitely
> > > > > > something
> > > > > > that
> > > > > > went in during the merge window.  The flicker isn't
> > > > > > continuous,
> > > > > > it's
> > > > > > periodic, with an interval of something like 2-5 seconds. 
> > > > > >  It
> > > > > > looks
> > > > > > like an old analogue TV going out of sync and then
> > > > > > resyncing.
> > > > > >  I've
> > > > > > attached the dmesg and X.org log below just in case they
> > > > > > can
> > > > > > help.
> > > > > >  I
> > > > > > might be able to bisect this next week, but, unfortunately,
> > > > > > this is
> > > > > > my
> > > > > > current laptop and I'm travelling this week.
> > > > > 
> > > > > Please try i915.enable_psr=0 module parameter.
> > > > 
> > > > Makes no discernable difference.  Current parameter settings
> > > > are:
> > > 
> > > Sorry for the silence. Would you mind trying out drm-intel
> > > -nightly
> > > branch of [1]?
> > > 
> > > BR,
> > > Jani.
> > > 
> > > [1] http://cgit.freedesktop.org/drm-intel
> > 
> > No, flicker is still there (and in fact seems worse) with the tree
> > with
> > this commit at the top:
> > 
> > commit 3eb202ecc3668583f9ff4338211dbab47d755d1c
> > Author: Daniel Vetter <daniel.vet...@ffwll.ch>
> > Date:   Thu Jun 16 14:38:54 2016 +0200
> > 
> > drm-intel-nightly: 2016y-06m-16d-12h-38m-37s UTC integration
> > manifest
> 
> Strange indeed, I hoped the improved watermark code in -nightly would
> help. I assume nothing in dmesg about underruns or something similar?

Not that I can tell.  I've attached the full dmesg just in case.

> I guess we'll need the bisect on this one to make progress.

Sigh, I was afraid that might be the next step.

James

[0.00] Linux version 4.7.0-rc3+ (jejb@jarvis) (gcc version 4.8.5 (SUSE 
Linux) ) #4 SMP PREEMPT Thu Jun 16 13:57:07 PDT 2016
[0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-test 
root=UUID=994363fd-c12c-4c6c-92e9-85e8ac695cf9 ro resume=/dev/nvme0n1p2 
splash=silent quiet showopts pcie_aspm=force
[0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[0.00] x86/fpu: xstate_offset[3]:  960, xstate_sizes[3]:   64
[0.00] x86/fpu: xstate_offset[4]: 1024, xstate_sizes[4]:   64
[0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point 
registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
[0.00] x86/fpu: Enabled xstate features 0x1f, context size is 1088 
bytes, using 'standard' format.
[0.00] x86/fpu: Using 'eager' FPU context switches.
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0fff] reserved
[0.00] BIOS-e820: [mem 0x1000-0x00057fff] usable
[0.00] BIOS-e820: [mem 0x00058000-0x00058fff] reserved
[0.00] BIOS-e820: [mem 0x00059000-0x0009cfff] usable
[0.00] BIOS-e820: [mem 0x0009d000-0x0009] reserved
[0.00] BIOS-e820: [mem 0x0010-0x69537fff] usable
[0.00] BIOS-e820: [mem 0x69538000-0x69538fff] ACPI NVS
[0.00] BIOS-e820: [mem 0x69539000-0x69562fff] reserved
[0.00] BIOS-e820: [mem 0x69563000-0x695bdfff] usable
[0.00] BIOS-e820: [mem 0x695be000-0x69daefff] reserved
[0.00] BIOS-e820: [mem 0x69daf000-0x7907dfff] usable
[0.00] BIOS-e820: [mem 0x7907e000-0x793f8fff] reserved
[0.00] BIOS-e820: [mem 0x7

Re: [Intel-gfx] Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-06-16 Thread James Bottomley
On Mon, 2016-06-13 at 13:14 +0300, Jani Nikula wrote:
> On Tue, 31 May 2016, James Bottomley <
> james.bottom...@hansenpartnership.com> wrote:
> > On Tue, 2016-05-31 at 10:51 +0300, Jani Nikula wrote:
> > > On Mon, 30 May 2016, James Bottomley <
> > > james.bottom...@hansenpartnership.com> wrote:
> > > > I've tested a pristine 4.6.0 system, so it's definitely
> > > > something
> > > > that
> > > > went in during the merge window.  The flicker isn't continuous,
> > > > it's
> > > > periodic, with an interval of something like 2-5 seconds.  It
> > > > looks
> > > > like an old analogue TV going out of sync and then resyncing. 
> > > >  I've
> > > > attached the dmesg and X.org log below just in case they can
> > > > help. 
> > > >  I
> > > > might be able to bisect this next week, but, unfortunately,
> > > > this is
> > > > my
> > > > current laptop and I'm travelling this week.
> > > 
> > > Please try i915.enable_psr=0 module parameter.
> > 
> > Makes no discernable difference.  Current parameter settings are:
> 
> Sorry for the silence. Would you mind trying out drm-intel-nightly
> branch of [1]?
> 
> BR,
> Jani.
> 
> [1] http://cgit.freedesktop.org/drm-intel

No, flicker is still there (and in fact seems worse) with the tree with
this commit at the top:

commit 3eb202ecc3668583f9ff4338211dbab47d755d1c
Author: Daniel Vetter <daniel.vet...@ffwll.ch>
Date:   Thu Jun 16 14:38:54 2016 +0200

drm-intel-nightly: 2016y-06m-16d-12h-38m-37s UTC integration
manifest

James

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-05-31 Thread James Bottomley
On Tue, 2016-05-31 at 10:51 +0300, Jani Nikula wrote:
> On Mon, 30 May 2016, James Bottomley <
> james.bottom...@hansenpartnership.com> wrote:
> > I've tested a pristine 4.6.0 system, so it's definitely something
> > that
> > went in during the merge window.  The flicker isn't continuous,
> > it's
> > periodic, with an interval of something like 2-5 seconds.  It looks
> > like an old analogue TV going out of sync and then resyncing.  I've
> > attached the dmesg and X.org log below just in case they can help. 
> >  I
> > might be able to bisect this next week, but, unfortunately, this is
> > my
> > current laptop and I'm travelling this week.
> 
> Please try i915.enable_psr=0 module parameter.

Makes no discernable difference.  Current parameter settings are:

disable_display: N
disable_power_well: 1
edp_vswing: 0
enable_cmd_parser: 1
enable_dc: -1
enable_dp_mst: Y
enable_execlists: 1
enable_fbc: -1
enable_guc_submission: N
enable_hangcheck: Y
enable_ips: 1
enable_ppgtt: 3
enable_psr: 0
enable_rc6: 1
fastboot: N
guc_log_level: -1
inject_load_failure: 0
invert_brightness: 0
load_detect_test: N
lvds_channel_mode: 0
lvds_use_ssc: -1
mmio_debug: 0
modeset: -1
nuclear_pageflip: N
panel_ignore_lid: 1
prefault_disable: N
preliminary_hw_support: 1
reset: Y
semaphores: -1
use_mmio_flip: 0
vbt_sdvo_panel_type: -1
verbose_state_checks: Y

James

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] Bad flicker on skylake HQD due to code in the 4.7 merge window

2016-05-30 Thread James Bottomley
I've tested a pristine 4.6.0 system, so it's definitely something that
went in during the merge window.  The flicker isn't continuous, it's
periodic, with an interval of something like 2-5 seconds.  It looks
like an old analogue TV going out of sync and then resyncing.  I've
attached the dmesg and X.org log below just in case they can help.  I
might be able to bisect this next week, but, unfortunately, this is my
current laptop and I'm travelling this week.

James
[13.504] 
X.Org X Server 1.17.2
Release Date: 2015-06-16
[13.504] X Protocol Version 11, Revision 0
[13.504] Build Operating System: openSUSE SUSE LINUX
[13.504] Current Operating System: Linux jarvis 4.7.0-rc1+ #1 SMP PREEMPT Mon May 30 12:29:53 PDT 2016 x86_64
[13.504] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-test root=UUID=994363fd-c12c-4c6c-92e9-85e8ac695cf9 ro resume=/dev/nvme0n1p2 splash=silent quiet showopts pcie_aspm=force
[13.504] Build Date: 07 March 2016  08:22:28AM
[13.504]  
[13.504] Current version of pixman: 0.32.6
[13.504] 	Before reporting problems, check http://wiki.x.org
	to make sure that you have the latest version.
[13.504] Markers: (--) probed, (**) from config file, (==) default setting,
	(++) from command line, (!!) notice, (II) informational,
	(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
[13.504] (==) Log file: "/var/log/Xorg.0.log", Time: Mon May 30 12:40:53 2016
[13.504] (==) Using config directory: "/etc/X11/xorg.conf.d"
[13.504] (==) Using system config directory "/usr/share/X11/xorg.conf.d"
[13.504] (==) No Layout section.  Using the first Screen section.
[13.504] (==) No screen section available. Using defaults.
[13.504] (**) |-->Screen "Default Screen Section" (0)
[13.504] (**) |   |-->Monitor ""
[13.505] (==) No monitor specified for screen "Default Screen Section".
	Using a default monitor configuration.
[13.505] (==) Automatically adding devices
[13.505] (==) Automatically enabling devices
[13.505] (==) Automatically adding GPU devices
[13.509] (WW) The directory "/usr/share/fonts/misc/sgi" does not exist.
[13.509] 	Entry deleted from font path.
[13.509] (==) FontPath set to:
	/usr/share/fonts/misc:unscaled,
	/usr/share/fonts/Type1/,
	/usr/share/fonts/100dpi:unscaled,
	/usr/share/fonts/75dpi:unscaled,
	/usr/share/fonts/ghostscript/,
	/usr/share/fonts/cyrillic:unscaled,
	/usr/share/fonts/truetype/,
	built-ins
[13.509] (==) ModulePath set to "/usr/lib64/xorg/modules"
[13.509] (II) The server relies on udev to provide the list of input devices.
	If no devices become available, reconfigure udev or disable AutoAddDevices.
[13.509] (II) Loader magic: 0x80dd00
[13.509] (II) Module ABI versions:
[13.509] 	X.Org ANSI C Emulation: 0.4
[13.509] 	X.Org Video Driver: 19.0
[13.509] 	X.Org XInput driver : 21.0
[13.509] 	X.Org Server Extension : 9.0
[13.509] (++) using VT number 7

[13.509] (II) systemd-logind: logind integration requires -keeptty and -keeptty was not provided, disabling logind integration
[13.510] (II) xfree86: Adding drm device (/dev/dri/card0)
[13.925] (--) PCI:*(0:0:2:0) 8086:1916:1028:0704 rev 7, Mem @ 0xdb00/16777216, 0x9000/268435456, I/O @ 0xf000/64, BIOS @ 0x/131072
[13.925] (II) LoadModule: "glx"
[13.926] (II) Loading /usr/lib64/xorg/modules/extensions/libglx.so
[13.930] (II) Module glx: vendor="X.Org Foundation"
[13.930] 	compiled for 1.17.2, module version = 1.0.0
[13.931] 	ABI class: X.Org Server Extension, version 9.0
[13.931] (==) AIGLX enabled
[13.931] (==) Matched intel as autoconfigured driver 0
[13.931] (==) Matched intel as autoconfigured driver 1
[13.931] (==) Matched modesetting as autoconfigured driver 2
[13.931] (==) Matched fbdev as autoconfigured driver 3
[13.931] (==) Matched vesa as autoconfigured driver 4
[13.931] (==) Assigned the driver to the xf86ConfigLayout
[13.931] (II) LoadModule: "intel"
[13.931] (II) Loading /usr/lib64/xorg/modules/drivers/intel_drv.so
[13.932] (II) Module intel: vendor="X.Org Foundation"
[13.932] 	compiled for 1.17.2, module version = 2.99.917
[13.932] 	Module class: X.Org Video Driver
[13.932] 	ABI class: X.Org Video Driver, version 19.0
[13.932] (II) LoadModule: "modesetting"
[13.932] (II) Loading /usr/lib64/xorg/modules/drivers/modesetting_drv.so
[13.932] (II) Module modesetting: vendor="X.Org Foundation"
[13.932] 	compiled for 1.17.2, module version = 1.17.2
[13.932] 	Module class: X.Org Video Driver
[13.932] 	ABI class: X.Org Video Driver, version 19.0
[13.932] (II) LoadModule: "fbdev"
[13.933] (II) Loading /usr/lib64/xorg/modules/drivers/fbdev_drv.so
[13.933] (II) Module fbdev: vendor="X.Org Foundation"
[13.933] 	compiled for 1.17.2, module version = 0.4.4
[13.933] 	Module class: X.Org Video Driver
[13.933] 	ABI class: X.Org Video Driver, version 

[Intel-gfx] i915 Sky Lake graphics problem: screen blank after grub boot

2015-12-18 Thread James Bottomley
With kernel 4.3.x (tested x=2,3) the screen goes blank as soon as the
system boots (both for VGA console and graphics).  The only way to get
it to turn on is to suspend and resume the system.

The system is a Dell XPS 13 (latest rev) running OpenSuse Leap 42.1 and
this is the lspci output:

00:00.0 Host bridge: Intel Corporation Sky Lake Host Bridge/DRAM Registers (rev 
08)
00:02.0 VGA compatible controller: Intel Corporation Sky Lake Integrated 
Graphics (rev 07)
00:04.0 Signal processing controller: Intel Corporation Device 1903 (rev 08)
00:14.0 USB controller: Intel Corporation Device 9d2f (rev 21)
00:14.2 Signal processing controller: Intel Corporation Device 9d31 (rev 21)
00:15.0 Signal processing controller: Intel Corporation Device 9d60 (rev 21)
00:15.1 Signal processing controller: Intel Corporation Device 9d61 (rev 21)
00:16.0 Communication controller: Intel Corporation Device 9d3a (rev 21)
00:1c.0 PCI bridge: Intel Corporation Device 9d10 (rev f1)
00:1c.4 PCI bridge: Intel Corporation Device 9d14 (rev f1)
00:1c.5 PCI bridge: Intel Corporation Device 9d15 (rev f1)
00:1d.0 PCI bridge: Intel Corporation Device 9d18 (rev f1)
00:1f.0 ISA bridge: Intel Corporation Device 9d48 (rev 21)
00:1f.2 Memory controller: Intel Corporation Device 9d21 (rev 21)
00:1f.3 Audio device: Intel Corporation Device 9d70 (rev 21)
00:1f.4 SMBus: Intel Corporation Device 9d23 (rev 21)
3a:00.0 Network controller: Intel Corporation Wireless 7265 (rev 59)
3b:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. Device 525a 
(rev 01)
3c:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a802 
(rev 01)

It looks like it might be something to do with this warning in the boot dmesgs 
(full set below):

[2.951393] WARNING: CPU: 3 PID: 6 at ../drivers/gpu/drm/i915/intel_dp.c:854 
intel_dp_aux_ch+0x105/0x660 [i915]()
[2.951394] dp_aux_ch not started status 0xad40001f

James

---

[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Initializing cgroup subsys cpuacct
[0.00] Linux version 4.3.2-1.g2aebb11-default (geeko@buildhost) (gcc 
version 5.2.1 20151008 [gcc-5-branch revision 228597] (SUSE Linux) ) #1 SMP 
PREEMPT Fri Dec 11 08:22:04 UTC 2015 (2aebb11)
[0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-4.3.2-1.g2aebb11-default 
root=UUID=994363fd-c12c-4c6c-92e9-85e8ac695cf9 ro resume=/dev/nvme0n1p2 
splash=silent quiet showopts
[0.00] x86/fpu: xstate_offset[2]: 0240, xstate_sizes[2]: 0100
[0.00] x86/fpu: xstate_offset[3]: 03c0, xstate_sizes[3]: 0040
[0.00] x86/fpu: xstate_offset[4]: 0400, xstate_sizes[4]: 0040
[0.00] x86/fpu: Supporting XSAVE feature 0x01: 'x87 floating point 
registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x02: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x04: 'AVX registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x08: 'MPX bounds registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x10: 'MPX CSR'
[0.00] x86/fpu: Enabled xstate features 0x1f, context size is 0x440 
bytes, using 'standard' format.
[0.00] x86/fpu: Using 'eager' FPU context switches.
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x00057fff] usable
[0.00] BIOS-e820: [mem 0x00058000-0x00058fff] reserved
[0.00] BIOS-e820: [mem 0x00059000-0x0009cfff] usable
[0.00] BIOS-e820: [mem 0x0009d000-0x0009] reserved
[0.00] BIOS-e820: [mem 0x0010-0x27265fff] usable
[0.00] BIOS-e820: [mem 0x27266000-0x27277fff] reserved
[0.00] BIOS-e820: [mem 0x27278000-0x27318fff] usable
[0.00] BIOS-e820: [mem 0x27319000-0x27319fff] ACPI NVS
[0.00] BIOS-e820: [mem 0x2731a000-0x27343fff] reserved
[0.00] BIOS-e820: [mem 0x27344000-0x2739cfff] usable
[0.00] BIOS-e820: [mem 0x2739d000-0x27b8dfff] reserved
[0.00] BIOS-e820: [mem 0x27b8e000-0x36e5cfff] usable
[0.00] BIOS-e820: [mem 0x36e5d000-0x371d6fff] reserved
[0.00] BIOS-e820: [mem 0x371d7000-0x37226fff] ACPI data
[0.00] BIOS-e820: [mem 0x37227000-0x379defff] ACPI NVS
[0.00] BIOS-e820: [mem 0x379df000-0x37ffefff] reserved
[0.00] BIOS-e820: [mem 0x37fff000-0x37ff] usable
[0.00] BIOS-e820: [mem 0x3800-0x380f] reserved
[0.00] BIOS-e820: [mem 0xe000-0xefff] reserved
[0.00] BIOS-e820: [mem 0xfe00-0xfe010fff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
[0.00]