Re: [Intel-gfx] NVIDIA GPU fallen off the bus after exiting s2idle

2021-05-21 Thread Saarinen, Jani
Hi, 

> -Original Message-
> From: Intel-gfx  On Behalf Of Chris 
> Chiu
> Sent: perjantai 21. toukokuuta 2021 7.02
> To: Rafael J. Wysocki 
> Cc: Brown, Len ; Karol Herbst ; Linux
> PM ; Linux PCI ;
> Westerberg, Mika ; Rafael J. Wysocki
> ; dri-devel ; Bjorn 
> Helgaas
> ; intel-gfx@lists.freedesktop.org
> Subject: Re: [Intel-gfx] NVIDIA GPU fallen off the bus after exiting s2idle
> 
> On Thu, May 6, 2021 at 5:46 PM Rafael J. Wysocki  wrote:
> >
> > On Tue, May 4, 2021 at 10:08 AM Chris Chiu  wrote:
> > >
> > > Hi,
> > > We have some Intel laptops (11th generation CPU) with NVIDIA GPU
> > > suffering the same GPU falling off the bus problem while exiting
> > > s2idle with external display connected. These laptops connect the
> > > external display via the HDMI/DisplayPort on a USB Type-C interfaced
> > > dock. If we enter and exit s2idle with the dock connected, the
> > > NVIDIA GPU (confirmed on 10de:24b6 and 10de:25b8) and the PCIe port
> > > can come back to D0 w/o problem. If we enter the s2idle, disconnect
> > > the dock, then exit the s2idle, both external display and the panel
> > > will remain with no output. The dmesg as follows shows the "nvidia
> :01:00.0:
> > > can't change power state from D3cold to D0 (config space
> > > inaccessible)" due to the following ACPI error [ 154.446781] [
> > > 154.446783] [ 154.446783] Initialized Local Variables for Method
> > > [IPCS]:
> > > [ 154.446784] Local0: 9863e365  Integer
> > > 09C5 [ 154.446790] [ 154.446791] Initialized Arguments
> > > for Method [IPCS]: (7 arguments defined for method invocation) [
> > > 154.446792] Arg0: 25568fbd  Integer 00AC [
> > > 154.446795] Arg1: 9ef30e76  Integer  [
> > > 154.446798] Arg2: fdf820f0  Integer 0010 [
> > > 154.446801] Arg3: 9fc2a088  Integer 0001 [
> > > 154.446804] Arg4: 3a3418f7  Integer 0001 [
> > > 154.446807] Arg5: 20c4b87c  Integer  [
> > > 154.446810] Arg6: 8b965a8a  Integer  [
> > > 154.446813] [ 154.446815] ACPI Error: Aborting method \IPCS due to
> > > previous error
> > > (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529) [ 154.446824] ACPI
> > > Error: Aborting method \MCUI due to previous error
> > > (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529) [ 154.446829] ACPI
> > > Error: Aborting method \SPCX due to previous error
> > > (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529) [ 154.446835] ACPI
> > > Error: Aborting method \_SB.PC00.PGSC due to previous error
> > > (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529) [ 154.446841] ACPI
> > > Error: Aborting method \_SB.PC00.PGON due to previous error
> > > (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529) [ 154.446846] ACPI
> > > Error: Aborting method \_SB.PC00.PEG1.NPON due to previous error
> > > (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529) [ 154.446852] ACPI
> > > Error: Aborting method \_SB.PC00.PEG1.PG01._ON due to previous error
> > > (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529) [ 154.446860] acpi
> > > device:02: Failed to change power state to D0 [ 154.690760] video
> > > LNXVIDEO:00: Cannot transition to power state D0 for parent in
> > > (unknown)
> >
> > If I were to guess, I would say that AML tries to access memory that
> > is not accessible while suspended, probably PCI config space.
> >
> > > The IPCS is the last function called from \_SB.PC00.PEG1.PG01._ON
> > > which we expect it to prepare everything before bringing back the
> > > NVIDIA GPU but it's stuck in the infinite loop as described below.
> > > Please refer to
> > > https://gist.github.com/mschiu77/fa4f5a97297749d0d66fe60c1d421c44
> > > for the full DSDT.dsl.
> >
> > The DSDT alone may not be sufficient.
> >
> > Can you please create a bug entry at bugzilla.kernel.org for this
> > issue and attach the full output of acpidump from one of the affected
> > machines to it?  And please let me know the number of the bug.
> >
> > Also please attach the output of dmesg including a suspend-resume
> > cycle including dock disconnection while suspended and the ACPI
> > messages quoted below.
> >
> > >While (One)
> > > {
> > > If ((!IBSY || (IERR == One)))
> > > {
> > > Break
> > > 

Re: [Intel-gfx] NVIDIA GPU fallen off the bus after exiting s2idle

2021-05-20 Thread Chris Chiu
On Thu, May 6, 2021 at 5:46 PM Rafael J. Wysocki  wrote:
>
> On Tue, May 4, 2021 at 10:08 AM Chris Chiu  wrote:
> >
> > Hi,
> > We have some Intel laptops (11th generation CPU) with NVIDIA GPU
> > suffering the same GPU falling off the bus problem while exiting
> > s2idle with external display connected. These laptops connect the
> > external display via the HDMI/DisplayPort on a USB Type-C interfaced
> > dock. If we enter and exit s2idle with the dock connected, the NVIDIA
> > GPU (confirmed on 10de:24b6 and 10de:25b8) and the PCIe port can come
> > back to D0 w/o problem. If we enter the s2idle, disconnect the dock,
> > then exit the s2idle, both external display and the panel will remain
> > with no output. The dmesg as follows shows the "nvidia :01:00.0:
> > can't change power state from D3cold to D0 (config space
> > inaccessible)" due to the following ACPI error
> > [ 154.446781]
> > [ 154.446783]
> > [ 154.446783] Initialized Local Variables for Method [IPCS]:
> > [ 154.446784] Local0: 9863e365  Integer 09C5
> > [ 154.446790]
> > [ 154.446791] Initialized Arguments for Method [IPCS]: (7 arguments
> > defined for method invocation)
> > [ 154.446792] Arg0: 25568fbd  Integer 00AC
> > [ 154.446795] Arg1: 9ef30e76  Integer 
> > [ 154.446798] Arg2: fdf820f0  Integer 0010
> > [ 154.446801] Arg3: 9fc2a088  Integer 0001
> > [ 154.446804] Arg4: 3a3418f7  Integer 0001
> > [ 154.446807] Arg5: 20c4b87c  Integer 
> > [ 154.446810] Arg6: 8b965a8a  Integer 
> > [ 154.446813]
> > [ 154.446815] ACPI Error: Aborting method \IPCS due to previous error
> > (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529)
> > [ 154.446824] ACPI Error: Aborting method \MCUI due to previous error
> > (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529)
> > [ 154.446829] ACPI Error: Aborting method \SPCX due to previous error
> > (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529)
> > [ 154.446835] ACPI Error: Aborting method \_SB.PC00.PGSC due to
> > previous error (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529)
> > [ 154.446841] ACPI Error: Aborting method \_SB.PC00.PGON due to
> > previous error (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529)
> > [ 154.446846] ACPI Error: Aborting method \_SB.PC00.PEG1.NPON due to
> > previous error (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529)
> > [ 154.446852] ACPI Error: Aborting method \_SB.PC00.PEG1.PG01._ON due
> > to previous error (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529)
> > [ 154.446860] acpi device:02: Failed to change power state to D0
> > [ 154.690760] video LNXVIDEO:00: Cannot transition to power state D0
> > for parent in (unknown)
>
> If I were to guess, I would say that AML tries to access memory that
> is not accessible while suspended, probably PCI config space.
>
> > The IPCS is the last function called from \_SB.PC00.PEG1.PG01._ON
> > which we expect it to prepare everything before bringing back the
> > NVIDIA GPU but it's stuck in the infinite loop as described below.
> > Please refer to
> > https://gist.github.com/mschiu77/fa4f5a97297749d0d66fe60c1d421c44 for
> > the full DSDT.dsl.
>
> The DSDT alone may not be sufficient.
>
> Can you please create a bug entry at bugzilla.kernel.org for this
> issue and attach the full output of acpidump from one of the affected
> machines to it?  And please let me know the number of the bug.
>
> Also please attach the output of dmesg including a suspend-resume
> cycle including dock disconnection while suspended and the ACPI
> messages quoted below.
>
> >While (One)
> > {
> > If ((!IBSY || (IERR == One)))
> > {
> > Break
> > }
> >
> > If ((Local0 > TMOV))
> > {
> > RPKG [Zero] = 0x03
> > Return (RPKG) /* \IPCS.RPKG */
> > }
> >
> > Sleep (One)
> > Local0++
> > }
> >
> > And the upstream PCIe port of NVIDIA seems to become inaccessible due
> > to the messages as follows.
> > [ 292.746508] pcieport :00:01.0: waiting 100 ms for downstream
> > link, after activation
> > [ 292.882296] pci :01:00.0: waiting additional 100 ms to become 
> > accessible
> > [ 316.876997] pci :01:00.0: can't change power state from D3cold
> > to D0 (config space inaccessible)
> >
> > Since the IPCS is the Intel Reference Code and we don't really know
> > why the never-end loop happens just because we unplug the dock while
> > the system still stays in s2idle. Can anyone from Intel suggest what
> > happens here?
>
> This list is not the right channel for inquiries related to Intel
> support, we can only help you as Linux kernel developers in this
> venue.
>
> > And one thing also worth mentioning, if we unplug the display cable
> > from the dock before entering the s2idle, NVIDIA GPU can c