HI Team,

Hope you are doing well.

Can I request you to support review for
https://review.coreboot.org/q/topic:PCH_HW_SEQ_Cleanup and validation on
older ICH/PCH chipsets.

I have done my testing so far on Chrome OS devices but would like to get
your help on older devices if possible.

Thanks,
Subrata


On Fri, Mar 18, 2022 at 12:11 AM Subrata Banik <subrataba...@google.com>
wrote:

> I agree with Tim and will try to provide more requested information and
> testing with those CLs.
>
> > Only your reasoning "maintain the code symmetry" is not enough to
> justify changes.
>
> Kindly ignore my #2 comment, re-reading it doesn't make sense to me :) to
> maintain the symmetry is the real reason for those code changes.
>
> Looking for review comments and I will address those earliest.
>
> Thanks for having good discussion here.
>
> Thanks,
> Subrata
>
>
> On Thu, Mar 17, 2022 at 10:24 PM Tim Wawrzynczak <twawrzync...@google.com>
> wrote:
>
>>
>>
>> On Thu, Mar 17, 2022 at 10:35 AM Nico Huber <nic...@gmx.de> wrote:
>>
>>> On 17.03.22 15:35, Subrata Banik wrote:
>>> > On Thu, Mar 17, 2022 at 6:08 PM Nico Huber <nic...@gmx.de> wrote:
>>> >> On 16.03.22 17:19, Subrata Banik wrote:
>>> >>> The reason for this refactoring of HW sequencing SPI driver code are:
>>> >>> 1. We (Chrome OS team) recently ran into a firmware update issue on
>>> >>> dogfooders (test device utilise by real users and share feedback)
>>> >>> systems,
>>> >>> where the attempt to perform SPI flash update operation from host-cpu
>>> >>> (using underlying flashrom utility) is silently failing.
>>> Furthermore, we
>>> >>> debug the issue with help of Intel and figure out the problem is
>>> related
>>> >>> to
>>> >>> multiple master is accessing the SPI flash and operation triggered by
>>> >>> host-cpu side using flashrom (write and erase operation) is getting
>>> timed
>>> >>> out (due to given lesser timeout boundary) due to underlying SPI bus
>>> is
>>> >>> occupied by other master.
>>> >>
>>> >> Does the issue persist with the following commit applied to
>>> libflashrom,
>>> >> activated and integrated into futility?
>>> >>
>>> >>   commit 330088d77 [CHROMIUM]: libflashrom: Also use USE_BIG_LOCK
>>> >>
>>> >> This would give us a hint if the issue is indeed due to multiple
>>> masters
>>> >> or actually due to multiple programs trying to control the same
>>> master.
>>> >>
>>> > [Subrata] Yes, we are seeing failure even with USE_BIG_LOCK code
>>> changes
>>> > being integrated. The only code change that helps us to W/A this issue
>>> is
>>> > here
>>> >
>>> https://chromium-review.googlesource.com/c/chromiumos/third_party/flashrom/+/3498895
>>> > (increase
>>> > timeout in flashrom operations). Also, Intel disable the CSE runtime
>>> access
>>> > to SPI using a custom tool and that helped to fix this update failure
>>> as
>>> > well. These are the evidence towards multi master theory in AU failure
>>> > issue.
>>>
>>> I see. I was probably too distracted by the refactoring patches earlier.
>>> CB:62867 is indeed necessary.
>>>
>>
>> We seem to have actually figured out what the actual root cause of our
>> problem was, and it indeed
>> does seem to be the multi-master problem and flashrom not waiting long
>> enough in these cases.
>> tl;dr widevine provisioning was failing, and repeating endlessly in a
>> loop (~each second or so) and
>> each attempt actually causes the CSE to write to its data area in flash.
>> Unfortunately, we do not have
>> fantastic visibility into the processes that are running during AU, and
>> so this was very hard to track down.
>> We have fixed the issues with widevine provisioning and also added a max
>> 5 attempts to provision a device
>> instead of writing endlessly to the flash in the failure case.
>>
>>
>>>
>>> We already assumed a 5s timeout is needed for erasing a block. Worst
>>> case seems to be that all masters try to do this at once and then one
>>> master would have to wait (NM-1)*5s before its transaction even starts
>>> (NM being the number of masters).
>>>
>>> We could make it depend on the number of masters, or just choose some-
>>> thing safe like 30s. WDYT?
>>>
>>>
>> I think the longer the timeout (until it gets ridiculous), probably the
>> better. Unless there is buggy hardware
>> out there, there is no good reason for these operations to timeout; there
>> could for sure be SPI flash
>> chips out there that perhaps aren't always performing to-spec, who knows.
>>
>> But in this case, since do we know the maximum number of masters, and 5s
>> per erase seems reasonable too,
>> then 30 seconds makes sense to me.
>>
>>
>>> >> This is the special case, starting from Tiger Lake Chrome Platform,
>>> where
>>> >>> we have enable the PVAP (Protected Audio Video Path) which request
>>> CSE to
>>> >>> perform some erase and write operation as needed.
>>> >>
>>> >> We used flashrom on PAVP enabled systems for more than a decade
>>> without
>>> >> issues. Did something change in that area for Tiger Lake in
>>> particular?
>>> >>
>>> > [Subrata]  I'm not sure about other platforms, but from TGL onwards, on
>>> > Chrome devices we have enabled PAVP use case and Intel had tried to
>>> > replicate this issue on older platforms as well and informed us it's
>>> > replicated the issue on TGL platform as well where PAVP first being
>>> enabled
>>> > and not on JSL (which doesn't enable PAVP). Right now the
>>> recommendation
>>> > that is coming out from Intel is that, on latest Chrome OS devices,
>>> there
>>> > might be concurrent SPI accesses between host CPU and CSE. But I will
>>> ask
>>> > this question about the older generation platform as you have
>>> highlighted
>>> > which has PAVP enabled. Can you please let me know those platform
>>> names ?
>>> > Also,  do you know on those devices if we are trying to update the FW
>>> using
>>> > flashrom tool as well or not?
>>>
>>> Thanks for looking into it. As our timeouts were way below the worst
>>> case, I assume enabling PAVP just makes the problem more visible and is
>>> not directly related. It could also be any other code running on the CSE
>>> that changed between JSL and TGL (unless you know, ofc., that PAVP is
>>> the only code that writes to flash).
>>>
>>>
>> PAVP is the only thing we know of ATM that writes to the flash aside from
>> perhaps after the first flash of the ME region.
>>
>> The reason this became visible on brya only now I explained just above :)
>>
>>
>>> Flashrom basically runs on everything so the platform names would be all
>>> platforms that support PAVP. Probably not worth looking further into it
>>> unless there are still issues with a reasonable timeout.
>>>
>>> >
>>> >>
>>> >>> 2. Maintain the code symmetry between coreboot and flashrom SPI
>>> hardware
>>> >>> sequencing operations.
>>> >>
>>> >> Flashrom has tighter requirements compared to coreboot because we try
>>> to
>>> >> support all compatible chipset generations with the same code. If you
>>> >> want to align the two code-bases, please use flashrom as reference and
>>> >> patch coreboot.
>>> >>
>>> > [Subrata] Let me understand this a little better, are you suggesting
>>> not to
>>> > increase the SPI operation timing in flashrom upstream master ? knowing
>>> > that there is a hidden issue acknowledged by Intel with their
>>> validation
>>> > data, also, promised to update the SPI programming guide to capture the
>>> > recommendation of timeout in multi master scenarios.
>>>
>>> No I'm not suggesting any such thing. What I wrote was related to your
>>> 2. point as quoted above.
>>>
>>> >
>>> > Apart from timeout increasing CLs, other code changes are to increase
>>> the
>>> > code modularity (like using macro and helper function), are you
>>> suggesting
>>> > to not touch the flashrom code? can't we all contribute towards
>>> validating
>>> > the concerned target platforms. At Least in the coreboot community we
>>> all
>>> > help each other to validate the platform if we are out of required
>>> devices.
>>>
>>> We can have a look at the individual changes if there is a reason to
>>> change the code, i.e. to make it better. Only your reasoning "maintain
>>> the code symmetry" is not enough to justify changes.
>>>
>>> I'm sorry if this affects your work, but we have very bad experience
>>> with unnecessary code changes by CrOS developers. Currently `sb600spi`
>>> (the equivalent to `ichspi` for AMD platforms) is broken since months
>>> and only CrOS developers could clarify why the code was changed as it
>>> was. We need to prevent that this happens to `ichspi` too, right?
>>>
>>> Nico
>>>
>>
>> We are certainly not intending to break ichspi or make it worse, we are
>> trying to make it better, perhaps we were a little
>> hasty in trying to get through solutions that were not perhaps the root
>> of the problem, but were discovered along the way.
>>
>> We are also working with Intel to get some documentation out about the
>> expected behaviors with regard to the
>> multi-master scenarios, hopefully that will help clarify things some in
>> this regard.
>>
>> -Tim
>>
>
_______________________________________________
flashrom mailing list -- flashrom@flashrom.org
To unsubscribe send an email to flashrom-le...@flashrom.org

Reply via email to