On Wed, Dec 18, 2024 at 5:48 AM Russell Senior
<[email protected]> wrote:
>
> The right solution is probably just to retire the one in the field and
> put the whole lot of them into a "museum box", but hey, it's the
> holidays. What better period to waste a bunch of time keeping creaking
> hardware alive. And anyway, the museum curators will be more thrilled
> to create an exhibit if they have working firmware.

I am happy to report that I was able to use the periodic builds I made
historically to narrow down the region of the introduction of the
breakage to a few months in 2019, between late February and late May
of that year. Then I used classic git bisection, in half-a-dozen or so
iterations, to narrow the breakage to a single commit. To do the
bisection on basically a 5 year old project that is constantly
changing, I had to set up a "period correct" build environment. That
is because the state of the project back in 2019 did/could not
anticipate the changes in the build host environment (things like new
compiler and toolchain versions, in particular gcc, g++ and python).
That meant I had to find a "spare" machine that I could commit to an
old OS version. I ended up with Ubuntu 18.04.6, which would have been
extant in 2019. I tried a Debian version, but it didn't have the
non-free firmware blobs needed to get the laptop ("spare") I had
connected to a network.

The single commit was a kernel bump from v4.14.112 to v4.14.113. So, I
looked at the commits involved in that transition and spotted one that
changed how support for the cyrix chips were supported. So, I took
v4.14.113 and reverted that single change, and *boom* my breakage was
fixed. So, I reported that upstream to the linux kernel people who
were involved in that commit. While waiting for a response from them,
an OpenWrt guy and I (mostly following his reasonable suggestions and
intuition), we narrowed the problem down even further. The root cause
appears to be that the SC1100 chip does *NOT* want its SUSP# pin
enabled. This pin allows an external device (part of the chipset) to
stop and start the CPU. Apparently, during warm boots, that pin gets
pulled low and the CPU dutifully stops. So, I have a patch that works
for my specific context, although it probably breaks in some other
contexts, so upstream will need to determine how to deal with that. My
same local fix works in modern OpenWrt with a v6.6.67 kernel. So, my
field deployed Soekris net4826 *can* be updated to modern firmware.

  
https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/datasheets/goede_gx1_databook-rev5.pdf

In the Geode GX1 family, there is a set of CPU registers that are
accessed by first writing a register index to port 0x22 then reading
or writing to port 0x23. The "fix" that broke the SC1100 was to
actually do that getting/setting correctly in the right order. I
*think* the reason it was breaking is that the Old Method was trying
to set the SUSP# enable bit, but actually failing, so it was not
enabled and my warm boot succeeded. When the v4.14.113 changes fixed
the getter/setter functions, it did the Wrong Thing successfully. So,
the right fix is just to not do the Wrong Thing at all.

Merry Christmas,

-- 
Russell Senior
[email protected]

Reply via email to