Kudos, Russell!  It nearly brought me to tears, reading that.  This is REAL 
old-school troubleshooting, like was routinely done back in the day when Linux 
wasn't so full of itself that it could just tell people to discard gear (or 
send it to a museum) that didn't work with it.

Does this apply to the net4801 or any other more commonly available used 
Soekris models that use the same CPU?

One of the critical reasons this kind of research is so important is when you 
do the Wrong Thing successfully, the code now diverges from the documentation 
and then later on someone gets a bug that they have no clue why it's there.

Ted

-----Original Message-----
From: PLUG <[email protected]> On Behalf Of Russell Senior
Sent: Tuesday, December 24, 2024 4:59 PM
To: Portland Linux/Unix Group <[email protected]>
Subject: Re: [PLUG] Netbooting device needs NFSv2

On Wed, Dec 18, 2024 at 5:48 AM Russell Senior <[email protected]> 
wrote:
>
> The right solution is probably just to retire the one in the field and 
> put the whole lot of them into a "museum box", but hey, it's the 
> holidays. What better period to waste a bunch of time keeping creaking 
> hardware alive. And anyway, the museum curators will be more thrilled 
> to create an exhibit if they have working firmware.

I am happy to report that I was able to use the periodic builds I made 
historically to narrow down the region of the introduction of the breakage to a 
few months in 2019, between late February and late May of that year. Then I 
used classic git bisection, in half-a-dozen or so iterations, to narrow the 
breakage to a single commit. To do the bisection on basically a 5 year old 
project that is constantly changing, I had to set up a "period correct" build 
environment. That is because the state of the project back in 2019 did/could 
not anticipate the changes in the build host environment (things like new 
compiler and toolchain versions, in particular gcc, g++ and python).
That meant I had to find a "spare" machine that I could commit to an old OS 
version. I ended up with Ubuntu 18.04.6, which would have been extant in 2019. 
I tried a Debian version, but it didn't have the non-free firmware blobs needed 
to get the laptop ("spare") I had connected to a network.

The single commit was a kernel bump from v4.14.112 to v4.14.113. So, I looked 
at the commits involved in that transition and spotted one that changed how 
support for the cyrix chips were supported. So, I took
v4.14.113 and reverted that single change, and *boom* my breakage was fixed. 
So, I reported that upstream to the linux kernel people who were involved in 
that commit. While waiting for a response from them, an OpenWrt guy and I 
(mostly following his reasonable suggestions and intuition), we narrowed the 
problem down even further. The root cause appears to be that the SC1100 chip 
does *NOT* want its SUSP# pin enabled. This pin allows an external device (part 
of the chipset) to stop and start the CPU. Apparently, during warm boots, that 
pin gets pulled low and the CPU dutifully stops. So, I have a patch that works 
for my specific context, although it probably breaks in some other contexts, so 
upstream will need to determine how to deal with that. My same local fix works 
in modern OpenWrt with a v6.6.67 kernel. So, my field deployed Soekris net4826 
*can* be updated to modern firmware.

  
https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/datasheets/goede_gx1_databook-rev5.pdf

In the Geode GX1 family, there is a set of CPU registers that are accessed by 
first writing a register index to port 0x22 then reading or writing to port 
0x23. The "fix" that broke the SC1100 was to actually do that getting/setting 
correctly in the right order. I
*think* the reason it was breaking is that the Old Method was trying to set the 
SUSP# enable bit, but actually failing, so it was not enabled and my warm boot 
succeeded. When the v4.14.113 changes fixed the getter/setter functions, it did 
the Wrong Thing successfully. So, the right fix is just to not do the Wrong 
Thing at all.

Merry Christmas,

--
Russell Senior
[email protected]

Reply via email to