On Wed, 2024-01-31 at 09:27 -0500, Gary Dale wrote:
> On 2024-01-30 15:54, hw wrote:
> > On Mon, 2024-01-29 at 11:42 -0500, Gary Dale wrote:
> > > I'm running Debian/Trixie on an AMD64 workstation. I've lost the ability
> > > to see the root directory even when I am logged in as root (su -).
> > > 
> > > This has been happening intermittently for several months. I initially
> > > thought it might be related to failing NVME drive that was part of a
> > > RAID1 array that is mounted as "/" but I replaced the device and the
> > > problem is still happening.
> > > [...]
> > What happens when you put the device you replaced back?
> > 
> How could putting a known-failing device back in help? The problem 
> existed before I replaced it and continues to exist after the replacement.

It sounded like you were able to list the root directory (at least
sometimes) before you did the replacement.  Manually failing the
device (perhaps after adding it back first) could make a difference.

I've seen such indefinite hangs only when an NFS share has become
unreachable after it had been mounted.  You could use clonezilla to
make a copy and then perhaps convert the file system to btrfs.

Do you still have the problem when you remove one of the NVME storage
things?  Perhaps you have the equivivalent of a bad SATA cable or the
mainboard doesn't like it when you access two of those at the same
time, or something like that.  Even simple network cables can behave
very strangely, and NVME may be a bit more complicated than that.

Running fsck on every boot to work around an issue like this is
certainly a bad idea.  Doesn't fsck report anything?  If it really
makes a difference in itself rather than creating some side effect
that leads to the root directory being readable, it should report
something.  Perhaps you need to increase its verbosity.

If there's no report then it would look like a side effect and raise
the question what side effect it might be.  Does fsck run before the
RAID has been brought up or after?  Is the RAID up when booting is
completed?  What does mdadm say about the device(s)?  Can you still
list the root directory when you manually fail either drive?  What
exactly are the circumstances under which you can and not list the
root directory?

You need to do some investigating and ask questions like those ...

Reply via email to