Re: Can't list root directory

2024-02-17 Thread Gary Dale

On 2024-02-01 02:37, Loren M. Lang wrote:


On January 31, 2024 1:28:37 PM PST, hw  wrote:

On Wed, 2024-01-31 at 09:27 -0500, Gary Dale wrote:

On 2024-01-30 15:54, hw wrote:

On Mon, 2024-01-29 at 11:42 -0500, Gary Dale wrote:

I'm running Debian/Trixie on an AMD64 workstation. I've lost the ability
to see the root directory even when I am logged in as root (su -).

This has been happening intermittently for several months. I initially
thought it might be related to failing NVME drive that was part of a
RAID1 array that is mounted as "/" but I replaced the device and the
problem is still happening.
[...]

What happens when you put the device you replaced back?


How could putting a known-failing device back in help? The problem
existed before I replaced it and continues to exist after the replacement.

It sounded like you were able to list the root directory (at least
sometimes) before you did the replacement.  Manually failing the
device (perhaps after adding it back first) could make a difference.

I've seen such indefinite hangs only when an NFS share has become
unreachable after it had been mounted.  You could use clonezilla to
make a copy and then perhaps convert the file system to btrfs.

Do you still have the problem when you remove one of the NVME storage
things?  Perhaps you have the equivivalent of a bad SATA cable or the
mainboard doesn't like it when you access two of those at the same
time, or something like that.  Even simple network cables can behave
very strangely, and NVME may be a bit more complicated than that.

Running fsck on every boot to work around an issue like this is
certainly a bad idea.  Doesn't fsck report anything?  If it really
makes a difference in itself rather than creating some side effect
that leads to the root directory being readable, it should report
something.  Perhaps you need to increase its verbosity.

If there's no report then it would look like a side effect and raise
the question what side effect it might be.  Does fsck run before the
RAID has been brought up or after?  Is the RAID up when booting is
completed?  What does mdadm say about the device(s)?  Can you still
list the root directory when you manually fail either drive?  What
exactly are the circumstances under which you can and not list the
root directory?

You need to do some investigating and ask questions like those ...


Also, instead of doing "ls -l /" which will stat() every child folder under root, try "/bin/ls 
-f /" and see if that is successful. That will only do a readdir() on root itself. Also, it might be 
interesting to get a log of "strace ls -l /" to confirm exactly where the hang happens.

-Loren


Thanks loren. /bin/ls -l works. The strace shows the hang is on 
/keybase. The strace did a really bad hang - ctrlC wouldn't kill it. 
I've set the fsck count to 1 again, so I can reboot and take a look at it.





Re: Can't list root directory

2024-02-02 Thread Gary Dale

On 2024-01-31 12:02, Max Nikulin wrote:

On 29/01/2024 23:42, Gary Dale wrote:

"ls -l /" just hangs


It may dereference symlinks, call stat, etc. to colorize output. May 
it happen that you have automount points or something related to 
network mounts?


Does "echo /*" hangs?

Even bash prompt may do some funny stuff. I would try it from "dash".

Can you install strace? E.g. copy files while booted from a live media.

Thanks everyone for the suggestions. I'll retune the array to not fsck 
every boot and see if the problem recurs so I can try your suggestions.




Re: Can't list root directory

2024-01-31 Thread Loren M. Lang



On January 31, 2024 1:28:37 PM PST, hw  wrote:
>On Wed, 2024-01-31 at 09:27 -0500, Gary Dale wrote:
>> On 2024-01-30 15:54, hw wrote:
>> > On Mon, 2024-01-29 at 11:42 -0500, Gary Dale wrote:
>> > > I'm running Debian/Trixie on an AMD64 workstation. I've lost the ability
>> > > to see the root directory even when I am logged in as root (su -).
>> > > 
>> > > This has been happening intermittently for several months. I initially
>> > > thought it might be related to failing NVME drive that was part of a
>> > > RAID1 array that is mounted as "/" but I replaced the device and the
>> > > problem is still happening.
>> > > [...]
>> > What happens when you put the device you replaced back?
>> > 
>> How could putting a known-failing device back in help? The problem 
>> existed before I replaced it and continues to exist after the replacement.
>
>It sounded like you were able to list the root directory (at least
>sometimes) before you did the replacement.  Manually failing the
>device (perhaps after adding it back first) could make a difference.
>
>I've seen such indefinite hangs only when an NFS share has become
>unreachable after it had been mounted.  You could use clonezilla to
>make a copy and then perhaps convert the file system to btrfs.
>
>Do you still have the problem when you remove one of the NVME storage
>things?  Perhaps you have the equivivalent of a bad SATA cable or the
>mainboard doesn't like it when you access two of those at the same
>time, or something like that.  Even simple network cables can behave
>very strangely, and NVME may be a bit more complicated than that.
>
>Running fsck on every boot to work around an issue like this is
>certainly a bad idea.  Doesn't fsck report anything?  If it really
>makes a difference in itself rather than creating some side effect
>that leads to the root directory being readable, it should report
>something.  Perhaps you need to increase its verbosity.
>
>If there's no report then it would look like a side effect and raise
>the question what side effect it might be.  Does fsck run before the
>RAID has been brought up or after?  Is the RAID up when booting is
>completed?  What does mdadm say about the device(s)?  Can you still
>list the root directory when you manually fail either drive?  What
>exactly are the circumstances under which you can and not list the
>root directory?
>
>You need to do some investigating and ask questions like those ...
>

Also, instead of doing "ls -l /" which will stat() every child folder under 
root, try "/bin/ls -f /" and see if that is successful. That will only do a 
readdir() on root itself. Also, it might be interesting to get a log of "strace 
ls -l /" to confirm exactly where the hang happens.

-Loren 

-- 
Sent from my Nexus 4 with K-9 Mail. Please excuse my brevity.



Re: Can't list root directory

2024-01-31 Thread hw
On Wed, 2024-01-31 at 09:27 -0500, Gary Dale wrote:
> On 2024-01-30 15:54, hw wrote:
> > On Mon, 2024-01-29 at 11:42 -0500, Gary Dale wrote:
> > > I'm running Debian/Trixie on an AMD64 workstation. I've lost the ability
> > > to see the root directory even when I am logged in as root (su -).
> > > 
> > > This has been happening intermittently for several months. I initially
> > > thought it might be related to failing NVME drive that was part of a
> > > RAID1 array that is mounted as "/" but I replaced the device and the
> > > problem is still happening.
> > > [...]
> > What happens when you put the device you replaced back?
> > 
> How could putting a known-failing device back in help? The problem 
> existed before I replaced it and continues to exist after the replacement.

It sounded like you were able to list the root directory (at least
sometimes) before you did the replacement.  Manually failing the
device (perhaps after adding it back first) could make a difference.

I've seen such indefinite hangs only when an NFS share has become
unreachable after it had been mounted.  You could use clonezilla to
make a copy and then perhaps convert the file system to btrfs.

Do you still have the problem when you remove one of the NVME storage
things?  Perhaps you have the equivivalent of a bad SATA cable or the
mainboard doesn't like it when you access two of those at the same
time, or something like that.  Even simple network cables can behave
very strangely, and NVME may be a bit more complicated than that.

Running fsck on every boot to work around an issue like this is
certainly a bad idea.  Doesn't fsck report anything?  If it really
makes a difference in itself rather than creating some side effect
that leads to the root directory being readable, it should report
something.  Perhaps you need to increase its verbosity.

If there's no report then it would look like a side effect and raise
the question what side effect it might be.  Does fsck run before the
RAID has been brought up or after?  Is the RAID up when booting is
completed?  What does mdadm say about the device(s)?  Can you still
list the root directory when you manually fail either drive?  What
exactly are the circumstances under which you can and not list the
root directory?

You need to do some investigating and ask questions like those ...



Re: Can't list root directory

2024-01-31 Thread Max Nikulin

On 29/01/2024 23:42, Gary Dale wrote:

"ls -l /" just hangs


It may dereference symlinks, call stat, etc. to colorize output. May it 
happen that you have automount points or something related to network 
mounts?


Does "echo /*" hangs?

Even bash prompt may do some funny stuff. I would try it from "dash".

Can you install strace? E.g. copy files while booted from a live media.



Re: Can't list root directory

2024-01-31 Thread The Wanderer
On 2024-01-29 at 11:42, Gary Dale wrote:

> I'm running Debian/Trixie on an AMD64 workstation. I've lost the
> ability to see the root directory even when I am logged in as root
> (su -).
> 
> This has been happening intermittently for several months. I
> initially thought it might be related to failing NVME drive that was
> part of a RAID1 array that is mounted as "/" but I replaced the
> device and the problem is still happening.
> 
> I had been able to fix it by booting to SystemRescue and running an
> fsck on the device but it didn't work this time. The device checks
> out OK (even when using fsck -/dev/mdx -f) but I still can't list the
> root. "ls -l /" just hangs, as do any attempts to see the root
> directory in a graphical file manager. In dolphin this means there is
> nothing in the folders - and since that is the default starting point
> I have to manually enter a folder name (e.g. /home/me) in the
> location bar to be able to see anything - but even then the folders
> panel remains empty.
> 
> Even running commands like df -h hang because they can't access the
> root folder. However the system is otherwise running normally.

I'm not sure it'll help lead to anything, but out of curiosity and/or as
a possible diagnostic: when the problem is manifesting, what happens if
you run 'stat /'? Does it report data (similar to what you'd get from
'stat' on another directory), or does it hang, or give errors, or...?

My thought is that this will give information about the filesystem
object that is the root directory, without trying to also access
information about the *contents* of that directory. If the one succeeds
where the other fails, that might help narrow down where the actual
issue is.

-- 
   The Wanderer

The reasonable man adapts himself to the world; the unreasonable one
persists in trying to adapt the world to himself. Therefore all
progress depends on the unreasonable man. -- George Bernard Shaw



signature.asc
Description: OpenPGP digital signature


Re: Can't list root directory

2024-01-31 Thread Gary Dale

On 2024-01-29 12:55, Hans wrote:

Hi Gary,

before loosing any data, I suggest, to boot from a liuvefile linux. Please use
a modern livefile like Knoppix or Kali-Linux.

If it is not a BIOS problem, you should see the device again and are able to
mount it. If /root is on a seperated partition, you can do some filesystem
checks, like e2fsck or else.

Ans: Most important, with a livefile system you can mount an external harddrive
and backup all files. Thus , even when the /dev/nvme*** is died or partly
broken, you can maybe restore /root on another partition.

Second: Please check ACL, although I do not believe the reason for these, it
is worth to look at this. Maybe you or someone else has chenged it accidently.

Third idea: Is the harddrive full? In the past I has the problem, not to be
able to do anything. The reason: My harddrive was completely full (some
temporary file was the reason). Deleting this big file was the trick.

Just some ideas, maybe it could help.

Good luck!

Best

Hans


There is no problem seeing the root folder when I boot from a live distro.

fsck never finds any significant issue.

An ACL issue would be permanent. This comes and goes.

I actually doubled the size of the root device when I put in the new 
NVME drive. When I set up the RAID array, I'd bought a 500G second drive 
to mirror the 256G original drive. When I replaced the 256G drive, I was 
able to expand the array to 500G (less a small amount for the EFI 
partition). The partition has lots of free space.


As I said, running an fsck seems to fix the issue temporarily. I now run 
an fsck on every boot.




Re: Can't list root directory

2024-01-31 Thread Gary Dale

On 2024-01-29 11:42, Gary Dale wrote:
I'm running Debian/Trixie on an AMD64 workstation. I've lost the 
ability to see the root directory even when I am logged in as root (su 
-).


This has been happening intermittently for several months. I initially 
thought it might be related to failing NVME drive that was part of a 
RAID1 array that is mounted as "/" but I replaced the device and the 
problem is still happening.


I had been able to fix it by booting to SystemRescue and running an 
fsck on the device but it didn't work this time. The device checks out 
OK (even when using fsck -/dev/mdx -f) but I still can't list the 
root. "ls -l /" just hangs, as do any attempts to see the root 
directory in a graphical file manager. In dolphin this means there is 
nothing in the folders - and since that is the default starting point 
I have to manually enter a folder name (e.g. /home/me) in the location 
bar to be able to see anything - but even then the folders panel 
remains empty.


Even running commands like df -h hang because they can't access the 
root folder. However the system is otherwise running normally.


Strangely, in the past simply booting to a rescue shell then exiting 
would also work. I'd usually try to do an fsck on the raid device but 
that would always fail because it was mounted.


The only thing I noticed that was unusual was I rebooted after 
installing the latest Trixie updates this morning. That took about 10 
minutes to shut down - 6 of which were spent waiting for a drkonqi 
process to finish. There was also a systemd message really late in the 
shutdown about /dev/md0 but that's not the root device.


I'm used to Linux taking its time to shutdown lately so I don't think 
this was related. The systemd shutdown just seems to be easily delayed.


Any ideas on how I can restore my ability to see the root directory?

OK, got it working again. I used tune2fs to do an fsck on every boot. 
This being an NVME device, it's barely noticeable.




Re: Can't list root directory

2024-01-31 Thread Gary Dale

On 2024-01-30 15:54, hw wrote:

On Mon, 2024-01-29 at 11:42 -0500, Gary Dale wrote:

I'm running Debian/Trixie on an AMD64 workstation. I've lost the ability
to see the root directory even when I am logged in as root (su -).

This has been happening intermittently for several months. I initially
thought it might be related to failing NVME drive that was part of a
RAID1 array that is mounted as "/" but I replaced the device and the
problem is still happening.
[...]

What happens when you put the device you replaced back?

How could putting a known-failing device back in help? The problem 
existed before I replaced it and continues to exist after the replacement.





Re: Can't list root directory

2024-01-30 Thread hw
On Mon, 2024-01-29 at 11:42 -0500, Gary Dale wrote:
> I'm running Debian/Trixie on an AMD64 workstation. I've lost the ability 
> to see the root directory even when I am logged in as root (su -).
> 
> This has been happening intermittently for several months. I initially 
> thought it might be related to failing NVME drive that was part of a 
> RAID1 array that is mounted as "/" but I replaced the device and the 
> problem is still happening.
> [...]

What happens when you put the device you replaced back?



Re: Can't list root directory

2024-01-29 Thread Hans
Hi Gary,

before loosing any data, I suggest, to boot from a liuvefile linux. Please use 
a modern livefile like Knoppix or Kali-Linux.

If it is not a BIOS problem, you should see the device again and are able to 
mount it. If /root is on a seperated partition, you can do some filesystem 
checks, like e2fsck or else.

Ans: Most important, with a livefile system you can mount an external harddrive 
and backup all files. Thus , even when the /dev/nvme*** is died or partly 
broken, you can maybe restore /root on another partition.

Second: Please check ACL, although I do not believe the reason for these, it 
is worth to look at this. Maybe you or someone else has chenged it accidently.

Third idea: Is the harddrive full? In the past I has the problem, not to be 
able to do anything. The reason: My harddrive was completely full (some 
temporary file was the reason). Deleting this big file was the trick.

Just some ideas, maybe it could help.

Good luck!

Best 

Hans 




Re: Can't list root directory

2024-01-29 Thread tomas
On Mon, Jan 29, 2024 at 11:42:14AM -0500, Gary Dale wrote:
> I'm running Debian/Trixie on an AMD64 workstation. I've lost the ability to
> see the root directory even when I am logged in as root (su -).
> 
> This has been happening intermittently for several months. I initially
> thought it might be related to failing NVME drive that was part of a RAID1
> array that is mounted as "/" but I replaced the device and the problem is
> still happening.

[...]

Anything mounted below / whose block device is taking its time?
Maybe a network device?

What does mount say?

Cheers
-- 
t


signature.asc
Description: PGP signature