Rick,
Thanks we are going to try some of these suggestions later this evening or
tomorrow. We are currently backing up the mdt (as described in the Lustre
manual). I will post further once we get there.
THanks for the suggestions.
Mike
On Wed, Jun 21, 2023 at 4:32 PM Mohr, Rick wrote:
>
Mike,
On the off chance that the recovery process is causing the issue, you could try
mounting the mdt with the "abort_recov" option and see if the behavior changes.
--Rick
On 6/21/23, 2:33 PM, "lustre-discuss on behalf of Jeff Johnson"
mailto:lustre-discuss-boun...@lists.lustre.org> on
Maybe someone else in the list can add clarity but I don't believe a
recovery process on mount would keep the MDS read-only or trigger that
trace. Something else may be going on.
I would start from the ground up. Bring your servers up, unmounted. Ensure
lnet is loaded and configured properly.
Jeff,
At this point we have the OSS shutdown. We were coming back from. full
outage and so we are trying to get the MDS up before starting to bring up
the OSS.
Mike
On Wed, Jun 21, 2023 at 2:15 PM Jeff Johnson
wrote:
> Mike,
>
> Have you made sure the the o2ib interface on all of your Lustre
Mike,
Have you made sure the the o2ib interface on all of your Lustre servers
(MDS & OSS) are functioning properly? Are you able to `lctl ping
x.x.x.x@o2ib` successfully between MDS and OSS nodes?
--Jeff
On Wed, Jun 21, 2023 at 10:08 AM Mike Mosley via lustre-discuss <
Rick,
172.16.100.4 is the IB address of one of the OSS servers.I
believe the mgt and mdt0 are the same target. My understanding is that
we have a single instanceof the MGT which is on the first MDT server i.e.
it was created via a comand similar to:
# mkfs.lustre --fsname=scratch --index=0
Hi Rick,
The MGS/MDS are combined. The output I posted is from the primary.
THanks,
Mike
On Wed, Jun 21, 2023 at 12:27 PM Mohr, Rick wrote:
> Mike,
>
> It looks like the mds server is having a problem contacting the mgs
> server. I'm guessing the mgs is a separate host? I would start by
Mike,
It looks like the mds server is having a problem contacting the mgs server.
I'm guessing the mgs is a separate host? I would start by looking for possible
network problems that might explain the LNet timeouts. You can try using "lctl
ping" to test the LNet connection between nodes,
Thanks, Rick, for that suggestion. TCP ping between a problematic host and
the MDS indeed does not go through.
Not exactly sure what to investigate next, but that gives me somewhere to
start...
- Youssef
On Tue, Jun 20, 2023 at 7:00 PM Mohr, Rick via lustre-discuss <
Greetings,
We have experienced some type of issue that is causing both of our MDS
servers to only be able to mount the mdt device in read only mode. Here
are some of the error messages we are seeing in the log files below. We
lost our Lustre expert a while back and we are not sure how to
Have you run ibdiagnet?
Also you want to run ibqueryerrors
On Tue, 20 Jun 2023, 17:11 Youssef Eldakar via lustre-discuss, <
lustre-discuss@lists.lustre.org> wrote:
> In a cluster having ~100 Lustre clients (compute nodes) connected together
> with the MDS and OSS over Intel True Scale InfiniBand
11 matches
Mail list logo