Re: [Pvfs2-users] Pvfs2 stability issues

Jim Kusznir Wed, 29 Jul 2009 15:41:17 -0700

Hello:

Thanks for your reply; answers inline below.

On Wed, Jul 29, 2009 at 2:30 PM, Sam Lang<[email protected]> wrote:
>
> Hi Jim,
>
> Unfortunately that log probably isn't going to be useful.  All it shows is
> that a client wasn't able to contact a server (pvfs2-io-0-0) and then (on a
> different day) that client was restarted a number of times.  The client log
> doesn't rotate, it just keeps growing.  If possible, could you send the
> pvfs2-client.log files from all the clients in your cluster, as well as the
> pvfs2-server.log files from each of the three servers?  Also, can you
> clarify some of the above?  I've asked a few clarifying questions below.

I can, but they never show anything.  The pvfs server logs are always
perfectly clear, and as the nodes are currently normally idle, there's
nothing there either.  Its only the headnode of the cluster (which is
just a client to pvfs) that I've been experiencing the issues, as the
demands on pvfs2 are much greater on the headnode presently.

> You migrated to PVFS 2.8.1 and the I/O servers were "down" during that time.
>  Did you keep the PVFS volume mounted on the clients, with the kernel module
> loaded and the client daemon running during that time?

I had shutdown all the nodes and stopped pvfs on the headnode, then
logged in to my 3 pvfs servers and stopped service, installed the new
RPM that I had built earlier that day, and started the pvfs service.
Not knowing there was an upgrade that took place, I then went and
started my headnode.  In addition, I kicked off a rebuild of my
compute nodes, which is how I do software installation on them
(installing the new pvfs2 rpm).  It turned out that while pvfs was
running on my pvfs server nodes, it was not answering requests due to
the updates, but I found this out when my clients were failing to
mount or talk to the nodes.

Later that day, one of my users managed to fill up the pvfs volume,
which caused hung processes.  I was able to promptly release a few
gig, and sent an e-mail out to all my users asking them to clean out
their space.  The freeing of space did not allow the hung processes to
resume or be killed (kill and kill -9 still had no effect on them).
My users promptly began to log in and clean their space and between
the du's, and rm's, at least one of my pvfs servers crashed.  It was a
few minutes before I figured out that this had happened; when I did, I
restarted pvfs2 on all 3 nodes.  Then again I had issues where my
clients still were not able to access them, and when I looked on the
server nodes again, the pvfs process was using 100% of one core, just
like the upgrade process.  After about 10 minutes, it fixed, and
things started working again.

> When the servers went down yesterday and the other days, did just the
> pvfs2-server process die?  Or did the entire server spontaneously reboot?

My pvfs servers have never died or rebooted prior to yesterday when at
least one of the pvfs server processes crashed.  I've never had an
unexpected reboot on my pvfs servers.  The servers always seem to be
functioning well; its always been the clients that have the problems.
>
> Did you reboot the _server_ to clear up hung processes? Hung processes are
> going to be on the client nodes, unless PVFS is also mounted on the same
> node the servers are running on, but that's not true for your config IIRC.

I rebooted my cluster headnode to clear the processes, as they were
user processes trying to do I/O on the pvfs volume through the
kernel-module-mounted pvfs data (/mnt/pvfs2).  As far as pvfs is
concerned, my head node is only a client.

> When it rebooted around noon, which node rebooted?  A server node or a
> client?  Which node in particular?  You've seen a number of spontaneous
> reboots.  Are they limited to client nodes or server nodes (or maybe just a
> few client nodes)?  Are there no kernel panics in syslog when those reboots
> occur?

Spontaneous reboots have always been client nodes.  Normally its my
cluster headnode, as that gets the heaviest pvfs usage, especially
"parallel usage" (multiple users performing unrelated i/o on the pvfs
volume, such as tar/untar, mv, cp, scp, and some minor preprocessing
of data).  I have occasionally seen a client node reboot for no
apparent reason (but the job that was running on there was always an
I/O intensive job).  In that case, though, I don't get any logs or
such, as when a compute node reboots in rocks, it is reformatted and
rebuilt on its way back up.

I have searched and sowered all the logs (except I forget the
pvfs-client log, as it doesn't go in /var/log like all my other logs),
and the only thing that shows up is the system booting back up.
There's no sign of a problem or cause of going down.

> If its helpful, our approach to debugging these sorts of problems usually
> involves:
>
> * Isolating the problem.  If we can know that process hangs or node reboots
> always occur at node X, or always occur because the client daemon dies, or
> one of the server falls over, etc.  It gives us an area to focus on.  PVFS
> (or any distributed file system) has a number of different components that
> interact in different ways.  Knowing which bits of code are the source of
> the problem make it a lot easier for us to figure out.  Basically, we're not
> very good at whack-a-mole.

Yep.  This makes sense.  While I'm not an expert with pvfs, my best
estimation of where the problem is is in the pvfs2 kernel module
and/or the code responsible for mounting the filesystem.  Like I've
said, I've never had issues with the server.  Normally when a process
hangs (like ls), any other I/O in that directory or below it will also
hang.  However, other servers will continue to work just fine, as will
pvfs2-ls on the affected system.  To me, this says the servers are
likely working correctly, and the pvfs2-* commands have not broken
either.  I've never done any ROMIO jobs, so my "client access methods"
are limited.

> * Reproducing the problem.  If you can reproduce what you're seeing by
> running some application with a specific set of parameters, or using a
> synthetic test, we can probably reproduce it on our own systems as well.
>  That makes debugging the problem a _ton_ easier, and a fix is usually in
> short order.  Obviously, its not always possible to reproduce a problem,
> especially for those problems that are more system specific.

This is very hard.  I know what kinds of conditions the problems occur
in, but this is after watching and asking users what they were doing
over the course of a few dozen crashes.  I've tried to stage crashes,
but have not succeeded.  Repeating the same actions to the best of our
ability will not reproduce the crashes.

Note that there are two different failure modes I'm experiencing.  The
more common (at least, prior to 1.8.1; I don't have enough experience
with 1.8.1 yet to speak to it) is that a directory and its
subdirectories will "freeze" and all processes that attempt I/O in
those directories will hang in an unkillable state.  They will each
add 1.0 to the system load average, but normally not add any to the
actual %cpu in use.  The only method I have of clearing these
processes and restoring access to that directory and its subordinates
is to reboot the CLIENT system responsible (which is usually my
headnode).

The other failure mode is spontaneous reboot.  This shows up less
frequently, but has happened at least a dozen times, and at least once
(possibly twice) since upgrading to 1.8.1.  There are no logs of
anything going wrong, just that the server reboots.  It always happens
when some users are doing lots of I/O.  Its more likely to happen when
I have a user running 4-6 scp processes and some other users working
on their data.  But not guaranteed.

In short, I cannot reproduce the problem on demand, but the problem
does reproduce itself often enough.

> * Lots of logging/debugging.  Problems in PVFS get written to logs as error
> messages most of the time.  Its probably not 100%, but we've gotten pretty
> good about writing errors to logs when something goes wrong.  Kernel panics
> usually show up in syslog on modern day kernels when the node falls over,
> with information about what bit of code caused the panic.  Giving us as much
> information as you can throw at us is better than trying to filter out the
> things that seem important.  We're used to handling lots of data. :-)

Unfortunately, I normally cannot find any trace of a problem.  I've
checked the syslog of both the client and all 3 servers, and I've
checked the 3 server pvfs client logs.  I have not previously checked
the client logs, as I usually forget they exist.  However, in today's
spontaneous reboot, there was nothing as you see above.  In the event
that a cluster node spontaneously reboots, there cannot be any logs
recovered.  I do not have any known / confirmed cases of a client
having "frozen directory" issues since much older versions of pvfs2
(1.6.x? 1.7.0 maybe).

> We also have quite a few debugging options that can be enabled and written
> to logs.  That's the direction we'll probably have to go if we can't solve
> your problem otherwise.
>
> Thanks,
> -sam
>
>>
>> --Jim
>>
>> On Tue, Jul 28, 2009 at 9:35 AM, Sam Lang<[email protected]> wrote:
>>>
>>> Jim,
>>>
>>> We'll definitely try to help you resolve the problem you're seeing.  That
>>> said, I responded to a similar query of yours back in April.  See:
>>>
>>>
>>> http://www.beowulf-underground.org/pipermail/pvfs2-users/2009-April/002765.html
>>>
>>> It would be great if you could answer the questions I asked in that
>>> email.
>>>
>>> Also, its been hinted by yourself and others that this may not be PVFS
>>> related, as other users aren't experiencing the same problem.  I
>>> encourage
>>> you to eliminate the possibility of memory problems on your system.  You
>>> could try to run memtester  (http://pyropus.ca/software/memtester/) on
>>> both
>>> servers and clients to verify that memory on your system isn't the
>>> problem.
>>>
>>> I've created a trac ticket for the problem you're seeing, so that we can
>>> keep track of it that way.  See:
>>>
>>> https://trac.mcs.anl.gov/projects/pvfs/ticket/113
>>>
>>> -sam
>>>
>>> On Jul 28, 2009, at 10:58 AM, Jim Kusznir wrote:
>>>
>>>> Hi all:
>>>>
>>>> More or less since I've installed pvfs2, I've had recurring stability
>>>> issues.  Presently, my cluster headnode has 3 processes, each using
>>>> 100% of a core, that are "hung" on I/O (all of that processor usage is
>>>> in "system", not "user"), but the process is not in "D" state (its
>>>> moving between S and R).  The process should have completed in an hour
>>>> or less, its now been running for over 18 hours.  It also is not
>>>> responding to kills (including kill -9).  From the sounds of the
>>>> users' message, any additional processes started in the same working
>>>> directory will hang in the same way.
>>>>
>>>> This happens a lot.  Presently, the 3 hung processes are a binary
>>>> specific to the research (x2) and gzip; often, the hung processes are
>>>> ls and ssh (for scp), etc.  When this happens, all other physical
>>>> systems are still fully functional.  This has happened repeatedly
>>>> (although not repeatable on demand) on versions 1.5 through 1.8.1.
>>>> The only recovery option I have found to date is to reboot the system.
>>>> This normally only happens on the head node, but the head node is
>>>> also where a lot of the user I/O takes place (especially a lot of
>>>> small I/O accesses such as a few scp sessions, some gzips, and 5-10
>>>> users doing ls, mv, and cp operations).
>>>>
>>>> Given what I understand about pvfs2's current user base, I'd think it
>>>> must be stable; a large cluster could never run pvfs2 and still be
>>>> useful to users with the types of instability I keep experiencing.  As
>>>> such, I suspect the problem is somewhere with my system/setup, but to
>>>> date pcarns and others on #pvfs2 have not been able to identify what
>>>> it is.  These stability issues are significantly effecting the
>>>> usability of the cluster, and of course, beginning to deter users from
>>>> it, and/or my competency in administrating it.  Yet from what I can
>>>> tell, I'm experiencing some bug in the pvfs kernel module.  I'd really
>>>> like to get this problem fixed, and I'm at a loss of how, other than
>>>> replacing pvfs2 with some other filesystem, which I'd rather not do.
>>>>
>>>> How do I fix this problem without replacing pvfs2?
>>>>
>>>> --Jim
>>>> _______________________________________________
>>>> Pvfs2-users mailing list
>>>> [email protected]
>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>
>>>
>
>

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Re: [Pvfs2-users] Pvfs2 stability issues

Reply via email to