I've been running into a problem on and off for a while now and I'm
hoping maybe someone here might have some advice or suggestions...

For reasons unknown to me, it seems that my NFS mounts are
occasionally unstable (this is on Solaris and the NFS mounts are
hosted on NetApp Filers). Of course, Weblogic is writing logs to those
NFS mounts and all of my incoming and outgoing data files (100+ mb xml
files mostly) are located there along with logs of a Java daemon
process.

The Java daemon monitors the incoming data directory and automatically
"processes" the new files its finds by prepending a timestamp to the
name, moving it to an archive directory (on the same NFS mount), and
then parsing and working with the data in the XML files (mostly
reading and pushing it into an Oracle database).

Every couple of months, the NFS mounts seem to drop out for a short
period of time (on the order of ~5 minutes or less). When it happens,
my Java daemon can no longer scan the filesystem for new files and it
never seems to recover even once the NFS mount is restored. And
sometimes I lose a Weblogic instance to boot. Most recently, I lost
the WL cluster admin instance AND a WL instance on the same box.

My network, OS and netapp guys are all looking into why that is
happening (but it happens so infrequently that it is difficult to
investigate) and in the meantime, I need to find a way for my Java
processes to more gracefully recover from this. It is a real pain to
stop and restart this daemon process since we have ~10 instances
running on a variety of machines, and the production ones require
approvals to touch, etc.

Has anyone run into this? Does anyone have any specific suggestions or
advice? Obviously I can (and probably will) adjust the file scanner
code to try to catch this error and maybe throw away the File
object/handle and get a new one that might restore the connection to
the filesystem, but since that is basically already happening in the
daemon child process, I'm honestly unsure that will do much for me.
(Probably I will find there is a longer-lived File object somewhere
that is somehow trying to connect to the "old" filesystem and is in a
bad state, and so we're not really getting a "new" File object... but
I haven't dug deep into the code yet.)

Finally, if you have any suggestions on how I can effectively set up a
unit test for this disconnected filesystem so I can be certain that
I've fixed the problem, that would be appreciated too!

Thanks.
Wayne

-- 
You received this message because you are subscribed to the Google Groups "The 
Java Posse" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/javaposse?hl=en.

Reply via email to