I've been running into a problem on and off for a while now and I'm hoping maybe someone here might have some advice or suggestions...
For reasons unknown to me, it seems that my NFS mounts are occasionally unstable (this is on Solaris and the NFS mounts are hosted on NetApp Filers). Of course, Weblogic is writing logs to those NFS mounts and all of my incoming and outgoing data files (100+ mb xml files mostly) are located there along with logs of a Java daemon process. The Java daemon monitors the incoming data directory and automatically "processes" the new files its finds by prepending a timestamp to the name, moving it to an archive directory (on the same NFS mount), and then parsing and working with the data in the XML files (mostly reading and pushing it into an Oracle database). Every couple of months, the NFS mounts seem to drop out for a short period of time (on the order of ~5 minutes or less). When it happens, my Java daemon can no longer scan the filesystem for new files and it never seems to recover even once the NFS mount is restored. And sometimes I lose a Weblogic instance to boot. Most recently, I lost the WL cluster admin instance AND a WL instance on the same box. My network, OS and netapp guys are all looking into why that is happening (but it happens so infrequently that it is difficult to investigate) and in the meantime, I need to find a way for my Java processes to more gracefully recover from this. It is a real pain to stop and restart this daemon process since we have ~10 instances running on a variety of machines, and the production ones require approvals to touch, etc. Has anyone run into this? Does anyone have any specific suggestions or advice? Obviously I can (and probably will) adjust the file scanner code to try to catch this error and maybe throw away the File object/handle and get a new one that might restore the connection to the filesystem, but since that is basically already happening in the daemon child process, I'm honestly unsure that will do much for me. (Probably I will find there is a longer-lived File object somewhere that is somehow trying to connect to the "old" filesystem and is in a bad state, and so we're not really getting a "new" File object... but I haven't dug deep into the code yet.) Finally, if you have any suggestions on how I can effectively set up a unit test for this disconnected filesystem so I can be certain that I've fixed the problem, that would be appreciated too! Thanks. Wayne -- You received this message because you are subscribed to the Google Groups "The Java Posse" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/javaposse?hl=en.
