Thanks for the suggestion. However, this is not true. We already tried this.
We did "find . | wc -l" to get the object count (1.1M) with no problems. But the backup still will not work. Constantly fails, in unpredictable/inconsistant places, with the same "Producer Thread" error. I spent 2+ days drilling through the various sub-directories (of this directory that causes the failures), one-by-one, and was able to backup 38 of the 40 subdirs, totalling over 980K objects, with out a problem. When I included these two other directories, in the same pile, the backup would fail. When I then went back and individually selected the sub-sub directories of these sub-directories (one at a time), I was able to backup *ALL* of the sub-sub directories, no problem. Then I went back and selected the upper-level directory and backed it up, no problem.. Let me draw a picture of the structure of these directories. The problem directories are in this directory: /coyote/dsk3/patients/prostateReOpt/Mount_0/ . If I try to backup the /Mount_0/ as a whole, crashes every time. If I point to sub-dirs below /Mount_0/ (40 of these - all with the same named 4-subsub dirs ), two of these cause a crash. I noted that these two both have >72K objects while the other 38 have less than 60K objects. Yet when I manually picked the 4-subsub dirs of the Patient_172 dir, the backup worked (sort of - see below). Same for the Patient_173. To really drive me crazy, the first attempt at backing up one of the subsub dirs under Patient_172, the backup crashed. Yet I could backup the other 3 with no issue. So, we started looking at the problem subdir and noticed a weird file name that ended in a tilde (~). When I excluded it, the backup ran. Then when I went back and picked just the file with the tilde, it backed up fine (my head is getting balder-and-balder !!). I then went back and re-selected the whole Patient_172 directory and it backed up (or at least scanned it since everything was backed-up) just fine !!!1 ARRRRRRRRRRRRGGGGGGHHHHHHHHHHHHH !! This is maddening and shows no rhyme-or-reason. Henk ten Have <[EMAIL PROTECTED]> Sent by: "ADSM: Dist Stor Manager" <[email protected]> 04/01/2005 08:29 AM Please respond to "ADSM: Dist Stor Manager" <[email protected]> To [email protected] cc Subject Re: [ADSM-L] Large Linux clients An old trick I used for many years: to investigate a "problem" filesystem, do a "find" in that filesystem. If the find dies, tsm definitly will die. I'll bet your find will die, and that's why your backup will die/hang or whatever also. A find will do a filestat on all files/dirs, actually the same the backup does. So your issue is OS related and not tsm. Cheers Henk () On Tuesday 29 March 2005 12:11, you wrote: > On Mar 29, 2005, at 12:37 PM, Zoltan Forray/AC/VCU wrote: > > ...However, then I try to backup the tree at the third-level (e.g. > > /coyote/dsk3/), the client pretty much siezes immediately and > > dsmerror.log > > says "B/A Txn Producer Thread, fatal error, Signal 11". The server > > shows > > the session as "SendW" and nothing going else going on.... > > Zoltan - > > Signal 11 is a segfault - a software failure. > The client programming has a defect, which may be incited by a problem > in that area of the file system (so have that investigated). A segfault > can be induced by memory constraint, which in this context would most > likely be Unix Resource Limits, so also enter the command 'limit' in > Linux csh or tcsh and potentially boost the stack size ('unlimit > stacksize'). This is to say that the client was probably invoked under > artificially limited environmentals. > > Richard Sims
