Hi All - long time since I have posted here, been lurking though :-). After reading this you may prefer I crawl back under my rock. I have a number of sites running on verisions versions of IB 7.x. We get a fair array of performance problems and admins letting servers run out of space, not backing up...., but we have a couple of sites that have knarly problems and one that I want to ask about. Server = MS Windows server 2003. The server abends (-1) every so often, e.g. it will go weeks without failing and then suddenly abend a couple of times. It seems to come in waves and I can't seem to see a pattern. We have another problem as there are 2 levels of admins, those onsite who can do some things on the server and others who are offsite who seem to pull levers as they see fit (e.g. AV / backup). So I am looking for causes and dignostic paths to fix. This is an end of life legacy system, so no massive budgets to re-write software / replace with a different version of IB/FB. I am fairly well versed in setting up / maintaining and supporting IB/FB, but am scratching my head with it at the moment. Usage pattern: - 20 - 50 concurrent users- very few users at night, but it is a 24 x 7 x 29 opp (they backup and restore once a month).- Some bots reading (automated users polling for changes / populating 3rd party DB's an kit)- Some bots writing (need to confirm this though) Failure pattern:- not proportionate to usage, e.g. a lot of failures appear to happen in the middle of the night- tend to cluster e.g. 2 or 3 in a couple of hours- some white outs (when server not crashed, but so slow to respond that users get admins to restart service) What we have tried- excluding AV from DB folder / IB exe folder / temp folder- ensuring enough space on drives- exluding backup from dbfolder
Gut feel is that it is a non-ib process accessing a file IB cares about whilst it is accessing it (gdb). The trouble is we don't control the whole server. By looking at the event log for the login events we had identified a 95% correlation between a system user (backupUserTest of all users) logging in and the system failing. Consensus was that this may be caused by a rogue process / virus, it seemed to log on the split second the crash happened. This could have also been an effect and not the cause (e.g. process crashes and a dr watson type process in created and authenticated to do some house keeping). Next was HW, it was running on very old kit, so we VM'd it and put it on a reliable box. Still it has crashes, not sure about the frequency. I had wanted to use sysinternals to monitor the GDB file to see if anyone else messed with it, but it was felt that this would be too much of a performance hit, but maybe they will reconsider this now. Now I amn working on a sysinternals command to scrape stats about:1) Who is logged in2) monitor temp folder3) Get snapshot about process / cpu / disk que / RAM usage etc4) filemonitor on the temp folder and the gdb file Have these being logged to a file and deleted every few hours (I would imagine that they would grow pretty quickly).When the server stops, I think you can run a command which would be to rename all the log files, so that you have the last period prior to the crash. Does this seem sensible? Any pointers on tools / techiniques which may enhance this. Also, wanted to use a SQL monitor to scrape all TCP/IP interaction, but some of the very old robot processes are DOS / java / delphi and may have local connections and also run from this same machine. I look at robots being a potential source of the problem as they are active all the time. I didn't see IB's performance monitoring as the chance of catching it at the point of failure would be very small. Finally, to exclude all external influence I had thought of:1) Setting a new OS user for IB2) Give perms to this user to the IB progs / gdb path / backup path / external tables path / dll path / temp folder3) remove all other users (including system) to the gdb path to begin with Does this seem feasible / sensible. As crashes are intermittant (e.g. up to 3 weeks betwen failures), whatever I put in place will need to run for a long time and not require a lot of daily management. Sorry for the long whaffle, but we have been struggling with this one for a long time. Normally I can completely isolate the servers, even put them behind a physical NAT router with specific ports open to remove all third party variables, but this server sits in some data centre somewhere. [Non-text portions of this message have been removed]
