If I'm understanding the problem correctly, you're running into a problem with a TSM client node, not with your TSM server.
We have a similar setup, although our proxy nodes are RHEL and use NFS rather than CIFS. We have a pool of nine 10GbE-attached nodes that backup a variety of storage devices that are big enough that we can't run a single backup schedule on them (GPFS), or that we don't have a good backup client for (Isilon, BlueARC). In aggregate these systems inspect a bit over 250 million objects spread over ~2.5PB. A few issues we've run into: * Under high load, the storage servers can bog down and cause backups to run a day behind, but it's rarely a serious problem. * The Linux dentry cache will get pinned, and cause the system to run out of RAM. By echo'ing 3 into /proc/sys/vm/drop_caches occasionally we can work-around this problem. Our original proxy nodes also had 12GB of RAM, but we've progressively bumped this up to 24GB and 48GB as we buy newer systems (RAM is cheap these days). * The Linux NFS client is pretty poor, and there's performance problems when stat'ing lots of files even on separate filesystems. This appears to us to be a context-switch issue, so we try to keep the number of simultaneous backups below the number of CPUs each proxy node has. * The atomic unit of parallelization in the TSM world is the filespace, not the filesystem. By working with end users before we start doing backups, we can find ways to divvy up each filesystem (where the minimum size is in the hundreds of TB, one ranging past a PB) into multiple filespaces that we can mount separately in /etc/fstab. With judicious use of -domain statements in schedules, we can assign different filespaces to different schedules that are all processing the filesystem but still get decent parallelization. For your particular problem, I would see if you can figure out where the bottleneck is. Is it data throughput? Metadata latency? Locking within Windows itself? Contention on the TSM server side (network throughput, DB, mount limits, etc.)? In the UNIX world, "strace -ttf" is a useful tool in that it will print the latency of every system call that's made by a process. Failing that, TSM client tracing can give the same information, albeit with much more cruft around the timing. On 08/16/13 08:29, Zoltan Forray wrote:
We are starting to experiencing performance issues on a server that acts as the "head" for multiple (31 currently) TSM nodes. This server CIFS mounts multiple departmental filesystems - all in various EMC SAN's. Each filesystem is a different TSM node. The "head" server is running Windows 2012 server with 12GB RAM and 2-quad-core processor. Anyone out there something like this? What are the realistic limits? I have tried spreading the backup start times as much as I can. As expected, a lot of the time is spend scanning files - 1-node is >10M files. Thoughts? Comments? Suggestions? -- *Zoltan Forray* TSM Software & Hardware Administrator Virginia Commonwealth University UCC/Office of Technology Services [email protected] - 804-828-4807 Don't be a phishing victim - VCU and other reputable organizations will never use email to request that you reply with your password, social security number or confidential personal information. For more details visit http://infosecurity.vcu.edu/phishing.html
-- -- Skylar Thompson ([email protected]) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine
