My advise would also be to look into this document from IBM;
*Blueprint and Server Automated Configuration for Linux x86* Three system configurations with exact spec's and deduplication backup/restore performance figures for the two larger ones, small config performance should be available later this year. On Mon, Dec 9, 2013 at 7:30 PM, Prather, Wanda <[email protected]>wrote: > Hi Sergio, > > It took me a while to get back to this, but it's important and I see site > after site falling into a mudhole with disk purchases. > > This is both an "it depends" and a "what not to do" answer. > And I hope we can get some other people to chime in with their > methodologies. > > What not to do: > > The problem with calling disk vendors: if you tell them you need 400TB, > they will assume you are shopping around and bid you the lowest priced > thing they have at 400TB; or if they think you have money and you don't > know anything, they will bid you the highest priced thing they think they > can get away with. The first will not do the job, and since you asked > about midrange I'm assuming you don't want to solve the problem by going to > 400 TB of the fastest disk money can buy, which would work but be expensive. > > What you seriously have to do is > > 1. Define your size *and* your throughput requirements, and > 2. Put them in writing, and tell the vendor you will not accept delivery > of the box until *after* it meets a throughput test. And make sure you > hold them to it and they agree they will take the box back if it fails to > meet the throughput. > > I've seen failure to do those 2 critical things create nightmares many > times (for sites who call me in after the fact to solve their TSM > performance problems). > > Vendors *will* do this. My commercial customers *do*demand it, and get > it. If you find a vendor who won't, or pretends it doesn't matter, don't > talk to them because you aren't talking to anybody who understands the > technical issues. (And yes, there are many folks in disk sales in the > Mid-Atlantic who don't know what they are doing. I know some of their > names.) Demand a pre-sales conference with the engineers, not just the > salesmen. > > The mid-range and low-end disk market has many, many options now. > Performance depends on how much cache, the firmware, and how many spindles > you have spinning, not just the type of RAID anymore. > > I do not sell hardware or software. But I have worked with vendors who > will take your performance requirements, and configure the box to meet the > throughput you need, which is what you have to do. (If you need a contact > in MD, I can give you one.) > > Now some "it depends" with an illustrations: > > In TSM 6.3.4, server-end dedup is incredibly I/O intensive. > > I have one customer using a very powerful Windows box to do server-end > dedup of 3.5 TB - 4-TB of TSM/VE backups per day (tiny blocks, 3.8-4:0 > dedup). They have a DS35xx disk array (which is incredibly affordable). > As originally configured (by a crappy vendor) with 1 controller, that > DS35xx array would do at most 40MB/second. We beefed up that array by > adding a controller, disks, and upgrading to XIV-like DDP firmware (all-way > striped, all disk spinning all the time). Now that little inexpensive box > will do 10,000 I/O's per second, 400MB+ per second. We got a 10-fold > improvement in throughput, for relatively little $$. > > So if you are talking server-end dedup, start with the amount of data you > have incoming each day, mumblety Terabytes. Consider that you have to land > it in on disk. Then you have to read it again for the BACKUP STGPOOL. > Then you have to read it again for the IDENTIFY DUPLICATES process. Then > you have to read it again for the reclaim dedup process, and write the > resulting deduped blocks back out (25% of the data for a 4:1 dedup). And > oh, by the way, if there is tape reclaim involved, read some of it again, > and there will be lock conflicts that slow that process. And if you > replicate, do it again (post dedup would be 25% at 4:1). And if your > server is a replication target, include that I/O as well. And that doesn't > include your TSM DB I/O. > > So I have no documented rules of thumb, but seems to me that for every 1 > TB of data coming in, I'd assume at least 4 TB of I/O just for the data, > not including replication or the TSM DB I/O. > > So assume you have 2 TB coming in per day. > Multiply by 4 for 2 TB * 1024 * 1024 = 8388608 megabytes. > Assume you want to get everything done in 16 hours * 60 * 60 = 57600 > seconds. > That means you need disk that will sustain > 145 Megabytes per second. > (not including the I/O to the TSM DB) > > That's very do-able with midrange disk, but as illustrated above, *it > matters* on how it's configured. > > That's just an example for a server-end dedup case. > I would be interested in hearing from anybody else what methodology they > use to figure this out, or any other ROT. > > Oher things you could do: > > 1) dedup on the client end. I don't have any numbers on that. > 2) TSM 7.1 is advertising a 10-fold improvement in dedup throughput. At > this point I have no idea what that means or what is required to achieve > it. Anybody got numbers or information about how it works? > > Another thing you have to consider: > When you ask vendors about throughput on a disk array, you have to ask > "for how many concurrent processes". > 1. If you are trying to backup 1 big SAP DB, for example, with one > session, what you care about is the throughput of a single process. > 2. If you are backing up many small clients at once, plus doing > backup stgpool and dedup, what you (usually) care about is not the > throughput for a single process, but the total throughput when many > processes are running at once. > > Most disk arrays get more throughput for case 2 than case 1. Be sure you > specify what case you are asking for, when you give the vendor your > throughput requirements. And again, don't assume that the disk salesman > has any idea what you are talking about. Talk to the engineers, and > SPECIFY the case you will use for your throughput test.) > > > My recommendations: > > * My personal favorite for mid-range disk is the V7000, with the > XIV-Like DDP firmware. There is a low-end inexpensive version and > higher-end version. The cool thing is that you can improve performance as > you grow by adding spindles. > > * Test. If you don't have dedup now, set up a small pool and play > with it, so you get a feel for the lifecycle. > > Wanda > > > > > > > > > -----Original Message----- > From: ADSM: Dist Stor Manager [mailto:[email protected]] On Behalf Of > Sergio O. Fuentes > Sent: Wednesday, November 13, 2013 10:32 AM > To: [email protected] > Subject: [ADSM-L] TSM Dedup stgpool target > > In an earlier thread, I polled this group on whether people recommend > going with an array-based dedup solution or doing a TSM dedup solution. > Well, the answers came back mixed, obviously with an 'It depends'-type > clause. > > So, moving on... assuming that I'm using TSM dedup, what sort of target > arrays are people putting behind their TSM servers. Assume here, also, > that you'll be having multiple TSM servers, another backup product, > *coughveeam and potentially having to do backup stgpools on the dedup > stgpools. I ask because I've been barking up the mid-tier storage array > market as our potential disk based backup target simply because of the > combination of cost, performance, and scalability. I'd prefer something > that is dense I.e. more capacity less footprint and can scale up to 400TB. > It seems like vendors get disappointed when you're asking for a 400TB > array with just SATA disk simply for backup targets. None of that fancy > array intelligence like auto-tiering, large caches, replication, dedup, > etc.. is required. > > Is there another storage market I should be looking at, I.e. really dumb > raid arrays, direct attached, NAS, etc... > > Any feedback is appreciated, even the 'it depends'-type. > > Thanks! > Sergio >
