Hi Everyone, Thanks for the thoughts/comments.
> I was told the only reason EMC recommends to turn off collocation is that > collocation on shoots up the individual volume count-generally yup, that's my understanding - more vols, more reclamation > also recommend a relatively high reclamation threshold. > think these 2 > factors together might end up in a lost of wasted unreclaimed space. I > think it would be ok if you were more aggressive with your reclamation. > Something to keep an eye on at the least. They suggested a reclamation thrshhold of 30%. We are using 35%. Fairly aggressive reclamation. I'm not too worried about throughput issues with reclamation. We are implementing 10g ethernet connections from the TSM servers into the DD. We currently are not replicating between these new DD's since we do not have enough network between our datacenters. Once a higher speed (dedicated) net is in place for DD replication we can get rid of our tape copy pools which will greatly lower our traffic from the DD's. > On another note, I've always been suspicious of whether or not granular > analysis like this is accurate. > Using collocation to identify "bad dedupe citizens" sounds reasonable, but > only if the values being returned by the "filesys show compression" > command is accurate. Is that data dynamically updated? Are the > individual file deduplication ratios immediately update automatically as > data is written or cleaned? So far, one of our new DD's has 215tb on 48tb of disk. I've been testing nodes who's occupancy is greater than 10tb to see what kind of dedup they are getting. I created a DD based file pool, and made it a next pool behind our existing tape pool while set to never migrate. I then run move nodedata from the tape pool into the file pool (and back). I let about 500gb get moved, then see what dedup the file pool is getting. So far the DD is responding with stats (filesys show compression /path/to/dir) that reflect the immediate status of the dir. It seems to gather the stats right when the cmd is run. The manual indicates that this cmd can take quite some time to run if there are many files to be looked at under a dir. > I find it hard to believe this is dynamically > maintained by the data domain, but I'd definitely want to know before > switching to colocation for this purpose. Dynamically maintained - no, I don't think so for an arbitary dir or file. I think it computes it when the cmd is run. Thus the warning in the manual about how long it can run. > Deduplication adds an abstraction layer between the file metadata and the > actual storage. I don't see how you could really get an accurate picture > of the true storage an individual file is occupying since it is sharing > space. Say there are 10x 100MB files sharing 50 percent of their data > with each other. How much space is one of those files occupying? Our concern is that the DD is a limited resource - how much disk it has. If some node dumps a bunch of backups that don't dedup that could greatly effect the disk available. A couple examples. 1) audio files - we have a node that records audio which are kept for 2 years. It currently has a TSM occupancy of 48tb. we expected it wouldn't be a good fit for a DD. Sure enough, when I put some of it's backups onto the DD I got almost no dedup/compression. 2) Notes mail backups - We use Notes for email. It's our understanding that the mail boxes are already compressed somehow by notes. If you take one and gz it, it doesn't shrink much. We figured it was another bad fit for the DD. To our surprise, when I put part of the backups on the DD it deduped at a 5x ratio. It turns out this is a good fit for DD. What we are worried about is a client that starts sending backups that shouldn't be on the DD. Also, we can't test every node for a DD fit. If TSM occupancy stays static but DD disk starts to grow, how will we tell which node is/are the problem? We would see it as the DD filling disk, but this may have little relation to the TSM occupancy of a node. We can easily see having to dig for nodes that should really be on tape. (our tape system isn't going away any time soon). So. After reading the replies by everyone, we are going to collocate our DD file pool and I'm going to try and create a report that will list our nodes, occupancy, DD disk, and a computed ratio. We'll see . . .this may crash and burn . . . Thanks! Rick ----------------------------------------- The information contained in this message is intended only for the personal and confidential use of the recipient(s) named above. If the reader of this message is not the intended recipient or an agent responsible for delivering it to the intended recipient, you are hereby notified that you have received this document in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify us immediately, and delete the original message.
