Re: DataDomain and dedup per node

Richard Rhodes Mon, 23 Apr 2012 12:17:36 -0700

Hi Everyone,

Thanks for the thoughts/comments.


> I was told the only reason EMC recommends to turn off collocation is
that
> collocation on shoots up the individual volume count-generally

yup, that's my understanding - more vols, more reclamation

> also recommend a relatively high reclamation threshold.
> think these 2
> factors together might end up in a lost of wasted unreclaimed space.  I
> think it would be ok if you were more aggressive with your reclamation.
> Something to keep an eye on at the least.

They suggested a reclamation thrshhold of 30%.  We are using 35%.  Fairly
aggressive reclamation.

I'm not too worried about throughput issues with reclamation.
We are implementing 10g ethernet connections from the TSM
servers into the DD. We currently are not replicating between
these new DD's since we do not have enough network between
our datacenters.  Once a higher speed (dedicated) net is
in place for DD replication we can get rid of our tape copy
pools which will greatly lower our traffic from the DD's.

> On another note, I've always been suspicious of  whether or not granular
> analysis like this is accurate.
> Using collocation to identify "bad dedupe citizens" sounds reasonable,
but
> only if the values being returned by the "filesys show compression"
> command is accurate.  Is that data dynamically updated?  Are the
> individual file deduplication ratios immediately update automatically as
> data is written or cleaned?

So far, one of our new DD's has 215tb on 48tb of disk.
I've been testing nodes who's occupancy is greater than
10tb to see what kind of dedup they are getting.  I
created a DD based file pool, and made it a next pool
behind our existing tape pool while set to never
migrate.  I then run move nodedata from the tape pool
 into the file pool (and back).  I let about 500gb get moved, then
see what dedup the file pool is getting.  So far the DD
is responding with stats (filesys show compression /path/to/dir)
that reflect the immediate status of the dir.  It seems to
gather the stats right when the cmd is run.  The manual
indicates that this cmd can take quite some time to run
if there are many files to be looked at under a dir.

> I find it hard to believe this is dynamically
> maintained by the data domain, but I'd definitely want to know before
> switching to colocation for this purpose.

Dynamically maintained - no, I don't think so for an
 arbitary dir or file.  I think it computes it when
 the cmd is run.  Thus the warning in the manual
about how long it can run.

> Deduplication adds an abstraction layer between the file metadata and
the
> actual storage.  I don't see how you could really get an accurate
picture
> of the true storage an individual file is occupying since it is sharing
> space.  Say there are 10x 100MB files sharing 50 percent of their data
> with each other.  How much space is one of those files occupying?

Our concern is that the DD is a limited resource - how much
 disk it has.  If some node dumps a bunch of backups that
 don't dedup that could greatly effect the disk available.
A couple examples.
1) audio files - we have a node  that records audio which are kept for 2
years.  It currently has a TSM occupancy of 48tb.  we expected it wouldn't
be a good fit for a DD.  Sure enough, when I put some of it's backups onto
the DD I got almost no dedup/compression.
2) Notes mail backups - We use Notes for email.  It's our understanding
that the mail boxes are already compressed somehow by notes.  If you take
one and gz it, it doesn't shrink much.  We figured it was another bad fit
for the DD.  To our surprise, when I put part of the backups on the DD it
deduped at a 5x ratio.  It turns out this is a good fit for DD.

What we are worried about is a client that starts sending
backups that shouldn't be on the DD.  Also, we can't test every
node for a DD fit.  If TSM occupancy stays static but DD disk
starts to grow, how will we tell which node is/are the problem?
We would see it as the DD filling disk, but this may have little
 relation to the TSM occupancy of a node.  We can easily see
having to dig for nodes that should really be on tape. (our tape system
isn't
going away any time soon).

So.  After reading the replies by everyone, we are going to
collocate our DD file pool and I'm going to try and create a
report that will list our nodes, occupancy, DD disk, and a computed ratio.


We'll see . . .this may crash and burn . . .


Thanks!

Rick




-----------------------------------------
The information contained in this message is intended only for the
personal and confidential use of the recipient(s) named above. If
the reader of this message is not the intended recipient or an
agent responsible for delivering it to the intended recipient, you
are hereby notified that you have received this document in error
and that any review, dissemination, distribution, or copying of
this message is strictly prohibited. If you have received this
communication in error, please notify us immediately, and delete
the original message.

Re: DataDomain and dedup per node

Reply via email to