On Tue, Feb 10, 2009 at 2:22 AM, Allen Wittenauer <[email protected]> wrote:
>
> The key here is to prioritize your data.  Impossible to replicate data gets
> backed up using whatever means necessary, hard-to-regenerate data, next
> priority. Easy to regenerate and ok to nuke data, doesn't get backed up.
>

I think thats a good advise to start with when creating a backup strategy.
E.g. what we do at the moment is to analyze huge volumes of access
logs where we import those logs into hdfs, creating aggregates for
several metrics and finally storing results in sequence files using
block level compression. Its kind of an intermediate format that can
be used for further analysis. Those files end up being pretty small
and will be exported daily to storage and getting backuped. In case
hdfs goes to hell we can restore some raw log data from the servers
and only loose historical logs which should not be a big deal.

I must also add that I really enjoy the great deal of optimization
opportunities that hadoop gives you by directly implementing the
serialization strategies. You really get control over every bit and
byte that gets recorded. Same with compression. So you can make the
best trade offs possible and finally store only data you really need.

Reply via email to