On Wed, Mar 21, 2012 at 10:47 AM,  <milind.bhandar...@emc.com> wrote:
> Answers inline.
>
> On 3/21/12 10:32 AM, "Eli Collins" <e...@cloudera.com> wrote:
>
>>
>>Why not just write new files and use Har files, because Har files are a
>>pita?
>
> Yes, and har creation is an MR job, which is totally I/O bound, and yet
> takes up slots/containers, reducing cluster utilization.
>
>>Can you elaborate on the 1st one, how it's especially helpful for
>>archival?
>
> Say you have daily log files (consider many small job history files).
> Instead of keeping them as separate files, one appends them to a monthly
> files (this in itself is a complete rewrite), but appending monthly files
> to year-to-date files should not require rewrite (because after March, it
> becomes very inefficient.)

Why not just keep the original daily files instead of continually
either rewriting (yuck) or duplicating (yuck) the data by aggregating
them into rollups?  I can think of two reasons:

1. If the daily files are smaller than 1 block (seems unlikely)
2. The small files problem (a typical NN can store 100-200M files, so
a problem for big users)

In which case maybe better to focus on #2 rather than work around it?

Thanks,
Eli

>
> Reducing number of files this way also makes it easy to copy, take
> snapshots etc without having to write special parallel code to do it.
>
>>
>>I assume the 2nd one refers to not having to Multi*InputFormat. And
>>the 3rd refers to appending to an old file instead of creating a new
>>one.
>
> Yes.
>
>>
>>> In addition, the small-files problem in HDFS forces people to write MR
>>> code, and causes rewrite of large datasets even if a small amount of
>>>data
>>> is added to it.
>
>
>>
>>Do people rewrite large datasets today just to add 1mb? I haven't
>>heard of that from big users (Yahoo!, FB, Twitter, eBay..) or my
>>customer base.  If so I'd would have expected people to put energy
>>into getting append working in 1.x which know was has put energy into
>>(I know some people feel the 20-based design is unworkable, I don't
>>know it well enough to comment there).
>
> With HDFS, they do not rewrite large datasets just to add a small amount
> of data. Instead they create new files, and use a separate
> metadata-service (or just file numbering conventions) to make the added
> data part of the large dataset. But with other file systems, they just
> ">>".
>
> Thanks,
>
> - milind
>
>
>>---
>>Milind Bhandarkar
>>Greenplum Labs, EMC
>>(Disclaimer: Opinions expressed in this email are those of the author,
>>and do not necessarily represent the views of any organization, past or
>>present, the author might be affiliated with.)
>

Reply via email to