>1. If the daily files are smaller than 1 block (seems unlikely)

Even at a large hdfs installation, the avg file size was < 1.5 blocks.
Bucketing causes the file sizes to drop.

>2. The small files problem (a typical NN can store 100-200M files, so
>a problem for big users)

Big users probably have enough people to write their own roll-up code to
avoid small-files problem. Its the rest that are used to storage systems
handling billions of files.

- milind

---
Milind Bhandarkar
Greenplum Labs, EMC
(Disclaimer: Opinions expressed in this email are those of the author, and
do not necessarily represent the views of any organization, past or
present, the author might be affiliated with.)



>
>In which case maybe better to focus on #2 rather than work around it?
>
>Thanks,
>Eli
>
>>
>> Reducing number of files this way also makes it easy to copy, take
>> snapshots etc without having to write special parallel code to do it.
>>
>>>
>>>I assume the 2nd one refers to not having to Multi*InputFormat. And
>>>the 3rd refers to appending to an old file instead of creating a new
>>>one.
>>
>> Yes.
>>
>>>
>>>> In addition, the small-files problem in HDFS forces people to write MR
>>>> code, and causes rewrite of large datasets even if a small amount of
>>>>data
>>>> is added to it.
>>
>>
>>>
>>>Do people rewrite large datasets today just to add 1mb? I haven't
>>>heard of that from big users (Yahoo!, FB, Twitter, eBay..) or my
>>>customer base.  If so I'd would have expected people to put energy
>>>into getting append working in 1.x which know was has put energy into
>>>(I know some people feel the 20-based design is unworkable, I don't
>>>know it well enough to comment there).
>>
>> With HDFS, they do not rewrite large datasets just to add a small amount
>> of data. Instead they create new files, and use a separate
>> metadata-service (or just file numbering conventions) to make the added
>> data part of the large dataset. But with other file systems, they just
>> ">>".
>>
>> Thanks,
>>
>> - milind
>>
>>
>>>---
>>>Milind Bhandarkar
>>>Greenplum Labs, EMC
>>>(Disclaimer: Opinions expressed in this email are those of the author,
>>>and do not necessarily represent the views of any organization, past or
>>>present, the author might be affiliated with.)
>>
>

Reply via email to