>1. If the daily files are smaller than 1 block (seems unlikely) Even at a large hdfs installation, the avg file size was < 1.5 blocks. Bucketing causes the file sizes to drop.
>2. The small files problem (a typical NN can store 100-200M files, so >a problem for big users) Big users probably have enough people to write their own roll-up code to avoid small-files problem. Its the rest that are used to storage systems handling billions of files. - milind --- Milind Bhandarkar Greenplum Labs, EMC (Disclaimer: Opinions expressed in this email are those of the author, and do not necessarily represent the views of any organization, past or present, the author might be affiliated with.) > >In which case maybe better to focus on #2 rather than work around it? > >Thanks, >Eli > >> >> Reducing number of files this way also makes it easy to copy, take >> snapshots etc without having to write special parallel code to do it. >> >>> >>>I assume the 2nd one refers to not having to Multi*InputFormat. And >>>the 3rd refers to appending to an old file instead of creating a new >>>one. >> >> Yes. >> >>> >>>> In addition, the small-files problem in HDFS forces people to write MR >>>> code, and causes rewrite of large datasets even if a small amount of >>>>data >>>> is added to it. >> >> >>> >>>Do people rewrite large datasets today just to add 1mb? I haven't >>>heard of that from big users (Yahoo!, FB, Twitter, eBay..) or my >>>customer base. If so I'd would have expected people to put energy >>>into getting append working in 1.x which know was has put energy into >>>(I know some people feel the 20-based design is unworkable, I don't >>>know it well enough to comment there). >> >> With HDFS, they do not rewrite large datasets just to add a small amount >> of data. Instead they create new files, and use a separate >> metadata-service (or just file numbering conventions) to make the added >> data part of the large dataset. But with other file systems, they just >> ">>". >> >> Thanks, >> >> - milind >> >> >>>--- >>>Milind Bhandarkar >>>Greenplum Labs, EMC >>>(Disclaimer: Opinions expressed in this email are those of the author, >>>and do not necessarily represent the views of any organization, past or >>>present, the author might be affiliated with.) >> >