On Wed, Mar 21, 2012 at 10:32 AM, Eli Collins <e...@cloudera.com> wrote: > Thanks for the feedback Milind, questions inline. > > On Wed, Mar 21, 2012 at 10:17 AM, <milind.bhandar...@emc.com> wrote: >> As someone who has worked with hdfs-compatible distributed file systems >> that support append, I can vouch for its extensive usage. >> >> I have seen how simple it becomes to create tar archives, and later append >> files to them, without writing special inefficient code to do so. >> > > Why not just write new files and use Har files, because Har files are a pita? > >> I have seen it used in archiving cold data, reducing MR task launch >> overhead without having to use a different input format, so that the same >> code can be used for both hot and cold data. >> > > Can you elaborate on the 1st one, how it's especially helpful for archival? > > I assume the 2nd one refers to not having to Multi*InputFormat. And > the 3rd refers to appending to an old file instead of creating a new > one. > >> In addition, the small-files problem in HDFS forces people to write MR >> code, and causes rewrite of large datasets even if a small amount of data >> is added to it. > > Do people rewrite large datasets today just to add 1mb? I haven't > heard of that from big users (Yahoo!, FB, Twitter, eBay..) or my > customer base. If so I'd would have expected people to put energy > into getting append working in 1.x which know was has put energy into
Arg, that should read "no one has put energy into". </drinks coffee>