On Wed, Mar 21, 2012 at 10:32 AM, Eli Collins <e...@cloudera.com> wrote:
> Thanks for the feedback Milind, questions inline.
>
> On Wed, Mar 21, 2012 at 10:17 AM,  <milind.bhandar...@emc.com> wrote:
>> As someone who has worked with hdfs-compatible distributed file systems
>> that support append, I can vouch for its extensive usage.
>>
>> I have seen how simple it becomes to create tar archives, and later append
>> files to them, without writing special inefficient code to do so.
>>
>
> Why not just write new files and use Har files, because Har files are a pita?
>
>> I have seen it used in archiving cold data, reducing MR task launch
>> overhead without having to use a different input format, so that the same
>> code can be used for both hot and cold data.
>>
>
> Can you elaborate on the 1st one, how it's especially helpful for archival?
>
> I assume the 2nd one refers to not having to Multi*InputFormat. And
> the 3rd refers to appending to an old file instead of creating a new
> one.
>
>> In addition, the small-files problem in HDFS forces people to write MR
>> code, and causes rewrite of large datasets even if a small amount of data
>> is added to it.
>
> Do people rewrite large datasets today just to add 1mb? I haven't
> heard of that from big users (Yahoo!, FB, Twitter, eBay..) or my
> customer base.  If so I'd would have expected people to put energy
> into getting append working in 1.x which know was has put energy into

Arg, that should read "no one has put energy into".  </drinks coffee>

Reply via email to