Re: large files vs many files

Stefan Podkowinski Sun, 10 May 2009 08:46:15 -0700

You just can't have many distributed jobs write into the same file
without locking/synchronizing these writes. Even with append(). Its
not different than using a regular file from multiple processes in
this respect.
Maybe you need to collect your data in front before processing them in hadoop?
Have a look at Chukwa, http://wiki.apache.org/hadoop/Chukwa



On Sat, May 9, 2009 at 9:44 AM, Sasha Dolgy <sdo...@gmail.com> wrote:
> Would WritableFactories not allow me to open one outputstream and continue
> to write() and sync() ?
>
> Maybe I'm reading into that wrong.  Although UUID would be nice, it would
> still leave me in the problem of having lots of little files instead of a
> few large files.
>
> -sd
>
> On Sat, May 9, 2009 at 8:37 AM, jason hadoop <jason.had...@gmail.com> wrote:
>
>> You must create unique file names, I don't believe (but I do not know) that
>> the append could will allow multiple writers.
>>
>> Are you writing from within a task, or as an external application writing
>> into hadoop.
>>
>> You may try using UUID,
>> http://java.sun.com/j2se/1.5.0/docs/api/java/util/UUID.html, as part of
>> your
>> filename.
>> Without knowing more about your goals, environment and constraints it is
>> hard to offer any more detailed suggestions.
>> You could also have an application aggregate the streams and write out
>> chunks, with one or more writers, one per output file.
>>
>>
>

Re: large files vs many files

Reply via email to