You just can't have many distributed jobs write into the same file without locking/synchronizing these writes. Even with append(). Its not different than using a regular file from multiple processes in this respect. Maybe you need to collect your data in front before processing them in hadoop? Have a look at Chukwa, http://wiki.apache.org/hadoop/Chukwa
On Sat, May 9, 2009 at 9:44 AM, Sasha Dolgy <sdo...@gmail.com> wrote: > Would WritableFactories not allow me to open one outputstream and continue > to write() and sync() ? > > Maybe I'm reading into that wrong. Although UUID would be nice, it would > still leave me in the problem of having lots of little files instead of a > few large files. > > -sd > > On Sat, May 9, 2009 at 8:37 AM, jason hadoop <jason.had...@gmail.com> wrote: > >> You must create unique file names, I don't believe (but I do not know) that >> the append could will allow multiple writers. >> >> Are you writing from within a task, or as an external application writing >> into hadoop. >> >> You may try using UUID, >> http://java.sun.com/j2se/1.5.0/docs/api/java/util/UUID.html, as part of >> your >> filename. >> Without knowing more about your goals, environment and constraints it is >> hard to offer any more detailed suggestions. >> You could also have an application aggregate the streams and write out >> chunks, with one or more writers, one per output file. >> >> >