Re: writing output files in hadoop streaming

Yuri Pradkin Tue, 15 Jan 2008 09:38:18 -0800

On Tue, Jan 15, 2008 at 09:09:07AM -0800, Ted Dunning wrote:
> 
> Regarding the race condition, hadoop builds task specific temporary
> directories in the output directory, one per reduce task, that hold these
> output files (as long as you don't use absolute path names).  When the
> process completes successfully, the output files from that temporary
> directory are moved to the correct place and the temporary task-specific
> directory is deleted.  If the reduce task dies or is superceded by another
> task, then the directory is simply deleted.  The file is not kept in memory
> pending write.


That sounds like a "generic" record writer I sketched might work.  I'd love
to hear you comment on it.

Are there any e.g. perl bindings to make our scripts write to files in those
temp directories (on HDFS)?  That could be another solution to our problem.

> I am curious about how to demarcate the image boundaries in your current
> output.  Hadoop streaming makes the strong presumption of line orientation.
> If that isn't valid for your output, then you may have a program that is
> only accidentally working by finding line boundaries in binary data.  In
> particular, you may someday have a situation where some of the data has one
> kind of line boundary that is recognized, but on output the corresponding
> boundary is generated in a different form.  For instance, if your program
> sees CR-LF, it might take the pair as a line boundary and emit just LF.
> Even if this is not happening now, you may be in for some trouble later.

Currently we are not using any image boundaries.  Our current reducer bunches
up all records with the same key and feeds it to an image generation program
that writes to a unique file name on NFS  (I described it in the previous post).
Each image is in a separate file.  Image boundary is the key boundary.

The "generic" record writer for multiple files would take a file size as an
argument, so it can know how long the rawbytes field is.

Thanks!

  -Yuri

> 
> 
> On 1/15/08 8:57 AM, "Yuri Pradkin" <[EMAIL PROTECTED]> wrote:
> 
> > Well, in our case the reducer munches key-value pairs to
> > generate images; it's conceivable that we'll have other
> > reducers in the future to do other interesting things.
> > So, it would be impractical to move all that code into
> > a RecordWriter.  We don't want to have a new  RecordWriter
> > for each new job, and we'd like to keep our processing
> > code in languages other than Java, which is the only reason
> > to use streaming, right?
> > 
> > 
> > Do you think it would be a good solution to come up with a
> > "generic" version of a record writer that would take as
> > input, say: 
> > <filename, filesize, rawbytes[filesize]>
> > and do the actual writing?
> > 
> > Will the Hadoop guarantee that only one "filename" will be
> > created/written to even if there are racing tasks and the
> > file will not be corrupted?
> > 
> > And what about memory requirements? -- When filesize is large,
> > would it have to be all stored in memory before it's written,
> > or Hadoop will cache it in a temp file?
> > 
> > Thanks much for your input.
> > 
> >   -Yuri 
> > 
> > On Mon, Jan 14, 2008 at 01:06:13PM -0800, Runping Qi wrote:
> >> 
> >> One way to achieve your goal is to implement your own
> >> OutputFormat/RecordWriter classes.
> >> Your reducer will emit all the key/value pairs as in the normal case.
> >> In your record writer class can open multiple output files and dispatch
> >> the key/value to appropriate files based on the actual values.
> >> This way, the Hadoop framework takes care of all the issues related the
> >> namespace and the necessary cleanup of the output files.
> >> 
> >> 
> >> Runping
> >>  
> >> 
> >>> -----Original Message-----
> >>> From: Yuri Pradkin [mailto:[EMAIL PROTECTED]
> >>> Sent: Monday, January 14, 2008 12:33 PM
> >>> To: hadoop-user@lucene.apache.org
> >>> Subject: writing output files in hadoop streaming
> >>> 
> >>> Hi,
> >>> 
> >>> We've been using Hadoop streaming for the last 3-4 months and
> >>> it all worked out fine except for one little problem:
> >>> 
> >>> in some situations a hadoop reduce job gets multiple key groups
> >>> and is desired to write out a separate binary output file for
> >>> each group.  However, when a reduce task takes too long and
> >>> there is spare capacity, the task may be replicated on another
> >>> node and these two are basically racing each other.  One finishes
> >>> cleanly and the other is terminated.  Hadoop takes care to remove
> >>> ther terminated job's output from HDFS, but since we're writing
> >>> files from scripts, it's up to us to separate the output of cleanly
> >>> finished tasks from the output of tasks that are terminated
> >>> prematurely.
> >>> 
> >>> Does somebody have answers to the following questions:
> >>> 1. Is there an easy way to tell in a script launched by the Hadoop
> >>>    streaming, if the script was terminated before it received complete
> >>>    input?
> >>>    As far as I was able to ascertain, no signals are being sent to
> >> those
> >>>    unix-jobs.  They just stop receiving data from STDIN.  The only way
> >>>    that seems to work for me was to process all input and then write
> >>>    something to STDOUT/STDERR and see if that causes a SIGPIPE.  But
> >>>    this is ugly, I hope there is a better solution.
> >>> 
> >>> 2. Is there any good way to write multiple HDFS files from a streaming
> >>> script
> >>>    *and have Hadoop cleanup those files* when it decides to destroy
> >> the
> >>>    task?  If there was just one file, I could simply use STDOUT, but
> >>> dumping
> >>>    multiple binary files to STDOUT is not pretty.
> >>> 
> >>> We are writing output files to an NFS partition shared among all
> >> reducers,
> >>> which
> >>> makes it all slightly more complicated because of possible file
> >>> overwrites.
> >>> 
> >>> Our current solution, which is not pretty but avoids directly
> >> addressing
> >>> this
> >>> problem is to write out files with random names (created with mktemp)
> >> and
> >>> write
> >>> to STDOUT the renaming command for this file to it's desired name.
> >> Then
> >>> as a
> >>> post-processing stage, I execute all those commands and delete the
> >>> remaining
> >>> temporary files as duplicates/incompletes.
> >>> 
> >>> Thanks,
> >>> 
> >>>   -Yuri

Re: writing output files in hadoop streaming

Reply via email to