Re: writing output files in hadoop streaming

Miles Osborne Tue, 15 Jan 2008 12:54:52 -0800

surely the clean way (in a streaming environment) would be to define a
representation of some kind which serialises the output.


http://en.wikipedia.org/wiki/Serialization

after your mappers and reducers have completed, you would then have some
code which deserialise (unpacked) the output as desired.   this would easily
allow you to reconstruct the  two files from a single (set) of file
fragments.

this approach would entail defining the serialisation / deserialisation
process in a way which was distinct from the actual mappers / reducers and
then having a little compilation process take that  definition and both
create the necessary serialisers / deserialisers and also serve as
documentation.

it does have extra overhead, but in the long run it is worth it, since the
interfaces are actually documented.

Miles

On 15/01/2008, John Heidemann <[EMAIL PROTECTED]> wrote:
>
> On Tue, 15 Jan 2008 09:09:07 PST, Ted Dunning wrote:
> >
> >Regarding the race condition, hadoop builds task specific temporary
> >directories in the output directory, one per reduce task, that hold these
> >output files (as long as you don't use absolute path names).  When the
> >process completes successfully, the output files from that temporary
> >directory are moved to the correct place and the temporary task-specific
> >directory is deleted.  If the reduce task dies or is superceded by
> another
> >task, then the directory is simply deleted.  The file is not kept in
> memory
> >pending write.
> >
> >I am curious about how to demarcate the image boundaries in your current
> >output.  Hadoop streaming makes the strong presumption of line
> orientation.
> >If that isn't valid for your output, then you may have a program that is
> >only accidentally working by finding line boundaries in binary data.  In
> >particular, you may someday have a situation where some of the data has
> one
> >kind of line boundary that is recognized, but on output the corresponding
> >boundary is generated in a different form.  For instance, if your program
> >sees CR-LF, it might take the pair as a line boundary and emit just LF.
> >Even if this is not happening now, you may be in for some trouble
> >later.
>
> I think Yuri left out a bit about what we're doing.
> He wasn't clear about what files we're talking about writing.
> Let me try to clarify.
>
> As context, all this is in Hadoop streaming.
>
> Here's one way, the "side-effect way" (this is what we're doing now):
>
> In principle, we'd like to not ouptut ANYTHING to stdout from streaming.
> Instead, we create new files somewhere in the shared Unix filespace.
> Basically, these files are side-effects of the map/reduce computation.
>
> This approach is described in Dean & Ghemawat section 4.5
> (Side-effects), with the caveat that the user must be responsible for
> making any side-effect atomic.
>
> Our problem is, I think, that duplicated reducers scheduled for
> straggler elimination can result in extra, partial side-effect files.
> We're trying to figure out how to clean them up properly.
>
> Currently it seems that prematurely terminated reducers (due to cancled
> straggler elimination jobs) are not told they are terminated.  They just
> get a SIGPIPE because their write destination goes away.
>
> This prompted Yuri's first question:
>
> >>>> 1. Is there an easy way to tell in a script launched by the Hadoop
> >>>>    streaming, if the script was terminated before it received
> complete
> >>>>    input?
>
> To me, it seems that cancled jobs should get a SIGTERM or SIGUSR1 so
> they can catch and cleanup properly.  Otherwise there seems to be no
> clean way to distinguish a half-run job from a fully run job that
> happens to have less input.  (I.e., no way for our reducer to do a
> commit or abort properly.)
>
> (It would be nicer to send an in-band termination signal down stdin, but
> I don't think a streaming reducer can do that.)
>
> So what do the Hadoop architects think about side-effects and recovering
> from half-run jobs?  Does hadoop intend to support side-effects (for
> interested users, obviously not as standard practice)?  If we were in
> Java would we get a signal we could use to do cleanup?
>
> What do that Hadoop streaming people think?  Is this just a bug that
> streaming is not propagating a signal that appears in Javaland?
>
>
>
> There's a second way, which is where most of the discussion has gone,
> call it the "proper" way:
>
> Rather than writing files as side-effects, the argument is to just
> output the data with the standard hadoop mechanism.  In streaming, this
> means through stdout.
>
> Which prompted Yuri's second question:
> >>>> 2. Is there any good way to write multiple HDFS files from a
> streaming
> >>>> script
> >>>>    *and have Hadoop cleanup those files* when it decides to destroy
> >>> the
> >>>>    task?  If there was just one file, I could simply use STDOUT, but
> >>>> dumping
> >>>>    multiple binary files to STDOUT is not pretty.
>
> But I actually think this is not viable for us,
> because we're writing images which are binary.
> As per Doug's comment:
>
> >If that isn't valid for your output, then you may have a program that is
> >only accidentally working by finding line boundaries in binary data.
>
> (Doug, we're not doing it this way right now.)
>
> That said, if it worked, this way is clearly a lot cleaner, since Hadoop
> already handles commit/abort for half-run jobs.  Basically all of our
> half-run problems go away.  But they're replaced with File Formatt
> Problems.
>
> If we were in Java, we could write our own OutputRecord class.  This is
> what
> Runping suggested and Yuri was discussing.  I don't think that works for
> us (because we're not in Java, although I suppose it might be made to
> work).
>
> If we go that way, then we're basically packing many files into one.
> To me it seems to me cleanest, if one wants to do that, to use some
> existing format, like tar or zip or cpio, or maybe the hadoop multi-file
> format.  But this way seems fraught with peril, since we have to fight
> streaming and custom record output, and then still extract the files
> after output completes anyway.  Lots and lots of work---it feels like
> this can't be right.
>
> (Another one hacky way to make this work in streaming is to convert binary
> to
> ascii, like base-64-ize the files.  Been there in SQL.  Done that.
> Don't want to do it again.  It still has all the encoding and
> post-processing junk. :-)
>
>
>
> Yuri had a very clever hack that merges the two schemes.  He writes to
> random filenames as side-effects, but then writes the side-effect
> filenames as hadoop output.  Therefore Hadoop handles commit/abort, and
> post run he just collects the files that appear in Hadoop's part-*
> output and discards the others.
>
> This hack works, but IMHO the reducer should do the commit/abort of
> side-effects, not some post-processing job.
>
>
> So any thoughts about supporting side-effects?
>
>
>    -John
>

Re: writing output files in hadoop streaming

Reply via email to