[
https://issues.apache.org/jira/browse/MAPREDUCE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421922#comment-13421922
]
Andy Isaacson commented on MAPREDUCE-2374:
------------------------------------------
bq. I have tested Andy's suggestion of not using -c switch. It does resolve the
issue on our test cluster.
Thanks for testing! This is great news.
bq. to avoid potential data loss issues (from delayed allocation by file
systems like ext4) I have made some changes so that the IO buffer contents of
taskjvm.sh file are committed to the underlying storage before shell executor
is called.
I'm a bit confused why you're concerned about crash consistency here. AFAIK,
ext4 delayed allocation is *completely* invisible to application code, unless
the machine crashes and you're recovering afterwards.
(OK, that's not quite true since you could use {{filefrag}} to find out where
the allocations are, or maybe you could use hires timers to notice seek-induced
IO timing discontinuities, but those are not relevant to this discussion.)
Since the taskjvm.sh script is just used immediately and not reused across a
kernel crash, why do you care that the IO buffer is synced to the stable
storage?
Looking at mapreduce-2374-branch-1.patch, I see that you've made two related
changes, one getting rid of the {{BufferedOutputStream}} from
{{RawLocalFileSystem#create}} and another adding calls to
{{FSDataOutputStream#flush}} and {{#sync}}. I don't see how either of those
changes can make much of a difference given that we call {{w.close}} 8 lines
down in the finally block.
bq. with strace the frequency of Text busy errors goes down drastically
Not too surprising; to get ETXTBSY we have to get to the relevant {{execve}}
before the forked process that's holding the fd exits. Adding a strace before
that slows down the shell and adds another child process into the scheduling
mix, making the race harder to win.
> Should not use PrintWriter to write taskjvm.sh
> ----------------------------------------------
>
> Key: MAPREDUCE-2374
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2374
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 0.22.0
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Fix For: 0.22.1
>
> Attachments: failed_taskjvmsh.strace, mapreduce-2374-branch-1.patch,
> mapreduce-2374-on-20sec.txt, mapreduce-2374.txt, mapreduce-2374.txt,
> successfull_taskjvmsh.strace
>
>
> Our use of PrintWriter in TaskController.writeCommand is unsafe, since that
> class swallows all IO exceptions. We're not currently checking for errors,
> which I'm seeing result in occasional task failures with the message "Text
> file busy" - assumedly because the close() call is failing silently for some
> reason.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira