[
https://issues.apache.org/jira/browse/HADOOP-3196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586511#action_12586511
]
Doug Cutting commented on HADOOP-3196:
--------------------------------------
> mapper and the reducer classes are consuming data -isn't that signal enough
> [...]?
We currently count both consumption and generation of data as activity. Some
mappers might, e.g., take lists of file names, and not consume new input lines
very often. Noticing progress on output is important for some applications.
I'm not arguing that this is the only or even the ideal way to handle progress
notification for streaming, but rather that it is the way that things currently
work, and that simply eliminating these flushes has the potential to break
applications which rely on them for prompt progress reports.
> get rid of excessive flushes from PipeMapper/Reducer
> ----------------------------------------------------
>
> Key: HADOOP-3196
> URL: https://issues.apache.org/jira/browse/HADOOP-3196
> Project: Hadoop Core
> Issue Type: Bug
> Components: contrib/streaming
> Affects Versions: 0.16.2
> Reporter: Joydeep Sen Sarma
>
> there's a flush on the buffered output streams in mapper/reducer for every
> row of data.
> // 2/4 Hadoop to Tool
>
> if (numExceptions_ == 0) {
> if (!this.ignoreKey) {
> write(key);
> clientOut_.write('\t');
> }
> write(value);
> if(!this.skipNewline) {
> clientOut_.write('\n');
> }
> clientOut_.flush();
> } else {
> numRecSkipped_++;
> }
> tried to measure impact of removing this. number of context switches reported
> by vmstat shows marked decline.
> with flush (10 second intervals):
> r b swpd free buff cache si so bi bo in cs us sy id wa
> 4 2 784 23140 83352 3114648 0 0 4819 32397 1175 13220 59 11 13
> 17
> 1 2 784 129724 80704 3075696 0 0 4614 27196 1156 14797 49 11 19
> 21
> 4 0 784 24160 83440 3174880 0 0 96 36070 1337 10976 67 11 9
> 12
> 5 0 784 155872 84400 3158840 0 0 125 44084 1280 11044 68 14 10
> 8
> 2 1 784 365128 87048 2892032 0 0 119 38472 1317 11610 69 14 10
> 7
> without flush:
> 5 0 784 24652 56056 3217864 0 0 310 29499 1379 7603 76 9 7
> 8
> 5 3 784 118456 54568 3209992 0 0 3249 33426 1173 6828 63 11 12
> 14
> 0 2 784 227628 54820 3198560 0 0 7840 30063 1146 8899 60 10 15
> 15
> 3 1 784 25608 55048 3313512 0 0 3251 36276 1194 7915 60 10 15
> 15
> 1 2 784 197324 49968 3194572 0 0 4714 35479 1281 8204 62 13 12
> 13
> cs goes down by about 20-30%. but having trouble measuring overall speed
> improvement (too many variables due to spec. execution etc. - need better
> benchmark).
> can't hurt.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.