[
https://issues.apache.org/jira/browse/TEZ-2918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984592#comment-14984592
]
Rajesh Balamohan edited comment on TEZ-2918 at 11/2/15 3:22 AM:
----------------------------------------------------------------
Patch introduces volatile updates in inner loops of the runtime. Is it
possible to avoid that as it would hurt performance. For instance,
pipelinedsorter does not have any sync or volatile in its core loop. In other
places like merge, this is one of the reasons for introducing
TEZ_RUNTIME_RECORDS_BEFORE_PROGRESS so that progress is updated every 10000
records or so. TEZ-1491, HADOOP-11029 provide details on the impact on counter
stats with volatile and sync.
Places where this can hurt are
- DefaultSorter's collect code path as it would get executed for every record.
- DefaultSorter's sortAndSpill, as quick sort would internally invoke
progress() lots of times. Instead "progressable.progress()" can be invoked just
before calling sort in sortAndSpill?
- MRReaderMapred.next(), MRReaderMapReduce.next(),
MROutput.KeyValueWriter.write(), OrderedGroupedKVInput.getReader().next(),
PipelinedSorter.collect(), UnorderedKVInput.getReader().next(),
UnorderedKVReader.next(). If needed, should it make use of
TEZ_RUNTIME_RECORDS_BEFORE_PROGRESS as they are hot code paths?
>> In UnorderedPartitionedKVWriter, {code} "writer.append(key, value); // ????
>> Why is outputrecordscounter not updated here?" {code}
- Writer already has reference to outputRecordsCounter. So when it gets closed,
outputRecordsCounter is automatically adjusted.
>> In PipelinedSorter, {code} // ??? why are events not being sent here? {code}
- Final event is sent out by OrderedPartitionedKVOutput itself. So in cases
where only one output is available, there is no need to send any events from
within sorter itself.
was (Author: rajesh.balamohan):
Patch introduces volatile updates in inner loops of the runtime. Is it
possible to avoid that as it would hurt performance. For instance,
pipelinedsorter does not have any sync or volatile in its core loop. In other
places like merge, this is one of the reasons for introducing
TEZ_RUNTIME_RECORDS_BEFORE_PROGRESS so that progress is updated every 10000
records or so. TEZ-1491, HADOOP-11029 refer to counter stats on volatile and
sync. Places where this can hurt are
- DefaultSorter's collect code path as it would get executed for every record.
- DefaultSorter's sortAndSpill, as quick sort would internally invoke
progress() lots of times. Instead "progressable.progress()" can be invoked just
before calling sort in sortAndSpill?
- MRReaderMapred.next(), MRReaderMapReduce.next(),
MROutput.KeyValueWriter.write(), OrderedGroupedKVInput.getReader().next(),
PipelinedSorter.collect(), UnorderedKVInput.getReader().next(),
UnorderedKVReader.next(). If needed, should it make use of
TEZ_RUNTIME_RECORDS_BEFORE_PROGRESS as they are hot code paths?
>> In UnorderedPartitionedKVWriter, {code} "writer.append(key, value); // ????
>> Why is outputrecordscounter not updated here?" {code}
- Writer already has reference to outputRecordsCounter. So when it gets closed,
outputRecordsCounter is automatically adjusted.
>> In PipelinedSorter, {code} // ??? why are events not being sent here? {code}
- Final event is sent out by OrderedPartitionedKVOutput itself. So in cases
where only one output is available, there is no need to send any events from
within sorter itself.
> Make progress notifications in IOs
> ----------------------------------
>
> Key: TEZ-2918
> URL: https://issues.apache.org/jira/browse/TEZ-2918
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Bikas Saha
> Assignee: Bikas Saha
> Attachments: TEZ-2918.1.patch, TEZ-2918.2.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)