[
https://issues.apache.org/jira/browse/TEZ-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14345853#comment-14345853
]
Siddharth Seth commented on TEZ-2001:
-------------------------------------
>From the previous review.
bq. Changing remaining to a List from a Set in the Fetcher leads to some
inefficiency - since the size of this list can be ~30, and remove() calls can
be expensive. We may want to fix this later - by using the spillId in the
hashCode - or a wrapping structure for just this.
Follow on jira ?
bq. For e.g, it would try to fetch from
output/attempt_1418684642047_0006_1_00_000000_0_10003_0/file.out when
OPTIMIZE_LOCAL_FETCH is enabled. I haven't seen any issue here. Am I mising
something?
Sorry, missed that this is based on the PathComponent - so will work. We
probably should have the fetchers use a method from TezTaskOutputFiles to be
more consistent. On the merge side (DiskMerger), things should work as well -
due to the the mergeId being used in the filename. If you don't mind, could you
please scan through the OnDiskMerger code to confirm this ?
- Minor: "Speculative execution needs to be turned when using this parameter" -
"off" missing
- ShuffleScheduler - dedupedList.put(inputNumber, id); - Is it possible for id
to have an older revision compared to what's in the oldList ? I think that
check should be in place.
- DefaultSorter - "if (spillRecord == null) { ... else writeIndexFile" - By
this point, indexCacheList will always be populated, which means we could end
up over-writing previously written spill index files.
Optimize Local Fetch:
The rest looks good to me.
- Future jira: Fail fast - "if (eventInfo != null &&
srcAttemptIdentifier.getAttemptNumber() > 0) {". Probably a repetition, but
even for non deterministic cases, this could be changed to check if a previous
attempt exists or not. For example, task attempts which are KILLED before they
start running. IAC, this is an optimization and a future jira.
> Support pipelined data transfer for ordered output
> --------------------------------------------------
>
> Key: TEZ-2001
> URL: https://issues.apache.org/jira/browse/TEZ-2001
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Attachments: TEZ-2001.1.patch, TEZ-2001.2.patch, TEZ-2001.3.patch,
> TEZ-2001.4.patch, TEZ-2001.5.patch, benchmark_q17_10TB.png, dag_plan.jpg
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)