[ 
https://issues.apache.org/jira/browse/TEZ-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14345853#comment-14345853
 ] 

Siddharth Seth commented on TEZ-2001:
-------------------------------------

>From the previous review.
bq. Changing remaining to a List from a Set in the Fetcher leads to some 
inefficiency - since the size of this list can be ~30, and remove() calls can 
be expensive. We may want to fix this later - by using the spillId in the 
hashCode - or a wrapping structure for just this.
Follow on jira ?

bq. For e.g, it would try to fetch from 
output/attempt_1418684642047_0006_1_00_000000_0_10003_0/file.out when 
OPTIMIZE_LOCAL_FETCH is enabled. I haven't seen any issue here. Am I mising 
something?
Sorry, missed that this is based on the PathComponent - so will work. We 
probably should have the fetchers use a method from TezTaskOutputFiles to be 
more consistent. On the merge side (DiskMerger), things should work as well - 
due to the the mergeId being used in the filename. If you don't mind, could you 
please scan through the OnDiskMerger code to confirm this ?

- Minor: "Speculative execution needs to be turned when using this parameter" - 
"off" missing
- ShuffleScheduler - dedupedList.put(inputNumber, id); - Is it possible for id 
to have an older revision compared to what's in the oldList ? I think that 
check should be in place.
- DefaultSorter - "if (spillRecord == null) { ... else writeIndexFile" - By 
this point, indexCacheList will always be populated, which means we could end 
up over-writing previously written spill index files.
Optimize Local Fetch: 

The rest looks good to me.

- Future jira: Fail fast - "if (eventInfo != null && 
srcAttemptIdentifier.getAttemptNumber() > 0) {". Probably a repetition, but 
even for non deterministic cases, this could be changed to check if a previous 
attempt exists or not. For example, task attempts which are KILLED before they 
start running. IAC, this is an optimization and a future jira.


> Support pipelined data transfer for ordered output
> --------------------------------------------------
>
>                 Key: TEZ-2001
>                 URL: https://issues.apache.org/jira/browse/TEZ-2001
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-2001.1.patch, TEZ-2001.2.patch, TEZ-2001.3.patch, 
> TEZ-2001.4.patch, TEZ-2001.5.patch, benchmark_q17_10TB.png, dag_plan.jpg
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to