[ 
https://issues.apache.org/jira/browse/TEZ-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14355446#comment-14355446
 ] 

Siddharth Seth commented on TEZ-1094:
-------------------------------------

Comments on the patch
- UnorderedPartitionedKVWriter - the changes are very confusing. I don't think 
SpillCallable should be dealing with any spill other than the one that was sent 
in. It shouldn't be generating events for all spills. One possible approach 
here would be add parameters to SpillCallable on whether it should be 
generating an event, whether it should be writing the index file, and whether 
it should be sending the event out immediately. isLastSpill - may not be 
required at all if the filename / counter is generated before calling 
SpillCallable. It'll be better to have the logic of sending events at the end 
for pipelining disabled, final merge disabled in the close method itself. This 
should simplify the code quite a bit, and make it easier to understand. 
SpillInfo could also be enhanced to make this simpler.
- UnorderedPartitionedKVWriter: cleanupCurrentBuffer - this method can be used 
in one more place in the close() method.
- UnorderedPartitionedKVWriter: I think emptyPartition information is being 
generated from the global empty list. This would imply that the empty stats for 
later spills will be an aggregate of previous ones.
- UnorderedPartitionedKVWriter: It's possible for multiple threads to be 
spilling at the same time. Not sure if this needs to be factored in anywhere.
- ShuffleInputEventHandlerImpl: Nit: Move "int spillEventId = 
shufflePayload.getSpillId();" inside the hasSpillId check.
- ShuffleManager: " if (shuffleInfoEventsMap.get(srcAttemptIdentifier) == null) 
{" - Why this check instead of directly checking the attemptNumber being 0 ? 
Don't think it's required at the moment, but will be in the future when we 
allow any one single attempt number. For now there's enough checks on attempt 0 
to make this check unnecessary.
- ShuffleManager: registerCompletedInputForPipelinedShuffle - the same check 
falls through. Missing return statement ?
- ShuffleManager: registerCompletedInputForPipelinedShuffle: 
"completedInputSet.add(fetchedInput.getInputAttemptIdentifier().getInputIdentifier());
 .... numCompletedInputs..." This is repeated from registerCompletedInput. 
Possible to split registerCompletedInput into different components and re-use 
the function ? The next comment may complicate this.
- ShuffleManager:registerCompletedInputForPipelinedShuffle: "if 
(!inputReadyNotificationSent.getAndSet(true)) {" - This implies that the input 
ready notification will only be sent out when all spills of at least one source 
have completed, which is an unnecessary delay. Ideally, the Processor should be 
able to start processing the moment a single spill is available.
- ShuffleManager.registerCompletedInputForPipelinedShuffle: "catch 
(InterruptedException e) {" - required ? Can we use completedInputs.add instead 
of put ?
- ShuffleManager: numCompletedInputs : Would be useful to have another 
parameter - numFetchedSpills or some such, that would make the LOG message a 
little more useful.
- Nit: javadoc "Ensure to set tez.runtime.disable.final-merge.in.sorter=false." 
- the property name has changed
- Nit: enable.final-merge-in.output rename to enable.final-merge.in.output 
(similar to the old property name)

> Support pipelined data transfer for Unordered Output
> ----------------------------------------------------
>
>                 Key: TEZ-1094
>                 URL: https://issues.apache.org/jira/browse/TEZ-1094
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Siddharth Seth
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-1094.1.patch
>
>
> For unsorted output (and possibly for sorted output), it should be possible 
> to send data in small batches instead of waiting for everything to be 
> generated before transmitting. For now, planning on getting started with 
> UnsortedOutput / Input pairs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to