[ 
https://issues.apache.org/jira/browse/TEZ-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-1094:
----------------------------------
    Attachment: TEZ-1094.2.patch

UnorderedPartitionedKVWriter related changes
- Changed the code to do the following. Use SpillCallable only for spilling and 
not for sending the events
- Send pipeline events in SpillCallback.onSuccess().
- Send out final events (if any) in close().
- Also, there was another case where writeLargeRecord wasn't handled.  This is 
fixed in the recent patch.


ShuffleInputEventHandlerImpl: Nit: Move "int spillEventId = 
shufflePayload.getSpillId();" inside the hasSpillId check.
- Fixed

ShuffleManager: " if (shuffleInfoEventsMap.get(srcAttemptIdentifier) == null) 
{" - Why this check instead of directly checking the attemptNumber being 0 ? 
Don't think it's required at the moment, but will be in the future when we 
allow any one single attempt number. For now there's enough checks on attempt 0 
to make this check unnecessary.
- InputAttemptIdentifier doesn't consider spillId for hashCode. Without this 
check, it might end up overwriting any previous values.

ShuffleManager: registerCompletedInputForPipelinedShuffle - the same check 
falls through. Missing return statement ?
- Fixed

ShuffleManager: registerCompletedInputForPipelinedShuffle: 
"completedInputSet.add(fetchedInput.getInputAttemptIdentifier().getInputIdentifier());
 .... numCompletedInputs..." This is repeated from registerCompletedInput. 
Possible to split registerCompletedInput into different components and re-use 
the function ? The next comment may complicate this.
ShuffleManager:registerCompletedInputForPipelinedShuffle: "if 
(!inputReadyNotificationSent.getAndSet(true)) {" - This implies that the input 
ready notification will only be sent out when all spills of at least one source 
have completed, which is an unnecessary delay. Ideally, the Processor should be 
able to start processing the moment a single spill is available.
- Split into 2 functions maybeInformInputReady(), adjustCompletedInputs() which 
are reused in registerCompletedInput() & 
registerCompletedInputForPipelinedShuffle().

ShuffleManager.registerCompletedInputForPipelinedShuffle: "catch 
(InterruptedException e) {" - required ? Can we use completedInputs.add instead 
of put ?
- Fixed

ShuffleManager: numCompletedInputs : Would be useful to have another parameter 
- numFetchedSpills or some such, that would make the LOG message a little more 
useful.
- Can you provide more info on this? Printing numFetchedSpills would be little 
more confusing I thought. 

Nit: javadoc "Ensure to set tez.runtime.disable.final-merge.in.sorter=false." - 
the property name has changed
- Fixed

Nit: enable.final-merge-in.output rename to enable.final-merge.in.output 
(similar to the old property name)
- Fixed

> Support pipelined data transfer for Unordered Output
> ----------------------------------------------------
>
>                 Key: TEZ-1094
>                 URL: https://issues.apache.org/jira/browse/TEZ-1094
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Siddharth Seth
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-1094.1.patch, TEZ-1094.2.patch
>
>
> For unsorted output (and possibly for sorted output), it should be possible 
> to send data in small batches instead of waiting for everything to be 
> generated before transmitting. For now, planning on getting started with 
> UnsortedOutput / Input pairs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to