[
https://issues.apache.org/jira/browse/TEZ-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rajesh Balamohan updated TEZ-1094:
----------------------------------
Attachment: TEZ-1094.2.patch
UnorderedPartitionedKVWriter related changes
- Changed the code to do the following. Use SpillCallable only for spilling and
not for sending the events
- Send pipeline events in SpillCallback.onSuccess().
- Send out final events (if any) in close().
- Also, there was another case where writeLargeRecord wasn't handled. This is
fixed in the recent patch.
ShuffleInputEventHandlerImpl: Nit: Move "int spillEventId =
shufflePayload.getSpillId();" inside the hasSpillId check.
- Fixed
ShuffleManager: " if (shuffleInfoEventsMap.get(srcAttemptIdentifier) == null)
{" - Why this check instead of directly checking the attemptNumber being 0 ?
Don't think it's required at the moment, but will be in the future when we
allow any one single attempt number. For now there's enough checks on attempt 0
to make this check unnecessary.
- InputAttemptIdentifier doesn't consider spillId for hashCode. Without this
check, it might end up overwriting any previous values.
ShuffleManager: registerCompletedInputForPipelinedShuffle - the same check
falls through. Missing return statement ?
- Fixed
ShuffleManager: registerCompletedInputForPipelinedShuffle:
"completedInputSet.add(fetchedInput.getInputAttemptIdentifier().getInputIdentifier());
.... numCompletedInputs..." This is repeated from registerCompletedInput.
Possible to split registerCompletedInput into different components and re-use
the function ? The next comment may complicate this.
ShuffleManager:registerCompletedInputForPipelinedShuffle: "if
(!inputReadyNotificationSent.getAndSet(true)) {" - This implies that the input
ready notification will only be sent out when all spills of at least one source
have completed, which is an unnecessary delay. Ideally, the Processor should be
able to start processing the moment a single spill is available.
- Split into 2 functions maybeInformInputReady(), adjustCompletedInputs() which
are reused in registerCompletedInput() &
registerCompletedInputForPipelinedShuffle().
ShuffleManager.registerCompletedInputForPipelinedShuffle: "catch
(InterruptedException e) {" - required ? Can we use completedInputs.add instead
of put ?
- Fixed
ShuffleManager: numCompletedInputs : Would be useful to have another parameter
- numFetchedSpills or some such, that would make the LOG message a little more
useful.
- Can you provide more info on this? Printing numFetchedSpills would be little
more confusing I thought.
Nit: javadoc "Ensure to set tez.runtime.disable.final-merge.in.sorter=false." -
the property name has changed
- Fixed
Nit: enable.final-merge-in.output rename to enable.final-merge.in.output
(similar to the old property name)
- Fixed
> Support pipelined data transfer for Unordered Output
> ----------------------------------------------------
>
> Key: TEZ-1094
> URL: https://issues.apache.org/jira/browse/TEZ-1094
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Siddharth Seth
> Assignee: Rajesh Balamohan
> Attachments: TEZ-1094.1.patch, TEZ-1094.2.patch
>
>
> For unsorted output (and possibly for sorted output), it should be possible
> to send data in small batches instead of waiting for everything to be
> generated before transmitting. For now, planning on getting started with
> UnsortedOutput / Input pairs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)