[
https://issues.apache.org/jira/browse/TEZ-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14355446#comment-14355446
]
Siddharth Seth commented on TEZ-1094:
-------------------------------------
Comments on the patch
- UnorderedPartitionedKVWriter - the changes are very confusing. I don't think
SpillCallable should be dealing with any spill other than the one that was sent
in. It shouldn't be generating events for all spills. One possible approach
here would be add parameters to SpillCallable on whether it should be
generating an event, whether it should be writing the index file, and whether
it should be sending the event out immediately. isLastSpill - may not be
required at all if the filename / counter is generated before calling
SpillCallable. It'll be better to have the logic of sending events at the end
for pipelining disabled, final merge disabled in the close method itself. This
should simplify the code quite a bit, and make it easier to understand.
SpillInfo could also be enhanced to make this simpler.
- UnorderedPartitionedKVWriter: cleanupCurrentBuffer - this method can be used
in one more place in the close() method.
- UnorderedPartitionedKVWriter: I think emptyPartition information is being
generated from the global empty list. This would imply that the empty stats for
later spills will be an aggregate of previous ones.
- UnorderedPartitionedKVWriter: It's possible for multiple threads to be
spilling at the same time. Not sure if this needs to be factored in anywhere.
- ShuffleInputEventHandlerImpl: Nit: Move "int spillEventId =
shufflePayload.getSpillId();" inside the hasSpillId check.
- ShuffleManager: " if (shuffleInfoEventsMap.get(srcAttemptIdentifier) == null)
{" - Why this check instead of directly checking the attemptNumber being 0 ?
Don't think it's required at the moment, but will be in the future when we
allow any one single attempt number. For now there's enough checks on attempt 0
to make this check unnecessary.
- ShuffleManager: registerCompletedInputForPipelinedShuffle - the same check
falls through. Missing return statement ?
- ShuffleManager: registerCompletedInputForPipelinedShuffle:
"completedInputSet.add(fetchedInput.getInputAttemptIdentifier().getInputIdentifier());
.... numCompletedInputs..." This is repeated from registerCompletedInput.
Possible to split registerCompletedInput into different components and re-use
the function ? The next comment may complicate this.
- ShuffleManager:registerCompletedInputForPipelinedShuffle: "if
(!inputReadyNotificationSent.getAndSet(true)) {" - This implies that the input
ready notification will only be sent out when all spills of at least one source
have completed, which is an unnecessary delay. Ideally, the Processor should be
able to start processing the moment a single spill is available.
- ShuffleManager.registerCompletedInputForPipelinedShuffle: "catch
(InterruptedException e) {" - required ? Can we use completedInputs.add instead
of put ?
- ShuffleManager: numCompletedInputs : Would be useful to have another
parameter - numFetchedSpills or some such, that would make the LOG message a
little more useful.
- Nit: javadoc "Ensure to set tez.runtime.disable.final-merge.in.sorter=false."
- the property name has changed
- Nit: enable.final-merge-in.output rename to enable.final-merge.in.output
(similar to the old property name)
> Support pipelined data transfer for Unordered Output
> ----------------------------------------------------
>
> Key: TEZ-1094
> URL: https://issues.apache.org/jira/browse/TEZ-1094
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Siddharth Seth
> Assignee: Rajesh Balamohan
> Attachments: TEZ-1094.1.patch
>
>
> For unsorted output (and possibly for sorted output), it should be possible
> to send data in small batches instead of waiting for everything to be
> generated before transmitting. For now, planning on getting started with
> UnsortedOutput / Input pairs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)