johnyangk commented on a change in pull request #232: [NEMO-411] Bug in
ScheduleGroupPass, OutputTag, DuplicateEdgeGroup
URL: https://github.com/apache/incubator-nemo/pull/232#discussion_r327450874
##########
File path: common/src/main/java/org/apache/nemo/common/ir/IRDAGChecker.java
##########
@@ -471,9 +472,11 @@ private boolean isConnectedToStreamVertex(final IREdge
irEdge) {
return irEdge.getDst() instanceof RelayVertex || irEdge.getSrc()
instanceof RelayVertex;
}
- private Map<Optional<String>, List<IREdge>>
groupOutEdgesByAdditionalOutputTag(final List<IREdge> outEdges) {
+ private final AtomicInteger distinctIntegerForEmptyOutputTag = new
AtomicInteger(0);
+ private Map<String, List<IREdge>> groupOutEdgesByAdditionalOutputTag(final
List<IREdge> outEdges) {
return outEdges.stream().collect(Collectors.groupingBy(
- (outEdge -> outEdge.getPropertyValue(AdditionalOutputTagProperty.class)),
+ (outEdge -> outEdge.getPropertyValue(AdditionalOutputTagProperty.class)
+
.orElse(String.valueOf(distinctIntegerForEmptyOutputTag.getAndIncrement()))),
Review comment:
Two potential issues with this change:
(1)
PCollection x;
PCollection y = x.map(..);
PCollection z = x.map(..);
When this application is converted to an IRDAG, x->y and x->z become two
IREdges with an empty output tag. With this change the two edges won't be
grouped, so `addEncodingCompressionCheckers()` is no longer able to catch bugs,
for example the two edges having different encoders. This change is also
problematic when later we apply the `DuplicateEdgeGroupProperty` optimizations
to non-loops such as the simple application above.
(2)
PCollection x;
TupleTag tt = "0";
PCollection y/z = x.do(emitMain(), emitAdditional())
With this change two IREdges x->y and x->z will be grouped, even though they
produce different data types.
Perhaps `distinctIntegerForEmptyOutputTag` is needed only when specifically
handling loops, and a separate method specific to loops would help to avoid the
issues above?
Would https://issues.apache.org/jira/browse/NEMO-348 help? Assigning the
same `OutputTag` for main outputs early on in the Beam frontend.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services