[
https://issues.apache.org/jira/browse/BEAM-10395?focusedWorklogId=456893&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-456893
]
ASF GitHub Bot logged work on BEAM-10395:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 09/Jul/20 23:48
Start Date: 09/Jul/20 23:48
Worklog Time Spent: 10m
Work Description: ihji commented on a change in pull request #12144:
URL: https://github.com/apache/beam/pull/12144#discussion_r452549676
##########
File path:
runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/util/PackageUtil.java
##########
@@ -311,8 +315,26 @@ public DataflowPackage stageToFile(
CompletionStage<StagingResult> stagingResult =
computePackageAttributes(source, hash, dest, stagingPath)
.thenComposeAsync(
- packageAttributes ->
- stagePackage(packageAttributes, retrySleeper,
createOptions));
+ packageAttributes -> {
+ String destLocation =
packageAttributes.getDestination().getLocation();
+ String existingHash =
+ distinctDestinations.putIfAbsent(destLocation,
packageAttributes.getHash());
+ if (existingHash == null) {
+ return stagePackage(packageAttributes, retrySleeper,
createOptions);
+ } else {
+ if (!existingHash.equals(packageAttributes.getHash())) {
+ LOG.warn(
+ "Upload of {} would overwrite {} with different
content",
+ packageAttributes.getSource(),
+ destLocation);
Review comment:
Two files should not have a same destination in the first place. They
are deduplicated by postfixing hash when dependency information is created:
https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/Environments.java#L320
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 456893)
Time Spent: 50m (was: 40m)
> Dataflow runner should deduplicate files to stage by destination
> -----------------------------------------------------------------
>
> Key: BEAM-10395
> URL: https://issues.apache.org/jira/browse/BEAM-10395
> Project: Beam
> Issue Type: Improvement
> Components: runner-dataflow
> Reporter: Steve Niemitz
> Assignee: Steve Niemitz
> Priority: P2
> Time Spent: 50m
> Remaining Estimate: 0h
>
> If a pipeline contains multiple files with the same destination path, the
> dataflow runner will try to stage them both in parallel, resulting in the
> upload usually failing (due to conflicting uploads).
> The runner should only upload one file per destination, and ideally check
> that the sources are the same as well.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)