[
https://issues.apache.org/jira/browse/BEAM-6955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845921#comment-16845921
]
Daniel Lescohier commented on BEAM-6955:
----------------------------------------
The modifications is the bug fix for BEAM-6952 (which I created, but I haven't
finished the work on the PR to finish getting it upstreamed). I believe we
needed it for reading Google Doubleclick data files, which have two
concatenated gzip files, the csv header line in one, and then all the data
lines. Our patched version also contains this ticket's fix.
This came up when we upgraded from Apache Beam 2.2 to 2.11, because of the
Dataflow retirement notice for Beam < 2.5. In 2.2, there was different logic
than what I implemented in this PR, but it acted similarly, it found the Docker
image based on a portion of the version number, not the complete version
number. But 2.11 used the whole version number, and we couldn't launch
Dataflow jobs any more until I investigated the problem and found this change
that it no longer only used a portion of the version number to determine the
Docker image name.
I didn't know about the --worker_harness_container_image option, but I don't
think it's a good user interface to require you to set two options. If you're
passing --sdk_location, I think it should just find the correct image for you,
without having to know about and specify a second option.
> Support Dataflow --sdk_location with modified version number
> ------------------------------------------------------------
>
> Key: BEAM-6955
> URL: https://issues.apache.org/jira/browse/BEAM-6955
> Project: Beam
> Issue Type: Bug
> Components: runner-dataflow
> Affects Versions: 2.11.0
> Reporter: Daniel Lescohier
> Priority: Major
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> Support Dataflow --sdk_location with modified version number
> Determine the version tag to use for the Google Container Registry, for the
> service image versions to use on the Dataflow worker nodes. Users of Dataflow
> may be using a locally-modified version of Apache Beam, which they submit to
> Dataflow with the --sdk_location option. Those users would most likely modify
> the version number of Apache Beam, so they can distinguish it from the public
> distribution of Apache Beam. However, the remote nodes in Dataflow still need
> to bootsrap the worker service with a Docker image that a version tag exists
> for.
> The most appropriate way for system integrators to modify the Apache Beam
> version number would be to add a Local Version Identifier:
> https://www.python.org/dev/peps/pep-0440/#local-version-identifiers
> If people only use Local Version Identifiers, then we could use the "public"
> attribute of the pkg_resources version object.
> If people instead use a post-release version identifier:
> https://www.python.org/dev/peps/pep-0440/#post-releases then only the
> "base_version" attribute would work both of these version number changes.
> Since Dataflow documentation does not specify how to modify version numbers,
> I am choosing to use "base_version" attribute.
> Will shortly submit a PR with the change.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)