[
https://issues.apache.org/jira/browse/BEAM-3883?focusedWorklogId=100789&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-100789
]
ASF GitHub Bot logged work on BEAM-3883:
----------------------------------------
Author: ASF GitHub Bot
Created on: 10/May/18 19:44
Start Date: 10/May/18 19:44
Worklog Time Spent: 10m
Work Description: tvalentyn commented on issue #5251: [BEAM-3883]
Refactor and clean dependency.py to make it reusable with artifact service
URL: https://github.com/apache/beam/pull/5251#issuecomment-388163831
I think we can simplify the logic here. During resource staging for Python
SDK we need to stage pipeline artifacts (SDK, workflow main session,
requirements.txt, cached version of the packages in requirements.txt, maybe
workflow tarball).
Staging happens in two steps:
1. If the artifact is not local, download the artifact from artifact source
location to a local temp folder.
2. Stage the artifact from local folder into the staging location.
Depending on the execution environment, we can have different staging
locations:
- All portable runners, including ULR should stage artifacts to an Artifact
Server over GRPC.
- Dataflow Runner for non-portable pipelines need to stage artifacts to a
GCS bucket.
Step 2 needs to be different for different stager implementations, however
the first step does not. I don't see a reason to implement prestaging the
dependencies separately for each stager. Being able to stage SDK from GCS,
HTTP, or other location is a capability of SDK, regardless of the runner, so I
think it should be common. We can support other locations as well when the need
arises.
This said, I suggest the following sketch for the abstractions:
```
class Stager(object):
def stage_job_resources(options, temp_dir=None, ...):
''' Materializes all resources to be stages in a local folder, stages
artifacts
one by one using _stage_artifact(), and calls _commit_manifest() at the
end.
SDK: will be staged from PyPI if sdk_location=default, otherwise from
sdk_location.
sdk_location can be a path to local directory, GCS path or HTTP URL.
Extra packages are staged from local directory.
Packages from requirements.txt are downloaded from PyPI into temporary
folder,
then staged to a staging location.
...
'''
# move existing functionality from dependency.stage_job_resources() here.
def stage_artifact(local_path_to_artifact, artefact_name):
""" Stages the artifact to self._staging_location, if successful returns
True and
adds artifact_name to the manifest of artifacts that have been staged.
"""
raise NotImplementedError
def commit_manifest():
"""Commits manifest through Artifact API."""
raise NotImplementedError
class GcsStager(Stager):
# Stager for legacy Dataflow pipelines
def stage_artifact(local_path_to_artifact, artefact_name):
# check that self.staging_location is a GCS bucket
# copy the artifact to
gs://self.staging_location/artefact_name+some_suffix
def commit_manifest():
pass # No need to do anything here for legacy pipelines.
class ArtifactServerStager(Stager):
def stage_artifact(local_path_to_artifact, artefact_name):
# Implementation that talks to Artifact Server via Fn Artifact API.
def commit_manifest(local_path_to_artifact, artefact_name):
# Implementation that talks to Artifact Server via Fn Artifact API.
```
What do you think?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 100789)
Time Spent: 6h 40m (was: 6.5h)
> Python SDK stages artifacts when talking to job server
> ------------------------------------------------------
>
> Key: BEAM-3883
> URL: https://issues.apache.org/jira/browse/BEAM-3883
> Project: Beam
> Issue Type: Sub-task
> Components: sdk-py-core
> Reporter: Ben Sidhom
> Assignee: Ankur Goenka
> Priority: Major
> Time Spent: 6h 40m
> Remaining Estimate: 0h
>
> The Python SDK does not currently stage its user-defined functions or
> dependencies when talking to the job API. Artifacts that need to be staged
> include the user code itself, any SDK components not included in the
> container image, and the list of Python packages that must be installed at
> runtime.
>
> Artifacts that are currently expected can be found in the harness boot code:
> [https://github.com/apache/beam/blob/58e3b06bee7378d2d8db1c8dd534b415864f63e1/sdks/python/container/boot.go#L52.]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)