[ 
https://issues.apache.org/jira/browse/BEAM-3883?focusedWorklogId=100789&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-100789
 ]

ASF GitHub Bot logged work on BEAM-3883:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 10/May/18 19:44
            Start Date: 10/May/18 19:44
    Worklog Time Spent: 10m 
      Work Description: tvalentyn commented on issue #5251: [BEAM-3883] 
Refactor and clean dependency.py to make it reusable with artifact service
URL: https://github.com/apache/beam/pull/5251#issuecomment-388163831
 
 
   I think we can simplify the logic here. During  resource staging for Python 
SDK we need to stage  pipeline artifacts (SDK, workflow main session, 
requirements.txt, cached version of the packages in requirements.txt, maybe 
workflow tarball). 
   
   Staging happens in two steps: 
   
   1. If the artifact is not local, download the artifact from artifact source 
location to a local temp folder.
   2. Stage the artifact from local folder into the staging location.
   
   Depending on the execution environment, we can have different staging 
locations: 
   - All portable runners, including ULR should stage artifacts to an Artifact 
Server  over GRPC. 
   - Dataflow Runner for non-portable pipelines need to stage artifacts to a 
GCS bucket. 
   
   Step 2 needs to be different for different stager implementations, however 
the first step does not. I don't see a reason to implement prestaging the 
dependencies  separately for each stager. Being able to stage SDK from GCS, 
HTTP, or other location is a capability of SDK, regardless of the runner, so I 
think it should be common. We can support other locations as well when the need 
arises.    
   
   This said, I suggest the following sketch for the abstractions:
   
   ```
   class Stager(object):
     def stage_job_resources(options, temp_dir=None, ...):
      ''' Materializes all resources to be stages in a local folder, stages 
artifacts 
      one by one using _stage_artifact(), and calls _commit_manifest() at the 
end. 
      SDK: will be staged from PyPI if sdk_location=default, otherwise from 
sdk_location. 
      sdk_location can be a path to local directory, GCS path or HTTP URL.
      Extra packages are staged from local directory.
      Packages from requirements.txt are downloaded from PyPI into temporary 
folder, 
      then staged to a staging location. 
      ...
      ''' 
      # move existing functionality from dependency.stage_job_resources() here.
   
   
     def stage_artifact(local_path_to_artifact, artefact_name):
       """ Stages the artifact to self._staging_location, if successful returns 
True and 
       adds artifact_name to the manifest of artifacts that have been staged.
       """
       raise NotImplementedError
   
     def commit_manifest():
       """Commits manifest through Artifact API."""
       raise NotImplementedError
   
   class GcsStager(Stager):
      # Stager for legacy Dataflow pipelines
      def stage_artifact(local_path_to_artifact, artefact_name):
         # check that self.staging_location is a GCS bucket
         # copy the artifact to 
gs://self.staging_location/artefact_name+some_suffix 
   
     def commit_manifest():
        pass # No need to do anything here for legacy pipelines.
   
   class ArtifactServerStager(Stager):
     def stage_artifact(local_path_to_artifact, artefact_name):
         # Implementation that talks to Artifact Server via Fn Artifact API.
   
     def commit_manifest(local_path_to_artifact, artefact_name):
         # Implementation that talks to Artifact Server via Fn Artifact API.
   ```
   
   What do you think?
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 100789)
    Time Spent: 6h 40m  (was: 6.5h)

> Python SDK stages artifacts when talking to job server
> ------------------------------------------------------
>
>                 Key: BEAM-3883
>                 URL: https://issues.apache.org/jira/browse/BEAM-3883
>             Project: Beam
>          Issue Type: Sub-task
>          Components: sdk-py-core
>            Reporter: Ben Sidhom
>            Assignee: Ankur Goenka
>            Priority: Major
>          Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> The Python SDK does not currently stage its user-defined functions or 
> dependencies when talking to the job API. Artifacts that need to be staged 
> include the user code itself, any SDK components not included in the 
> container image, and the list of Python packages that must be installed at 
> runtime.
>  
> Artifacts that are currently expected can be found in the harness boot code: 
> [https://github.com/apache/beam/blob/58e3b06bee7378d2d8db1c8dd534b415864f63e1/sdks/python/container/boot.go#L52.]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to