[ 
https://issues.apache.org/jira/browse/BEAM-13459?focusedWorklogId=702101&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-702101
 ]

ASF GitHub Bot logged work on BEAM-13459:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 29/Dec/21 22:09
            Start Date: 29/Dec/21 22:09
    Worklog Time Spent: 10m 
      Work Description: steveniemitz commented on a change in pull request 
#16229:
URL: https://github.com/apache/beam/pull/16229#discussion_r776508707



##########
File path: sdks/python/apache_beam/runners/dataflow/internal/apiclient.py
##########
@@ -556,9 +565,79 @@ def _get_sdk_image_overrides(self, pipeline_options):
     return (
         dict(s.split(',', 1) for s in sdk_overrides) if sdk_overrides else {})
 
+  @staticmethod
+  def _compute_sha256(file):
+    hasher = hashlib.sha256()
+    with open(file, 'rb') as f:
+      for chunk in iter(partial(f.read,
+                                DataflowApplicationClient._HASH_CHUNK_SIZE),
+                        b""):
+        hasher.update(chunk)
+    return hasher.hexdigest()
+
+  @staticmethod
+  def _split_gcs_path(path):
+    if not path.startswith('gs://'):
+      raise RuntimeError('Expected gs:// path, got %s', path)
+    return path[5:].split('/', 1)
+
+  def _cached_location(self, sha256):
+    sha_prefix = sha256[0:2]
+    return FileSystems.join(
+        self._root_staging_location,
+        DataflowApplicationClient._GCS_CACHE_PREFIX,
+        sha_prefix,
+        sha256)
+
+  @retry.with_exponential_backoff(
+      retry_filter=retry.retry_on_server_errors_and_timeout_filter)
+  def _gcs_object_exists(self, gcs_or_local_path):
+    if not gcs_or_local_path.startswith('gs://'):
+      return False
+    else:
+      bucket, name = self._split_gcs_path(gcs_or_local_path)
+      request = storage.StorageObjectsGetRequest(bucket=bucket, object=name)
+      try:
+        self._storage_client.objects.Get(request)
+        return True
+      except exceptions.HttpError as e:
+        return e.status_code not in (403, 404)
+
+  @retry.with_exponential_backoff(
+      retry_filter=retry.retry_on_server_errors_and_timeout_filter)
+  def _gcs_to_gcs_copy(self, from_gcs, to_gcs):

Review comment:
       I didn't try that (or even knew it existed), it seems like it'd indeed 
remove duplication here if it works correctly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 702101)
    Time Spent: 3h 40m  (was: 3.5h)

> Dataflow python runner should cache uploaded artifacts across job runs
> ----------------------------------------------------------------------
>
>                 Key: BEAM-13459
>                 URL: https://issues.apache.org/jira/browse/BEAM-13459
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-dataflow
>            Reporter: Steve Niemitz
>            Assignee: Steve Niemitz
>            Priority: P1
>          Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Similar to how the jvm dataflow runner caches artifacts uploaded to GCS 
> across job runs (based on their sha256), the python runner should do the same.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to