[GitHub] [beam] ajamato commented on a change in pull request #12762: Ensuring that BigQuery jobs are tagged with the Dataflow step that launches them

GitBox Thu, 03 Sep 2020 13:42:16 -0700


ajamato commented on a change in pull request #12762:
URL: https://github.com/apache/beam/pull/12762#discussion_r483239088




##########
File path: sdks/python/apache_beam/io/gcp/bigquery_io_metadata.py
##########
@@ -64,6 +64,8 @@ def create_bigquery_io_metadata():
     # As we do not want a bad label to fail the BQ job.
     if _is_valid_cloud_label_value(dataflow_job_id):
       kwargs['beam_job_id'] = dataflow_job_id
+  if step_name:
+    kwargs['step_name'] = step_name

Review comment:
       This is likely to cause the BQ job to fail. The step_name would need to 
be sanitized. For the cloud label format
   
https://cloud.google.com/resource-manager/docs/creating-managing-labels#requirements
   
   
   - Keys have a minimum length of 1 character and a maximum length of 63 
characters, and cannot be empty. Values can be empty, and have a maximum length 
of 63 characters.
   - Keys and values can contain only lowercase letters, numeric characters, 
underscores, and dashes. All characters must use UTF-8 encoding, and 
international characters are allowed.
   
   

##########
File path: sdks/python/apache_beam/io/gcp/bigquery_file_loads.py
##########
@@ -826,7 +836,8 @@ def _load_data(
             TriggerCopyJobs(
                 create_disposition=self.create_disposition,
                 write_disposition=self.write_disposition,
-                test_client=self.test_client),
+                test_client=self.test_client,
+                step_name=step_name),

Review comment:
       Do we know the format of the step_name. Is this just what the SDK 
harness is referring to the step as?
   
   IIRC these did not reflect something meaningful, in the original graph to 
the user. So we probably don't want to show them
   
   I know the pcollections in python at least were very opaque "pcollection1", 
"pcollection2", etc.
   
   Normally we rely on the DF RunnerHarness to translate those before sending 
to the UI (ex: on metrics)

##########
File path: sdks/python/apache_beam/io/gcp/bigquery.py
##########
@@ -625,6 +625,7 @@ def to_type_hint(self):
 
 
 class _CustomBigQuerySource(BoundedSource):
+
   def __init__(

Review comment:
       Would you mind sharing the motivation with me on this. Its not entirely 
clear why you are making this change to me.
   
   Can you make equivalent changes in the Java implementation as well? For a 
consistent experience

##########
File path: sdks/python/apache_beam/io/gcp/bigquery.py
##########
@@ -1875,14 +1895,22 @@ def process(self, unused_element, signal):
     temp_location = pcoll.pipeline.options.view_as(
         GoogleCloudOptions).temp_location
     gcs_location = self._get_destination_uri(temp_location)
+    job_name = pcoll.pipeline.options.view_as(GoogleCloudOptions).job_name
 
+    try:
+      step_name = self.label

Review comment:
       I looked at the file, and it seems like self.label not referenced 
anywhere else in the class, and I don't see a call to super ctor which would 
set it.
   
   Is this using a werid pattern, where something external is taking the object 
and attaching parameters to it?
   
   Is it possible to fix up that style, so there is a clear method or something 
we can see where it gets added extenrally?
   
   Or to t lease define a default, empty value with a comment saying how it 
gets set




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] ajamato commented on a change in pull request #12762: Ensuring that BigQuery jobs are tagged with the Dataflow step that launches them

Reply via email to