[GitHub] [beam] robertwb commented on a diff in pull request #27939: Update how we specify artifact modes

via GitHub Wed, 09 Aug 2023 13:36:08 -0700


robertwb commented on code in PR #27939:
URL: https://github.com/apache/beam/pull/27939#discussion_r1289167487



##########
sdks/python/apache_beam/ml/transforms/base.py:
##########
@@ -143,23 +143,46 @@ def __init__(
         retrieved from this location. Note that when consuming artifacts,
         it is not necessary to pass the transforms since they are inherently
         stored within the artifacts themselves. The value assigned to
-        `artifact_location` should be a valid storage path where the artifacts
-        can be written to or read from.
+        `write_artifact_location` should be a valid storage directory where the
+        artifacts from this transform can be written to. If no directory exists
+        at this location, one will be created. This will overwrite any
+        artifacts already in this location, so distinct locations should be
+        used for each instance of MLTransform. Only one of
+        write_artifact_location and read_artifact_location should be specified.
+      read_artifact_location: A storage location to read artifacts resulting
+        froma previous MLTransform. These artifacts include transformations
+        applied to the dataset and generated values like min, max from
+        ScaleTo01, and mean, var from ScaleToZScore. Note that when consuming
+        artifacts, it is not necessary to pass the transforms since they are
+        inherently stored within the artifacts themselves. The value assigned
+        to `read_artifact_location` should be a valid storage path where the
+        artifacts can be read from. Only one of write_artifact_location and
+        read_artifact_location should be specified.
       transforms: A list of transforms to apply to the data. All the transforms
         are applied in the order they are specified. The input of the
         i-th transform is the output of the (i-1)-th transform. Multi-input
         transforms are not supported yet.
-      artifact_mode: Whether to produce or consume artifacts. If set to
-        'consume', MLTransform will assume that the artifacts are already
-        computed and stored in the artifact_location. Pass the same artifact
-        location that was passed during produce phase to ensure that the
-        right artifacts are read. If set to 'produce', MLTransform
-        will compute the artifacts and store them in the artifact_location.
-        The artifacts will be read from this location during the consume phase.
     """
     if transforms:
       _ = [self._validate_transform(transform) for transform in transforms]
 
+    if read_artifact_location and write_artifact_location:
+      raise ValueError(
+          'Only one of read_artifact_location or write_artifact_location can '
+          'be specified to initialize MLTransform')
+
+    if not read_artifact_location and not write_artifact_location:
+      raise ValueError(
+          'Either a read_artifact_location or write_artifact_location must be '
+          'specified to initialize MLTransform')
+
+    artifact_location = write_artifact_location
+    artifact_mode = ArtifactMode.PRODUCE
+
+    if len(read_artifact_location) > 0:

Review Comment:
   Just noticed this will be an error with None
   
   I prefer to avoid the write-then-overwrite pattern. E.g. I'd do something 
like
   
   ```
   if read_artifact_location and write_artifact_location:
      [error]
   elif write_artifact_location:
     artifact_location = write_artifact_location
     artifact_mode = ArtifactMode.PRODUCE
   elif read_artifact_location:
     artifact_location = read_artifact_location
     artifact_mode = ArtifactMode.CONSUME
   else:
     error that at least one must be provided.
   ```
   
   This does of course move a bit of logic up into the argument checking. One 
can also do
   
   ```
   if write_artifact_location:
     artifact_location = write_artifact_location
     artifact_mode = ArtifactMode.PRODUCE
   else read_artifact_location:
     artifact_location = read_artifact_location
     artifact_mode = ArtifactMode.CONSUME
   ```
   
   which assumes the preconditions are met. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] robertwb commented on a diff in pull request #27939: Update how we specify artifact modes

Reply via email to