[GitHub] [beam] damccorm commented on a diff in pull request #27493: Add ScaleByMinMax and NGrams data processing transformations

via GitHub Fri, 14 Jul 2023 08:40:19 -0700


damccorm commented on code in PR #27493:
URL: https://github.com/apache/beam/pull/27493#discussion_r1263874412



##########
sdks/python/apache_beam/ml/transforms/tft_test.py:
##########
@@ -391,5 +392,81 @@ def equals_fn(a, b):
       assert_that(actual_output, equal_to(expected_output, 
equals_fn=equals_fn))
 
 
+class ScaleToMinMaxTest(unittest.TestCase):
+  def setUp(self) -> None:
+    self.artifact_location = tempfile.mkdtemp()
+
+  def tearDown(self):
+    shutil.rmtree(self.artifact_location)
+
+  def test_scale_to_min_max(self):
+    data = [{
+        'x': 4,
+    }, {
+        'x': 4,
+    }, {
+        'x': 4,
+    }]

Review Comment:
   Nit: could we add a test where they're not all the same value? 



##########
sdks/python/apache_beam/ml/transforms/tft.py:
##########
@@ -434,3 +437,71 @@ def apply_transform(
         output_column_name + '_tfidf_weight': tfidf_weight
     }
     return output
+
+
+@register_input_dtype(float)
+class ScaleByMinMax(TFTOperation):
+  def __init__(
+      self,
+      columns: List[str],
+      min_value: float = 0.0,
+      max_value: float = 1.0,
+      name: Optional[str] = None):
+    """
+    This function applies a scaling transformation on the given columns
+    of incoming data. The transformation scales the input values to the
+    range [min_value, max_value].
+
+    Args:
+      columns: A list of column names to apply the transformation on.
+      min_value: The minimum value of the output range.
+      max_value: The maximum value of the output range.
+      name: A name for the operation (optional).
+    """
+    super().__init__(columns)
+    self.min_value = min_value
+    self.max_value = max_value
+    self.name = name
+
+    if self.max_value <= self.min_value:
+      raise ValueError('max_value must be greater than min_value')
+
+  def apply_transform(
+      self, data: tf.Tensor, output_column_name: str) -> tf.Tensor:
+
+    output = tft.scale_by_min_max(
+        x=data, output_min=self.min_value, output_max=self.max_value)
+    return {output_column_name: output}
+
+
+@register_input_dtype(str)
+class NGrams(TFTOperation):
+  def __init__(
+      self,
+      columns: List[str],
+      ngram_range: Tuple[int, int],
+      separator: str,
+      name: Optional[str] = None):
+    """
+    An n-gram is a contiguous sequence of n items from a given sample of text
+    or speech. This operation applies an n-gram transformation to
+    specified columns of incoming data, splitting the input data into a
+    set of consecutive n-grams.
+
+    Args:
+      columns: A list of column names to apply the transformation on.
+      ngram_range: A tuple of integers(inclusive) specifying the range of
+        n-gram sizes.
+      separator: A string that specifies the separator between tokens.
+      name: A name for the operation (optional).
+    """
+    super().__init__(columns)
+    self.ngram_range = ngram_range
+    self.separator = separator
+    self.name = name
+
+  def apply_transform(self, data: tf.SparseTensor,
+                      output_column_name: str) -> Dict[str, tf.SparseTensor]:
+    # TODO: Perform splitting using separator when the input is a string.

Review Comment:
   Not sure what this TODO means - aren't we already doing this?
   
   Also, should it have a linked issue?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] damccorm commented on a diff in pull request #27493: Add ScaleByMinMax and NGrams data processing transformations

Reply via email to