tvalentyn commented on code in PR #31026:
URL: https://github.com/apache/beam/pull/31026#discussion_r1571040470
##########
sdks/python/apache_beam/ml/transforms/tft_test.py:
##########
@@ -155,6 +155,134 @@ def test_ScaleTo01(self):
actual_output, equal_to(expected_output, equals_fn=np.array_equal))
+class ScaleToGaussianTest(unittest.TestCase):
+ def setUp(self) -> None:
+ self.artifact_location = tempfile.mkdtemp()
+
+ def tearDown(self):
+ shutil.rmtree(self.artifact_location)
+
+ def test_gaussian_list_uniform_distribution(self):
+ list_data = [{'x': [1, 2, 3]}, {'x': [4, 5, 6]}]
+ with beam.Pipeline() as p:
+ list_result = (
+ p
+ | "listCreate" >> beam.Create(list_data)
+ | "listMLTransform" >> base.MLTransform(
+ write_artifact_location=self.artifact_location).with_transform(
+ tft.ScaleToGaussian(columns=['x'])))
+
+ expected_data = [
+ np.array([-1.46385, -0.87831, -0.29277], dtype=np.float32),
Review Comment:
Can we compare floats up to a certain precision to avoid shenanigans with
floating point precision, like https://0.30000000000000004.com/ ?
##########
sdks/python/apache_beam/ml/transforms/tft.py:
##########
@@ -291,6 +291,41 @@ def apply_transform(
return output_dict
+@register_input_dtype(float)
+class ScaleToGaussian(TFTOperation):
+ def __init__(
+ self,
+ columns: List[str],
+ elementwise: bool = False,
+ name: Optional[str] = None):
+ """
+ This function applies a scaling transformation on the given columns
+ of incoming data. The operation transforms the input column values
+ to an approximately normal distribution with mean 0 and variance of 1.
+ The Gaussian transformation is only applied if the column has long tails;
Review Comment:
Does this mean: "do not use this operation if your data doesn't have long
tails, use z-score instead" or does it mean: "this transform is doing some
sophisticated normalization but if your data doesn't have long tails, then it's
the same as z-score normalization"?
##########
sdks/python/apache_beam/ml/transforms/tft.py:
##########
@@ -291,6 +291,41 @@ def apply_transform(
return output_dict
+@register_input_dtype(float)
+class ScaleToGaussian(TFTOperation):
+ def __init__(
+ self,
+ columns: List[str],
+ elementwise: bool = False,
+ name: Optional[str] = None):
+ """
+ This function applies a scaling transformation on the given columns
Review Comment:
Nit: Consider a shortened message : `This operation scales the given
input column values to an approximately normal distribution with mean 0 and
variance of 1. `
##########
sdks/python/apache_beam/ml/transforms/tft.py:
##########
@@ -291,6 +291,41 @@ def apply_transform(
return output_dict
+@register_input_dtype(float)
+class ScaleToGaussian(TFTOperation):
+ def __init__(
+ self,
+ columns: List[str],
+ elementwise: bool = False,
+ name: Optional[str] = None):
+ """
+ This function applies a scaling transformation on the given columns
+ of incoming data. The operation transforms the input column values
+ to an approximately normal distribution with mean 0 and variance of 1.
+ The Gaussian transformation is only applied if the column has long tails;
Review Comment:
Consider adding more details from the below link and/or consider adding:
"For more information, see:
https://www.tensorflow.org/tfx/transform/api_docs/python/tft/scale_to_gaussian"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]