Re: [PR] Implement the hash_words TFT operation [beam]

via GitHub Mon, 13 May 2024 07:17:01 -0700


jrmccluskey commented on code in PR #31249:
URL: https://github.com/apache/beam/pull/31249#discussion_r1598554126



##########
sdks/python/apache_beam/ml/transforms/tft.py:
##########
@@ -637,3 +637,48 @@ def apply_transform(self, data: tf.SparseTensor, 
output_col_name: str):
 def count_unique_words(
     data: tf.SparseTensor, output_vocab_name: Optional[str]) -> None:
   tft.count_per_key(data, key_vocabulary_filename=output_vocab_name)
+
+
+@register_input_dtype(str)
+class HashStrings(TFTOperation):
+  def __init__(
+      self,
+      columns: List[str],
+      hash_buckets: int,
+      key: Optional[Iterable[int]] = None,
+      name: Optional[str] = None):
+    '''Hashes strings into the provided number of buckets.
+    
+    Args:
+      columns: A list of the column names to apply the transformation on.
+      hash_buckets: the number of buckets to hash the strings into.
+      key: optional. An array of two Python `uint64`. If passed, output will be

Review Comment:
   I've matched the type hint and comment from the TFT implementation, but yeah 
it's surprising that it has a fixed length in the comment but is written as an 
iterable generally. Tightening the bound on our end doesn't impact usage, so 
it's a reasonable improvement on our part. Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Implement the hash_words TFT operation [beam]

Reply via email to