AnandInguva commented on code in PR #29542:
URL: https://github.com/apache/beam/pull/29542#discussion_r1410044972
##########
sdks/python/apache_beam/ml/transforms/handlers.py:
##########
@@ -134,7 +135,9 @@ def process(self, element):
hash_object.update(str(list(value)).encode())
else: # assume value is a primitive that can be turned into str
hash_object.update(str(value).encode())
- yield (hash_object.hexdigest(), element)
+ # add a unique suffix to the hash key to avoid collisions.
+ unique_suffix = uuid.uuid4().hex
Review Comment:
I think we shouldn't raise an error as of now since the chance of collision
is very low.
>> I wonder if performance and pipeline cost would improve if we can find a
way to pass-through columns that do not need to be processed to tft, converting
them to bytes if necessary, avoiding the shuffle step.
I can try this and test the performance. If this works, we can remove the
CoGroupByKey altogether.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]