gemini-code-assist[bot] commented on code in PR #38357:
URL: https://github.com/apache/beam/pull/38357#discussion_r3174964050
##########
sdks/python/apache_beam/runners/dataflow/internal/apiclient.py:
##########
@@ -279,6 +279,10 @@ def __init__(
for k, v in sdk_pipeline_options.items() if v is not None
}
options_dict["pipelineUrl"] = proto_pipeline_staged_url
+ if self._proto_pipeline:
+ serialized_pipeline = self._proto_pipeline.SerializeToString()
+ options_dict["pipelineProtoHash"] = hashlib.sha256(
+ serialized_pipeline).hexdigest()
Review Comment:

The pipeline proto is re-serialized here to compute the SHA256 hash. For
large pipelines, `SerializeToString()` can be expensive in terms of CPU and
memory. Additionally, while usually deterministic within the same process,
protobuf serialization is not strictly guaranteed to be deterministic by the
specification; hashing the actual bytes uploaded to GCS would be safer and more
efficient. Consider allowing the caller to provide the hash if it has already
been computed during the staging process.
```python
if self._proto_pipeline and "pipelineProtoHash" not in options_dict:
options_dict["pipelineProtoHash"] = hashlib.sha256(
self._proto_pipeline.SerializeToString()).hexdigest()
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]