Re: [PR] Pin tensor_rt digest for PyTorch sentiment Dataflow benchmarks [beam]

via GitHub Thu, 07 May 2026 09:24:52 -0700


gemini-code-assist[bot] commented on code in PR #38374:
URL: https://github.com/apache/beam/pull/38374#discussion_r3203019604



##########
sdks/python/apache_beam/examples/inference/pytorch_sentiment.py:
##########
@@ -62,16 +59,35 @@ def process(self, element: tuple[str, PredictionResult]) -> 
Iterable[dict]:
     }
 
 
-def tokenize_text(text: str,
-                  tokenizer: DistilBertTokenizerFast) -> tuple[str, dict]:
-  """Tokenizes input text using the specified tokenizer."""
-  tokenized = tokenizer(
-      text,
-      padding='max_length',
-      truncation=True,
-      max_length=128,
-      return_tensors="pt")
-  return text, {k: torch.squeeze(v) for k, v in tokenized.items()}
+class TokenizeTextDoFn(beam.DoFn):
+  """Initializes tokenizer per worker and tokenizes input text."""
+  def __init__(self, model_path: str):
+    self.model_path = model_path
+    self.tokenizer = None
+
+  def setup(self):
+    self.tokenizer = DistilBertTokenizerFast.from_pretrained(self.model_path)
+    # Some transformers builds expose pad token through legacy attributes.
+    if not hasattr(self.tokenizer, '_pad_token'):
+      self.tokenizer._pad_token = '[PAD]'

Review Comment:
   ![high](https://www.gstatic.com/codereviewagent/high-priority.svg)
   
   The check `hasattr(self.tokenizer, '_pad_token')` is likely to always return 
`True` because `_pad_token` is an internal attribute defined in the base class 
of `transformers` tokenizers (initialized to `None` in `__init__`), even if no 
padding token has been set. This makes the conditional assignment ineffective. 
It is recommended to check the public `pad_token` property and use the public 
API to set it if it is missing.
   
   ```suggestion
       if self.tokenizer.pad_token is None:
         self.tokenizer.pad_token = '[PAD]'
   ```



##########
sdks/python/apache_beam/examples/inference/pytorch_sentiment.py:
##########
@@ -62,16 +59,35 @@ def process(self, element: tuple[str, PredictionResult]) -> 
Iterable[dict]:
     }
 
 
-def tokenize_text(text: str,
-                  tokenizer: DistilBertTokenizerFast) -> tuple[str, dict]:
-  """Tokenizes input text using the specified tokenizer."""
-  tokenized = tokenizer(
-      text,
-      padding='max_length',
-      truncation=True,
-      max_length=128,
-      return_tensors="pt")
-  return text, {k: torch.squeeze(v) for k, v in tokenized.items()}
+class TokenizeTextDoFn(beam.DoFn):
+  """Initializes tokenizer per worker and tokenizes input text."""
+  def __init__(self, model_path: str):
+    self.model_path = model_path
+    self.tokenizer = None
+
+  def setup(self):
+    self.tokenizer = DistilBertTokenizerFast.from_pretrained(self.model_path)
+    # Some transformers builds expose pad token through legacy attributes.
+    if not hasattr(self.tokenizer, '_pad_token'):
+      self.tokenizer._pad_token = '[PAD]'
+
+  def process(self, text: str) -> Iterable[tuple[str, dict]]:
+    tokenized = self.tokenizer(
+        text,
+        padding='max_length',
+        truncation=True,
+        max_length=128,
+        return_tensors="pt")
+    yield text, {k: torch.squeeze(v) for k, v in tokenized.items()}

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   Using `torch.squeeze(v)` without specifying a dimension can be risky as it 
removes all dimensions of size 1. If the sequence length happens to be 1, the 
tensor would be reduced to a scalar, which might cause issues during batching 
in `RunInference`. Since the tokenizer with `return_tensors="pt"` adds a batch 
dimension at index 0, it is safer to use `torch.squeeze(v, 0)` to specifically 
remove only that dimension.
   
   ```suggestion
       yield text, {k: torch.squeeze(v, 0) for k, v in tokenized.items()}
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Pin tensor_rt digest for PyTorch sentiment Dataflow benchmarks [beam]

Reply via email to