gemini-code-assist[bot] commented on code in PR #38374:
URL: https://github.com/apache/beam/pull/38374#discussion_r3203019604
##########
sdks/python/apache_beam/examples/inference/pytorch_sentiment.py:
##########
@@ -62,16 +59,35 @@ def process(self, element: tuple[str, PredictionResult]) ->
Iterable[dict]:
}
-def tokenize_text(text: str,
- tokenizer: DistilBertTokenizerFast) -> tuple[str, dict]:
- """Tokenizes input text using the specified tokenizer."""
- tokenized = tokenizer(
- text,
- padding='max_length',
- truncation=True,
- max_length=128,
- return_tensors="pt")
- return text, {k: torch.squeeze(v) for k, v in tokenized.items()}
+class TokenizeTextDoFn(beam.DoFn):
+ """Initializes tokenizer per worker and tokenizes input text."""
+ def __init__(self, model_path: str):
+ self.model_path = model_path
+ self.tokenizer = None
+
+ def setup(self):
+ self.tokenizer = DistilBertTokenizerFast.from_pretrained(self.model_path)
+ # Some transformers builds expose pad token through legacy attributes.
+ if not hasattr(self.tokenizer, '_pad_token'):
+ self.tokenizer._pad_token = '[PAD]'
Review Comment:

The check `hasattr(self.tokenizer, '_pad_token')` is likely to always return
`True` because `_pad_token` is an internal attribute defined in the base class
of `transformers` tokenizers (initialized to `None` in `__init__`), even if no
padding token has been set. This makes the conditional assignment ineffective.
It is recommended to check the public `pad_token` property and use the public
API to set it if it is missing.
```suggestion
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = '[PAD]'
```
##########
sdks/python/apache_beam/examples/inference/pytorch_sentiment.py:
##########
@@ -62,16 +59,35 @@ def process(self, element: tuple[str, PredictionResult]) ->
Iterable[dict]:
}
-def tokenize_text(text: str,
- tokenizer: DistilBertTokenizerFast) -> tuple[str, dict]:
- """Tokenizes input text using the specified tokenizer."""
- tokenized = tokenizer(
- text,
- padding='max_length',
- truncation=True,
- max_length=128,
- return_tensors="pt")
- return text, {k: torch.squeeze(v) for k, v in tokenized.items()}
+class TokenizeTextDoFn(beam.DoFn):
+ """Initializes tokenizer per worker and tokenizes input text."""
+ def __init__(self, model_path: str):
+ self.model_path = model_path
+ self.tokenizer = None
+
+ def setup(self):
+ self.tokenizer = DistilBertTokenizerFast.from_pretrained(self.model_path)
+ # Some transformers builds expose pad token through legacy attributes.
+ if not hasattr(self.tokenizer, '_pad_token'):
+ self.tokenizer._pad_token = '[PAD]'
+
+ def process(self, text: str) -> Iterable[tuple[str, dict]]:
+ tokenized = self.tokenizer(
+ text,
+ padding='max_length',
+ truncation=True,
+ max_length=128,
+ return_tensors="pt")
+ yield text, {k: torch.squeeze(v) for k, v in tokenized.items()}
Review Comment:

Using `torch.squeeze(v)` without specifying a dimension can be risky as it
removes all dimensions of size 1. If the sequence length happens to be 1, the
tensor would be reduced to a scalar, which might cause issues during batching
in `RunInference`. Since the tokenizer with `return_tensors="pt"` adds a batch
dimension at index 0, it is safer to use `torch.squeeze(v, 0)` to specifically
remove only that dimension.
```suggestion
yield text, {k: torch.squeeze(v, 0) for k, v in tokenized.items()}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]