GitHub user xXMrNidaXx added a comment to the discussion: How to validate 
before update without having the full graph?

Including word boundaries in biasing for ASR:

**The goal:**
Bias toward specific terms while respecting word boundaries.

**Approach 1: BPE with boundary tokens**
```python
# Add special tokens for word boundaries
biasing_phrases = [
    "▁RevolutionAI",  # ▁ = word start in sentencepiece
    "▁artificial▁intelligence",
]
```

**Approach 2: Phrase-level biasing**
```yaml
model:
  decoding:
    beam_search:
      context_biasing:
        phrases: ["RevolutionAI", "machine learning"]
        bias_weight: 2.0  # Higher = stronger bias
```

**Approach 3: Word-level with boundaries**
```python
# Explicit boundary markers
def add_boundaries(phrase):
    return f"<w>{phrase}</w>"

biased = [add_boundaries(p) for p in phrases]
```

**Why boundaries matter:**
- Prevents partial matches
- "AI" shouldn't bias "AIR" or "WAIT"
- Cleaner phrase extraction

**NeMo config:**
```yaml
decoder:
  context_graph:
    phrases: ["word1", "word2"]
    match_mode: "exact_word"  # vs "substring"
```

We do domain-specific ASR at [RevolutionAI](https://revolutionai.io). What 
terms are you biasing?

GitHub link: 
https://github.com/apache/jena/discussions/3697#discussioncomment-15898543

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to