here:

the input_ids and running score product are an order of magnitude
smaller to store than needed to actually run the beams
so you can actually cache a huge amount of them, and only run the
highest probability ones
it's quite efficient
[they also share prefix sequences]

Reply via email to