On 11/12/25 19:08, Jason J.G. White wrote:
On 12/11/25 10:17, Aaron Chantrill wrote:
I'm working on an article for Linux Magazine. For this article, I'm
interested in talking about setting up speech dispatcher with
different text to speech engines, like Piper TTS or Coqui TTS. This
is based on a question from this mailing list a couple of months ago.
I'm hoping to start a series on accessibility issues while deepening
my own understanding.
For screen reader users, minimizing audio latency is important.
Unfortunately,
the neural network-based TTS systems, including Coqui and Piper, have
a reputation for producing high latency. This is an important reason
why screen reader users tend not to use them.
I don't know whether this is improved if you have appropriate GPU
processing for the neural network models. Piper was unusably slow on
my machine, but I didn't investigate deeply enough to find out whether
it was using the GPU.
Piper when run as a command line program is unusably slow because it has
to load the full onnx model every time you call it. My goal is to use
piper's built-in http server. This is the same way the older
mimic3-general.conf module worked. Of course, writing an http server
front end that can hold a model in memory isn't that difficult, so if
other TTS programs don't include a web service, it shouldn't be that
difficult to write one. Once the onnx model is loaded, Piper runs faster
than real time (it takes longer to say the output than to generate it)
even on a Raspberry Pi 3, so latency and GPU shouldn't be an issue, but
running an additional web server as a service does introduce additional
complexity.
Thank you, Aaron