Dear Audio Engineers,

I'm writing an app to interact with OpenAI's 'realtime' API (bidirectional
realtime audio over websocket with AI serverside).

To do this, I need to be careful that the AI-speak doesn't make its way out
of the speakers, back in thru the mic, and back to their server (else it
starts to talk to itself, and gets very confused).

So I need AEC, which I've actually got working,
using kAudioUnitSubType_VoiceProcessingIO
and AudioUnitSetProperty(kAUVoiceIOProperty_BypassVoiceProcessing, setting
to False).

Now I also wish to detect when the speaker (me) is speaking or not
speaking, which I've also managed to do
via kAudioDevicePropertyVoiceActivityDetectionEnable.

But getting them to play together is another matter, and I'm struggling
hard here.

I've rigged up a simple test (
https://gist.github.com/p-i-/d262e492073d20338e8fcf9273a355b4), where a
440Hz sinewave is generated in the render-callback, and mic-input is
recorded to file in the input-callback.

So the AEC works delightfully, subtracting the sinewave and recording my
voice.
And if I turn the sine-wave amplitude down to 0, the VAD correctly triggers
the speech-started and speech-stopped events.

But if I turn up the sine-wave, it messes up the VAD.

Presumably the VAD is working over the pre-EchoCancelled audio, which is
most undesirable.

How can I progress here?

My thought was to create an audio pipeline, using AUGraph, but my efforts
have thus far been unsuccessful, and I lack confidence that I'm even
pushing in the right direction.

My thought was to have an IO unit that interfaces with the hardware
(mic/spkr), which plugs into an AEC unit, which plugs into a VAD unit.

But I can't see how to set this up.

On iOS there's a RemoteIO unit to deal with the hardware, but I can't see
any such unit on macOS. It seems the VoiceProcessing unit wants to do that
itself.

And then I wonder: Could I make a second VoiceProcessing unit, and have
vp1_aec split send its bus[1(mic)].outputScope to vp2_vad.bus[1].inputScope?

Can I do this kind of work by routing audio, or do I need to get my hands
dirty with input/render callbacks?

It feels like I'm going hard against the grain if I am faffing with these
callbacks.

If there's anyone out there that would care to offer me some guidance here,
I am most grateful!

π

PS Is it not a serious problem that VAD can't operate on post-AEC input?
 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Coreaudio-api mailing list      (Coreaudio-api@lists.apple.com)
Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/coreaudio-api/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to