Dear Audio Engineers, I'm writing an app to interact with OpenAI's 'realtime' API (bidirectional realtime audio over websocket with AI serverside).
To do this, I need to be careful that the AI-speak doesn't make its way out of the speakers, back in thru the mic, and back to their server (else it starts to talk to itself, and gets very confused). So I need AEC, which I've actually got working, using kAudioUnitSubType_VoiceProcessingIO and AudioUnitSetProperty(kAUVoiceIOProperty_BypassVoiceProcessing, setting to False). Now I also wish to detect when the speaker (me) is speaking or not speaking, which I've also managed to do via kAudioDevicePropertyVoiceActivityDetectionEnable. But getting them to play together is another matter, and I'm struggling hard here. I've rigged up a simple test ( https://gist.github.com/p-i-/d262e492073d20338e8fcf9273a355b4), where a 440Hz sinewave is generated in the render-callback, and mic-input is recorded to file in the input-callback. So the AEC works delightfully, subtracting the sinewave and recording my voice. And if I turn the sine-wave amplitude down to 0, the VAD correctly triggers the speech-started and speech-stopped events. But if I turn up the sine-wave, it messes up the VAD. Presumably the VAD is working over the pre-EchoCancelled audio, which is most undesirable. How can I progress here? My thought was to create an audio pipeline, using AUGraph, but my efforts have thus far been unsuccessful, and I lack confidence that I'm even pushing in the right direction. My thought was to have an IO unit that interfaces with the hardware (mic/spkr), which plugs into an AEC unit, which plugs into a VAD unit. But I can't see how to set this up. On iOS there's a RemoteIO unit to deal with the hardware, but I can't see any such unit on macOS. It seems the VoiceProcessing unit wants to do that itself. And then I wonder: Could I make a second VoiceProcessing unit, and have vp1_aec split send its bus[1(mic)].outputScope to vp2_vad.bus[1].inputScope? Can I do this kind of work by routing audio, or do I need to get my hands dirty with input/render callbacks? It feels like I'm going hard against the grain if I am faffing with these callbacks. If there's anyone out there that would care to offer me some guidance here, I am most grateful! π PS Is it not a serious problem that VAD can't operate on post-AEC input?
_______________________________________________ Do not post admin requests to the list. They will be ignored. Coreaudio-api mailing list (Coreaudio-api@lists.apple.com) Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/coreaudio-api/archive%40mail-archive.com This email sent to arch...@mail-archive.com