alex stone wrote: > > On Mon, Mar 9, 2009 at 2:59 PM, Olivier Guilyardi <[email protected] > <mailto:[email protected]>> wrote: > > alex stone wrote: > > > If you're intent on automating a speech analysis, voice noise removal > > device of some sort, then you might do well to start with a 'pre and > > post' framework. Things like lipsmacking, glottal and nasal noise for > > the end of phrase, etc, are fairly easy to identify, and generally > occur > > pre and post. So that may well be a decent percentage of any cleanup > > done quickly. (Dependent of course on language. Cleaning up russian > > would be a different 'module' to cleaning up French, or Finnish.) > > That sounds encouraging. What to you mean by "pre and post" (sorry > if that's an > obvious question to you)? > [...] > Pre and post meaning the start and finish of a recorded wav or region. > Example being the first few, and the last few, milliseconds or so. Most > of this would be obvious to the ear, so i can imagine a means to edit > this could be mechanised in some way. (Being > careful, of course, not to dehumanise the original recording too far.)
Alright, got that. We've done some experimentations. On a 3 minutes of speech recording, we got 53 noises, with the following repartition: inspire: 37 expiration: 2 lips: 6 nose: 3 glottal: 1 inspire+lips: 4 That means "inspiring" (breathing in, between two phrases or groups of words) noises makes 70% of the noises. These can be silenced, no need for frequency filtering, because they always happen "pre and post" as you say, and they're apparently always preceded and followed by small silences. Here's the spectrogram+waveform of two inspire noises (the cursor, a white vertical line, is on the noise). On each view appears 1 noise, surrounded by speech: http://www.samalyse.com/code/speechfilter/inspire1.png http://www.samalyse.com/code/speechfilter/inspire2.png We've also measured the duration of 14 inspire noises. Except for 1, all of them are under 1 second. The durations range from from 256ms to 1024ms, with an average of 529ms. An automatic way of removing these inspire noises may largely satisfy the users I'm dealing with. Saving 70% of manual editing time is everything but marginal. So I'm going to concentrate on that at first, leaving all lips, nose, ... noises for later. > Thinking further about modules, you might consider the inclusion (should > you try this) of a user definable module, in which the user could set > parameters. Consider the lone singer at home, or the voice over artist > who uses the same 'voice' on a regular basis. They would tend to form > phrases, and speech, in the same way, more often than not, including > mouth noise, nasal, etc.... (Big generalisation here, but to get the > point across...) > If the user can use his or her own template each time as a start point, > then it might prove more efficient, and definable, as a mechanised process. > (Alex singing module, Olivier talking module, etc) Correct me if I'm wrong, but if I code this as a plugin which exposes parameters, I think that presets should be handled by the host, not the plugin. Anyway, I'm not sure I could code this as a plugin, because, for detecting the inspire noise, I would need to buffer something like 2 seconds of signal. I suppose that might not be such a problem though, there's already plenty of non RT-capable plugins... Plus, before removal, visual/auditive review of the detected noises (in some sort of audio editor) sounds quite important: there's alway a risk to confuse a noise with the end of a phrase or an other element of speech. So I might need to craft a little gui, or manage to integrate this detection into rezound, ardour, etc.. Anyway, before this happens I need to find a way to detect the noises. A colleague has told me that the best technology in this field currently involves using a database of noise recordings. You then try to find these noises in the signal by doing a more-or-less tolerant comparison in frequency domain. However, looking at the above spectrograms and waveforms, I think there could be a more algorithmic way of detecting these noises, thus avoiding the need for such a database, given the following facts: 1 - on the waveform: their amplitude is much lower than speech 2 - on the spectrogram: the frequencies in the noise seem to spread rather homogeneously (maybe a bit like white noise) where the speech contain noticeable peaks under 1000Hz or so. Do you think I could use these characteristics to detect the noises? PS: Alex, maybe that you could try and improve the way you handle citations when posting? That's not essential, but would make your replies more readable... -- Olivier _______________________________________________ Linux-audio-dev mailing list [email protected] http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
