Ross Vandegrift wrote:
> My goal is not to determine how things sound.  See below for more explanation.
> 

> By using auditory phonetics.  The field of auditory phonetics attempts to
> analyze speech sounds as they travel through the air in waveforms.  Given
> a waveform, one can calulcate a spectrographic image of the waveform.
> 

Given a waveform, many things are possible.


> Currently, it is well known what features of speech cause in a spectrographic
> plot of that speech's waveform.  What is unknown is what MP3/OggVorbis do to
> this spectrogram (ie, the speech data).

A spectrogram is only one representation of speech data.

The system of filter banks used by MP3 audio does not realize a
perfectly reversible transform.  After the initial transform, 
either encoder sends the data through a psychoacoustic blender
anyway.  You will lose inaudible speech that may otherwise be
extracted.


> Perhaps (and I suspect) the loss is totally negligable for some bitrate n.
> Then the result of encoding a speech sample at bitrate n preserves all of
> the features a linguist needs to do auditory phonetic analysis.  The
> "interesting" auditory features of a speech waveform happen independant of
> the psychoacoustics.  I'm attempting to examine if MP3/OggVorbis
> psychoacoustics fit speech analysis well enough.

Perceptual encoding adds additional noise and distortion, with
the goal that anything added (or removed) is inaudible.  Other
applications that have different goals may well tolerate more
noise, or may demand less distortion.


> For example, if LAME at bitrate n causes much attenuation of white
> noise in the lower frequency bands, it's completely unsuitable for
> storing speech samples destined for linguistic analysis - low-band
> noise is a major distinguisher of certain phonemes.

A perceptual encoder should not strip dominant frequencies
that are audible.  Less powerful frequencies in the shadows (with
respect to frequency and time) of the dominant may be stripped
away if they are judged inaudible.


> I'm not examining "why".  In part I wish I were.  Unfortunately I'm an
> undergraduate Mathematics student, not a graduate linguistics student, and I
> only have so much time.  The immediate question that my professor would like
> answered is: "If I want to publish a library of speech online, is there some
> reasonable format I can use that will keep the necessary speech features, and
> preserve the academic viability of said speech?  Cause it would be nice if I
> didn't have to post gigantic .wav files...."  Perhaps if I ever go on to
> grad school for linguistics I'd take the tact you propose.  Just not now ::-).

What is deemed viable must be precisely defined.  If human ears
are the only target, than MP3 audio is probably a good compromise.
If further mathematical analysis is required, MP3 may not be
suitable due to the waveform dependent artifacts introduced.

Sticking to PCM waveforms for research data eliminates any
questions that the MP3 audio may introduce.  Specialized lossless
audio compression is available.  The general data compressor "RAR"
also does a good job for PCM audio (unlike "ZIP" compression).
Very roughly, the compressor will cut the size in half.


> Any qualitative analysis?  My professor insists no one has done a rigorous
> study of speech preservation through MPEG/OggVorbis compression.  Can I tell
> him he's wrong?

Information is lost with perceptual coding.  The goal is not to
preserve speech as it is uttered, but speech as it is perceived.

To this end of preserving perceived audio, many studies have been
performed, but most based on empirical data.  Some interesting
papers examine physical models for the ear and simplistic models
for aural neurons.  I've yet to find an investigation that also
models the neural network involved in post-processing and
perception.


Kind regards,

- John
_______________________________________________
mp3encoder mailing list
[EMAIL PROTECTED]
http://minnie.tuhs.org/mailman/listinfo/mp3encoder

Reply via email to