>
> Hi,
>
> I have just commited changes to add --nspsytune option.
> With this option, vbrtest.wav is encoded perfectly, and
> encoded sound becomes more natural to my ear though encoded file size
> increases.
>
> --nspsytune command line option turns on following things
>
> 1. Addition of simultaneous masking.
> 2. MAXNOISE is selected if max/avg exceeds predefined threshold.
> 3. long block fft window function is changed to blackmann window.
> 4. new tonality measure is introduced, since it seems that old tonality
> is not working correctly.
> 5. Usage of tonality is changed. If tonality exceeds predefined
> threshold, masking made by that partition is suppressed by 10dB.
>
> --
> Naoki Shibata e-mail: [EMAIL PROTECTED]
>
> --
Hi Naoki,
I spent some time looking at all these changes. Here are some
comments:
1. This is just a coding issue: This code:
if (gfp->exp_nspsytune) {
for ( k = gfc->s3ind[b][0]; k <= gfc->s3ind[b][1]; k++ ) {
gfc->s3_l[b][k] *= 0.7;
}
is probably not in the right place. Right now, the spreading function
is normalized so that (for example) convolving s3 with a constant
function will not remove any energy and return the same constant.
After the spreading function is applied, then you can then adjust the
strength based on the tonality measure, and if you want a uniform
reduction by .7 (1.5db), it can be incorporated later.
(in fact, later it looks like you increase the masking by 2db, so all of
this could be done at that point?)
2. tonality measure:
I cant really understand your tonality measure: it looks like a
comparison between peak and average energy within a partition band?
This is based on the theory that noise is usually has a flat
spectrum, while pure tones would have sharp peaks?
The ISO measure of tonality is based on how stationary a signal is
in time. Thus the ISO formula is based on measuring the change in
energy and phase over 3 granules: if they dont change much over these
3 granules, the signal is considered very tone-like, and if they
change a lot, noise-like.
3. Simultaneous masking: This is based on the theory that
two maskers, when added together, can (but not always?) give
more masking then if the sum of their individual maskings.
I haven't looked at Zwicker's book (Lincoln's reference for this)
but I imagine it is based on tests with just 1 or 2 maskers.
In MPEG, every signal is considered a masker, and they are
being more conservative by just adding them.
4. Your point #5 above is very similar to the ISO formula
which is implemented via "minval" threshold. The strength
of the maskings (computed based on tonality) is not allowed
to exceed a certain threshold. The ISO formula is a little
more complicated in that this threshold depends on frequency:
for low frequencies, minval is more restrictive (resulting
in less masking than would be used w/o minval).
In AAC, minval was dropped. Although it may have been dropped
from the spec but still used in the commercial encoder!
5. MAXNOISE: This is probably the biggest effect?
Switch to a maxnoise formulation and you raise the bitrate
by 20kbps or so, which will improve vbrtest.wav. But
if for example, else3.wav (which sounds fine with VBR)
also has an increase in bitrate for the same quality setting,
that nothing has really changed except the VBR scale.
The real goal is to find something that will increase
the bitrate of vbrtest.wav, without disturbing the bitrate
of other samples which sound ok with VBR.
I've played with MAXNOISE and do not really like it since it is based
on inaccurate energy estimates of single MDCT coefficients, rather
than some kind of averaging. For example, take a signal with a very
large N'th coefficient. A tiny change in this signal can easily move
the energy so that it is now 50%/50% between N and N+1 coefficients.
The thing that doesn't change is the total energy. Thus I think some
type of smoothing needs to be done. A better solution might be to
take the maximum of a moving average of 5 coefficients over all the
coefficients in the band.
6. Blackman window: (I think this is "Blackman", even
though the name is usually spelled Blackmann.)
There is a minor bug in your formula: the window should be centered
over the input data, going to zero at each end, and be periodic with
perioed 1024. Thus window[N]=window[N+1024], and the zero should
occur between sample 1023 and 1024 (sample 1024 = sample 0).
Of course the FFT only takes data sampled at window[0..1023], but those
are the sample values of a function satisfying the above contraints.
Mark
--
MP3 ENCODER mailing list ( http://geek.rcc.se/mp3encoder/ )