On 2015-08-31, [email protected] wrote:
Real-time auralization of dynamic sound sources in Virtual Environments would be one application. Myself coming more from the graphics/interaction side, our fellow acoustician colleagues do this in order to couple arbitrarily moving sound sources with a potentially dynamic room acoustics simulation.
That is a good point. But then, in the past many spatial audio folks have also optimized such computations by relying on head related transfer functions, which are precomputed to the Fourier domain, and then even relied on linear interpolation between them. That's been possible since e.g. the Kemar dolly sets of HRTFs are pretty dense, bringing the interpolation error down considerably.
As such, I'm not too sure using brute force FIR is really necessary here. But since I haven't gone through the papers, yet, do correct me if I'm wrong. Why precisely did they choose straight time domain FIR, instead of Fourier mediated convolution?
If I have a position-tracked audience with headphones being able to move during a musical performance, or e.g. in a theatre performance the moving actors as virtual sound sources, and want to place them in arbitrary virtual acoustic settings, the techniques cited above would probably apply as well.
Now headtracking, that's interesting to me. Besides this list, I'm a long time participant on the sursound list, and as such a bit of an ambisonic freak.
One of my best, novel, ideas there was how to do zero delay directional head tracking. Of course for considerable computational cost, but I think it's well possible on modern multicore hardware, thanks to ambisonic's inherently parallel arithmetic. Nobody's implemented that idea as of now, but...
If you want to try it out, conceptualize the binaural ambisonic framework a bit differently from how you normally do it. Typically you'd have a number of spherical harmonics pulsing around which you sample in a conceptual array, and then you'd project them down onto two static headphones. Instead of doing that, go with the original formulation:
What you have is a number of rotationally symmetrical fields which add onto a whole soundfield over a sphere. Now sample all of the thing at the same time with a Kemar set, and you're led to a simple binaural rendering of the field. Now with time structure; the convolution we've been talking about.
But then that problem is very much over-determined when you rotate your head around. Because of the rotational symmetry of any ambisonic system of order, no matter how many directional samples you have of the set, all rotations of the sample set will at most give you the same number off independent degrees of freedom. So, in fact, you can reduce the whole Kemar set to whatever degree of ambisonic representation you want by just integrating the response over the sphere of possible rotations.
After that, you can do the funniest thing: because of the linearity of the spherical Fourier integral of the ambisonic system, and because of the linearity of sampling it via the HRTF set (Kemar in this case), it's legitimate to exchange the order of the two operations. Even if the HRTF set has time structure, because it's going to be separable from direction.
If you do that, you no longer 1) rotate a sound source into position, 2) convolve with two or more HRTF's, and 3) reduce into binaural sound. Instead you 1) render onto ambisonics of given order at given angle of arrival, 2) you apply an invariant many-by-many convolution which transmits the sound from your source to the sphere where the listener's ears can lie, and 3) then you just sample that sphere at two points.
Sure, it's a heavier calculation. But within the ambisonic framework it's guaranteed to be perfect, with constant computational load, and when you calculated it in that order, it's absolutely zero delay. Absent Doppler products of your head turning really fast -- which too can be mimicked at low extra cost -- this sort of thing ought to be a gamer's *dream*. 8)
Wether the same degree of realism is really required for a musical performance is probably debatable, but if I want to accurately reproduce the room-acoustic properties of dynamic scenes, folding (probably very many) sound sources with very long dynamically generated impulse responses is definitely something which you would do in the context of musical DSP.
Yet, can you really hear that difference? This goes rapidly into the psychoacoustical territory, I know. Obviously if you want to do everything perfectly, you have to utilize a bunch of nasty, expensive methods to do so. At worst the kinds which call for supercomputers, weeks and petabytes, running a high end solver for the wave equation. (In my favourite techno, the *nonlinear* wave equation as well, because of what happens with high level bass.)
I'm not too sure that is the relevant marginal we should be thinking about in musical DSP, though. Isn't the definition of music something we acutely here, and find pleasing? If so, maybe we should actually speak more about how we hear and feel, in respect to our algorithms? And not so much about how to make an acoustical simulation just right? ;)
(And yes, sorry again, I have a tendency to get carried of a bit. No harm, no foul, right...)
-- Sampo Syreeni, aka decoy - [email protected], http://decoy.iki.fi/front +358-40-3255353, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2 _______________________________________________ dupswapdrop: music-dsp mailing list [email protected] https://lists.columbia.edu/mailman/listinfo/music-dsp
