On Thu, Jul 27, 2017 at 9:43 PM, Marco Domingues <marcodomingue...@gmail.com
> wrote:
> I’ve uploaded the patch with the normals fixed as you asked. (
> https://sourceforge.net/p/brlcad/patches/472/)
>
> Meanwhile, I’ve been working in optimizing the process of building the
> regiontable, by precomputing the table with the regions involved in each
> primitive instead of doing this for each partition in the rt_boolfinal
> kernel. This resulted in a significant improvement on the performance! I
> will include a table with the new times in my log post tonight for
> reference.
>
Excellent results! 2.72 seconds now (with the optimization) vs 13.53
seconds before for 'havoc.g'!
This level of performance should be good enough already to be of use in a
lot of practical cases.
The BRL-CAD OpenCL librt module is up to 3x faster than the ANSI C librt
module on those test scenes. However in some scenes, like havoc.g, it is
still slower than the ANSI C module... But I expect performance to improve
further once the optimization to skip invalid partitions during the render
phase that I suggested, which is done in the ANSI C code, is implemented.
Which x86 CPU do you have? AFAIK BRL-CAD ANSI C librt uses scalar SSE2 so
it should be able to issue like 2 DP FLOPS/cycle per core, with vector
instructions it is possibly to issue like 4, 8, or 16 DP FLOPS/cycle,
depending on the processor.
So in theory, if we exploit the vector instructions well, we can expect to
achieve an up to 8x performance improvement.
We will also have to test performance against stock ANSI C librt. Remember
that you are using an experimental simplified bool_eval() path that I
implemented which could have worse performance than the stock bool_eval()
function because it doesn't short circuit expression evaluation but that is
a lot simpler to evaluate and uses less memory. Besides that the only other
optimization that the ANSI C code does that we aren't doing, that I can
think of, is that the ANSI C code evaluates rt_boolfinal() on a partial
fashion in rt_shootray() while we always do a full evaluation in your code
so it might also be worthwhile to evaluate the performance impact of that.
-[
In case my simplified bool_eval() proves to be slower than the stock
bool_eval() function on the ANSI C code it is possible to optimize this by
storing the tree as a linear array with skip pointers to avoid the gotos.
I also think there are a lot of possibilities to trim fat in the data
structures. To reduce the memory fooprint or increase the locality of the
accesses to memory. Some of the elements in 'struct hit' are only used
during the render phase so it is not clear we even need to store some
things at some stages. In particular the hit_normal only seems to be used
in the render phase. While the hit_point can be computed given the ray
origin and the hit_dist so we are likely better off recomputing, rather
than caching, these values. Plus like I said before we can use double[3]
instead of double3. The only data accessed in 'struct partition's inhit and
outhit are the hit_dists so we don't even need to store anything else...
Also we can stuff the bools as bits someplace instead of using a whole char
to store them.
-]
Still, even with all these limitations, the performance is quite excellent.
The fact that it even exceeds the mature ANSI C code's performance on some
scenes at this stage in the rt_boolweave() and rt_boolfinal() functions is
certainly nothing to sneeze at.
--
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
BRL-CAD Developer mailing list
brlcad-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/brlcad-devel