On Fri, Aug 11, 2017 at 12:20 AM, Vasco Alexandre da Silva Costa <
> On Thu, Aug 10, 2017 at 11:36 PM, Marco Domingues <
> marcodomingue...@gmail.com> wrote:
>> Thanks for reviewing my code and making the adjustments, Vasco! I’ve
>> integrated the changes in my patch.
>> I’ve finished the port of the new bool_eval() function to OpenCL, and
>> although the improved performance, it wasn’t enough to outperform the ANSI
>> C code with the Release build.
>> For the havoc scene, I got 1.56sec now vs 2.10sec before, when running
>> the OpenCL code on my GPU. (command ‘rt -z1 -l5 -s1024’). For reference,
>> the same scene renders in 0.63sec with the ANSI C code currently in the
>> Despite that, when I ran the OpenCL code in my CPU, I got 0.64sec now vs
>> 2.79sec before. (command ‘rt -z1 -l5 -s1024’).
> So let me get this straight. The OpenCL backend is slower in your GPU than
> the CPU based trunk/ ANSI C backend. That's not totally unexpected. You
> have a consumer GPU with nerfed DP FP.
> What I want you to do tomorrow is to compare the trunk/ ANSI C backend
> with the OpenCL backend over your CPU with the AMD and Intel OpenCL
> implementations. I also want you to time the results with the single-hit
> mode if you have the time for that.
> Why are the 'rt -s1024' times in your July 27 post different from the
> times in your August 7 post?
>> I am a little intrigued with this, because smaller scenes like the
>> operators.g are clearly faster when using the GPU, (0.06 sec gpu vs 0.16sec
>> cpu). Any explanation?
> Those scenes are fillrate limited with little depth or scene complexity.
>> Other thing that caught my attention was how close the lines RTFM and
>> wallclock from the ‘rt’ output are when running the OpenCL code in the CPU,
>> compared with the same lines from running the OpenCL code in the GPU. (i.e
>> 0.60 and 0.65 sec - cpu vs 0.32 and 1.65 sec - gpu).
>> Couldn’t the big difference on the GPU side be caused from transfers
>> between CPU-GPU and not by performing ray-intersections, boolean evaluation
>> and shading operations? Is there a way to investigate this?
> The best way is to use a profiler like the ones I mentioned before.
> Alternatively one can time the transfers vs the computations by timing the
> appropriate CL calls making sure to clFinish() the queue before you measure
> the time.
>> Tomorrow I will update the previous tables that I shared before on my
>> document, now using Release builds. And will also include side by side
>> image comparisons between the ANCI C and OpenCL results, for each scene.
> Ok. Make sure to try with both the AMD and Intel OpenCL SDKs over the CPU.
> I'm not interested in your GPU results right now as it would only
> complicate the comparison.
PS: If you really wanna know what I think is happening i suspect that in
complex scenes, where the bitvectors are really sparse, we would have been
better off using lists like the ANSI C code does instead. But the thing is,
I don't like the fact that the ANSI C code uses bu_pbtl_ins_unique()
either. I suspect we are better off emitting all the entries in a list and
removing the duplicates later. The other issue that should be causing
slowdowns is that the ANSI C code in rt_shootray() intersects one solid at
a time and calls rt_bootfinal() on the partial results before continuing.
In scenes with high depth complexity this means it will terminate earlier.
So if you want an apples to apples comparison you'll have to force the ANSI
C code to also process all the hits.
To solve the first problem means we'll need more compute kernels which we
don't have currently (e.g. prefix sum, reduce, radix sort) and to bound the
memory used. To solve the second problem means we need to rethink how to
process the primitives and change the rendering loop.
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
BRL-CAD Developer mailing list