On Wed, Jul 19, 2017 at 5:56 AM, Christopher Sean Morrison <brl...@mac.com>
wrote:
>
> Also your code to iterate the bitarray is still inefficient. You check the
>> leading zeros in an uint but then you iterate the bits one by one over that
>> to iterate over all set bits. This should be faster:
>>
>
> Using bit arrays seems like such an antiquated method. I mean it’s
> certainly somewhat memory-friendly (in that it’s compact, minimal L2
> contention), avoids pointers, etc, but what are some alternatives? I mean,
> I could certainly imagine a lock-free version that has nice properties and
> is still essentially a bit array for book-keeping, but I don’t have broader
> intuition. I have to imagine they ported simply because that’s what the
> old code does/did.
>
> Getting it working is certainly first priority, but I’m just curious what
> the thoughts are on them.
>
Well the code has like two phases. In one you determine which segments are
in each partition (without duplicates) and in the other you have to iterate
over the segments in a partition. The ANSI C code uses
bu_ptbl_ins_unique() in the first phase with an O(N) per insertion cost and
iterates a double-linked list with pointer chasing in the second phase. So
a bitarray in the first phase and an ordered array in the second phase
would actually be an improvement (at least in the theoretical number of
ops). Memory usage and memory coherency might still be an issue on the GPU
respectively though. But we need to examine better the access patterns and
GPU usage count to determine if we need a more advanced data structure. Or
to better know the traits of the one we would need to use.
It is basically a collection of dictionaries of integers in a well known
range that you initially treat like a set but that afterwards you need to
iterate.
An alternative implementation for the dictionaries would be as an ordered
list.
Or we could just build the list unsorted with duplicates, sort the list
(e.g. radix sort), and then remove the duplicates, which is what typically
is done in GPU algorithms because it is more memory coherent. But for that
we need to determine an upper bound for the used memory before the
algorithm begins and make sure that the bounds are reasonable. We haven't
done that yet. In fact when I glanced at it it seemed like the worst case
total memory used was just too large.
We would also need an efficient sorting algorithm, which would take
considerable time to do from scratch (i.e. you could waste an entire GSoC
in it), there is source code for a radix sort algorithm from Thrust in
PyOCL we could use. But that is Apache License 2.0, which you guys had
concerns about before, even though I think it should be 100% compatible
with LGPL v2.1 since it's like BSD with a patent clause and LGPL v2.1 also
has a patent clause but IANAL.
There is nothing wrong with KISS on a GPU. It is typically an order of
magnitude (that's an underestimate) more complex to implement advanced data
structures on a GPU and keep it 100% used at all times.
Regards,
--
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
BRL-CAD Developer mailing list
brlcad-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/brlcad-devel