Histograms are looking like a good collection of examples for experimenting
with ParallelJS
Here are some histogram options
1) dense histograms where some bins have a zero entry and
2) sparse histograms where only the non-zero bins are stored
Sometimes you know the number of bins beforehand but in general, you need to
look at the data to find the min and max
For example given the data set [2 1 0 0 2 2 1 1 1 1 4] contains 2 zeros, 5
ones, and 3 twos and 1 four, is[2 5 3 0 1]
using the dense method and[(0,2), (1,5), (2,3), (4,1)] using the sparse method.
Since there are no threes, the sparse histogram representation does not contain
a bin for that value
One way to store the sparse histogram is as two separate arrays one of keys and
one of bin counts
[0 1 2 4] and [2 5 3 1] (The soa format )
If the number of bins is relatively small compared to the input size, then
binary search based dense might be the best performing while if the number of
bins is near the input size then the sparse version might be better
If I just add prims, here are two ways to calculate histograms
Dense
Sort the data
Find the number of bins, equal to max value plus one
Allocate the space to hold the histogram, and fill it with 0,1,2,3.. max value
Initially set each histogram bin to upperbound
Upper bound is a useful prim
Given a sorted array, and a set of values use binary search to find
the last position in the sorted array where each data value could be inserted,
all the binary search ops happen in parallel
(I'm not sure if there is a nice way to express this op in ParallelArrays)
n At this point we have a vector of positions the pair hist[i], hist[i+1]
are the indexes in the sorted data that go into bin i
In parallel do all the adds, no need for atomics
Your alternative -
find the max
Allocate the bins
Scatter each element to the bin number, handle collisions by an add,
works but I suspect it will be a lot slower, because each add has to be atomic
and will need synchronization
Another alternative for dense histograms would require knowing a lot of machine
details
If you knew you were on a cpu not a gpu, you could form a partial histogram
per core and then serially combine the results but on a gpu with massive
parallelism, the space for all the partial histograms would be problematic so
you would need to lay this out carefully - like share some of the partial
histograms over blocks of threads
Sparse
Sort the data
Need to find the number of bins, which is the number of unique values
Reduce using sum ( Map if data [i] = data[j] then 0, else 1) - counts the
number of values that are different from the prior
Allocate two arrays keys and counts
Reduce by key - this is a generalization of reduce, Reduce each element with
the same key returning the set of keys and the reduced results which sets up
both arrays, (another tricky prim to express)
From: Herhut, Stephan A [mailto:[email protected]]
Sent: Friday, April 19, 2013 12:45 AM
To: Norm Rubin; [email protected]
Cc: [email protected]
Subject: RE: parallel arrays and sorting
Thanks for the overview of thrust. I had completely missed the issue of
stability before. This needs careful consideration and I have not yet made up
my mind. I understand the desire to have unstable sort for performance reasons,
but it also introduces a form of non-determinism.
Histogram can also be expressed using scatter in out API. Assuming some array A
of color values, you can do
A.map((v) => 1).scatter(A.map((v) => bucket(v)), 0, (a,b) => a+b,
numberOfBuckets)
So, essentially create a vector of 1's and then scatter those to your buckets
based on their bucket value, using addition as the conflict resolution. Sorting
the scatter indices indeed makes a lot of sense and an implementation would be
free to first sort the vector of scatter indices to reduce synchronization
cost. I also thought about an explicit version directly using sort and some
other primitives of our API but I could not come up with a formulation that
would benefit from sorting.
The array of structures (AoS) vs. structures of arrays (SoA) transformation is
problematic in JavaScript. In C like languages one has a lot of structural
information, in particular the fields and types of a structure are statically
known and homogeneous across an array. In JavaScript, on the other hand,
objects can be used like structures but in general there may be a lot more
dynamism: Arrays may contain objects with varying labels, properties may have
different types and, probably worst of all, the programmer can add new
properties to objects at any time. So even though a structure may start out as
a homogeneous array, this can change during its lifetime. Analyzing these data
structures and tracking the 'anomalies' seems not straight forward to me and I
am not convinced that this can be implemented efficiently.
However, there is also good news. The binary data proposal (see
http://wiki.ecmascript.org/doku.php?id=harmony:binary_data) brings C like data
types, including arrays of structures, to JavaScript and we have proposed an
extension to Parallel JavaScript that makes use of binary data as storage
format (see
http://wiki.ecmascript.org/doku.php?id=strawman:data_parallelism#support_for_binary_data).
In such a setting, doing automatic AoS to SoA transformations is possible and
our programming model does not preclude this.
Stephan
From: Norm Rubin [mailto:[email protected]]
Sent: Thursday, April 11, 2013 11:36 AM
To: Herhut, Stephan A; [email protected]<mailto:[email protected]>
Cc:
[email protected]<mailto:[email protected]>
Subject: RE: parallel arrays and sorting
Good set of questions about sort, thanks for looking at this carefully. You
guys have been great about responding to my emails.
One sorting application that I would expect is physics particle simulation,
where the code might sort objects based on distance from the viewer. This
could be used so that far away objects get rendered smaller.
Another might be computing a histogram over a visual image - sort the data
first so nearby values will go into the same bucket, without needing atomic
synchronization.
Sorting with an optional compare seems ok to me- even if implementations can
only optimize for some comparisons
One interesting question you brought up as part of sorting by key is pretty
tricky.
In general parallel data arrays would like to be stored in transposed order
from classic cpu styles.
We usually call this array of structures or structures of arrays
Do we sort an array of objects each with multiple fields stored one object
after another or do we sort multiple arrays, with all the values of field1
followed by all the values of field2 etc
I'd hope we allow implementations to reorder the data within a parallelArray
as they like without needing to expose the layout to developers. Thrust, on
the other hand, pushes the layout to the developer and thus has a significant
set of transpose routines.
Just for a point of reference there are 8 versions of sort in the thrust
library. As you can see, Thrust never aimed to be a minimal set it just gained
operations as applications appeared. AMD has a similar library called bolt
which uses the same interface.
Two different parallel libraries with the same routines does suggest that this
set is enough to do useful work.
Internally thrust notices that the compare is < on primitive types and
uses radix sort (on the gpu).
Thurst includes:
Stable and unstable forms
Ascending only or with a comparison function
Single array or a pair of keys and values
1) Sort(array) - sorts the array into ascending order, not guaranteed to
be stable
2) Sort (array)- with a comparison function
3) Sort by key (keys, values) returns both the reordered keys and the
reordered values, sorted by the array keys. This one is kind of confusing so
here is an example
// an example of key sorting
#include <thrust/sort.h<http://docs.thrust.googlecode.com/hg/sort_8h.html>>
const int N = 6;
int keys[N] = { 1, 4, 2, 8, 5, 7};
char values[N] = {'a', 'b', 'c', 'd', 'e', 'f'};
thrust::sort_by_key<http://docs.thrust.googlecode.com/hg/group__sorting.html#ga2bb765aeef19f6a04ca8b8ba11efff24>(keys,
keys + N, values,
thrust::greater<int><http://docs.thrust.googlecode.com/hg/structthrust_1_1greater.html>());
// keys is now { 8, 7, 5, 4, 2, 1}
// values is now {'d', 'f', 'e', 'b', 'c', 'a'}
4) Sort by key with a comparison function
5-8 the stable forms
Based on the thrust primitives, one parallel array sort might be
Sort an array of objects
Besides the array the arguments might be
- a flag indicating if the sort must be stable
- an optional user compare function that defaulted to < on primitive types
From: Herhut, Stephan A [mailto:[email protected]]
Sent: Thursday, April 11, 2013 1:09 PM
To: Norm Rubin; [email protected]<mailto:[email protected]>
Subject: RE: parallel arrays and sorting
Rick is travelling, so let me chime in.
We have discussed this back and forth but have not come to a conclusion.
Generally, we agree that adding a sort primitive makes a lot of sense, in
particular as the Array object in JavaScript already has a sort method. Also,
as you too mentioned, implementing an efficient sort as a library function
without knowing the details of the parallel hardware used is difficult, to say
the least. So sort ticks all the boxes to become a primitive.
The other, and arguably more difficult question, is what a sort method should
look like. If we take JavaScript's existing Array.sort, the sort method would
get an (optional) comparator function. However, using a comparator would
preclude the use of radix sort.
An alternative would be to implement sorting of primitive types only. This
brings back more choice in sort algorithms but limits use. For such a design,
we considered a function as optional argument to sort that, given a value from
the ParallelArray to be sorted, returns a key used for comparison, which again
needs to be a primitive. This would at least enable sorting of objects by a
field and, at some runtime cost, sorting of general data.
The tradeoff between these two approaches, and probably other designs, is hard
to judge without knowing what sort is used for. So we decided to wait for some
good use cases before deciding on a specific design.
Sorry, no answers only further questions.
Stephan
From: [email protected]<mailto:[email protected]>
[mailto:[email protected]] On Behalf Of Norm Rubin
Sent: Monday, April 08, 2013 7:46 AM
To: [email protected]<mailto:[email protected]>
Subject: parallel arrays and sorting
In comparing ParallelArrays (rivertrail) to the cuda thrust library, I noticed
that Sorting using parallelArrays looks like a missing primitive operation.
Sorting is special because the good (high performance) algorithm on a gpu is
radix sort, while the good (high performance) algorithm on the cpu is parallel
merge sort, The other way around radix sort of cpu, or parallel merge sort of
a gpu is very slow and often worse than serial implementations
Sadly it is well past a jit to take the code for one flavor of sort and
transform it into the other, while it would be pretty simple for a run-time
to pick a good sort, if only it knew that a sort was going on. Run times would
not be required to do so here since they can always pick a slow sort.
I know this is a slippery road, since once you add another prim, adding more
prims becomes ever easier but sorting seems pretty important. And array
already has a sort, so adding a version that works on ParallelArrays does not
seem so bad.
What do you guys think?
________________________________
This email message is for the sole use of the intended recipient(s) and may
contain confidential information. Any unauthorized review, use, disclosure or
distribution is prohibited. If you are not the intended recipient, please
contact the sender by reply email and destroy all copies of the original
message.
________________________________
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss