Re: My ACCU 2016 keynote video available online

2016-05-20 Thread Jens Müller via Digitalmars-d-announce

On Friday, 20 May 2016 at 20:04:35 UTC, Andrei Alexandrescu wrote:

On 5/20/16 2:13 PM, Jens Müller wrote:
No it doesn't work because you need to break in the last case. 
Consider
the case when the last element of a is equal to an element in 
b. Next

iteration you overrun a.


I'm not that Bright :o).


Me neither.

So you'd need one more test, but you still save the other test 
and the test on one of the three branches, so 2 out of 4. --


Yes. Current version reduces it from 4 to 2.
I've read that people use SSE to speed it up. Maybe I consider 
this later.
If we want to improve on similar vectors (which is a great idea) 
we just reorder the cases, I'll guess. Just did it. It improves 
on my test data. But then on near dissimilar input I expect it to 
be worse.
When doing these optimizations it is always dependent on the 
expected input which is a pity when optimizing library functions. 
These must work in all cases and should only be less efficient 
for very good reason.

I'm looking forward to Fastware.

Jens


Re: My ACCU 2016 keynote video available online

2016-05-20 Thread Jens Müller via Digitalmars-d-announce

On Friday, 20 May 2016 at 14:14:18 UTC, Andrei Alexandrescu wrote:

On 05/19/2016 06:50 PM, Jens Müller wrote:
What if you stomped over an index in a that has as an equal 
index in b

(it could be anywhere in b).


Hmmm, you're right. So that doesn't work, or at least not 
efficiently (the fixup would entail a binary search in b).


How about this idea: arrange things such that a.back.idx <= 
b.back.idx (i.e. swap them if not so). Then stomp b.back.idx 
with size_t.max. Your kernel is:


if (a[i].idx < b[j].idx) {
  if (i == imax) break; // check needed
  i++;
} else if (a[i].idx > b[j].idx) {
  j++; // no check needed
} else {
  // no check needed
  r += a[i++].val * b[j++].val;
}

The fixup needs only to account for the case a.back.idx == 
b.back.idx. In fact I like this a lot better than others 
because it works real fast for identical vector (taking the 
norm) or similar vectors (often the case). Works?


No it doesn't work because you need to break in the last case. 
Consider the case when the last element of a is equal to an 
element in b. Next iteration you overrun a.

But optimizing for similar vectors is interesting.

Jens


Re: My ACCU 2016 keynote video available online

2016-05-19 Thread Jens Müller via Digitalmars-d-announce
On Thursday, 19 May 2016 at 22:04:56 UTC, Andrei Alexandrescu 
wrote:

On 05/19/2016 05:36 PM, Jens Müller wrote:
I removed the code to optimize for large gaps. Because it is 
only
confusing. I may generate some benchmark data with larger gaps 
later to

see whether it is worthwhile for such data.


For skipping large gaps quickly, check galloping search (google 
for it, we also have it in phobos). -- Andrei


Sure. I've already seen this. It's nice. But you have to include 
it in the sparse dot product (or list intersection) algorithm 
somehow. Then you require random access and galloping is only 
beneficial if the gaps are large. As a library writer this is a 
difficult position because this turns easily into over 
engineering. Optimally one just exposes the primitives and the 
user plugs them together. Ideally without having to many knobs 
per algorithm.


Jens


Re: My ACCU 2016 keynote video available online

2016-05-19 Thread Jens Müller via Digitalmars-d-announce
On Thursday, 19 May 2016 at 22:02:53 UTC, Andrei Alexandrescu 
wrote:

On 05/19/2016 05:36 PM, Jens Müller wrote:

I'm not seeing it. Let me explain.
Consider the input a = [1] and b = [2, 3] (I only write the 
indices).
The smallest back index is 1, i.e., a.back is the chosen 
sentinel.


Nonono, you stamp the largest index over the smaller index. So 
you overwrite a = [3] and you leave b = [2, 3] as it is.


Now you know that you're multiplying two correct sparse vectors 
in which _definitely_ the last elements have equal indexes. So 
the inner loop is:


if (a[i].idx < b[j].idx) {
  i++; // no check needed
} else if (a[i].idx > b[j].idx) {
  j++; // no check needed
} else {
  // check needed
  r += a[i].val * b[j].val;
  if (i == imax || j == jmax) break;
  ++i;
  ++j;
}

At the end you need a fixup to make sure you account for the 
last index that you overwrote (which of course you need to save 
first).


Makes sense?


What if you stomped over an index in a that has as an equal index 
in b (it could be anywhere in b). After the loop finishes you 
restore the index in a. But how do you address the values for the 
stomped over index if needed?

For instance test it on
a = [2]
b = [2,3]
Note the 2 in b could be anywhere.

I think you can check for
if (a[i].idx == sentinelIdx) break;
instead of
if (i == imax || j == jmax) break;

Jens


Re: My ACCU 2016 keynote video available online

2016-05-19 Thread Jens Müller via Digitalmars-d-announce
On Thursday, 19 May 2016 at 12:04:31 UTC, Andrei Alexandrescu 
wrote:

On 5/19/16 4:12 AM, Jens Müller wrote:

---
if (a.length == 0 || b.length == 0)
return 0;
const amax = a.length - 1, bmax = b.length - 1;
size_t i,j = 0;
double sum = 0;
for (;;)
{
if (a[i].index < b[j].index) {
if (i++ == amax) break;
}
else if (a[i].index > b[j].index) {
bumpJ: if (j++ == bmax) break;
}
else
{
assert(a[i].index == b[j].index);
sum += a[i].value * b[j].value;
if (i++ == amax) break;
goto bumpJ;
}
}
return sum;
---

Then if you add the sentinel you only need the bounds tests in 
the third case.


I'm not seeing it. Let me explain.
Consider the input a = [1] and b = [2, 3] (I only write the 
indices). The smallest back index is 1, i.e., a.back is the 
chosen sentinel. Now I assume that we set b.back to a.back 
restoring it after the loop. Now in the case a[i].index < 
b[j].index I have to check whether a[i].index == a.back.index to 
break because otherwise i is incremented (since a[i].index = 1 
and b[j].index = 2, for i = 0 and j = 0 respectively). In the 
last case I only check a[i].index == a.back.index, since this 
implies b[j].index == a.back.index.
So in sum I have two bounds tests. But I think this is not what 
you are thinking of.

This does not look right.
Here are the plots for the implementations.
https://www.scribd.com/doc/313204510/Running-Time
https://www.scribd.com/doc/313204526/Speedup

dot1 is my baseline, which is indeed worse than your baseline 
(dot2). But only on gdc. I choose dot2 as the baseline for 
computing the speedup. dot3 is the sentinel version.


I removed the code to optimize for large gaps. Because it is only 
confusing. I may generate some benchmark data with larger gaps 
later to see whether it is worthwhile for such data.

It looks much more regular now (ldc is still strange).

Jens


Re: My ACCU 2016 keynote video available online

2016-05-19 Thread Jens Müller via Digitalmars-d-announce
On Thursday, 19 May 2016 at 12:04:31 UTC, Andrei Alexandrescu 
wrote:

On 5/19/16 4:12 AM, Jens Müller wrote:



What test data did you use?


An instance for benchmarking is generated as follows. Given nnz 
which is the sum of non-zero indices in input vector a and b.


auto lengthA = uniform!"[]"(0, nnz, gen);
auto lengthB = nnz - lengthA;

auto a = iota(0, nnz).randomSample(lengthA, gen).map!(i => 
Pair(i, 10)).array();
auto b = iota(0, nnz).randomSample(lengthB, gen).map!(i => 
Pair(i, 10)).array();


So I take a random sample of (0, ..., nnz) for each input.
Any better idea? I've seen that people generate vectors with 
larger gaps.


10%-20% win on dot product is significant because for many 
algorithms dot product is a kernel and dominates everything 
else. For those any win goes straight to the bottom line.


Sure. Still I wasn't sure whether I got the idea from your talk. 
So maybe there is/was more.


The base line (dot1 in the graphs) is the straightforward 
version


---
size_t i,j = 0;
double sum = 0;
while (i < a.length && j < b.length)
{
 if (a[i].index < b[j].index) i++;
 else if (a[i].index > b[j].index) j++;
 else
 {
 assert(a[i].index == b[j].index);
 sum += a[i].value * b[j].value;
 i++;
 j++;
 }
}
return sum;
---


That does redundant checking. There's a better baseline:

---
if (a.length == 0 || b.length == 0)
return 0;
const amax = a.length - 1, bmax = b.length - 1;
size_t i,j = 0;
double sum = 0;
for (;;)
{
if (a[i].index < b[j].index) {
if (i++ == amax) break;
}
else if (a[i].index > b[j].index) {
bumpJ: if (j++ == bmax) break;
}
else
{
assert(a[i].index == b[j].index);
sum += a[i].value * b[j].value;
if (i++ == amax) break;
goto bumpJ;
}
}
return sum;
---


I check that.

Then if you add the sentinel you only need the bounds tests in 
the third case.


I post the sentinel code later. Probably there is something to 
improve there as well.



BTW the effects vary greatly for different compilers.
For example with dmd the optimized version is slowest. The 
baseline is
best. Weird. With gdc the optimized is best and gdc's code is 
always
faster than dmd's code. With ldc it's really strange. Slower 
than dmd. I

assume I'm doing something wrong here.

Used compiler flags
dmd v2.071.0
-wi -dw -g -O -inline -release -noboundscheck
gdc (crosstool-NG 203be35 - 20160205-2.066.1-e95a735b97) 5.2.0
-Wall  -g -O3 -fomit-frame-pointer -finline-functions -frelease
-fno-bounds-check -ffast-math
ldc (0.15.2-beta2) based on DMD v2.066.1 and LLVM 3.6.1
-wi -dw -g -O3 -enable-inlining -release -boundscheck=off

Am I missing some flags?


These look reasonable.


But ldc looks so bad.
Any comments from ldc users or developers? Because I see this in 
many other measurements as well. I would love to have another 
compiler producing efficient like gdc.

For example what's equivalent to gdc's -ffast-math in ldc.


I uploaded my plots.
- running time 
https://www.scribd.com/doc/312951947/Running-Time

- speed up https://www.scribd.com/doc/312951964/Speedup


What is dot2? Could you please redo the experiments with the 
modified code as well?


dot2 is an optimization for jumping over gaps more quickly 
replacing the first two if statements with while statements.
But my benchmark tests have no large gaps but interestingly it 
does make things worse.


Jens


Re: My ACCU 2016 keynote video available online

2016-05-19 Thread Jens Müller via Digitalmars-d-announce

On Monday, 16 May 2016 at 13:46:11 UTC, Andrei Alexandrescu wrote:
Uses D for examples, showcases Design by Introspection, and 
rediscovers a fast partition routine. It was quite well 
received. https://www.youtube.com/watch?v=AxnotgLql0k


Andrei


Nice presentation.

The code applying the sentinel optimization assumes mutability of 
the input.
That needs to be checked for. That's fine for partition because 
that is assumed
to be in-place. But for other algorithms it's not so obvious. 
It's sad that the
optimization works only for non-const input. It is in conflict 
with the advice
to make input const if the function doesn't change it. This makes 
the
optimization less likely to be applicable. One might though relax 
the const
requirement to mean "the input is identical at return of the 
function to its
beginning". But that's a different story, I'll guess. Coming up 
with another
implementation might also work, using chain or so. But typically 
the sentinel

optimization assumes mutability.

I didn't get the idea behind sentinels for sparse dot product. I 
picked the
smallest of the last elements (so you need bidirectional ranges) 
and fix up as
needed. For gdc I get a speedup (baseline over new 
implementation) of 1.2 in
best case and >1.0 in worst case. On average it's about 1.1 I 
would say. I
expected more. How would you approach sentinels with the sparse 
dot product. Can

you elaborate the idea from the video? I didn't get it.

The base line (dot1 in the graphs) is the straightforward version

---
size_t i,j = 0;
double sum = 0;
while (i < a.length && j < b.length)
{
if (a[i].index < b[j].index) i++;
else if (a[i].index > b[j].index) j++;
else
{
assert(a[i].index == b[j].index);
sum += a[i].value * b[j].value;
i++;
j++;
}
}
return sum;
---

BTW the effects vary greatly for different compilers.
For example with dmd the optimized version is slowest. The 
baseline is
best. Weird. With gdc the optimized is best and gdc's code is 
always
faster than dmd's code. With ldc it's really strange. Slower than 
dmd. I

assume I'm doing something wrong here.

Used compiler flags
dmd v2.071.0
-wi -dw -g -O -inline -release -noboundscheck
gdc (crosstool-NG 203be35 - 20160205-2.066.1-e95a735b97) 5.2.0
-Wall  -g -O3 -fomit-frame-pointer -finline-functions -frelease 
-fno-bounds-check -ffast-math

ldc (0.15.2-beta2) based on DMD v2.066.1 and LLVM 3.6.1
-wi -dw -g -O3 -enable-inlining -release -boundscheck=off

Am I missing some flags?

I uploaded my plots.
- running time https://www.scribd.com/doc/312951947/Running-Time
- speed up https://www.scribd.com/doc/312951964/Speedup

*Disclaimer*
I hope most of this makes sense but take it with a grain of salt.

Jens

PS
It seems the mailinglist interface does not work. I cannot send 
replies anymore via mail. I wrote Brad Roberts but no answer yet.