[opencog-dev] Re: Atomspace & Pattern Matcher on Graphcore IPU?

Linas Vepstas Mon, 15 Apr 2019 20:53:37 -0700

Hi Ben,

** Short Answer: you might want to think about porting PLN to it.  You may
want to rethink how PLN works, first.  And we actually have another (very?)
interesting option.

** Long Answer (with a quasi-happy ending):
What they call "graph" is not the same thing as what we call "graph".  What
they call "graph" is actually "memory-access, multiply-add, another memory
access, repeat". (see page 9, 12 of pdf you attached)   By contrast, what
we call a graph is "memory-access, integer compare, another memory access,
repeat".  Which maybe sounds similar, but its really very different.

** The IPU in review:
So, consider neural-net gradient descent, implemented via back-propagation.
Those algos march over memory, in a quasi-regular way (they are vectors,
after all), doing lots of multiplies and adds, and maybe a few
if-statements (aka "integer compares") and a small handful of other integer
operations (jump, increment some loop counter, test to see if done.)  Very
very few subroutine calls.  The PDF says "300MB In-Processor-Memory(TM)" so
it sounds like if your NN weight matrix fits in 300MB, you can do gradient
descent at... RAM speed. Whatever that is. RAM speed is always a
bottleneck, and it depends on how they designed it. 300MB is ballpark of L2
cache and L2-cache tends to be SRAM not DRAM and run at 1/10th the speed of
the CPU core. So much much faster than DRAM, but slower than CPU. Anyway,
really freaking fast if you can make it fit.   If it doesn't fit, then you
have a very impressive switching fabric, which is really nice, cause sooner
or later, you do have to move training data in and out. When things don't
fit, the switch fabric gets the missing parts into place pretty damned
fast.  The chart on page 15 looks honest and 100% believable to me: dark
blue is the multiply-add-loop, and yellow-lightbiue is the moving-in the
next block of training data (moving around the weight vectors?  Whatever.)

** The MMU and why the IPU doesn't have one:
It's called an IPU and not a CPU because there's no MMU. Which is good and
bad. Its good for hardware because MMU's are big, complex, bulky
bottlenecks between the CPU and slow, slow memory (because DRAM memory is
slow. Fact of life. Deal with it). Its bad because MMU's make programming
really really easy: programmers do not have to think about the physical
location of their data. Any idiot with --visual basic-- python skills can
write a program. Like teenagers. Smart pre-teens. The good news is that
several decades of industry experience with GPU's means that the industry
possesses really clever compilers and really clever libraries can hide most
of the pain of not having an MMU. Its almost fully automated. But still
requires pros who know what they're doing.  TL;DR: the pros only have to
port tensorflow to it once. The hardware savings on MMU complexity is huge!
It eliminate lots of awkward bus and data-transfer silicon design with lots
of ugly timing chains (google up Spectre, Meltdown, for a flavor of modern
MMU design mad skillz)

The IPU is a super-duper neural net machine, no doubt. I'm quite sure the
M87-Sag A* Event-Horizon astronomers will love it too. And lots of
supercomputer, weather-simulation, nuclear-bomb buffs, too.

** The pattern matcher in review:
Compare this to the pattern-matcher, which is a pure-integer, with no
floating-point in it. The pattern matcher is conceptually very simple
(practical considerations make it complex). It compares two graphs,
side-by-side. At every vertex, it needs to compare edges -- all possible
permutations of edges ("choices" actually, not permutations; there are
usually fewer choices, but still more than one).  Since any one of those
choices might be the right one, therefore one must save (push) the current
state onto the stack, and then continue exploration of the next vertex.
Repeat until match or no more choices, then pop the stack once, explore the
next choice.  So almost all CPU cycles are spent pushing, popping, and each
push-pop is at least one subroutine call, if not 5 or 8.  Each push-pop is
a lot of memory access, to save the current state.

Can this run fast on IPU? Sure, maybe even really fast, if it fits into 300
MB.  And doing 1216 of these in parallel is kick-butt. When happens if our
graphs don't fit in 300MB? I have no clue, cause I don't know where the
rest of the graph is located, I don't know how to find it, I don't know how
to stuff it into 300MB. Maybe this is solvable. Maybe. If I can't
hallucinate a good solution in 5 minutes, it means its hard.  Requires some
hard thought and clever invention.

Sadly, there's no floating-point in the pattern matcher, so all those
whizzy floating-point units in the IPU are sitting idle. Which is a shame.
Really a shame.

What about PLN? Well, today's PLN, built on the pattern matcher, will run
thousands of CPU cycles and then do a small handful of float-point ops. The
good news: almost all PLN pattern matches are really quite very small
(that's good- it should be fast) and of course all the current PLN demos
fit into 300MB.  I currently have no clue how to apply PLN to large
datasets.

Are there other ways to think of PLN that do less searching and more
multiplying? Can you swap inner and outer loops somehow? Nil might like to
ponder this.

** An alternative that I personally find exciting
The language-learning pipeline I built doesn't use the pattern matcher. At
all.  What it does do is ... large quantities of multiply-adds. Immense
quantities of them. Lots of vector and matrix multiplies. Lots of
summations (multiply-adds in a loop).  So its more-or-less a totally boring
vector-matrix library, with one huge difference: the matrices are extremely
sparse -- one-in-a-billion non-zero entries. And that means it is
impossible to store those zeroes in any conventional form. One MUST store
only the non-zero values.  In graphs!  Which is why I wrote it for the
atomspace, instead of SciPy or GnuR or PyWhatever.  My clustering code, the
code that tries to do word-sense disambiguation, is totally bottle-necked
doing multiply-adds and then a small handful of pointer-chases to find the
next numbers to multiply-add.

I'm pretty sure it could run pretty well on the IPU. Maybe even really
well. I think it would be "easy" to slice it to make each slice fit in
300MB. And have it work without thrashing.  I'm optimistic.

** To conclude:
I would recommend porting this it the IPU immediately, except for one minor
gotcha:  so far, I've personally failed to convince you personally that my
algos really can do word-sense disambiguation, and that the resulting
grammars really are correct, and I'm distracted by other concerns and
making zero progress on such a proof.  Meanwhile, Anton's team is still
quite lost, dazed and confused about the issues. I think they've made some
good forward progress, but so far, they've only worked on the easy stuff.
The hard stuff is still ahead of them, and I don't think they understand
this yet, and I don't think they are prepared to tackle it. They're at the
foothills of a mountain. They struggled to get up the foothills, they
haven't gotten past the treeline, and don't yet realize what's up there.
So, without any actual proof that my claims are true, I'm 95% sure you're
not willing to commit to this (i.e. porting ultra-super-sparse matrix-math
to the IPU's.)

(To be clear: that would be my proposal: ultra-super-sparse matrix-math to
the IPU's.  I find it convenient to store those numbers as Values on Atoms,
but that is a convenience, not a necessity. So I am NOT proposing a port of
the atomspace to the IPU's. Just to be clear about that. (Although I  would
want a bulk-copy of the floats to-from the atomspace, because the atomspace
is still the correct data-structure for long-term knowledge store. ))

So that is my happy ending.  We can leverage IPU's but just not in the
traditional opencog architecture.

-- Linas

On Sun, Apr 7, 2019 at 9:36 AM Ben Goertzel <[email protected]> wrote:

> I'm reading about Graphcore and their IPU graph processors...
>
> I wonder if it would be a good idea to make Atomspace and the Pattern
> Matcher run on their IPUs using their Poplar libraries?   A short rundown
> on the IPU and Poplar are attached to this email...
>
> Obviously there are a lot of other priorities.   What I'm asking now is,
> for those who know more than me about hardware, do you think that we would
> get dramatic speedups by porting Atomspace/PM to this hardware
> [understanding it would be a lot of work to do so...]
>
> To tell for sure would require more digging obviously, but I'm trying to
> form a clear opinion of whether that digging is even worthwhile...
>
> thanks
> Ben
>  NeurIPS 2018 Poplar Graph Toolchain.pdf
> <https://drive.google.com/file/d/1p_4OXDquYSl6SuDCajk10W3N5nf7bInp/view?usp=drive_web>
>
>
> --
> Ben Goertzel, PhD
> http://goertzel.org
>
> "Listen: This world is the lunatic's sphere,  /  Don't always agree it's
> real.  /  Even with my feet upon it / And the postman knowing my door / My
> address is somewhere else." -- Hafiz
>

-- 
cassette tapes - analog TV - film cameras - you

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA35X0SkvtudUZiUPEvP9-qdQ5Rx0Sm-%3Dkbtgdb9-48LnFw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: Atomspace & Pattern Matcher on Graphcore IPU?

Reply via email to