Re: Decentralized building blocks [was Re: [opencog-dev] Distributed Atomspace

Linas Vepstas Sun, 16 Aug 2020 18:43:58 -0700

CC'ing Xabush to answer the question at the bottom ..

On Sun, Aug 16, 2020 at 5:24 PM Predrag Radović <[email protected]>
wrote:

> Hi Linas,
>
> On Fri, Aug 07, 2020 at 01:30:14PM -0500, Linas Vepstas wrote:
> > >
> > > To get scientific about it, you'd want to create a heat-map -- load up
> > > some large datasets, say, some of the genomics datasets, run one of
> their
> > > standard work-loads as a bench-mark, and then see which pages are hit
> the
> > > most often. I mean -- what is the actual working-set size of the
> genomics
> > > processing? No one knows -- we know that during graph traversal,
> memory is
> > > hit "randomly" .. but what is the distribution? It's surely not
> uniform.
> > > Maybe 90% of the work is done on 10% of the pages? (Maybe it's Zipfian?
> > > I'd love to see those charts...)
> >
> > The "query-loop" is a subset/sample from one of the agi-bio datasets.
> It's
> > a good one to experiment with, since it will never change, so you can
> > compare before-and-after results.  The agi-bio datasets change all the
> > time, as they add and remove new features, new data sources, etc. They're
> > bigger, but not stable.
>
> VM page heat-map for query-loop benchmark is here:
>
>
> https://github.com/crackleware/opencog-experiments/tree/c0cc508dc5757635ce6c069b20f8ae13ccf8ef8a/mmapped-atomspace

Wow! I wasn't really expecting much to happen, so very definitely wow!

>
>
> Everything is getting dirty during loading. There is a "hot" subset of
> pages
> being referenced during processing stage. Total size of referenced pages
> in
> processing stage is around ~150MB of 1.6GB (total allocation). Heat-map is
> very
> crude because it groups pages in linear order which is probably bad
> grouping. I may experiment with page grouping to get more informative
> graphs
> (could be useful chunking research).
>

OK, some assorted random, disconnected remarks:

 * After initial data load is completed, run the benchmark for 10 seconds,
sort the pages by hits, and then monitor to see how that changes over
time...

* There's a way to monitor guile, while running.  In the main guile shell,
say `(use-modules (opencog cogserver))` and `(start-cogserver)` which
should print `Listening on port 17001
$1 = "Started CogServer"`   then, from somewhere else, you can `rlwrap
telnet localhost 17001` and `scm` which prints `Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
opencog> scm
Entering scheme shell; use ^D or a single . on a line by itself to exit.`
and then any scheme is valid. An interesting one is (gc-stats) which prints
info about guile's garbage collector. It burns through a huge amount of ram
during data load (no surprise), but then settles down to a working-set size
of 10MBytes (also no surprise) there is very little guile usage after the
initial load.

* We have newer load procedures that don't use guile for loading, so the
initial hugeness should come down.

* Using the non-guile load is surely a benefit, as it will mean that
atomspace RAM and guile RAM allocations are far less likely to interleave
and fragment one-another.  Less fragmentation means that the guile GC is
less likely to invade every page just to scan a few hundreds bytes of
guile-heap. (I'm not clear how it actually works, but I think the
fragmentation is a valid concern.)

* I have no clue of the 150MBytes vs long-tail. It is possible that, during
the file-load, all of the gene data ended up on a 150MB subset of RAM, and
the protein and reactome data fills the rest. There are 30K genes and 150K
proteins, so that is a 5x difference, but you are seeing a 10x difference
between hot and luke-warm ... hmm.

* In principle, the long tail does not surprise me: the access patterns are
"very random". So, first of all, the genes are likely to get splattered
over most pages, (depending on how the files load) and the various links
connecting genes together might get splattered onto even more pages.

The "triangle" benchmark is looking for three genes that interact
pair-wise, thus forming a triangle. These have a very "fat tail": the
distribution is square-root-Zipfian. That is, if you sort genes by the
number of triangles they appear in, then rank them, then the distribution
is 1/sqrt(rank) so much much fatter than the classic Zipf tail of 1/rank. I
also looked at tetragons, and its fatter still. (I back-burnered that work,
but have detailed graphs for this stuff at
https://github.com/linas/biome-distribution/blob/master/paper/biome-distributions.pdf
which I need to finish...)

... all this has implications for which RAM pages get hit. ...

... I've never-ever thought about it before, but maybe there are some
tricks where we could somehow force more locality during the data load,
e.g. by having the Atomspace allocate out of a different pool, than say,
where-ever other random allocations are being done.  Or some other clever
locality stunts, like asking related atoms to be placed near each other,
e.g. the way modern file systems allocate blocks... is there a
file-system-like allocator for RAM? Where I can ask for RAM that is as
"near as possible to this", and otherwise, far away, leaving gaps for
growth (like what ext2 does as opposed to what DOS FAT did)"?

Well, the thing to do here is to stop using guile for file loading, and see
if that fixes the long-tail ... that long tail might just be the guile GC
touching every page, because the guile heap got fragmented everywhere ...
Xabush will explain how to use the "fast file loader" on the new datasets.

> I also did several experimental runs where I used swap-space on NFS and
> NBD
> (network block device). 2 cores, 1GB RAM, 2GB swap. Performance was not
> very
> good (~10%). CPU is too fast for this amount of memory. :-)
>
> Intermittent peaks are probably garbage collections.
>

Yes. And not using guile to load data may avoid having GC touch all RAM.

>
> All in all, I expect much better performance with very concurrent
> workloads,
> hundreds of threads. When a processing thread hits a page which is not yet
> in
> physical RAM it blocks. Request for that page from storage is queued.
> Other
> threads continue to work and after some time they will block too waiting
> for
> some of their pages to load.  Storage layer will collect multiple requests
> and
> deliver data in batches, introducing latency. That's why when they
> benchmark
> SSDs there are graphs for various queue depths. Deeper queue, better
> throughput.
>

OK, so these tests are "easily" parallelized, with appropriate definition
of "easy". Each search is conducted on each gene separately, so these can
be run in parallel. That's the good news. The bad news is that doing this
with guile threads seems to fail; there is some kind of live-lock problem
in guile that I have not been able to isolate. Pure guile multi-threads
well, but not guile+atomspace ... it is 1.5x faster for two threads, and
running three threads is like running one. Running four threads is slower
than one. Yech.

The good news is that I believe that atomspace pure C++ code threads well.
The other good news is that Atomese has several actual Atoms that run
multiple threads -- one called ParallelLink, the other called
JoinThreadLink. I've never-ever tried them with the pattern matcher before.
I will try now ... anyway, I think it's possible to do pure-c++-threading
without having to write any new C++ code... separate email when I have
something to say...

> Query-loop benchmark is single-threaded.  I would like to run more
> concurrent
> workload with bigger datasets. Any suggestions?
>

Xabush, what are your biggest datasets? How do you load them?  How do you
run them?

Meanwhile, I'll explore concurrency with the ParallelLink ... see if I can
make that practical.

-- Linas

-- 
Verbogeny is one of the pleasurettes of a creatific thinkerizer.
        --Peter da Silva

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA36%2BJJsAMvpvcYRxtorb6%3Dc%2Bo5%3Dz7%2BwNQWybLtwV4shnnQ%40mail.gmail.com.

Re: Decentralized building blocks [was Re: [opencog-dev] Distributed Atomspace

Reply via email to