Re: Decentralized building blocks [was Re: [opencog-dev] Distributed Atomspace

Abdulrahman Semrie Mon, 17 Aug 2020 14:33:42 -0700

> Xabush, what are your biggest datasets? How do you load them?
>
> Meanwhile, I'll explore concurrency with the ParallelLink ... see if I can 
> make that practical.


I load the following datasets:
[0] - https://mozi.ai/datasets/current_2020-04-30.tar.gz
[1] - https://mozi.ai/datasets/string_dataset_2020-04-01.tar.gz
[2] - https://mozi.ai/datasets/go-plus-2020-07-08.tar.gz
[3] - https://mozi.ai/datasets/go-plus-with-definition-2020-07-08.tar.gz

Loading all the datasets takes 84.8 seconds with 4.5 GB of RAM usage on my 
machine. When excluding the string datasets, it takes only 25.8 seconds to load 
with ~ 2GB of RAM usage.

I use the sexpr code 
(https://github.com/opencog/atomspace/tree/master/opencog/persist/sexpr) to 
load them which is much faster than using guile’s primitive-load.

> How do you run them?

I didn't understand this question.

—

Regards,

Abdulrahman Semrie

> On Monday, Aug 17, 2020 at 4:43 AM, Linas Vepstas <[email protected] 
> (mailto:[email protected])> wrote:
> CC'ing Xabush to answer the question at the bottom ..
>
> On Sun, Aug 16, 2020 at 5:24 PM Predrag Radović <[email protected] 
> (mailto:[email protected])> wrote:
> > Hi Linas,
> >
> > On Fri, Aug 07, 2020 at 01:30:14PM -0500, Linas Vepstas wrote:
> > > >
> > > > To get scientific about it, you'd want to create a heat-map -- load up
> > > > some large datasets, say, some of the genomics datasets, run one of 
> > > > their
> > > > standard work-loads as a bench-mark, and then see which pages are hit 
> > > > the
> > > > most often. I mean -- what is the actual working-set size of the 
> > > > genomics
> > > > processing? No one knows -- we know that during graph traversal, memory 
> > > > is
> > > > hit "randomly" .. but what is the distribution? It's surely not uniform.
> > > > Maybe 90% of the work is done on 10% of the pages? (Maybe it's Zipfian?
> > > > I'd love to see those charts...)
> > >
> > > The "query-loop" is a subset/sample from one of the agi-bio datasets. It's
> > > a good one to experiment with, since it will never change, so you can
> > > compare before-and-after results. The agi-bio datasets change all the
> > > time, as they add and remove new features, new data sources, etc. They're
> > > bigger, but not stable.
> >
> > VM page heat-map for query-loop benchmark is here:
> >
> > https://github.com/crackleware/opencog-experiments/tree/c0cc508dc5757635ce6c069b20f8ae13ccf8ef8a/mmapped-atomspace
>
> Wow! I wasn't really expecting much to happen, so very definitely wow!
> >
> >
> > Everything is getting dirty during loading. There is a "hot" subset of pages
> > being referenced during processing stage. Total size of referenced pages in
> > processing stage is around ~150MB of 1.6GB (total allocation). Heat-map is 
> > very
> > crude because it groups pages in linear order which is probably bad
> > grouping. I may experiment with page grouping to get more informative graphs
> > (could be useful chunking research).
>
> OK, some assorted random, disconnected remarks:
>
> * After initial data load is completed, run the benchmark for 10 seconds, 
> sort the pages by hits, and then monitor to see how that changes over time...
>
> * There's a way to monitor guile, while running. In the main guile shell, say 
> `(use-modules (opencog cogserver))` and `(start-cogserver)` which should 
> print `Listening on port 17001
> $1 = "Started CogServer"` then, from somewhere else, you can `rlwrap telnet 
> localhost 17001` and `scm` which prints `Trying 127.0.0.1...
> Connected to localhost.
> Escape character is '^]'.
> opencog> scm
> Entering scheme shell; use ^D or a single . on a line by itself to exit.`
> and then any scheme is valid. An interesting one is (gc-stats) which prints 
> info about guile's garbage collector. It burns through a huge amount of ram 
> during data load (no surprise), but then settles down to a working-set size 
> of 10MBytes (also no surprise) there is very little guile usage after the 
> initial load.
>
> * We have newer load procedures that don't use guile for loading, so the 
> initial hugeness should come down.
>
> * Using the non-guile load is surely a benefit, as it will mean that 
> atomspace RAM and guile RAM allocations are far less likely to interleave and 
> fragment one-another. Less fragmentation means that the guile GC is less 
> likely to invade every page just to scan a few hundreds bytes of guile-heap. 
> (I'm not clear how it actually works, but I think the fragmentation is a 
> valid concern.)
>
> * I have no clue of the 150MBytes vs long-tail. It is possible that, during 
> the file-load, all of the gene data ended up on a 150MB subset of RAM, and 
> the protein and reactome data fills the rest. There are 30K genes and 150K 
> proteins, so that is a 5x difference, but you are seeing a 10x difference 
> between hot and luke-warm ... hmm.
>
> * In principle, the long tail does not surprise me: the access patterns are 
> "very random". So, first of all, the genes are likely to get splattered over 
> most pages, (depending on how the files load) and the various links 
> connecting genes together might get splattered onto even more pages.
>
> The "triangle" benchmark is looking for three genes that interact pair-wise, 
> thus forming a triangle. These have a very "fat tail": the distribution is 
> square-root-Zipfian. That is, if you sort genes by the number of triangles 
> they appear in, then rank them, then the distribution is 1/sqrt(rank) so much 
> much fatter than the classic Zipf tail of 1/rank. I also looked at tetragons, 
> and its fatter still. (I back-burnered that work, but have detailed graphs 
> for this stuff at 
> https://github.com/linas/biome-distribution/blob/master/paper/biome-distributions.pdf
>  which I need to finish...)
>
> ... all this has implications for which RAM pages get hit. ...
>
> ... I've never-ever thought about it before, but maybe there are some tricks 
> where we could somehow force more locality during the data load, e.g. by 
> having the Atomspace allocate out of a different pool, than say, where-ever 
> other random allocations are being done. Or some other clever locality 
> stunts, like asking related atoms to be placed near each other, e.g. the way 
> modern file systems allocate blocks... is there a file-system-like allocator 
> for RAM? Where I can ask for RAM that is as "near as possible to this", and 
> otherwise, far away, leaving gaps for growth (like what ext2 does as opposed 
> to what DOS FAT did)"?
>
> Well, the thing to do here is to stop using guile for file loading, and see 
> if that fixes the long-tail ... that long tail might just be the guile GC 
> touching every page, because the guile heap got fragmented everywhere ... 
> Xabush will explain how to use the "fast file loader" on the new datasets.
>
> >
> > I also did several experimental runs where I used swap-space on NFS and NBD
> > (network block device). 2 cores, 1GB RAM, 2GB swap. Performance was not very
> > good (~10%). CPU is too fast for this amount of memory. :-)
> >
> > Intermittent peaks are probably garbage collections.
>
> Yes. And not using guile to load data may avoid having GC touch all RAM.
> >
> > All in all, I expect much better performance with very concurrent workloads,
> > hundreds of threads. When a processing thread hits a page which is not yet 
> > in
> > physical RAM it blocks. Request for that page from storage is queued. Other
> > threads continue to work and after some time they will block too waiting for
> > some of their pages to load. Storage layer will collect multiple requests 
> > and
> > deliver data in batches, introducing latency. That's why when they benchmark
> > SSDs there are graphs for various queue depths. Deeper queue, better 
> > throughput.
>
> OK, so these tests are "easily" parallelized, with appropriate definition of 
> "easy". Each search is conducted on each gene separately, so these can be run 
> in parallel. That's the good news. The bad news is that doing this with guile 
> threads seems to fail; there is some kind of live-lock problem in guile that 
> I have not been able to isolate. Pure guile multi-threads well, but not 
> guile+atomspace ... it is 1.5x faster for two threads, and running three 
> threads is like running one. Running four threads is slower than one. Yech.
>
> The good news is that I believe that atomspace pure C++ code threads well. 
> The other good news is that Atomese has several actual Atoms that run 
> multiple threads -- one called ParallelLink, the other called JoinThreadLink. 
> I've never-ever tried them with the pattern matcher before. I will try now 
> ... anyway, I think it's possible to do pure-c++-threading without having to 
> write any new C++ code... separate email when I have something to say...
>
> >
> > Query-loop benchmark is single-threaded. I would like to run more concurrent
> > workload with bigger datasets. Any suggestions?
>
> Xabush, what are your biggest datasets? How do you load them? How do you run 
> them?
>
> Meanwhile, I'll explore concurrency with the ParallelLink ... see if I can 
> make that practical.
>
> -- Linas
>
> --
> Verbogeny is one of the pleasurettes of a creatific thinkerizer.
> --Peter da Silva
>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/8550dc8a-7a16-4020-beb6-ae9c5ab521c3%40Canary.

Re: Decentralized building blocks [was Re: [opencog-dev] Distributed Atomspace

Reply via email to