CC'ing Xabush to answer the question at the bottom .. On Sun, Aug 16, 2020 at 5:24 PM Predrag Radović <[email protected]> wrote:
> Hi Linas, > > On Fri, Aug 07, 2020 at 01:30:14PM -0500, Linas Vepstas wrote: > > > > > > To get scientific about it, you'd want to create a heat-map -- load up > > > some large datasets, say, some of the genomics datasets, run one of > their > > > standard work-loads as a bench-mark, and then see which pages are hit > the > > > most often. I mean -- what is the actual working-set size of the > genomics > > > processing? No one knows -- we know that during graph traversal, > memory is > > > hit "randomly" .. but what is the distribution? It's surely not > uniform. > > > Maybe 90% of the work is done on 10% of the pages? (Maybe it's Zipfian? > > > I'd love to see those charts...) > > > > The "query-loop" is a subset/sample from one of the agi-bio datasets. > It's > > a good one to experiment with, since it will never change, so you can > > compare before-and-after results. The agi-bio datasets change all the > > time, as they add and remove new features, new data sources, etc. They're > > bigger, but not stable. > > VM page heat-map for query-loop benchmark is here: > > > https://github.com/crackleware/opencog-experiments/tree/c0cc508dc5757635ce6c069b20f8ae13ccf8ef8a/mmapped-atomspace Wow! I wasn't really expecting much to happen, so very definitely wow! > > > Everything is getting dirty during loading. There is a "hot" subset of > pages > being referenced during processing stage. Total size of referenced pages > in > processing stage is around ~150MB of 1.6GB (total allocation). Heat-map is > very > crude because it groups pages in linear order which is probably bad > grouping. I may experiment with page grouping to get more informative > graphs > (could be useful chunking research). > OK, some assorted random, disconnected remarks: * After initial data load is completed, run the benchmark for 10 seconds, sort the pages by hits, and then monitor to see how that changes over time... * There's a way to monitor guile, while running. In the main guile shell, say `(use-modules (opencog cogserver))` and `(start-cogserver)` which should print `Listening on port 17001 $1 = "Started CogServer"` then, from somewhere else, you can `rlwrap telnet localhost 17001` and `scm` which prints `Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. opencog> scm Entering scheme shell; use ^D or a single . on a line by itself to exit.` and then any scheme is valid. An interesting one is (gc-stats) which prints info about guile's garbage collector. It burns through a huge amount of ram during data load (no surprise), but then settles down to a working-set size of 10MBytes (also no surprise) there is very little guile usage after the initial load. * We have newer load procedures that don't use guile for loading, so the initial hugeness should come down. * Using the non-guile load is surely a benefit, as it will mean that atomspace RAM and guile RAM allocations are far less likely to interleave and fragment one-another. Less fragmentation means that the guile GC is less likely to invade every page just to scan a few hundreds bytes of guile-heap. (I'm not clear how it actually works, but I think the fragmentation is a valid concern.) * I have no clue of the 150MBytes vs long-tail. It is possible that, during the file-load, all of the gene data ended up on a 150MB subset of RAM, and the protein and reactome data fills the rest. There are 30K genes and 150K proteins, so that is a 5x difference, but you are seeing a 10x difference between hot and luke-warm ... hmm. * In principle, the long tail does not surprise me: the access patterns are "very random". So, first of all, the genes are likely to get splattered over most pages, (depending on how the files load) and the various links connecting genes together might get splattered onto even more pages. The "triangle" benchmark is looking for three genes that interact pair-wise, thus forming a triangle. These have a very "fat tail": the distribution is square-root-Zipfian. That is, if you sort genes by the number of triangles they appear in, then rank them, then the distribution is 1/sqrt(rank) so much much fatter than the classic Zipf tail of 1/rank. I also looked at tetragons, and its fatter still. (I back-burnered that work, but have detailed graphs for this stuff at https://github.com/linas/biome-distribution/blob/master/paper/biome-distributions.pdf which I need to finish...) ... all this has implications for which RAM pages get hit. ... ... I've never-ever thought about it before, but maybe there are some tricks where we could somehow force more locality during the data load, e.g. by having the Atomspace allocate out of a different pool, than say, where-ever other random allocations are being done. Or some other clever locality stunts, like asking related atoms to be placed near each other, e.g. the way modern file systems allocate blocks... is there a file-system-like allocator for RAM? Where I can ask for RAM that is as "near as possible to this", and otherwise, far away, leaving gaps for growth (like what ext2 does as opposed to what DOS FAT did)"? Well, the thing to do here is to stop using guile for file loading, and see if that fixes the long-tail ... that long tail might just be the guile GC touching every page, because the guile heap got fragmented everywhere ... Xabush will explain how to use the "fast file loader" on the new datasets. > I also did several experimental runs where I used swap-space on NFS and > NBD > (network block device). 2 cores, 1GB RAM, 2GB swap. Performance was not > very > good (~10%). CPU is too fast for this amount of memory. :-) > > Intermittent peaks are probably garbage collections. > Yes. And not using guile to load data may avoid having GC touch all RAM. > > All in all, I expect much better performance with very concurrent > workloads, > hundreds of threads. When a processing thread hits a page which is not yet > in > physical RAM it blocks. Request for that page from storage is queued. > Other > threads continue to work and after some time they will block too waiting > for > some of their pages to load. Storage layer will collect multiple requests > and > deliver data in batches, introducing latency. That's why when they > benchmark > SSDs there are graphs for various queue depths. Deeper queue, better > throughput. > OK, so these tests are "easily" parallelized, with appropriate definition of "easy". Each search is conducted on each gene separately, so these can be run in parallel. That's the good news. The bad news is that doing this with guile threads seems to fail; there is some kind of live-lock problem in guile that I have not been able to isolate. Pure guile multi-threads well, but not guile+atomspace ... it is 1.5x faster for two threads, and running three threads is like running one. Running four threads is slower than one. Yech. The good news is that I believe that atomspace pure C++ code threads well. The other good news is that Atomese has several actual Atoms that run multiple threads -- one called ParallelLink, the other called JoinThreadLink. I've never-ever tried them with the pattern matcher before. I will try now ... anyway, I think it's possible to do pure-c++-threading without having to write any new C++ code... separate email when I have something to say... > Query-loop benchmark is single-threaded. I would like to run more > concurrent > workload with bigger datasets. Any suggestions? > Xabush, what are your biggest datasets? How do you load them? How do you run them? Meanwhile, I'll explore concurrency with the ParallelLink ... see if I can make that practical. -- Linas -- Verbogeny is one of the pleasurettes of a creatific thinkerizer. --Peter da Silva -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA36%2BJJsAMvpvcYRxtorb6%3Dc%2Bo5%3Dz7%2BwNQWybLtwV4shnnQ%40mail.gmail.com.
