> Xabush, what are your biggest datasets? How do you load them? > > Meanwhile, I'll explore concurrency with the ParallelLink ... see if I can > make that practical.
I load the following datasets: [0] - https://mozi.ai/datasets/current_2020-04-30.tar.gz [1] - https://mozi.ai/datasets/string_dataset_2020-04-01.tar.gz [2] - https://mozi.ai/datasets/go-plus-2020-07-08.tar.gz [3] - https://mozi.ai/datasets/go-plus-with-definition-2020-07-08.tar.gz Loading all the datasets takes 84.8 seconds with 4.5 GB of RAM usage on my machine. When excluding the string datasets, it takes only 25.8 seconds to load with ~ 2GB of RAM usage. I use the sexpr code (https://github.com/opencog/atomspace/tree/master/opencog/persist/sexpr) to load them which is much faster than using guile’s primitive-load. > How do you run them? I didn't understand this question. — Regards, Abdulrahman Semrie > On Monday, Aug 17, 2020 at 4:43 AM, Linas Vepstas <[email protected] > (mailto:[email protected])> wrote: > CC'ing Xabush to answer the question at the bottom .. > > On Sun, Aug 16, 2020 at 5:24 PM Predrag Radović <[email protected] > (mailto:[email protected])> wrote: > > Hi Linas, > > > > On Fri, Aug 07, 2020 at 01:30:14PM -0500, Linas Vepstas wrote: > > > > > > > > To get scientific about it, you'd want to create a heat-map -- load up > > > > some large datasets, say, some of the genomics datasets, run one of > > > > their > > > > standard work-loads as a bench-mark, and then see which pages are hit > > > > the > > > > most often. I mean -- what is the actual working-set size of the > > > > genomics > > > > processing? No one knows -- we know that during graph traversal, memory > > > > is > > > > hit "randomly" .. but what is the distribution? It's surely not uniform. > > > > Maybe 90% of the work is done on 10% of the pages? (Maybe it's Zipfian? > > > > I'd love to see those charts...) > > > > > > The "query-loop" is a subset/sample from one of the agi-bio datasets. It's > > > a good one to experiment with, since it will never change, so you can > > > compare before-and-after results. The agi-bio datasets change all the > > > time, as they add and remove new features, new data sources, etc. They're > > > bigger, but not stable. > > > > VM page heat-map for query-loop benchmark is here: > > > > https://github.com/crackleware/opencog-experiments/tree/c0cc508dc5757635ce6c069b20f8ae13ccf8ef8a/mmapped-atomspace > > Wow! I wasn't really expecting much to happen, so very definitely wow! > > > > > > Everything is getting dirty during loading. There is a "hot" subset of pages > > being referenced during processing stage. Total size of referenced pages in > > processing stage is around ~150MB of 1.6GB (total allocation). Heat-map is > > very > > crude because it groups pages in linear order which is probably bad > > grouping. I may experiment with page grouping to get more informative graphs > > (could be useful chunking research). > > OK, some assorted random, disconnected remarks: > > * After initial data load is completed, run the benchmark for 10 seconds, > sort the pages by hits, and then monitor to see how that changes over time... > > * There's a way to monitor guile, while running. In the main guile shell, say > `(use-modules (opencog cogserver))` and `(start-cogserver)` which should > print `Listening on port 17001 > $1 = "Started CogServer"` then, from somewhere else, you can `rlwrap telnet > localhost 17001` and `scm` which prints `Trying 127.0.0.1... > Connected to localhost. > Escape character is '^]'. > opencog> scm > Entering scheme shell; use ^D or a single . on a line by itself to exit.` > and then any scheme is valid. An interesting one is (gc-stats) which prints > info about guile's garbage collector. It burns through a huge amount of ram > during data load (no surprise), but then settles down to a working-set size > of 10MBytes (also no surprise) there is very little guile usage after the > initial load. > > * We have newer load procedures that don't use guile for loading, so the > initial hugeness should come down. > > * Using the non-guile load is surely a benefit, as it will mean that > atomspace RAM and guile RAM allocations are far less likely to interleave and > fragment one-another. Less fragmentation means that the guile GC is less > likely to invade every page just to scan a few hundreds bytes of guile-heap. > (I'm not clear how it actually works, but I think the fragmentation is a > valid concern.) > > * I have no clue of the 150MBytes vs long-tail. It is possible that, during > the file-load, all of the gene data ended up on a 150MB subset of RAM, and > the protein and reactome data fills the rest. There are 30K genes and 150K > proteins, so that is a 5x difference, but you are seeing a 10x difference > between hot and luke-warm ... hmm. > > * In principle, the long tail does not surprise me: the access patterns are > "very random". So, first of all, the genes are likely to get splattered over > most pages, (depending on how the files load) and the various links > connecting genes together might get splattered onto even more pages. > > The "triangle" benchmark is looking for three genes that interact pair-wise, > thus forming a triangle. These have a very "fat tail": the distribution is > square-root-Zipfian. That is, if you sort genes by the number of triangles > they appear in, then rank them, then the distribution is 1/sqrt(rank) so much > much fatter than the classic Zipf tail of 1/rank. I also looked at tetragons, > and its fatter still. (I back-burnered that work, but have detailed graphs > for this stuff at > https://github.com/linas/biome-distribution/blob/master/paper/biome-distributions.pdf > which I need to finish...) > > ... all this has implications for which RAM pages get hit. ... > > ... I've never-ever thought about it before, but maybe there are some tricks > where we could somehow force more locality during the data load, e.g. by > having the Atomspace allocate out of a different pool, than say, where-ever > other random allocations are being done. Or some other clever locality > stunts, like asking related atoms to be placed near each other, e.g. the way > modern file systems allocate blocks... is there a file-system-like allocator > for RAM? Where I can ask for RAM that is as "near as possible to this", and > otherwise, far away, leaving gaps for growth (like what ext2 does as opposed > to what DOS FAT did)"? > > Well, the thing to do here is to stop using guile for file loading, and see > if that fixes the long-tail ... that long tail might just be the guile GC > touching every page, because the guile heap got fragmented everywhere ... > Xabush will explain how to use the "fast file loader" on the new datasets. > > > > > I also did several experimental runs where I used swap-space on NFS and NBD > > (network block device). 2 cores, 1GB RAM, 2GB swap. Performance was not very > > good (~10%). CPU is too fast for this amount of memory. :-) > > > > Intermittent peaks are probably garbage collections. > > Yes. And not using guile to load data may avoid having GC touch all RAM. > > > > All in all, I expect much better performance with very concurrent workloads, > > hundreds of threads. When a processing thread hits a page which is not yet > > in > > physical RAM it blocks. Request for that page from storage is queued. Other > > threads continue to work and after some time they will block too waiting for > > some of their pages to load. Storage layer will collect multiple requests > > and > > deliver data in batches, introducing latency. That's why when they benchmark > > SSDs there are graphs for various queue depths. Deeper queue, better > > throughput. > > OK, so these tests are "easily" parallelized, with appropriate definition of > "easy". Each search is conducted on each gene separately, so these can be run > in parallel. That's the good news. The bad news is that doing this with guile > threads seems to fail; there is some kind of live-lock problem in guile that > I have not been able to isolate. Pure guile multi-threads well, but not > guile+atomspace ... it is 1.5x faster for two threads, and running three > threads is like running one. Running four threads is slower than one. Yech. > > The good news is that I believe that atomspace pure C++ code threads well. > The other good news is that Atomese has several actual Atoms that run > multiple threads -- one called ParallelLink, the other called JoinThreadLink. > I've never-ever tried them with the pattern matcher before. I will try now > ... anyway, I think it's possible to do pure-c++-threading without having to > write any new C++ code... separate email when I have something to say... > > > > > Query-loop benchmark is single-threaded. I would like to run more concurrent > > workload with bigger datasets. Any suggestions? > > Xabush, what are your biggest datasets? How do you load them? How do you run > them? > > Meanwhile, I'll explore concurrency with the ParallelLink ... see if I can > make that practical. > > -- Linas > > -- > Verbogeny is one of the pleasurettes of a creatific thinkerizer. > --Peter da Silva > -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/8550dc8a-7a16-4020-beb6-ae9c5ab521c3%40Canary.
