Hi Linas,

On Fri, Aug 07, 2020 at 01:30:14PM -0500, Linas Vepstas wrote:
> >
> > To get scientific about it, you'd want to create a heat-map -- load up
> > some large datasets, say, some of the genomics datasets, run one of their
> > standard work-loads as a bench-mark, and then see which pages are hit the
> > most often. I mean -- what is the actual working-set size of the genomics
> > processing? No one knows -- we know that during graph traversal, memory is
> > hit "randomly" .. but what is the distribution? It's surely not uniform.
> > Maybe 90% of the work is done on 10% of the pages? (Maybe it's Zipfian?
> > I'd love to see those charts...)
> 
> The "query-loop" is a subset/sample from one of the agi-bio datasets. It's
> a good one to experiment with, since it will never change, so you can
> compare before-and-after results.  The agi-bio datasets change all the
> time, as they add and remove new features, new data sources, etc. They're
> bigger, but not stable.

VM page heat-map for query-loop benchmark is here: 

https://github.com/crackleware/opencog-experiments/tree/c0cc508dc5757635ce6c069b20f8ae13ccf8ef8a/mmapped-atomspace

Everything is getting dirty during loading. There is a "hot" subset of pages 
being referenced during processing stage. Total size of referenced pages in 
processing stage is around ~150MB of 1.6GB (total allocation). Heat-map is very 
crude because it groups pages in linear order which is probably bad 
grouping. I may experiment with page grouping to get more informative graphs 
(could be useful chunking research).

I also did several experimental runs where I used swap-space on NFS and NBD 
(network block device). 2 cores, 1GB RAM, 2GB swap. Performance was not very 
good (~10%). CPU is too fast for this amount of memory. :-)

Intermittent peaks are probably garbage collections.

All in all, I expect much better performance with very concurrent workloads, 
hundreds of threads. When a processing thread hits a page which is not yet in 
physical RAM it blocks. Request for that page from storage is queued. Other 
threads continue to work and after some time they will block too waiting for 
some of their pages to load.  Storage layer will collect multiple requests and 
deliver data in batches, introducing latency. That's why when they benchmark 
SSDs there are graphs for various queue depths. Deeper queue, better throughput.

Query-loop benchmark is single-threaded.  I would like to run more concurrent 
workload with bigger datasets. Any suggestions?

 
--pedja

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/20200816222406.GA1557615%40intelnuc.localdomain.

Reply via email to