high memory usage with large number of HashSets. 3X more memory than Python

cblake Fri, 24 Dec 2021 04:15:14 -0800

Follow-up: also you should use `--gc:arc` if you are not already (sounds like 
not). Just with stdlib Table & HashSet (with init size hint of `3`), I get 2.57 
sec, 1313MB with the default gc, 1.73 sec, 843MB with `gc:arc`, then adding 
`-d:nimIntHash1` this falls to 1.08 sec, 780 MB. Then with gcc PGO this falls 
to 0.95 sec, 895MB. So, it does seem like you should listen to both Araq & 
ElegantBeef after all. :-)


Then with adix & identity hash & hash code elision (code shown later), I get 
1.00sec, 480MB and with PGO 0.82 sec, 563MB.
    
    
    import adix/[lptabz, althash], random
    proc hash(x: int): Hash = hashIdentity(x)
    var mem = initLPTabz[int, LPSetz[int,int,0], int, 0](0, 16, 1)
    for docid in 1..2554381: # 0 used as hash key sentinel!
        if not mem.contains(docid):
            mem[docid] = initLPSetz[int,int,0](0, 16, 1)
        for y in 0..10:
            mem[docid].incl(rand(115000))
    
    
    Run

The 16 controls table resize - once a linear probe search depth counter hits 16 
deep, the table resizes (unless it is already below a sparsity threshold). I 
did try the compact mode but it was quite a bit slower. I didn't play with 
having the table be one mode and the sets another.

Parenthetically, on your broader context, it seems like you are working out a 
search engine style inverted index. A hash set, while conceptually spot on, is 
both a space & time inefficient representation for the "posting list" of 
documents containing a term (the value in your `Table`). You can also do union 
& intersection efficiently on ordered lists. Ensuring order to the list is not 
so hard if your always process your documents in docId order in the first 
place. This is usually easy since you can arrange for docId to be the order 
number of the document. In fact, because posting lists can be large, fancy 
search engines may even do "delta encodings" of ordered docId lists (i.e. 
encoding the difference to the next docId, not the docId itself) as a kind of 
fast online/context specific data compression. To start I would recommend a 
`seq[tuple[docId, freq: uint32]]` and writing a couple of tiny 
union/intersection procs on those.

There is a pretty good book on all this originally called _Managing Megabytes_ 
{ that became _Gigabytes_ and were it updated today would surely be called 
Terabytes or maybe Petabytes :-) }.

high memory usage with large number of HashSets. 3X more memory than Python

Reply via email to