Follow-up: also you should use `--gc:arc` if you are not already (sounds like not). Just with stdlib Table & HashSet (with init size hint of `3`), I get 2.57 sec, 1313MB with the default gc, 1.73 sec, 843MB with `gc:arc`, then adding `-d:nimIntHash1` this falls to 1.08 sec, 780 MB. Then with gcc PGO this falls to 0.95 sec, 895MB. So, it does seem like you should listen to both Araq & ElegantBeef after all. :-)
Then with adix & identity hash & hash code elision (code shown later), I get 1.00sec, 480MB and with PGO 0.82 sec, 563MB. import adix/[lptabz, althash], random proc hash(x: int): Hash = hashIdentity(x) var mem = initLPTabz[int, LPSetz[int,int,0], int, 0](0, 16, 1) for docid in 1..2554381: # 0 used as hash key sentinel! if not mem.contains(docid): mem[docid] = initLPSetz[int,int,0](0, 16, 1) for y in 0..10: mem[docid].incl(rand(115000)) Run The 16 controls table resize - once a linear probe search depth counter hits 16 deep, the table resizes (unless it is already below a sparsity threshold). I did try the compact mode but it was quite a bit slower. I didn't play with having the table be one mode and the sets another. Parenthetically, on your broader context, it seems like you are working out a search engine style inverted index. A hash set, while conceptually spot on, is both a space & time inefficient representation for the "posting list" of documents containing a term (the value in your `Table`). You can also do union & intersection efficiently on ordered lists. Ensuring order to the list is not so hard if your always process your documents in docId order in the first place. This is usually easy since you can arrange for docId to be the order number of the document. In fact, because posting lists can be large, fancy search engines may even do "delta encodings" of ordered docId lists (i.e. encoding the difference to the next docId, not the docId itself) as a kind of fast online/context specific data compression. To start I would recommend a `seq[tuple[docId, freq: uint32]]` and writing a couple of tiny union/intersection procs on those. There is a pretty good book on all this originally called _Managing Megabytes_ { that became _Gigabytes_ and were it updated today would surely be called Terabytes or maybe Petabytes :-) }.