Re: [OT] Efficient file structure for very large lookup tables?

Craig Dillabaugh Tue, 17 Dec 2013 12:22:07 -0800

On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:

Another OT thread to pick your brains. :)
What's a good, efficient file structure for storing extremelylargelookup tables? (Extremely large as in > 10 million entries,with keysand values roughly about 100 bytes each.) The structure mustsupportefficient adding and lookup of entries, as these two operationswill be
very frequent.
I did some online research, and it seems that hashtablesperform poorlyon disk, because the usual hash functions cause randomscattering of
related data (which are likely to be access with higher temporal
locality), which incurs lots of disk seeks.
I thought about B-trees, but they have high overhead (and are apain toimplement), and also only exhibit good locality if tableentries areaccessed sequentially; the problem is I'm working withhigh-dimensionaldata and the order of accesses is unlikely to be sequential.However,they do exhibit good spatial locality in higher-dimensionalspace (i.e.,if entry X is accessed first, then the next entry Y is quitelikely to
be close to X in that space).  Does anybody know of a good data
structure that can take advantage of this fact to minimize disk
accesses?


T

As a first attempt could you use a key-value database (like REDISif you have enough memory to fit everything in)? Or is that outof the question.

Another question is can your queries be batched? If that is thecase and your data is bigger than your available memory, then tryGoogling "Lars Arge Buffer Tree" which might work well. However,if you thought implementing a B-tree was going to be painful,that might not appeal to you. If you don't want to implementthat yourself you could look at TPIE:


http://www.madalgo.au.dk/tpie/

Although it is in C++.

If I had to design something quick on the spot, my first guesswould be to use a grid on the first two dimensions and then binthe 'points' or keys within each grid square and build a simplerstructure on those. This won't work so well though for reallyhigh dimension data or if the 'points' are randomly distributed.


Also, what exactly do you mean by "in that space" when you say:

"if entry X is accessed first, then the next entry Y is quitelikely to be close to X in that space".

Do you mean that the value of Y in the next dimension isnumerically close (or expected to be) to X?


Cheers,

Craig

Re: [OT] Efficient file structure for very large lookup tables?

Reply via email to