http://www.cc.gatech.edu/~bader/COURSES/UNM/ece637-Fall2003/papers/LFN02.pdf
On Sat, Apr 30, 2016 at 4:40 PM, G. Wade Johnson via Houston <[email protected] > wrote: > On Sat, 30 Apr 2016 19:14:24 +0200 > Reini Urban via Houston <[email protected]> wrote: > > > No, the 1st cache argument is to skip the 2 unnecessary left and > > right pointers and use a trie instead. we do that even perl5. the 2nd > > cache argument is to use the Van Emde Boas layout and not the simple > > binary search trie layout. the 3rd cache argument, is to use a good > > radix trie, and not a BST at all. > > > > if memory is a problem a bloom filter in front of it will beat > > everything. my bloom filter deduper for 1G files used 100x less > > memory than all the others. > > While doing this on disk for a real system would benefit from a more > powerful data structure (I've seen B+ trees used effectively.) But, for > understanding the problem, a binary tree is easy to grasp and serves as > a good springboard to explore. > > In trying to learn the trade-offs in storing and retrieving this kind of > data, there is a lot of benefit in working through the problem from > this level. > > Chris, you should keep the group posted as you explore the problem and > potential solutions. > > G. Wade > > > Reini Urban > > [email protected] > > > > > > > > > On Apr 30, 2016, at 7:09 PM, Mev412 <[email protected]> wrote: > > > > > > Can't say this was meant to be "state of the art". Todd mentioned > > > the perfect hashes, but the original problem was how to lower the > > > memory footprint as much as possible. So any data structure stored > > > completely in memory wouldn't be ideal. This was the motivation to > > > use something where searching could happen by seeking through the > > > file rather than in-memory. As far as L1 cache, the keys are stored > > > sequentially on disk so this should utilize cache a lot better than > > > random in-memory pointers. > > > > > > Best Regards, > > > Christopher Mevissen > > > > > > On Sat, Apr 30, 2016 at 1:51 AM, Reini Urban > > > <[email protected]> wrote: Ordinary BST’s are not really state > > > of the art any more on modern CPU’s. The overhead of the two > > > absolute pointers trash the L1 cache, the very same problem as with > > > our perl5 op tree overhead. One of the reasons why perl5 is so > > > slow. Ditto linked lists. > > > > > > I also saw you trying it with Config, and there you can easily see > > > how my gperf (a static perfect hash) outperforms the BST. > > > > > > https://github.com/toddr/ConfigDat vs > > > https://github.com/perl11/p5-Config > > > > > > And the gperf hash is not always the best method, I just haven’t > > > had enough time to finish my Perfect::Hash module which comes up > > > with better characteristics then gperf in some cases. bulk88 > > > optimized the hell out of it lately. > > > https://github.com/rurban/Perfect-Hash#benchmarks > > > > > > State of the art besides properly implemented hash tables (i.e. NOT > > > perl5 hash tables) are Van Emde Boas binary search tries, which > > > perform much better than ordinary binary search tries, Note: > > > trie != tree. No right-left pointers needed. But even with the > > > advantage of a trie, the traditional binary search layout is not > > > optimal anymore. > > > > > > https://en.wikipedia.org/wiki/Van_Emde_Boas_tree > > > > > > radix trees with optimizations on word-sizes (Patricia trie) also > > > perform much better, e.g. judy or HAT-trie. A good HAT-trie is as > > > fast as a proper hash table, esp. for smaller sizes. > > > > > > Some links: > > > search for Cache Oblivious Search Tree > > > nice maps: > > > > https://www.cs.utexas.edu/~pingali/CS395T/2013fa/lectures/MemoryOptimizations_2013.pdf > > > > > > > > > Reini Urban > > > [email protected] > > > > > > > > > > > > > On Apr 30, 2016, at 5:31 AM, Mev412 via Houston <[email protected]> > > > > wrote: > > > > > > > > The conversation at the last meeting sparked my interest to > > > > implement the file-based binary search tree. > > > > > > > > > https://github.com/despertargz/tree-binary-search-file/blob/master/lib/Tree/Binary/Search/File.pm > > > > > > > > Build the file: > > > > my $file = Tree::Binary::Search::File->new("/tmp/test-bst"); > > > > $file->write_file({ test => "blah" }); > > > > > > > > Reading values: > > > > my $file = Tree::Binary::Search::File->new("/tmp/test-bst"); > > > > my $value = $file->get("test"); > > > > > > > > It performed well against a file-based, linear search. You can > > > > see how the linear search doubles as the records doubles. Haven't > > > > measured to see how close to O(log n) it is, but it appears to do > > > > well. It barely flinches when going from 1 to 2 million records. > > > > > > > > Time is seconds to locate single record (worst-case-scenario) > > > > > > > > # of records, binary-search-tree, linear > > > > 1024, 4.48226928710938e-05, 0.000963926315307617 > > > > 2048, 4.31537628173828e-05, 0.00278782844543457 > > > > 4096, 3.38554382324219e-05, 0.00162196159362793 > > > > 8192, 5.07831573486328e-05, 0.0121698379516602 > > > > 16384, 4.60147857666016e-05, 0.0115268230438232 > > > > 32768, 6.58035278320312e-05, 0.0142660140991211 > > > > 65536, 0.000729084014892578, 0.0285739898681641 > > > > 131072, 0.00218009948730469, 0.0539009571075439 > > > > 262144, 0.00141692161560059, 0.1079261302948 > > > > 524288, 0.0019831657409668, 0.214764833450317 > > > > 1048576, 0.00240302085876465, 0.434930086135864 > > > > 2097152, 0.00240802764892578, 0.875269889831543 > > > > > > > > The header format is > > > > [left-node position][right-node position][value > > > > length][key][value] > > > > > > > > It currently uses a static key size, so it can read in the key > > > > along with the rest of the header. This takes up more disk space > > > > but should be faster than an extra read. If there's any natural > > > > buffering of the file though then this may not incur a > > > > performance penalty so I'll have to benchmark. > > > > > > > > This is the main search logic > > > > > > > > my $header; > > > > read $fh, $header, $HEADER_SIZE; > > > > > > > > my $file_key = substr($header, 12, $KEY_SIZE); > > > > my $val_len = unpack("V", substr($header, 8, 4)); > > > > my $right = unpack("V", substr($header, 4, 4)); > > > > my $left = unpack("V", substr($header, 0, 4)); > > > > > > > > my $comp = $key cmp $file_key; > > > > > > > > if ($comp == 0) { > > > > my $val; > > > > read $fh, $val, $val_len; > > > > return $val; > > > > } > > > > elsif ($comp == -1) { > > > > if ($left == 0) { > > > > return undef; > > > > } > > > > > > > > $self->find_key($key, $left); > > > > } > > > > else { > > > > if ($right == 0) { > > > > return undef; > > > > } > > > > > > > > $self->find_key($key, $right); > > > > } > > > > > > > > The writing of the file sorts the key/value pairs, builds a BST, > > > > while building the BST a 'flat' list of the nodes is built along > > > > with the positions of their left and right node. Recording the > > > > position of the node itself made writing the file easier, then > > > > this is fed to a method which writes each node to the file. > > > > > > > > The writing of the file is not memory-efficient as it builds the > > > > BST in memory for simplicity, though this cost is only incurred > > > > once when the file is written. If it could both insert and > > > > balance the file-based tree then this would be ideal so I'll have > > > > to look into some ways to do that. > > > > > > > > Another consideration would be storing all the values at the end > > > > of the file so the headers run sequentially. Especially if the > > > > values are longer, this could improve cache hits / buffering. > > > > > > > > It's a work-in-progress as I need to make some methods private, > > > > make key-size configurable, add documentation and tests, then I > > > > might see if I can upload to cpan. > > > > > > > > Anyways, just wanted to share. Let me know what you think. Always > > > > enjoy the talks and the technical discussions that ensue :) > > > > > > > > > > > > Best Regards, > > > > Christopher Mevissen > > > > > > > > _______________________________________________ > > > > Houston mailing list > > > > [email protected] > > > > http://mail.pm.org/mailman/listinfo/houston > > > > Website: http://houston.pm.org/ > > > > > > > > > > _______________________________________________ > > Houston mailing list > > [email protected] > > http://mail.pm.org/mailman/listinfo/houston > > Website: http://houston.pm.org/ > > -- > Fortune knocks but once, but misfortune has much more patience. > -- Laurence J. Peter > _______________________________________________ > Houston mailing list > [email protected] > http://mail.pm.org/mailman/listinfo/houston > Website: http://houston.pm.org/
_______________________________________________ Houston mailing list [email protected] http://mail.pm.org/mailman/listinfo/houston Website: http://houston.pm.org/
