Re: [pm-h] Binary Search Tree File

B. Estrade via Houston Sat, 30 Apr 2016 15:39:51 -0700

http://www.cc.gatech.edu/~bader/COURSES/UNM/ece637-Fall2003/papers/LFN02.pdf


On Sat, Apr 30, 2016 at 4:40 PM, G. Wade Johnson via Houston <[email protected]
> wrote:

> On Sat, 30 Apr 2016 19:14:24 +0200
> Reini Urban via Houston <[email protected]> wrote:
>
> > No, the 1st cache argument is to skip the 2 unnecessary left and
> > right pointers and use a trie instead. we do that even perl5. the 2nd
> > cache argument is to use the  Van Emde Boas layout and not the simple
> > binary search trie layout. the 3rd cache argument, is to use a good
> > radix trie, and not a BST at all.
> >
> > if memory is a problem a bloom filter in front of it will beat
> > everything. my bloom filter deduper for 1G files used 100x less
> > memory than all the others.
>
> While doing this on disk for a real system would benefit from a more
> powerful data structure (I've seen B+ trees used effectively.) But, for
> understanding the problem, a binary tree is easy to grasp and serves as
> a good springboard to explore.
>
> In trying to learn the trade-offs in storing and retrieving this kind of
> data, there is a lot of benefit in working through the problem from
> this level.
>
> Chris, you should keep the group posted as you explore the problem and
> potential solutions.
>
> G. Wade
>
> > Reini Urban
> > [email protected]
> >
> >
> >
> > > On Apr 30, 2016, at 7:09 PM, Mev412 <[email protected]> wrote:
> > >
> > > Can't say this was meant to be "state of the art". Todd mentioned
> > > the perfect hashes, but the original problem was how to lower the
> > > memory footprint as much as possible. So any data structure stored
> > > completely in memory wouldn't be ideal. This was the motivation to
> > > use something where searching could happen by seeking through the
> > > file rather than in-memory. As far as L1 cache, the keys are stored
> > > sequentially on disk so this should utilize cache a lot better than
> > > random in-memory pointers.
> > >
> > > Best Regards,
> > > Christopher Mevissen
> > >
> > > On Sat, Apr 30, 2016 at 1:51 AM, Reini Urban
> > > <[email protected]> wrote: Ordinary BST’s are not really state
> > > of the art any more on modern CPU’s. The overhead of the two
> > > absolute pointers trash the L1 cache, the very same problem as with
> > > our perl5 op tree overhead. One of the reasons why perl5 is so
> > > slow. Ditto linked lists.
> > >
> > > I also saw you trying it with Config, and there you can easily see
> > > how my gperf (a static perfect hash) outperforms the BST.
> > >
> > > https://github.com/toddr/ConfigDat vs
> > > https://github.com/perl11/p5-Config
> > >
> > > And the gperf hash is not always the best method, I just haven’t
> > > had enough time to finish my Perfect::Hash module which comes up
> > > with better characteristics then gperf in some cases. bulk88
> > > optimized the hell out of it lately.
> > > https://github.com/rurban/Perfect-Hash#benchmarks
> > >
> > > State of the art besides properly implemented hash tables (i.e. NOT
> > > perl5 hash tables) are Van Emde Boas binary search tries, which
> > > perform much better than ordinary binary search tries, Note:
> > > trie != tree. No right-left pointers needed. But even with the
> > > advantage of a trie, the traditional binary search layout is not
> > > optimal anymore.
> > >
> > > https://en.wikipedia.org/wiki/Van_Emde_Boas_tree
> > >
> > > radix trees with optimizations on word-sizes (Patricia trie) also
> > > perform much better, e.g. judy or HAT-trie. A good HAT-trie is as
> > > fast as a proper hash table, esp. for smaller sizes.
> > >
> > > Some links:
> > > search for Cache Oblivious Search Tree
> > > nice maps:
> > >
> https://www.cs.utexas.edu/~pingali/CS395T/2013fa/lectures/MemoryOptimizations_2013.pdf
> > >
> > >
> > > Reini Urban
> > > [email protected]
> > >
> > >
> > >
> > > > On Apr 30, 2016, at 5:31 AM, Mev412 via Houston <[email protected]>
> > > > wrote:
> > > >
> > > > The conversation at the last meeting sparked my interest to
> > > > implement the file-based binary search tree.
> > > >
> > > >
> https://github.com/despertargz/tree-binary-search-file/blob/master/lib/Tree/Binary/Search/File.pm
> > > >
> > > > Build the file:
> > > > my $file = Tree::Binary::Search::File->new("/tmp/test-bst");
> > > > $file->write_file({ test => "blah" });
> > > >
> > > > Reading values:
> > > > my $file = Tree::Binary::Search::File->new("/tmp/test-bst");
> > > > my $value = $file->get("test");
> > > >
> > > > It performed well against a file-based, linear search. You can
> > > > see how the linear search doubles as the records doubles. Haven't
> > > > measured to see how close to O(log n) it is, but it appears to do
> > > > well. It barely flinches when going from 1 to 2 million records.
> > > >
> > > > Time is seconds to locate single record (worst-case-scenario)
> > > >
> > > > # of records, binary-search-tree, linear
> > > > 1024, 4.48226928710938e-05, 0.000963926315307617
> > > > 2048, 4.31537628173828e-05, 0.00278782844543457
> > > > 4096, 3.38554382324219e-05, 0.00162196159362793
> > > > 8192, 5.07831573486328e-05, 0.0121698379516602
> > > > 16384, 4.60147857666016e-05, 0.0115268230438232
> > > > 32768, 6.58035278320312e-05, 0.0142660140991211
> > > > 65536, 0.000729084014892578, 0.0285739898681641
> > > > 131072, 0.00218009948730469, 0.0539009571075439
> > > > 262144, 0.00141692161560059, 0.1079261302948
> > > > 524288, 0.0019831657409668, 0.214764833450317
> > > > 1048576, 0.00240302085876465, 0.434930086135864
> > > > 2097152, 0.00240802764892578, 0.875269889831543
> > > >
> > > > The header format is
> > > > [left-node position][right-node position][value
> > > > length][key][value]
> > > >
> > > > It currently uses a static key size, so it can read in the key
> > > > along with the rest of the header. This takes up more disk space
> > > > but should be faster than an extra read. If there's any natural
> > > > buffering of the file though then this may not incur a
> > > > performance penalty so I'll have to benchmark.
> > > >
> > > > This is the main search logic
> > > >
> > > >     my $header;
> > > >     read $fh, $header, $HEADER_SIZE;
> > > >
> > > >     my $file_key = substr($header, 12, $KEY_SIZE);
> > > >     my $val_len  = unpack("V", substr($header, 8, 4));
> > > >     my $right    = unpack("V", substr($header, 4, 4));
> > > >     my $left     = unpack("V", substr($header, 0, 4));
> > > >
> > > >     my $comp = $key cmp $file_key;
> > > >
> > > >     if ($comp == 0) {
> > > >         my $val;
> > > >         read $fh, $val, $val_len;
> > > >         return $val;
> > > >     }
> > > >     elsif ($comp == -1) {
> > > >         if ($left == 0) {
> > > >             return undef;
> > > >         }
> > > >
> > > >         $self->find_key($key, $left);
> > > >     }
> > > >     else {
> > > >         if ($right == 0) {
> > > >             return undef;
> > > >         }
> > > >
> > > >         $self->find_key($key, $right);
> > > >     }
> > > >
> > > > The writing of the file sorts the key/value pairs, builds a BST,
> > > > while building the BST a 'flat' list of the nodes is built along
> > > > with the positions of their left and right node. Recording the
> > > > position of the node itself made writing the file easier, then
> > > > this is fed to a method which writes each node to the file.
> > > >
> > > > The writing of the file is not memory-efficient as it builds the
> > > > BST in memory for simplicity, though this cost is only incurred
> > > > once when the file is written. If it could both insert and
> > > > balance the file-based tree then this would be ideal so I'll have
> > > > to look into some ways to do that.
> > > >
> > > > Another consideration would be storing all the values at the end
> > > > of the file so the headers run sequentially. Especially if the
> > > > values are longer, this could improve cache hits / buffering.
> > > >
> > > > It's a work-in-progress as I need to make some methods private,
> > > > make key-size configurable, add documentation and tests, then I
> > > > might see if I can upload to cpan.
> > > >
> > > > Anyways, just wanted to share. Let me know what you think. Always
> > > > enjoy the talks and the technical discussions that ensue :)
> > > >
> > > >
> > > > Best Regards,
> > > > Christopher Mevissen
> > > >
> > > > _______________________________________________
> > > > Houston mailing list
> > > > [email protected]
> > > > http://mail.pm.org/mailman/listinfo/houston
> > > > Website: http://houston.pm.org/
> > >
> > >
> >
> > _______________________________________________
> > Houston mailing list
> > [email protected]
> > http://mail.pm.org/mailman/listinfo/houston
> > Website: http://houston.pm.org/
>
> --
> Fortune knocks but once, but misfortune has much more patience.
>                                                 -- Laurence J. Peter
> _______________________________________________
> Houston mailing list
> [email protected]
> http://mail.pm.org/mailman/listinfo/houston
> Website: http://houston.pm.org/

_______________________________________________
Houston mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/houston
Website: http://houston.pm.org/

Re: [pm-h] Binary Search Tree File

Reply via email to