Re: [pm-h] Binary Search Tree File

Reini Urban via Houston Fri, 29 Apr 2016 23:52:54 -0700

Ordinary BST’s are not really state of the art any more on modern CPU’s. 
The overhead of the two absolute pointers trash the L1 cache, the very same 
problem 
as with our perl5 op tree overhead. One of the reasons why perl5 is so slow. 
Ditto linked lists.


I also saw you trying it with Config, and there you can easily see how my 
gperf (a static perfect hash) outperforms the BST.

https://github.com/toddr/ConfigDat vs https://github.com/perl11/p5-Config

And the gperf hash is not always the best method, I just haven’t had enough 
time to finish 
my Perfect::Hash module which comes up with better characteristics then gperf 
in some cases. 
bulk88 optimized the hell out of it lately.
https://github.com/rurban/Perfect-Hash#benchmarks

State of the art besides properly implemented hash tables (i.e. NOT perl5 hash 
tables)
are Van Emde Boas binary search tries, which perform much better than ordinary 
binary search tries,
Note: trie != tree. No right-left pointers needed. But even with the advantage 
of a trie, the traditional 
binary search layout is not optimal anymore.

https://en.wikipedia.org/wiki/Van_Emde_Boas_tree

radix trees with optimizations on word-sizes (Patricia trie) also perform much 
better, e.g. judy or HAT-trie.
A good HAT-trie is as fast as a proper hash table, esp. for smaller sizes.

Some links:
search for Cache Oblivious Search Tree
nice maps:
https://www.cs.utexas.edu/~pingali/CS395T/2013fa/lectures/MemoryOptimizations_2013.pdf


Reini Urban
[email protected]



> On Apr 30, 2016, at 5:31 AM, Mev412 via Houston <[email protected]> wrote:
> 
> The conversation at the last meeting sparked my interest to implement the 
> file-based binary search tree. 
> 
> https://github.com/despertargz/tree-binary-search-file/blob/master/lib/Tree/Binary/Search/File.pm
> 
> Build the file:
> my $file = Tree::Binary::Search::File->new("/tmp/test-bst");
> $file->write_file({ test => "blah" });
> 
> Reading values:
> my $file = Tree::Binary::Search::File->new("/tmp/test-bst");
> my $value = $file->get("test");
> 
> It performed well against a file-based, linear search. You can see how the 
> linear search doubles as the records doubles. Haven't measured to see how 
> close to O(log n) it is, but it appears to do well. It barely flinches when 
> going from 1 to 2 million records.
> 
> Time is seconds to locate single record (worst-case-scenario)
> 
> # of records, binary-search-tree, linear
> 1024, 4.48226928710938e-05, 0.000963926315307617
> 2048, 4.31537628173828e-05, 0.00278782844543457
> 4096, 3.38554382324219e-05, 0.00162196159362793
> 8192, 5.07831573486328e-05, 0.0121698379516602
> 16384, 4.60147857666016e-05, 0.0115268230438232
> 32768, 6.58035278320312e-05, 0.0142660140991211
> 65536, 0.000729084014892578, 0.0285739898681641
> 131072, 0.00218009948730469, 0.0539009571075439
> 262144, 0.00141692161560059, 0.1079261302948
> 524288, 0.0019831657409668, 0.214764833450317
> 1048576, 0.00240302085876465, 0.434930086135864
> 2097152, 0.00240802764892578, 0.875269889831543
> 
> The header format is
> [left-node position][right-node position][value length][key][value]
> 
> It currently uses a static key size, so it can read in the key along with the 
> rest of the header. This takes up more disk space but should be faster than 
> an extra read. If there's any natural buffering of the file though then this 
> may not incur a performance penalty so I'll have to benchmark.
> 
> This is the main search logic
> 
>     my $header;
>     read $fh, $header, $HEADER_SIZE;
> 
>     my $file_key = substr($header, 12, $KEY_SIZE);
>     my $val_len  = unpack("V", substr($header, 8, 4));
>     my $right    = unpack("V", substr($header, 4, 4));
>     my $left     = unpack("V", substr($header, 0, 4));
> 
>     my $comp = $key cmp $file_key;
> 
>     if ($comp == 0) {
>         my $val;
>         read $fh, $val, $val_len;
>         return $val;
>     }
>     elsif ($comp == -1) {
>         if ($left == 0) {
>             return undef;
>         }
> 
>         $self->find_key($key, $left);
>     }
>     else {
>         if ($right == 0) {
>             return undef;
>         }
> 
>         $self->find_key($key, $right);
>     }
> 
> The writing of the file sorts the key/value pairs, builds a BST, while 
> building the BST a 'flat' list of the nodes is built along with the positions 
> of their left and right node. Recording the position of the node itself made 
> writing the file easier, then this is fed to a method which writes each node 
> to the file. 
> 
> The writing of the file is not memory-efficient as it builds the BST in 
> memory for simplicity, though this cost is only incurred once when the file 
> is written. If it could both insert and balance the file-based tree then this 
> would be ideal so I'll have to look into some ways to do that. 
> 
> Another consideration would be storing all the values at the end of the file 
> so the headers run sequentially. Especially if the values are longer, this 
> could improve cache hits / buffering.
> 
> It's a work-in-progress as I need to make some methods private, make key-size 
> configurable, add documentation and tests, then I might see if I can upload 
> to cpan. 
> 
> Anyways, just wanted to share. Let me know what you think. Always enjoy the 
> talks and the technical discussions that ensue :)
> 
> 
> Best Regards,
> Christopher Mevissen
> 
> _______________________________________________
> Houston mailing list
> [email protected]
> http://mail.pm.org/mailman/listinfo/houston
> Website: http://houston.pm.org/

_______________________________________________
Houston mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/houston
Website: http://houston.pm.org/

Re: [pm-h] Binary Search Tree File

Reply via email to