Re: Tip: Binary insertion code

David Adams Sun, 15 Jan 2017 14:58:27 -0800

> As a rule of thumb how large an array are you dealing with where managing
> them like this starts to matter?
>
> I'm thinking thousands+ of elements but maybe it's a noticeable bump on
> smaller ones?


Hey Kirk, I haven't the faintest idea. (But check the bottom of this
message for more details.) It's going to depend on what you're doing, but
that may not be the point. To set the scene, I'm looking at 4D's C_OBJECT
and related commands these days. I'm trying to sort through the best
options available for various situations. One big limitation in these
commands is that you can address items by position (relative or absolute),
and you can't address array elements by index unless you pull the whole
array out with OB PARSE ARRAY. I don't actually know if this is a
real-world problem, but it bugs me. (I haven't run speed tests.) I'm just
experimenting.

<fair_warning>I go into full-on nerd mode below. There are some good points
and a nice twist on the available object-access strategies in 4D. Buy,
well, I'm going full nerd.</fair_mailing>

So what does this have to do with binary search? Well, I'm dealing with
items with some unique property - or perhaps two unique properties. That
means that some kind of indexed can be built and that index can be ordered.
I asked about techniques on the list and three came up. And remember, I'm
dealing with unique values - my example uses values that aren't likely to
be unique...but area easy to ready

1) Build a parallel array with the index values. So, you store

  [Anderson]
  [Berkowitz]
  [Fredrickson]

...and those three elements match up with larger objects stored in an
object array. If I want to find "Anderson", I look at the tiny text array
and then know if a) Anderson is defined at all and b) where in the big
object array to look for it.

2) Another idea: Skip the hassle and space of storing a secondary array and
just load up the object array and iterate over it. The downsides here are
obvious:

-- If an item doesn't exist, you have to pay the maximum price: Load the
full object array and scan all of its items. With the index array, all you
need to do is scan the text array. Even the full price (scan all items) is
smaller, and you still get a definitive answer. At this point, you can also
do a binary search and (generally) optimize the lookup.

3) The third idea from Rob and Justin is to use the unique value as a key
name. This is a pretty awesome idea for 4D, as it turns out. Unfortunately,
it's not an obvious one, given the names of the commands and it makes some
specific cases of processing the JSON a bit more difficult. (I'd guess that
most people won't run into this situation, and it isn't a flaw of any
kind.) Imagine that you're storing product definitions with a product code:

{
   "products":[
      {
         "id":"ABC",
         "price":1.23
      },
      {
         "id":"DEF",
         "price":4.56
      }
   ]
}

This is just made-up JSON I've typed in by hand, but hopefully you get the
idea. Imagine that there are a stack of properties, not just a price. So, a
big object. Here's the version where the ID is used as the key:

{
   "ABC":{"price":1.23},
   "DEF":{"price":4.56
   }
}

Again, made-up JSON and imagine that there are a bunch of properties. The
deal here is that you can get 4D to grab items using the ID:

OB GET ($object;"ABC")

You don't have to iterate over everything (not that that's straightforward
to start with in 4D), and you don't have to put everything into an
extractable array...and then pull the whole array out.

So, #3 is kind of kick-ass solution in the 4D-specific universe. (With a
complete command set, you would not necessarily end up doing things this
way.) Oh, the reason this is so good in 4D is because of how object access
works *inside* of 4D. How does it work? All we know is that it's a hash
table - nothing more. There are a lot of ways to deal with the specifics of
implementing a hash table, and they have a great deal in common with the
tools and trade-offs used in building a variety of index operations. LR has
worked with these sorts of questions for a very, very long time and is
obviously very good at it by now. So, take it as read that the underlying
hash table implementation we use indirection with the C_OBJECT and object
field commands is first rate. That's why the
random-access-by-using-the-unique-value-as-an-actual-key kicks ass. It's
because the keys are supported by an underlying data structure that's
optimized for random access to an unordered value.

But there are a couple of limits on the approach above. One of which is
that you only get the one index. If you have a couple of properties, it
doesn't help. Also, what if you have a good reason to keep all of the data
in a single large object? That's good for serialization, archiving, and
transmission (as a parameter, record, message of any sort.)

So what about combining #1 (support array of unique values) and #3 (main
unique value as key for a key:value pair where unique_value:{definition}.)
Sure you can.  Take this JSON as an example (again, not generated with code
so apologies for glitches):

{
   "person_id_index":"[1,3,5,8,28]",

 
"person_last_name_index":"[\"Leavens\",\"Smith\",\"Carr\",\"Harris\",\"Stewart\"]",
   "person_definitions":[
      {
         "id":1,
         "last_name":"Leavens",
         "first_name":"Justin",
         "country":"USA"
      },
      {
         "id":3,
         "last_name":"Smith",
         "first_name":"Cannon",
         "country":"Canada"
      },
      {
         "id":5,
         "last_name":"Carr",
         "first_name":"Justin",
         "country":"Australia"
      },
      {
         "id":8,
         "last_name":"Harris",
         "first_name":"Welsh",
         "country":"USA"
      },
      {
         "id":28,
         "last_name":"Stewart",
         "first_name":"Wayne",
         "country":"Australia"
      }
   ]
}

You've got five people with a few attributes. Pretend that ID is unique and
that last name is also unique. The objects are ordered by the ID. What then
are person_last_name_index and person_id_index? They're indexes into the
objects. Let's say you wan to find out if the person with ID 28 is in the
array of people definition objects. Naive approach:

ARRAY OBJECT($people_ao;0)
OB GET ARRAY($object;"person_definitions";$people_ao)
For ($index;1;Size of array($people_ao))
  If (OB Get($people_ao{$index};"id")=28)
     $match:=$index
     $index:=Size of array($people_ao)+1 // Break
  end if
End for

That's off the top of my head (I'm in BBEdit now), so apologies for
glitches.

There's nothing wrong with this approach, but what if you have 10's or
100's of thousands? Justin Leavens had such a situation in the real world,
which lead him to the idea of using the main key as an actual key. Okay,
back to the example, What help do we get from person_id_index in this case:

ARRAY LONGINT($ids_al;0)
JSON PARSE ARRAY(OB Get($object;"person_id_index"))

....assuming that I got the syntax right, this takes a simple JSON text
block:

   "person_id_index":"[1,3,5,8,28]",

and turns it into a longint 4D array:

[1]
[3]
[5]
[8]
[28]

It's ordered, so we can use a binary search to find out if the ID we want
is present or not, and, if present, where exactly. How does this help?
Imagine you've got 100,000 objects and the ID you want to match is not
found. If you have to load and scan the entire array of objects
*sequentially*, that's a lot of work. A non-matching element is always
worst-case time because you have to scan the entire array from start to
finish to prove that your item isn't there. No binary search is possible
because there isn't any order. (They're just a bunch of objects.) The
use-the-key-as-the-key trick works with 4D because it's got a
well-implemented hash table underneath it all. Okay now imagine that you
instead of searching the 100,000 objects instead you can scan 1000,000
longints. What does that buy you. A lot. In a worst-case situation before,
you scanned the entire object array. In the new case:

* You do a binary search on the longint array and don't ever load the
object array, let alone scan the 100K items fruitlessly. Huge savings! How
huge? I've got a table below, but scanning 1,000,000 items takes a max of
30 comparisons for a binary search instead of 1,000,000 for a sequential
scan. Big savings.

* If you do need to load up the whole array and retrieve the object, you
don't need to find its position - you've already got it. That's another big
win. No need to scan the contents of the items until you find what you
need. The position in the index array = the position in the larger object
array that you're 'indexing'. So, again, a big win at a small cost.

* You've had to pay a price, loading the person_id_index and scanning it.
But, again, that's not terribly expensive relative to the benefit, in this
case. In the real world, you might want a parallel array that doesn't need
constant rehydration before use. If you eventually need to inject the array
into your object for storage, etc., you could do that. (As a related
example, I've got an ErrorStack object where I keen an ARRAY OBJECT going
to make counting, pushing, and popping easy. I only convert it to a single
big object when I need it serialized.)

Now, in the example so far, you will still be better with setting the ID as
the key name because of 4D's underlying hash table system. (You need to use
a legal JSON name, so it might be "ID_1":{"price":1.23}) But what about
searching by a second key? My last name example is terrible, but imagine a
product with a public SKU and a different internal ID. Not that rare a
situation in a lot of tables. Anyway, this is where my example uses

"person_last_name_index":"[\"Leavens\",\"Smith\",\"Carr\",\"Harris\",\"Stewart\"]",

These values are ordered in the same way as the base objects, so they're
not ordered in a useful way by themselves. But, their order does allow you
to relatively inexpensively recover the position of the larger object. In
this case, I'd be using Find in array...but you could even make a small
object structure using technique #3 described earlier.

So that's what I'm thinking about when it comes to binary searches. I'm
using it for something like the person_id_index because, well, why not?

Speaking of why not, your original point was about the break-even point for
using a binary search instead of find in array. I could say "it depends",
but that's a poor answer, in this case. First take the extreme cases.
You've got an array of 1,000,000 unique items:

* Your target value is in element 1.
* Your target value is in element 1,000,000.
* Your target value is not in the array.

With a find in array, you scan the array from 1 to 1,000,000, so :

* Your target value is in element 1: 1 comparison. You win!
* Your target value is in element 1,000,000: 1,000,000 comparison. You
lose, biggly.
* Your target value is not in the array: 1,000,000 comparisons. You lose,
biggly.

With a binary search, the number of required comparisons is *massively*
reduced. I had to look up the formula for calculating the maximum # based
on the size of the sorted array and it is log2(n), rounded up to next whole
integer. Here's how this breaks down, according to the table I found in
"Explorations in Computing: An Introduction to Computer Science and Python"
(I just did a Google search and it came up in Google Books.)

Size                 Max comparisons
                            2    1
                            4    2
                            8    3
                          16    4
                     1,000   10
                     2,000   11
                     4,000   12
              1,000,000   20
       1,000,000,000   30
1,000,000,000,000   40

What you have to say "scales well." If you've got a situation where you can
benefit from binary search, why wouldn't you use it? It's just
substantially better and doesn't take meaningfully more work. For my cases,
I'd use anywhere I've got an array of unique values where I'm controlling
insertion. At that point, I'm potentially getting a gain with really no
cost. Where will it show up as a benefit in the real world? Now that
totally depends. Sometimes it won't, it depends on your
data/platform/situation. But if you're just talking algorithms, binary
search is superior.

Don't get me wrong! I'm not about micro-optimizations and I'm not about
going to extra work to optimize "just in case." Micro-optimization and
optimize-in-advance have gotten me into *way* more trouble down the years
than not optimizing something and having to go back. But when you get to
pick a design and one approach is broadly superior without costing extra
work, why not? Just seems sensible.

Warning: Notice from the earlier example that you can construct edge cases
where a sequential scan of an array takes fewer comparisons than a binary
search. So what. It's a stupid example. Seriously. I've seen way, way, way
too many 4D demos down the years that shows "massive" performance gains
that only worked in minority cases. By definition these are the cases of
least importance! If anyone wants to optimize something, make damn sure
they can prove it works more of the time, not only in weird (sometimes
entirely artificial) cases. This is why, for example, I've long been unable
to reproduce virtually any of the "optimizations" I've heard about 4D. They
only work in edge cases. LR and the team are doing a great job of
optimizing server operations and I don't see any reason to second guess
them.
**********************************************************************
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[email protected]
**********************************************************************

Re: Tip: Binary insertion code

Reply via email to