Hello all, I've been interested in this topic for some time but have never
taken the time to run any tests. I don't have the time now (for sure), but
I took some anyway. What I did was grab a huge unique word file, clear out
words that are obviously illegal JSON key names and tried doing lookups
three ways:

* Sequential Find in array
* Binary Find in sorted array
* Object lookup

For reference, here's a link to the original file full of words:
https://github.com/dwyl/english-words/blob/master/words.zip

I tried not to bias the tests as what I'd like are useful results. Still, I
didn't test a whole lot of different ways and bias is nearly impossible to
avoid. Even if my test is totally fair, there's no way that it's complete -
results always depend on the data under test. This is part of why
algorithms are described using big O terminology. You have a way of talking
about the performance envelope around the algorithm under various sorts of
conditions. (That's a terrible description, but so be it.) As an example,
it's very easy to compare s sequential find in array with a binary search.
There are only a couple of cases where sequential is faster, no matter the
size of the array. With 4D's object lookups, we just don't know.  Even if
they are a hash table (likely but not confirmed), this doesn't tell us
much. (Hash tables have a whole lot of components in their implementation,
some of which can behave in weird ways, depending on your data set+hashing
function. It also matters what you use to find actual values, not just hash
bins.)

Anyway, here are a set of results in a compiled system with ~465,000
words/keys:

Words: 466,474
Tests: 10,000
Sequential: 107,777
Binary: 153
Object: 9

The three times are in milliseconds. As in, "Searching for 10,000 different
words in an array of 466,474 unique words took a little over 1/10th of a
second using a sorted array." That's roughly 1/2 of a blink. (Not kidding.)

Comments and take-aways:

* Binary search is great.

* Object search is great.

* Sequential search is not so great, but it still only took about 11
seconds.

* I noticed that setting up the sorted array took no time and that setting
up the object took time that I could feel. I didn't do timing results on
this. But if it's true, the *overall* time (including setup) for the object
was *unfavorable.*

Conclusion: I'll use objects when I need them and sorted arrays when I need
them. The performance difference is too small to be a factor, it will come
down to other properties of these data structures.

++ To Justin on the whole 'use the lookup value as the key' tip (Rob has
mentioned this too.) I sue that all of the time in objects, it's a really
excellent practice.

If anyone wants to re-run the tests or check my code for logic errors, dumb
errors, bias, etc., here's the code with comments:

If (False)
 // https://github.com/dwyl/english-words/blob/master/words.zip
 // Imported into a new table in 4D.
 // Cleared ones starting with numbers or punctuation.
 // 466,475 words left.
End if

  //----------------------------------------------------------------
  // Setup
  //----------------------------------------------------------------
ALL RECORDS([word])
ORDER BY([word];[word]word) // 4D indexed sort

ARRAY TEXT($words_at;0)
SELECTION TO ARRAY([word]word;$words_at)  // Sorted arraa of 464K+ words

C_OBJECT($words_object)
C_LONGINT($words_count)
C_LONGINT($word_index)
$words_object:=JSON Parse("{}")
$words_count:=Size of array($words_at)

For ($word_index;1;$words_count)
C_TEXT($word)
$word:=$words_at{$word_index}
 // {"hello":"HELLO"} - no reason for the lower/upper other than to make it
read in the Debugger.
OB SET($words_object;Lowercase($word);Uppercase($word))
End for

  // Let's build an array of random words from the main array of words.
C_LONGINT($test_words_count)
$test_words_count:=10000  // Note: Cannot be larger than the $words_count


  // Hmmm. Not getting a good distribution of indexes from Random.
  // Instead, I'll pick words from different positions along the array.

ARRAY TEXT($test_words_at;$test_words_count)
$test_words_at{1}:=$words_at{1}  // Best case for a sequential scan
$test_words_at{2}:=$words_at{$words_count}  // Worst case for a sequential
scan

C_LONGINT($interval)
  // Now we want to fill in the rest of the test array.
  // The selected words are grabbed from even intervals along the array.

  // The speed difference for a sequential search should be linear.

  // The speed difference for a binary search should be very small amongst
words.
  // It should take up to about ~18 reads to find the word.

  // The speed difference for the object? No clue, we don't know how
they're implemented.
  // With a fast hash and a large hash table, it could be very quick. Hard
to say.
$interval:=$words_count\$test_words_count

C_LONGINT($test_word_index)
For ($test_word_index;3;$test_words_count)  // Start at 3 because we just
filled in 1 & 2 by hand.
C_LONGINT($word_index)
$word_index:=$interval*$test_word_index
$test_words_at{$test_word_index}:=$words_at{$word_index}
End for

  //----------------------------------------------------------------
  // Tests
  //----------------------------------------------------------------
  // And we're good to go.

  //---------------------------
  // Sequential search in array
  //---------------------------
C_LONGINT($sequential_start)
$sequential_start:=Milliseconds

C_LONGINT($test_word_index)
For ($test_word_index;1;$test_words_count)
C_TEXT($word)
C_LONGINT($match_index)
$word:=$test_words_at{$test_word_index}
$match_index:=Find in array($words_at;$word)
End for

C_LONGINT($sequential_elapsed)
$sequential_elapsed:=Milliseconds-$sequential_start

  //---------------------------
  // Binary search in array
  //---------------------------
C_LONGINT($binary_start)
$binary_start:=Milliseconds

C_LONGINT($test_word_index)
For ($test_word_index;1;$test_words_count)
C_TEXT($word)
C_LONGINT($match_index)
C_BOOLEAN($found)
$word:=$test_words_at{$test_word_index}
$found:=Find in sorted array($words_at;$word;>;$match_index)
End for

C_LONGINT($binary_elapsed)
$binary_elapsed:=Milliseconds-$binary_start

  //---------------------------
  // Object lookup
  //---------------------------
C_LONGINT($object_start)
$object_start:=Milliseconds

C_LONGINT($test_word_index)
For ($test_word_index;1;$test_words_count)
C_TEXT($word)
C_TEXT($word_returned)
$word:=$test_words_at{$test_word_index}
$word_returned:=OB Get($words_object;$word)
End for

C_LONGINT($object_elapsed)
$object_elapsed:=Milliseconds-$object_start

  //---------------------------
  // Results summary
  //---------------------------
C_TEXT($tab)
C_TEXT($cr)
$tab:=Char(Tab)
$cr:=Char(Carriage return)

C_TEXT($results_text)
$results_text:=""
$results_text:=$results_text+"Words:"+$tab+String($words_count;"###,###,###,##0")+$cr
$results_text:=$results_text+"Tests:"+$tab+String($test_words_count;"###,###,###,##0")+$cr
$results_text:=$results_text+"Sequential:"+$tab+String($sequential_elapsed;"###,###,###,##0")+$cr
$results_text:=$results_text+"Binary:"+$tab+String($binary_elapsed;"###,###,###,##0")+$cr
$results_text:=$results_text+"Object:"+$tab+String($object_elapsed;"###,###,###,##0")+$cr

SET TEXT TO PASTEBOARD($results_text)

ALERT($results_text)
**********************************************************************
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
**********************************************************************

Reply via email to