[REBOL] Essay on Associative Data Stores

joel . neely Sun, 17 Sep 2000 12:49:42 -0700
Short version:

1.  A general-purpose Associative Data Store (ADS) is not
    directly implemented by  append  and  select  for a
variety of reasons.

1.1.  Keys and values are not distinguised.
1.2.  Storage management/economy require extra code to
      handle replacement of values for existing keys.
1.3.  The  block!  type doesn't appear to be supported as a key
      in a manner consistent with other data types.

2.  The path notation does not support a general purpose ADS
    implementation for a variety of reasons.

2.1.  It restricts the range of possible key types.
2.2.  It does not support the addition of values, requiring
      distinct syntax for addition of new associations versus
      updating or retrieving existing associations.
2.2.  The overloading of "/" in REBOL provides a potentially
      rich source of bugs (or at least confusion).

Conclusion:

Applications that need a general-purpose ADS will require
additional code to provide that functionality.  This note is
already too long; I'll provide a (further-modified) possible
implementation in a separate email later.

Note:

I'm not saying that blocks and the native operations on blocks
are "bad", just that they aren't sufficient for the general
case.

-jn-


Long-winded version:

Greetings, fellow Rebolutionaries!

There has been a good bit of discussion regarding the concept
and possible REBOL implementation, of an "Associative Data
Store" (which I'll abbreviate hereafter as ADS, aka "associative
array", "associative list", "a-list", "lookup table",
"dictionary", etc.)  From my perspective, there are some issues
in this discussion that have become confusingly entangled; as
I am a bear of small brain, only able to think of one thing at
a time, I'd like to disentangle them.

Let me begin with a definition that lets me address the
different features I believe a completely general-purpose ADS
would possess.

An ADS is a mechanism for storing and retrieving associations
between a collection of (distinct!) "keys" and their
corresponding "value"s.  It can be visualized as a table of
two columns: each row holds an association between the key in
the left-hand column and the value in the right-hand column.

An association is added/updated by ensuring that the desired
key is in a left-hand entry (inserting it if it is not already
present) and that the corresponding right-hand entry contains
the desired value (inserting or replacing, depending on whether
the key was already present).

An association may be retrieved for a given "target" (if the
target occurs among the keys) by locating the target in the
left-hand column and returning the value in the corresponding
position of the right-hand column.  If the target does not
occur in the left-hand column, we might simply declare an error
or we might return a default "not found" value -- such as  none
in REBOL.

Note that I've said nothing about what types of data may serve
as keys or values.  An obvious consequence of this description 
is that ANY data type whatsoever can be used as a value, and any
data type that supports simple (in)equality testing can be used
as a key.

[At this point, let me apologize if I have insulted anybody's
intelligence.  I certainly didn't intend to; I just wanted to
make my assumptions about an ADS completely explicit.]

Now let's look at some of the categories of suggestions that
have been discussed on the list in response to the question,
"How can I implement an ADS in REBOL?"


1.  "Why not just use block/append/find/select?"

We can see several reasons why block/append/find/select do not
provide an implementation of the general-purpose ADS described
above.  (I'm not saying here that they can't be USED IN an
implementation of ADS, just that they don't support all of the
stated requirements without additional coding.)

1.1.  Keys and values are not distinguished by  select , which
      allows "false positive" results to be obtained.  E.g.:

    >> append blk ["a" "b"]
    == ["a" "b"]
    >> append blk ["e" "f"]
    == ["a" "b" "e" "f"]
    >> append blk ["i" "j"]
    == ["a" "b" "e" "f" "i" "j"]
    >> append blk ["o" "p"]
    == ["a" "b" "e" "f" "i" "j" "o" "p"]
    >> append blk ["u" "v"]
    == ["a" "b" "e" "f" "i" "j" "o" "p" "u" "v"]

In other words,  blk  is intended as an ADS associating each
vowel with the following consonant.  This sometimes works:

    >> select blk "i"
    == "j"

and sometimes doesn't:

    >> select blk "b"
    == "e"

because  select  doesn't understand which entries were intended
as keys and which as values.  In addition to the above false
positive result, this problem can also mask the existance of
a legitimate association, as in:

    >> blk2: [14 79  12 86  53 92  79 104]
    == [14 79 12 86 53 92 79 104]
    >> select blk2 79
    == 12

1.2.  Simply appending a new pair on the end of a block is not
      sufficient for defining an association, as it doesn't
deal with the possibility that the key already exists (either
as a key or as a value...)  In such cases, the newly-added
pair will never be found, unless additional code is written to
deal with this situation.  Of course, this could be sidestepped
by using  insert  to place the new pair at the head of the
block.  This would provide functionally correct behavior (if we
ignore 1.1 for a moment) but would have the undesirable side
effect of consuming progressively more memory as new entries
were added to replace associations on previous keys.

1.3.  It appears that  find  and  select  don't treat  block!
      data consistenly with other data types when used as keys
and search targets, even though equality is defined for  block!
data.  Consider the function:

    >> test-select: func [b [block!] k [any-type!] v [any-type!]] [
    [    append b reduce [k v]
    [    select b k
    [    ]
    >> blk: []
    == []

A naive analysis would say, "We're going to put k and v into the
block and then look up the value associated with k.  Therefore,
(except for the problem discussed under 1.1) we should always
get back v."  Let's try it:

    >> test-select blk "a" "b"
    == "b"
    >> test-select blk 2 14
    == 14
    >> test-select blk true 1.2.3.4
    == 1.2.3.4
    >> test-select blk 1.2.3.5 "www.gorp.org"
    == "www.gorp.org"

So far, so good.  Now...

    >> test-select blk [2 3] 6
    == none

Hmmm.  That didn't work.  However it gets more puzzling (to me,
at least) as we continue to poke around!

    >> blk
    == ["a" "b" 2 14 true 1.2.3.4 1.2.3.5 "www.gorp.org" [2 3] 6]

The "[2 3]" entry seems to be present, which we can confirm by
the aliasing bug:

    >> select blk "www.gorp.org"
    == [2 3]

but it doesn't seem to be recognized by  select  as a key

    >> select blk [2 3]
    == none
    >> select blk reduce [2 3]
    == none
    >> select blk reduce [[2 3]]
    == 6

Now THAT was a surprise!  I'll be glad to hear any explanations
anyone can offer for that last case!  Whatever the explantion
may be, it appears that  select  is NOT using the normal
meaning of equality, as demonstrated by:

    >> pick blk 9
    == [2 3]
    >> [2 3] = pick blk 9
    == true
    >> select blk [2 3]
    == none


2.  "Why not use the built-in path notation?"

The path notation is even less suitable for a general purpose
ADS implementation for a variety of reasons.

2.1.  It restricts the range of possible key types.

Let's start small, with our vowels example:

    >> blk: ["a" "b" "e" "f" "i" "j" "o" "p" "u" "v"]
    == ["a" "b" "e" "f" "i" "j" "o" "p" "u" "v"]

To use pathing as an access notation, we are FORCED to create
additional words, instead of being able to access data via
literal keys.

    >> blk/a
    ** Script Error: Invalid path value: a.
    ** Where: blk/a
    >> blk/:("a")
    ** Syntax Error: Invalid word-get -- :.
    ** Where: (line 1) blk/:("a")

That didn't work because there isn't a  word!  value of  a
in the dictionary block, and because the colon prefix isn't an
operator.  Since our keys here are  string!  data, we have to
create a word (variable) to hold the key, and use the associated
get-word!  as a path element, as in:

    >> key: "a"
    == "a"
    >> blk/:key
    == "b"

However, if our dictionary starts to get more interesting, we
get some more surprises!

    >> blk: [
    [    "a" "b" "e" "f" "i" "j" "o" "p" "u" "v"
    [    2 1  3 2  5 3  7 8  11 55
    [    true "yep"  false "nope"
    [    1.2.3.4  "castor.romefounders.org"
    [    1.2.3.5  "pollux.romefounders.org"
    [    ]
    == [
        "a" "b" "e" "f" "i" "j" "o" "p" "u" "v"
        2 1 3 2 5 3 7 8 11 55
        true "yep" false "nope"
        1.2.3.4 "castor.romef...

Now we have a more interesting variety of types for keys and
associated values.  Does the create-a-word-and-use-its-get-word
trick still work?

    >> key: 1.2.3.4
    == 1.2.3.4
    >> blk/:key
    == "castor.romefounders.org"

It would appear so, until we try

    >> key: true
    == true
    >> blk/:key
    == "a"

Whoa!  Did I set  key  to what I thought I did?

    >> key
    == true

Yep.  It appears there's another weirdism here!  And it also
bites the other  logic!  value when used as a key.
    
    >> key: false
    == false
    >> blk/:key
    == "b"

Purely speculation on my part, but I'd guess this relates to
what we'll cover in point 2.3 below.

To be fair, I should point out that this example is really due
to TWO issues.  The  logic!  keys are giving false positives,
and the dictionary doesn't actually contain any  logic!  keys.
Consider this:

    >> pick blk 21
    == true
    >> type? pick blk 21
    == word!
    >> true = pick blk 21
    == false
    >> 'true = pick blk 21
    == true

The twenty-first entry in blk (intended as the eleventh key) is
actually a  word!  value, and not a  logic!  value, as  blk  was
initialized with a literal (and therefore unevaluated) block.
However, we can show that this is not what caused the retrieval
failure very simply:

    >> blk: reduce blk
    == [
        "a" "b" "e" "f" "i" "j" "o" "p" "u" "v"
        2 1 3 2 5 3 7 8 11 55 true "yep" false "nope"
        1.2.3.4 "castor.romefounde...
    >> pick blk 21
    == true
    >> type? pick blk 21
    == logic!
    >> true = pick blk 21
    == true
    >> 'true = pick blk 21
    == false
    >> key: true
    == true
    >> blk/:key
    == "a"

So, even after the effort of using a  get-word!  path element,
we have some problems.

2.2.  It does not support the addition of values, requiring
      distinct syntax for addition of new associations versus
updating or retrieving existing associations.

    >> blk: [a "first"  b "second"  c "third"]
    == [a "first" b "second" c "third"]
    >> blk/a
    == "first"
    >> blk/a: "primo"
    == [a "primo" b "second" c "third"]

So far, so good.  However, we can't add new values this way:

    >> blk/d: "fourth"
    ** Script Error: Invalid path value: d.
    ** Where: blk/d: "fourth"

and we can't use "calculated" keys for updating:

    >> key: 'b
    == b
    >> blk/:key
    == "second"
    >> blk/:key: "alternate"
    ** Syntax Error: Invalid word -- :key:.
    ** Where: (line 1) blk/:key: "alternate"

All of these are leading up to the last point on paths.

2.3.  The overloading of "/" in REBOL provides a potentially
      rich source of bugs (or, at the very least, confusion).

The "/" is used for several distinct concepts in REBOL:

- Refinements - a combination boolean argument and switch
  that MAY signal the presence of other arguments.
- Paths - SYMBOLIC selection from within a block, similar to
  (but more fragile than) using  select  with a  word!  key.
- Indexing - POSITIONAL selection from within a block.

The "/" notation breaks referential transparency (which is
very odd in a language billed as "Expression Based".)  For
those not jargon-saturated, referential transparency is just
a complicated way of saying that equal-valued expressions
should be freely interchangeable.  For example, all expressions
evaluating to  2  can be substituted for each other, as in:

    >> tangent 2
    == 3.49207694917477E-2
    >> a: 2
    == 2
    >> tangent a
    == 3.49207694917477E-2
    >> (1 + 1)
    == 2
    >> tangent (1 + 1)
    == 3.49207694917477E-2
    >> b: 1
    == 1
    >> tangent (3 * b - 1)
    == 3.49207694917477E-2

This doesn't work with "/", as demonstrated by:

    >> now
    == 16-Sep-2000/15:13:33-5:00
    >> now/date
    == 16-Sep-2000
    >> jetzt: now
    == 16-Sep-2000/15:13:44-5:00
    >> jetzt/date
    == 16-Sep-2000

So far, so good.  A word referring to a  date!  and a function
that returns a  date!  start off looking similar.  But:

    >> jetzt/date/year
    == 2000
    >> jetzt/date/month
    == 9
    >> now/date/year
    ** Script Error: Too many refinements.
    ** Where: now/date/year
    >> now/date/month
    ** Script Error: Too many refinements.
    ** Where: now/date/month

The slashes after  jetzt  are all interpreted as structure
accessors, but the slashes after  now  are all interpreted as
refinements.  There isn't a notation to say, "Take the value
from this function call and access its components." without
storing it somewhere first.

    >> (now/date)/year
    == /year
    >> yuck: now/date
    == 16-Sep-2000
    >> yuck/year
    == 2000

With respect to the last two uses of "/", it's not always
obvious, even from reading your own source code, which one
you'll get.

    >> blk: [a "first" b "second" c "third" 1 "one" 2 "two" 3 "three"]
    == [a "first" b "second" c "third" 1 "one" 2 "two" 3 "three"]
    >> key: 'a
    == a
    >> blk/:key
    == "first"
    >> key: 1
    == 1
    >> blk/:key
    == a
    >> key: 2
    == 2
    >> blk/:key
    == "first"

If the value of  key  is a word, the slash retrieves a value
by symbolic lookup, but if the value of  key  is numeric, the
slash retrieves a value by position.  Therefore, it cannot be
used as a notational substitute for

    >> select blk 1
    == "one"

Bear in mind that we might have something similar to

    key: some-complicated-function with-some-arguments

which could require significantly more effort to determine
the meaning of

    val: blk/:key

in the general case.

That's (more than) enough for now!

-jn-
[REBOL] Essay on Associative Data Stores

Reply via email to