Re: [lucy-dev] Clownfish Hash API

Marvin Humphrey Sun, 19 Apr 2015 21:37:06 -0700

On Sat, Apr 18, 2015 at 8:39 AM, Nick Wellnhofer <[email protected]> wrote:
> On 13/04/2015 20:35, Nick Wellnhofer wrote:


>> The first one is CLOWNFISH-2 "Create iterator for Hash":
>
>> Then there's CLOWNFISH-7 "String-only keys for Hash":
>
> These issues are resolved now.

Yeeha!  Thank you, Nick!

I missed my chance to work on those because of a busy week at ApacheCon
Austin -- but now Lucy will need to be adapted for these changes, and I'm
happy to pick up that task.

>> If we only support string keys, should Obj#Hash_Sum be removed? If not,
>> hash sums should be of type size_t.
>
> LFReg uses string keys now, so Obj#Hash_Sum could be removed. OTOH, we could
> just keep it without making it public. I don't have a strong opinion on
> this.

I agree that Obj#Hash_Sum should be removed.  Perhaps one day it will
resurface as an abstract method in a `Hashable` interface or something like
that.

>> What's the Find_Key method for? I think it can be removed.
>
> This method only seems useful for testing. I lean towards removing it.

+1

Once upon a time, Find_Key was used for uniquing values in SortFieldWriter.
That's no longer the case as of last July:

    http://s.apache.org/6Ue (link to github.com)

However, this brings up an interesting topic: Aside from Find_Key, Hash
doesn't have have an easy way of differentiating a missing key from a key
mapped to NULL.

There are numerous ways of providing such functionality.  The simplest is to
provide a routine which returns a boolean indicating whether the key is
present:

    name in names    # Python
    name in names    // JavaScript
    exists $names{$name}  # Perl
    names.has_key?(name)  # Ruby
    names.containsKey(name) // Java
    names.ContainsKey(name) // C#

The downside of that idiom is that testing followed by fetching requires
looking up the key twice.

    if (Hash_Has_Key(hash, key)) { // look up the key
        Obj *value = Hash_Fetch(hash, key); // look up the key again
        // ...
    }

But it's such a common API and maps so easily to multiple hosts that providing
it is a no-brainer.

My favorite alternative comes from Go's maps, where a lookup returns a tuple
of the value and a boolean indicating whether the key is present.

    value, isPresent := names[name] // Go

Unfortunately, multiple return values don't work well with C and some other
hosts.

Other techniques include Python throwing KeyErrors (blech), setting a non-null
default value (Python defaultdict, Ruby Hash#default=) on the object, or
adding an extra parameter to Fetch().

    Obj *got = Hash_Fetch(hash, key, default_value);
    if (got != default_value) {
        // ...
    }

I still like `Has_Key` the best.

>> Using unsigned 32-bit values for hash size and capacity is OK in my
>> opinion.

Is that best?  Java has had problems because of array indexes being 32-bit.
Will a 32-bit integer always be large enough?  Will we want to use a size_t or
uint64_t for Clownfish's String, and then use that same type elsewhere for
consistency?

> Another thing: We should implement Clone and Shallow_Copy methods for Hash
> just like the ones in VArray.

This is probably a debate for later, but those method names may not be ideal.
`Clone` is ambiguous as to whether it is deep or shallow.

    http://www.csharp411.com/c-object-clone-wars/
    http://en.wikipedia.org/wiki/Object_copy#Methods_of_copying

I think the reasoning in the docs for Python's `copy` module is compelling:

    https://docs.python.org/3/library/copy.html

    Two problems often exist with deep copy operations that don’t exist with
    shallow copy operations:

    *   Recursive objects (compound objects that, directly or indirectly,
        contain a reference to themselves) may cause a recursive loop.
    *   Because deep copy copies everything it may copy too much, e.g.,
        administrative data structures that should be shared even between
        copies.

That would imply making Clone() shallow, and implementing deep copying as
Deep_Clone().

Marvin Humphrey

Re: [lucy-dev] Clownfish Hash API

Reply via email to