[jira] [Comment Edited] (IGNITE-4011) Automatically compute hash codes for newly built binary objects

Alexander Paschenko (JIRA) Thu, 06 Oct 2016 13:37:32 -0700

    [ 
https://issues.apache.org/jira/browse/IGNITE-4011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553094#comment-15553094
 ]


Alexander Paschenko edited comment on IGNITE-4011 at 10/6/16 8:37 PM:
----------------------------------------------------------------------

All right, the first version of patch for this issue is being tested on TC, and 
therefore it's time to describe the design that has ultimately been implemented 
and showcase the examples of configuration.

h2. Preface

Here are the main ideas:

- Leave the design as simple and clean as possible.
- Make all configuration changes optional. The only users that will need to 
change anything will be those who wish to use new DML features in binary mode, 
and only for keys without classes. For those who don't care about DML or don't 
use binary keys, there'll be nothing to worry about.
- Make possible the cases where no additional coding will be needed from the 
user's side.

Of course, if there's anyone who wanted to use binary classless keys outside of 
DML context, they also will benefit from this change.

h2. API changes

The only configuration/public API related class changed is 
{{CacheKeyConfiguration}}. It has four fields added:

{code:java}
/** Key hashing mode. */
private BinaryKeyHashingMode binHashingMode;

/** Fields to build binary objects' hash code upon. */
private List<String> binHashCodeFields;

/** Class name for hash code resolver to automatically compute hash codes for 
newly built binary objects. */
private String binHashCodeRslvrClsName;
{code}

h2. Hashing mode

The latter two params are meaningful only depending on the value of the first 
one, so let's review it first. New enum has been introduced to control binary 
classless key hashing behavior - namely, {{BinaryKeyHashingMode}}. It's 
declared as follows - I left javadocs intact so that possible options are clear:

{code:java}
/**
 * Mode of generating hash codes for keys created with {@link 
BinaryObjectBuilder}.
 */
public enum BinaryKeyHashingMode {
    /**
     * Default (also legacy pre 1.8) mode. Use this mode if you use no SQL DML 
commands - INSERT, UPDATE, DELETE, MERGE,
     * in other words, if you put data to cache NOT via SQL.
     * Effect from choosing this mode is identical to omitting mode settings 
from key configuration at all.
     */
    DEFAULT,

    /**
     * Generate hash code based upon serialized representation of binary object 
fields - namely, byte array constructed
     * by {@link BinaryObjectBuilder}. Use this mode if you are NOT planning to 
retrieve data from cache via
     * ordinary cache methods like {@link IgniteCache#get(Object)}, {@link 
IgniteCache#getAll(Set)}, etc., or
     * if you don't have particular classes for keys neither on client nor on 
server - it's an convenient way
     * to manipulate and retrieve binary data in cache only via full-scale SQL 
features
     * with as little additional configuration overhead as choosing this mode.
     */
    BYTES_HASH,

    /**
     * Generate hash code based upon on list of fields declared in {@link 
BinaryObjectBuilder}
     * (not in {@link BinaryObject} as hash code has to be computed 
<b>before</b> {@link BinaryObject} is fully built) -
     * this mode requires that you set {@link 
CacheKeyConfiguration#binHashCodeFields} for it to work.
     */
    FIELDS_HASH,

    /**
     * Generate hash code arbitrarily based on {@link BinaryObjectBuilder} 
using specified class implementing
     * {@link BinaryObjectHashCodeResolver}- this mode requires that you set
     * {@link CacheKeyConfiguration#binHashCodeRslvrClsName} for it to work.
     */
    CUSTOM;
}
{code}

h2. Hashing modes explained

So, there are four options, as it'd been discussed on dev list:
- don't change any behavior
- hash byte array of fields set in builder
- hash particular subset of fields in builder
- provide custom logic to hash field values in builder in arbitrary way

Dev list had also suggested that we introduce interface 
{{BinaryObjectHashCodeResolver}}.

However, in order to make this interface simple to understand and implement, 
its usage is limited to the last two options - fields subset hashing and custom 
hashing (last 2 modes in the above list), while byte array hashing works 
without using it (as byte array is not a part of binary builder).

Let's focus on the latter two. Correct hashing is of little use without correct 
implementation of {{equals}} - even if we manage to maintain uniqueness of hash 
codes, we have to have mechanism of comparing objects for equality, or 
otherwise we won't be able to retrieve from the cache what we've put there.

Current implementaion of {{equals}} in {{BinaryObjectExImpl}} is based on 
contents of the arrays. Therefore, this behavior is unchanged for 
{{BYTES_HASH}} mode - if byte arrays of obejcts are equal, then their portions 
that correspond to fields are the same as well.

As mentioned above, {{FIELDS_HASH}} and {{CUSTOM}} modes utilize 
{{BinaryObjectHashCodeResolver}} for hashing and equality comparison.

h2. Resolver interface and implementation

This interface looks as follows:

{code:java}
package org.apache.ignite.binary;

import org.apache.ignite.internal.binary.BinaryObjectExImpl;

/**
 * Method to compute hash codes for new binary objects.
 */
public interface BinaryObjectHashCodeResolver {
    /**
     * @param builder Binary object builder.
     * @return Hash code value.
     */
    public int hash(BinaryObjectBuilder builder);

    /**
     * Compare binary objects for equality in consistence with how hash code is 
computed.
     *
     * @param o1 First object.
     * @param o2 Second object.
     * @return
     */
    public boolean equals(BinaryObjectExImpl o1, BinaryObjectExImpl o2);
}
{code}

For {{FIELDS_HASH}}, configuration takes setting list of fields as param of 
{{CacheKeyConfiguration}} - hash code resolver will be built based upon those. 
Therefore, this mode takes no additional coding.

For {{CUSTOM}}, configuration takes setting list of fields as param of 
{{CacheKeyConfiguration}}. This mode obliges user to implement 
{{BinaryObjectHashCodeResolver}} and specify class name for implementation.

h2. Per mode configuration examples

h3. {{BYTES_HASH}}

{code:xml}
<bean id="grid.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
    <!-- ...other properties... -->

    <property name="cacheKeyConfiguration">
        <list>
            <bean class="org.apache.ignite.cache.CacheKeyConfiguration">
                <property name="typeName" value="bytes_hashed_type" />

                <property name="affKeyFieldName" value="someAffField" />

                <property name="binHashingMode" value="BYTES_HASH" />
            </bean>
        </list>
    </property>
{code}

No coding, no other settings - just set the mode, and you can do all your 
MERGEs and INSERTs. However, doing {{get}} s will probably be perilous as 
you'll have to create your keys with builder. This minimalistic configuration 
suits setups when the user wishes to interact with some portion of data in 
cache solely via SQL.

h3. {{FIELDS_HASH}}

{code:xml}
<bean id="grid.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
    <!-- ...other properties... -->

    <property name="cacheKeyConfiguration">
        <list>
            <bean class="org.apache.ignite.cache.CacheKeyConfiguration">
                <property name="typeName" value="fields_hashed_type" />

                <property name="affKeyFieldName" value="someAffField" />

                <property name="binHashingMode" value="FIELDS_HASH" />

                <property name="binHashCodeFields">
                    <list>
                        <value>someHashField</value>
                        <value>anotherHashField</value>
                    </list>
                </property>
            </bean>
        </list>
    </property>
{code}

Aside from setting the mode, you have to list the fields to hash. Suits modes 
when client node has classes and data nodes don't, while data gets to cache via 
SQL INSERT/MERGE.

h3. {{CUSTOM}}

{code:xml}
<bean id="grid.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
    <!-- ...other properties... -->

    <property name="cacheKeyConfiguration">
        <list>
            <bean class="org.apache.ignite.cache.CacheKeyConfiguration">
                <property name="typeName" value="CustomHashedBinaryType" />

                <property name="affKeyFieldName" value="someAffField" />

                <property name="binHashingMode" value="CUSTOM" />

                <property name="binHashCodeRslvrClsName" 
value="com.company.ignite.binary.SomeCustomHasher" />
            </bean>
        </list>
    </property>
{code}

Aside from setting the mode, you have to implement 
{{BinaryObjectHashCodeResolver}} on specified class. Suits modes when client 
node has classes and data nodes don't, while data gets to cache via SQL 
INSERT/MERGE.

h2. Existing key classes with {{FIELDS_HASH}} and {{CUSTOM}} hashing modes

There is an important aspect of binary object handling: what if we wish to 
perform a {{get}} on cache that contains a key
- for which the class *is* present on client node
- and the class *is not* present on data nodes
- and key was put to cache not by calling {{put}} but by SQL INSERT or MERGE?

What then? In this case user's class already has {{hashCode}} and {{equals}} 
implemented but we don't have classes on nodes, still {{get}} s obviously have 
to work. In this case, logic of {{BinaryObjectHashCodeResolver}} should match 
that declared in key's class (which data nodes don't have).

For the cases when {{hashCode}} / {{equals}} logic is trivial and generated by 
IDE, fields based hashing and equality comparisons are sufficient - therefore, 
{{FIELDS_HASH}} works, and the only thing to maintain is consistency of field 
lists in code of key class which data nodes don't have *AND* config files on 
data nodes.

For the cases when {{hashCode}} / {{equals}} logic is not trivial, user will 
have to implement custom {{BinaryObjectHashCodeResolver}} which will have to 
mimic the logic of key hashing/comparing in the class.

Rationale behind this design is as follows:
- If the user does not care about automatic keys hashing (= does not use DML 
features), then he or she is probably happy and does not want to configure or, 
God forbid, code anything. All that works has to work without new 
coding/configuration.
- If the user wishes to hash binary classless keys automatically (from SQL 
INSERT/MERGE) *AND* have key classes on client nodes (= perform {{get}} with 
key serialized by, say, {{IgniteBinary.toBinary(Object)}} and *NOT* constructed 
with binary builder), he or she will have to maintain integrity between hashing 
modes on client and server nodes. However, forcing the user to change the code 
of existing classes does not seem right, so the only burden is re-configuring 
data nodes. (And, optionally, writing custom resolver if original class is 
hashed/compared in some weird way).

h2. Any ways to avoid having to do anything at all?
^(aka who won't have to do any changes in code or configuration?)^

Sure thing.
- Don't use DML.
- Don't use binary keys without classes. *(Everything written above affects 
only cases with non trivial classless keys.)*


was (Author: al.psc):
All right, the first version of patch for this issue is being tested on TC, and 
therefore it's time to describe the design that has ultimately been implemented 
and showcase the examples of configuration.

h2. Preface

Here are the main ideas:

- Leave the design as simple and clean as possible.
- Make all configuration changes optional. The only users that will need to 
change anything will be those who wish to use new DML features in binary mode, 
and only for keys without classes. For those who don't care about DML or don't 
use binary keys, there'll be nothing to worry about.
- Make possible the cases where no additional coding will be needed from the 
user's side.

Of course, if there's anyone who wanted to use binary classless keys outside of 
DML context, they also will benefit from this change.

h2. API changes

The only configuration/public API related class changed is 
{{CacheKeyConfiguration}}. It has four fields added:

{code:java}
/** Key hashing mode. */
private BinaryKeyHashingMode binHashingMode;

/** Fields to build binary objects' hash code upon. */
private List<String> binHashCodeFields;

/** Class name for hash code resolver to automatically compute hash codes for 
newly built binary objects. */
private String binHashCodeRslvrClsName;
{code}

h2. Hashing mode

The latter two params are meaningful only depending on the value of the first 
one, so let's review it first. New enum has been introduced to control binary 
classless key hashing behavior - namely, {{BinaryKeyHashingMode}}. It's 
declared as follows - I left javadocs intact so that possible options are clear:

{code:java}
/**
 * Mode of generating hash codes for keys created with {@link 
BinaryObjectBuilder}.
 */
public enum BinaryKeyHashingMode {
    /**
     * Default (also legacy pre 1.8) mode. Use this mode if you use no SQL DML 
commands - INSERT, UPDATE, DELETE, MERGE,
     * in other words, if you put data to cache NOT via SQL.
     * Effect from choosing this mode is identical to omitting mode settings 
from key configuration at all.
     */
    DEFAULT,

    /**
     * Generate hash code based upon serialized representation of binary object 
fields - namely, byte array constructed
     * by {@link BinaryObjectBuilder}. Use this mode if you are NOT planning to 
retrieve data from cache via
     * ordinary cache methods like {@link IgniteCache#get(Object)}, {@link 
IgniteCache#getAll(Set)}, etc., or
     * if you don't have particular classes for keys neither on client nor on 
server - it's an convenient way
     * to manipulate and retrieve binary data in cache only via full-scale SQL 
features
     * with as little additional configuration overhead as choosing this mode.
     */
    BYTES_HASH,

    /**
     * Generate hash code based upon on list of fields declared in {@link 
BinaryObjectBuilder}
     * (not in {@link BinaryObject} as hash code has to be computed 
<b>before</b> {@link BinaryObject} is fully built) -
     * this mode requires that you set {@link 
CacheKeyConfiguration#binHashCodeFields} for it to work.
     */
    FIELDS_HASH,

    /**
     * Generate hash code arbitrarily based on {@link BinaryObjectBuilder} 
using specified class implementing
     * {@link BinaryObjectHashCodeResolver}- this mode requires that you set
     * {@link CacheKeyConfiguration#binHashCodeRslvrClsName} for it to work.
     */
    CUSTOM;
}
{code}

h2. Hashing modes explained

So, there are four options, as it'd been discussed on dev list:
- don't change any behavior
- hash byte array of fields set in builder
- hash particular subset of fields in builder
- provide custom logic to hash field values in builder in arbitrary way

Dev list had also suggested that we introduce interface 
{{BinaryObjectHashCodeResolver}}.

However, in order to make this interface simple to understand and implement, 
its usage is limited to the last two options - fields subset hashing and custom 
hashing (last 2 modes in the above list), while byte array hashing works 
without using it (as byte array is not a part of binary builder).

Let's focus on the latter two. Correct hashing is of little use without correct 
implementation of {{equals}} - even if we manage to maintain uniqueness of hash 
codes, we have to have mechanism of comparing objects for equality, or 
otherwise we won't be able to retrieve from the cache what we've put there.

Current implementaion of {{equals}} in {{BinaryObjectExImpl}} is based on 
contents of the arrays. Therefore, this behavior is unchanged for 
{{BYTES_HASH}} mode - if byte arrays of obejcts are equal, then their portions 
that correspond to fields are the same as well.

As mentioned above, {{FIELDS_HASH}} and {{CUSTOM}} modes utilize 
{{BinaryObjectHashCodeResolver}} for hashing and equality comparison.

h2. Resolver interface and implementation

This interface looks as follows:

{code:java}
package org.apache.ignite.binary;

import org.apache.ignite.internal.binary.BinaryObjectExImpl;

/**
 * Method to compute hash codes for new binary objects.
 */
public interface BinaryObjectHashCodeResolver {
    /**
     * @param builder Binary object builder.
     * @return Hash code value.
     */
    public int hash(BinaryObjectBuilder builder);

    /**
     * Compare binary objects for equality in consistence with how hash code is 
computed.
     *
     * @param o1 First object.
     * @param o2 Second object.
     * @return
     */
    public boolean equals(BinaryObjectExImpl o1, BinaryObjectExImpl o2);
}
{code}

For {{FIELDS_HASH}}, configuration takes setting list of fields as param of 
{{CacheKeyConfiguration}} - hash code resolver will be built based upon those. 
Therefore, this mode takes no additional coding.

For {{CUSTOM}}, configuration takes setting list of fields as param of 
{{CacheKeyConfiguration}}. This mode obliges user to implement 
{{BinaryObjectHashCodeResolver}} and specify class name for implementation.

h2. Per mode configuration examples

h3. {{BYTES_HASH}}

{code:xml}
<bean id="grid.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
    <!-- ...other properties... -->

    <property name="cacheKeyConfiguration">
        <list>
            <bean class="org.apache.ignite.cache.CacheKeyConfiguration">
                <property name="typeName" value="bytes_hashed_type" />

                <property name="affKeyFieldName" value="someAffField" />

                <property name="binHashingMode" value="BYTES_HASH" />
            </bean>
        </list>
    </property>
{code}

No coding, no other settings - just set the mode, and you can do all your 
MERGEs and INSERTs. However, doing {{get}} s will probably be perilous as 
you'll have to create your keys with builder. This minimalistic configuration 
suits setups when the user wishes to interact with some portion of data in 
cache solely via SQL.

h3. {{FIELDS_HASH}}

{code:xml}
<bean id="grid.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
    <!-- ...other properties... -->

    <property name="cacheKeyConfiguration">
        <list>
            <bean class="org.apache.ignite.cache.CacheKeyConfiguration">
                <property name="typeName" value="fields_hashed_type" />

                <property name="affKeyFieldName" value="someAffField" />

                <property name="binHashingMode" value="FIELDS_HASH" />

                <property name="binHashCodeFields">
                    <list>
                        <value>someHashField</value>
                        <value>anotherHashField</value>
                    </list>
                </property>
            </bean>
        </list>
    </property>
{code}

Aside from setting the mode, you have to list the fields to hash. Suits modes 
when client node has classes and data nodes don't, while data gets to cache via 
SQL INSERT/MERGE.

h3. {{CUSTOM}}

{code:xml}
<bean id="grid.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
    <!-- ...other properties... -->

    <property name="cacheKeyConfiguration">
        <list>
            <bean class="org.apache.ignite.cache.CacheKeyConfiguration">
                <property name="typeName" value="CustomHashedBinaryType" />

                <property name="affKeyFieldName" value="someAffField" />

                <property name="binHashingMode" value="CUSTOM" />

                <property name="binHashCodeRslvrClsName" 
value="com.company.ignite.binary.SomeCustomHasher" />
            </bean>
        </list>
    </property>
{code}

Aside from setting the mode, you have to implement 
{{BinaryObjectHashCodeResolver}} on specified class. Suits modes when client 
node has classes and data nodes don't, while data gets to cache via SQL 
INSERT/MERGE.

h2. Existing key classes with {{FIELDS_HASH}} and {{CUSTOM}} hashing modes

There is an important aspect of binary object handling: what if we wish to 
perform a {{get}} on cache that contains a key
- for which the class *is* present on client node
- and the class *is not* present on data nodes
- and key was put to cache not by calling {{put}} but by SQL INSERT or MERGE?

What then? In this case user's class already has {{hashCode}} and {{equals}} 
implemented but we don't have classes on nodes, still {{get}} s obviously have 
to work. In this case, logic of {{BinaryObjectHashCodeResolver}} should match 
that declared in key's class (which data nodes don't have).

For the cases when {{hashCode}} / {{equals}} logic is trivial and generated by 
IDE, fields based hashing and equality comparisons are sufficient - therefore, 
{{FIELDS_HASH}} works, and the only thing to maintain is consistency of field 
lists in code of key class which data nodes don't have *AND* config files on 
data nodes.

For the cases when {{hashCode}} / {{equals}} logic is not trivial, user will 
have to implement custom {{BinaryObjectHashCodeResolver}} which will have to 
mimic the logic of key hashing/comparing in the class.

Rationale behind this design is as follows:
- If the user does not care about automatic keys hashing (= does not use DML 
features), then he or she is probably happy and does not want to configure or, 
God forbid, code anything. All that works has to work without new 
coding/configuration.
- If the user wishes to hash binary classless keys automatically (from SQL 
INSERT/MERGE) *AND* have key classes on client nodes (= perform {{get}} with 
key serialized by, say, {{IgniteBinary.toBinary(Object)}} and *NOT* constructed 
with binary builder), he or she will have to maintain integrity between hashing 
modes on client and server nodes. However, forcing the user to change the code 
of existing classes does not seem right, so the only burden is re-configuring 
data nodes. (And, optionally, writing custom resolver if original class is 
hashed/compared in some weird way).

h2. Any ways to avoid having to do anything at all?
Sure thing.
- Don't use DML.
- Don't use binary keys without classes. *(Everything written above affects 
only cases with non trivial classless keys.)

> Automatically compute hash codes for newly built binary objects
> ---------------------------------------------------------------
>
>                 Key: IGNITE-4011
>                 URL: https://issues.apache.org/jira/browse/IGNITE-4011
>             Project: Ignite
>          Issue Type: Task
>          Components: binary, cache
>            Reporter: Alexander Paschenko
>            Assignee: Alexander Paschenko
>             Fix For: 1.8
>
>
> For binary keys built automatically inside SQL engine during INSERT or MERGE, 
> we need to compute hash codes automatically because in this case the user 
> does not interact with any builders and can't set hash code explicitly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (IGNITE-4011) Automatically compute hash codes for newly built binary objects

Reply via email to