Re: Default hash code generation strategy for new binary objects

Alexey Goncharuk Thu, 29 Sep 2016 09:15:03 -0700

In the assumption that we do not have 'non-equals' fields, binary array
comparison generally should be faster, because even  if we did a
field-by-field comparison, we would read the same amount of data anyway,
but also would need to do some byte jiggling for type conversion.


2016-09-29 19:07 GMT+03:00 Dmitriy Setrakyan <[email protected]>:

> Alexey, in your opinion, what will be faster, the binary array comparison
> or field-by-field comparison?
>
> On Thu, Sep 29, 2016 at 8:39 AM, Alexey Goncharuk <
> [email protected]> wrote:
>
> > Folks, let me point out a few obvious (or not) things
> >
> > A set of fields participating in hashCode and equals is impossible to
> > change without cluster restart. Imagine a new client adding a field F to
> > key structure (A, B) so that new key is (A, B, F). In this case key (1,
> 2,
> > 0) will be treated as a different key w/respect to (1, 2, 3), while the
> new
> > client expects them to be the same. I think the set of fields defining
> the
> > key _must_ be used in both hash code and equals calculation and cannot
> > change over time. Having said that, we can use binary array comparison as
> > object equality test. Presence of new 'garbage' fields in key looks
> useless
> > to me.
> >
> > So, we should support:
> > 1) All the things Dmitriy pointed out, in other words - multi-field key
> > with affinity key fields
> > 2) Ability to instantiate the key using a text-only environment
> > 3) Ability to define the structure of a key using XML or DDL
> >
> > I think all of these can be done via combination of HashCodeResolver
> > interface + default resolver which can read type configuration.
> >
> > 2016-09-29 3:54 GMT+03:00 Dmitriy Setrakyan <[email protected]>:
> >
> > > Guys,
> > >
> > > We need to look at 3 cases:
> > >
> > > a) key is just one field
> > > b) key is multiple fields
> > > c) key is one or multiple fields, with possibility of an alternate
> > affinity
> > > key
> > >
> > > For (a) and (b), whenever a type is defined in XML, and further in
> DML, a
> > > user will specify which fields are part of the key. In that case, we
> > should
> > > just grab those fields and calculate the hashcode automatically.
> > >
> > > For (c), a user should specify in XML, and further in DML, which fields
> > > should be used for affinity, in addition to the key fields. In that
> case,
> > > we again should grab those fields and calculate the hashcodes for the
> > > primary key and the affinity key.
> > >
> > > I really am not sure if there are other ways of doing it. Am I missing
> > > something?
> > >
> > > D.
> > >
> > >
> > > On Wed, Sep 28, 2016 at 5:14 PM, Denis Magda <[email protected]>
> > wrote:
> > >
> > > > Let me show the picture I have in my mind:
> > > >
> > > > Primary key is a must for all INSERT and MERGE operations. If it’s
> not
> > > set
> > > > then an INSERT/MERGE fails.
> > > > If a primary key is a boxed/unboxed primitive (int, Integer, String,
> > > Date,
> > > > etc.) then the key value is used for hashCode calculation. At the
> same
> > > time
> > > > the key will be an affinity key.
> > > > If a primary key is a custom object then it’s value can be passed as
> a
> > > > param directly from Java, .Net, C++ in a way like “INSERT (_key,
> > field1,
> > > > field2) VALUES (?, val1, val2)”. In this scenario we will call
> hashCode
> > > > directly on the key's value. In addition, we will be able to get an
> > > > affinity key since we know key’s type class descriptor
> > > > (BinaryClassDescriptor).
> > > > If a primary key is still a custom key and we want to insert its
> value
> > > > from an SQL console, PHP, Tableu, etc. then we can’t pass the key’s
> > value
> > > > as is. Here we’re trying to apply a workaround by listing key's
> fields
> > in
> > > > INSERT/MERGE and the task is to properly re-construct the key on our
> > side
> > > > using only specific fields.
> > > >
> > > > Is my understanding correct for all the bullets above?
> > > >
> > > > If so then, yes, I would agree that we need to list these fields in a
> > > > configuration and the default hash code resolver will use them as
> well.
> > > > Moreover, we have to pin point an affinity field. So, the question is
> > > what
> > > > the configuration we should use.
> > > >
> > > > Community, any other thoughts/ideas?
> > > >
> > > > —
> > > > Denis
> > > >
> > > > > On Sep 28, 2016, at 4:16 PM, Alexander Paschenko <
> > > > [email protected]> wrote:
> > > > >
> > > > > Also MERGE.
> > > > >
> > > > > 2016-09-29 2:10 GMT+03:00 Denis Magda <[email protected]>:
> > > > >> You need a hash code only for INSERT operation, right?
> > > > >>
> > > > >> —
> > > > >> Denis
> > > > >>
> > > > >>> On Sep 28, 2016, at 3:47 PM, Alexander Paschenko <
> > > > [email protected]> wrote:
> > > > >>>
> > > > >>> But what if the user works from some kind of console and just
> types
> > > > >>> the queries as text in full and does not bind params via JDBC or
> > > > >>> something alike? What if there's no binary object? I don't see
> why
> > we
> > > > >>> should keep the user from usual cache gets in this case. I really
> > > like
> > > > >>> the idea of supplying the values of distinct fields, thus freeing
> > the
> > > > >>> user of the need to mess with objects and builders, AND then just
> > > > >>> calculating hash code as suggested before - say, via explicitly
> > > > >>> listing participating fields in XML or by marking them with
> > transient
> > > > >>> keyword or some annotation.
> > > > >>> Actually, I believe that's the only case when we need to generate
> > any
> > > > >>> hash codes - when the class is present, we can just get hash code
> > > from
> > > > >>> its implementation of its method. When there's no class, we
> > generate.
> > > > >>> And all that is solely for SQL. For the rest - just throw an
> > > exception
> > > > >>> when there's no hash code manually set for binary object. I don't
> > see
> > > > >>> why we should try to generate anything when the user already is
> > using
> > > > >>> Ignite in full, not just via limited interface of SQL.
> > > > >>>
> > > > >>> 2016-09-29 0:31 GMT+03:00 Denis Magda <[email protected]>:
> > > > >>>> Hmm, this is a good question.
> > > > >>>>
> > > > >>>> If a user doesn’t provide a _key when an INSERT is executed for
> me
> > > it
> > > > means that he is not going to use the key later for cache.get/put,
> > > DELETE,
> > > > UPDATE and other possible operation simply because he doesn’t know
> how
> > to
> > > > reconstruct the key back in his code. If he wants to use the primary
> > key
> > > in
> > > > the rest of operations then he must provide it at INSERT time.
> > > > >>>>
> > > > >>>> Do we need this key only for a case when an object is being
> > inserted
> > > > into a cache? If it’s so I would auto-generate a key using ‘long’ as
> a
> > > key
> > > > type. I do remember that we provided the auto-generation for Spark
> > module
> > > > in a some way that may be useful here.
> > > > >>>>
> > > > >>>> —
> > > > >>>> Denis
> > > > >>>>
> > > > >>>>> On Sep 28, 2016, at 9:53 AM, Alexander Paschenko <
> > > > [email protected]> wrote:
> > > > >>>>>
> > > > >>>>> Denis,
> > > > >>>>>
> > > > >>>>> That's not what I was asking about.
> > > > >>>>> Currently DML implementation allows for dymanic instantiation
> of
> > > > keys,
> > > > >>>>> in other words, user does not have to provide value for
> > > object-typed
> > > > >>>>> _key column - instead, he may supply just field values based on
> > > which
> > > > >>>>> _key will be dynamically instantiated/binary built. And that's
> > the
> > > > >>>>> whole point of this discussion as I see it: what to do when
> we've
> > > > >>>>> binary built classless key that we build ourselves inside SQL
> > > engine
> > > > >>>>> and don't know how to compute hash code for it?
> > > > >>>>>
> > > > >>>>> - Alex
> > > > >>>>>
> > > > >>>>> 2016-09-28 19:48 GMT+03:00 Denis Magda <[email protected]>:
> > > > >>>>>> Alexander,
> > > > >>>>>>
> > > > >>>>>> As I guess if we have a key without a class then it will be
> > > > constructed using a BinaryBuilder instance and it’s user
> responsibility
> > > to
> > > > set the has code at the end with BinaryBuilder.hasCode method. Sure,
> > all
> > > > this cases must be well-documented in both Java Doc API and Apache
> > Ignite
> > > > documentation.
> > > > >>>>>>
> > > > >>>>>> —
> > > > >>>>>> Denis
> > > > >>>>>>
> > > > >>>>>>> On Sep 28, 2016, at 9:33 AM, Alexander Paschenko <
> > > > [email protected]> wrote:
> > > > >>>>>>>
> > > > >>>>>>> Dmitry, Denis,
> > > > >>>>>>>
> > > > >>>>>>> OK, but I think it's necessary to address also the cases when
> > > > there's
> > > > >>>>>>> no actual class for the key, and its fields are simply
> declared
> > > in
> > > > >>>>>>> XML. In this case, there are no fields to be marked
> transient.
> > > > What do
> > > > >>>>>>> we do then? List transient fields in XML separately?
> > > > >>>>>>>
> > > > >>>>>>> - Alex
> > > > >>>>>>>
> > > > >>>>>>> 2016-09-28 4:15 GMT+03:00 Dmitriy Setrakyan <
> > > [email protected]
> > > > >:
> > > > >>>>>>>> Agree with Denis.
> > > > >>>>>>>>
> > > > >>>>>>>> - by default, all non-transient key fields should
> participate
> > in
> > > > the
> > > > >>>>>>>> hashcode generation
> > > > >>>>>>>> - when working on DDL, then the primary key fields should
> > > > participate in
> > > > >>>>>>>> the hashcode
> > > > >>>>>>>> - we should add a resolver to override the default behavior
> > > > (please
> > > > >>>>>>>> propose the interface in Jira)
> > > > >>>>>>>> - we should print out a warning, only once per type, the the
> > > > hashcode
> > > > >>>>>>>> has been automatically generated based on which fields and
> > which
> > > > formula
> > > > >>>>>>>>
> > > > >>>>>>>> D.
> > > > >>>>>>>>
> > > > >>>>>>>> On Tue, Sep 27, 2016 at 5:42 PM, Denis Magda <
> > > [email protected]>
> > > > wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>> Hi Alexander,
> > > > >>>>>>>>>
> > > > >>>>>>>>> Vladimir’s proposal sounds reasonable to me. However we
> must
> > > > keep in mind
> > > > >>>>>>>>> one important thing. Binary objects were designed to
> address
> > > the
> > > > following
> > > > >>>>>>>>> disadvantages a regular serializer, like optimized
> > marshaller,
> > > > has:
> > > > >>>>>>>>> necessity to deserialize an object on a server side every
> > time
> > > > it’s needed.
> > > > >>>>>>>>> necessity to hold an object in both serialized and
> > deserialized
> > > > forms on
> > > > >>>>>>>>> the server node.
> > > > >>>>>>>>> necessity to restart the whole cluster each time an object
> > > > version is
> > > > >>>>>>>>> changed (new field is added or an old one is removed).
> > > > >>>>>>>>> If it will be needed to perform step 3 for a default
> > > > implementation of the
> > > > >>>>>>>>> binary resolver just because the resolver has to consider
> new
> > > > fields or
> > > > >>>>>>>>> ignore old ones then such an implementation sucks. Overall,
> > the
> > > > default
> > > > >>>>>>>>> implementation should use the reflection coming over all
> the
> > > > fields a key
> > > > >>>>>>>>> has ignoring the ones that are marked with “transient”
> > keyword.
> > > > If a user
> > > > >>>>>>>>> wants to control the default resolver's logic then he can
> > label
> > > > all the
> > > > >>>>>>>>> fields that mustn’t be of a final has code value with
> > > > “transient”. This has
> > > > >>>>>>>>> to be well-documented for sure.
> > > > >>>>>>>>>
> > > > >>>>>>>>> Makes sense?
> > > > >>>>>>>>>
> > > > >>>>>>>>> —
> > > > >>>>>>>>> Denis
> > > > >>>>>>>>>
> > > > >>>>>>>>>> On Sep 26, 2016, at 12:40 PM, Alexander Paschenko <
> > > > >>>>>>>>> [email protected]> wrote:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Hello Igniters,
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> As DML support is near, it's critical that we agree on how
> > we
> > > > generate
> > > > >>>>>>>>>> hash codes for new keys in presence of binary marshaller.
> > > > Actually,
> > > > >>>>>>>>>> this discussion isn't new - please see its beginning here:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> http://apache-ignite-developers.2346864.n4.nabble.
> > > > >>>>>>>>> com/All-BinaryObjects-created-
> by-BinaryObjectBuilder-stored-
> > > > >>>>>>>>> at-the-same-partition-by-default-td8042.html
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Still, I'm creating this new thread to make getting to the
> > > final
> > > > >>>>>>>>>> solution as simple and fast as possible.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> I remind everyone that the approach that has got the least
> > > > critics was
> > > > >>>>>>>>>> the one proposed by Vladimir Ozerov:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> <quote>
> > > > >>>>>>>>>> I think we can do the following:
> > > > >>>>>>>>>> 1) Add "has hash code" flag as Denis suggested.
> > > > >>>>>>>>>> 2) If object without a hash code is put to cache, throw an
> > > > exception.
> > > > >>>>>>>>>> 3) Add *BinaryEqualsHashCodeResolver *interface.
> > > > >>>>>>>>>> 4) Add default implementation which will auto-generate
> hash
> > > > code. *Print
> > > > >>>>>>>>> a
> > > > >>>>>>>>>> warning when auto-generation occurs*, so that user is
> aware
> > > > that he is
> > > > >>>>>>>>>> likely to have problems with normal GETs/PUTs.
> > > > >>>>>>>>>> 5) Add another implementation which will use encoded
> string
> > to
> > > > calculate
> > > > >>>>>>>>> a
> > > > >>>>>>>>>> hash code. E.g. *new BinaryEqualsHashCodeResolver("{a} *
> > 31 +
> > > > {b}")*.
> > > > >>>>>>>>>> Originally proposed by Yakov some time ago.
> > > > >>>>>>>>>> </quote>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> After that, Sergi suggested that instead of a "formula" we
> > > keep
> > > > just a
> > > > >>>>>>>>>> list of the "fields" that participate in hash code
> > evaluation,
> > > > and
> > > > >>>>>>>>>> with that list, we simply calculate hash code just like
> IDE
> > > > does -
> > > > >>>>>>>>>> with all its bit shifts and additions.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> I'm planning on settling down with this combined
> Vlad-Sergi
> > > > approach.
> > > > >>>>>>>>>> Any objections?
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> And an extra question I have: Vlad, you suggest that we
> both
> > > > throw an
> > > > >>>>>>>>>> exception on cache code absence and that we might generate
> > it
> > > > as the
> > > > >>>>>>>>>> last resort. Do I understand you correctly that you
> suggest
> > > > generating
> > > > >>>>>>>>>> random code only in context of SQL, but throw exception
> for
> > > keys
> > > > >>>>>>>>>> without codes on ordinary put?
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> And yes, built-in hash codes for JDK types are supported
> as
> > > > well as
> > > > >>>>>>>>>> items 1-2 from Vlad's list (there's already fixed issue of
> > > > IGNITE-3633
> > > > >>>>>>>>>> for the flag and its presence check).
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> - Alex
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: Default hash code generation strategy for new binary objects

Reply via email to