In the assumption that we do not have 'non-equals' fields, binary array comparison generally should be faster, because even if we did a field-by-field comparison, we would read the same amount of data anyway, but also would need to do some byte jiggling for type conversion.
2016-09-29 19:07 GMT+03:00 Dmitriy Setrakyan <dsetrak...@apache.org>: > Alexey, in your opinion, what will be faster, the binary array comparison > or field-by-field comparison? > > On Thu, Sep 29, 2016 at 8:39 AM, Alexey Goncharuk < > alexey.goncha...@gmail.com> wrote: > > > Folks, let me point out a few obvious (or not) things > > > > A set of fields participating in hashCode and equals is impossible to > > change without cluster restart. Imagine a new client adding a field F to > > key structure (A, B) so that new key is (A, B, F). In this case key (1, > 2, > > 0) will be treated as a different key w/respect to (1, 2, 3), while the > new > > client expects them to be the same. I think the set of fields defining > the > > key _must_ be used in both hash code and equals calculation and cannot > > change over time. Having said that, we can use binary array comparison as > > object equality test. Presence of new 'garbage' fields in key looks > useless > > to me. > > > > So, we should support: > > 1) All the things Dmitriy pointed out, in other words - multi-field key > > with affinity key fields > > 2) Ability to instantiate the key using a text-only environment > > 3) Ability to define the structure of a key using XML or DDL > > > > I think all of these can be done via combination of HashCodeResolver > > interface + default resolver which can read type configuration. > > > > 2016-09-29 3:54 GMT+03:00 Dmitriy Setrakyan <dsetrak...@apache.org>: > > > > > Guys, > > > > > > We need to look at 3 cases: > > > > > > a) key is just one field > > > b) key is multiple fields > > > c) key is one or multiple fields, with possibility of an alternate > > affinity > > > key > > > > > > For (a) and (b), whenever a type is defined in XML, and further in > DML, a > > > user will specify which fields are part of the key. In that case, we > > should > > > just grab those fields and calculate the hashcode automatically. > > > > > > For (c), a user should specify in XML, and further in DML, which fields > > > should be used for affinity, in addition to the key fields. In that > case, > > > we again should grab those fields and calculate the hashcodes for the > > > primary key and the affinity key. > > > > > > I really am not sure if there are other ways of doing it. Am I missing > > > something? > > > > > > D. > > > > > > > > > On Wed, Sep 28, 2016 at 5:14 PM, Denis Magda <dma...@gridgain.com> > > wrote: > > > > > > > Let me show the picture I have in my mind: > > > > > > > > Primary key is a must for all INSERT and MERGE operations. If it’s > not > > > set > > > > then an INSERT/MERGE fails. > > > > If a primary key is a boxed/unboxed primitive (int, Integer, String, > > > Date, > > > > etc.) then the key value is used for hashCode calculation. At the > same > > > time > > > > the key will be an affinity key. > > > > If a primary key is a custom object then it’s value can be passed as > a > > > > param directly from Java, .Net, C++ in a way like “INSERT (_key, > > field1, > > > > field2) VALUES (?, val1, val2)”. In this scenario we will call > hashCode > > > > directly on the key's value. In addition, we will be able to get an > > > > affinity key since we know key’s type class descriptor > > > > (BinaryClassDescriptor). > > > > If a primary key is still a custom key and we want to insert its > value > > > > from an SQL console, PHP, Tableu, etc. then we can’t pass the key’s > > value > > > > as is. Here we’re trying to apply a workaround by listing key's > fields > > in > > > > INSERT/MERGE and the task is to properly re-construct the key on our > > side > > > > using only specific fields. > > > > > > > > Is my understanding correct for all the bullets above? > > > > > > > > If so then, yes, I would agree that we need to list these fields in a > > > > configuration and the default hash code resolver will use them as > well. > > > > Moreover, we have to pin point an affinity field. So, the question is > > > what > > > > the configuration we should use. > > > > > > > > Community, any other thoughts/ideas? > > > > > > > > — > > > > Denis > > > > > > > > > On Sep 28, 2016, at 4:16 PM, Alexander Paschenko < > > > > alexander.a.pasche...@gmail.com> wrote: > > > > > > > > > > Also MERGE. > > > > > > > > > > 2016-09-29 2:10 GMT+03:00 Denis Magda <dma...@gridgain.com>: > > > > >> You need a hash code only for INSERT operation, right? > > > > >> > > > > >> — > > > > >> Denis > > > > >> > > > > >>> On Sep 28, 2016, at 3:47 PM, Alexander Paschenko < > > > > alexander.a.pasche...@gmail.com> wrote: > > > > >>> > > > > >>> But what if the user works from some kind of console and just > types > > > > >>> the queries as text in full and does not bind params via JDBC or > > > > >>> something alike? What if there's no binary object? I don't see > why > > we > > > > >>> should keep the user from usual cache gets in this case. I really > > > like > > > > >>> the idea of supplying the values of distinct fields, thus freeing > > the > > > > >>> user of the need to mess with objects and builders, AND then just > > > > >>> calculating hash code as suggested before - say, via explicitly > > > > >>> listing participating fields in XML or by marking them with > > transient > > > > >>> keyword or some annotation. > > > > >>> Actually, I believe that's the only case when we need to generate > > any > > > > >>> hash codes - when the class is present, we can just get hash code > > > from > > > > >>> its implementation of its method. When there's no class, we > > generate. > > > > >>> And all that is solely for SQL. For the rest - just throw an > > > exception > > > > >>> when there's no hash code manually set for binary object. I don't > > see > > > > >>> why we should try to generate anything when the user already is > > using > > > > >>> Ignite in full, not just via limited interface of SQL. > > > > >>> > > > > >>> 2016-09-29 0:31 GMT+03:00 Denis Magda <dma...@gridgain.com>: > > > > >>>> Hmm, this is a good question. > > > > >>>> > > > > >>>> If a user doesn’t provide a _key when an INSERT is executed for > me > > > it > > > > means that he is not going to use the key later for cache.get/put, > > > DELETE, > > > > UPDATE and other possible operation simply because he doesn’t know > how > > to > > > > reconstruct the key back in his code. If he wants to use the primary > > key > > > in > > > > the rest of operations then he must provide it at INSERT time. > > > > >>>> > > > > >>>> Do we need this key only for a case when an object is being > > inserted > > > > into a cache? If it’s so I would auto-generate a key using ‘long’ as > a > > > key > > > > type. I do remember that we provided the auto-generation for Spark > > module > > > > in a some way that may be useful here. > > > > >>>> > > > > >>>> — > > > > >>>> Denis > > > > >>>> > > > > >>>>> On Sep 28, 2016, at 9:53 AM, Alexander Paschenko < > > > > alexander.a.pasche...@gmail.com> wrote: > > > > >>>>> > > > > >>>>> Denis, > > > > >>>>> > > > > >>>>> That's not what I was asking about. > > > > >>>>> Currently DML implementation allows for dymanic instantiation > of > > > > keys, > > > > >>>>> in other words, user does not have to provide value for > > > object-typed > > > > >>>>> _key column - instead, he may supply just field values based on > > > which > > > > >>>>> _key will be dynamically instantiated/binary built. And that's > > the > > > > >>>>> whole point of this discussion as I see it: what to do when > we've > > > > >>>>> binary built classless key that we build ourselves inside SQL > > > engine > > > > >>>>> and don't know how to compute hash code for it? > > > > >>>>> > > > > >>>>> - Alex > > > > >>>>> > > > > >>>>> 2016-09-28 19:48 GMT+03:00 Denis Magda <dma...@gridgain.com>: > > > > >>>>>> Alexander, > > > > >>>>>> > > > > >>>>>> As I guess if we have a key without a class then it will be > > > > constructed using a BinaryBuilder instance and it’s user > responsibility > > > to > > > > set the has code at the end with BinaryBuilder.hasCode method. Sure, > > all > > > > this cases must be well-documented in both Java Doc API and Apache > > Ignite > > > > documentation. > > > > >>>>>> > > > > >>>>>> — > > > > >>>>>> Denis > > > > >>>>>> > > > > >>>>>>> On Sep 28, 2016, at 9:33 AM, Alexander Paschenko < > > > > alexander.a.pasche...@gmail.com> wrote: > > > > >>>>>>> > > > > >>>>>>> Dmitry, Denis, > > > > >>>>>>> > > > > >>>>>>> OK, but I think it's necessary to address also the cases when > > > > there's > > > > >>>>>>> no actual class for the key, and its fields are simply > declared > > > in > > > > >>>>>>> XML. In this case, there are no fields to be marked > transient. > > > > What do > > > > >>>>>>> we do then? List transient fields in XML separately? > > > > >>>>>>> > > > > >>>>>>> - Alex > > > > >>>>>>> > > > > >>>>>>> 2016-09-28 4:15 GMT+03:00 Dmitriy Setrakyan < > > > dsetrak...@apache.org > > > > >: > > > > >>>>>>>> Agree with Denis. > > > > >>>>>>>> > > > > >>>>>>>> - by default, all non-transient key fields should > participate > > in > > > > the > > > > >>>>>>>> hashcode generation > > > > >>>>>>>> - when working on DDL, then the primary key fields should > > > > participate in > > > > >>>>>>>> the hashcode > > > > >>>>>>>> - we should add a resolver to override the default behavior > > > > (please > > > > >>>>>>>> propose the interface in Jira) > > > > >>>>>>>> - we should print out a warning, only once per type, the the > > > > hashcode > > > > >>>>>>>> has been automatically generated based on which fields and > > which > > > > formula > > > > >>>>>>>> > > > > >>>>>>>> D. > > > > >>>>>>>> > > > > >>>>>>>> On Tue, Sep 27, 2016 at 5:42 PM, Denis Magda < > > > dma...@gridgain.com> > > > > wrote: > > > > >>>>>>>> > > > > >>>>>>>>> Hi Alexander, > > > > >>>>>>>>> > > > > >>>>>>>>> Vladimir’s proposal sounds reasonable to me. However we > must > > > > keep in mind > > > > >>>>>>>>> one important thing. Binary objects were designed to > address > > > the > > > > following > > > > >>>>>>>>> disadvantages a regular serializer, like optimized > > marshaller, > > > > has: > > > > >>>>>>>>> necessity to deserialize an object on a server side every > > time > > > > it’s needed. > > > > >>>>>>>>> necessity to hold an object in both serialized and > > deserialized > > > > forms on > > > > >>>>>>>>> the server node. > > > > >>>>>>>>> necessity to restart the whole cluster each time an object > > > > version is > > > > >>>>>>>>> changed (new field is added or an old one is removed). > > > > >>>>>>>>> If it will be needed to perform step 3 for a default > > > > implementation of the > > > > >>>>>>>>> binary resolver just because the resolver has to consider > new > > > > fields or > > > > >>>>>>>>> ignore old ones then such an implementation sucks. Overall, > > the > > > > default > > > > >>>>>>>>> implementation should use the reflection coming over all > the > > > > fields a key > > > > >>>>>>>>> has ignoring the ones that are marked with “transient” > > keyword. > > > > If a user > > > > >>>>>>>>> wants to control the default resolver's logic then he can > > label > > > > all the > > > > >>>>>>>>> fields that mustn’t be of a final has code value with > > > > “transient”. This has > > > > >>>>>>>>> to be well-documented for sure. > > > > >>>>>>>>> > > > > >>>>>>>>> Makes sense? > > > > >>>>>>>>> > > > > >>>>>>>>> — > > > > >>>>>>>>> Denis > > > > >>>>>>>>> > > > > >>>>>>>>>> On Sep 26, 2016, at 12:40 PM, Alexander Paschenko < > > > > >>>>>>>>> alexander.a.pasche...@gmail.com> wrote: > > > > >>>>>>>>>> > > > > >>>>>>>>>> Hello Igniters, > > > > >>>>>>>>>> > > > > >>>>>>>>>> As DML support is near, it's critical that we agree on how > > we > > > > generate > > > > >>>>>>>>>> hash codes for new keys in presence of binary marshaller. > > > > Actually, > > > > >>>>>>>>>> this discussion isn't new - please see its beginning here: > > > > >>>>>>>>>> > > > > >>>>>>>>>> http://apache-ignite-developers.2346864.n4.nabble. > > > > >>>>>>>>> com/All-BinaryObjects-created- > by-BinaryObjectBuilder-stored- > > > > >>>>>>>>> at-the-same-partition-by-default-td8042.html > > > > >>>>>>>>>> > > > > >>>>>>>>>> Still, I'm creating this new thread to make getting to the > > > final > > > > >>>>>>>>>> solution as simple and fast as possible. > > > > >>>>>>>>>> > > > > >>>>>>>>>> I remind everyone that the approach that has got the least > > > > critics was > > > > >>>>>>>>>> the one proposed by Vladimir Ozerov: > > > > >>>>>>>>>> > > > > >>>>>>>>>> <quote> > > > > >>>>>>>>>> I think we can do the following: > > > > >>>>>>>>>> 1) Add "has hash code" flag as Denis suggested. > > > > >>>>>>>>>> 2) If object without a hash code is put to cache, throw an > > > > exception. > > > > >>>>>>>>>> 3) Add *BinaryEqualsHashCodeResolver *interface. > > > > >>>>>>>>>> 4) Add default implementation which will auto-generate > hash > > > > code. *Print > > > > >>>>>>>>> a > > > > >>>>>>>>>> warning when auto-generation occurs*, so that user is > aware > > > > that he is > > > > >>>>>>>>>> likely to have problems with normal GETs/PUTs. > > > > >>>>>>>>>> 5) Add another implementation which will use encoded > string > > to > > > > calculate > > > > >>>>>>>>> a > > > > >>>>>>>>>> hash code. E.g. *new BinaryEqualsHashCodeResolver("{a} * > > 31 + > > > > {b}")*. > > > > >>>>>>>>>> Originally proposed by Yakov some time ago. > > > > >>>>>>>>>> </quote> > > > > >>>>>>>>>> > > > > >>>>>>>>>> After that, Sergi suggested that instead of a "formula" we > > > keep > > > > just a > > > > >>>>>>>>>> list of the "fields" that participate in hash code > > evaluation, > > > > and > > > > >>>>>>>>>> with that list, we simply calculate hash code just like > IDE > > > > does - > > > > >>>>>>>>>> with all its bit shifts and additions. > > > > >>>>>>>>>> > > > > >>>>>>>>>> I'm planning on settling down with this combined > Vlad-Sergi > > > > approach. > > > > >>>>>>>>>> Any objections? > > > > >>>>>>>>>> > > > > >>>>>>>>>> And an extra question I have: Vlad, you suggest that we > both > > > > throw an > > > > >>>>>>>>>> exception on cache code absence and that we might generate > > it > > > > as the > > > > >>>>>>>>>> last resort. Do I understand you correctly that you > suggest > > > > generating > > > > >>>>>>>>>> random code only in context of SQL, but throw exception > for > > > keys > > > > >>>>>>>>>> without codes on ordinary put? > > > > >>>>>>>>>> > > > > >>>>>>>>>> And yes, built-in hash codes for JDK types are supported > as > > > > well as > > > > >>>>>>>>>> items 1-2 from Vlad's list (there's already fixed issue of > > > > IGNITE-3633 > > > > >>>>>>>>>> for the flag and its presence check). > > > > >>>>>>>>>> > > > > >>>>>>>>>> - Alex > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>> > > > > >>>> > > > > >> > > > > > > > > > > > > > >