Re: [protobuf] Re: Generated hashcode() returns different values across JVM instances?

Ron Reynolds Wed, 18 May 2011 12:08:45 -0700

the corner-stone of Hash* containers is:
 (A.equals(B)) => (A.hashCode() == B.hashCode()) for all A, B.


tho it's not explicitly stated it would seem to be implied that is within a 
single JVM.
not sure if the code in question maintains that rule within a JVM (if not 
that's 
a big deal).  
if so that would seem sufficient for all but the most distributed of Hash* 
containers, such as where a client which is remote from the storage (i.e, in 
another JVM) determines the bucket (based on hashCode()) to find that the 
element in question has been placed into another bucket because the hashCode() 
within the containing JVM has evaluated to another value.  that's a pretty 
far-fetched, but not unimaginable, situation.



________________________________
From: Jason Hsueh <[email protected]>
To: Jay Booth <[email protected]>
Cc: Protocol Buffers <[email protected]>
Sent: Wed, May 18, 2011 12:00:19 PM
Subject: Re: [protobuf] Re: Generated hashcode() returns different values 
across 
JVM instances?

Jumping in late to the thread, and I'm not really a Java person, so I may be 
misunderstanding something here. But as far as I can tell, you are asking for 
hashCode() to be a 'consistent' hash across processes. hashCode() as 
implemented 
is still useful within a single JVM, allowing you to use protobufs in HashMaps 
based on content rather than object identity. That was the intended use case.


On Wed, May 18, 2011 at 11:48 AM, Jay Booth <[email protected]> wrote:

Well, that's your prerogative, I guess, but why even implement
>hashcode at all then?  Just inherit from object and you're getting
>effectively the same behavior.  Is that what you're intending?
>
>
>On May 16, 10:03 am, Pherl Liu <[email protected]> wrote:
>> We discussed internally and decided not to make the hashCode()
>> return deterministic result. If you need consistent hashcode in different
>> runs, use toByteString().hashCode().
>>
>> Quoted from Kenton:
>>
>> Hashing the content of the descriptor would actually be incorrect, because
>> two descriptors with exactly the same content are still considered different
>> types.  Descriptors are compared by identity, hence they are hashed by
>> pointer.
>>
>> Removing the descriptor from the calculation would indeed make hashCode()
>> consistent between two runs of the same binary, and probably insignificant
>> runtime cost.  Of course, once you do that, you will never be able to
>> introduce non-determinism again because people will depend on it.
>>
>> But there's a much bigger risk.  People may actually start depending on
>> hashCode() returning consistent results between two different versions of
>> the binary, or two completely separate binaries that compile in the same
>> protocol, or -- most dangerously -- two different versions of the same
>> protocol (e.g. with fields added or removed).  I think it would be very
>> difficult and limiting to make these guarantees, so I would be extremely
>> cautious about this.
>>
>> Certainly, there is no implementation of hashCode() that would be any safer
>> than .toByteString().hashCode().  So, I'd advise steering people to the
>> latter.  Note that if unknown fields are present, the results may still be
>> inconsistent.  However, there is no reasonable way to implement a hashCode()
>> that is consistent in the presence of unknown fields.
>>
>>
>>
>>
>>
>>
>>
>
>> On Thu, May 12, 2011 at 5:32 AM, Ben Wright <[email protected]> wrote:
>> > I think we wrote those replies at the same time : )
>>
>> > You're right, at the cost of some additional hash collisions, the
>> > simplest solution is to simply not include the type / descriptor in
>> > the hash calculation at all.
>>
>> > The best / least-collision solutions with good performance would be
>> > what I wrote in my previous post, but that requires that someone
>> > (presumably a current committer) with sufficient knowledge of the
>> > Descriptor types to have enough time to update the compiler and java
>> > libraries accordingly.
>>
>> > Any input from a committer for this issue?  Seems the simple solution
>> > would take less than an hour to push into the stream and could make it
>> > into the next release.
>>
>> > On May 11, 5:25 pm, Ben Wright <[email protected]> wrote:
>> > > Alternatively... instead of putting the onus on the compiler, the
>> > > hashcode could be computed by the JVM at initialization time for the
>> > > Descriptor instance, (which would also help performance of dynamically
>> > > parsed Descriptor instance hashcode calls).
>>
>> > > i.e.
>>
>> > > private final int computedHashcode;
>>
>> > > public Descriptor() {
>> > >    //initialization
>>
>> > >   computedHashcode = do_compute_hashCode();
>>
>> > > }
>>
>> > > public int hashCode() {
>> > >     return computedHashcode;
>>
>> > > }
>>
>> > > punlic int do_compute_hashCode(){
>> > >   return // compute hashcode
>>
>> > > }
>>
>> > > This is all talking towards optimum performance implementation... the
>> > > real problem is the need for a hashCode implementation for Descriptor
>> > > based on the actual Descriptor's content...
>>
>> > > On May 11, 4:54 pm, Ben Wright <[email protected]> wrote:
>>
>> > > > Jay:
>>
>> > > > Using the class name to generate the hashcode is logically incorrect
>> > > > because the class name can be derived by the options java_package_
>> > > > name and java_outer_classname.
>>
>> > > > Additionally (although less likely to matter), separate protocol
>> > > > buffer files can define an identical class names with different
>> > > > protocol buffers.
>>
>> > > > Lastly, and most importantly...
>>
>> > > > If the same Message is being used with generated code and with dynamic
>> > > > code, the hash code for the descriptor would still be identical if
>> > > > generated from the descriptor instance, whereas the dynamic usage does
>> > > > not have a classname from which to derive a hashcode.  While in your
>> > > > case this should not matter, it does matter for other users of
>> > > > protobuf.  The hashcode function would be better served by being
>> > > > implemented correctly from state data for the descriptor.
>> > > > Additionally, in generated code it seems that this hashcode could be
>> > > > pre-computed by the compiler and Descriptor.hashcode() could return a
>> > > > constant integer - which would be much more efficient than any other
>> > > > method.
>>
>> > > > On May 11, 3:02 pm, Jay Booth <[email protected]> wrote:
>>
>> > > > > It can be legitimate, especially in the case of Object.hashCode(),
>> > but
>> > > > > it's supposed to be in sync with equals() by contract.  As it stands,
>> > > > > two objects which are equal() will produce different hashes, or the
>> > > > > same logical object will produce different hashes across JVMs.  That
>> > > > > breaks the contract..  if the equals() method simply did return
>> > (other
>> > > > > == this), then it'd be fine, albeit a little useless.
>>
>> > > > > I created an issue and posted a 1-liner patch that would eliminate
>> > the
>> > > > > problem by using getClass().getName().hashCode() to incorporate type
>> > > > > information into the hashCode without depending on a Descriptor
>> > > > > object's memory address.
>>
>> > > > > On May 11, 12:01 am, Dmitriy Ryaboy <[email protected]> wrote:
>>
>> > > > > > Hi Jay,
>>
>> > > > > > I encountered that before. Unfortunately this is a legitimate thing
>> > to
>> > > > > > do, as documented in Object.hashCode()
>>
>> > > > > > I have a write-up of the problem and how we wound up solving it
>> > (not
>> > > > > > elegant.. suggestions welcome) here:
>> >http://squarecog.wordpress.com/2011/02/20/hadoop-requires-stable-hash...
>>
>> > > > > > D
>>
>> > > > > > On Mon, May 9, 2011 at 8:25 AM, Jay Booth <[email protected]>
>> > wrote:
>> > > > > > > I'm testing an on-disk hashtable with Protobufs and noticed that
>> > with
>> > > > > > > the java generated hashcode function, it seems to return a
>> > different
>> > > > > > > hashcode across JVM invocations for the same logically equivalent
>> > > > > > > object (tested with a single string protobuf, same string for
>> > both
>> > > > > > > instances).
>>
>> > > > > > > Is this known behavior?  Bit busy right now backporting this to
>> > work
>> > > > > > > with String keys instead but I could provide a bit of command
>> > line
>> > > > > > > code that demonstrates the issue when I get a chance.
>>
>> > > > > > > Glancing at the generated hashcode() function, it looks like the
>> > > > > > > difference comes from etiher getDescriptorForType().hashCode() or
>> > > > > > > getUnknownFields().hashCode(), both of which are incorporated.
>>
>> > > > > > > --
>> > > > > > > You received this message because you are subscribed to the
>> > Google Groups "Protocol Buffers" group.
>> > > > > > > To post to this group, send email to [email protected].
>> > > > > > > To unsubscribe from this group, send email to
>> > [email protected].
>> > > > > > > For more options, visit this group athttp://
>> > groups.google.com/group/protobuf?hl=en.
>>
>> > --
>> > You received this message because you are subscribed to the Google Groups
>> > "Protocol Buffers" group.
>> > To post to this group, send email to [email protected].
>> > To unsubscribe from this group, send email to
>> > [email protected].
>> > For more options, visit this group at
>> >http://groups.google.com/group/protobuf?hl=en.
>
>--
>You received this message because you are subscribed to the Google Groups 
>"Protocol Buffers" group.
>To post to this group, send email to [email protected].
>To unsubscribe from this group, send email to 
>[email protected].
>For more options, visit this group at 
>http://groups.google.com/group/protobuf?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.

Re: [protobuf] Re: Generated hashcode() returns different values across JVM instances?

Reply via email to