[protobuf] Re: Generated hashcode() returns different values across JVM instances?

Jay Booth Wed, 18 May 2011 12:19:30 -0700

Right, I forgot that the Descriptor object is statically initialized
so it'll be consistent within the JVM.


I still think that in the context of serializable objects, (i.e.
objects intended to be transported between JVMs in some way, shape or
form) a consistent hashCode would be useful for a lot of cases (mine
included).  If it can't be consistent in the presence of unknown
fields, perhaps a well-documented caution to that (relatively
uncommon?) case would be more useful than completely punting on all
cases.

On May 18, 3:08 pm, Ron Reynolds <[email protected]> wrote:
> the corner-stone of Hash* containers is:
>  (A.equals(B)) => (A.hashCode() == B.hashCode()) for all A, B.  
>
> tho it's not explicitly stated it would seem to be implied that is within a
> single JVM.
> not sure if the code in question maintains that rule within a JVM (if not 
> that's
> a big deal).  
> if so that would seem sufficient for all but the most distributed of Hash*
> containers, such as where a client which is remote from the storage (i.e, in
> another JVM) determines the bucket (based on hashCode()) to find that the
> element in question has been placed into another bucket because the hashCode()
> within the containing JVM has evaluated to another value.  that's a pretty
> far-fetched, but not unimaginable, situation.
>
> ________________________________
> From: Jason Hsueh <[email protected]>
> To: Jay Booth <[email protected]>
> Cc: Protocol Buffers <[email protected]>
> Sent: Wed, May 18, 2011 12:00:19 PM
> Subject: Re: [protobuf] Re: Generated hashcode() returns different values 
> across
> JVM instances?
>
> Jumping in late to the thread, and I'm not really a Java person, so I may be
> misunderstanding something here. But as far as I can tell, you are asking for
> hashCode() to be a 'consistent' hash across processes. hashCode() as 
> implemented
> is still useful within a single JVM, allowing you to use protobufs in HashMaps
> based on content rather than object identity. That was the intended use case.
>
> On Wed, May 18, 2011 at 11:48 AM, Jay Booth <[email protected]> wrote:
>
> Well, that's your prerogative, I guess, but why even implement
>
>
>
>
>
>
>
>
>
> >hashcode at all then?  Just inherit from object and you're getting
> >effectively the same behavior.  Is that what you're intending?
>
> >On May 16, 10:03 am, Pherl Liu <[email protected]> wrote:
> >> We discussed internally and decided not to make the hashCode()
> >> return deterministic result. If you need consistent hashcode in different
> >> runs, use toByteString().hashCode().
>
> >> Quoted from Kenton:
>
> >> Hashing the content of the descriptor would actually be incorrect, because
> >> two descriptors with exactly the same content are still considered 
> >> different
> >> types.  Descriptors are compared by identity, hence they are hashed by
> >> pointer.
>
> >> Removing the descriptor from the calculation would indeed make hashCode()
> >> consistent between two runs of the same binary, and probably insignificant
> >> runtime cost.  Of course, once you do that, you will never be able to
> >> introduce non-determinism again because people will depend on it.
>
> >> But there's a much bigger risk.  People may actually start depending on
> >> hashCode() returning consistent results between two different versions of
> >> the binary, or two completely separate binaries that compile in the same
> >> protocol, or -- most dangerously -- two different versions of the same
> >> protocol (e.g. with fields added or removed).  I think it would be very
> >> difficult and limiting to make these guarantees, so I would be extremely
> >> cautious about this.
>
> >> Certainly, there is no implementation of hashCode() that would be any safer
> >> than .toByteString().hashCode().  So, I'd advise steering people to the
> >> latter.  Note that if unknown fields are present, the results may still be
> >> inconsistent.  However, there is no reasonable way to implement a 
> >> hashCode()
> >> that is consistent in the presence of unknown fields.
>
> >> On Thu, May 12, 2011 at 5:32 AM, Ben Wright <[email protected]> wrote:
> >> > I think we wrote those replies at the same time : )
>
> >> > You're right, at the cost of some additional hash collisions, the
> >> > simplest solution is to simply not include the type / descriptor in
> >> > the hash calculation at all.
>
> >> > The best / least-collision solutions with good performance would be
> >> > what I wrote in my previous post, but that requires that someone
> >> > (presumably a current committer) with sufficient knowledge of the
> >> > Descriptor types to have enough time to update the compiler and java
> >> > libraries accordingly.
>
> >> > Any input from a committer for this issue?  Seems the simple solution
> >> > would take less than an hour to push into the stream and could make it
> >> > into the next release.
>
> >> > On May 11, 5:25 pm, Ben Wright <[email protected]> wrote:
> >> > > Alternatively... instead of putting the onus on the compiler, the
> >> > > hashcode could be computed by the JVM at initialization time for the
> >> > > Descriptor instance, (which would also help performance of dynamically
> >> > > parsed Descriptor instance hashcode calls).
>
> >> > > i.e.
>
> >> > > private final int computedHashcode;
>
> >> > > public Descriptor() {
> >> > >    //initialization
>
> >> > >   computedHashcode = do_compute_hashCode();
>
> >> > > }
>
> >> > > public int hashCode() {
> >> > >     return computedHashcode;
>
> >> > > }
>
> >> > > punlic int do_compute_hashCode(){
> >> > >   return // compute hashcode
>
> >> > > }
>
> >> > > This is all talking towards optimum performance implementation... the
> >> > > real problem is the need for a hashCode implementation for Descriptor
> >> > > based on the actual Descriptor's content...
>
> >> > > On May 11, 4:54 pm, Ben Wright <[email protected]> wrote:
>
> >> > > > Jay:
>
> >> > > > Using the class name to generate the hashcode is logically incorrect
> >> > > > because the class name can be derived by the options java_package_
> >> > > > name and java_outer_classname.
>
> >> > > > Additionally (although less likely to matter), separate protocol
> >> > > > buffer files can define an identical class names with different
> >> > > > protocol buffers.
>
> >> > > > Lastly, and most importantly...
>
> >> > > > If the same Message is being used with generated code and with 
> >> > > > dynamic
> >> > > > code, the hash code for the descriptor would still be identical if
> >> > > > generated from the descriptor instance, whereas the dynamic usage 
> >> > > > does
> >> > > > not have a classname from which to derive a hashcode.  While in your
> >> > > > case this should not matter, it does matter for other users of
> >> > > > protobuf.  The hashcode function would be better served by being
> >> > > > implemented correctly from state data for the descriptor.
> >> > > > Additionally, in generated code it seems that this hashcode could be
> >> > > > pre-computed by the compiler and Descriptor.hashcode() could return a
> >> > > > constant integer - which would be much more efficient than any other
> >> > > > method.
>
> >> > > > On May 11, 3:02 pm, Jay Booth <[email protected]> wrote:
>
> >> > > > > It can be legitimate, especially in the case of Object.hashCode(),
> >> > but
> >> > > > > it's supposed to be in sync with equals() by contract.  As it 
> >> > > > > stands,
> >> > > > > two objects which are equal() will produce different hashes, or the
> >> > > > > same logical object will produce different hashes across JVMs.  
> >> > > > > That
> >> > > > > breaks the contract..  if the equals() method simply did return
> >> > (other
> >> > > > > == this), then it'd be fine, albeit a little useless.
>
> >> > > > > I created an issue and posted a 1-liner patch that would eliminate
> >> > the
> >> > > > > problem by using getClass().getName().hashCode() to incorporate 
> >> > > > > type
> >> > > > > information into the hashCode without depending on a Descriptor
> >> > > > > object's memory address.
>
> >> > > > > On May 11, 12:01 am, Dmitriy Ryaboy <[email protected]> wrote:
>
> >> > > > > > Hi Jay,
>
> >> > > > > > I encountered that before. Unfortunately this is a legitimate 
> >> > > > > > thing
> >> > to
> >> > > > > > do, as documented in Object.hashCode()
>
> >> > > > > > I have a write-up of the problem and how we wound up solving it
> >> > (not
> >> > > > > > elegant.. suggestions welcome) here:
> >> >http://squarecog.wordpress.com/2011/02/20/hadoop-requires-stable-hash...
>
> >> > > > > > D
>
> >> > > > > > On Mon, May 9, 2011 at 8:25 AM, Jay Booth <[email protected]>
> >> > wrote:
> >> > > > > > > I'm testing an on-disk hashtable with Protobufs and noticed 
> >> > > > > > > that
> >> > with
> >> > > > > > > the java generated hashcode function, it seems to return a
> >> > different
> >> > > > > > > hashcode across JVM invocations for the same logically 
> >> > > > > > > equivalent
> >> > > > > > > object (tested with a single string protobuf, same string for
> >> > both
> >> > > > > > > instances).
>
> >> > > > > > > Is this known behavior?  Bit busy right now backporting this to
> >> > work
> >> > > > > > > with String keys instead but I could provide a bit of command
> >> > line
> >> > > > > > > code that demonstrates the issue when I get a chance.
>
> >> > > > > > > Glancing at the generated hashcode() function, it looks like 
> >> > > > > > > the
> >> > > > > > > difference comes from etiher getDescriptorForType().hashCode() 
> >> > > > > > > or
> >> > > > > > > getUnknownFields().hashCode(), both of which are incorporated.
>
> >> > > > > > > --
> >> > > > > > > You received this message because you are subscribed to the
> >> > Google Groups "Protocol Buffers" group.
> >> > > > > > > To post to this group, send email to [email protected].
> >> > > > > > > To unsubscribe from this group, send email to
> >> > [email protected].
> >> > > > > > > For more options, visit this group athttp://
> >> > groups.google.com/group/protobuf?hl=en.
>
> >> > --
> >> > You received this message because you are subscribed to the Google Groups
> >> > "Protocol Buffers" group.
> >> > To post to this group, send email to [email protected].
> >> > To unsubscribe from this group, send email to
> >> > [email protected].
> >> > For more options, visit this group at
> >> >http://groups.google.com/group/protobuf?hl=en.
>
> >--
> >You received this message because you are subscribed to the Google Groups
> >"Protocol Buffers" group.
> >To post to this group, send email to [email protected].
> >To unsubscribe from this group, send email to
> >[email protected].
> >For more options, visit this group at
> >http://groups.google.com/group/protobuf?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Protocol Buffers" group.
> To post to this group, send email to ...
>
> read more »

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.

[protobuf] Re: Generated hashcode() returns different values across JVM instances?

Reply via email to