Re: [protobuf] Re: Generated hashcode() returns different values across JVM instances?

Jason Hsueh Wed, 18 May 2011 14:43:47 -0700

On Wed, May 18, 2011 at 12:44 PM, Jay Booth <[email protected]> wrote:


> That definitely seems acceptable.  Any reason not to make the default
> implementation return toByteString().hashCode() and get the best of
> both worlds?  It's a little more expensive probably, but
> Object.hashCode() is already deceptively expensive to begin with.
>

I'm not well-enough versed in Java to say with authority, but as far as I
can tell, hashCode() is intended for HashMap/HashSet, and assuming that
Object.hashCode() provides some consistent hash is incorrect. Yes, protobuf
could attempt to provide a consistent hash with toByteString().hashCode(),
except we already know that there are caveats to that, so it's not like we
can provide any contractual guarantee. Applications like yours that need a
stable hash could implement a wrapper object that overrides hashCode() to
use this approach, or perhaps simply use a different method altogether to
express the stability requirement.


> I personally switched my particular problem to use Strings as keys,
>

I think you want to avoid using Strings to store the serializations since
this is binary data, and String will coerce that to UTF.


> since that works for the use case here and is easier for corresponding
> command-line tools, I just generally see a consistent hashcode as a
> good idea if supportable.
>
> On Wed, May 18, 2011 at 3:33 PM, Jason Hsueh <[email protected]> wrote:
> > Well, hashing the serialization as Pherl/Kenton suggested gives you that,
> > and seems cleaner and more direct than making hashCode() of the in-memory
> > object assume that you are doing something across JVMs.
> > (Also, unknown fields are pretty common when you may have a jungle of
> > binaries, which is true here.)
> >
> > On Wed, May 18, 2011 at 12:19 PM, Jay Booth <[email protected]> wrote:
> >>
> >> Right, I forgot that the Descriptor object is statically initialized
> >> so it'll be consistent within the JVM.
> >>
> >> I still think that in the context of serializable objects, (i.e.
> >> objects intended to be transported between JVMs in some way, shape or
> >> form) a consistent hashCode would be useful for a lot of cases (mine
> >> included).  If it can't be consistent in the presence of unknown
> >> fields, perhaps a well-documented caution to that (relatively
> >> uncommon?) case would be more useful than completely punting on all
> >> cases.
> >>
> >> On May 18, 3:08 pm, Ron Reynolds <[email protected]> wrote:
> >> > the corner-stone of Hash* containers is:
> >> >  (A.equals(B)) => (A.hashCode() == B.hashCode()) for all A, B.
> >> >
> >> > tho it's not explicitly stated it would seem to be implied that is
> >> > within a
> >> > single JVM.
> >> > not sure if the code in question maintains that rule within a JVM (if
> >> > not that's
> >> > a big deal).
> >> > if so that would seem sufficient for all but the most distributed of
> >> > Hash*
> >> > containers, such as where a client which is remote from the storage
> >> > (i.e, in
> >> > another JVM) determines the bucket (based on hashCode()) to find that
> >> > the
> >> > element in question has been placed into another bucket because the
> >> > hashCode()
> >> > within the containing JVM has evaluated to another value.  that's a
> >> > pretty
> >> > far-fetched, but not unimaginable, situation.
> >> >
> >> > ________________________________
> >> > From: Jason Hsueh <[email protected]>
> >> > To: Jay Booth <[email protected]>
> >> > Cc: Protocol Buffers <[email protected]>
> >> > Sent: Wed, May 18, 2011 12:00:19 PM
> >> > Subject: Re: [protobuf] Re: Generated hashcode() returns different
> >> > values across
> >> > JVM instances?
> >> >
> >> > Jumping in late to the thread, and I'm not really a Java person, so I
> >> > may be
> >> > misunderstanding something here. But as far as I can tell, you are
> >> > asking for
> >> > hashCode() to be a 'consistent' hash across processes. hashCode() as
> >> > implemented
> >> > is still useful within a single JVM, allowing you to use protobufs in
> >> > HashMaps
> >> > based on content rather than object identity. That was the intended
> use
> >> > case.
> >> >
> >> > On Wed, May 18, 2011 at 11:48 AM, Jay Booth <[email protected]>
> wrote:
> >> >
> >> > Well, that's your prerogative, I guess, but why even implement
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > >hashcode at all then?  Just inherit from object and you're getting
> >> > >effectively the same behavior.  Is that what you're intending?
> >> >
> >> > >On May 16, 10:03 am, Pherl Liu <[email protected]> wrote:
> >> > >> We discussed internally and decided not to make the hashCode()
> >> > >> return deterministic result. If you need consistent hashcode in
> >> > >> different
> >> > >> runs, use toByteString().hashCode().
> >> >
> >> > >> Quoted from Kenton:
> >> >
> >> > >> Hashing the content of the descriptor would actually be incorrect,
> >> > >> because
> >> > >> two descriptors with exactly the same content are still considered
> >> > >> different
> >> > >> types.  Descriptors are compared by identity, hence they are hashed
> >> > >> by
> >> > >> pointer.
> >> >
> >> > >> Removing the descriptor from the calculation would indeed make
> >> > >> hashCode()
> >> > >> consistent between two runs of the same binary, and probably
> >> > >> insignificant
> >> > >> runtime cost.  Of course, once you do that, you will never be able
> to
> >> > >> introduce non-determinism again because people will depend on it.
> >> >
> >> > >> But there's a much bigger risk.  People may actually start
> depending
> >> > >> on
> >> > >> hashCode() returning consistent results between two different
> >> > >> versions of
> >> > >> the binary, or two completely separate binaries that compile in the
> >> > >> same
> >> > >> protocol, or -- most dangerously -- two different versions of the
> >> > >> same
> >> > >> protocol (e.g. with fields added or removed).  I think it would be
> >> > >> very
> >> > >> difficult and limiting to make these guarantees, so I would be
> >> > >> extremely
> >> > >> cautious about this.
> >> >
> >> > >> Certainly, there is no implementation of hashCode() that would be
> any
> >> > >> safer
> >> > >> than .toByteString().hashCode().  So, I'd advise steering people to
> >> > >> the
> >> > >> latter.  Note that if unknown fields are present, the results may
> >> > >> still be
> >> > >> inconsistent.  However, there is no reasonable way to implement a
> >> > >> hashCode()
> >> > >> that is consistent in the presence of unknown fields.
> >> >
> >> > >> On Thu, May 12, 2011 at 5:32 AM, Ben Wright <
> [email protected]>
> >> > >> wrote:
> >> > >> > I think we wrote those replies at the same time : )
> >> >
> >> > >> > You're right, at the cost of some additional hash collisions, the
> >> > >> > simplest solution is to simply not include the type / descriptor
> in
> >> > >> > the hash calculation at all.
> >> >
> >> > >> > The best / least-collision solutions with good performance would
> be
> >> > >> > what I wrote in my previous post, but that requires that someone
> >> > >> > (presumably a current committer) with sufficient knowledge of the
> >> > >> > Descriptor types to have enough time to update the compiler and
> >> > >> > java
> >> > >> > libraries accordingly.
> >> >
> >> > >> > Any input from a committer for this issue?  Seems the simple
> >> > >> > solution
> >> > >> > would take less than an hour to push into the stream and could
> make
> >> > >> > it
> >> > >> > into the next release.
> >> >
> >> > >> > On May 11, 5:25 pm, Ben Wright <[email protected]> wrote:
> >> > >> > > Alternatively... instead of putting the onus on the compiler,
> the
> >> > >> > > hashcode could be computed by the JVM at initialization time
> for
> >> > >> > > the
> >> > >> > > Descriptor instance, (which would also help performance of
> >> > >> > > dynamically
> >> > >> > > parsed Descriptor instance hashcode calls).
> >> >
> >> > >> > > i.e.
> >> >
> >> > >> > > private final int computedHashcode;
> >> >
> >> > >> > > public Descriptor() {
> >> > >> > >    //initialization
> >> >
> >> > >> > >   computedHashcode = do_compute_hashCode();
> >> >
> >> > >> > > }
> >> >
> >> > >> > > public int hashCode() {
> >> > >> > >     return computedHashcode;
> >> >
> >> > >> > > }
> >> >
> >> > >> > > punlic int do_compute_hashCode(){
> >> > >> > >   return // compute hashcode
> >> >
> >> > >> > > }
> >> >
> >> > >> > > This is all talking towards optimum performance
> implementation...
> >> > >> > > the
> >> > >> > > real problem is the need for a hashCode implementation for
> >> > >> > > Descriptor
> >> > >> > > based on the actual Descriptor's content...
> >> >
> >> > >> > > On May 11, 4:54 pm, Ben Wright <[email protected]> wrote:
> >> >
> >> > >> > > > Jay:
> >> >
> >> > >> > > > Using the class name to generate the hashcode is logically
> >> > >> > > > incorrect
> >> > >> > > > because the class name can be derived by the options
> >> > >> > > > java_package_
> >> > >> > > > name and java_outer_classname.
> >> >
> >> > >> > > > Additionally (although less likely to matter), separate
> >> > >> > > > protocol
> >> > >> > > > buffer files can define an identical class names with
> different
> >> > >> > > > protocol buffers.
> >> >
> >> > >> > > > Lastly, and most importantly...
> >> >
> >> > >> > > > If the same Message is being used with generated code and
> with
> >> > >> > > > dynamic
> >> > >> > > > code, the hash code for the descriptor would still be
> identical
> >> > >> > > > if
> >> > >> > > > generated from the descriptor instance, whereas the dynamic
> >> > >> > > > usage does
> >> > >> > > > not have a classname from which to derive a hashcode.  While
> in
> >> > >> > > > your
> >> > >> > > > case this should not matter, it does matter for other users
> of
> >> > >> > > > protobuf.  The hashcode function would be better served by
> >> > >> > > > being
> >> > >> > > > implemented correctly from state data for the descriptor.
> >> > >> > > > Additionally, in generated code it seems that this hashcode
> >> > >> > > > could be
> >> > >> > > > pre-computed by the compiler and Descriptor.hashcode() could
> >> > >> > > > return a
> >> > >> > > > constant integer - which would be much more efficient than
> any
> >> > >> > > > other
> >> > >> > > > method.
> >> >
> >> > >> > > > On May 11, 3:02 pm, Jay Booth <[email protected]> wrote:
> >> >
> >> > >> > > > > It can be legitimate, especially in the case of
> >> > >> > > > > Object.hashCode(),
> >> > >> > but
> >> > >> > > > > it's supposed to be in sync with equals() by contract.  As
> it
> >> > >> > > > > stands,
> >> > >> > > > > two objects which are equal() will produce different
> hashes,
> >> > >> > > > > or the
> >> > >> > > > > same logical object will produce different hashes across
> >> > >> > > > > JVMs.  That
> >> > >> > > > > breaks the contract..  if the equals() method simply did
> >> > >> > > > > return
> >> > >> > (other
> >> > >> > > > > == this), then it'd be fine, albeit a little useless.
> >> >
> >> > >> > > > > I created an issue and posted a 1-liner patch that would
> >> > >> > > > > eliminate
> >> > >> > the
> >> > >> > > > > problem by using getClass().getName().hashCode() to
> >> > >> > > > > incorporate type
> >> > >> > > > > information into the hashCode without depending on a
> >> > >> > > > > Descriptor
> >> > >> > > > > object's memory address.
> >> >
> >> > >> > > > > On May 11, 12:01 am, Dmitriy Ryaboy <[email protected]>
> >> > >> > > > > wrote:
> >> >
> >> > >> > > > > > Hi Jay,
> >> >
> >> > >> > > > > > I encountered that before. Unfortunately this is a
> >> > >> > > > > > legitimate thing
> >> > >> > to
> >> > >> > > > > > do, as documented in Object.hashCode()
> >> >
> >> > >> > > > > > I have a write-up of the problem and how we wound up
> >> > >> > > > > > solving it
> >> > >> > (not
> >> > >> > > > > > elegant.. suggestions welcome) here:
> >> > >>
> >> > >> > >
> http://squarecog.wordpress.com/2011/02/20/hadoop-requires-stable-hash...
> >> >
> >> > >> > > > > > D
> >> >
> >> > >> > > > > > On Mon, May 9, 2011 at 8:25 AM, Jay Booth
> >> > >> > > > > > <[email protected]>
> >> > >> > wrote:
> >> > >> > > > > > > I'm testing an on-disk hashtable with Protobufs and
> >> > >> > > > > > > noticed that
> >> > >> > with
> >> > >> > > > > > > the java generated hashcode function, it seems to
> return
> >> > >> > > > > > > a
> >> > >> > different
> >> > >> > > > > > > hashcode across JVM invocations for the same logically
> >> > >> > > > > > > equivalent
> >> > >> > > > > > > object (tested with a single string protobuf, same
> string
> >> > >> > > > > > > for
> >> > >> > both
> >> > >> > > > > > > instances).
> >> >
> >> > >> > > > > > > Is this known behavior?  Bit busy right now backporting
> >> > >> > > > > > > this to
> >> > >> > work
> >> > >> > > > > > > with String keys instead but I could provide a bit of
> >> > >> > > > > > > command
> >> > >> > line
> >> > >> > > > > > > code that demonstrates the issue when I get a chance.
> >> >
> >> > >> > > > > > > Glancing at the generated hashcode() function, it looks
> >> > >> > > > > > > like the
> >> > >> > > > > > > difference comes from etiher
> >> > >> > > > > > > getDescriptorForType().hashCode() or
> >> > >> > > > > > > getUnknownFields().hashCode(), both of which are
> >> > >> > > > > > > incorporated.
> >> >
> >> > >> > > > > > > --
> >> > >> > > > > > > You received this message because you are subscribed to
> >> > >> > > > > > > the
> >> > >> > Google Groups "Protocol Buffers" group.
> >> > >> > > > > > > To post to this group, send email to
> >> > >> > > > > > > [email protected].
> >> > >> > > > > > > To unsubscribe from this group, send email to
> >> > >> > [email protected].
> >> > >> > > > > > > For more options, visit this group athttp://
> >> > >> > groups.google.com/group/protobuf?hl=en.
> >> >
> >> > >> > --
> >> > >> > You received this message because you are subscribed to the
> Google
> >> > >> > Groups
> >> > >> > "Protocol Buffers" group.
> >> > >> > To post to this group, send email to [email protected].
> >> > >> > To unsubscribe from this group, send email to
> >> > >> > [email protected].
> >> > >> > For more options, visit this group at
> >> > >> >http://groups.google.com/group/protobuf?hl=en.
> >> >
> >> > >--
> >> > >You received this message because you are subscribed to the Google
> >> > > Groups
> >> > >"Protocol Buffers" group.
> >> > >To post to this group, send email to [email protected].
> >> > >To unsubscribe from this group, send email to
> >> > >[email protected].
> >> > >For more options, visit this group at
> >> > >http://groups.google.com/group/protobuf?hl=en.
> >> >
> >> > --
> >> > You received this message because you are subscribed to the Google
> >> > Groups
> >> > "Protocol Buffers" group.
> >> > To post to this group, send email to ...
> >> >
> >> > read more »
> >>
> >> --
> >> You received this message because you are subscribed to the Google
> Groups
> >> "Protocol Buffers" group.
> >> To post to this group, send email to [email protected].
> >> To unsubscribe from this group, send email to
> >> [email protected].
> >> For more options, visit this group at
> >> http://groups.google.com/group/protobuf?hl=en.
> >>
> >
> >
>

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.

Re: [protobuf] Re: Generated hashcode() returns different values across JVM instances?

Reply via email to