On Wed, May 18, 2011 at 12:44 PM, Jay Booth <[email protected]> wrote:
> That definitely seems acceptable. Any reason not to make the default > implementation return toByteString().hashCode() and get the best of > both worlds? It's a little more expensive probably, but > Object.hashCode() is already deceptively expensive to begin with. > I'm not well-enough versed in Java to say with authority, but as far as I can tell, hashCode() is intended for HashMap/HashSet, and assuming that Object.hashCode() provides some consistent hash is incorrect. Yes, protobuf could attempt to provide a consistent hash with toByteString().hashCode(), except we already know that there are caveats to that, so it's not like we can provide any contractual guarantee. Applications like yours that need a stable hash could implement a wrapper object that overrides hashCode() to use this approach, or perhaps simply use a different method altogether to express the stability requirement. > I personally switched my particular problem to use Strings as keys, > I think you want to avoid using Strings to store the serializations since this is binary data, and String will coerce that to UTF. > since that works for the use case here and is easier for corresponding > command-line tools, I just generally see a consistent hashcode as a > good idea if supportable. > > On Wed, May 18, 2011 at 3:33 PM, Jason Hsueh <[email protected]> wrote: > > Well, hashing the serialization as Pherl/Kenton suggested gives you that, > > and seems cleaner and more direct than making hashCode() of the in-memory > > object assume that you are doing something across JVMs. > > (Also, unknown fields are pretty common when you may have a jungle of > > binaries, which is true here.) > > > > On Wed, May 18, 2011 at 12:19 PM, Jay Booth <[email protected]> wrote: > >> > >> Right, I forgot that the Descriptor object is statically initialized > >> so it'll be consistent within the JVM. > >> > >> I still think that in the context of serializable objects, (i.e. > >> objects intended to be transported between JVMs in some way, shape or > >> form) a consistent hashCode would be useful for a lot of cases (mine > >> included). If it can't be consistent in the presence of unknown > >> fields, perhaps a well-documented caution to that (relatively > >> uncommon?) case would be more useful than completely punting on all > >> cases. > >> > >> On May 18, 3:08 pm, Ron Reynolds <[email protected]> wrote: > >> > the corner-stone of Hash* containers is: > >> > (A.equals(B)) => (A.hashCode() == B.hashCode()) for all A, B. > >> > > >> > tho it's not explicitly stated it would seem to be implied that is > >> > within a > >> > single JVM. > >> > not sure if the code in question maintains that rule within a JVM (if > >> > not that's > >> > a big deal). > >> > if so that would seem sufficient for all but the most distributed of > >> > Hash* > >> > containers, such as where a client which is remote from the storage > >> > (i.e, in > >> > another JVM) determines the bucket (based on hashCode()) to find that > >> > the > >> > element in question has been placed into another bucket because the > >> > hashCode() > >> > within the containing JVM has evaluated to another value. that's a > >> > pretty > >> > far-fetched, but not unimaginable, situation. > >> > > >> > ________________________________ > >> > From: Jason Hsueh <[email protected]> > >> > To: Jay Booth <[email protected]> > >> > Cc: Protocol Buffers <[email protected]> > >> > Sent: Wed, May 18, 2011 12:00:19 PM > >> > Subject: Re: [protobuf] Re: Generated hashcode() returns different > >> > values across > >> > JVM instances? > >> > > >> > Jumping in late to the thread, and I'm not really a Java person, so I > >> > may be > >> > misunderstanding something here. But as far as I can tell, you are > >> > asking for > >> > hashCode() to be a 'consistent' hash across processes. hashCode() as > >> > implemented > >> > is still useful within a single JVM, allowing you to use protobufs in > >> > HashMaps > >> > based on content rather than object identity. That was the intended > use > >> > case. > >> > > >> > On Wed, May 18, 2011 at 11:48 AM, Jay Booth <[email protected]> > wrote: > >> > > >> > Well, that's your prerogative, I guess, but why even implement > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > >hashcode at all then? Just inherit from object and you're getting > >> > >effectively the same behavior. Is that what you're intending? > >> > > >> > >On May 16, 10:03 am, Pherl Liu <[email protected]> wrote: > >> > >> We discussed internally and decided not to make the hashCode() > >> > >> return deterministic result. If you need consistent hashcode in > >> > >> different > >> > >> runs, use toByteString().hashCode(). > >> > > >> > >> Quoted from Kenton: > >> > > >> > >> Hashing the content of the descriptor would actually be incorrect, > >> > >> because > >> > >> two descriptors with exactly the same content are still considered > >> > >> different > >> > >> types. Descriptors are compared by identity, hence they are hashed > >> > >> by > >> > >> pointer. > >> > > >> > >> Removing the descriptor from the calculation would indeed make > >> > >> hashCode() > >> > >> consistent between two runs of the same binary, and probably > >> > >> insignificant > >> > >> runtime cost. Of course, once you do that, you will never be able > to > >> > >> introduce non-determinism again because people will depend on it. > >> > > >> > >> But there's a much bigger risk. People may actually start > depending > >> > >> on > >> > >> hashCode() returning consistent results between two different > >> > >> versions of > >> > >> the binary, or two completely separate binaries that compile in the > >> > >> same > >> > >> protocol, or -- most dangerously -- two different versions of the > >> > >> same > >> > >> protocol (e.g. with fields added or removed). I think it would be > >> > >> very > >> > >> difficult and limiting to make these guarantees, so I would be > >> > >> extremely > >> > >> cautious about this. > >> > > >> > >> Certainly, there is no implementation of hashCode() that would be > any > >> > >> safer > >> > >> than .toByteString().hashCode(). So, I'd advise steering people to > >> > >> the > >> > >> latter. Note that if unknown fields are present, the results may > >> > >> still be > >> > >> inconsistent. However, there is no reasonable way to implement a > >> > >> hashCode() > >> > >> that is consistent in the presence of unknown fields. > >> > > >> > >> On Thu, May 12, 2011 at 5:32 AM, Ben Wright < > [email protected]> > >> > >> wrote: > >> > >> > I think we wrote those replies at the same time : ) > >> > > >> > >> > You're right, at the cost of some additional hash collisions, the > >> > >> > simplest solution is to simply not include the type / descriptor > in > >> > >> > the hash calculation at all. > >> > > >> > >> > The best / least-collision solutions with good performance would > be > >> > >> > what I wrote in my previous post, but that requires that someone > >> > >> > (presumably a current committer) with sufficient knowledge of the > >> > >> > Descriptor types to have enough time to update the compiler and > >> > >> > java > >> > >> > libraries accordingly. > >> > > >> > >> > Any input from a committer for this issue? Seems the simple > >> > >> > solution > >> > >> > would take less than an hour to push into the stream and could > make > >> > >> > it > >> > >> > into the next release. > >> > > >> > >> > On May 11, 5:25 pm, Ben Wright <[email protected]> wrote: > >> > >> > > Alternatively... instead of putting the onus on the compiler, > the > >> > >> > > hashcode could be computed by the JVM at initialization time > for > >> > >> > > the > >> > >> > > Descriptor instance, (which would also help performance of > >> > >> > > dynamically > >> > >> > > parsed Descriptor instance hashcode calls). > >> > > >> > >> > > i.e. > >> > > >> > >> > > private final int computedHashcode; > >> > > >> > >> > > public Descriptor() { > >> > >> > > //initialization > >> > > >> > >> > > computedHashcode = do_compute_hashCode(); > >> > > >> > >> > > } > >> > > >> > >> > > public int hashCode() { > >> > >> > > return computedHashcode; > >> > > >> > >> > > } > >> > > >> > >> > > punlic int do_compute_hashCode(){ > >> > >> > > return // compute hashcode > >> > > >> > >> > > } > >> > > >> > >> > > This is all talking towards optimum performance > implementation... > >> > >> > > the > >> > >> > > real problem is the need for a hashCode implementation for > >> > >> > > Descriptor > >> > >> > > based on the actual Descriptor's content... > >> > > >> > >> > > On May 11, 4:54 pm, Ben Wright <[email protected]> wrote: > >> > > >> > >> > > > Jay: > >> > > >> > >> > > > Using the class name to generate the hashcode is logically > >> > >> > > > incorrect > >> > >> > > > because the class name can be derived by the options > >> > >> > > > java_package_ > >> > >> > > > name and java_outer_classname. > >> > > >> > >> > > > Additionally (although less likely to matter), separate > >> > >> > > > protocol > >> > >> > > > buffer files can define an identical class names with > different > >> > >> > > > protocol buffers. > >> > > >> > >> > > > Lastly, and most importantly... > >> > > >> > >> > > > If the same Message is being used with generated code and > with > >> > >> > > > dynamic > >> > >> > > > code, the hash code for the descriptor would still be > identical > >> > >> > > > if > >> > >> > > > generated from the descriptor instance, whereas the dynamic > >> > >> > > > usage does > >> > >> > > > not have a classname from which to derive a hashcode. While > in > >> > >> > > > your > >> > >> > > > case this should not matter, it does matter for other users > of > >> > >> > > > protobuf. The hashcode function would be better served by > >> > >> > > > being > >> > >> > > > implemented correctly from state data for the descriptor. > >> > >> > > > Additionally, in generated code it seems that this hashcode > >> > >> > > > could be > >> > >> > > > pre-computed by the compiler and Descriptor.hashcode() could > >> > >> > > > return a > >> > >> > > > constant integer - which would be much more efficient than > any > >> > >> > > > other > >> > >> > > > method. > >> > > >> > >> > > > On May 11, 3:02 pm, Jay Booth <[email protected]> wrote: > >> > > >> > >> > > > > It can be legitimate, especially in the case of > >> > >> > > > > Object.hashCode(), > >> > >> > but > >> > >> > > > > it's supposed to be in sync with equals() by contract. As > it > >> > >> > > > > stands, > >> > >> > > > > two objects which are equal() will produce different > hashes, > >> > >> > > > > or the > >> > >> > > > > same logical object will produce different hashes across > >> > >> > > > > JVMs. That > >> > >> > > > > breaks the contract.. if the equals() method simply did > >> > >> > > > > return > >> > >> > (other > >> > >> > > > > == this), then it'd be fine, albeit a little useless. > >> > > >> > >> > > > > I created an issue and posted a 1-liner patch that would > >> > >> > > > > eliminate > >> > >> > the > >> > >> > > > > problem by using getClass().getName().hashCode() to > >> > >> > > > > incorporate type > >> > >> > > > > information into the hashCode without depending on a > >> > >> > > > > Descriptor > >> > >> > > > > object's memory address. > >> > > >> > >> > > > > On May 11, 12:01 am, Dmitriy Ryaboy <[email protected]> > >> > >> > > > > wrote: > >> > > >> > >> > > > > > Hi Jay, > >> > > >> > >> > > > > > I encountered that before. Unfortunately this is a > >> > >> > > > > > legitimate thing > >> > >> > to > >> > >> > > > > > do, as documented in Object.hashCode() > >> > > >> > >> > > > > > I have a write-up of the problem and how we wound up > >> > >> > > > > > solving it > >> > >> > (not > >> > >> > > > > > elegant.. suggestions welcome) here: > >> > >> > >> > >> > > > http://squarecog.wordpress.com/2011/02/20/hadoop-requires-stable-hash... > >> > > >> > >> > > > > > D > >> > > >> > >> > > > > > On Mon, May 9, 2011 at 8:25 AM, Jay Booth > >> > >> > > > > > <[email protected]> > >> > >> > wrote: > >> > >> > > > > > > I'm testing an on-disk hashtable with Protobufs and > >> > >> > > > > > > noticed that > >> > >> > with > >> > >> > > > > > > the java generated hashcode function, it seems to > return > >> > >> > > > > > > a > >> > >> > different > >> > >> > > > > > > hashcode across JVM invocations for the same logically > >> > >> > > > > > > equivalent > >> > >> > > > > > > object (tested with a single string protobuf, same > string > >> > >> > > > > > > for > >> > >> > both > >> > >> > > > > > > instances). > >> > > >> > >> > > > > > > Is this known behavior? Bit busy right now backporting > >> > >> > > > > > > this to > >> > >> > work > >> > >> > > > > > > with String keys instead but I could provide a bit of > >> > >> > > > > > > command > >> > >> > line > >> > >> > > > > > > code that demonstrates the issue when I get a chance. > >> > > >> > >> > > > > > > Glancing at the generated hashcode() function, it looks > >> > >> > > > > > > like the > >> > >> > > > > > > difference comes from etiher > >> > >> > > > > > > getDescriptorForType().hashCode() or > >> > >> > > > > > > getUnknownFields().hashCode(), both of which are > >> > >> > > > > > > incorporated. > >> > > >> > >> > > > > > > -- > >> > >> > > > > > > You received this message because you are subscribed to > >> > >> > > > > > > the > >> > >> > Google Groups "Protocol Buffers" group. > >> > >> > > > > > > To post to this group, send email to > >> > >> > > > > > > [email protected]. > >> > >> > > > > > > To unsubscribe from this group, send email to > >> > >> > [email protected]. > >> > >> > > > > > > For more options, visit this group athttp:// > >> > >> > groups.google.com/group/protobuf?hl=en. > >> > > >> > >> > -- > >> > >> > You received this message because you are subscribed to the > Google > >> > >> > Groups > >> > >> > "Protocol Buffers" group. > >> > >> > To post to this group, send email to [email protected]. > >> > >> > To unsubscribe from this group, send email to > >> > >> > [email protected]. > >> > >> > For more options, visit this group at > >> > >> >http://groups.google.com/group/protobuf?hl=en. > >> > > >> > >-- > >> > >You received this message because you are subscribed to the Google > >> > > Groups > >> > >"Protocol Buffers" group. > >> > >To post to this group, send email to [email protected]. > >> > >To unsubscribe from this group, send email to > >> > >[email protected]. > >> > >For more options, visit this group at > >> > >http://groups.google.com/group/protobuf?hl=en. > >> > > >> > -- > >> > You received this message because you are subscribed to the Google > >> > Groups > >> > "Protocol Buffers" group. > >> > To post to this group, send email to ... > >> > > >> > read more ยป > >> > >> -- > >> You received this message because you are subscribed to the Google > Groups > >> "Protocol Buffers" group. > >> To post to this group, send email to [email protected]. > >> To unsubscribe from this group, send email to > >> [email protected]. > >> For more options, visit this group at > >> http://groups.google.com/group/protobuf?hl=en. > >> > > > > > -- You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.
