I’m mainly a lurker on PDFBox these days, as I have moved to focus on other things (though ironically my latest tasking does indirectly make use of PDFBox again!) and I am not familiar with the details of the code but I would offer the following advice on this:
If two objects are: a) the same type and b) have the same value (i.e., evaluate: obj1.equals(obj2) == true ) and c) are immutable (i.e., their value cannot be changed once constructed) Then they should have the same hashcode. Conceptually such objects represent the same exact, immutable, information. This is why two String objects that both hold the same character sequence have the same hashcode. These sort of immutable objects are considered interchangeable when they have the same data value. Code execution is exactly the same regardless of which object instance you use for a given sequence of code. Numbers such as Integers, Longs, Floats and Doubles and Booleans also all represent immutable information and the same rules apply. The number “5” is informationally identical throughout the universe and indeed all references to “5” are really references to the same immutable information. A huge advantage of treating such equal-value, same-type objects interchangeably (by giving them identical hashcodes) is that they can be used with Object Pooling to reduce memory and improve performance. If the object types are not immutable, however — i.e., if it is possible for their values to be modified (such as with setters or other mutators) then whether they should have the same hashcode depends on how they are used. Do they have other data fields that are not being taken into consideration when the hash is calculated? Do they have transient fields that are not maintained across serialization? Hashcodes usually (not always) should be persistent through the life cycle of the object. If you put an object in a hashmap, the internal bucket it gets dropped into will be based on the hash and you (normally) don’t want that changing while the object reference is stored in the map. I can not recall enough detail about the PDFBox codebase or the COSxxxx wrappers in particular to be able to assert how these points apply so I am just offering these concepts here for folks to keep in mind. I’m sure you guys will make the right design decision. And thanks a ton for all the work you guys have done over the years. Mel Dr. Mel Martinez m.marti...@ll.mit.edu<mailto:m.marti...@ll.mit.edu> On Mar 5, 2022, at 10:30 AM, Andreas Lehmkuehler <andr...@lehmi.de<mailto:andr...@lehmi.de>> wrote: Hi, I'm not sure if we dicussed that topic in the past or if I simply mixed it up with a discussion about "equals" and "=" However, PDFBOX-5286 shows the we have an issue with objects which aren't the same but are treated as the same because of the same hash. This is true for all simple objects such as COSInteger, COSFLoat, COSBoolean and COSName. Think about the following two indirect /Length objects 100 0 obj 512 endobj 200 0 obj 512 endobj * there two different COSObjects "100 0" and "200 0" * both COSObjects have different hashes * both COSObjects are referencing a COSInteger holding the same value "512" * both COSIntegers are different objects * both COSIntegers have the SAME hash, as the current implementation of hashCode is based on the value of the COSInteger Or some pseudo code COSObject(100,0) != COSObject(200,0) COSInteger(100,0) != COSInteger(200,0) COSObject(100,0).hashCode != COSObject(200,0).hashCode COSInteger(100,0).hashCode == COSInteger(200,0).hashCode COSInteger(100,0).equals(COSInteger(200,0) == true IMHO we should change the implementation of hashCode so that different objects will have different hashCodes. I expect some side effects * we are using a lot of hash-based collections and I'm afraid there may be some cases where the fact of having the same hash for different objects is wanted (knowingly or not) * we have to remove the static instances for COSInteger values in a range from -100 to 256 which will result in an increased number of COSInteger instances * there are just two static instances of COSBoolean ("true" and "false") which have to be replaced too * COSName is caching a lot of values as static instances as well, which should be removed as well * looks like COSFloat shouldn't be a problem WDYT? Should we simply start with COSFloat and COSInteger and see how it ends up? Andreas --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org<mailto:dev-unsubscr...@pdfbox.apache.org> For additional commands, e-mail: dev-h...@pdfbox.apache.org<mailto:dev-h...@pdfbox.apache.org>