As a third option as @melloware said in the pull request comments the original implementation came from Apache Hive. The current hashes could be named hash31Hive and hash128Hive
On Mon, Nov 4, 2019 at 12:53 AM Claude Warren <[email protected]> wrote: > I think the way to prove they behave correctly is to test the results > against the original "C" implementation. With this in mind I proposed the > changes. > > I agree that there should be a way to revert to the old code as there is > code in the wild that depends on the earlier implementation. > > I know that there are broken implementations in other Apache projects, but > those are not libraries (i.e. Cassandra). > > I would suggest that perhaps the changes that I provided be renamed as > hash128Standard and hash32Standard (or some other designator to show that > it is different from the original CODEC implementation). > > Alternatively hash128 and hash32 could be deprecated and replaced with > hashx86_32Old, and hashx64_128Old and provide the corrected versions for > hashx86_32 and hashx64_128. Obviously this leaves the door open to a > hashx86_128 implementation. > > My opinion is that we should provide Murmur3 hash implementations that > produce the same results as the "C"/C++ implementations for cross > application compatibility. > > Claude > > On Sun, Nov 3, 2019 at 11:34 PM sebb <[email protected]> wrote: > >> As I see it, there are two use cases here. >> >> 1) Commons Codec code is used exclusively, in which case it is >> essential that the output does not change for a given seed. >> >> 2) Commons Codec is used in conjunction with other implementations, in >> which case it is essential that Codec is aligned with other >> implementations >> >> Both use cases seem valid to me, so I think it's important to allow >> both old and new algorithms to co-exist going forward. >> >> However: >> Should the default behaviour change (with option to revert)? >> Or would it be better to make the corrected behaviour optional? >> >> Or deliberately break the API to force users to choose between old and >> new behaviour? >> >> Note that Wikipedia says that the x86 and x64 versions of the >> algorithm generate different results in the case of the 128 bit hash. >> However Commons Codec does not have different implementations for x86 >> and x64; it is not clear which has been implemented. >> The JavaDoc also fails to mention that there is a 64bit hash. >> >> The class clearly needs additional test data; it's possible that there >> are discrepancies in other hash sizes as well. >> >> The code has multiple implementations for each hash size depending on >> the parameter types, which increases the number of tests needed. >> >> Note that the original C++ implementation only has 3 hash methods: >> MurmurHash3_x86_32 >> MurmurHash3_x86_128 >> MurmurHash3_x64_128 >> >> AFAICT these all take the same parameters: >> - unsigned byte array >> - length >> - unsigned 32 bit seed >> - pointer to output buffer >> >> No idea why the Codec version allows additional input such as one (or >> two) longs, etc. >> Unless there is a defined standard for how these should be handled, it >> will be impossible to prove that they behave correctly. >> >> On Sun, 3 Nov 2019 at 22:01, Alex Herbert <[email protected]> >> wrote: >> > >> > >> > >> > > On 3 Nov 2019, at 21:45, Gary Gregory <[email protected]> wrote: >> > > >> > > I feel like I am missing something basic in the assumption of this >> issue: >> > > there is no such thing as an unsigned int in Java and the ticket talks >> > > about (C?) unsigned ints. Please help me understand how or why we >> should >> > > care about C vs. Java ints. Are we comparing apples to oranges here? >> > >> > When a byte is converted to an int there is sign extension if it is >> negative. A negative byte will have all bits above the 8th bit set to 1. So >> if the byte is negative then when converted to an int for bit shift and xor >> operations the raw bits are not the same. >> > >> > These are not the same: >> > >> > byte b = -1; >> > (int) b != (b & 0xff); >> > b << 8 != (b & 0xff) << 8; >> > b << 16 != (b & 0xff) << 16; >> > >> > The original code has the use of the 0xff mask for most of the murmur3 >> algorithm. It has been missed from the final steps applied the the last 3 >> bytes in the hash32 algorithm variant. >> > >> > Alex >> > >> > >> > >> > > >> > > Thank you, >> > > Gary >> > > >> > > On Sun, Nov 3, 2019, 07:59 Claude Warren <[email protected]> wrote: >> > > >> > >> There is an error in the current Murmur3 code introduced by sign >> extension >> > >> errors. This is documented in CODEC-264.[1] >> > >> >> > >> I have created a pull request to fix it.[2] >> > >> >> > >> While the code changes did not change any of the existing Murmur3 >> tests, I >> > >> did add new tests that failed until the changes were applied. >> > >> >> > >> However, I am concerned that the commons-codec Murmur3 code is in >> the wild >> > >> and this change will impact some hashes. >> > >> >> > >> Therefore, I am asking for a discussion concerning how to apply any >> patches >> > >> like CODEC-264 to hashing algorithms that are in the wild. >> > >> >> > >> Thx, >> > >> Claude >> > >> >> > >> [1] https://issues.apache.org/jira/browse/CODEC-264 >> > >> [2] https://github.com/apache/commons-codec/pull/27 >> > >> >> > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: [email protected] >> > For additional commands, e-mail: [email protected] >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > > -- > I like: Like Like - The likeliest place on the web > <http://like-like.xenei.com> > LinkedIn: http://www.linkedin.com/in/claudewarren > -- I like: Like Like - The likeliest place on the web <http://like-like.xenei.com> LinkedIn: http://www.linkedin.com/in/claudewarren
