Re: [CODEC] Sign Extension Error in Murmur3

Claude Warren Sun, 03 Nov 2019 16:57:53 -0800

As a third option as @melloware said in the pull request comments the
original implementation came from Apache Hive.  The current hashes could be
named hash31Hive and hash128Hive


On Mon, Nov 4, 2019 at 12:53 AM Claude Warren <[email protected]> wrote:

> I think the way to prove they behave correctly is to test the results
> against the original "C" implementation.  With this in mind I proposed the
> changes.
>
> I agree that there should be a way to revert to the old code as there is
> code in the wild that depends on the earlier implementation.
>
> I know that there are broken implementations in other Apache projects, but
> those are not libraries (i.e. Cassandra).
>
> I would suggest that perhaps the changes that I provided be renamed as
> hash128Standard and hash32Standard (or some other designator to show that
> it is different from the original CODEC implementation).
>
> Alternatively hash128 and hash32 could be deprecated  and replaced with
> hashx86_32Old, and hashx64_128Old and provide the corrected versions for
> hashx86_32 and hashx64_128.  Obviously this leaves the door open to a
> hashx86_128 implementation.
>
> My opinion is that we should provide Murmur3 hash implementations that
> produce the same results as the "C"/C++ implementations for cross
> application compatibility.
>
> Claude
>
> On Sun, Nov 3, 2019 at 11:34 PM sebb <[email protected]> wrote:
>
>> As I see it, there are two use cases here.
>>
>> 1) Commons Codec code is used exclusively, in which case it is
>> essential that the output does not change for a given seed.
>>
>> 2) Commons Codec is used in conjunction with other implementations, in
>> which case it is essential that Codec is aligned with other
>> implementations
>>
>> Both use cases seem valid to me, so I think it's important to allow
>> both old and new algorithms to co-exist going forward.
>>
>> However:
>> Should the default behaviour change (with option to revert)?
>> Or would it be better to make the corrected behaviour optional?
>>
>> Or deliberately break the API to force users to choose between old and
>> new behaviour?
>>
>> Note that Wikipedia says that the x86 and x64 versions of the
>> algorithm generate different results in the case of the 128 bit hash.
>> However Commons Codec does not have different implementations for x86
>> and x64; it is not clear which has been implemented.
>> The JavaDoc also fails to mention that there is a 64bit hash.
>>
>> The class clearly needs additional test data; it's possible that there
>> are discrepancies in other hash sizes as well.
>>
>> The code has multiple implementations for each hash size depending on
>> the parameter types, which increases the number of tests needed.
>>
>> Note that the original C++ implementation only has 3 hash methods:
>> MurmurHash3_x86_32
>> MurmurHash3_x86_128
>> MurmurHash3_x64_128
>>
>> AFAICT these all take the same parameters:
>> - unsigned byte array
>> - length
>> - unsigned 32 bit seed
>> - pointer to output buffer
>>
>> No idea why the Codec version allows additional input such as one (or
>> two) longs, etc.
>> Unless there is a defined standard for how these should be handled, it
>> will be impossible to prove that they behave correctly.
>>
>> On Sun, 3 Nov 2019 at 22:01, Alex Herbert <[email protected]>
>> wrote:
>> >
>> >
>> >
>> > > On 3 Nov 2019, at 21:45, Gary Gregory <[email protected]> wrote:
>> > >
>> > > I feel like I am missing something basic in the assumption of this
>> issue:
>> > > there is no such thing as an unsigned int in Java and the ticket talks
>> > > about (C?) unsigned ints. Please help me understand how or why we
>> should
>> > > care about C vs. Java ints. Are we comparing apples to oranges here?
>> >
>> > When a byte is converted to an int there is sign extension if it is
>> negative. A negative byte will have all bits above the 8th bit set to 1. So
>> if the byte is negative then when converted to an int for bit shift and xor
>> operations the raw bits are not the same.
>> >
>> > These are not the same:
>> >
>> > byte b = -1;
>> > (int) b != (b & 0xff);
>> > b << 8 != (b & 0xff) << 8;
>> > b << 16 != (b & 0xff) << 16;
>> >
>> > The original code has the use of the 0xff mask for most of the murmur3
>> algorithm. It has been missed from the final steps applied the the last 3
>> bytes in the hash32 algorithm variant.
>> >
>> > Alex
>> >
>> >
>> >
>> > >
>> > > Thank you,
>> > > Gary
>> > >
>> > > On Sun, Nov 3, 2019, 07:59 Claude Warren <[email protected]> wrote:
>> > >
>> > >> There is an error in the current Murmur3 code introduced by sign
>> extension
>> > >> errors.  This is documented in CODEC-264.[1]
>> > >>
>> > >> I have created a pull request to fix it.[2]
>> > >>
>> > >> While the code changes did not change any of the existing Murmur3
>> tests, I
>> > >> did add new tests that failed until the changes were applied.
>> > >>
>> > >> However, I am concerned that the commons-codec Murmur3 code is in
>> the wild
>> > >> and this change will impact some hashes.
>> > >>
>> > >> Therefore, I am asking for a discussion concerning how to apply any
>> patches
>> > >> like CODEC-264 to hashing algorithms that are in the wild.
>> > >>
>> > >> Thx,
>> > >> Claude
>> > >>
>> > >> [1] https://issues.apache.org/jira/browse/CODEC-264
>> > >> [2] https://github.com/apache/commons-codec/pull/27
>> > >>
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [email protected]
>> > For additional commands, e-mail: [email protected]
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [CODEC] Sign Extension Error in Murmur3

Reply via email to