Alex Herbert created CODEC-268:
----------------------------------

             Summary: Deprecate methods in MurmurHash3 with no equivalent in 
the reference c++ source.
                 Key: CODEC-268
                 URL: https://issues.apache.org/jira/browse/CODEC-268
             Project: Commons Codec
          Issue Type: Wish
    Affects Versions: 1.13
            Reporter: Alex Herbert
            Assignee: Alex Herbert


The following methods have no equivalent in the [MurmurHash3 c++ source 
code|https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp]
{code:java}
    public static long hash64(final long data) {
    public static long hash64(final int data) {
    public static long hash64(final short data) {
    public static long hash64(final byte[] data) {
    public static long hash64(final byte[] data, final int offset, final int 
length) {
    public static long hash64(final byte[] data, final int offset, final int 
length, final int seed) {
{code}
They are documented to return the upper 64-bits of the 128-bit hash method. 
This is false as the code neglects to mix in alternating blocks of 64-bits with 
the corresponding other hash that would be built in the 128-bit method. Thus 
this passes using a NotEquals check:
{code:java}
@Test
public void testHash64InNotEqualToHash128() {
    for (int i = 0; i < 32; i++) {
        final byte[] bytes = Arrays.copyOf(RANDOM_BYTES, i);
        final long h1 = MurmurHash3.hash64(bytes);
        final long[] hash = MurmurHash3.hash128(bytes);
        Assert.assertNotEquals("Did not expect hash64 to match upper bits of 
hash128", hash[0], h1);
        Assert.assertNotEquals("Did not expect hash64 to match lower bits of 
hash128", hash[1], h1);
    }
}
{code}
The following methods are convenience methods that use an arbitrary default 
seed:
{code:java}
    public static final int DEFAULT_SEED = 104729;
    public static int hash32(final long data1, final long data2) {
    public static int hash32(final long data1, final long data2, final int 
seed) {
    public static int hash32(final long data) {
    public static int hash32(final long data, final int seed) {
{code}
Although they match calling hash32() with the equivalent bytes this assumes 
big-endian format. The reference c++ code is designed for little endian. If you 
hash the corresponding values using a Google Guava implementation with the same 
seed then they do not match as Guava faithfully converts primitives as little 
endian. Also note that the available methods for hashing primitives are not the 
same as those in hash64.

These methods thus make very little sense and should be removed as an unwanted 
legacy from the port from Apache Hive.

The following methods are convenience methods that use default encoding via 
String.getBytes():
{code:java}
    public static int hash32(final String data) {
    public static long[] hash128(final String data) {
{code}
These effectively do nothing and save 0 lines of code for the caller:
{code:java}
    String s;
    int hash1 = MurmurHash3.hash32(s);
    int hash2 = MurmurHash3.hash32(s.getBytes());
    assert hash1 == hash2;
{code}
This is not even used:
{code:java}
    public static final long NULL_HASHCODE = 2862933555777941757L;
{code}
The following methods for hash32, hash64 and hash128 are not consistent with 
the available arguments:
{code:java}
    public static int hash32(final byte[] data) {
    public static int hash32(final String data) {
    public static int hash32(final byte[] data, final int length) {
    public static int hash32(final byte[] data, final int length, final int 
seed) {
    public static int hash32(final byte[] data, final int offset, final int 
length, final int seed) {

    public static long hash64(final byte[] data) {
    // Two int arguments specify (offset, length) not (length, seed)
    public static long hash64(final byte[] data, final int offset, final int 
length) {
    public static long hash64(final byte[] data, final int offset, final int 
length, final int seed) {

    public static long[] hash128(final byte[] data) {
    public static long[] hash128(final String data) {
    // No other (offset, length, seed) combinations
    public static long[] hash128(final byte[] data, final int offset, final int 
length, final int seed) {
{code}
I would suggest deprecating:
 * All hash64 methods
 * The default seed
 * The null hashcode
 * The helper methods that hash primitives
 * The helper methods that hash String using default encoding
 * Those methods that accept various combinations of offset, length and seed

This would leave a cleaner API:
{code:java}
    public static int hash32(final byte[] data) {
    public static int hash32(final byte[] data, final int offset, final int 
length, final int seed) {

    public static long[] hash128(final byte[] data) {
    public static long[] hash128(final byte[] data, final int offset, final int 
length, final int seed) {
{code}
There are outstanding bugs in CODEC-264 and CODEC-267 for sign extension errors 
in hash32 and hash128. These should be fixed using a minimal API with names 
that reference the x86 and x64 names from the MurmurHash3 reference c++ code:

I would argue that the default seed should be zero (it is zero in Google Guava 
and Python mmh3 implementations of MurmurHash3), not an arbitrary random int.
{code:java}
    public static int hash32x86(final byte[] data) {
    public static int hash32x86(final byte[] data, final int offset, final int 
length, final int seed) {

    public static long[] hash128x64(final byte[] data) {
    public static long[] hash128x64(final byte[] data, final int offset, final 
int length, final int seed) {
{code}
All deprecated methods should be referred to use these instead.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to