Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5725#discussion_r29221599
  
    --- Diff: 
unsafe/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java ---
    @@ -0,0 +1,552 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.unsafe.map;
    +
    +import java.lang.Override;
    +import java.lang.UnsupportedOperationException;
    +import java.util.Iterator;
    +import java.util.LinkedList;
    +import java.util.List;
    +
    +import org.apache.spark.unsafe.*;
    +import org.apache.spark.unsafe.array.ByteArrayMethods;
    +import org.apache.spark.unsafe.array.LongArray;
    +import org.apache.spark.unsafe.bitset.BitSet;
    +import org.apache.spark.unsafe.hash.Murmur3_x86_32;
    +import org.apache.spark.unsafe.memory.*;
    +
    +/**
    + * An append-only hash map where keys and values are contiguous regions of 
bytes.
    + * <p>
    + * This is backed by a power-of-2-sized hash table, using quadratic 
probing with triangular numbers,
    + * which is guaranteed to exhaust the space.
    + * <p>
    + * Note that even though we use long for indexing, the map can support up 
to 2^31 keys because
    + * we use 32 bit MurmurHash. In either case, if the key cardinality is so 
high, you should probably
    + * be using sorting instead of hashing for better cache locality.
    + * <p>
    + * This class is not thread safe.
    + */
    +public final class BytesToBytesMap {
    +
    +  private static final Murmur3_x86_32 HASHER = new Murmur3_x86_32(0);
    +
    +  private static final HashMapGrowthStrategy growthStrategy = 
HashMapGrowthStrategy.DOUBLING;
    +
    +  private final MemoryManager memoryManager;
    +
    +  /**
    +   * A linked list for tracking all allocated data pages so that we can 
free all of our memory.
    +   */
    +  private final List<MemoryBlock> dataPages = new 
LinkedList<MemoryBlock>();
    +
    +  /**
    +   * The data page that will be used to store keys and values for new 
hashtable entries. When this
    +   * page becomes full, a new page will be allocated and this pointer will 
change to point to that
    +   * new page.
    +   */
    +  private MemoryBlock currentDataPage = null;
    +
    +  /**
    +   * Offset into `currentDataPage` that points to the location where new 
data can be inserted into
    +   * the page.
    +   */
    +  private long pageCursor = 0;
    +
    +  /**
    +   * The size of the data pages that hold key and value data. Map entries 
cannot span multiple
    +   * pages, so this limits the maximum entry size.
    +   */
    +  private static final long PAGE_SIZE_BYTES = 1L << 26; // 64 megabytes
    +
    +  // This choice of page table size and page size means that we can 
address up to 500 gigabytes
    +  // of memory.
    +
    +  /**
    +   * A single array to store the key and value.
    +   *
    +   * Position {@code 2 * i} in the array is used to track a pointer to the 
key at index {@code i},
    +   * while position {@code 2 * i + 1} in the array holds key's full 32-bit 
hashcode.
    +   */
    +  private LongArray longArray;
    +  // TODO: we're wasting 32 bits of space here; we can probably store 
fewer bits of the hashcode
    +  // and exploit word-alignment to use fewer bits to hold the address.  
This might let us store
    +  // only one long per map entry, increasing the chance that this array 
will fit in cache at the
    +  // expense of maybe performing more lookups if we have hash collisions.  
Say that we stored only
    +  // 27 bits of the hashcode and 37 bits of the address.  37 bits is 
enough to address 1 terabyte
    +  // of RAM given word-alignment.  If we use 13 bits of this for our page 
table, that gives us a
    +  // maximum page size of 2^24 * 8 = ~134 megabytes per page. This change 
will require us to store
    +  // full base addresses in the page table for off-heap mode so that we 
can reconstruct the full
    +  // absolute memory addresses.
    +
    +  /**
    +   * A {@link BitSet} used to track location of the map where the key is 
set.
    +   * Size of the bitset should be half of the size of the long array.
    +   */
    +  private BitSet bitset;
    +
    +  private final double loadFactor;
    +
    +  /**
    +   * Number of keys defined in the map.
    +   */
    +  private int size;
    +
    +  /**
    +   * The map will be expanded once the number of keys exceeds this 
threshold.
    +   */
    +  private int growthThreshold;
    +
    +  /**
    +   * Mask for truncating hashcodes so that they do not exceed the long 
array's size.
    --- End diff --
    
    call out explicitly the strength reduction here. Essentially a mod 
operation, but done using a bit mask since this is power-of-2 sized hash map.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to