omalley commented on a change in pull request #651:
URL: https://github.com/apache/orc/pull/651#discussion_r600838359



##########
File path: java/core/src/java/org/apache/orc/impl/DictionaryUtils.java
##########
@@ -0,0 +1,86 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.io.OutputStream;
+
+import org.apache.hadoop.io.Text;
+
+
+public class DictionaryUtils {
+  private DictionaryUtils() {
+    // Utility class does nothing in constructor
+  }
+
+  public static void getTextInternal(Text result, int position, 
DynamicIntArray keyOffsets, DynamicByteArray byteArray) {
+    int offset = keyOffsets.get(position);
+    int length;
+    if (position + 1 == keyOffsets.size()) {
+      length = byteArray.size() - offset;
+    } else {
+      length = keyOffsets.get(position + 1) - offset;
+    }
+    byteArray.setText(result, offset, length);
+  }
+
+  static class VisitorContextImpl implements Dictionary.VisitorContext {

Review comment:
       I'd think that a common class (named DictionaryStorage?) that 
implemented VisitorContext and had the byteArray and keyOffsets would make 
sense. Each of the dictionaries could have a field that held the reference.

##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 
bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no 
reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;

Review comment:
       I suspect that we'd get better performance from using linear probing 
into a single DynamicIntArray, although we might want to drop the max load 
factor.

##########
File path: java/core/src/java/org/apache/orc/impl/DictionaryUtils.java
##########
@@ -0,0 +1,86 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.io.OutputStream;
+
+import org.apache.hadoop.io.Text;
+
+
+public class DictionaryUtils {
+  private DictionaryUtils() {

Review comment:
       Actually, I don't mind this style for utility classes, since it prevents 
anyone from accidentally creating a useless instance. That said, I suspect that 
this should probably be refactored.

##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 
bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no 
reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+  public StringHashTableDictionary(int initialCapacity) {
+    this(initialCapacity, DEFAULT_LOAD_FACTOR);
+  }
+
+  public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+    this.capacity = initialCapacity;
+    this.loadFactor = loadFactor;
+    this.keyOffsets = new DynamicIntArray(initialCapacity);
+    this.hashArray = initHashArray(initialCapacity);
+    this.threshold = (int)Math.min(initialCapacity * loadFactor, 
MAX_ARRAY_SIZE + 1);
+  }
+
+  private DynamicIntArray[] initHashArray(int capacity) {
+    DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+    for (int i = 0; i < capacity; i++) {
+      bucket[i] = new DynamicIntArray();

Review comment:
       Yeah, that is too large. If we have more than a handful of collisions 
the table is too small or the function isn't good.

##########
File path: java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java
##########
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Using HashTable to represent a dictionary. The strings are stored as UTF-8 
bytes
+ * and an offset for each entry. It is using chaining for collision resolution.
+ *
+ * This implementation is not thread-safe. It also assumes there's no 
reduction in the size of hash-table
+ * as it shouldn't happen in the use cases for this class.
+ */
+public class StringHashTableDictionary implements Dictionary {
+
+  private final DynamicByteArray byteArray = new DynamicByteArray();
+  // starting offset of key-in-byte in the byte array for the i-th key.
+  // Two things combined stores the key array.
+  private final DynamicIntArray keyOffsets;
+
+  private final Text newKey = new Text();
+
+  private DynamicIntArray[] hashArray;
+
+  private int capacity;
+
+  private int threshold;
+
+  private float loadFactor;
+
+  private static float DEFAULT_LOAD_FACTOR = 0.75f;
+
+  private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8;
+
+  public StringHashTableDictionary(int initialCapacity) {
+    this(initialCapacity, DEFAULT_LOAD_FACTOR);
+  }
+
+  public StringHashTableDictionary(int initialCapacity, float loadFactor) {
+    this.capacity = initialCapacity;
+    this.loadFactor = loadFactor;
+    this.keyOffsets = new DynamicIntArray(initialCapacity);
+    this.hashArray = initHashArray(initialCapacity);
+    this.threshold = (int)Math.min(initialCapacity * loadFactor, 
MAX_ARRAY_SIZE + 1);
+  }
+
+  private DynamicIntArray[] initHashArray(int capacity) {
+    DynamicIntArray[] bucket = new DynamicIntArray[capacity];
+    for (int i = 0; i < capacity; i++) {
+      bucket[i] = new DynamicIntArray();
+    }
+    return bucket;
+  }
+
+  @Override
+  public void visit(Visitor visitor)
+      throws IOException {
+    traverse(visitor, new DictionaryUtils.VisitorContextImpl(this.byteArray, 
this.keyOffsets));
+  }
+
+  private void traverse(Visitor visitor, DictionaryUtils.VisitorContextImpl 
context) throws IOException {
+    for (DynamicIntArray intArray : hashArray) {
+      for (int i = 0; i < intArray.size() ; i ++) {
+        context.setPosition(intArray.get(i));
+        visitor.visit(context);
+      }
+    }
+  }
+
+  @Override
+  public void clear() {
+    byteArray.clear();
+    keyOffsets.clear();
+    Arrays.fill(hashArray, null);
+  }
+
+  @Override
+  public void getText(Text result, int position) {
+    DictionaryUtils.getTextInternal(result, position, this.keyOffsets, 
this.byteArray);
+  }
+
+  @Override
+  public int add(byte[] bytes, int offset, int length) {
+    resizeIfNeeded();
+    newKey.set(bytes, offset, length);
+    return add(newKey);
+  }
+
+  public int add(Text text) {
+    resizeIfNeeded();
+
+    int index = getIndex(text);
+    DynamicIntArray candidateArray = hashArray[index];
+
+    newKey.set(text);
+
+    Text tmpText = new Text();
+    for (int i = 0; i < candidateArray.size(); i++) {
+      getText(tmpText, candidateArray.get(i));
+      if (tmpText.equals(newKey)) {
+        return candidateArray.get(i);
+      }
+    }
+
+    // if making it here, it means no match.
+    int len = newKey.getLength();
+    int currIdx = keyOffsets.size();
+    keyOffsets.add(byteArray.add(newKey.getBytes(), 0, len));
+    candidateArray.add(currIdx);
+    return currIdx;
+  }
+
+  private void resizeIfNeeded() {
+    if (keyOffsets.size() >= threshold) {
+      int oldCapacity = keyOffsets.size();
+      int newCapacity = (oldCapacity << 1) + 1;
+      doResize(newCapacity);
+      this.threshold = (int)Math.min(newCapacity * loadFactor, MAX_ARRAY_SIZE 
+ 1);
+    }
+  }
+
+  @Override
+  public int size() {
+    return keyOffsets.size();
+  }
+
+  /**
+   * Compute the hash value and find the corresponding index.
+   *
+   */
+  int getIndex(Text text) {
+    return (text.hashCode() & 0x7FFFFFFF) % capacity;

Review comment:
       I'd be tempted to use Math.floorMod here, which would always be positive.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to