danepitkin commented on code in PR #38423:
URL: https://github.com/apache/arrow/pull/38423#discussion_r1385614601


##########
java/vector/src/main/java/org/apache/arrow/vector/dictionary/BatchedDictionary.java:
##########
@@ -0,0 +1,270 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.dictionary;
+
+import java.io.Closeable;
+import java.io.IOException;
+
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.util.hash.MurmurHasher;
+import org.apache.arrow.util.VisibleForTesting;
+import org.apache.arrow.vector.BaseIntVector;
+import org.apache.arrow.vector.BaseVariableWidthVector;
+import org.apache.arrow.vector.FieldVector;
+import org.apache.arrow.vector.ipc.ArrowWriter;
+import org.apache.arrow.vector.types.pojo.ArrowType;
+import org.apache.arrow.vector.types.pojo.DictionaryEncoding;
+import org.apache.arrow.vector.types.pojo.FieldType;
+
+/**
+ * A dictionary implementation that can be used when writing batches of data to
+ * a stream or file. Supports delta or replacement encoding.
+ */
+public class BatchedDictionary implements Closeable, BaseDictionary {
+
+  private final DictionaryEncoding encoding;
+
+  private final BaseVariableWidthVector dictionary;
+
+  private final BaseIntVector indexVector;
+
+  private final DictionaryHashTable hashTable;
+
+  private final boolean forFileIPC;
+
+  private int deltaIndex;
+
+  private int dictionaryIndex;
+
+  private boolean wasReset;

Review Comment:
   If its not required, I would consider deleting it to reduce state managed 
inside this object.



##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowReader.java:
##########
@@ -243,6 +248,8 @@ protected void loadDictionary(ArrowDictionaryBatch 
dictionaryBatch) {
       }
       return;
     }
+    // TODO - If the super class is a file reader it may be good to throw an 
exception here
+    // the dictionary failed to satisfy the spec (i.e. being a replacement 
dictionary)

Review Comment:
   Can we enforce this in the reader (instead of a todo?)



##########
java/vector/src/main/java/org/apache/arrow/vector/dictionary/BatchedDictionary.java:
##########
@@ -0,0 +1,270 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector.dictionary;
+
+import java.io.Closeable;
+import java.io.IOException;
+
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.util.hash.MurmurHasher;
+import org.apache.arrow.util.VisibleForTesting;
+import org.apache.arrow.vector.BaseIntVector;
+import org.apache.arrow.vector.BaseVariableWidthVector;
+import org.apache.arrow.vector.FieldVector;
+import org.apache.arrow.vector.ipc.ArrowWriter;
+import org.apache.arrow.vector.types.pojo.ArrowType;
+import org.apache.arrow.vector.types.pojo.DictionaryEncoding;
+import org.apache.arrow.vector.types.pojo.FieldType;
+
+/**
+ * A dictionary implementation that can be used when writing batches of data to
+ * a stream or file. Supports delta or replacement encoding.
+ */
+public class BatchedDictionary implements Closeable, BaseDictionary {
+
+  private final DictionaryEncoding encoding;
+
+  private final BaseVariableWidthVector dictionary;
+
+  private final BaseIntVector indexVector;
+
+  private final DictionaryHashTable hashTable;
+
+  private final boolean forFileIPC;

Review Comment:
   Could we instead have the IPC stream interface check and fail if a 
replacement dictionary is sent? I think it would be nice if we don't maintain 
this state inside the dictionary itself.



##########
java/memory/memory-core/src/main/java/org/apache/arrow/memory/util/hash/MurmurHasher.java:
##########
@@ -106,6 +111,36 @@ public static int hashCode(long address, long length, int 
seed) {
     return finalizeHashCode(hash, length);
   }
 
+  /**
+   * Calculates the hash code for a byte array.
+   * @param buffer the non-null buffer to read.
+   * @param offset an offset into the byte array.
+   * @param length length of the memory region.
+   * @param seed the seed.
+   * @return the hash code.
+   */
+  public static int hashCode(byte[] buffer, int offset, int length, int seed) {

Review Comment:
   Why are we adding support for `byte[]` buffers instead of using `ArrowBuf`?



##########
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowFileWriter.java:
##########
@@ -130,14 +130,24 @@ protected void endInternal(WriteChannel out) throws 
IOException {
   protected void ensureDictionariesWritten(DictionaryProvider provider, 
Set<Long> dictionaryIdsUsed)
       throws IOException {
     if (dictionariesWritten) {
+      for (long id : dictionaryIdsUsed) {
+        BaseDictionary dictionary = provider.lookup(id);
+        if (dictionary.getEncoding().isDelta()) {
+          writeDictionaryBatch(dictionary, false);
+        }
+        // TODO - It would be useful to throw an exception here if a 
replacement dictionary was found
+        // with modifications. Replacements are not currently allowed in 
files. For now, we just drop it
+        // and throw an exception in the BatchedDictionary and hope users use 
that class.
+      }

Review Comment:
   I'd prefer we enforce this here in the writer than in the dictionary itself.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to