omalley commented on a change in pull request #637:
URL: https://github.com/apache/orc/pull/637#discussion_r572311006
##########
File path: java/core/src/java/org/apache/orc/OrcConf.java
##########
@@ -105,6 +105,9 @@
"(default 10000 rows) else dictionary check will happen before\n" +
"writing first stripe. In both cases, the decision to use\n" +
"dictionary or not will be retained thereafter."),
+ DICTIONARY_FACTORY_CLASS_KEY("orc.dictionary.class.key", null,
Review comment:
Let's make this a simple string with values of "hash" or "rbtree".
##########
File path: java/core/src/java/org/apache/orc/impl/Dictionary.java
##########
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.io.OutputStream;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Interface to define the dictionary used for encoding value in columns of
specific types like string, char, varchar, etc.
+ */
+public interface Dictionary {
+ int INITIAL_DICTIONARY_SIZE = 4096;
+
+ interface DictionaryFactory {
+ Dictionary createDict(Configuration conf);
+ }
+
+ /**
+ * Traverse the whole dictionary and apply the action.
+ */
+ void visit(Visitor visitor) throws IOException;
+
+ void clear();
+
+ /**
+ * Given the position index, return the original string before being encoded.
+ */
+ void getText(Text result, int originalPosition);
+
+ int add(String value);
Review comment:
We shouldn't need the add(String).
##########
File path: java/core/src/java/org/apache/orc/impl/Dictionary.java
##########
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.orc.impl;
+
+import java.io.IOException;
+import java.io.OutputStream;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+
+
+/**
+ * Interface to define the dictionary used for encoding value in columns of
specific types like string, char, varchar, etc.
+ */
+public interface Dictionary {
+ int INITIAL_DICTIONARY_SIZE = 4096;
+
+ interface DictionaryFactory {
Review comment:
I don't think I'd bother defining a factory interface.
##########
File path:
java/core/src/java/org/apache/orc/impl/writer/StringBaseTreeWriter.java
##########
@@ -64,6 +64,15 @@
WriterEncryptionVariant encryption,
WriterContext writer) throws IOException {
super(schema, encryption, writer);
+ Configuration conf = writer.getConfiguration();
+
Review comment:
I'd make a function that looks at the configuration value and builds the
hash or red-black dictionary directly. I can't see a need to make the
dictionary mechanism pluggable.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]