Github user wgtmac commented on a diff in the pull request:
https://github.com/apache/orc/pull/304#discussion_r213744153
--- Diff: java/core/src/test/org/apache/orc/TestStringDictionary.java ---
@@ -409,4 +411,77 @@ public void testTooManyDistinctV11AlwaysDictionary()
throws Exception {
}
+ /**
+ * Test that dictionaries can be disabled, per column. In this test, we
want to disable DICTIONARY_V2 for the
+ * `longString` column (presumably for a low hit-ratio), while
preserving DICTIONARY_V2 for `shortString`.
+ * @throws Exception on unexpected failure
+ */
+ @Test
+ public void testDisableDictionaryForSpecificColumn() throws Exception {
+ final String SHORT_STRING_VALUE = "foo";
+ final String LONG_STRING_VALUE = "BAAAAAAAAR!!";
+
+ TypeDescription schema =
+
TypeDescription.fromString("struct<shortString:string,longString:string>");
+
+ Writer writer = OrcFile.createWriter(
+ testFilePath,
+ OrcFile.writerOptions(conf).setSchema(schema)
+ .compress(CompressionKind.NONE)
+ .bufferSize(10000)
+ .directEncodingColumns("longString"));
--- End diff --
That makes sense. I will also port current dictionary encoding to C++
writer shortly.
BTW, we plan to do some testing about global dictionary which is shared by
all stripes in that file. Can we come up with a design in ORC V2? I can propose
a prototype after gathering certain experiment results.
---