Github user wgtmac commented on a diff in the pull request:
https://github.com/apache/orc/pull/304#discussion_r213542675
--- Diff: java/core/src/test/org/apache/orc/TestStringDictionary.java ---
@@ -409,4 +411,77 @@ public void testTooManyDistinctV11AlwaysDictionary()
throws Exception {
}
+ /**
+ * Test that dictionaries can be disabled, per column. In this test, we
want to disable DICTIONARY_V2 for the
+ * `longString` column (presumably for a low hit-ratio), while
preserving DICTIONARY_V2 for `shortString`.
+ * @throws Exception on unexpected failure
+ */
+ @Test
+ public void testDisableDictionaryForSpecificColumn() throws Exception {
+ final String SHORT_STRING_VALUE = "foo";
+ final String LONG_STRING_VALUE = "BAAAAAAAAR!!";
+
+ TypeDescription schema =
+
TypeDescription.fromString("struct<shortString:string,longString:string>");
+
+ Writer writer = OrcFile.createWriter(
+ testFilePath,
+ OrcFile.writerOptions(conf).setSchema(schema)
+ .compress(CompressionKind.NONE)
+ .bufferSize(10000)
+ .directEncodingColumns("longString"));
--- End diff --
Is it better to support specifying columns which use dictionary encoding?
---