[GitHub] orc pull request #304: ORC-397. Allow selective disabling of dictionary enco...

wgtmac Wed, 29 Aug 2018 09:18:02 -0700

Github user wgtmac commented on a diff in the pull request:

    https://github.com/apache/orc/pull/304#discussion_r213744153
  
    --- Diff: java/core/src/test/org/apache/orc/TestStringDictionary.java ---
    @@ -409,4 +411,77 @@ public void testTooManyDistinctV11AlwaysDictionary() 
throws Exception {
     
       }
     
    +  /**
    +   * Test that dictionaries can be disabled, per column. In this test, we 
want to disable DICTIONARY_V2 for the
    +   * `longString` column (presumably for a low hit-ratio), while 
preserving DICTIONARY_V2 for `shortString`.
    +   * @throws Exception on unexpected failure
    +   */
    +  @Test
    +  public void testDisableDictionaryForSpecificColumn() throws Exception {
    +    final String SHORT_STRING_VALUE = "foo";
    +    final String  LONG_STRING_VALUE = "BAAAAAAAAR!!";
    +
    +    TypeDescription schema =
    +        
TypeDescription.fromString("struct<shortString:string,longString:string>");
    +
    +    Writer writer = OrcFile.createWriter(
    +        testFilePath,
    +        OrcFile.writerOptions(conf).setSchema(schema)
    +            .compress(CompressionKind.NONE)
    +            .bufferSize(10000)
    +            .directEncodingColumns("longString"));
    --- End diff --
    
    That makes sense. I will also port current dictionary encoding to C++ 
writer shortly.
    BTW, we plan to do some testing about global dictionary which is shared by 
all stripes in that file. Can we come up with a design in ORC V2? I can propose 
a prototype after gathering certain experiment results.

---

[GitHub] orc pull request #304: ORC-397. Allow selective disabling of dictionary enco...

Reply via email to