[jira] [Commented] (KYLIN-4810) TrieDictionary is not correctly build

ASF subversion and git services (Jira) Mon, 23 Nov 2020 19:25:07 -0800


    [ 
https://issues.apache.org/jira/browse/KYLIN-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237826#comment-17237826
 ]


ASF subversion and git services commented on KYLIN-4810:
--------------------------------------------------------

Commit 3e12b6d621fe8e5c747a5783f64bc535618c8035 in kylin's branch 
refs/heads/master from zhengshengjun
[ https://gitbox.apache.org/repos/asf?p=kylin.git;h=3e12b6d ]

FIX KYLIN-4810, Add some tips and test case


> TrieDictionary is not correctly build
> -------------------------------------
>
>                 Key: KYLIN-4810
>                 URL: https://issues.apache.org/jira/browse/KYLIN-4810
>             Project: Kylin
>          Issue Type: Bug
>          Components: Job Engine
>    Affects Versions: v2.3.2
>            Reporter: ShengJun Zheng
>            Assignee: ShengJun Zheng
>            Priority: Critical
>              Labels: Dictionary
>             Fix For: v3.1.2
>
>
> Hi, recently, I've met a problem in our product environment: Segments failed 
> to merge because TrieDictionaryForest was disordered
> {code:java}
> java.lang.IllegalStateException: Invalid input data. Unordered data cannot be 
> split into multi trees
>     at 
> org.apache.kylin.dict.TrieDictionaryForestBuilder.addValue(TrieDictionaryForestBuilder.java:92)
>     at 
> org.apache.kylin.dict.TrieDictionaryForestBuilder.addValue(TrieDictionaryForestBuilder.java:78)
>     at 
> org.apache.kylin.dict.DictionaryGenerator$StringTrieDictForestBuilder.addValue(DictionaryGenerator.java:214)
>     at 
> org.apache.kylin.dict.DictionaryGenerator.buildDictionary(DictionaryGenerator.java:81)
>     at 
> org.apache.kylin.dict.DictionaryGenerator.buildDictionary(DictionaryGenerator.java:65)
>     at 
> org.apache.kylin.dict.DictionaryGenerator.mergeDictionaries(DictionaryGenerator.java:106)
> {code}
> After some analysis, we found out when there is large values in a 
> dict-encoded column, iterating over a single TrieDictionaryTree will get 
> unordered data.
>  
>  Digging into the source code,  the root cause is as described: 
>  # Kylin will split a TrieTree Node into two parts when a single nodes's 
> value length is more than 255 bytes
>  # Then, these tow parts of value will be added to build the TrieTree. In 
> fact the splitted two parts should not be used as new values to add to the 
> TrieTree.
>  # Step 2 will cause the TrieDictionaryTree build more leave nodes，and the 
> extra leaf nodes will be 'end-value' of dictionary tree;
>  # It has no impact to the correctness of the dict tree itself, except for 
> adding some additional nodes .
>  # But If you spit a UTF-8 word, you will get unordered data when iterating 
> over the tree ( Something todo with Java UTF-8  String Serialize/Deserialize 
> implementations. Please Refer to JDK sun.nio.cs.UTF_8.class)
> How to re-produce ? Run test code :
> {code:java}
> TrieDictionaryForestBuilder builder = new TrieDictionaryForestBuilder(new 
> StringBytesConverter());
> String longUrl = 
> "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx你好~~~";
> builder.addValue(longUrl);
> TrieDictionaryForest<String> dict = builder.build();
> TrieDictionaryForestBuilder mergeBuild = new TrieDictionaryForestBuilder(new 
> StringBytesConverter());
> for (int i = dict.getMinId(); i <= dict.getMaxId(); i++) {
>     String str = dict.getValueFromId(i);
>     System.out.println("add value into merge tree");
>     mergeBuild.addValue(str);
> }
> The log output of this test code is:
> add value into merge tree
> add value into merge tree
> 16:59:36 [main] INFO 
> org.apache.kylin.dict.TrieDictionaryForestBuilder.addValue(TrieDictionaryForestBuilder.java:127)
>  values not in ascending order, previous 
> 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\xEF\xBF\xBD',
>  current 
> 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\xE4\xBD\xA0\xE5\xA5\xBD~~~'
> {code}
> We can see from the test code's output：
>  # We only add 1 value but the tire dictionary tree turn out to have 2 end 
> vlaues
>  # Iterating over the TrieDictionary Tree got unordered data
> We address this problem by
>  # classify values which is a whole column value, which is splitted value,
>  # not mark splitted value as end-value of a TrieTree Node.
> I wonder if there is something wrong, thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KYLIN-4810) TrieDictionary is not correctly build

Reply via email to