Hi Kumar Vishal, I'll create task to trace this issue. Thanks for your suggestions.
Regards, He Xiaoqiao On Sun, Nov 27, 2016 at 1:41 AM, Kumar Vishal <[email protected]> wrote: > Hi Xiaoqiao He, > > You can go ahead with DAT implementation, based on the result. > I will look forward for you PR. > > Please let me know if you need any support:). > > -Regards > KUmar Vishal > > On Fri, Nov 25, 2016 at 11:22 PM, Xiaoqiao He <[email protected]> wrote: > > > Hi Liang, Kumar Vishal, > > > > I has done a standard benchmark about multiply data structures for > > Dictionary following your suggestions. Based on the test results, I think > > DAT may be the best choice for CarbonData. > > > > *1. Here are 2 test results:* > > ----------------------------------------------------------------------- > > Benchmark about {HashMap,DAT,RadixTree,TrieDict} Structures for > Dictionary > > HashMap : java.util.HashMap > > DAT (Double Array Trie): > > https://github.com/komiya-atsushi/darts-java > > RadixTree: > > https://github.com/npgall/concurrent-trees > > TrieDict (Dictionary in Kylin): > > http://kylin.apache.org/blog/2015/08/13/kylin-dictionary > > Dictionary Source (Traditional Chinese): > > https://raw.githubusercontent.com/fxsjy/jieba/master/extra_ > > dict/dict.txt.big > > ================Test Result================ > > a. Dictionary Size:584429 > > -------- > > b. Build Time (ms) : > > DAT : 5714 > > HashMap : 110 > > RadixTree : 22044 > > TrieDict : 855 > > -------- > > c. Memory footprint in 64-bit JVM (bytes) : > > DAT : 16779752 > > HashMap : 32196592 > > RadixTree : 46130584 > > TrieDict : 10443608 > > -------- > > d. Retrieval Performance for 9935293 query times (ms) : > > DAT : 585 > > HashMap : 1010 > > RadixTree : 417639 > > TrieDict : 8664 > > ================Test Result================ > > > > ================Test Result================ > > a. Dictionary Size:584429 > > -------- > > b. Build Time (ms) : > > DAT : 5867 > > HashMap : 100 > > RadixTree : 22082 > > TrieDict : 840 > > -------- > > c. Memory footprint in 64-bit JVM (bytes) : > > DAT : 16779752 > > HashMap : 32196592 > > RadixTree : 46130584 > > TrieDict : 10443608 > > -------- > > d. Retrieval Performance for 9935293 query times (ms) : > > DAT : 593 > > HashMap : 821 > > RadixTree : 422297 > > TrieDict : 8752 > > ================Test Result================ > > > > *2. Conclusion:* > > a. TrieDict is good for building tree and less memory footprint overhead, > > but worst retrieval performance, > > b. DAT is a good tradeoff between memory footprint and retrieval > > performance, > > c. RadixTree has the worst performance in different aspects. > > > > *3. Result Analysis:* > > a. With Trie the memory footprint of the TrieDict mapping is kinda > > minimized if compared to HashMap, in order to improve performance there > is > > a cache layer overlays on top of Trie. > > b. Because a large number of duplicate prefix data, the total memory > > footprint is more than trie, meanwhile i think calculating string hash > code > > of traditional Chinese consume considerable time overhead, so the > > performance is not the best. > > c. DAT is a better tradeoff. > > d. I have no idea why RadixTree has the worst performance in terms of > > memory, retrieval and building tree. > > > > > > On Fri, Nov 25, 2016 at 11:28 AM, Liang Chen <[email protected]> > > wrote: > > > > > Hi xiaoqiao > > > > > > ok, look forward to seeing your test result. > > > Can you take this task for this improvement? Please let me know if you > > need > > > any support :) > > > > > > Regards > > > Liang > > > > > > > > > hexiaoqiao wrote > > > > Hi Kumar Vishal, > > > > > > > > Thanks for your suggestions. As you said, choose Trie replace HashMap > > we > > > > can get better memory footprint and also good performance. Of course, > > DAT > > > > is not only choice, and I will do test about DAT vs Radix Trie and > > > release > > > > the test result as soon as possible. Thanks your suggestions again. > > > > > > > > Regards, > > > > Xiaoqiao > > > > > > > > On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal < > > > > > > > kumarvishal1802@ > > > > > > > > > > > > wrote: > > > > > > > >> Hi XIaoqiao He, > > > >> +1, > > > >> For forward dictionary case it will be very good optimisation, as > our > > > >> case > > > >> is very specific storing byte array to int mapping[data to surrogate > > key > > > >> mapping], I think we will get much better memory footprint and > > > >> performance > > > >> will be also good(2x). We can also try radix tree(radix trie), it is > > > more > > > >> optimise for storage. > > > >> > > > >> -Regards > > > >> Kumar Vishal > > > >> > > > >> On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen < > > > > > > > chenliang6136@ > > > > > > > > > > > >> wrote: > > > >> > > > >> > Hi xiaoqiao > > > >> > > > > >> > For the below example, 600K dictionary data: > > > >> > It is to say that using "DAT" can save 36M memory against > > > >> > "ConcurrentHashMap", whereas the performance just lost less > > (1718ms) ? > > > >> > > > > >> > One more question:if increases the dictionary data size, what's > the > > > >> > comparison results "ConcurrentHashMap" VS "DAT" > > > >> > > > > >> > Regards > > > >> > Liang > > > >> > ------------------------------------------------------------ > > > >> > ------------------------------------------ > > > >> > a. memory footprint (approximate quantity) in 64-bit JVM: > > > >> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*) > > > >> > > > > >> > b. retrieval performance: total time(ms) of 500 million query: > > > >> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*) > > > >> > > > > >> > Regards > > > >> > Liang > > > >> > > > > >> > hexiaoqiao wrote > > > >> > > hi Liang, > > > >> > > > > > >> > > Thanks for your reply, i need to correct the experiment result > > > >> because > > > >> > > it's > > > >> > > wrong order NO.1 column of result data table. > > > >> > > > > > >> > > In order to compare performance between Trie and HashMap, Two > > > >> different > > > >> > > structures are constructed using the same dictionary data which > > size > > > >> is > > > >> > > 600K and each item's length is between 2 and 50 bytes. > > > >> > > > > > >> > > ConcurrentHashMap (structure which is used in CarbonData > > currently) > > > >> vs > > > >> > > Double > > > >> > > Array Trie (one implementation of Trie Structures) > > > >> > > > > > >> > > a. memory footprint (approximate quantity) in 64-bit JVM: > > > >> > > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*) > > > >> > > > > > >> > > b. retrieval performance: total time(ms) of 500 million query: > > > >> > > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*) > > > >> > > > > > >> > > Regards, > > > >> > > He Xiaoqiao > > > >> > > > > > >> > > > > > >> > > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen < > > > >> > > > > >> > > chenliang6136@ > > > >> > > > > >> > > > wrote: > > > >> > > > > > >> > >> Hi xiaoqiao > > > >> > >> > > > >> > >> This improvement looks great! > > > >> > >> Can you please explain the below data, what does it mean? > > > >> > >> ---------- > > > >> > >> ConcurrentHashMap > > > >> > >> ~68MB 14543 > > > >> > >> Double Array Trie > > > >> > >> ~104MB 12825 > > > >> > >> > > > >> > >> Regards > > > >> > >> Liang > > > >> > >> > > > >> > >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He < > > > >> > > > > >> > > xq.he2009@ > > > >> > > > > >> > > >: > > > >> > >> > > > >> > >> > Hi All, > > > >> > >> > > > > >> > >> > I would like to propose Dictionary improvement which using > Trie > > > in > > > >> > >> place > > > >> > >> of > > > >> > >> > HashMap. > > > >> > >> > > > > >> > >> > In order to speedup aggregation, reduce run-time memory > > > footprint, > > > >> > >> enable > > > >> > >> > fast > > > >> > >> > distinct count etc, CarbonData encodes data using dictionary > at > > > >> file > > > >> > >> level > > > >> > >> > or table level based on cardinality. It is a general and > > > efficient > > > >> way > > > >> > >> in > > > >> > >> > many big data systems, but when apply ConcurrentHashMap > > > >> > >> > to maintain Dictionary in CarbonData currently, memory > overhead > > > of > > > >> > >> > Driver is very huge since it has to load whole Dictionary to > > > >> decode > > > >> > >> actual > > > >> > >> > data value, especially column cardinality is a large number. > > and > > > >> > >> CarbonData > > > >> > >> > will not do dictionary if cardinality > 1 million at default > > > >> behavior. > > > >> > >> > > > > >> > >> > I propose using Trie in place of HashMap for the following > > three > > > >> > >> reasons: > > > >> > >> > (1) Trie is a proper structure for Dictionary, > > > >> > >> > (2) Reduce memory footprint, > > > >> > >> > (3) Not impact retrieval performance > > > >> > >> > > > > >> > >> > The experimental results show that Trie is able to meet the > > > >> > >> requirement. > > > >> > >> > a. ConcurrentHashMap vs Double Array Trie > > > >> > >> > <https://linux.thai.net/~thep/datrie/datrie.html>(one > > > >> > >> implementation of > > > >> > >> > Trie Structures) > > > >> > >> > b. Dictionary size: 600K > > > >> > >> > c. Memory footprint and query time > > > >> > >> > - memory footprint (64-bit JVM) 500 million query time(ms) > > > >> > >> > ConcurrentHashMap > > > >> > >> > ~68MB 14543 > > > >> > >> > Double Array Trie > > > >> > >> > ~104MB 12825 > > > >> > >> > > > > >> > >> > Please share your suggestions about the proposed improvement > of > > > >> > >> Dictionary. > > > >> > >> > > > > >> > >> > Regards > > > >> > >> > He Xiaoqiao > > > >> > >> > > > > >> > >> > > > >> > >> > > > >> > >> > > > >> > >> -- > > > >> > >> Regards > > > >> > >> Liang > > > >> > >> > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > -- > > > >> > View this message in context: http://apache-carbondata- > > > >> > mailing-list-archive.1130556.n5.nabble.com/Improvement-Use- > > > >> > Trie-in-place-of-HashMap-to-reduce-memory-footprint-of- > > > >> > Dictionary-tp3132p3143.html > > > >> > Sent from the Apache CarbonData Mailing List archive mailing list > > > >> archive > > > >> > at Nabble.com. > > > >> > > > > >> > > > > > > > > > > > > > > > > > > -- > > > View this message in context: http://apache-carbondata- > > > mailing-list-archive.1130556.n5.nabble.com/Improvement-Use- > > > Trie-in-place-of-HashMap-to-reduce-memory-footprint-of- > > > Dictionary-tp3132p3186.html > > > Sent from the Apache CarbonData Mailing List archive mailing list > archive > > > at Nabble.com. > > > > > >
