Hi Xiaoqiao He, You can go ahead with DAT implementation, based on the result. I will look forward for you PR.
Please let me know if you need any support:). -Regards KUmar Vishal On Fri, Nov 25, 2016 at 11:22 PM, Xiaoqiao He <[email protected]> wrote: > Hi Liang, Kumar Vishal, > > I has done a standard benchmark about multiply data structures for > Dictionary following your suggestions. Based on the test results, I think > DAT may be the best choice for CarbonData. > > *1. Here are 2 test results:* > ----------------------------------------------------------------------- > Benchmark about {HashMap,DAT,RadixTree,TrieDict} Structures for Dictionary > HashMap : java.util.HashMap > DAT (Double Array Trie): > https://github.com/komiya-atsushi/darts-java > RadixTree: > https://github.com/npgall/concurrent-trees > TrieDict (Dictionary in Kylin): > http://kylin.apache.org/blog/2015/08/13/kylin-dictionary > Dictionary Source (Traditional Chinese): > https://raw.githubusercontent.com/fxsjy/jieba/master/extra_ > dict/dict.txt.big > ================Test Result================ > a. Dictionary Size:584429 > -------- > b. Build Time (ms) : > DAT : 5714 > HashMap : 110 > RadixTree : 22044 > TrieDict : 855 > -------- > c. Memory footprint in 64-bit JVM (bytes) : > DAT : 16779752 > HashMap : 32196592 > RadixTree : 46130584 > TrieDict : 10443608 > -------- > d. Retrieval Performance for 9935293 query times (ms) : > DAT : 585 > HashMap : 1010 > RadixTree : 417639 > TrieDict : 8664 > ================Test Result================ > > ================Test Result================ > a. Dictionary Size:584429 > -------- > b. Build Time (ms) : > DAT : 5867 > HashMap : 100 > RadixTree : 22082 > TrieDict : 840 > -------- > c. Memory footprint in 64-bit JVM (bytes) : > DAT : 16779752 > HashMap : 32196592 > RadixTree : 46130584 > TrieDict : 10443608 > -------- > d. Retrieval Performance for 9935293 query times (ms) : > DAT : 593 > HashMap : 821 > RadixTree : 422297 > TrieDict : 8752 > ================Test Result================ > > *2. Conclusion:* > a. TrieDict is good for building tree and less memory footprint overhead, > but worst retrieval performance, > b. DAT is a good tradeoff between memory footprint and retrieval > performance, > c. RadixTree has the worst performance in different aspects. > > *3. Result Analysis:* > a. With Trie the memory footprint of the TrieDict mapping is kinda > minimized if compared to HashMap, in order to improve performance there is > a cache layer overlays on top of Trie. > b. Because a large number of duplicate prefix data, the total memory > footprint is more than trie, meanwhile i think calculating string hash code > of traditional Chinese consume considerable time overhead, so the > performance is not the best. > c. DAT is a better tradeoff. > d. I have no idea why RadixTree has the worst performance in terms of > memory, retrieval and building tree. > > > On Fri, Nov 25, 2016 at 11:28 AM, Liang Chen <[email protected]> > wrote: > > > Hi xiaoqiao > > > > ok, look forward to seeing your test result. > > Can you take this task for this improvement? Please let me know if you > need > > any support :) > > > > Regards > > Liang > > > > > > hexiaoqiao wrote > > > Hi Kumar Vishal, > > > > > > Thanks for your suggestions. As you said, choose Trie replace HashMap > we > > > can get better memory footprint and also good performance. Of course, > DAT > > > is not only choice, and I will do test about DAT vs Radix Trie and > > release > > > the test result as soon as possible. Thanks your suggestions again. > > > > > > Regards, > > > Xiaoqiao > > > > > > On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal < > > > > > kumarvishal1802@ > > > > > > > > > wrote: > > > > > >> Hi XIaoqiao He, > > >> +1, > > >> For forward dictionary case it will be very good optimisation, as our > > >> case > > >> is very specific storing byte array to int mapping[data to surrogate > key > > >> mapping], I think we will get much better memory footprint and > > >> performance > > >> will be also good(2x). We can also try radix tree(radix trie), it is > > more > > >> optimise for storage. > > >> > > >> -Regards > > >> Kumar Vishal > > >> > > >> On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen < > > > > > chenliang6136@ > > > > > > > > >> wrote: > > >> > > >> > Hi xiaoqiao > > >> > > > >> > For the below example, 600K dictionary data: > > >> > It is to say that using "DAT" can save 36M memory against > > >> > "ConcurrentHashMap", whereas the performance just lost less > (1718ms) ? > > >> > > > >> > One more question:if increases the dictionary data size, what's the > > >> > comparison results "ConcurrentHashMap" VS "DAT" > > >> > > > >> > Regards > > >> > Liang > > >> > ------------------------------------------------------------ > > >> > ------------------------------------------ > > >> > a. memory footprint (approximate quantity) in 64-bit JVM: > > >> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*) > > >> > > > >> > b. retrieval performance: total time(ms) of 500 million query: > > >> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*) > > >> > > > >> > Regards > > >> > Liang > > >> > > > >> > hexiaoqiao wrote > > >> > > hi Liang, > > >> > > > > >> > > Thanks for your reply, i need to correct the experiment result > > >> because > > >> > > it's > > >> > > wrong order NO.1 column of result data table. > > >> > > > > >> > > In order to compare performance between Trie and HashMap, Two > > >> different > > >> > > structures are constructed using the same dictionary data which > size > > >> is > > >> > > 600K and each item's length is between 2 and 50 bytes. > > >> > > > > >> > > ConcurrentHashMap (structure which is used in CarbonData > currently) > > >> vs > > >> > > Double > > >> > > Array Trie (one implementation of Trie Structures) > > >> > > > > >> > > a. memory footprint (approximate quantity) in 64-bit JVM: > > >> > > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*) > > >> > > > > >> > > b. retrieval performance: total time(ms) of 500 million query: > > >> > > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*) > > >> > > > > >> > > Regards, > > >> > > He Xiaoqiao > > >> > > > > >> > > > > >> > > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen < > > >> > > > >> > > chenliang6136@ > > >> > > > >> > > > wrote: > > >> > > > > >> > >> Hi xiaoqiao > > >> > >> > > >> > >> This improvement looks great! > > >> > >> Can you please explain the below data, what does it mean? > > >> > >> ---------- > > >> > >> ConcurrentHashMap > > >> > >> ~68MB 14543 > > >> > >> Double Array Trie > > >> > >> ~104MB 12825 > > >> > >> > > >> > >> Regards > > >> > >> Liang > > >> > >> > > >> > >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He < > > >> > > > >> > > xq.he2009@ > > >> > > > >> > > >: > > >> > >> > > >> > >> > Hi All, > > >> > >> > > > >> > >> > I would like to propose Dictionary improvement which using Trie > > in > > >> > >> place > > >> > >> of > > >> > >> > HashMap. > > >> > >> > > > >> > >> > In order to speedup aggregation, reduce run-time memory > > footprint, > > >> > >> enable > > >> > >> > fast > > >> > >> > distinct count etc, CarbonData encodes data using dictionary at > > >> file > > >> > >> level > > >> > >> > or table level based on cardinality. It is a general and > > efficient > > >> way > > >> > >> in > > >> > >> > many big data systems, but when apply ConcurrentHashMap > > >> > >> > to maintain Dictionary in CarbonData currently, memory overhead > > of > > >> > >> > Driver is very huge since it has to load whole Dictionary to > > >> decode > > >> > >> actual > > >> > >> > data value, especially column cardinality is a large number. > and > > >> > >> CarbonData > > >> > >> > will not do dictionary if cardinality > 1 million at default > > >> behavior. > > >> > >> > > > >> > >> > I propose using Trie in place of HashMap for the following > three > > >> > >> reasons: > > >> > >> > (1) Trie is a proper structure for Dictionary, > > >> > >> > (2) Reduce memory footprint, > > >> > >> > (3) Not impact retrieval performance > > >> > >> > > > >> > >> > The experimental results show that Trie is able to meet the > > >> > >> requirement. > > >> > >> > a. ConcurrentHashMap vs Double Array Trie > > >> > >> > <https://linux.thai.net/~thep/datrie/datrie.html>(one > > >> > >> implementation of > > >> > >> > Trie Structures) > > >> > >> > b. Dictionary size: 600K > > >> > >> > c. Memory footprint and query time > > >> > >> > - memory footprint (64-bit JVM) 500 million query time(ms) > > >> > >> > ConcurrentHashMap > > >> > >> > ~68MB 14543 > > >> > >> > Double Array Trie > > >> > >> > ~104MB 12825 > > >> > >> > > > >> > >> > Please share your suggestions about the proposed improvement of > > >> > >> Dictionary. > > >> > >> > > > >> > >> > Regards > > >> > >> > He Xiaoqiao > > >> > >> > > > >> > >> > > >> > >> > > >> > >> > > >> > >> -- > > >> > >> Regards > > >> > >> Liang > > >> > >> > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > -- > > >> > View this message in context: http://apache-carbondata- > > >> > mailing-list-archive.1130556.n5.nabble.com/Improvement-Use- > > >> > Trie-in-place-of-HashMap-to-reduce-memory-footprint-of- > > >> > Dictionary-tp3132p3143.html > > >> > Sent from the Apache CarbonData Mailing List archive mailing list > > >> archive > > >> > at Nabble.com. > > >> > > > >> > > > > > > > > > > > > -- > > View this message in context: http://apache-carbondata- > > mailing-list-archive.1130556.n5.nabble.com/Improvement-Use- > > Trie-in-place-of-HashMap-to-reduce-memory-footprint-of- > > Dictionary-tp3132p3186.html > > Sent from the Apache CarbonData Mailing List archive mailing list archive > > at Nabble.com. > > >
