Hi Liang, Generally, yes, because the same prefix of items in dictionary does not require to repeat in DAT, and more data better result.
Actually the cost of DAT is building Tree, and i don't think we need to consider it since this cost appears only once when load data. FYI. Regards, Xiaoqiao On Thu, Nov 24, 2016 at 2:42 PM, Liang Chen <chenliang6...@gmail.com> wrote: > Hi xiaoqiao > > For the below example, 600K dictionary data: > It is to say that using "DAT" can save 36M memory against > "ConcurrentHashMap", whereas the performance just lost less (1718ms) ? > > One more question:if increases the dictionary data size, what's the > comparison results "ConcurrentHashMap" VS "DAT" > > Regards > Liang > ------------------------------------------------------------ > ------------------------------------------ > a. memory footprint (approximate quantity) in 64-bit JVM: > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*) > > b. retrieval performance: total time(ms) of 500 million query: > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*) > > Regards > Liang > > hexiaoqiao wrote > > hi Liang, > > > > Thanks for your reply, i need to correct the experiment result because > > it's > > wrong order NO.1 column of result data table. > > > > In order to compare performance between Trie and HashMap, Two different > > structures are constructed using the same dictionary data which size is > > 600K and each item's length is between 2 and 50 bytes. > > > > ConcurrentHashMap (structure which is used in CarbonData currently) vs > > Double > > Array Trie (one implementation of Trie Structures) > > > > a. memory footprint (approximate quantity) in 64-bit JVM: > > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*) > > > > b. retrieval performance: total time(ms) of 500 million query: > > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*) > > > > Regards, > > He Xiaoqiao > > > > > > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen < > > > chenliang6136@ > > > > wrote: > > > >> Hi xiaoqiao > >> > >> This improvement looks great! > >> Can you please explain the below data, what does it mean? > >> ---------- > >> ConcurrentHashMap > >> ~68MB 14543 > >> Double Array Trie > >> ~104MB 12825 > >> > >> Regards > >> Liang > >> > >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He < > > > xq.he2009@ > > > >: > >> > >> > Hi All, > >> > > >> > I would like to propose Dictionary improvement which using Trie in > >> place > >> of > >> > HashMap. > >> > > >> > In order to speedup aggregation, reduce run-time memory footprint, > >> enable > >> > fast > >> > distinct count etc, CarbonData encodes data using dictionary at file > >> level > >> > or table level based on cardinality. It is a general and efficient way > >> in > >> > many big data systems, but when apply ConcurrentHashMap > >> > to maintain Dictionary in CarbonData currently, memory overhead of > >> > Driver is very huge since it has to load whole Dictionary to decode > >> actual > >> > data value, especially column cardinality is a large number. and > >> CarbonData > >> > will not do dictionary if cardinality > 1 million at default behavior. > >> > > >> > I propose using Trie in place of HashMap for the following three > >> reasons: > >> > (1) Trie is a proper structure for Dictionary, > >> > (2) Reduce memory footprint, > >> > (3) Not impact retrieval performance > >> > > >> > The experimental results show that Trie is able to meet the > >> requirement. > >> > a. ConcurrentHashMap vs Double Array Trie > >> > <https://linux.thai.net/~thep/datrie/datrie.html>(one > >> implementation of > >> > Trie Structures) > >> > b. Dictionary size: 600K > >> > c. Memory footprint and query time > >> > - memory footprint (64-bit JVM) 500 million query time(ms) > >> > ConcurrentHashMap > >> > ~68MB 14543 > >> > Double Array Trie > >> > ~104MB 12825 > >> > > >> > Please share your suggestions about the proposed improvement of > >> Dictionary. > >> > > >> > Regards > >> > He Xiaoqiao > >> > > >> > >> > >> > >> -- > >> Regards > >> Liang > >> > > > > > > -- > View this message in context: http://apache-carbondata-maili > ng-list-archive.1130556.n5.nabble.com/Improvement-Use-Trie- > in-place-of-HashMap-to-reduce-memory-footprint-of-Dictionary > -tp3132p3143.html > Sent from the Apache CarbonData Mailing List archive mailing list archive > at Nabble.com. >