Hi xiaoqiao ok, look forward to seeing your test result. Can you take this task for this improvement? Please let me know if you need any support :)
Regards Liang hexiaoqiao wrote > Hi Kumar Vishal, > > Thanks for your suggestions. As you said, choose Trie replace HashMap we > can get better memory footprint and also good performance. Of course, DAT > is not only choice, and I will do test about DAT vs Radix Trie and release > the test result as soon as possible. Thanks your suggestions again. > > Regards, > Xiaoqiao > > On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal < > kumarvishal1802@ > > > wrote: > >> Hi XIaoqiao He, >> +1, >> For forward dictionary case it will be very good optimisation, as our >> case >> is very specific storing byte array to int mapping[data to surrogate key >> mapping], I think we will get much better memory footprint and >> performance >> will be also good(2x). We can also try radix tree(radix trie), it is more >> optimise for storage. >> >> -Regards >> Kumar Vishal >> >> On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen < > chenliang6136@ > > >> wrote: >> >> > Hi xiaoqiao >> > >> > For the below example, 600K dictionary data: >> > It is to say that using "DAT" can save 36M memory against >> > "ConcurrentHashMap", whereas the performance just lost less (1718ms) ? >> > >> > One more question:if increases the dictionary data size, what's the >> > comparison results "ConcurrentHashMap" VS "DAT" >> > >> > Regards >> > Liang >> > ------------------------------------------------------------ >> > ------------------------------------------ >> > a. memory footprint (approximate quantity) in 64-bit JVM: >> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*) >> > >> > b. retrieval performance: total time(ms) of 500 million query: >> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*) >> > >> > Regards >> > Liang >> > >> > hexiaoqiao wrote >> > > hi Liang, >> > > >> > > Thanks for your reply, i need to correct the experiment result >> because >> > > it's >> > > wrong order NO.1 column of result data table. >> > > >> > > In order to compare performance between Trie and HashMap, Two >> different >> > > structures are constructed using the same dictionary data which size >> is >> > > 600K and each item's length is between 2 and 50 bytes. >> > > >> > > ConcurrentHashMap (structure which is used in CarbonData currently) >> vs >> > > Double >> > > Array Trie (one implementation of Trie Structures) >> > > >> > > a. memory footprint (approximate quantity) in 64-bit JVM: >> > > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*) >> > > >> > > b. retrieval performance: total time(ms) of 500 million query: >> > > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*) >> > > >> > > Regards, >> > > He Xiaoqiao >> > > >> > > >> > > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen < >> > >> > > chenliang6136@ >> > >> > > > wrote: >> > > >> > >> Hi xiaoqiao >> > >> >> > >> This improvement looks great! >> > >> Can you please explain the below data, what does it mean? >> > >> ---------- >> > >> ConcurrentHashMap >> > >> ~68MB 14543 >> > >> Double Array Trie >> > >> ~104MB 12825 >> > >> >> > >> Regards >> > >> Liang >> > >> >> > >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He < >> > >> > > xq.he2009@ >> > >> > > >: >> > >> >> > >> > Hi All, >> > >> > >> > >> > I would like to propose Dictionary improvement which using Trie in >> > >> place >> > >> of >> > >> > HashMap. >> > >> > >> > >> > In order to speedup aggregation, reduce run-time memory footprint, >> > >> enable >> > >> > fast >> > >> > distinct count etc, CarbonData encodes data using dictionary at >> file >> > >> level >> > >> > or table level based on cardinality. It is a general and efficient >> way >> > >> in >> > >> > many big data systems, but when apply ConcurrentHashMap >> > >> > to maintain Dictionary in CarbonData currently, memory overhead of >> > >> > Driver is very huge since it has to load whole Dictionary to >> decode >> > >> actual >> > >> > data value, especially column cardinality is a large number. and >> > >> CarbonData >> > >> > will not do dictionary if cardinality > 1 million at default >> behavior. >> > >> > >> > >> > I propose using Trie in place of HashMap for the following three >> > >> reasons: >> > >> > (1) Trie is a proper structure for Dictionary, >> > >> > (2) Reduce memory footprint, >> > >> > (3) Not impact retrieval performance >> > >> > >> > >> > The experimental results show that Trie is able to meet the >> > >> requirement. >> > >> > a. ConcurrentHashMap vs Double Array Trie >> > >> > <https://linux.thai.net/~thep/datrie/datrie.html>(one >> > >> implementation of >> > >> > Trie Structures) >> > >> > b. Dictionary size: 600K >> > >> > c. Memory footprint and query time >> > >> > - memory footprint (64-bit JVM) 500 million query time(ms) >> > >> > ConcurrentHashMap >> > >> > ~68MB 14543 >> > >> > Double Array Trie >> > >> > ~104MB 12825 >> > >> > >> > >> > Please share your suggestions about the proposed improvement of >> > >> Dictionary. >> > >> > >> > >> > Regards >> > >> > He Xiaoqiao >> > >> > >> > >> >> > >> >> > >> >> > >> -- >> > >> Regards >> > >> Liang >> > >> >> > >> > >> > >> > >> > >> > -- >> > View this message in context: http://apache-carbondata- >> > mailing-list-archive.1130556.n5.nabble.com/Improvement-Use- >> > Trie-in-place-of-HashMap-to-reduce-memory-footprint-of- >> > Dictionary-tp3132p3143.html >> > Sent from the Apache CarbonData Mailing List archive mailing list >> archive >> > at Nabble.com. >> > >> -- View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-Dictionary-tp3132p3186.html Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.
