Hi xiaoqiao This improvement looks great! Can you please explain the below data, what does it mean? ---------- ConcurrentHashMap ~68MB 14543 Double Array Trie ~104MB 12825
Regards Liang 2016-11-24 2:04 GMT+08:00 Xiaoqiao He <[email protected]>: > Hi All, > > I would like to propose Dictionary improvement which using Trie in place of > HashMap. > > In order to speedup aggregation, reduce run-time memory footprint, enable > fast > distinct count etc, CarbonData encodes data using dictionary at file level > or table level based on cardinality. It is a general and efficient way in > many big data systems, but when apply ConcurrentHashMap > to maintain Dictionary in CarbonData currently, memory overhead of > Driver is very huge since it has to load whole Dictionary to decode actual > data value, especially column cardinality is a large number. and CarbonData > will not do dictionary if cardinality > 1 million at default behavior. > > I propose using Trie in place of HashMap for the following three reasons: > (1) Trie is a proper structure for Dictionary, > (2) Reduce memory footprint, > (3) Not impact retrieval performance > > The experimental results show that Trie is able to meet the requirement. > a. ConcurrentHashMap vs Double Array Trie > <https://linux.thai.net/~thep/datrie/datrie.html>(one implementation of > Trie Structures) > b. Dictionary size: 600K > c. Memory footprint and query time > - memory footprint (64-bit JVM) 500 million query time(ms) > ConcurrentHashMap > ~68MB 14543 > Double Array Trie > ~104MB 12825 > > Please share your suggestions about the proposed improvement of Dictionary. > > Regards > He Xiaoqiao > -- Regards Liang
