Hi All, I would like to propose Dictionary improvement which using Trie in place of HashMap.
In order to speedup aggregation, reduce run-time memory footprint, enable fast distinct count etc, CarbonData encodes data using dictionary at file level or table level based on cardinality. It is a general and efficient way in many big data systems, but when apply ConcurrentHashMap to maintain Dictionary in CarbonData currently, memory overhead of Driver is very huge since it has to load whole Dictionary to decode actual data value, especially column cardinality is a large number. and CarbonData will not do dictionary if cardinality > 1 million at default behavior. I propose using Trie in place of HashMap for the following three reasons: (1) Trie is a proper structure for Dictionary, (2) Reduce memory footprint, (3) Not impact retrieval performance The experimental results show that Trie is able to meet the requirement. a. ConcurrentHashMap vs Double Array Trie <https://linux.thai.net/~thep/datrie/datrie.html>(one implementation of Trie Structures) b. Dictionary size: 600K c. Memory footprint and query time - memory footprint (64-bit JVM) 500 million query time(ms) ConcurrentHashMap ~68MB 14543 Double Array Trie ~104MB 12825 Please share your suggestions about the proposed improvement of Dictionary. Regards He Xiaoqiao