Hi All,

I would like to propose Dictionary improvement which using Trie in place of
HashMap.

In order to speedup aggregation, reduce run-time memory footprint, enable fast
distinct count etc, CarbonData encodes data using dictionary at file level
or table level based on cardinality. It is a general and efficient way in
many big data systems, but when apply ConcurrentHashMap
to maintain Dictionary in CarbonData currently, memory overhead of
Driver is very huge since it has to load whole Dictionary to decode actual
data value, especially column cardinality is a large number. and CarbonData
will not do dictionary if cardinality > 1 million at default behavior.

I propose using Trie in place of HashMap for the following three reasons:
(1) Trie is a proper structure for Dictionary,
(2) Reduce memory footprint,
(3) Not impact retrieval performance

The experimental results show that Trie is able to meet the requirement.
a. ConcurrentHashMap vs Double Array Trie
<https://linux.thai.net/~thep/datrie/datrie.html>(one implementation of
Trie Structures)
b. Dictionary size: 600K
c. Memory footprint and query time
- memory footprint (64-bit JVM) 500 million query time(ms)
ConcurrentHashMap
~68MB 14543
Double Array Trie
~104MB 12825

Please share your suggestions about the proposed improvement of Dictionary.

Regards
He Xiaoqiao

Reply via email to