Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

Xiaoqiao He Wed, 23 Nov 2016 19:31:03 -0800

hi Liang,

Thanks for your reply, i need to correct the experiment result because it's
wrong order NO.1 column of result data table.


In order to compare performance between Trie and HashMap, Two different
structures are constructed using the same dictionary data which size is
600K and each item's length is between 2 and 50 bytes.

ConcurrentHashMap (structure which is used in CarbonData currently) vs Double
Array Trie (one implementation of Trie Structures)

a. memory footprint (approximate quantity) in 64-bit JVM:
~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)

b. retrieval performance: total time(ms) of 500 million query:
12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)

Regards,
He Xiaoqiao


On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen <[email protected]> wrote:

> Hi xiaoqiao
>
> This improvement looks great!
> Can you please explain the below data, what does it mean?
> ----------
> ConcurrentHashMap
> ~68MB 14543
> Double Array Trie
> ~104MB 12825
>
> Regards
> Liang
>
> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He <[email protected]>:
>
> >  Hi All,
> >
> > I would like to propose Dictionary improvement which using Trie in place
> of
> > HashMap.
> >
> > In order to speedup aggregation, reduce run-time memory footprint, enable
> > fast
> > distinct count etc, CarbonData encodes data using dictionary at file
> level
> > or table level based on cardinality. It is a general and efficient way in
> > many big data systems, but when apply ConcurrentHashMap
> > to maintain Dictionary in CarbonData currently, memory overhead of
> > Driver is very huge since it has to load whole Dictionary to decode
> actual
> > data value, especially column cardinality is a large number. and
> CarbonData
> > will not do dictionary if cardinality > 1 million at default behavior.
> >
> > I propose using Trie in place of HashMap for the following three reasons:
> > (1) Trie is a proper structure for Dictionary,
> > (2) Reduce memory footprint,
> > (3) Not impact retrieval performance
> >
> > The experimental results show that Trie is able to meet the requirement.
> > a. ConcurrentHashMap vs Double Array Trie
> > <https://linux.thai.net/~thep/datrie/datrie.html>(one implementation of
> > Trie Structures)
> > b. Dictionary size: 600K
> > c. Memory footprint and query time
> > - memory footprint (64-bit JVM) 500 million query time(ms)
> > ConcurrentHashMap
> > ~68MB 14543
> > Double Array Trie
> > ~104MB 12825
> >
> > Please share your suggestions about the proposed improvement of
> Dictionary.
> >
> > Regards
> > He Xiaoqiao
> >
>
>
>
> --
> Regards
> Liang
>

Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

Reply via email to