Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

Liang Chen Thu, 24 Nov 2016 19:28:41 -0800

Hi xiaoqiao

ok, look forward to seeing your test result.
Can you take this task for this improvement? Please let me know if you need
any support :)


Regards
Liang


hexiaoqiao wrote
> Hi Kumar Vishal,
> 
> Thanks for your suggestions. As you said, choose Trie replace HashMap we
> can get better memory footprint and also good performance. Of course, DAT
> is not only choice, and I will do test about DAT vs Radix Trie and release
> the test result as soon as possible. Thanks your suggestions again.
> 
> Regards,
> Xiaoqiao
> 
> On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal &lt;

> kumarvishal1802@

> &gt;
> wrote:
> 
>> Hi XIaoqiao He,
>> +1,
>> For forward dictionary case it will be very good optimisation, as our
>> case
>> is very specific storing byte array to int mapping[data to surrogate key
>> mapping], I think we will get much better memory footprint and
>> performance
>> will be also good(2x). We can also try radix tree(radix trie), it is more
>> optimise for storage.
>>
>> -Regards
>> Kumar Vishal
>>
>> On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen &lt;

> chenliang6136@

> &gt;
>> wrote:
>>
>> > Hi xiaoqiao
>> >
>> > For the below example, 600K dictionary data:
>> > It is to say that using "DAT" can save 36M memory against
>> > "ConcurrentHashMap", whereas the performance just lost less (1718ms) ?
>> >
>> > One more question:if increases the dictionary data size, what's the
>> > comparison results "ConcurrentHashMap" VS "DAT"
>> >
>> > Regards
>> > Liang
>> > ------------------------------------------------------------
>> > ------------------------------------------
>> > a. memory footprint (approximate quantity) in 64-bit JVM:
>> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
>> >
>> > b. retrieval performance: total time(ms) of 500 million query:
>> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
>> >
>> > Regards
>> > Liang
>> >
>> > hexiaoqiao wrote
>> > > hi Liang,
>> > >
>> > > Thanks for your reply, i need to correct the experiment result
>> because
>> > > it's
>> > > wrong order NO.1 column of result data table.
>> > >
>> > > In order to compare performance between Trie and HashMap, Two
>> different
>> > > structures are constructed using the same dictionary data which size
>> is
>> > > 600K and each item's length is between 2 and 50 bytes.
>> > >
>> > > ConcurrentHashMap (structure which is used in CarbonData currently)
>> vs
>> > > Double
>> > > Array Trie (one implementation of Trie Structures)
>> > >
>> > > a. memory footprint (approximate quantity) in 64-bit JVM:
>> > > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
>> > >
>> > > b. retrieval performance: total time(ms) of 500 million query:
>> > > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
>> > >
>> > > Regards,
>> > > He Xiaoqiao
>> > >
>> > >
>> > > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen &lt;
>> >
>> > > chenliang6136@
>> >
>> > > &gt; wrote:
>> > >
>> > >> Hi xiaoqiao
>> > >>
>> > >> This improvement looks great!
>> > >> Can you please explain the below data, what does it mean?
>> > >> ----------
>> > >> ConcurrentHashMap
>> > >> ~68MB 14543
>> > >> Double Array Trie
>> > >> ~104MB 12825
>> > >>
>> > >> Regards
>> > >> Liang
>> > >>
>> > >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He &lt;
>> >
>> > > xq.he2009@
>> >
>> > > &gt;:
>> > >>
>> > >> >  Hi All,
>> > >> >
>> > >> > I would like to propose Dictionary improvement which using Trie in
>> > >> place
>> > >> of
>> > >> > HashMap.
>> > >> >
>> > >> > In order to speedup aggregation, reduce run-time memory footprint,
>> > >> enable
>> > >> > fast
>> > >> > distinct count etc, CarbonData encodes data using dictionary at
>> file
>> > >> level
>> > >> > or table level based on cardinality. It is a general and efficient
>> way
>> > >> in
>> > >> > many big data systems, but when apply ConcurrentHashMap
>> > >> > to maintain Dictionary in CarbonData currently, memory overhead of
>> > >> > Driver is very huge since it has to load whole Dictionary to
>> decode
>> > >> actual
>> > >> > data value, especially column cardinality is a large number. and
>> > >> CarbonData
>> > >> > will not do dictionary if cardinality > 1 million at default
>> behavior.
>> > >> >
>> > >> > I propose using Trie in place of HashMap for the following three
>> > >> reasons:
>> > >> > (1) Trie is a proper structure for Dictionary,
>> > >> > (2) Reduce memory footprint,
>> > >> > (3) Not impact retrieval performance
>> > >> >
>> > >> > The experimental results show that Trie is able to meet the
>> > >> requirement.
>> > >> > a. ConcurrentHashMap vs Double Array Trie
>> > >> > &lt;https://linux.thai.net/~thep/datrie/datrie.html&gt;(one
>> > >> implementation of
>> > >> > Trie Structures)
>> > >> > b. Dictionary size: 600K
>> > >> > c. Memory footprint and query time
>> > >> > - memory footprint (64-bit JVM) 500 million query time(ms)
>> > >> > ConcurrentHashMap
>> > >> > ~68MB 14543
>> > >> > Double Array Trie
>> > >> > ~104MB 12825
>> > >> >
>> > >> > Please share your suggestions about the proposed improvement of
>> > >> Dictionary.
>> > >> >
>> > >> > Regards
>> > >> > He Xiaoqiao
>> > >> >
>> > >>
>> > >>
>> > >>
>> > >> --
>> > >> Regards
>> > >> Liang
>> > >>
>> >
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-carbondata-
>> > mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-
>> > Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-
>> > Dictionary-tp3132p3143.html
>> > Sent from the Apache CarbonData Mailing List archive mailing list
>> archive
>> > at Nabble.com.
>> >
>>





--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-Dictionary-tp3132p3186.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

Reply via email to