Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

Xiaoqiao He Thu, 24 Nov 2016 04:47:11 -0800

Hi Kumar Vishal,

Thanks for your suggestions. As you said, choose Trie replace HashMap we
can get better memory footprint and also good performance. Of course, DAT
is not only choice, and I will do test about DAT vs Radix Trie and release
the test result as soon as possible. Thanks your suggestions again.


Regards,
Xiaoqiao

On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal <[email protected]>
wrote:

> Hi XIaoqiao He,
> +1,
> For forward dictionary case it will be very good optimisation, as our case
> is very specific storing byte array to int mapping[data to surrogate key
> mapping], I think we will get much better memory footprint and performance
> will be also good(2x). We can also try radix tree(radix trie), it is more
> optimise for storage.
>
> -Regards
> Kumar Vishal
>
> On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen <[email protected]>
> wrote:
>
> > Hi xiaoqiao
> >
> > For the below example, 600K dictionary data:
> > It is to say that using "DAT" can save 36M memory against
> > "ConcurrentHashMap", whereas the performance just lost less (1718ms) ?
> >
> > One more question:if increases the dictionary data size, what's the
> > comparison results "ConcurrentHashMap" VS "DAT"
> >
> > Regards
> > Liang
> > ------------------------------------------------------------
> > ------------------------------------------
> > a. memory footprint (approximate quantity) in 64-bit JVM:
> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> >
> > b. retrieval performance: total time(ms) of 500 million query:
> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> >
> > Regards
> > Liang
> >
> > hexiaoqiao wrote
> > > hi Liang,
> > >
> > > Thanks for your reply, i need to correct the experiment result because
> > > it's
> > > wrong order NO.1 column of result data table.
> > >
> > > In order to compare performance between Trie and HashMap, Two different
> > > structures are constructed using the same dictionary data which size is
> > > 600K and each item's length is between 2 and 50 bytes.
> > >
> > > ConcurrentHashMap (structure which is used in CarbonData currently) vs
> > > Double
> > > Array Trie (one implementation of Trie Structures)
> > >
> > > a. memory footprint (approximate quantity) in 64-bit JVM:
> > > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> > >
> > > b. retrieval performance: total time(ms) of 500 million query:
> > > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> > >
> > > Regards,
> > > He Xiaoqiao
> > >
> > >
> > > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen &lt;
> >
> > > chenliang6136@
> >
> > > &gt; wrote:
> > >
> > >> Hi xiaoqiao
> > >>
> > >> This improvement looks great!
> > >> Can you please explain the below data, what does it mean?
> > >> ----------
> > >> ConcurrentHashMap
> > >> ~68MB 14543
> > >> Double Array Trie
> > >> ~104MB 12825
> > >>
> > >> Regards
> > >> Liang
> > >>
> > >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He &lt;
> >
> > > xq.he2009@
> >
> > > &gt;:
> > >>
> > >> >  Hi All,
> > >> >
> > >> > I would like to propose Dictionary improvement which using Trie in
> > >> place
> > >> of
> > >> > HashMap.
> > >> >
> > >> > In order to speedup aggregation, reduce run-time memory footprint,
> > >> enable
> > >> > fast
> > >> > distinct count etc, CarbonData encodes data using dictionary at file
> > >> level
> > >> > or table level based on cardinality. It is a general and efficient
> way
> > >> in
> > >> > many big data systems, but when apply ConcurrentHashMap
> > >> > to maintain Dictionary in CarbonData currently, memory overhead of
> > >> > Driver is very huge since it has to load whole Dictionary to decode
> > >> actual
> > >> > data value, especially column cardinality is a large number. and
> > >> CarbonData
> > >> > will not do dictionary if cardinality > 1 million at default
> behavior.
> > >> >
> > >> > I propose using Trie in place of HashMap for the following three
> > >> reasons:
> > >> > (1) Trie is a proper structure for Dictionary,
> > >> > (2) Reduce memory footprint,
> > >> > (3) Not impact retrieval performance
> > >> >
> > >> > The experimental results show that Trie is able to meet the
> > >> requirement.
> > >> > a. ConcurrentHashMap vs Double Array Trie
> > >> > &lt;https://linux.thai.net/~thep/datrie/datrie.html&gt;(one
> > >> implementation of
> > >> > Trie Structures)
> > >> > b. Dictionary size: 600K
> > >> > c. Memory footprint and query time
> > >> > - memory footprint (64-bit JVM) 500 million query time(ms)
> > >> > ConcurrentHashMap
> > >> > ~68MB 14543
> > >> > Double Array Trie
> > >> > ~104MB 12825
> > >> >
> > >> > Please share your suggestions about the proposed improvement of
> > >> Dictionary.
> > >> >
> > >> > Regards
> > >> > He Xiaoqiao
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Regards
> > >> Liang
> > >>
> >
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-
> > mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-
> > Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-
> > Dictionary-tp3132p3143.html
> > Sent from the Apache CarbonData Mailing List archive mailing list archive
> > at Nabble.com.
> >
>

Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

Reply via email to