Hi Liang,

Generally, yes, because the same prefix of items in dictionary does not
require to repeat in DAT, and more data better result.

Actually the cost of DAT is building Tree, and i don't think we need to
consider it since this cost appears only once when load data.

FYI.

Regards,
Xiaoqiao

On Thu, Nov 24, 2016 at 2:42 PM, Liang Chen <chenliang6...@gmail.com> wrote:

> Hi xiaoqiao
>
> For the below example, 600K dictionary data:
> It is to say that using "DAT" can save 36M memory against
> "ConcurrentHashMap", whereas the performance just lost less (1718ms) ?
>
> One more question:if increases the dictionary data size, what's the
> comparison results "ConcurrentHashMap" VS "DAT"
>
> Regards
> Liang
> ------------------------------------------------------------
> ------------------------------------------
> a. memory footprint (approximate quantity) in 64-bit JVM:
> ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
>
> b. retrieval performance: total time(ms) of 500 million query:
> 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
>
> Regards
> Liang
>
> hexiaoqiao wrote
> > hi Liang,
> >
> > Thanks for your reply, i need to correct the experiment result because
> > it's
> > wrong order NO.1 column of result data table.
> >
> > In order to compare performance between Trie and HashMap, Two different
> > structures are constructed using the same dictionary data which size is
> > 600K and each item's length is between 2 and 50 bytes.
> >
> > ConcurrentHashMap (structure which is used in CarbonData currently) vs
> > Double
> > Array Trie (one implementation of Trie Structures)
> >
> > a. memory footprint (approximate quantity) in 64-bit JVM:
> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> >
> > b. retrieval performance: total time(ms) of 500 million query:
> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> >
> > Regards,
> > He Xiaoqiao
> >
> >
> > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen &lt;
>
> > chenliang6136@
>
> > &gt; wrote:
> >
> >> Hi xiaoqiao
> >>
> >> This improvement looks great!
> >> Can you please explain the below data, what does it mean?
> >> ----------
> >> ConcurrentHashMap
> >> ~68MB 14543
> >> Double Array Trie
> >> ~104MB 12825
> >>
> >> Regards
> >> Liang
> >>
> >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He &lt;
>
> > xq.he2009@
>
> > &gt;:
> >>
> >> >  Hi All,
> >> >
> >> > I would like to propose Dictionary improvement which using Trie in
> >> place
> >> of
> >> > HashMap.
> >> >
> >> > In order to speedup aggregation, reduce run-time memory footprint,
> >> enable
> >> > fast
> >> > distinct count etc, CarbonData encodes data using dictionary at file
> >> level
> >> > or table level based on cardinality. It is a general and efficient way
> >> in
> >> > many big data systems, but when apply ConcurrentHashMap
> >> > to maintain Dictionary in CarbonData currently, memory overhead of
> >> > Driver is very huge since it has to load whole Dictionary to decode
> >> actual
> >> > data value, especially column cardinality is a large number. and
> >> CarbonData
> >> > will not do dictionary if cardinality > 1 million at default behavior.
> >> >
> >> > I propose using Trie in place of HashMap for the following three
> >> reasons:
> >> > (1) Trie is a proper structure for Dictionary,
> >> > (2) Reduce memory footprint,
> >> > (3) Not impact retrieval performance
> >> >
> >> > The experimental results show that Trie is able to meet the
> >> requirement.
> >> > a. ConcurrentHashMap vs Double Array Trie
> >> > &lt;https://linux.thai.net/~thep/datrie/datrie.html&gt;(one
> >> implementation of
> >> > Trie Structures)
> >> > b. Dictionary size: 600K
> >> > c. Memory footprint and query time
> >> > - memory footprint (64-bit JVM) 500 million query time(ms)
> >> > ConcurrentHashMap
> >> > ~68MB 14543
> >> > Double Array Trie
> >> > ~104MB 12825
> >> >
> >> > Please share your suggestions about the proposed improvement of
> >> Dictionary.
> >> >
> >> > Regards
> >> > He Xiaoqiao
> >> >
> >>
> >>
> >>
> >> --
> >> Regards
> >> Liang
> >>
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-maili
> ng-list-archive.1130556.n5.nabble.com/Improvement-Use-Trie-
> in-place-of-HashMap-to-reduce-memory-footprint-of-Dictionary
> -tp3132p3143.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>

Reply via email to