There was a presentation a year or so ago I presented at the MapR sales kickoff that covers the memory characteristics of operators. Unfortunately, I don't have access to the content but hopefully someone internal to MapR should have it. (Maybe Ellen or Neeraja)
Approximately (from memory): total hash aggregate size = entries * (links (4 bytes) + hash code (4 bytes) + aggregate key size + aggregate workspace variable size) aggregate key size = (fixed value size of all keys + variable value size for all keys) fixed value size = fixed width size (e.g. 4 bytes for a four byte int) + nullability (1 or 0 bytes) variable value size = offset (4 bytes) + length of data + nullability (1 or 0 bytes) aggregate workspace variable size = each function field * value size Note that the entries is actually based on the nearest power of two. Additionally, every vector is also rounded up to the nearest power of two. (this includes both the key vectors, workspace vectors links and hash code vectors -- Jacques Nadeau CTO and Co-Founder, Dremio On Fri, May 27, 2016 at 11:21 AM, Aman Sinha <amansi...@apache.org> wrote: > Rahul, can you send me the query profile separately ? Also, can you try > group-by on fixed-width columns instead of Varchar ? > With single group, the hash table itself should be consuming relatively > small amount of memory. > > On Fri, May 27, 2016 at 11:14 AM, Zelaine Fong <zf...@maprtech.com> wrote: > > > My guess would be that for hashing, a hash table is pre-allocated based > on > > the number of keys in the hash. That would explain why with more keys, > the > > memory usage grows. But that's just my guess. Someone who really > > understands how this works should chime in :). > > > > -- Zelaine > > > > On Fri, May 27, 2016 at 10:36 AM, rahul challapalli < > > challapallira...@gmail.com> wrote: > > > > > Any inputs on this one? > > > > > > On Wed, May 25, 2016 at 7:51 PM, rahul challapalli < > > > challapallira...@gmail.com> wrote: > > > > > > > Its using hash aggregation. > > > > On May 25, 2016 7:48 PM, "Zelaine Fong" <zf...@maprtech.com> wrote: > > > > > > > >> What does the explain plan show? I.e., is the group by being done > > via a > > > >> hash agg or a streaming agg? If it's a streaming agg, then you > still > > > have > > > >> to sort the entire data set before you reduce it down to a single > > group. > > > >> That would explain the increase in memory as you add group by keys. > > > >> > > > >> -- Zelaine > > > >> > > > >> On Wed, May 25, 2016 at 5:50 PM, rahul challapalli < > > > >> challapallira...@gmail.com> wrote: > > > >> > > > >> > I am trying to understand the memory usage patterns for hash > > > aggregate. > > > >> The > > > >> > below query completes in 9.163 seconds and uses 24 MB of memory > for > > > >> > hash-aggregate (according to profile) > > > >> > > > > >> > select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1, > > > 'kfjhl' > > > >> > c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5 from mem_heavy1) d group by > > > d.c1, > > > >> > d.c2, d.c3, d.c4, d.c5; > > > >> > > > > >> > Adding one more constant column to the group by, the below query > > takes > > > >> > 11.638 seconds and uses 29 MB of ram > > > >> > > > > >> > select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1, > > > 'kfjhl' > > > >> > c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5, 'bmkr' c6 from > mem_heavy1) d > > > >> group > > > >> > by d.c1, d.c2, d.c3, d.c4, d.c5, d.c6; > > > >> > > > > >> > The below query with one more constant column added to group by > > 14.622 > > > >> > seconds and uses 33 MB memory > > > >> > > > > >> > select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1, > > > 'kfjhl' > > > >> > c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5, 'bmkr' c6, 'ciuh' c7 from > > > >> > mem_heavy1) d group by d.c1, d.c2, d.c3, d.c4, d.c5, d.c6, d.c7; > > > >> > > > > >> > > > > >> > As you can see, there is only one disctinct group in all the above > > > >> cases. > > > >> > It looks like the memory usage is proportional to no of elements > in > > > the > > > >> > group by clause. Is this expected? > > > >> > > > > >> > Is the increase in time expected between the above queries? (As we > > did > > > >> not > > > >> > introduce any new groups) > > > >> > > > > >> > - Rahul > > > >> > > > > >> > > > > > > > > > >