[jira] [Commented] (ASTERIXDB-1556) Hash Table used by External hash group-by doesn't conform to the budget.

Taewoo Kim (JIRA) Thu, 08 Sep 2016 14:36:42 -0700

    [ 
https://issues.apache.org/jira/browse/ASTERIXDB-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15475096#comment-15475096
 ]


Taewoo Kim commented on ASTERIXDB-1556:
---------------------------------------

Another discussion with [~buyingyi] regarding the hash table size estimation:

This is Yinyi's idea - rather than let the system admin set a parameter, the 
compiler can provide more reasonable number using a worst-case scenario.

Based on the given group-memory, and each tuple in data table consists of at 
least 9 (tuple offset, field offset, and type tag) + x bytes (real data 
payload), the compiler can assign a memory budget to hash table. The details 
are:

Assume each tuple in the data table only have one field:

4 byte for tuple offset
4 byte for field offset
X byte for payload
1 byte for type tag

If data table occupies 32MB, hash table need the following size:  {code} 
Min(32M/(8+X+1), 2^(8X)) * 8 + Min(32M/(8+X+1), 2^(8X)) * 32 {code}

1 byte:  256 * 40 / 1000 = 10KB
2 byte: 0.6 *40 = 24MB
3 byte: (32M/12) * 40 = 106 MB
4 byte: (32M/13) * 40 = 98 MB 
5 byte: (32M/14) * 40 = 91MB

So, 106 MB is the maximal value. Then, the ratio of hash table is 98 / (32 + 
98) = 0.75. Even if we change the budget, this ratio doesn't change. So, for 
any one field tuple, we can assign 75% of the group-memory budget to hash table.

Similarly for multiple-field tuple cases, 
2 fields (in the grouping result):
4 byte for tuple offset
8 byte for field offset
2X byte for payload
2 byte for type tag

58/(32+58) = 0.64

3 fields (in the grouping result):
4 byte for tuple offset
16 byte for field offset
3X byte for payload
3 byte for type tag

36/(32+36) = 0.53

We can calculate this ratio. In the conclusion: we can set a ratio based the 
number of field and the group-memory budget. 
 

> Hash Table used by External hash group-by doesn't conform to the budget.
> ------------------------------------------------------------------------
>
>                 Key: ASTERIXDB-1556
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1556
>             Project: Apache AsterixDB
>          Issue Type: Bug
>            Reporter: Taewoo Kim
>            Assignee: Taewoo Kim
>            Priority: Critical
>              Labels: soon
>         Attachments: 2wayjoin.pdf, 2wayjoin.rtf, 2wayjoinplan.rtf, 
> 3wayjoin.pdf, 3wayjoin.rtf, 3wayjoinplan.rtf
>
>
> When we enable prefix-based fuzzy-join and apply the multi-way fuzzy-join ( > 
> 2), the system generates an out-of-memory exception. 
> Since a fuzzy-join is created using 30-40 lines of AQL codes and this AQL is 
> translated into massive number of operators (more than 200 operators in the 
> plan for a 3-way fuzzy join), it could generate out-of-memory exception.
> /// Update: as the discussion goes, we found that hash table in the external 
> hash group by doesn't conform to the frame limit. So, an out of memory 
> exception happens during the execution of an external hash group by operator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ASTERIXDB-1556) Hash Table used by External hash group-by doesn't conform to the budget.

Reply via email to