[ https://issues.apache.org/jira/browse/ASTERIXDB-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15475096#comment-15475096 ]
Taewoo Kim commented on ASTERIXDB-1556: --------------------------------------- Another discussion with [~buyingyi] regarding the hash table size estimation: This is Yinyi's idea - rather than let the system admin set a parameter, the compiler can provide more reasonable number using a worst-case scenario. Based on the given group-memory, and each tuple in data table consists of at least 9 (tuple offset, field offset, and type tag) + x bytes (real data payload), the compiler can assign a memory budget to hash table. The details are: Assume each tuple in the data table only have one field: 4 byte for tuple offset 4 byte for field offset X byte for payload 1 byte for type tag If data table occupies 32MB, hash table need the following size: {code} Min(32M/(8+X+1), 2^(8X)) * 8 + Min(32M/(8+X+1), 2^(8X)) * 32 {code} 1 byte: 256 * 40 / 1000 = 10KB 2 byte: 0.6 *40 = 24MB 3 byte: (32M/12) * 40 = 106 MB 4 byte: (32M/13) * 40 = 98 MB 5 byte: (32M/14) * 40 = 91MB So, 106 MB is the maximal value. Then, the ratio of hash table is 98 / (32 + 98) = 0.75. Even if we change the budget, this ratio doesn't change. So, for any one field tuple, we can assign 75% of the group-memory budget to hash table. Similarly for multiple-field tuple cases, 2 fields (in the grouping result): 4 byte for tuple offset 8 byte for field offset 2X byte for payload 2 byte for type tag 58/(32+58) = 0.64 3 fields (in the grouping result): 4 byte for tuple offset 16 byte for field offset 3X byte for payload 3 byte for type tag 36/(32+36) = 0.53 We can calculate this ratio. In the conclusion: we can set a ratio based the number of field and the group-memory budget. > Hash Table used by External hash group-by doesn't conform to the budget. > ------------------------------------------------------------------------ > > Key: ASTERIXDB-1556 > URL: https://issues.apache.org/jira/browse/ASTERIXDB-1556 > Project: Apache AsterixDB > Issue Type: Bug > Reporter: Taewoo Kim > Assignee: Taewoo Kim > Priority: Critical > Labels: soon > Attachments: 2wayjoin.pdf, 2wayjoin.rtf, 2wayjoinplan.rtf, > 3wayjoin.pdf, 3wayjoin.rtf, 3wayjoinplan.rtf > > > When we enable prefix-based fuzzy-join and apply the multi-way fuzzy-join ( > > 2), the system generates an out-of-memory exception. > Since a fuzzy-join is created using 30-40 lines of AQL codes and this AQL is > translated into massive number of operators (more than 200 operators in the > plan for a 3-way fuzzy join), it could generate out-of-memory exception. > /// Update: as the discussion goes, we found that hash table in the external > hash group by doesn't conform to the frame limit. So, an out of memory > exception happens during the execution of an external hash group by operator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)