[
https://issues.apache.org/jira/browse/ASTERIXDB-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408840#comment-15408840
]
Taewoo Kim edited comment on ASTERIXDB-1556 at 8/5/16 4:16 AM:
---------------------------------------------------------------
My analysis so far on external-groupby:
The hash table consists of headers frame + content frame (stores the tuple
pointer for the real tuple). Both Headers and content frame can be
incrementally allocated though the maximum number of header frame is limited.
That is equivalent to the "initial entry size in bytes * 2 / frame size). The
number of the content frames can be increased indefinitely.
The data table is bounded by the number of limit that is calculated from the
user configuration setting. So, once an insertion to the data table is failed,
a partition is spilled to the disk. In this case, currently, we reset the
corresponding entries in the hash table.
So, we need to set up a policy regarding the proportion between the hash table
and the data table. And, allocating a frame for hash table or allocating a
frame for the data table fails, the spill should happen.
{quote}
(3) We need to come up with a strategy. Possible choices are: 1) Data and
hash-table dynamically share the entire budget. 2) have a global budget, and
let DATA and HASH-TABLE have pre-defined proportion (e.g., data -80%, hash
table - 20%). Do not let each overgrow beyond the proportion. 3) have a two
separate budget and let DATA and HASH-TABLE stick to it.
{quote}
So, how much percentage should we allocate for hash table and how much for data
table, at least initially since we need to decide the number of partitions in
the data table and hash table.
was (Author: wangsaeu):
My analysis so far on external-groupby:
The hash table consists of headers frame + content frame (stores the tuple
pointer for the real tuple). Both Headers and content frame can be allocated
though the maximum number of header frame is limited. That is equivalent to the
"initial entry size in bytes * 2 / frame size). Content frame can be increased
indefinitely.
The data table is bounded by the number of limit that is calculated from the
user configuration setting. So, once an insertion to the data table is failed,
a partition is spilled to the disk. In this case, currently, we reset the
corresponding entries in the hash table.
So, we need to set up a policy regarding the proportion between the hash table
and the data table. And, allocating a frame for hash table or allocating a
frame for the data table fails, the spill should happen.
{quote}
(3) We need to come up with a strategy. Possible choices are: 1) Data and
hash-table dynamically share the entire budget. 2) have a global budget, and
let DATA and HASH-TABLE have pre-defined proportion (e.g., data -80%, hash
table - 20%). Do not let each overgrow beyond the proportion. 3) have a two
separate budget and let DATA and HASH-TABLE stick to it.
{quote}
So, how much percentage should we allocate for hash table and how much for data
table, at least initially since we need to decide the number of partitions in
the data table and hash table.
> Prefix-based multi-way Fuzzy-join generates an exception.
> ---------------------------------------------------------
>
> Key: ASTERIXDB-1556
> URL: https://issues.apache.org/jira/browse/ASTERIXDB-1556
> Project: Apache AsterixDB
> Issue Type: Bug
> Reporter: Taewoo Kim
> Assignee: Taewoo Kim
> Attachments: 2wayjoin.pdf, 2wayjoin.rtf, 2wayjoinplan.rtf,
> 3wayjoin.pdf, 3wayjoin.rtf, 3wayjoinplan.rtf
>
>
> When we enable prefix-based fuzzy-join and apply the multi-way fuzzy-join ( >
> 2), the system generates an out-of-memory exception.
> Since a fuzzy-join is created using 30-40 lines of AQL codes and this AQL is
> translated into massive number of operators (more than 200 operators in the
> plan for a 3-way fuzzy join), it could generate out-of-memory exception.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)