[ https://issues.apache.org/jira/browse/ASTERIXDB-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15459716#comment-15459716 ]
Taewoo Kim commented on ASTERIXDB-1628: --------------------------------------- After having a discussion with [~che...@gmail.com] and [~javierjia], we have concluded that the above code could be revised as follows: {code} private int getNumOfPartitions(int nubmerOfFramesForDataAndHashTable, int frameLimit) { if (frameLimit >= nubmerOfFramesForDataAndHashTable * FUDGE_FACTOR) { return 1; // all in memory, we will create a big partition } int numberOfPartitions = (int) (Math .ceil((nubmerOfFramesForDataAndHashTable * FUDGE_FACTOR - frameLimit) / (frameLimit - 1))); // Actually, in this case, this is not a in-memory hash (#frames required > #frameLimit) // so we guarantee that the number of partition is at least two. numberOfPartitions = Math.max(2, numberOfPartitions); if (numberOfPartitions > frameLimit) { numberOfPartitions = (int) Math.ceil(Math.sqrt(nubmerOfFramesForDataAndHashTable * FUDGE_FACTOR)); return Math.max(2, Math.min(numberOfPartitions, frameLimit)); } return numberOfPartitions; } {code} > The number of partitions in External Hash-Groupby is calculated improperly > for smaller data size. > ------------------------------------------------------------------------------------------------- > > Key: ASTERIXDB-1628 > URL: https://issues.apache.org/jira/browse/ASTERIXDB-1628 > Project: Apache AsterixDB > Issue Type: Bug > Reporter: Taewoo Kim > Assignee: Taewoo Kim > Labels: soon > > If the number of frames required for a data (e.g., external file), say A, is > slightly larger than the number of available frames (= memory budget), say B, > then the number of partitions may be calculated as 1 and it will cause the > infinite cycles during the merge phase. > If the number of partition is 1, the current code assumes that there is no > spilling due to the out of memory budget and the output of the build phase is > directly generated as the final output. > But, if A > B, then a spill would happen and once a partition is spilled to > the disk, it can't be generated as the final output. So, the merge process > goes to the next round that just creates only one partition again and tries > to generate some as final output. But, it can't. Thus, an infinite cycle > begins. > The resolution is that if A > B, we should not set the number of partition as > one. -- This message was sent by Atlassian JIRA (v6.3.4#6332)