[ 
https://issues.apache.org/jira/browse/ASTERIXDB-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15459716#comment-15459716
 ] 

Taewoo Kim commented on ASTERIXDB-1628:
---------------------------------------

After having a discussion with [~che...@gmail.com] and [~javierjia], we have 
concluded that the above code could be revised as follows:

{code}
    private int getNumOfPartitions(int nubmerOfFramesForDataAndHashTable, int 
frameLimit) {
        if (frameLimit >= nubmerOfFramesForDataAndHashTable * FUDGE_FACTOR) {
            return 1; // all in memory, we will create a big partition
        }
        int numberOfPartitions = (int) (Math
                .ceil((nubmerOfFramesForDataAndHashTable * FUDGE_FACTOR - 
frameLimit) / (frameLimit - 1)));
        // Actually, in this case, this is not a in-memory hash (#frames 
required > #frameLimit)
        // so we guarantee that the number of partition is at least two.
        numberOfPartitions = Math.max(2, numberOfPartitions);

        if (numberOfPartitions > frameLimit) {
            numberOfPartitions = (int) 
Math.ceil(Math.sqrt(nubmerOfFramesForDataAndHashTable * FUDGE_FACTOR));
            return Math.max(2, Math.min(numberOfPartitions, frameLimit));
        }
        return numberOfPartitions;
    }
{code}

> The number of partitions in External Hash-Groupby is calculated improperly 
> for smaller data size.
> -------------------------------------------------------------------------------------------------
>
>                 Key: ASTERIXDB-1628
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1628
>             Project: Apache AsterixDB
>          Issue Type: Bug
>            Reporter: Taewoo Kim
>            Assignee: Taewoo Kim
>              Labels: soon
>
> If the number of frames required for a data (e.g., external file), say A,  is 
> slightly larger than the number of available frames (= memory budget), say B, 
> then the number of partitions may be calculated as 1 and it will cause the 
> infinite cycles during the merge phase.
> If the number of partition is 1, the current code assumes that there is no 
> spilling due to the out of memory budget and the output of the build phase is 
> directly generated as the final output. 
> But, if A > B, then a spill would happen and once a partition is spilled to 
> the disk, it can't be generated as the final output. So, the merge process 
> goes to the next round that just creates only one partition again and tries 
> to generate some as final output. But, it can't. Thus, an infinite cycle 
> begins.
> The resolution is that if A > B, we should not set the number of partition as 
> one.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to