[jira] [Comment Edited] (ASTERIXDB-1556) Prefix-based multi-way Fuzzy-join generates an exception.

Taewoo Kim (JIRA) Tue, 02 Aug 2016 10:02:50 -0700

    [ 
https://issues.apache.org/jira/browse/ASTERIXDB-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15403898#comment-15403898
 ]


Taewoo Kim edited comment on ASTERIXDB-1556 at 8/2/16 5:01 PM:
---------------------------------------------------------------

It seems that the compiler doesn't set the hash table size for 
external-group-by, in-memory-hash-join, and hash-group-by during 
APIFramework.compileQuery(). Only the following settings are applied. 
{code:title=APIFramework.java|borderStyle=solid}
        AsterixCompilerProperties compilerProperties = 
AsterixAppContextInfo.getInstance().getCompilerProperties();
        int frameSize = compilerProperties.getFrameSize();
        int sortFrameLimit = (int) (compilerProperties.getSortMemorySize() / 
frameSize);
        int groupFrameLimit = (int) (compilerProperties.getGroupMemorySize() / 
frameSize);
        int joinFrameLimit = (int) (compilerProperties.getJoinMemorySize() / 
frameSize);
        
OptimizationConfUtil.getPhysicalOptimizationConfig().setFrameSize(frameSize);
        
OptimizationConfUtil.getPhysicalOptimizationConfig().setMaxFramesExternalSort(sortFrameLimit);
        
OptimizationConfUtil.getPhysicalOptimizationConfig().setMaxFramesExternalGroupBy(groupFrameLimit);
        
OptimizationConfUtil.getPhysicalOptimizationConfig().setMaxFramesForJoin(joinFrameLimit);
{code}

Here, the number of frame limit is set. However, the hash table size is always 
set to 10,485,767 based on the following setting in 
PhysicalOptimizationConfig(). 

{code:title=PhysicalOptimizationConfig.java|borderStyle=solid}
    public PhysicalOptimizationConfig() {
        int frameSize = 32768;
        setInt(FRAMESIZE, frameSize);
        setInt(MAX_FRAMES_EXTERNAL_SORT, (int) (((long) 32 * MB) / frameSize));
        setInt(MAX_FRAMES_EXTERNAL_GROUP_BY, (int) (((long) 32 * MB) / 
frameSize));

        // use http://www.rsok.com/~jrm/printprimes.html to find prime numbers
        setInt(DEFAULT_HASH_GROUP_TABLE_SIZE, 10485767);
        setInt(DEFAULT_EXTERNAL_GROUP_TABLE_SIZE, 10485767);
        setInt(DEFAULT_IN_MEM_HASH_JOIN_TABLE_SIZE, 10485767);
    } 
{code}

Though we have three methods that can change the default table size, there are 
no callers for these methods.

{code:title=PhysicalOptimizationConfig.java|borderStyle=solid}
    public void setExternalGroupByTableSize(int tableSize) {
        setInt(DEFAULT_EXTERNAL_GROUP_TABLE_SIZE, tableSize);
    }

    public void setInMemHashJoinTableSize(int tableSize) {
        setInt(DEFAULT_IN_MEM_HASH_JOIN_TABLE_SIZE, tableSize);
    }

    public void setHashGroupByTableSize(int tableSize) {
        setInt(DEFAULT_HASH_GROUP_TABLE_SIZE, tableSize);
    }
{code}

I checked the hybrid-hash join part and it seems that The callers that create a 
hash table in the join part is well adjusted based on the number of tuples 
(file sizes). But, for Group-by, there is no such setting. So, the 
HashSpillableTableFactory.buildSpillableTable() always set the 8 (INT_SIZE * 2) 
times of the table size (10,485,767), which is 8 * 10,485,767 = 80MB. So, for 
example, if we have 8 partitions, then engine always assigns 80 * 8 = 640MB for 
each group-by operator. 

{code:title=SerializableHashTable.java|borderStyle=solid}
    public SerializableHashTable(int tableSize, final IHyracksFrameMgrContext 
ctx) throws HyracksDataException {
        this.ctx = ctx;
        int frameSize = ctx.getInitialFrameSize();

        int residual = tableSize * INT_SIZE * 2 % frameSize == 0 ? 0 : 1;
        int headerSize = tableSize * INT_SIZE * 2 / frameSize + residual;
        headers = new IntSerDeBuffer[headerSize];

        IntSerDeBuffer frame = new IntSerDeBuffer(ctx.allocateFrame().array());
        contents.add(frame);
        frameCurrentIndex.add(0);
        frameCapacity = frame.capacity();
    } 
{code}


was (Author: wangsaeu):
It seems that the compiler doesn't set the hash table size for 
external-group-by, in-memory-hash-join, and hash-group-by during 
APIFramework.compileQuery(). Only the following settings are applied. 

        AsterixCompilerProperties compilerProperties = 
AsterixAppContextInfo.getInstance().getCompilerProperties();
        int frameSize = compilerProperties.getFrameSize();
        int sortFrameLimit = (int) (compilerProperties.getSortMemorySize() / 
frameSize);
        int groupFrameLimit = (int) (compilerProperties.getGroupMemorySize() / 
frameSize);
        int joinFrameLimit = (int) (compilerProperties.getJoinMemorySize() / 
frameSize);
        
OptimizationConfUtil.getPhysicalOptimizationConfig().setFrameSize(frameSize);
        
OptimizationConfUtil.getPhysicalOptimizationConfig().setMaxFramesExternalSort(sortFrameLimit);
        
OptimizationConfUtil.getPhysicalOptimizationConfig().setMaxFramesExternalGroupBy(groupFrameLimit);
        
OptimizationConfUtil.getPhysicalOptimizationConfig().setMaxFramesForJoin(joinFrameLimit);

Here, the number of frame limit is set. However, the hash table size is always 
set to 10,485,767 based on the following setting in 
PhysicalOptimizationConfig(). 

    public PhysicalOptimizationConfig() {
        int frameSize = 32768;
        setInt(FRAMESIZE, frameSize);
        setInt(MAX_FRAMES_EXTERNAL_SORT, (int) (((long) 32 * MB) / frameSize));
        setInt(MAX_FRAMES_EXTERNAL_GROUP_BY, (int) (((long) 32 * MB) / 
frameSize));

        // use http://www.rsok.com/~jrm/printprimes.html to find prime numbers
        setInt(DEFAULT_HASH_GROUP_TABLE_SIZE, 10485767);
        setInt(DEFAULT_EXTERNAL_GROUP_TABLE_SIZE, 10485767);
        setInt(DEFAULT_IN_MEM_HASH_JOIN_TABLE_SIZE, 10485767);
    } 

Though we have three methods that can change the default table size, there are 
no callers for these methods.

    public void setExternalGroupByTableSize(int tableSize) {
        setInt(DEFAULT_EXTERNAL_GROUP_TABLE_SIZE, tableSize);
    }

    public void setInMemHashJoinTableSize(int tableSize) {
        setInt(DEFAULT_IN_MEM_HASH_JOIN_TABLE_SIZE, tableSize);
    }

    public void setHashGroupByTableSize(int tableSize) {
        setInt(DEFAULT_HASH_GROUP_TABLE_SIZE, tableSize);
    }

I checked the hybrid-hash join part and it seems that The callers that create a 
hash table in the join part is well adjusted based on the number of tuples 
(file sizes). But, for Group-by, there is no such setting. So, the 
HashSpillableTableFactory.buildSpillableTable() always set the 8 (INT_SIZE * 2) 
times of the table size (10,485,767), which is 8 * 10,485,767 = 80MB. So, for 
example, if we have 8 partitions, then engine always assigns 80 * 8 = 640MB for 
each group-by operator. 

    public SerializableHashTable(int tableSize, final IHyracksFrameMgrContext 
ctx) throws HyracksDataException {
        this.ctx = ctx;
        int frameSize = ctx.getInitialFrameSize();

        int residual = tableSize * INT_SIZE * 2 % frameSize == 0 ? 0 : 1;
        int headerSize = tableSize * INT_SIZE * 2 / frameSize + residual;
        headers = new IntSerDeBuffer[headerSize];

        IntSerDeBuffer frame = new IntSerDeBuffer(ctx.allocateFrame().array());
        contents.add(frame);
        frameCurrentIndex.add(0);
        frameCapacity = frame.capacity();
    } 


> Prefix-based multi-way Fuzzy-join generates an exception.
> ---------------------------------------------------------
>
>                 Key: ASTERIXDB-1556
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1556
>             Project: Apache AsterixDB
>          Issue Type: Bug
>            Reporter: Taewoo Kim
>            Assignee: Taewoo Kim
>         Attachments: 2wayjoin.pdf, 2wayjoin.rtf, 2wayjoinplan.rtf, 
> 3wayjoin.pdf, 3wayjoin.rtf, 3wayjoinplan.rtf
>
>
> When we enable prefix-based fuzzy-join and apply the multi-way fuzzy-join ( > 
> 2), the system generates an out-of-memory exception. 
> Since a fuzzy-join is created using 30-40 lines of AQL codes and this AQL is 
> translated into massive number of operators (more than 200 operators in the 
> plan for a 3-way fuzzy join), it could generate out-of-memory exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (ASTERIXDB-1556) Prefix-based multi-way Fuzzy-join generates an exception.

Reply via email to