[ 
https://issues.apache.org/jira/browse/FLINK-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14616423#comment-14616423
 ] 

ASF GitHub Bot commented on FLINK-2240:
---------------------------------------

Github user StephanEwen commented on the pull request:

    https://github.com/apache/flink/pull/888#issuecomment-119136367
  
    This is a very nice idea, thank you for the contribution! The numbers look 
quite encouraging.
    
    I need to look into this carefully, as it touches a very sensitive part of 
the system. It will probably take me a bit of time.
    
    Here are some initial comments:
    
      - The integration tests seem to be failing, this change apparently 
triggers a stack-overflow at some point. Have a look at the logs of the Travis 
CI build.
    
      - Can we add a flag to the hash-table, to enable/disable the 
bloom-filters? That would make it easier for future comparisons.
    
      - Could you include a standalone mini benchmark similar to the one you 
did where you posted the numbers here? A simple standalone Java executable that 
creates the hash table and feeds some generated records through it (with bloom 
filters activated and deactivated)? It would not start a full Flink cluster, 
but only test the HashJoin in isolation.
    We like to include some of those mini bechmarks for performance critical 
parts, and re-run them once in a while to determine how the performance behaves 
at that point.


> Use BloomFilter to minimize probe side records which are spilled to disk in 
> Hybrid-Hash-Join
> --------------------------------------------------------------------------------------------
>
>                 Key: FLINK-2240
>                 URL: https://issues.apache.org/jira/browse/FLINK-2240
>             Project: Flink
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Chengxiang Li
>            Assignee: Chengxiang Li
>            Priority: Minor
>
> In Hybrid-Hash-Join, while small table does not fit into memory, part of the 
> small table data would be spilled to disk, and the counterpart partition of 
> big table data would be spilled to disk in probe phase as well. If we build a 
> BloomFilter while spill small table to disk during build phase, and use it to 
> filter the big table records which tend to be spilled to disk, this may 
> greatly  reduce the spilled big table file size, and saved the disk IO cost 
> for writing and further reading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to