[ https://issues.apache.org/jira/browse/FLINK-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14661203#comment-14661203 ]
ASF GitHub Bot commented on FLINK-2240: --------------------------------------- Github user ChengXiangLi commented on the pull request: https://github.com/apache/flink/pull/888#issuecomment-128563750 Thanks for the review, @StephanEwen , i'm very interesting in this project, and i would like to contribute more. @vasia , I think stephan has helped to answer the question yet, the most important reason is that i want to reuse the memory occupied by hash table buckets. Besides, since this is a performance sense issue, i try to make this bloom filter as much simple and efficient as i can, for example, the hashcode of join key is already generated and stored in hybrid hash join, i just reuse the hashcode instead of generate it by join key value inside bloom filter again. > Use BloomFilter to minimize probe side records which are spilled to disk in > Hybrid-Hash-Join > -------------------------------------------------------------------------------------------- > > Key: FLINK-2240 > URL: https://issues.apache.org/jira/browse/FLINK-2240 > Project: Flink > Issue Type: Improvement > Components: Core > Reporter: Chengxiang Li > Assignee: Chengxiang Li > Priority: Minor > Fix For: 0.10 > > > In Hybrid-Hash-Join, while small table does not fit into memory, part of the > small table data would be spilled to disk, and the counterpart partition of > big table data would be spilled to disk in probe phase as well. If we build a > BloomFilter while spill small table to disk during build phase, and use it to > filter the big table records which tend to be spilled to disk, this may > greatly reduce the spilled big table file size, and saved the disk IO cost > for writing and further reading. -- This message was sent by Atlassian JIRA (v6.3.4#6332)