[
https://issues.apache.org/jira/browse/FLINK-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614780#comment-14614780
]
ASF GitHub Bot commented on FLINK-2240:
---------------------------------------
Github user ChengXiangLi commented on the pull request:
https://github.com/apache/flink/pull/888#issuecomment-118786914
I did a simple test on single node, here is the related information:
1 task manager, 1 slot, 1G RAM assigned, probe table 5G, build table 1G,
half of build table partitions get spilled to disk.
FOF: how many percent records been filtered out during probe phase?
FOF | without bloom filter(s) | with bloom filter(s)
------------- | ------------- | -------------------
90% | 210 | 147
50% | 214 | 187
0 | 236 | 252
> Use BloomFilter to minimize probe side records which are spilled to disk in
> Hybrid-Hash-Join
> --------------------------------------------------------------------------------------------
>
> Key: FLINK-2240
> URL: https://issues.apache.org/jira/browse/FLINK-2240
> Project: Flink
> Issue Type: Improvement
> Components: Core
> Reporter: Chengxiang Li
> Assignee: Chengxiang Li
> Priority: Minor
>
> In Hybrid-Hash-Join, while small table does not fit into memory, part of the
> small table data would be spilled to disk, and the counterpart partition of
> big table data would be spilled to disk in probe phase as well. If we build a
> BloomFilter while spill small table to disk during build phase, and use it to
> filter the big table records which tend to be spilled to disk, this may
> greatly reduce the spilled big table file size, and saved the disk IO cost
> for writing and further reading.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)