GitHub user ChengXiangLi opened a pull request:
https://github.com/apache/flink/pull/888
[FLINK-2240] Use BloomFilter to filter probe records in Hybrid-Hash-Join
In Hybrid-Hash-Join, while small table does not fit into memory, part of
the small table data would be spilled to disk, and the counterpart partition of
big table data would be spilled to disk in probe phase as well. If we build a
BloomFilter while spill small table to disk during build phase, and use it to
filter the big table records which tend to be spilled to disk, this may greatly
reduce the spilled big table file size, and saved the disk IO cost for writing
and further reading.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ChengXiangLi/flink hj-bloomfilter
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/888.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #888
----
commit 78c59d6ee52a00fd4964001cbce81437c38d86cb
Author: chengxiang li <[email protected]>
Date: 2015-07-03T15:53:47Z
add bloom filter for spilled partitions in hashtable.
commit cacaa9a15a5330c6130306841ef73958490cf69d
Author: chengxiang li <[email protected]>
Date: 2015-07-06T07:15:39Z
fix previous get buckets method
commit 6bbbb27d4935da72ae44ec404f884a74de7bbc4c
Author: chengxiang li <[email protected]>
Date: 2015-07-06T08:07:30Z
fix some format issues.
commit b7fee8d26445db4bba7928bfff8a9dd5ada8cd03
Author: chengxiang li <[email protected]>
Date: 2015-07-06T08:08:52Z
Merge remote-tracking branch 'upstream/master' into hj-bloomfilter
commit d352c090b9c06baf701235809f7dfd0b4e9b87af
Author: Li <[email protected]>
Date: 2015-07-06T08:44:13Z
add tab as indent of blank line.
commit edacfb3ae17beeb84630d73f8452629d3e19b66b
Author: Li <[email protected]>
Date: 2015-07-06T08:48:56Z
fix tab indent for blank lines.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---