Github user StephanEwen commented on the pull request:
https://github.com/apache/flink/pull/888#issuecomment-119136367
This is a very nice idea, thank you for the contribution! The numbers look
quite encouraging.
I need to look into this carefully, as it touches a very sensitive part of
the system. It will probably take me a bit of time.
Here are some initial comments:
- The integration tests seem to be failing, this change apparently
triggers a stack-overflow at some point. Have a look at the logs of the Travis
CI build.
- Can we add a flag to the hash-table, to enable/disable the
bloom-filters? That would make it easier for future comparisons.
- Could you include a standalone mini benchmark similar to the one you
did where you posted the numbers here? A simple standalone Java executable that
creates the hash table and feeds some generated records through it (with bloom
filters activated and deactivated)? It would not start a full Flink cluster,
but only test the HashJoin in isolation.
We like to include some of those mini bechmarks for performance critical
parts, and re-run them once in a while to determine how the performance behaves
at that point.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---