Github user fhueske commented on the pull request:
https://github.com/apache/flink/pull/1941#issuecomment-216232502
Thanks for the PR @fpompermaier.
I think the new format is a bit too much tailored towards certain query
templates (`BETWEEN` predicate on integer column). Also modifying queries that
users provide, is a bit risky, IMO.
To make it more general I would propose to:
- Accept query templates with markers, similar to parameter makers in
prepared statements: `SELECT address FROM people WHERE name = ? AND birthday
BETWEEN ? AND ?`.
- Let users explicitly provide bounds. Users should know their data best
and can provide bounds which take skewed distributions into account. Parameter
values can be provided as `Object[]`, one array for each parameter. We can
provide some utility methods to help users generating uniformly distributed
parameter values.
- Let `InputSplit` not provide two bound values but the index for the
parameter value array. So each instance can build the query by substituting the
parameters by values.
What do you think?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---