Github user chiwanpark commented on the pull request:
https://github.com/apache/flink/pull/1220#issuecomment-171852976
@danielblazevski, I think we can use `crossWithTiny` and `crossWithHuge`
method to reduce shuffle cost. Best approach is that counting elements in both
datasets and decide method to cross, but currently we simply add a parameter to
decide this like following:
```scala
import
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
class KNN {
// ...
def setSizeHint(sizeHint: CrossHint): KNN = {
parameters.add(SizeHint, sizeHint)
this
}
// ...
}
object KNN {
// ...
case object SizeHint extends Parameter[CrossHint] {
val defaultValue: Option[CrossHint] = None
}
// ...
}
```
And we can use the parameter in `predictValues` method:
```scala
val crossTuned = sizeHint match {
case Some(hint) if hint == CrossHint.FIRST_IS_SMALL =>
trainingSet.crossWithHuge(inputSplit)
case Some(hint) if hint == CrossHint.SECOND_IS_SMALL =>
trainingSet.crossWithTiny(inputSplit)
case _ => trainingSet.cross(inputSplit)
}
val crossed = crossTuned.mapPartition {
// ...
}
// ...
```
We have to decide the name of added parameter (`SizeHint`) and add
documentation of explanation that which dataset is first (training) and which
dataset is second (testing).
By the way, there is no documentation for k-NN. Could you add the
documentation to `docs/ml` directory?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---