Github user chiwanpark commented on the pull request:

    https://github.com/apache/flink/pull/1220#issuecomment-171852976
  
    @danielblazevski, I think we can use `crossWithTiny` and `crossWithHuge` 
method to reduce shuffle cost. Best approach is that counting elements in both 
datasets and decide method to cross, but currently we simply add a parameter to 
decide this like following:
    
    ```scala
    import 
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
    
    class KNN {
      // ...
    
      def setSizeHint(sizeHint: CrossHint): KNN = {
        parameters.add(SizeHint, sizeHint)
        this
      }
    
      // ...
    }
    
    object KNN {
      // ...
    
      case object SizeHint extends Parameter[CrossHint] {
        val defaultValue: Option[CrossHint] = None
      }
    
      // ...
    }
    ```
    
    And we can use the parameter in `predictValues` method:
    
    ```scala
    val crossTuned = sizeHint match {
      case Some(hint) if hint == CrossHint.FIRST_IS_SMALL =>
        trainingSet.crossWithHuge(inputSplit)
      case Some(hint) if hint == CrossHint.SECOND_IS_SMALL =>
        trainingSet.crossWithTiny(inputSplit)
      case _ => trainingSet.cross(inputSplit)
    }
    
    val crossed = crossTuned.mapPartition {
      // ...
    }
    
    // ...
    ```
    
    We have to decide the name of added parameter (`SizeHint`) and add 
documentation of explanation that which dataset is first (training) and which 
dataset is second (testing).
    
    By the way, there is no documentation for k-NN. Could you add the 
documentation to `docs/ml` directory? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to