Github user chiwanpark commented on the pull request:
https://github.com/apache/flink/pull/1220#issuecomment-145353515
I just reviewed quickly. I have some comments for this pull request.
1. This pull request cannot pass checking Scala code style. The Scala code
should not exceed 100 characters in each line. I also found some wrong
indentations such as line 200-211 in `KNN.scala` file. Please reformat your
code.
2. The meaning of `BruteOrQuad` is ambiguous. From my understand, if
`BruteOrQuad` is true, we use quadtree implementation, right? So the other name
such as `useQuadTree` would be better.
3. Please remove blank line after Scala doc comments.
4. Your implementation assumes that the type of given `Vector` is
`DenseVector`. But basically, there is no guarantee of the type of `Vector`. We
should generalize the implementation.
5. We should be able to configure the use of quadtree implementation.
Please add configure parameter into `KNN` object.
6. Currently, `predictDataSet` is too big and complex because of large code
in `mapPartition` operation. How about split the operation into `mapPartition`
for raw and quadtree?
7. Please avoid calculation of `BigDecimal`. Maybe we can check whether we
use quadtree or not without `BigDecimal` like following:
* `(4 ^ dim) * Ntest * log(Ntrain) < Ntest * Ntrain` is same as `dim +
log_4(log(Ntrain)) < log_4(Ntrain)`
This is great start! If you have some question about my comments, please
feel free to reply.
P.S.: How about create training quadtree before cross operation? Is there
any intention to create quadtree after cross operation?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---