[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694338#comment-14694338
 ] 

ASF GitHub Bot commented on FLINK-1745:
---------------------------------------

Github user kno10 commented on the pull request:

    https://github.com/apache/flink/pull/696#issuecomment-130469648
  
    R-trees are hard to parallelize.
    For distributed and gigabyte size data, an approximative approach is 
preferable, like the one we discuss in this article:
    
    E. Schubert, A. Zimek, H.-P. Kriegel
    Fast and Scalable Outlier Detection with Approximate Nearest Neighbor 
Ensembles
    In Proceedings of the 20th International Conference on Database Systems for 
Advanced Applications (DASFAA), Hanoi, Vietnam: 19–36, 2015. 
    
    We discuss an approach that is easy to parallelize. It needs sorting and a 
sliding window (or blocks), so it is not strict MapReduce, but it should be a 
good match for Flink. The hardest part is to get the different space filling 
curves right and efficient. The other components (random projections to reduce 
dimensionality, ensemble to improve quality, and list inversions to also build 
reverse kNN that then allow accelerating methods such as LOF are much easier).
    
    The main drawback of most of these kNN-join approaches (including ours) is 
that they only work with Minkowski norms. There are much more interesting 
distance functions than that...
    
    We also discuss why the space filling curves appear to give better results 
for kNN, while LSH etc. work better for radius joins. LSH is another option, 
but it cannot guarantee to find k neighbors and parameter tuning is tricky. So 
you may want to have a look at this recent ensemble approach instead.


> Add exact k-nearest-neighbours algorithm to machine learning library
> --------------------------------------------------------------------
>
>                 Key: FLINK-1745
>                 URL: https://issues.apache.org/jira/browse/FLINK-1745
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>              Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to