[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14396306#comment-14396306
 ] 

Sen Fang commented on SPARK-2336:
---------------------------------

I'm tentatively going to give the hybrid spilltree implementation described in 
OP's post a try. Specifically I'm going to follow the implementation described 
in: http://dx.doi.org/10.1109/WACV.2007.18

According to the paper, the algorithm scales well in terms of number of 
observations, bounded by the available memory across clusters (billions in 
paper's example). The original hybrid spilltree paper claims the algorithm 
scales much better than other alternatives (LSH, metric tree) when it comes to 
number of features (up to hundreds). Furthermore, random projection is often 
used to reduce the dimensionality of feature space. Random projection is out of 
scope in my implementation and should probably implemented as a separate 
dimension reduction technique. (in the paper 4000 features are projected down 
to 100)

The average runtime for training and prediction is generally O(log n). The 
training runtime will suffer if we turn up the buffer size. The search is only 
approximate on overlapping node and the higher the buffer size the more 
accurate the search is. The prediction runtime will suffer when backtracking 
search is needed in a non-overlap node (which is more likely as balance 
threshold increases). 

More specifically, metric tree promises accurate but less performant search 
while spill tree creates trees with overlapping nodes and uses more efficient 
defeatist search whose accuracy is related to the buffer size *tau* (the larger 
it is the more accurate the search but deeper tree). Hybrid spill tree is 
constructed with both overlapping (spill tree) and non-overlapping (metric 
tree) node and uses balance threshold *rho* to balances the tree depth and 
search efficiency.


A high level overview of the algorithm is as follows:
1. Sample M data points (M = number of partitions)
2. Build the top level metric tree
3. Repartition RDD by assigning each point to leaf node of the above tree
4. Build a hybrid spill tree at each partition 
This concludes the training phase of kNN.

Prediction is achieved by identify the subtree it falls into then run 
prediction on that subtree.


Let me know if anyone has any thoughts, concerns or corrections on the things I 
said above.


> Approximate k-NN Models for MLLib
> ---------------------------------
>
>                 Key: SPARK-2336
>                 URL: https://issues.apache.org/jira/browse/SPARK-2336
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Brian Gawalt
>            Priority: Minor
>              Labels: clustering, features
>
> After tackling the general k-Nearest Neighbor model as per 
> https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
> also offer approximate k-Nearest Neighbor. A promising approach would involve 
> building a kd-tree variant within from each partition, a la
> http://www.autonlab.org/autonweb/14714.html?branch=1&language=2
> This could offer a simple non-linear ML model that can label new data with 
> much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to