[
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14396306#comment-14396306
]
Sen Fang commented on SPARK-2336:
---------------------------------
I'm tentatively going to give the hybrid spilltree implementation described in
OP's post a try. Specifically I'm going to follow the implementation described
in: http://dx.doi.org/10.1109/WACV.2007.18
According to the paper, the algorithm scales well in terms of number of
observations, bounded by the available memory across clusters (billions in
paper's example). The original hybrid spilltree paper claims the algorithm
scales much better than other alternatives (LSH, metric tree) when it comes to
number of features (up to hundreds). Furthermore, random projection is often
used to reduce the dimensionality of feature space. Random projection is out of
scope in my implementation and should probably implemented as a separate
dimension reduction technique. (in the paper 4000 features are projected down
to 100)
The average runtime for training and prediction is generally O(log n). The
training runtime will suffer if we turn up the buffer size. The search is only
approximate on overlapping node and the higher the buffer size the more
accurate the search is. The prediction runtime will suffer when backtracking
search is needed in a non-overlap node (which is more likely as balance
threshold increases).
More specifically, metric tree promises accurate but less performant search
while spill tree creates trees with overlapping nodes and uses more efficient
defeatist search whose accuracy is related to the buffer size *tau* (the larger
it is the more accurate the search but deeper tree). Hybrid spill tree is
constructed with both overlapping (spill tree) and non-overlapping (metric
tree) node and uses balance threshold *rho* to balances the tree depth and
search efficiency.
A high level overview of the algorithm is as follows:
1. Sample M data points (M = number of partitions)
2. Build the top level metric tree
3. Repartition RDD by assigning each point to leaf node of the above tree
4. Build a hybrid spill tree at each partition
This concludes the training phase of kNN.
Prediction is achieved by identify the subtree it falls into then run
prediction on that subtree.
Let me know if anyone has any thoughts, concerns or corrections on the things I
said above.
> Approximate k-NN Models for MLLib
> ---------------------------------
>
> Key: SPARK-2336
> URL: https://issues.apache.org/jira/browse/SPARK-2336
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Brian Gawalt
> Priority: Minor
> Labels: clustering, features
>
> After tackling the general k-Nearest Neighbor model as per
> https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to
> also offer approximate k-Nearest Neighbor. A promising approach would involve
> building a kd-tree variant within from each partition, a la
> http://www.autonlab.org/autonweb/14714.html?branch=1&language=2
> This could offer a simple non-linear ML model that can label new data with
> much lower latency than the plain-vanilla kNN versions.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]