[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15087301#comment-15087301
 ] 

mustafa elbehery commented on SPARK-5226:
-----------------------------------------

I have tried to use Aliaksei's implementation on 500MB of GPS Trajectories. The 
algorithm never finished. Though, his implementation worked very well on the 
provided sample data. 

When I have created a scatter plot for both datasets; sample data && 
trajectories data, I found out that his data's distribution was Gaussian, while 
mine was very skewed. Moreover, this implementation has a bottleneck, because 
basically all the partition are merged together in a reduce step, which leads 
turns the algorithm into Serial again !!!.. 

I have commented below a better implementation to avoid this bottleneck, hope 
it would be more helpful.

> Add DBSCAN Clustering Algorithm to MLlib
> ----------------------------------------
>
>                 Key: SPARK-5226
>                 URL: https://issues.apache.org/jira/browse/SPARK-5226
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Muhammad-Ali A'rabi
>            Priority: Minor
>              Labels: DBSCAN, clustering
>
> MLlib is all k-means now, and I think we should add some new clustering 
> algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to