Hi all, I am trying to use some machine learning algorithms that are not included in the Mllib. Like Mixture Model and LDA(Latent Dirichlet Allocation), and I am using pyspark and Spark SQL.
My problem is: I have some scripts that implement these algorithms, but I am not sure which part I shall change to make it fit into Big Data. - Like some very simple calculation may take much time if data is too big,but also constructing RDD or SQLContext table takes too much time. I am really not sure if I shall use map(), reduce() every time I need to make calculation. - Also, there are some matrix/array level calculation that can not be implemented easily merely using map(),reduce(), thus functions of the Numpy package shall be used. I am not sure when data is too big, and we simply use the numpy functions. Will it take too much time? I have found some scripts that are not from Mllib and was created by other developers(credits to Meethu Mathew from Flytxt, thanks for giving me insights!:)) Many thanks and look forward to getting feedbacks! Best, Danqing GMMSpark.py (7K) <http://apache-spark-developers-list.1001551.n3.nabble.com/attachment/9964/0/GMMSpark.py> -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Problems-concerning-implementing-machine-learning-algorithm-from-scratch-based-on-Spark-tp9964.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.