GSoC projects related to Spark
Hello all, I am really interested in Spark. I have been doing small projects in machine learning using spark and would love to do a project in this year's GSoC. Can anyone tell me are there any projects related to Spark for this year's GSoC? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/GSoC-projects-related-to-Spark-tp19653.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Regularized Logistic regression
I used the cross validator tool for tuning the parameter. My code is here: from pyspark.ml.classification import LogisticRegression from pyspark.ml.tuning import ParamGridBuilder, CrossValidator from pyspark.ml.evaluation import BinaryClassificationEvaluator reg=100.0 lr=LogisticRegression(maxIter=500) paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.02,0.01,0.2,0.1,1.0,2.0,10.0,15.0,20.0,100.0]).addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0]).build() crossval = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=BinaryClassificationEvaluator(), numFolds=3) model=crossval.fit(data_train_df) And finally predicted the values: prediction = model.transform(data_test_df) prediction.show() +-+--+ |label|prediction| +-+--+ | 1.0| 1.0| | 1.0| 1.0| | 1.0| 1.0| | 1.0| 1.0| | 1.0| 1.0| | 0.0| 1.0| | 0.0| 1.0| | 0.0| 1.0| | 0.0| 1.0| | 0.0| 1.0| | 0.0| 1.0| | 0.0| 1.0| | 0.0| 1.0| | 0.0| 1.0| | 0.0| 1.0| | 0.0| 1.0| | 0.0| 1.0| +-+--+ Why am I getting the wrong predictions? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Regularized-Logistic-regression-tp19432p19448.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
RE: Regularized Logistic regression
Ok so I tried setting the regParam and tried lowering it. how do I evaluate which regParam is best. Do I have to to do it by trial and error. I am currently calculating the log_loss for the model. Is it good to find the best regparam value. here is my code: from math import exp,log #from pyspark.sql.functions import log epsilon = 1e-16 def sigmoid_log_loss(w,x): ans=float(1/(1+exp(-(w.dot(x.features) if ans==0: ans=ans+epsilon if ans==1: ans=ans-epsilon log_loss=-((x.label)*log(ans)+(1-x.label)*log(1-ans)) return ((ans,x.label),log_loss) --- reg=0.02 from pyspark.ml.classification import LogisticRegression lr=LogisticRegression(regParam=reg,maxIter=500,standardization=True,elasticNetParam=0.5) model=lr.fit(data_train_df) w=model.coefficients intercept=model.intercept data_predicted_df=data_val_df.map(lambda x:(sigmoid_log_loss(w,x))) log_loss=data_predicted_df.map(lambda x:x[1]).mean() print log_loss -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Regularized-Logistic-regression-tp19432p19444.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
RE: Regularized Logistic regression
Thank you Anurag Verma for replying. I tried increasing the iterations. However I still get underfitted results. I am checking the model's prediction by seeing how many pairs of labels and predictions it gets right data_predict_with_model=best_model.transform(data_test_df) final_pred_df=data_predict_with_model.select(col('label'),col('prediction')) ans=final_pred_df.map(lambda x:((x[0],x[1]),1)).reduceByKey(lambda a,b:a+b).toDF() ans.show() -+---+ | _1| _2| +-+---+ |[1.0,1.0]| 5| |[0.0,1.0]| 12| +-+---+ Do you know any other methods by which I can check the model? and what is it that I am doing wrong. I have filtered the data and arranged it in a features and label column. So now only the model creation part is wrong I guess. Can anyone help me please. I am still learning machine learning. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Regularized-Logistic-regression-tp19432p19443.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Regularized Logistic regression
Hello, I am trying to solve a problem using regularized logistic regression in spark. I am using the model created by LogisticRegression(): lr=LogisticRegression(regParam=10.0,maxIter=10,standardization=True) model=lr.fit(data_train_df) data_predict_with_model=model.transform(data_test_df) However I am not able to get proper results. Can anyone tell me whether we have to pass any other parameters in the model? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Regularized-Logistic-regression-tp19432.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Contribution to Apache Spark
Hello, I am Aditya Vyas and I am currently in my third year of college doing BTech in my engineering. I know python, a little bit of Java. I want to start contribution in Apache Spark. This is my first time in the field of Big Data. Can someone please help me as to how to get started. Which resources to look at? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contribution-to-Apache-Spark-tp18852.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org