Hi Arijit, Can you please put timing counters around below code to understand 20-30 seconds you observe: 1. Creation of SparkContext: sc = SparkContext("local[*]", "test") 2. Converting pandas to Pyspark dataframe: > train_data= pd.read_csv("data1.csv") > test_data = pd.read_csv("data2.csv") > train_data = sqlCtx.createDataFrame(pd.DataFrame(train_data)) > test_data = sqlCtx.createDataFrame(pd.DataFrame(test_data))
Also, you can pass pandas data frame directly to MLContext :) Thanks Niketan > On May 10, 2017, at 10:31 AM, arijit chakraborty <ak...@hotmail.com> wrote: > > Hi, > > > I'm creating a process in SystemML, and running it through spark. I'm running > the code in the following way: > > > # Spark Specifications: > > > import os > import sys > import pandas as pd > import numpy as np > > spark_path = "C:\spark" > os.environ['SPARK_HOME'] = spark_path > os.environ['HADOOP_HOME'] = spark_path > > sys.path.append(spark_path + "/bin") > sys.path.append(spark_path + "/python") > sys.path.append(spark_path + "/python/pyspark/") > sys.path.append(spark_path + "/python/lib") > sys.path.append(spark_path + "/python/lib/pyspark.zip") > sys.path.append(spark_path + "/python/lib/py4j-0.10.4-src.zip") > > from pyspark import SparkContext > from pyspark import SparkConf > > sc = SparkContext("local[*]", "test") > > > # SystemML Specifications: > > > from pyspark.sql import SQLContext > import systemml as sml > sqlCtx = SQLContext(sc) > ml = sml.MLContext(sc) > > > # Importing the data > > > train_data= pd.read_csv("data1.csv") > test_data = pd.read_csv("data2.csv") > > > > train_data = sqlCtx.createDataFrame(pd.DataFrame(train_data)) > test_data = sqlCtx.createDataFrame(pd.DataFrame(test_data)) > > > # Finally executing the code: > > > scriptUrl = "C:/systemml-0.13.0-incubating-bin/scripts/model_code.dml" > > script = sml.dml(scriptUrl).input(bdframe_train =train_data , bdframe_test = > test_data).output("check_func") > > beta = ml.execute(script).get("check_func").toNumPy() > > pd.DataFrame(beta).head(1) > > The datasize are 1000 & 100 rows for train and test respectively. I'm testing > it on small dataset during development. Later will test in larger dataset. > I'm running on my local system with 4 cores. > > The problem is, if I run the model in R, it's taking fraction of second. But > when I'm running like this, it's taking around 20-30 seconds. > > Could anyone please suggest me how to improve the execution speed? In case > there are any other way I can execute the code, which can improve the > execution speed. > > Also, thank you all you guyz for releasing the 0.14 version. There are > fewimprovements we found extremely helpful. > > Thank you! > Arijit >