Hi Arijit,

Can you please put timing counters around below code to understand 20-30 
seconds you observe:
1. Creation of SparkContext: 
sc = SparkContext("local[*]", "test")
2. Converting pandas to Pyspark dataframe:
> train_data= pd.read_csv("data1.csv")
> test_data     = pd.read_csv("data2.csv")
> train_data = sqlCtx.createDataFrame(pd.DataFrame(train_data))
> test_data  = sqlCtx.createDataFrame(pd.DataFrame(test_data))


Also, you can pass pandas data frame directly to MLContext :)

Thanks 

Niketan 

> On May 10, 2017, at 10:31 AM, arijit chakraborty <ak...@hotmail.com> wrote:
> 
> Hi,
> 
> 
> I'm creating a process in SystemML, and running it through spark. I'm running 
> the code in the following way:
> 
> 
> # Spark Specifications:
> 
> 
> import os
> import sys
> import pandas as pd
> import numpy as np
> 
> spark_path = "C:\spark"
> os.environ['SPARK_HOME'] = spark_path
> os.environ['HADOOP_HOME'] = spark_path
> 
> sys.path.append(spark_path + "/bin")
> sys.path.append(spark_path + "/python")
> sys.path.append(spark_path + "/python/pyspark/")
> sys.path.append(spark_path + "/python/lib")
> sys.path.append(spark_path + "/python/lib/pyspark.zip")
> sys.path.append(spark_path + "/python/lib/py4j-0.10.4-src.zip")
> 
> from pyspark import SparkContext
> from pyspark import SparkConf
> 
> sc = SparkContext("local[*]", "test")
> 
> 
> # SystemML Specifications:
> 
> 
> from pyspark.sql import SQLContext
> import systemml as sml
> sqlCtx = SQLContext(sc)
> ml = sml.MLContext(sc)
> 
> 
> # Importing the data
> 
> 
> train_data= pd.read_csv("data1.csv")
> test_data     = pd.read_csv("data2.csv")
> 
> 
> 
> train_data = sqlCtx.createDataFrame(pd.DataFrame(train_data))
> test_data  = sqlCtx.createDataFrame(pd.DataFrame(test_data))
> 
> 
> # Finally executing the code:
> 
> 
> scriptUrl = "C:/systemml-0.13.0-incubating-bin/scripts/model_code.dml"
> 
> script = sml.dml(scriptUrl).input(bdframe_train =train_data , bdframe_test = 
> test_data).output("check_func")
> 
> beta = ml.execute(script).get("check_func").toNumPy()
> 
> pd.DataFrame(beta).head(1)
> 
> The datasize are 1000 & 100 rows for train and test respectively. I'm testing 
> it on small dataset during development. Later will test in larger dataset. 
> I'm running on my local system with 4 cores.
> 
> The problem is, if I run the model in R, it's taking fraction of second. But 
> when I'm running like this, it's taking around 20-30 seconds.
> 
> Could anyone please suggest me how to improve the execution speed? In case 
> there are any other way I can execute the code, which can improve the 
> execution speed.
> 
> Also, thank you all you guyz for releasing the 0.14 version. There are 
> fewimprovements  we found extremely helpful.
> 
> Thank you!
> Arijit
> 

Reply via email to