Hi,

I'm creating a process in SystemML, and running it through spark. I'm running 
the code in the following way:


# Spark Specifications:


import os
import sys
import pandas as pd
import numpy as np

spark_path = "C:\spark"
os.environ['SPARK_HOME'] = spark_path
os.environ['HADOOP_HOME'] = spark_path

sys.path.append(spark_path + "/bin")
sys.path.append(spark_path + "/python")
sys.path.append(spark_path + "/python/pyspark/")
sys.path.append(spark_path + "/python/lib")
sys.path.append(spark_path + "/python/lib/pyspark.zip")
sys.path.append(spark_path + "/python/lib/py4j-0.10.4-src.zip")

from pyspark import SparkContext
from pyspark import SparkConf

sc = SparkContext("local[*]", "test")


# SystemML Specifications:


from pyspark.sql import SQLContext
import systemml as sml
sqlCtx = SQLContext(sc)
ml = sml.MLContext(sc)


# Importing the data


train_data= pd.read_csv("data1.csv")
test_data     = pd.read_csv("data2.csv")



train_data = sqlCtx.createDataFrame(pd.DataFrame(train_data))
test_data  = sqlCtx.createDataFrame(pd.DataFrame(test_data))


# Finally executing the code:


scriptUrl = "C:/systemml-0.13.0-incubating-bin/scripts/model_code.dml"

script = sml.dml(scriptUrl).input(bdframe_train =train_data , bdframe_test = 
test_data).output("check_func")

beta = ml.execute(script).get("check_func").toNumPy()

pd.DataFrame(beta).head(1)

The datasize are 1000 & 100 rows for train and test respectively. I'm testing 
it on small dataset during development. Later will test in larger dataset. I'm 
running on my local system with 4 cores.

The problem is, if I run the model in R, it's taking fraction of second. But 
when I'm running like this, it's taking around 20-30 seconds.

Could anyone please suggest me how to improve the execution speed? In case 
there are any other way I can execute the code, which can improve the execution 
speed.

Also, thank you all you guyz for releasing the 0.14 version. There are 
fewimprovements  we found extremely helpful.

Thank you!
Arijit

Reply via email to