Re: Improve SystemML execution speed in Spark

arijit chakraborty Thu, 11 May 2017 14:02:33 -0700

Hi Niketan,


Thank you for your suggestion!


I tried what you suggested.


## Changed it here:


from pyspark.sql import SQLContext
import systemml as sml
sqlCtx = SQLContext(sc)
ml = sml.MLContext(sc).setStatistics(True)


# And then :


scriptUrl = "C:/systemml-0.13.0-incubating-bin/scripts/model_code.dml"
 %%time
script = sml.dml(scriptUrl).input(bdframe_train =train_data , bdframe_test = 
test_data).output("check_func")

beta = ml.execute(script).get("check_func").toNumPy()

pd.DataFrame(beta).head(1)



It gave me output:


Wall time: 16.3 s



But how I can get this "time is spent in converters" or "some instruction in 
SystemML"?


Just want to add I'm running this code through jupyter notebook.


Thanks again!


Arijit

________________________________
From: Niketan Pansare <npan...@us.ibm.com>
Sent: Friday, May 12, 2017 2:02:52 AM
To: dev@systemml.incubator.apache.org
Subject: Re: Improve SystemML execution speed in Spark

Ok, then the next step would be to set statistics:
>> ml = sml.MLContext(sc).setStatistics(True)

It will help you identify whether the time is spent in converters or some 
instruction in SystemML.

Also, since dataframe creation is lazy, you may to do persist() followed by an 
action such as count() to ensure you are measuring it correctly.

> On May 11, 2017, at 1:27 PM, arijit chakraborty <ak...@hotmail.com> wrote:
>
> Thank you Niketan for your reply! I was actually putting the timer in the dml 
> code part. Rest of the portion were almost instantaneous. The dml code part 
> was taking time. And I could not able to figure out why it could be.
>
>
> Thanks again!
>
> Arijit
>
> ________________________________
> From: Niketan Pansare <npan...@us.ibm.com>
> Sent: Thursday, May 11, 2017 1:33:15 AM
> To: dev@systemml.incubator.apache.org
> Subject: Re: Improve SystemML execution speed in Spark
>
> Hi Arijit,
>
> Can you please put timing counters around below code to understand 20-30 
> seconds you observe:
> 1. Creation of SparkContext:
> sc = SparkContext("local[*]", "test")
> 2. Converting pandas to Pyspark dataframe:
>> train_data= pd.read_csv("data1.csv")
>> test_data     = pd.read_csv("data2.csv")
>> train_data = sqlCtx.createDataFrame(pd.DataFrame(train_data))
>> test_data  = sqlCtx.createDataFrame(pd.DataFrame(test_data))
>
>
> Also, you can pass pandas data frame directly to MLContext :)
>
> Thanks
>
> Niketan
>
>> On May 10, 2017, at 10:31 AM, arijit chakraborty <ak...@hotmail.com> wrote:
>>
>> Hi,
>>
>>
>> I'm creating a process in SystemML, and running it through spark. I'm 
>> running the code in the following way:
>>
>>
>> # Spark Specifications:
>>
>>
>> import os
>> import sys
>> import pandas as pd
>> import numpy as np
>>
>> spark_path = "C:\spark"
>> os.environ['SPARK_HOME'] = spark_path
>> os.environ['HADOOP_HOME'] = spark_path
>>
>> sys.path.append(spark_path + "/bin")
>> sys.path.append(spark_path + "/python")
>> sys.path.append(spark_path + "/python/pyspark/")
>> sys.path.append(spark_path + "/python/lib")
>> sys.path.append(spark_path + "/python/lib/pyspark.zip")
>> sys.path.append(spark_path + "/python/lib/py4j-0.10.4-src.zip")
>>
>> from pyspark import SparkContext
>> from pyspark import SparkConf
>>
>> sc = SparkContext("local[*]", "test")
>>
>>
>> # SystemML Specifications:
>>
>>
>> from pyspark.sql import SQLContext
>> import systemml as sml
>> sqlCtx = SQLContext(sc)
>> ml = sml.MLContext(sc)
>>
>>
>> # Importing the data
>>
>>
>> train_data= pd.read_csv("data1.csv")
>> test_data     = pd.read_csv("data2.csv")
>>
>>
>>
>> train_data = sqlCtx.createDataFrame(pd.DataFrame(train_data))
>> test_data  = sqlCtx.createDataFrame(pd.DataFrame(test_data))
>>
>>
>> # Finally executing the code:
>>
>>
>> scriptUrl = "C:/systemml-0.13.0-incubating-bin/scripts/model_code.dml"
>>
>> script = sml.dml(scriptUrl).input(bdframe_train =train_data , bdframe_test = 
>> test_data).output("check_func")
>>
>> beta = ml.execute(script).get("check_func").toNumPy()
>>
>> pd.DataFrame(beta).head(1)
>>
>> The datasize are 1000 & 100 rows for train and test respectively. I'm 
>> testing it on small dataset during development. Later will test in larger 
>> dataset. I'm running on my local system with 4 cores.
>>
>> The problem is, if I run the model in R, it's taking fraction of second. But 
>> when I'm running like this, it's taking around 20-30 seconds.
>>
>> Could anyone please suggest me how to improve the execution speed? In case 
>> there are any other way I can execute the code, which can improve the 
>> execution speed.
>>
>> Also, thank you all you guyz for releasing the 0.14 version. There are 
>> fewimprovements  we found extremely helpful.
>>
>> Thank you!
>> Arijit
>>
>

Re: Improve SystemML execution speed in Spark

Reply via email to