Hi Srinath/Nirmal

I managed to get the $subject working. Here I connected iPython/Jupyter
Notebook to pyspark, and pyspark submits the job to the remote spark
cluster (created by DAS).  One of the advantages of using Notebook is that
a user can load the data in DAS tables as spark dataframe, and can
interactively work on it.

But it also have the following limitations:

   - Client side need a spark distribution. (To use pyspark)
   - Have to limit the cores allocated to the Spark App used by DAS
   (CarbonAnalytics), so that the Spark App created by pySpark can run in
   parallel.
   - Have to set the spark-classpath at the client side, with the jars used
   by DAS, so that the once the job is submitted, spark-executor knows where
   to look for the classes.


*Training Models:*

As we discussed offline, for large datasets, we can directly use algorithms
in spark's mllib and ml. This is very straight forward, as the data we get
from DAS is a spark-dataframe, and hence can train models on top of the
dataframe (or can convert it to rdd).
And for small and medium datasets, we can convert the spark-dataframe to
pandas-dataframe using df.toPandas(), which will load all the to memory,
and then train sklearn algorithms on top of that.

A sample python script can b found at [1].

[1]
https://github.com/SupunS/play-ground/blob/master/pyspark/PySpark-Sample.ipynb

-- 
*Supun Sethunga*
Senior Software Engineer
WSO2, Inc.
http://wso2.com/
lean | enterprise | middleware
Mobile : +94 716546324
Blog: http://supunsetunga.blogspot.com
_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to