Hi Srinath/Nirmal I managed to get the $subject working. Here I connected iPython/Jupyter Notebook to pyspark, and pyspark submits the job to the remote spark cluster (created by DAS). One of the advantages of using Notebook is that a user can load the data in DAS tables as spark dataframe, and can interactively work on it.
But it also have the following limitations: - Client side need a spark distribution. (To use pyspark) - Have to limit the cores allocated to the Spark App used by DAS (CarbonAnalytics), so that the Spark App created by pySpark can run in parallel. - Have to set the spark-classpath at the client side, with the jars used by DAS, so that the once the job is submitted, spark-executor knows where to look for the classes. *Training Models:* As we discussed offline, for large datasets, we can directly use algorithms in spark's mllib and ml. This is very straight forward, as the data we get from DAS is a spark-dataframe, and hence can train models on top of the dataframe (or can convert it to rdd). And for small and medium datasets, we can convert the spark-dataframe to pandas-dataframe using df.toPandas(), which will load all the to memory, and then train sklearn algorithms on top of that. A sample python script can b found at [1]. [1] https://github.com/SupunS/play-ground/blob/master/pyspark/PySpark-Sample.ipynb -- *Supun Sethunga* Senior Software Engineer WSO2, Inc. http://wso2.com/ lean | enterprise | middleware Mobile : +94 716546324 Blog: http://supunsetunga.blogspot.com
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
