Hi Aishwarya, For the error message, that just means that the SystemML jar isn't being found. Can you add a `--driver-class-path $SYSTEMML_HOME/target/SystemML.jar` to the invocation of Jupyter? I.e. `PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --jars $SYSTEMML_HOME/target/SystemML.jar --driver-class-path $SYSTEMML_HOME/target/SystemML.jar`. There was a PySpark bug that was supposed to have been fixed in Spark 2.x, but it's possible that it is still an issue.
As for the output, the notebook will create SystemML `Matrix` objects for all of the weights and biases of the trained models. To save, please convert each one to a DataFrame, i.e. `Wc1.toDF()` and repeated for each matrix, and then simply save the DataFrames. This could be done all at once like this for a SystemML Matrix object `Wc1`: `Wc1.toDf().write.save("path/to/save/Wc1.parquet", format="parquet")`. Just repeat for each matrix returned by the "Train" code for the algorithms. At that point, you will have a set of saved DataFrames representing a trained SystemML model, and these can be used in downstream classification tasks in a similar manner to the "Eval" sections. -Mike -- Mike Dusenberry GitHub: github.com/dusenberrymw LinkedIn: linkedin.com/in/mikedusenberry Sent from my iPhone. > On Apr 24, 2017, at 3:07 AM, Aishwarya Chaurasia <aishwarya2...@gmail.com> > wrote: > > Further more : > What is the output of MachineLearning.ipynb you're obtaining sir? > We are actually nearing our deadline for our problem. > Thanks a lot. > > On 24-Apr-2017 2:58 PM, "Aishwarya Chaurasia" <aishwarya2...@gmail.com> > wrote: > > Hello sir, > > Thanks a lot for replying sir. But unfortunately it did not work. Although > the NameError did not appear this time but another error came about : > > https://paste.fedoraproject.org/paste/TUMtSIb88Q73FYekwJmM7V > 5M1UNdIGYhyRLivL9gydE= > > This error was obtained after executing the second block of code of > MachineLearning.py in terminal. ( ml = MLContext(sc) ) > > We have installed the bleeding-edge version of systemml only and the > installation was done correctly. We are in a fix now. :/ > Kindly look into the matter asap > > On 24-Apr-2017 12:15 PM, "Mike Dusenberry" <dusenberr...@gmail.com> wrote: > > Hi Aishwarya, > > Glad to hear that the preprocessing stage was successful! As for the > `MachineLearning.ipynb` notebook, here is a general guide: > > > - The `MachineLearning.ipynb` notebook essentially (1) loads in the > training and validation DataFrames from the preprocessing step, (2) > converts them to normalized & one-hot encoded SystemML matrices for > consumption by the ML algorithms, and (3) explores training a couple of > models. > - To run, you'll need to start Jupyter in the context of PySpark via > `PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter > PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --jars > $SYSTEMML_HOME/target/SystemML.jar`. Note that if you have installed > SystemML with pip from PyPy (`pip3 install systemml`), this will install > our 0.13 release, and the `--jars $SYSTEMML_HOME/target/SystemML.jar` > will > not be necessary. If you instead have installed a bleeding-edge version > of > SystemML locally (git clone locally, maven build, `pip3 install -e > src/main/python` as listed in `projects/breast_cancer/README.md`), the > `--jars $SYSTEMML_HOME/target/SystemML.jar` part *is* necessary. We are > about to release 0.14, and for this project, I *would* recommend using a > bleeding edge install. > - Once Jupyter has been started in the context of PySpark, the `sc` > SparkContext object should be available. Please let me know if you > continue to see this issue. > - The "Read in train & val data" section simply reads in the training > and validation data generated in the preprocessing stage. Be sure that > the > `size` setting is the same as the preprocessing size. The percentage `p` > setting determines whether the full or sampled DataFrames are loaded. If > you set `p = 1`, the full DataFrames will be used. If you instead would > prefer to use the smaller sampled DataFrames while getting started, > please > set it to the same value as used in the preprocessing to generate the > smaller sampled DataFrames. > - The `Extract X & Y matrices` section splits each of the train and > validation DataFrames into effectively X & Y matrices (still as DataFrame > types), with X containing the images, and Y containing the labels. > - The `Convert to SystemML Matrices` section passes the X & Y DataFrames > into a SystemML script that performs some normalization of the images & > one-hot encoding of the labels, and then returns SystemML `Matrix` types. > These are now ready to be passed into the subsequent algorithms. > - The "Trigger Caching" and "Save Matrices" are experimental features, > and not necessary to execute. > - Next comes the two algorithms being explored in this notebook. The > "Softmax Classifier" is just a multi-class logistic regression model, and > is simply there to serve as a baseline comparison with the subsequent > convolutional neural net model. You may wish to simply skip this softmax > model and move to the latter convnet model further down in the notebook. > - The actual softmax model is located at [ > https://github.com/apache/incubator-systemml/blob/master/ > projects/breast_cancer/softmax_clf.dml], > and the notebook calls functions from that file. > - The softmax sanity check just ensures that the model is able to > completely overfit when given a tiny sample size. This should yield > ~100% > training accuracy if the sample size in this section is small enough. > This > is just a check to ensure that nothing else is wrong with the math or the > data. > - The softmax "Train" section will train a softmax model and return the > weights (`W`) and biases (`b`) of the model as SystemML `Matrix` objects. > Please adjust the hyperparameters in this section to your problem. > - The softmax "Eval" section takes the trained weights and biases and > evaluates the training and validation performance. > - The next model is a LeNet-like convnet model. The actual model is > located at [ > https://github.com/apache/incubator-systemml/blob/master/ > projects/breast_cancer/convnet.dml], > and the notebook simply calls functions from that file. > - Once again, there is an initial sanity check for the ability to > overfit on a small amount of data. > - The "Hyperparameter Search" contains a script to sample different > hyperparams for the convnet, and save the hyperparams + validation > accuracy > of each set after a single epoch of training. These string files will be > saved to HDFS. Please feel free to adjust the range of the > hyperparameters > for your problem. Please also feel free to try using the `parfor` > (parallel for-loop) instead of the while loop to speed up this section. > Note that this is still a work in progress. The hyperparameter tuning in > this section makes use of random search (as opposed to grid search), > which > has been promoted by Bengio et al. to speed up the search time. > - The "Train" section trains the convnet and returns the weights and > biases as SystemML `Matrix` types. In this section, please replace the > hyperparameters with the best ones from above, and please increase the > number of epochs given your time constraints. > - The "Eval" section evaluates the performance of the trained convnet. > - Although it is not shown in the notebook yet, to save the weights and > biases, please use the `toDF()` method on each weight and biases (i.e. > `Wc1.toDF()`) to convert to a Spark DataFrame, and then simply save the > DataFrame as desired. > - Finally, please feel free to extend the model in `convnet.dml` for > your particular problem! The LeNet-like model just serves as a simple > convnet, but there are much richer models currently, such as resnets, > that > we are experimenting with. To make larger models such as resnets easier > to > define, we are also working on other tools for converting model > definitions > + pretrained weights from other systems into SystemML. > > > Also, please keep in mind that the deep learning support in SystemML is > still a work in progress. Therefore, if you run into issues, please let us > know and we'll do everything possible to help get things running! > > > Thanks! > > - Mike > > > -- > > Michael W. Dusenberry > GitHub: github.com/dusenberrymw > LinkedIn: linkedin.com/in/mikedusenberry > > On Sat, Apr 22, 2017 at 4:49 AM, Aishwarya Chaurasia < > aishwarya2...@gmail.com> wrote: > >> Hey, >> >> Thank you so much for your help sir. We were finally able to run >> preprocess.py without any errors. And the results obtained were >> satisfactory i.e we got five set of data frames like you said we would. >> >> But alas! when we tried to run MachineLearning.ipynb the same NameError >> came : https://paste.fedoraproject.org/paste/l3LFJreg~ >> vnYEDTSTQH73l5M1UNdIGYhyRLivL9gydE= >> >> Could you guide us again as to how to proceed now? >> Also, could you please provide an overview of the process >> MachineLearning.ipynb is following to train the samples. >> >> Thanks a lot! >> >>> On 20-Apr-2017 12:16 AM, <dusenberr...@gmail.com> wrote: >>> >>> Hi Aishwarya, >>> >>> Looks like you've just encountered an out of memory error on one of the >>> executors. Therefore, you just need to adjust the >> `spark.executor.memory` >>> and `spark.driver.memory` settings with higher amounts of RAM. What is >>> your current setup? I.e. are you using a cluster of machines, or a >> single >>> machine? We generally use a large driver on one machine, and then a >> single >>> large executor on each other machine. I would give a sizable amount of >>> memory to the driver, and about half the possible memory on the > executors >>> so that the Python processes have enough memory as well. PySpark has > JVM >>> and Python components, and the Spark memory settings only pertain to the >>> JVM side, thus the need to save about half the executor memory for the >>> Python side. >>> >>> Thanks! >>> >>> - Mike >>> >>> -- >>> >>> Mike Dusenberry >>> GitHub: github.com/dusenberrymw >>> LinkedIn: linkedin.com/in/mikedusenberry >>> >>> Sent from my iPhone. >>> >>> >>>> On Apr 19, 2017, at 5:53 AM, Aishwarya Chaurasia < >>> aishwarya2...@gmail.com> wrote: >>>> >>>> Hello sir, >>>> >>>> We also wanted to ensure that the spark-submit command we're using is >> the >>>> correct one for running 'preprocess.py'. >>>> Command : /home/new/sparks/bin/spark-submit preprocess.py >>>> >>>> >>>> Thank you. >>>> Aishwarya Chaurasia. >>>> >>>> On 19-Apr-2017 3:55 PM, "Aishwarya Chaurasia" <aishwarya2...@gmail.com >>> >>>> wrote: >>>> >>>> Hello sir, >>>> On running the file preprocess.py we are getting the following error : >>>> >>>> https://paste.fedoraproject.org/paste/IAvqiiyJChSC0V9eeETe2F5M1UNdIG >>>> YhyRLivL9gydE= >>>> >>>> Can you please help us by looking into the error and kindly tell us > the >>>> solution for it. >>>> Thanks a lot. >>>> Aishwarya Chaurasia >>>> >>>> >>>>> On 19-Apr-2017 12:43 AM, <dusenberr...@gmail.com> wrote: >>>>> >>>>> Hi Aishwarya, >>>>> >>>>> Certainly, here is some more detailed information >> about`preprocess.py`: >>>>> >>>>> * The preprocessing Python script is located at >>>>> https://github.com/apache/incubator-systemml/blob/master/ >>>>> projects/breast_cancer/preprocess.py. Note that this is different >> than >>>>> the library module at https://github.com/apache/incu >>>>> bator-systemml/blob/master/projects/breast_cancer/breastc >>>>> ancer/preprocessing.py. >>>>> * This script is used to preprocess a set of histology slide images, >>>>> which are `.svs` files in our case, and `.tiff` files in your case. >>>>> * Lines 63-79 contain "settings" such as the output image sizes, >> folder >>>>> paths, etc. Of particular interest, line 72 has the folder path for >> the >>>>> original slide images that should be commonly accessible from all >>> machines >>>>> being used, and lines 74-79 contain the names of the output > DataFrames >>> that >>>>> will be saved. >>>>> * Line 82 performs the actual preprocessing and creates a Spark >>>>> DataFrame with the following columns: slide number, tumor score, >>> molecular >>>>> score, sample. The "sample" in this case is the actual small, >>> chopped-up >>>>> section of the image that has been extracted and flattened into a row >>>>> Vector. For test images without labels (`training=false`), only the >>> slide >>>>> number and sample will be contained in the DataFrame (i.e. no > labels). >>>>> This calls the `preprocess(...)` function located on line 371 of >>>>> https://github.com/apache/incubator-systemml/blob/master/ >>>>> projects/breast_cancer/breastcancer/preprocessing.py, which is a >>>>> different file. >>>>> * Line 87 simply saves the above DataFrame to HDFS with the name > from >>>>> line 74. >>>>> * Line 93 splits the above DataFrame row-wise into separate >> "training" >>>>> and "validation" DataFrames, based on the split percentage from line >> 70 >>>>> (`train_frac`). This is performed so that downstream machine > learning >>>>> tasks can learn from the training set, and validate performance and >>>>> hyperparameter choices on the validation set. These DataFrames will >>> start >>>>> with the same columns as the above DataFrame. If `add_row_indices` >> from >>>>> line 69 is true, then an additional row index column (`__INDEX`) will >> be >>>>> pretended. This is useful for SystemML in downstream machine > learning >>>>> tasks as it gives the DataFrame row numbers like a real matrix would >>> have, >>>>> and SystemML is built to operate on matrices. >>>>> * Lines 97 & 98 simply save the training and validation DataFrames >>> using >>>>> the names defined on lines 76 & 78. >>>>> * Lines 103-137 create smaller train and validation DataFrames by >>> taking >>>>> small row-wise samples of the full train and validation DataFrames. >> The >>>>> percentage of the sample is defined on line 111 (`p=0.01` for a 1% >>>>> sample). This is generally useful for quicker downstream tasks >> without >>>>> having to load in the larger DataFrames, assuming you have a large >>> amount >>>>> of data. For us, we have ~7TB of data, so having 1% sampled >> DataFrames >>> is >>>>> useful for quicker downstream tests. Once again, the same columns >> from >>> the >>>>> larger train and validation DataFrames will be used. >>>>> * Lines 146 & 147 simply save these sampled train and validation >>>>> DataFrames. >>>>> >>>>> As a summary, after running `preprocess.py`, you will be left with > the >>>>> following saved DataFrames in HDFS: >>>>> * Full DataFrame >>>>> * Training DataFrame >>>>> * Validation DataFrame >>>>> * Sampled training DataFrame >>>>> * Sampled validation DataFrame >>>>> >>>>> As for visualization, you may visualize a "sample" (i.e. small, >>> chopped-up >>>>> section of original image) from a DataFrame by using the ` >>>>> breastcancer.visualization.visualize_sample(...)` function. You will >>>>> need to do this after creating the DataFrames. Here is a snippet to >>>>> visualize the first row sample in a DataFrame, where `df` is one of >> the >>>>> DataFrames from above: >>>>> >>>>> ``` >>>>> from breastcancer.visualization import visualize_sample >>>>> visualize_sample(df.first().sample) >>>>> ``` >>>>> >>>>> Please let me know if you have any additional questions. >>>>> >>>>> Thanks! >>>>> >>>>> - Mike >>>>> >>>>> -- >>>>> >>>>> Mike Dusenberry >>>>> GitHub: github.com/dusenberrymw >>>>> LinkedIn: linkedin.com/in/mikedusenberry >>>>> >>>>> Sent from my iPhone. >>>>> >>>>> >>>>>> On Apr 15, 2017, at 4:38 AM, Aishwarya Chaurasia < >>>>> aishwarya2...@gmail.com> wrote: >>>>>> >>>>>> Hello sir, >>>>>> Can you please elaborate more on what output we would be getting >>> because >>>>> we >>>>>> tried executing the preprocess.py file using spark submit it keeps > on >>>>>> adding the tiles in rdd and while running the visualisation.py file >> it >>>>>> isn't showing any output. Can you please help us out asap stating > the >>>>>> output we will be getting and the sequence of execution of files. >>>>>> Thank you. >>>>>> >>>>>>> On 07-Apr-2017 5:54 AM, <dusenberr...@gmail.com> wrote: >>>>>>> >>>>>>> Hi Aishwarya, >>>>>>> >>>>>>> Thanks for sharing more info on the issue! >>>>>>> >>>>>>> To facilitate easier usage, I've updated the preprocessing code by >>>>> pulling >>>>>>> out most of the logic into a `breastcancer/preprocessing.py` >> module, >>>>>>> leaving just the execution in the `Preprocessing.ipynb` notebook. >>>>> There is >>>>>>> also a `preprocess.py` script with the same contents as the > notebook >>> for >>>>>>> use with `spark-submit`. The choice of the notebook or the script >> is >>>>> just >>>>>>> a matter of convenience, as they both import from the same >>>>>>> `breastcancer/preprocessing.py` package. >>>>>>> >>>>>>> As part of the updates, I've added an explicit SparkSession >> parameter >>>>>>> (`spark`) to the `preprocess(...)` function, and updated the body > to >>> use >>>>>>> this SparkSession object rather than the older SparkContext `sc` >>> object. >>>>>>> Previously, the `preprocess(...)` function accessed the `sc` object >>> that >>>>>>> was pulled in from the enclosing scope, which would work while all >> of >>>>> the >>>>>>> code was colocated within the notebook, but not if the code was >>>>> extracted >>>>>>> and imported. The explicit parameter now allows for the code to be >>>>>>> imported. >>>>>>> >>>>>>> Can you please try again with the latest updates? We are currently >>>>> using >>>>>>> Spark 2.x with Python 3. If you use the notebook, the pyspark >> kernel >>>>>>> should have a `spark` object available that can be supplied to the >>>>>>> functions (as is done now in the notebook), and if you use the >>>>>>> `preprocess.py` script with `spark-submit`, the `spark` object will >> be >>>>>>> created explicitly by the script. >>>>>>> >>>>>>> For a bit of context to others, Aishwarya initially reached out to >>> find >>>>>>> out if our breast cancer project could be applied to TIFF images, >>> rather >>>>>>> than the SVS images we are currently using (the answer is "yes" so >>> long >>>>> as >>>>>>> they are "generic tiled TIFF images, according to the OpenSlide >>>>>>> documentation), and then followed up with Spark issues related to >> the >>>>>>> preprocessing code. This conversation has been promptly moved to >> the >>>>>>> mailing list so that others in the community can benefit. >>>>>>> >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> -Mike >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Mike Dusenberry >>>>>>> GitHub: github.com/dusenberrymw >>>>>>> LinkedIn: linkedin.com/in/mikedusenberry >>>>>>> >>>>>>> Sent from my iPhone. >>>>>>> >>>>>>> >>>>>>>> On Apr 6, 2017, at 5:09 AM, Aishwarya Chaurasia < >>>>> aishwarya2...@gmail.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> Hey, >>>>>>>> >>>>>>>> The object sc is already defined in pyspark and yet this name > error >>>>> keeps >>>>>>>> occurring. We are using spark 2.* >>>>>>>> >>>>>>>> Here is the link to error that we are getting : >>>>>>>> https://paste.fedoraproject.org/paste/ >> 89iQODxzpNZVbSfgwocH8l5M1UNdIG >>>>>>> YhyRLivL9gydE= >>>>>>> >>>>> >>> >>