Hello sir, The NameError is occuring again sir. Why does it keep resurfacing?
Attaching the screenshot of the error. On 25-Apr-2017 2:50 AM, <dusenberr...@gmail.com> wrote: > Hi Aishwarya, > > For the error message, that just means that the SystemML jar isn't being > found. Can you add a `--driver-class-path $SYSTEMML_HOME/target/SystemML.jar` > to the invocation of Jupyter? I.e. `PYSPARK_PYTHON=python3 > PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" > pyspark --jars $SYSTEMML_HOME/target/SystemML.jar --driver-class-path > $SYSTEMML_HOME/target/SystemML.jar`. There was a PySpark bug that was > supposed to have been fixed in Spark 2.x, but it's possible that it is > still an issue. > > As for the output, the notebook will create SystemML `Matrix` objects for > all of the weights and biases of the trained models. To save, please > convert each one to a DataFrame, i.e. `Wc1.toDF()` and repeated for each > matrix, and then simply save the DataFrames. This could be done all at > once like this for a SystemML Matrix object `Wc1`: > `Wc1.toDf().write.save("path/to/save/Wc1.parquet", format="parquet")`. > Just repeat for each matrix returned by the "Train" code for the > algorithms. At that point, you will have a set of saved DataFrames > representing a trained SystemML model, and these can be used in downstream > classification tasks in a similar manner to the "Eval" sections. > > -Mike > > -- > > Mike Dusenberry > GitHub: github.com/dusenberrymw > LinkedIn: linkedin.com/in/mikedusenberry > > Sent from my iPhone. > > > > On Apr 24, 2017, at 3:07 AM, Aishwarya Chaurasia < > aishwarya2...@gmail.com> wrote: > > > > Further more : > > What is the output of MachineLearning.ipynb you're obtaining sir? > > We are actually nearing our deadline for our problem. > > Thanks a lot. > > > > On 24-Apr-2017 2:58 PM, "Aishwarya Chaurasia" <aishwarya2...@gmail.com> > > wrote: > > > > Hello sir, > > > > Thanks a lot for replying sir. But unfortunately it did not work. > Although > > the NameError did not appear this time but another error came about : > > > > https://paste.fedoraproject.org/paste/TUMtSIb88Q73FYekwJmM7V > > 5M1UNdIGYhyRLivL9gydE= > > > > This error was obtained after executing the second block of code of > > MachineLearning.py in terminal. ( ml = MLContext(sc) ) > > > > We have installed the bleeding-edge version of systemml only and the > > installation was done correctly. We are in a fix now. :/ > > Kindly look into the matter asap > > > > On 24-Apr-2017 12:15 PM, "Mike Dusenberry" <dusenberr...@gmail.com> > wrote: > > > > Hi Aishwarya, > > > > Glad to hear that the preprocessing stage was successful! As for the > > `MachineLearning.ipynb` notebook, here is a general guide: > > > > > > - The `MachineLearning.ipynb` notebook essentially (1) loads in the > > training and validation DataFrames from the preprocessing step, (2) > > converts them to normalized & one-hot encoded SystemML matrices for > > consumption by the ML algorithms, and (3) explores training a couple of > > models. > > - To run, you'll need to start Jupyter in the context of PySpark via > > `PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter > > PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --jars > > $SYSTEMML_HOME/target/SystemML.jar`. Note that if you have installed > > SystemML with pip from PyPy (`pip3 install systemml`), this will > install > > our 0.13 release, and the `--jars $SYSTEMML_HOME/target/SystemML.jar` > > will > > not be necessary. If you instead have installed a bleeding-edge > version > > of > > SystemML locally (git clone locally, maven build, `pip3 install -e > > src/main/python` as listed in `projects/breast_cancer/README.md`), the > > `--jars $SYSTEMML_HOME/target/SystemML.jar` part *is* necessary. We > are > > about to release 0.14, and for this project, I *would* recommend using > a > > bleeding edge install. > > - Once Jupyter has been started in the context of PySpark, the `sc` > > SparkContext object should be available. Please let me know if you > > continue to see this issue. > > - The "Read in train & val data" section simply reads in the training > > and validation data generated in the preprocessing stage. Be sure that > > the > > `size` setting is the same as the preprocessing size. The percentage > `p` > > setting determines whether the full or sampled DataFrames are loaded. > If > > you set `p = 1`, the full DataFrames will be used. If you instead > would > > prefer to use the smaller sampled DataFrames while getting started, > > please > > set it to the same value as used in the preprocessing to generate the > > smaller sampled DataFrames. > > - The `Extract X & Y matrices` section splits each of the train and > > validation DataFrames into effectively X & Y matrices (still as > DataFrame > > types), with X containing the images, and Y containing the labels. > > - The `Convert to SystemML Matrices` section passes the X & Y > DataFrames > > into a SystemML script that performs some normalization of the images & > > one-hot encoding of the labels, and then returns SystemML `Matrix` > types. > > These are now ready to be passed into the subsequent algorithms. > > - The "Trigger Caching" and "Save Matrices" are experimental features, > > and not necessary to execute. > > - Next comes the two algorithms being explored in this notebook. The > > "Softmax Classifier" is just a multi-class logistic regression model, > and > > is simply there to serve as a baseline comparison with the subsequent > > convolutional neural net model. You may wish to simply skip this > softmax > > model and move to the latter convnet model further down in the > notebook. > > - The actual softmax model is located at [ > > https://github.com/apache/incubator-systemml/blob/master/ > > projects/breast_cancer/softmax_clf.dml], > > and the notebook calls functions from that file. > > - The softmax sanity check just ensures that the model is able to > > completely overfit when given a tiny sample size. This should yield > > ~100% > > training accuracy if the sample size in this section is small enough. > > This > > is just a check to ensure that nothing else is wrong with the math or > the > > data. > > - The softmax "Train" section will train a softmax model and return the > > weights (`W`) and biases (`b`) of the model as SystemML `Matrix` > objects. > > Please adjust the hyperparameters in this section to your problem. > > - The softmax "Eval" section takes the trained weights and biases and > > evaluates the training and validation performance. > > - The next model is a LeNet-like convnet model. The actual model is > > located at [ > > https://github.com/apache/incubator-systemml/blob/master/ > > projects/breast_cancer/convnet.dml], > > and the notebook simply calls functions from that file. > > - Once again, there is an initial sanity check for the ability to > > overfit on a small amount of data. > > - The "Hyperparameter Search" contains a script to sample different > > hyperparams for the convnet, and save the hyperparams + validation > > accuracy > > of each set after a single epoch of training. These string files will > be > > saved to HDFS. Please feel free to adjust the range of the > > hyperparameters > > for your problem. Please also feel free to try using the `parfor` > > (parallel for-loop) instead of the while loop to speed up this section. > > Note that this is still a work in progress. The hyperparameter tuning > in > > this section makes use of random search (as opposed to grid search), > > which > > has been promoted by Bengio et al. to speed up the search time. > > - The "Train" section trains the convnet and returns the weights and > > biases as SystemML `Matrix` types. In this section, please replace the > > hyperparameters with the best ones from above, and please increase the > > number of epochs given your time constraints. > > - The "Eval" section evaluates the performance of the trained convnet. > > - Although it is not shown in the notebook yet, to save the weights and > > biases, please use the `toDF()` method on each weight and biases (i.e. > > `Wc1.toDF()`) to convert to a Spark DataFrame, and then simply save the > > DataFrame as desired. > > - Finally, please feel free to extend the model in `convnet.dml` for > > your particular problem! The LeNet-like model just serves as a simple > > convnet, but there are much richer models currently, such as resnets, > > that > > we are experimenting with. To make larger models such as resnets > easier > > to > > define, we are also working on other tools for converting model > > definitions > > + pretrained weights from other systems into SystemML. > > > > > > Also, please keep in mind that the deep learning support in SystemML is > > still a work in progress. Therefore, if you run into issues, please let > us > > know and we'll do everything possible to help get things running! > > > > > > Thanks! > > > > - Mike > > > > > > -- > > > > Michael W. Dusenberry > > GitHub: github.com/dusenberrymw > > LinkedIn: linkedin.com/in/mikedusenberry > > > > On Sat, Apr 22, 2017 at 4:49 AM, Aishwarya Chaurasia < > > aishwarya2...@gmail.com> wrote: > > > >> Hey, > >> > >> Thank you so much for your help sir. We were finally able to run > >> preprocess.py without any errors. And the results obtained were > >> satisfactory i.e we got five set of data frames like you said we would. > >> > >> But alas! when we tried to run MachineLearning.ipynb the same NameError > >> came : https://paste.fedoraproject.org/paste/l3LFJreg~ > >> vnYEDTSTQH73l5M1UNdIGYhyRLivL9gydE= > >> > >> Could you guide us again as to how to proceed now? > >> Also, could you please provide an overview of the process > >> MachineLearning.ipynb is following to train the samples. > >> > >> Thanks a lot! > >> > >>> On 20-Apr-2017 12:16 AM, <dusenberr...@gmail.com> wrote: > >>> > >>> Hi Aishwarya, > >>> > >>> Looks like you've just encountered an out of memory error on one of the > >>> executors. Therefore, you just need to adjust the > >> `spark.executor.memory` > >>> and `spark.driver.memory` settings with higher amounts of RAM. What is > >>> your current setup? I.e. are you using a cluster of machines, or a > >> single > >>> machine? We generally use a large driver on one machine, and then a > >> single > >>> large executor on each other machine. I would give a sizable amount of > >>> memory to the driver, and about half the possible memory on the > > executors > >>> so that the Python processes have enough memory as well. PySpark has > > JVM > >>> and Python components, and the Spark memory settings only pertain to > the > >>> JVM side, thus the need to save about half the executor memory for the > >>> Python side. > >>> > >>> Thanks! > >>> > >>> - Mike > >>> > >>> -- > >>> > >>> Mike Dusenberry > >>> GitHub: github.com/dusenberrymw > >>> LinkedIn: linkedin.com/in/mikedusenberry > >>> > >>> Sent from my iPhone. > >>> > >>> > >>>> On Apr 19, 2017, at 5:53 AM, Aishwarya Chaurasia < > >>> aishwarya2...@gmail.com> wrote: > >>>> > >>>> Hello sir, > >>>> > >>>> We also wanted to ensure that the spark-submit command we're using is > >> the > >>>> correct one for running 'preprocess.py'. > >>>> Command : /home/new/sparks/bin/spark-submit preprocess.py > >>>> > >>>> > >>>> Thank you. > >>>> Aishwarya Chaurasia. > >>>> > >>>> On 19-Apr-2017 3:55 PM, "Aishwarya Chaurasia" < > aishwarya2...@gmail.com > >>> > >>>> wrote: > >>>> > >>>> Hello sir, > >>>> On running the file preprocess.py we are getting the following error : > >>>> > >>>> https://paste.fedoraproject.org/paste/IAvqiiyJChSC0V9eeETe2F5M1UNdIG > >>>> YhyRLivL9gydE= > >>>> > >>>> Can you please help us by looking into the error and kindly tell us > > the > >>>> solution for it. > >>>> Thanks a lot. > >>>> Aishwarya Chaurasia > >>>> > >>>> > >>>>> On 19-Apr-2017 12:43 AM, <dusenberr...@gmail.com> wrote: > >>>>> > >>>>> Hi Aishwarya, > >>>>> > >>>>> Certainly, here is some more detailed information > >> about`preprocess.py`: > >>>>> > >>>>> * The preprocessing Python script is located at > >>>>> https://github.com/apache/incubator-systemml/blob/master/ > >>>>> projects/breast_cancer/preprocess.py. Note that this is different > >> than > >>>>> the library module at https://github.com/apache/incu > >>>>> bator-systemml/blob/master/projects/breast_cancer/breastc > >>>>> ancer/preprocessing.py. > >>>>> * This script is used to preprocess a set of histology slide images, > >>>>> which are `.svs` files in our case, and `.tiff` files in your case. > >>>>> * Lines 63-79 contain "settings" such as the output image sizes, > >> folder > >>>>> paths, etc. Of particular interest, line 72 has the folder path for > >> the > >>>>> original slide images that should be commonly accessible from all > >>> machines > >>>>> being used, and lines 74-79 contain the names of the output > > DataFrames > >>> that > >>>>> will be saved. > >>>>> * Line 82 performs the actual preprocessing and creates a Spark > >>>>> DataFrame with the following columns: slide number, tumor score, > >>> molecular > >>>>> score, sample. The "sample" in this case is the actual small, > >>> chopped-up > >>>>> section of the image that has been extracted and flattened into a row > >>>>> Vector. For test images without labels (`training=false`), only the > >>> slide > >>>>> number and sample will be contained in the DataFrame (i.e. no > > labels). > >>>>> This calls the `preprocess(...)` function located on line 371 of > >>>>> https://github.com/apache/incubator-systemml/blob/master/ > >>>>> projects/breast_cancer/breastcancer/preprocessing.py, which is a > >>>>> different file. > >>>>> * Line 87 simply saves the above DataFrame to HDFS with the name > > from > >>>>> line 74. > >>>>> * Line 93 splits the above DataFrame row-wise into separate > >> "training" > >>>>> and "validation" DataFrames, based on the split percentage from line > >> 70 > >>>>> (`train_frac`). This is performed so that downstream machine > > learning > >>>>> tasks can learn from the training set, and validate performance and > >>>>> hyperparameter choices on the validation set. These DataFrames will > >>> start > >>>>> with the same columns as the above DataFrame. If `add_row_indices` > >> from > >>>>> line 69 is true, then an additional row index column (`__INDEX`) will > >> be > >>>>> pretended. This is useful for SystemML in downstream machine > > learning > >>>>> tasks as it gives the DataFrame row numbers like a real matrix would > >>> have, > >>>>> and SystemML is built to operate on matrices. > >>>>> * Lines 97 & 98 simply save the training and validation DataFrames > >>> using > >>>>> the names defined on lines 76 & 78. > >>>>> * Lines 103-137 create smaller train and validation DataFrames by > >>> taking > >>>>> small row-wise samples of the full train and validation DataFrames. > >> The > >>>>> percentage of the sample is defined on line 111 (`p=0.01` for a 1% > >>>>> sample). This is generally useful for quicker downstream tasks > >> without > >>>>> having to load in the larger DataFrames, assuming you have a large > >>> amount > >>>>> of data. For us, we have ~7TB of data, so having 1% sampled > >> DataFrames > >>> is > >>>>> useful for quicker downstream tests. Once again, the same columns > >> from > >>> the > >>>>> larger train and validation DataFrames will be used. > >>>>> * Lines 146 & 147 simply save these sampled train and validation > >>>>> DataFrames. > >>>>> > >>>>> As a summary, after running `preprocess.py`, you will be left with > > the > >>>>> following saved DataFrames in HDFS: > >>>>> * Full DataFrame > >>>>> * Training DataFrame > >>>>> * Validation DataFrame > >>>>> * Sampled training DataFrame > >>>>> * Sampled validation DataFrame > >>>>> > >>>>> As for visualization, you may visualize a "sample" (i.e. small, > >>> chopped-up > >>>>> section of original image) from a DataFrame by using the ` > >>>>> breastcancer.visualization.visualize_sample(...)` function. You > will > >>>>> need to do this after creating the DataFrames. Here is a snippet to > >>>>> visualize the first row sample in a DataFrame, where `df` is one of > >> the > >>>>> DataFrames from above: > >>>>> > >>>>> ``` > >>>>> from breastcancer.visualization import visualize_sample > >>>>> visualize_sample(df.first().sample) > >>>>> ``` > >>>>> > >>>>> Please let me know if you have any additional questions. > >>>>> > >>>>> Thanks! > >>>>> > >>>>> - Mike > >>>>> > >>>>> -- > >>>>> > >>>>> Mike Dusenberry > >>>>> GitHub: github.com/dusenberrymw > >>>>> LinkedIn: linkedin.com/in/mikedusenberry > >>>>> > >>>>> Sent from my iPhone. > >>>>> > >>>>> > >>>>>> On Apr 15, 2017, at 4:38 AM, Aishwarya Chaurasia < > >>>>> aishwarya2...@gmail.com> wrote: > >>>>>> > >>>>>> Hello sir, > >>>>>> Can you please elaborate more on what output we would be getting > >>> because > >>>>> we > >>>>>> tried executing the preprocess.py file using spark submit it keeps > > on > >>>>>> adding the tiles in rdd and while running the visualisation.py file > >> it > >>>>>> isn't showing any output. Can you please help us out asap stating > > the > >>>>>> output we will be getting and the sequence of execution of files. > >>>>>> Thank you. > >>>>>> > >>>>>>> On 07-Apr-2017 5:54 AM, <dusenberr...@gmail.com> wrote: > >>>>>>> > >>>>>>> Hi Aishwarya, > >>>>>>> > >>>>>>> Thanks for sharing more info on the issue! > >>>>>>> > >>>>>>> To facilitate easier usage, I've updated the preprocessing code by > >>>>> pulling > >>>>>>> out most of the logic into a `breastcancer/preprocessing.py` > >> module, > >>>>>>> leaving just the execution in the `Preprocessing.ipynb` notebook. > >>>>> There is > >>>>>>> also a `preprocess.py` script with the same contents as the > > notebook > >>> for > >>>>>>> use with `spark-submit`. The choice of the notebook or the script > >> is > >>>>> just > >>>>>>> a matter of convenience, as they both import from the same > >>>>>>> `breastcancer/preprocessing.py` package. > >>>>>>> > >>>>>>> As part of the updates, I've added an explicit SparkSession > >> parameter > >>>>>>> (`spark`) to the `preprocess(...)` function, and updated the body > > to > >>> use > >>>>>>> this SparkSession object rather than the older SparkContext `sc` > >>> object. > >>>>>>> Previously, the `preprocess(...)` function accessed the `sc` object > >>> that > >>>>>>> was pulled in from the enclosing scope, which would work while all > >> of > >>>>> the > >>>>>>> code was colocated within the notebook, but not if the code was > >>>>> extracted > >>>>>>> and imported. The explicit parameter now allows for the code to be > >>>>>>> imported. > >>>>>>> > >>>>>>> Can you please try again with the latest updates? We are currently > >>>>> using > >>>>>>> Spark 2.x with Python 3. If you use the notebook, the pyspark > >> kernel > >>>>>>> should have a `spark` object available that can be supplied to the > >>>>>>> functions (as is done now in the notebook), and if you use the > >>>>>>> `preprocess.py` script with `spark-submit`, the `spark` object will > >> be > >>>>>>> created explicitly by the script. > >>>>>>> > >>>>>>> For a bit of context to others, Aishwarya initially reached out to > >>> find > >>>>>>> out if our breast cancer project could be applied to TIFF images, > >>> rather > >>>>>>> than the SVS images we are currently using (the answer is "yes" so > >>> long > >>>>> as > >>>>>>> they are "generic tiled TIFF images, according to the OpenSlide > >>>>>>> documentation), and then followed up with Spark issues related to > >> the > >>>>>>> preprocessing code. This conversation has been promptly moved to > >> the > >>>>>>> mailing list so that others in the community can benefit. > >>>>>>> > >>>>>>> > >>>>>>> Thanks! > >>>>>>> > >>>>>>> -Mike > >>>>>>> > >>>>>>> -- > >>>>>>> > >>>>>>> Mike Dusenberry > >>>>>>> GitHub: github.com/dusenberrymw > >>>>>>> LinkedIn: linkedin.com/in/mikedusenberry > >>>>>>> > >>>>>>> Sent from my iPhone. > >>>>>>> > >>>>>>> > >>>>>>>> On Apr 6, 2017, at 5:09 AM, Aishwarya Chaurasia < > >>>>> aishwarya2...@gmail.com> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> Hey, > >>>>>>>> > >>>>>>>> The object sc is already defined in pyspark and yet this name > > error > >>>>> keeps > >>>>>>>> occurring. We are using spark 2.* > >>>>>>>> > >>>>>>>> Here is the link to error that we are getting : > >>>>>>>> https://paste.fedoraproject.org/paste/ > >> 89iQODxzpNZVbSfgwocH8l5M1UNdIG > >>>>>>> YhyRLivL9gydE= > >>>>>>> > >>>>> > >>> > >> >