Hello sir, We also wanted to ensure that the spark-submit command we're using is the correct one for running 'preprocess.py'. Command : /home/new/sparks/bin/spark-submit preprocess.py
Thank you. Aishwarya Chaurasia. On 19-Apr-2017 3:55 PM, "Aishwarya Chaurasia" <aishwarya2...@gmail.com> wrote: Hello sir, On running the file preprocess.py we are getting the following error : https://paste.fedoraproject.org/paste/IAvqiiyJChSC0V9eeETe2F5M1UNdIG YhyRLivL9gydE= Can you please help us by looking into the error and kindly tell us the solution for it. Thanks a lot. Aishwarya Chaurasia On 19-Apr-2017 12:43 AM, <dusenberr...@gmail.com> wrote: > Hi Aishwarya, > > Certainly, here is some more detailed information about`preprocess.py`: > > * The preprocessing Python script is located at > https://github.com/apache/incubator-systemml/blob/master/ > projects/breast_cancer/preprocess.py. Note that this is different than > the library module at https://github.com/apache/incu > bator-systemml/blob/master/projects/breast_cancer/breastc > ancer/preprocessing.py. > * This script is used to preprocess a set of histology slide images, > which are `.svs` files in our case, and `.tiff` files in your case. > * Lines 63-79 contain "settings" such as the output image sizes, folder > paths, etc. Of particular interest, line 72 has the folder path for the > original slide images that should be commonly accessible from all machines > being used, and lines 74-79 contain the names of the output DataFrames that > will be saved. > * Line 82 performs the actual preprocessing and creates a Spark > DataFrame with the following columns: slide number, tumor score, molecular > score, sample. The "sample" in this case is the actual small, chopped-up > section of the image that has been extracted and flattened into a row > Vector. For test images without labels (`training=false`), only the slide > number and sample will be contained in the DataFrame (i.e. no labels). > This calls the `preprocess(...)` function located on line 371 of > https://github.com/apache/incubator-systemml/blob/master/ > projects/breast_cancer/breastcancer/preprocessing.py, which is a > different file. > * Line 87 simply saves the above DataFrame to HDFS with the name from > line 74. > * Line 93 splits the above DataFrame row-wise into separate "training" > and "validation" DataFrames, based on the split percentage from line 70 > (`train_frac`). This is performed so that downstream machine learning > tasks can learn from the training set, and validate performance and > hyperparameter choices on the validation set. These DataFrames will start > with the same columns as the above DataFrame. If `add_row_indices` from > line 69 is true, then an additional row index column (`__INDEX`) will be > pretended. This is useful for SystemML in downstream machine learning > tasks as it gives the DataFrame row numbers like a real matrix would have, > and SystemML is built to operate on matrices. > * Lines 97 & 98 simply save the training and validation DataFrames using > the names defined on lines 76 & 78. > * Lines 103-137 create smaller train and validation DataFrames by taking > small row-wise samples of the full train and validation DataFrames. The > percentage of the sample is defined on line 111 (`p=0.01` for a 1% > sample). This is generally useful for quicker downstream tasks without > having to load in the larger DataFrames, assuming you have a large amount > of data. For us, we have ~7TB of data, so having 1% sampled DataFrames is > useful for quicker downstream tests. Once again, the same columns from the > larger train and validation DataFrames will be used. > * Lines 146 & 147 simply save these sampled train and validation > DataFrames. > > As a summary, after running `preprocess.py`, you will be left with the > following saved DataFrames in HDFS: > * Full DataFrame > * Training DataFrame > * Validation DataFrame > * Sampled training DataFrame > * Sampled validation DataFrame > > As for visualization, you may visualize a "sample" (i.e. small, chopped-up > section of original image) from a DataFrame by using the ` > breastcancer.visualization.visualize_sample(...)` function. You will > need to do this after creating the DataFrames. Here is a snippet to > visualize the first row sample in a DataFrame, where `df` is one of the > DataFrames from above: > > ``` > from breastcancer.visualization import visualize_sample > visualize_sample(df.first().sample) > ``` > > Please let me know if you have any additional questions. > > Thanks! > > - Mike > > -- > > Mike Dusenberry > GitHub: github.com/dusenberrymw > LinkedIn: linkedin.com/in/mikedusenberry > > Sent from my iPhone. > > > > On Apr 15, 2017, at 4:38 AM, Aishwarya Chaurasia < > aishwarya2...@gmail.com> wrote: > > > > Hello sir, > > Can you please elaborate more on what output we would be getting because > we > > tried executing the preprocess.py file using spark submit it keeps on > > adding the tiles in rdd and while running the visualisation.py file it > > isn't showing any output. Can you please help us out asap stating the > > output we will be getting and the sequence of execution of files. > > Thank you. > > > >> On 07-Apr-2017 5:54 AM, <dusenberr...@gmail.com> wrote: > >> > >> Hi Aishwarya, > >> > >> Thanks for sharing more info on the issue! > >> > >> To facilitate easier usage, I've updated the preprocessing code by > pulling > >> out most of the logic into a `breastcancer/preprocessing.py` module, > >> leaving just the execution in the `Preprocessing.ipynb` notebook. > There is > >> also a `preprocess.py` script with the same contents as the notebook for > >> use with `spark-submit`. The choice of the notebook or the script is > just > >> a matter of convenience, as they both import from the same > >> `breastcancer/preprocessing.py` package. > >> > >> As part of the updates, I've added an explicit SparkSession parameter > >> (`spark`) to the `preprocess(...)` function, and updated the body to use > >> this SparkSession object rather than the older SparkContext `sc` object. > >> Previously, the `preprocess(...)` function accessed the `sc` object that > >> was pulled in from the enclosing scope, which would work while all of > the > >> code was colocated within the notebook, but not if the code was > extracted > >> and imported. The explicit parameter now allows for the code to be > >> imported. > >> > >> Can you please try again with the latest updates? We are currently > using > >> Spark 2.x with Python 3. If you use the notebook, the pyspark kernel > >> should have a `spark` object available that can be supplied to the > >> functions (as is done now in the notebook), and if you use the > >> `preprocess.py` script with `spark-submit`, the `spark` object will be > >> created explicitly by the script. > >> > >> For a bit of context to others, Aishwarya initially reached out to find > >> out if our breast cancer project could be applied to TIFF images, rather > >> than the SVS images we are currently using (the answer is "yes" so long > as > >> they are "generic tiled TIFF images, according to the OpenSlide > >> documentation), and then followed up with Spark issues related to the > >> preprocessing code. This conversation has been promptly moved to the > >> mailing list so that others in the community can benefit. > >> > >> > >> Thanks! > >> > >> -Mike > >> > >> -- > >> > >> Mike Dusenberry > >> GitHub: github.com/dusenberrymw > >> LinkedIn: linkedin.com/in/mikedusenberry > >> > >> Sent from my iPhone. > >> > >> > >>> On Apr 6, 2017, at 5:09 AM, Aishwarya Chaurasia < > aishwarya2...@gmail.com> > >> wrote: > >>> > >>> Hey, > >>> > >>> The object sc is already defined in pyspark and yet this name error > keeps > >>> occurring. We are using spark 2.* > >>> > >>> Here is the link to error that we are getting : > >>> https://paste.fedoraproject.org/paste/89iQODxzpNZVbSfgwocH8l5M1UNdIG > >> YhyRLivL9gydE= > >> >