Re: Regarding incubator systemml/breast_cancer project

Aishwarya Chaurasia Wed, 19 Apr 2017 05:54:02 -0700

Hello sir,

We also wanted to ensure that the spark-submit command we're using is the
correct one for running 'preprocess.py'.
Command :  /home/new/sparks/bin/spark-submit preprocess.py



Thank you.
Aishwarya Chaurasia.

On 19-Apr-2017 3:55 PM, "Aishwarya Chaurasia" <aishwarya2...@gmail.com>
wrote:

Hello sir,
On running the file preprocess.py we are getting the following error :

https://paste.fedoraproject.org/paste/IAvqiiyJChSC0V9eeETe2F5M1UNdIG
YhyRLivL9gydE=

Can you please help us by looking into the error and kindly tell us the
solution for it.
Thanks a lot.
Aishwarya Chaurasia


On 19-Apr-2017 12:43 AM, <dusenberr...@gmail.com> wrote:

> Hi Aishwarya,
>
> Certainly, here is some more detailed information about`preprocess.py`:
>
>   * The preprocessing Python script is located at
> https://github.com/apache/incubator-systemml/blob/master/
> projects/breast_cancer/preprocess.py.  Note that this is different than
> the library module at https://github.com/apache/incu
> bator-systemml/blob/master/projects/breast_cancer/breastc
> ancer/preprocessing.py.
>   * This script is used to preprocess a set of histology slide images,
> which are `.svs` files in our case, and `.tiff` files in your case.
>   * Lines 63-79 contain "settings" such as the output image sizes, folder
> paths, etc.  Of particular interest, line 72 has the folder path for the
> original slide images that should be commonly accessible from all machines
> being used, and lines 74-79 contain the names of the output DataFrames that
> will be saved.
>   * Line 82 performs the actual preprocessing and creates a Spark
> DataFrame with the following columns: slide number, tumor score, molecular
> score, sample.  The "sample" in this case is the actual small, chopped-up
> section of the image that has been extracted and flattened into a row
> Vector.  For test images without labels (`training=false`), only the slide
> number and sample will be contained in the DataFrame (i.e. no labels).
> This calls the `preprocess(...)` function located on line 371 of
> https://github.com/apache/incubator-systemml/blob/master/
> projects/breast_cancer/breastcancer/preprocessing.py, which is a
> different file.
>   * Line 87 simply saves the above DataFrame to HDFS with the name from
> line 74.
>   * Line 93 splits the above DataFrame row-wise into separate "training"
> and "validation" DataFrames, based on the split percentage from line 70
> (`train_frac`).  This is performed so that downstream machine learning
> tasks can learn from the training set, and validate performance and
> hyperparameter choices on the validation set.  These DataFrames will start
> with the same columns as the above DataFrame.  If `add_row_indices` from
> line 69 is true, then an additional row index column (`__INDEX`) will be
> pretended.  This is useful for SystemML in downstream machine learning
> tasks as it gives the DataFrame row numbers like a real matrix would have,
> and SystemML is built to operate on matrices.
>   * Lines 97 & 98 simply save the training and validation DataFrames using
> the names defined on lines 76 & 78.
>   * Lines 103-137 create smaller train and validation DataFrames by taking
> small row-wise samples of the full train and validation DataFrames.  The
> percentage of the sample is defined on line 111 (`p=0.01` for a 1%
> sample).  This is generally useful for quicker downstream tasks without
> having to load in the larger DataFrames, assuming you have a large amount
> of data.  For us, we have ~7TB of data, so having 1% sampled DataFrames is
> useful for quicker downstream tests.  Once again, the same columns from the
> larger train and validation DataFrames will be used.
>   * Lines 146 & 147 simply save these sampled train and validation
> DataFrames.
>
> As a summary, after running `preprocess.py`, you will be left with the
> following saved DataFrames in HDFS:
>   * Full DataFrame
>   * Training DataFrame
>   * Validation DataFrame
>   * Sampled training DataFrame
>   * Sampled validation DataFrame
>
> As for visualization, you may visualize a "sample" (i.e. small, chopped-up
> section of original image) from a DataFrame by using the `
> breastcancer.visualization.visualize_sample(...)` function.  You will
> need to do this after creating the DataFrames.  Here is a snippet to
> visualize the first row sample in a DataFrame, where `df` is one of the
> DataFrames from above:
>
> ```
> from breastcancer.visualization import visualize_sample
> visualize_sample(df.first().sample)
> ```
>
> Please let me know if you have any additional questions.
>
> Thanks!
>
> - Mike
>
> --
>
> Mike Dusenberry
> GitHub: github.com/dusenberrymw
> LinkedIn: linkedin.com/in/mikedusenberry
>
> Sent from my iPhone.
>
>
> > On Apr 15, 2017, at 4:38 AM, Aishwarya Chaurasia <
> aishwarya2...@gmail.com> wrote:
> >
> > Hello sir,
> > Can you please elaborate more on what output we would be getting because
> we
> > tried executing the preprocess.py file using spark submit it keeps on
> > adding the tiles in rdd and while running the visualisation.py file it
> > isn't showing any output. Can you please help us out asap stating the
> > output we will be getting and the sequence of execution of files.
> > Thank you.
> >
> >> On 07-Apr-2017 5:54 AM, <dusenberr...@gmail.com> wrote:
> >>
> >> Hi Aishwarya,
> >>
> >> Thanks for sharing more info on the issue!
> >>
> >> To facilitate easier usage, I've updated the preprocessing code by
> pulling
> >> out most of the logic into a `breastcancer/preprocessing.py` module,
> >> leaving just the execution in the `Preprocessing.ipynb` notebook.
> There is
> >> also a `preprocess.py` script with the same contents as the notebook for
> >> use with `spark-submit`.  The choice of the notebook or the script is
> just
> >> a matter of convenience, as they both import from the same
> >> `breastcancer/preprocessing.py` package.
> >>
> >> As part of the updates, I've added an explicit SparkSession parameter
> >> (`spark`) to the `preprocess(...)` function, and updated the body to use
> >> this SparkSession object rather than the older SparkContext `sc` object.
> >> Previously, the `preprocess(...)` function accessed the `sc` object that
> >> was pulled in from the enclosing scope, which would work while all of
> the
> >> code was colocated within the notebook, but not if the code was
> extracted
> >> and imported.  The explicit parameter now allows for the code to be
> >> imported.
> >>
> >> Can you please try again with the latest updates?  We are currently
> using
> >> Spark 2.x with Python 3.  If you use the notebook, the pyspark kernel
> >> should have a `spark` object available that can be supplied to the
> >> functions (as is done now in the notebook), and if you use the
> >> `preprocess.py` script with `spark-submit`, the `spark` object will be
> >> created explicitly by the script.
> >>
> >> For a bit of context to others, Aishwarya initially reached out to find
> >> out if our breast cancer project could be applied to TIFF images, rather
> >> than the SVS images we are currently using (the answer is "yes" so long
> as
> >> they are "generic tiled TIFF images, according to the OpenSlide
> >> documentation), and then followed up with Spark issues related to the
> >> preprocessing code.  This conversation has been promptly moved to the
> >> mailing list so that others in the community can benefit.
> >>
> >>
> >> Thanks!
> >>
> >> -Mike
> >>
> >> --
> >>
> >> Mike Dusenberry
> >> GitHub: github.com/dusenberrymw
> >> LinkedIn: linkedin.com/in/mikedusenberry
> >>
> >> Sent from my iPhone.
> >>
> >>
> >>> On Apr 6, 2017, at 5:09 AM, Aishwarya Chaurasia <
> aishwarya2...@gmail.com>
> >> wrote:
> >>>
> >>> Hey,
> >>>
> >>> The object sc is already defined in pyspark and yet this name error
> keeps
> >>> occurring. We are using spark 2.*
> >>>
> >>> Here is the link to error that we are getting :
> >>> https://paste.fedoraproject.org/paste/89iQODxzpNZVbSfgwocH8l5M1UNdIG
> >> YhyRLivL9gydE=
> >>
>

Re: Regarding incubator systemml/breast_cancer project

Reply via email to