[jira] [Updated] (SYSTEMML-1185) SystemML Breast Cancer Project

Mike Dusenberry (JIRA) Wed, 22 Mar 2017 17:05:14 -0700

     [ 
https://issues.apache.org/jira/browse/SYSTEMML-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mike Dusenberry updated SYSTEMML-1185:
--------------------------------------
    Description: 
h1. Predicting Breast Cancer Proliferation Scores with Apache Spark and Apache 
SystemML

h3. Overview
The [Tumor Proliferation Assessment Challenge 2016 (TUPAC16) | 
http://tupac.tue-image.nl/] is a "Grand Challenge" that was created for the 
[2016 Medical Image Computing and Computer Assisted Intervention (MICCAI 2016) 
| http://miccai2016.org/en/] conference.  In this challenge, the goal is to 
develop state-of-the-art algorithms for automatic prediction of tumor 
proliferation scores from whole-slide histopathology images of breast tumors.

h3. Background
Breast cancer is the leading cause of cancerous death in women in 
less-developed countries, and is the second leading cause of cancerous deaths 
in developed countries, accounting for 29% of all cancers in women within the 
U.S. \[1]. Survival rates increase as early detection increases, giving 
incentive for pathologists and the medical world at large to develop improved 
methods for even earlier detection \[2].  There are many forms of breast cancer 
including Ductal Carcinoma in Situ (DCIS), Invasive Ductal Carcinoma (IDC), 
Tubular Carcinoma of the Breast, Medullary Carcinoma of the Breast, Invasive 
Lobular Carcinoma, Inflammatory Breast Cancer and several others \[3]. Within 
all of these forms of breast cancer, the rate in which breast cancer cells grow 
(proliferation), is a strong indicator of a patient’s prognosis. Although there 
are many means of determining the presence of breast cancer, tumor 
proliferation speed has been proven to help pathologists determine the 
treatment for the patient. The most common technique for determining the 
proliferation speed is through mitotic count (mitotic index) estimates, in 
which a pathologist counts the dividing cell nuclei in hematoxylin and eosin 
(H&E) stained slide preparations to determine the number of mitotic bodies.  
Given this, the pathologist produces a proliferation score of either 1, 2, or 
3, ranging from better to worse prognosis \[4]. Unfortunately, this approach is 
known to have reproducibility problems due to the variability in counting, as 
well as the difficulty in distinguishing between different grades.

References:  
\[1] http://emedicine.medscape.com/article/1947145-overview#a3  
\[2] http://emedicine.medscape.com/article/1947145-overview#a7  
\[3] http://emedicine.medscape.com/article/1954658-overview  
\[4] http://emedicine.medscape.com/article/1947145-workup#c12  

h3. Goal & Approach
In an effort to automate the process of classification, this project aims to 
develop a large-scale deep learning approach for predicting tumor scores 
directly from the pixels of whole-slide histopathology images.  Our proposed 
approach is based on a recent research paper from Stanford \[1].  Starting with 
500 extremely high-resolution tumor slide images with accompanying score 
labels, we aim to make use of Apache Spark in a preprocessing step to cut and 
filter the images into smaller square samples, generating 4.7 million samples 
for a total of ~7TB of data \[2].  We then utilize Apache SystemML on top of 
Spark to develop and train a custom, large-scale, deep convolutional neural 
network on these samples, making use of the familiar linear algebra syntax and 
automatically-distributed execution of SystemML \[3].  Our model takes as input 
the pixel values of the individual samples, and is trained to predict the 
correct tumor score classification for each one.  In addition to distributed 
linear algebra, we aim to exploit task-parallelism via parallel for-loops for 
hyperparameter optimization, as well as hardware acceleration for faster 
training via a GPU-backed runtime.  Ultimately, we aim to develop a model that 
is sufficiently stronger than existing approaches for the task of breast cancer 
tumor proliferation score classification.

References:  
\[1] https://web.stanford.edu/group/rubinlab/pubs/2243353.pdf  
\[2] See [{{Preprocessing.ipynb}} | 
https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/Preprocessing.ipynb].
  
\[3] See [{{MachineLearning.ipynb}} | 
https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/MachineLearning.ipynb],
 [{{softmax_clf.dml}} | 
https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/softmax_clf.dml],
 and [{{convnet.dml}} | 
https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/convnet.dml].
  

!approach.svg!

----

h2. Systems Tasks

>From a systems perspective, we aim to support multi-node, multi-GPU 
>distributed SGD training to support large-scale experiments for the specific 
>breast cancer use case.

To achieve this goal, the following steps as necessary:
# Single-node, CPU mini-batch SGD training (1 mini-batch at a time).
# Single-node, single-GPU mini-batch SGD training (1 mini-batch at a time).
# Single-node, multi-GPU data-parallel mini-batch SGD training (`n` parallel 
mini-batches for `n` GPUs at a time).
# Multi-node, CPU data-parallel mini-batch SGD training (`n` parallel 
mini-batches for `n` parallel tasks at a time).
# Multi-node, single-GPU data-parallel mini-batch SGD training (`n` parallel 
mini-batches for `n` total GPUs across the cluster at a time).
# Multi-node, multi-GPU data-parallel mini-batch SGD training (`n` parallel 
mini-batches for `n` total GPUs across the cluster at a time).

----

Here is a list of past and present JIRA epics and issues that have blocked, or 
are currently blocking progress on the breast cancer project.
 
Overall Deep Learning Epic
  * https://issues.apache.org/jira/browse/SYSTEMML-540
  *This is the overall "Deep Learning" JIRA epic, with all issues either within 
or related to the epic.

Past
* https://issues.apache.org/jira/browse/SYSTEMML-633
* https://issues.apache.org/jira/browse/SYSTEMML-951
  ** Issue that completely blocked mini-batch training approaches.
* https://issues.apache.org/jira/browse/SYSTEMML-914
  ** Epic containing issues related to input DataFrame conversions that blocked 
getting data into the system entirely.  Most of the issues specifically refer 
to existing, internal converters.  993 was a particularly large issue, and 
triggered a large body of work related to internal memory estimates that were 
incorrect.  Also see 919, 946, & 994.
* https://issues.apache.org/jira/browse/SYSTEMML-1076
* https://issues.apache.org/jira/browse/SYSTEMML-1077
* https://issues.apache.org/jira/browse/SYSTEMML-948

Present
* https://issues.apache.org/jira/browse/SYSTEMML-1160
  ** Current open blocker to efficiently using a stochastic gradient descent 
approach.
* https://issues.apache.org/jira/browse/SYSTEMML-1078
  ** Current open blocker to training even an initial deep learning model for 
the project.  This is another example of an internal compiler bug.
* https://issues.apache.org/jira/browse/SYSTEMML-686
  ** We need distributed convolution and max pooling operators.
* https://issues.apache.org/jira/browse/SYSTEMML-1159
  ** This is the main issue that discusses the need for the `parfor` construct 
to support efficient, parallel hyperparameter tuning on a cluster with large 
datasets.  The broken remote parfor in 1129 blocked this issue, which in turned 
blocked any meaningful work on training a deep neural net for the project.
* https://issues.apache.org/jira/browse/SYSTEMML-1142
  ** This was one of the blockers to doing hyperparameter tuning.
* https://issues.apache.org/jira/browse/SYSTEMML-1129
  ** This is an epic for the issue in which the `parfor` construct was broken 
for remote Spark cases, and was one of the blockers for doing hyperparameter 
tuning.

  was:
This issue tracks the new SystemML breast cancer project!

>From a systems perspective, we aim to support multi-node, multi-GPU 
>distributed SGD training to support large-scale experiments for the specific 
>breast cancer use case.

To achieve this goal, the following steps as necessary:
# Single-node, CPU mini-batch SGD training (1 mini-batch at a time).
# Single-node, single-GPU mini-batch SGD training (1 mini-batch at a time).
# Single-node, multi-GPU data-parallel mini-batch SGD training (`n` parallel 
mini-batches for `n` GPUs at a time).
# Multi-node, CPU data-parallel mini-batch SGD training (`n` parallel 
mini-batches for `n` parallel tasks at a time).
# Multi-node, single-GPU data-parallel mini-batch SGD training (`n` parallel 
mini-batches for `n` total GPUs across the cluster at a time).
# Multi-node, multi-GPU data-parallel mini-batch SGD training (`n` parallel 
mini-batches for `n` total GPUs across the cluster at a time).

----

Here is a list of past and present JIRA epics and issues that have blocked, or 
are currently blocking progress on the breast cancer project.
 
Overall Deep Learning Epic
  * https://issues.apache.org/jira/browse/SYSTEMML-540
  *This is the overall "Deep Learning" JIRA epic, with all issues either within 
or related to the epic.

Past
* https://issues.apache.org/jira/browse/SYSTEMML-633
* https://issues.apache.org/jira/browse/SYSTEMML-951
  ** Issue that completely blocked mini-batch training approaches.
* https://issues.apache.org/jira/browse/SYSTEMML-914
  ** Epic containing issues related to input DataFrame conversions that blocked 
getting data into the system entirely.  Most of the issues specifically refer 
to existing, internal converters.  993 was a particularly large issue, and 
triggered a large body of work related to internal memory estimates that were 
incorrect.  Also see 919, 946, & 994.
* https://issues.apache.org/jira/browse/SYSTEMML-1076
* https://issues.apache.org/jira/browse/SYSTEMML-1077
* https://issues.apache.org/jira/browse/SYSTEMML-948

Present
* https://issues.apache.org/jira/browse/SYSTEMML-1160
  ** Current open blocker to efficiently using a stochastic gradient descent 
approach.
* https://issues.apache.org/jira/browse/SYSTEMML-1078
  ** Current open blocker to training even an initial deep learning model for 
the project.  This is another example of an internal compiler bug.
* https://issues.apache.org/jira/browse/SYSTEMML-686
  ** We need distributed convolution and max pooling operators.
* https://issues.apache.org/jira/browse/SYSTEMML-1159
  ** This is the main issue that discusses the need for the `parfor` construct 
to support efficient, parallel hyperparameter tuning on a cluster with large 
datasets.  The broken remote parfor in 1129 blocked this issue, which in turned 
blocked any meaningful work on training a deep neural net for the project.
* https://issues.apache.org/jira/browse/SYSTEMML-1142
  ** This was one of the blockers to doing hyperparameter tuning.
* https://issues.apache.org/jira/browse/SYSTEMML-1129
  ** This is an epic for the issue in which the `parfor` construct was broken 
for remote Spark cases, and was one of the blockers for doing hyperparameter 
tuning.


> SystemML Breast Cancer Project
> ------------------------------
>
>                 Key: SYSTEMML-1185
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1185
>             Project: SystemML
>          Issue Type: New Feature
>            Reporter: Mike Dusenberry
>            Assignee: Mike Dusenberry
>         Attachments: approach.svg
>
>
> h1. Predicting Breast Cancer Proliferation Scores with Apache Spark and 
> Apache SystemML
> h3. Overview
> The [Tumor Proliferation Assessment Challenge 2016 (TUPAC16) | 
> http://tupac.tue-image.nl/] is a "Grand Challenge" that was created for the 
> [2016 Medical Image Computing and Computer Assisted Intervention (MICCAI 
> 2016) | http://miccai2016.org/en/] conference.  In this challenge, the goal 
> is to develop state-of-the-art algorithms for automatic prediction of tumor 
> proliferation scores from whole-slide histopathology images of breast tumors.
> h3. Background
> Breast cancer is the leading cause of cancerous death in women in 
> less-developed countries, and is the second leading cause of cancerous deaths 
> in developed countries, accounting for 29% of all cancers in women within the 
> U.S. \[1]. Survival rates increase as early detection increases, giving 
> incentive for pathologists and the medical world at large to develop improved 
> methods for even earlier detection \[2].  There are many forms of breast 
> cancer including Ductal Carcinoma in Situ (DCIS), Invasive Ductal Carcinoma 
> (IDC), Tubular Carcinoma of the Breast, Medullary Carcinoma of the Breast, 
> Invasive Lobular Carcinoma, Inflammatory Breast Cancer and several others 
> \[3]. Within all of these forms of breast cancer, the rate in which breast 
> cancer cells grow (proliferation), is a strong indicator of a patient’s 
> prognosis. Although there are many means of determining the presence of 
> breast cancer, tumor proliferation speed has been proven to help pathologists 
> determine the treatment for the patient. The most common technique for 
> determining the proliferation speed is through mitotic count (mitotic index) 
> estimates, in which a pathologist counts the dividing cell nuclei in 
> hematoxylin and eosin (H&E) stained slide preparations to determine the 
> number of mitotic bodies.  Given this, the pathologist produces a 
> proliferation score of either 1, 2, or 3, ranging from better to worse 
> prognosis \[4]. Unfortunately, this approach is known to have reproducibility 
> problems due to the variability in counting, as well as the difficulty in 
> distinguishing between different grades.
> References:  
> \[1] http://emedicine.medscape.com/article/1947145-overview#a3  
> \[2] http://emedicine.medscape.com/article/1947145-overview#a7  
> \[3] http://emedicine.medscape.com/article/1954658-overview  
> \[4] http://emedicine.medscape.com/article/1947145-workup#c12  
> h3. Goal & Approach
> In an effort to automate the process of classification, this project aims to 
> develop a large-scale deep learning approach for predicting tumor scores 
> directly from the pixels of whole-slide histopathology images.  Our proposed 
> approach is based on a recent research paper from Stanford \[1].  Starting 
> with 500 extremely high-resolution tumor slide images with accompanying score 
> labels, we aim to make use of Apache Spark in a preprocessing step to cut and 
> filter the images into smaller square samples, generating 4.7 million samples 
> for a total of ~7TB of data \[2].  We then utilize Apache SystemML on top of 
> Spark to develop and train a custom, large-scale, deep convolutional neural 
> network on these samples, making use of the familiar linear algebra syntax 
> and automatically-distributed execution of SystemML \[3].  Our model takes as 
> input the pixel values of the individual samples, and is trained to predict 
> the correct tumor score classification for each one.  In addition to 
> distributed linear algebra, we aim to exploit task-parallelism via parallel 
> for-loops for hyperparameter optimization, as well as hardware acceleration 
> for faster training via a GPU-backed runtime.  Ultimately, we aim to develop 
> a model that is sufficiently stronger than existing approaches for the task 
> of breast cancer tumor proliferation score classification.
> References:  
> \[1] https://web.stanford.edu/group/rubinlab/pubs/2243353.pdf  
> \[2] See [{{Preprocessing.ipynb}} | 
> https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/Preprocessing.ipynb].
>   
> \[3] See [{{MachineLearning.ipynb}} | 
> https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/MachineLearning.ipynb],
>  [{{softmax_clf.dml}} | 
> https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/softmax_clf.dml],
>  and [{{convnet.dml}} | 
> https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/convnet.dml].
>   
> !approach.svg!
> ----
> h2. Systems Tasks
> From a systems perspective, we aim to support multi-node, multi-GPU 
> distributed SGD training to support large-scale experiments for the specific 
> breast cancer use case.
> To achieve this goal, the following steps as necessary:
> # Single-node, CPU mini-batch SGD training (1 mini-batch at a time).
> # Single-node, single-GPU mini-batch SGD training (1 mini-batch at a time).
> # Single-node, multi-GPU data-parallel mini-batch SGD training (`n` parallel 
> mini-batches for `n` GPUs at a time).
> # Multi-node, CPU data-parallel mini-batch SGD training (`n` parallel 
> mini-batches for `n` parallel tasks at a time).
> # Multi-node, single-GPU data-parallel mini-batch SGD training (`n` parallel 
> mini-batches for `n` total GPUs across the cluster at a time).
> # Multi-node, multi-GPU data-parallel mini-batch SGD training (`n` parallel 
> mini-batches for `n` total GPUs across the cluster at a time).
> ----
> Here is a list of past and present JIRA epics and issues that have blocked, 
> or are currently blocking progress on the breast cancer project.
>  
> Overall Deep Learning Epic
>   * https://issues.apache.org/jira/browse/SYSTEMML-540
>   *This is the overall "Deep Learning" JIRA epic, with all issues either 
> within or related to the epic.
> Past
> * https://issues.apache.org/jira/browse/SYSTEMML-633
> * https://issues.apache.org/jira/browse/SYSTEMML-951
>   ** Issue that completely blocked mini-batch training approaches.
> * https://issues.apache.org/jira/browse/SYSTEMML-914
>   ** Epic containing issues related to input DataFrame conversions that 
> blocked getting data into the system entirely.  Most of the issues 
> specifically refer to existing, internal converters.  993 was a particularly 
> large issue, and triggered a large body of work related to internal memory 
> estimates that were incorrect.  Also see 919, 946, & 994.
> * https://issues.apache.org/jira/browse/SYSTEMML-1076
> * https://issues.apache.org/jira/browse/SYSTEMML-1077
> * https://issues.apache.org/jira/browse/SYSTEMML-948
> Present
> * https://issues.apache.org/jira/browse/SYSTEMML-1160
>   ** Current open blocker to efficiently using a stochastic gradient descent 
> approach.
> * https://issues.apache.org/jira/browse/SYSTEMML-1078
>   ** Current open blocker to training even an initial deep learning model for 
> the project.  This is another example of an internal compiler bug.
> * https://issues.apache.org/jira/browse/SYSTEMML-686
>   ** We need distributed convolution and max pooling operators.
> * https://issues.apache.org/jira/browse/SYSTEMML-1159
>   ** This is the main issue that discusses the need for the `parfor` 
> construct to support efficient, parallel hyperparameter tuning on a cluster 
> with large datasets.  The broken remote parfor in 1129 blocked this issue, 
> which in turned blocked any meaningful work on training a deep neural net for 
> the project.
> * https://issues.apache.org/jira/browse/SYSTEMML-1142
>   ** This was one of the blockers to doing hyperparameter tuning.
> * https://issues.apache.org/jira/browse/SYSTEMML-1129
>   ** This is an epic for the issue in which the `parfor` construct was broken 
> for remote Spark cases, and was one of the blockers for doing hyperparameter 
> tuning.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (SYSTEMML-1185) SystemML Breast Cancer Project

Reply via email to