[jira] [Commented] (SYSTEMML-1185) SystemML Breast Cancer Project

Mike Dusenberry (JIRA) Mon, 17 Jul 2017 17:29:56 -0700

    [ 
https://issues.apache.org/jira/browse/SYSTEMML-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090865#comment-16090865
 ]


Mike Dusenberry commented on SYSTEMML-1185:
-------------------------------------------

Update: I merged a large number of changes that were in the [experimental 
branch | 
https://github.com/dusenberrymw/systemml/tree/breast_cancer_experimental2/projects/breast_cancer]
 into master in [commit 532da1b | 
https://github.com/apache/systemml/commit/532da1bc51fed65cd6c329b1c99c1926fe4cf2cd].
   This includes our Keras experiments, updates to the preprocessing, shell 
script helpers, file reorganization, and documentation improvements.

> SystemML Breast Cancer Project
> ------------------------------
>
>                 Key: SYSTEMML-1185
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1185
>             Project: SystemML
>          Issue Type: Epic
>            Reporter: Mike Dusenberry
>            Assignee: Mike Dusenberry
>         Attachments: approach.svg
>
>
> h1. Predicting Breast Cancer Proliferation Scores with Apache Spark and 
> Apache SystemML
> h3. Overview
> The [Tumor Proliferation Assessment Challenge 2016 (TUPAC16) | 
> http://tupac.tue-image.nl/] is a "Grand Challenge" that was created for the 
> [2016 Medical Image Computing and Computer Assisted Intervention (MICCAI 
> 2016) | http://miccai2016.org/en/] conference.  In this challenge, the goal 
> is to develop state-of-the-art algorithms for automatic prediction of tumor 
> proliferation scores from whole-slide histopathology images of breast tumors.
> h3. Background
> Breast cancer is the leading cause of cancerous death in women in 
> less-developed countries, and is the second leading cause of cancerous deaths 
> in developed countries, accounting for 29% of all cancers in women within the 
> U.S. \[1]. Survival rates increase as early detection increases, giving 
> incentive for pathologists and the medical world at large to develop improved 
> methods for even earlier detection \[2].  There are many forms of breast 
> cancer including Ductal Carcinoma in Situ (DCIS), Invasive Ductal Carcinoma 
> (IDC), Tubular Carcinoma of the Breast, Medullary Carcinoma of the Breast, 
> Invasive Lobular Carcinoma, Inflammatory Breast Cancer and several others 
> \[3]. Within all of these forms of breast cancer, the rate in which breast 
> cancer cells grow (proliferation), is a strong indicator of a patient’s 
> prognosis. Although there are many means of determining the presence of 
> breast cancer, tumor proliferation speed has been proven to help pathologists 
> determine the treatment for the patient. The most common technique for 
> determining the proliferation speed is through mitotic count (mitotic index) 
> estimates, in which a pathologist counts the dividing cell nuclei in 
> hematoxylin and eosin (H&E) stained slide preparations to determine the 
> number of mitotic bodies.  Given this, the pathologist produces a 
> proliferation score of either 1, 2, or 3, ranging from better to worse 
> prognosis \[4]. Unfortunately, this approach is known to have reproducibility 
> problems due to the variability in counting, as well as the difficulty in 
> distinguishing between different grades.
> References:  
> \[1] http://emedicine.medscape.com/article/1947145-overview#a3  
> \[2] http://emedicine.medscape.com/article/1947145-overview#a7  
> \[3] http://emedicine.medscape.com/article/1954658-overview  
> \[4] http://emedicine.medscape.com/article/1947145-workup#c12  
> h3. Goal & Approach
> In an effort to automate the process of classification, this project aims to 
> develop a large-scale deep learning approach for predicting tumor scores 
> directly from the pixels of whole-slide histopathology images.  Our proposed 
> approach is based on a recent research paper from Stanford \[1].  Starting 
> with 500 extremely high-resolution tumor slide images with accompanying score 
> labels, we aim to make use of Apache Spark in a preprocessing step to cut and 
> filter the images into smaller square samples, generating 4.7 million samples 
> for a total of ~7TB of data \[2].  We then utilize Apache SystemML on top of 
> Spark to develop and train a custom, large-scale, deep convolutional neural 
> network on these samples, making use of the familiar linear algebra syntax 
> and automatically-distributed execution of SystemML \[3].  Our model takes as 
> input the pixel values of the individual samples, and is trained to predict 
> the correct tumor score classification for each one.  In addition to 
> distributed linear algebra, we aim to exploit task-parallelism via parallel 
> for-loops for hyperparameter optimization, as well as hardware acceleration 
> for faster training via a GPU-backed runtime.  Ultimately, we aim to develop 
> a model that is sufficiently stronger than existing approaches for the task 
> of breast cancer tumor proliferation score classification.
> References:  
> \[1] https://web.stanford.edu/group/rubinlab/pubs/2243353.pdf  
> \[2] See [{{Preprocessing.ipynb}} | 
> https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/Preprocessing.ipynb].
>   
> \[3] See [{{MachineLearning.ipynb}} | 
> https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/MachineLearning.ipynb],
>  [{{softmax_clf.dml}} | 
> https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/softmax_clf.dml],
>  and [{{convnet.dml}} | 
> https://github.com/apache/incubator-systemml/blob/master/projects/breast_cancer/convnet.dml].
>   
> !approach.svg!
> ----
> h2. Systems Tasks
> From a systems perspective, we aim to support multi-node, multi-GPU 
> distributed SGD training to support large-scale experiments for the specific 
> breast cancer use case.
> To achieve this goal, the following steps as necessary:
> # Single-node, CPU mini-batch SGD training (1 mini-batch at a time).
> # Single-node, single-GPU mini-batch SGD training (1 mini-batch at a time).
> # Single-node, multi-GPU data-parallel mini-batch SGD training (`n` parallel 
> mini-batches for `n` GPUs at a time).
> # Multi-node, CPU data-parallel mini-batch SGD training (`n` parallel 
> mini-batches for `n` parallel tasks at a time).
> # Multi-node, single-GPU data-parallel mini-batch SGD training (`n` parallel 
> mini-batches for `n` total GPUs across the cluster at a time).
> # Multi-node, multi-GPU data-parallel mini-batch SGD training (`n` parallel 
> mini-batches for `n` total GPUs across the cluster at a time).
> ---
> Here is a list of past and present JIRA epics and issues that have blocked, 
> or are currently blocking progress on the breast cancer project.
>  
> Overall Deep Learning Epic
>   * https://issues.apache.org/jira/browse/SYSTEMML-540
>   *This is the overall "Deep Learning" JIRA epic, with all issues either 
> within or related to the epic.
> Past
> * https://issues.apache.org/jira/browse/SYSTEMML-633
> * https://issues.apache.org/jira/browse/SYSTEMML-951
>   ** Issue that completely blocked mini-batch training approaches.
> * https://issues.apache.org/jira/browse/SYSTEMML-914
>   ** Epic containing issues related to input DataFrame conversions that 
> blocked getting data into the system entirely.  Most of the issues 
> specifically refer to existing, internal converters.  993 was a particularly 
> large issue, and triggered a large body of work related to internal memory 
> estimates that were incorrect.  Also see 919, 946, & 994.
> * https://issues.apache.org/jira/browse/SYSTEMML-1076
> * https://issues.apache.org/jira/browse/SYSTEMML-1077
> * https://issues.apache.org/jira/browse/SYSTEMML-948
> Present
> * https://issues.apache.org/jira/browse/SYSTEMML-1160
>   ** Current open blocker to efficiently using a stochastic gradient descent 
> approach.
> * https://issues.apache.org/jira/browse/SYSTEMML-1078
>   ** Current open blocker to training even an initial deep learning model for 
> the project.  This is another example of an internal compiler bug.
> * https://issues.apache.org/jira/browse/SYSTEMML-686
>   ** We need distributed convolution and max pooling operators.
> * https://issues.apache.org/jira/browse/SYSTEMML-1159
>   ** This is the main issue that discusses the need for the `parfor` 
> construct to support efficient, parallel hyperparameter tuning on a cluster 
> with large datasets.  The broken remote parfor in 1129 blocked this issue, 
> which in turned blocked any meaningful work on training a deep neural net for 
> the project.
> * https://issues.apache.org/jira/browse/SYSTEMML-1142
>   ** This was one of the blockers to doing hyperparameter tuning.
> * https://issues.apache.org/jira/browse/SYSTEMML-1129
>   ** This is an epic for the issue in which the `parfor` construct was broken 
> for remote Spark cases, and was one of the blockers for doing hyperparameter 
> tuning.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (SYSTEMML-1185) SystemML Breast Cancer Project

Reply via email to