[ https://issues.apache.org/jira/browse/SYSTEMML-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mike Dusenberry updated SYSTEMML-1185: -------------------------------------- Description: This issue tracks the new SystemML breast cancer project! >From a systems perspective, we aim to support multi-node, multi-GPU >distributed SGD training to support large-scale experiments for the specific >breast cancer use case. To achieve this goal, the following steps as necessary: # Single-node, CPU mini-batch SGD training (1 mini-batch at a time). # Single-node, single-GPU mini-batch SGD training (1 mini-batch at a time). # Single-node, multi-GPU data-parallel mini-batch SGD training (`n` parallel mini-batches for `n` GPUs at a time). # Multi-node, CPU data-parallel mini-batch SGD training (`n` parallel mini-batches for `n` parallel tasks at a time). # Multi-node, single-GPU data-parallel mini-batch SGD training (`n` parallel mini-batches for `n` total GPUs across the cluster at a time). # Multi-node, multi-GPU data-parallel mini-batch SGD training (`n` parallel mini-batches for `n` total GPUs across the cluster at a time). ---- Here is a list of past and present JIRA epics and issues that have blocked, or are currently blocking progress on the breast cancer project. Overall Deep Learning Epic * https://issues.apache.org/jira/browse/SYSTEMML-540 *This is the overall "Deep Learning" JIRA epic, with all issues either within or related to the epic. Past * https://issues.apache.org/jira/browse/SYSTEMML-633 * https://issues.apache.org/jira/browse/SYSTEMML-951 ** Issue that completely blocked mini-batch training approaches. * https://issues.apache.org/jira/browse/SYSTEMML-914 ** Epic containing issues related to input DataFrame conversions that blocked getting data into the system entirely. Most of the issues specifically refer to existing, internal converters. 993 was a particularly large issue, and triggered a large body of work related to internal memory estimates that were incorrect. Also see 919, 946, & 994. * https://issues.apache.org/jira/browse/SYSTEMML-1076 * https://issues.apache.org/jira/browse/SYSTEMML-1077 * https://issues.apache.org/jira/browse/SYSTEMML-948 Present * https://issues.apache.org/jira/browse/SYSTEMML-1160 ** Current open blocker to efficiently using a stochastic gradient descent approach. * https://issues.apache.org/jira/browse/SYSTEMML-1078 ** Current open blocker to training even an initial deep learning model for the project. This is another example of an internal compiler bug. * https://issues.apache.org/jira/browse/SYSTEMML-686 ** We need distributed convolution and max pooling operators. * https://issues.apache.org/jira/browse/SYSTEMML-1159 ** This is the main issue that discusses the need for the `parfor` construct to support efficient, parallel hyperparameter tuning on a cluster with large datasets. The broken remote parfor in 1129 blocked this issue, which in turned blocked any meaningful work on training a deep neural net for the project. * https://issues.apache.org/jira/browse/SYSTEMML-1142 ** This was one of the blockers to doing hyperparameter tuning. * https://issues.apache.org/jira/browse/SYSTEMML-1129 ** This is an epic for the issue in which the `parfor` construct was broken for remote Spark cases, and was one of the blockers for doing hyperparameter tuning. was: This issue tracks the new SystemML breast cancer project! Here is a list of past and present JIRA epics and issues that have blocked, or are currently blocking progress on the breast cancer project. Overall Deep Learning Epic * https://issues.apache.org/jira/browse/SYSTEMML-540 *This is the overall "Deep Learning" JIRA epic, with all issues either within or related to the epic. Past * https://issues.apache.org/jira/browse/SYSTEMML-633 * https://issues.apache.org/jira/browse/SYSTEMML-951 ** Issue that completely blocked mini-batch training approaches. * https://issues.apache.org/jira/browse/SYSTEMML-914 ** Epic containing issues related to input DataFrame conversions that blocked getting data into the system entirely. Most of the issues specifically refer to existing, internal converters. 993 was a particularly large issue, and triggered a large body of work related to internal memory estimates that were incorrect. Also see 919, 946, & 994. * https://issues.apache.org/jira/browse/SYSTEMML-1076 * https://issues.apache.org/jira/browse/SYSTEMML-1077 * https://issues.apache.org/jira/browse/SYSTEMML-948 Present * https://issues.apache.org/jira/browse/SYSTEMML-1160 ** Current open blocker to efficiently using a stochastic gradient descent approach. * https://issues.apache.org/jira/browse/SYSTEMML-1078 ** Current open blocker to training even an initial deep learning model for the project. This is another example of an internal compiler bug. * https://issues.apache.org/jira/browse/SYSTEMML-686 ** We need distributed convolution and max pooling operators. * https://issues.apache.org/jira/browse/SYSTEMML-1159 ** This is the main issue that discusses the need for the `parfor` construct to support efficient, parallel hyperparameter tuning on a cluster with large datasets. The broken remote parfor in 1129 blocked this issue, which in turned blocked any meaningful work on training a deep neural net for the project. * https://issues.apache.org/jira/browse/SYSTEMML-1142 ** This was one of the blockers to doing hyperparameter tuning. * https://issues.apache.org/jira/browse/SYSTEMML-1129 ** This is an epic for the issue in which the `parfor` construct was broken for remote Spark cases, and was one of the blockers for doing hyperparameter tuning. > SystemML Breast Cancer Project > ------------------------------ > > Key: SYSTEMML-1185 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1185 > Project: SystemML > Issue Type: New Feature > Reporter: Mike Dusenberry > Assignee: Mike Dusenberry > > This issue tracks the new SystemML breast cancer project! > From a systems perspective, we aim to support multi-node, multi-GPU > distributed SGD training to support large-scale experiments for the specific > breast cancer use case. > To achieve this goal, the following steps as necessary: > # Single-node, CPU mini-batch SGD training (1 mini-batch at a time). > # Single-node, single-GPU mini-batch SGD training (1 mini-batch at a time). > # Single-node, multi-GPU data-parallel mini-batch SGD training (`n` parallel > mini-batches for `n` GPUs at a time). > # Multi-node, CPU data-parallel mini-batch SGD training (`n` parallel > mini-batches for `n` parallel tasks at a time). > # Multi-node, single-GPU data-parallel mini-batch SGD training (`n` parallel > mini-batches for `n` total GPUs across the cluster at a time). > # Multi-node, multi-GPU data-parallel mini-batch SGD training (`n` parallel > mini-batches for `n` total GPUs across the cluster at a time). > ---- > Here is a list of past and present JIRA epics and issues that have blocked, > or are currently blocking progress on the breast cancer project. > > Overall Deep Learning Epic > * https://issues.apache.org/jira/browse/SYSTEMML-540 > *This is the overall "Deep Learning" JIRA epic, with all issues either > within or related to the epic. > Past > * https://issues.apache.org/jira/browse/SYSTEMML-633 > * https://issues.apache.org/jira/browse/SYSTEMML-951 > ** Issue that completely blocked mini-batch training approaches. > * https://issues.apache.org/jira/browse/SYSTEMML-914 > ** Epic containing issues related to input DataFrame conversions that > blocked getting data into the system entirely. Most of the issues > specifically refer to existing, internal converters. 993 was a particularly > large issue, and triggered a large body of work related to internal memory > estimates that were incorrect. Also see 919, 946, & 994. > * https://issues.apache.org/jira/browse/SYSTEMML-1076 > * https://issues.apache.org/jira/browse/SYSTEMML-1077 > * https://issues.apache.org/jira/browse/SYSTEMML-948 > Present > * https://issues.apache.org/jira/browse/SYSTEMML-1160 > ** Current open blocker to efficiently using a stochastic gradient descent > approach. > * https://issues.apache.org/jira/browse/SYSTEMML-1078 > ** Current open blocker to training even an initial deep learning model for > the project. This is another example of an internal compiler bug. > * https://issues.apache.org/jira/browse/SYSTEMML-686 > ** We need distributed convolution and max pooling operators. > * https://issues.apache.org/jira/browse/SYSTEMML-1159 > ** This is the main issue that discusses the need for the `parfor` > construct to support efficient, parallel hyperparameter tuning on a cluster > with large datasets. The broken remote parfor in 1129 blocked this issue, > which in turned blocked any meaningful work on training a deep neural net for > the project. > * https://issues.apache.org/jira/browse/SYSTEMML-1142 > ** This was one of the blockers to doing hyperparameter tuning. > * https://issues.apache.org/jira/browse/SYSTEMML-1129 > ** This is an epic for the issue in which the `parfor` construct was broken > for remote Spark cases, and was one of the blockers for doing hyperparameter > tuning. -- This message was sent by Atlassian JIRA (v6.3.15#6346)