[
https://issues.apache.org/jira/browse/SYSTEMML-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matthias Boehm updated SYSTEMML-1623:
-------------------------------------
Description:
The current JMLC conversion functions cause a very inefficient and memory
intensive code path with leads to unnecessary OOMs that can be easily avoided.
This task aims to add and improve these primitives to allow convenient data
conversions with much better memory efficiency.
For example consider a scenario of a 500k x 90 input model available as csv
file in the classpath. The typical codepath currently use looks as follows:
{code}
ResourceStream(model_file)
-> prep
---> StringBuilder -> String [3GB tmp, 1GB]
-> convertToDoubleMatrix
---> byte[] -> ByteInputStream [2GB]
---> MatrixBlock [360MB]
---> double[][] [400MB]
-> setMatrix
---> MatrixBlock [360MB]
{code}
which requires at least 4GB of memory due to strong references to all
intermediates. The goal of this task is to reduce this to the following:
{code}
ResourceStream(model_file)
-> convertToMatrix
---> MatrixBlock [360MB]
-> setMatrix
---> by references
{code}
> Memory efficiency JMLC matrix and frame conversions
> ---------------------------------------------------
>
> Key: SYSTEMML-1623
> URL: https://issues.apache.org/jira/browse/SYSTEMML-1623
> Project: SystemML
> Issue Type: Bug
> Reporter: Matthias Boehm
>
> The current JMLC conversion functions cause a very inefficient and memory
> intensive code path with leads to unnecessary OOMs that can be easily
> avoided. This task aims to add and improve these primitives to allow
> convenient data conversions with much better memory efficiency.
> For example consider a scenario of a 500k x 90 input model available as csv
> file in the classpath. The typical codepath currently use looks as follows:
> {code}
> ResourceStream(model_file)
> -> prep
> ---> StringBuilder -> String [3GB tmp, 1GB]
> -> convertToDoubleMatrix
> ---> byte[] -> ByteInputStream [2GB]
> ---> MatrixBlock [360MB]
> ---> double[][] [400MB]
> -> setMatrix
> ---> MatrixBlock [360MB]
> {code}
> which requires at least 4GB of memory due to strong references to all
> intermediates. The goal of this task is to reduce this to the following:
> {code}
> ResourceStream(model_file)
> -> convertToMatrix
> ---> MatrixBlock [360MB]
> -> setMatrix
> ---> by references
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)