[
https://issues.apache.org/jira/browse/SYSTEMML-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15537762#comment-15537762
]
Mike Dusenberry commented on SYSTEMML-993:
------------------------------------------
Excellent! Eagerly looking forward to making use of them, and thanks for the
continued support here -- definitely helping for push forward the project I'm
sprinting on.
> Performance: Improve Vector DataFrame Conversions
> -------------------------------------------------
>
> Key: SYSTEMML-993
> URL: https://issues.apache.org/jira/browse/SYSTEMML-993
> Project: SystemML
> Issue Type: Improvement
> Reporter: Mike Dusenberry
> Priority: Critical
>
> Currently, the performance of vector DataFrame conversions leaves much to be
> desired, with regards to frequent OOM errors, and overall slow performance.
> Scenario:
> * Spark DataFrame:
> ** 3,745,888 rows
> ** Each row contains one {{vector}} column, where the vector is dense and of
> length 65,539. Note: This is a 256x256 pixel image stretched out and
> appended with a 3-column one-hot encoded label. I.e. this is a forced
> workaround to get both the labels and features into SystemML as efficiently
> as possible.
> * SystemML script + MLContext invocation (Python): Simply accept the
> DataFrame as input and save as a SystemML matrix in binary form. Note: I'm
> not grabbing the output here, so the matrix will literally be written by
> SystemML.
> ** {code}
> script = """
> write(train, "train", format="binary")
> """
> script = dml(script).input(train=train_YX)
> ml.execute(script)
> {code}
> I'm seeing large amounts of memory being used in conversion during the
> {{mapPartitionsToPair at RDDConverterUtils.java:311}} stage. For example, I
> have a scenario where it read in 1493.2 GB as "Input", and performed a
> "Shuffle Write" of 2.5 TB. A subsequent stage of {{saveAsHadoopFile at
> WriteSPInstruction.java:261}} then did a "Shuffle Read" of 2.5TB, and
> "Output" 1829.1 GB. This was for a simple script that took in DataFrames
> with a vector column and wrote to disk in binary format. It kept running out
> of heap space memory, so I kept increasing the executor memory 3x until it
> finally ran. Additionally, the latter stage had a very skewed execution time
> across the partitions, with ~1hour for the first 1000 paritions (out of
> 20,000), ~20 minutes for the next 18,000 partitions, and ~1 hour for the
> final 1000 partitions. The passed in DataFrame had an average of 180 rows
> per partition with a max of 215, and a min of 155.
> cc [~mboehm7]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)