[jira] [Created] (SYSTEMML-993) Performance: Improve Vector DataFrame Conversions

Mike Dusenberry (JIRA) Thu, 29 Sep 2016 13:07:47 -0700

Mike Dusenberry created SYSTEMML-993:
----------------------------------------


             Summary: Performance: Improve Vector DataFrame Conversions
                 Key: SYSTEMML-993
                 URL: https://issues.apache.org/jira/browse/SYSTEMML-993
             Project: SystemML
          Issue Type: Improvement
            Reporter: Mike Dusenberry
            Priority: Critical


Currently, the performance of vector DataFrame conversions leaves much to be 
desired, with regards to frequent OOM errors, and overall slow performance.  

Scenario:
* Spark DataFrame:
** 3,745,888 rows
** Each row contains one {{vector}} column, where the vector is dense and of 
length 65,539.  Note: This is a 256x256 pixel image stretched out and appended 
with a 3-column one-hot encoded label. I.e. this is a forced workaround to get 
both the labels and features into SystemML as efficiently as possible.
* SystemML script + MLContext invocation (Python):  Simply accept the DataFrame 
as input and save as a SystemML matrix in binary form.  Note: I'm not grabbing 
the output here, so the matrix will literally be written by SystemML.
** {code}
script = """
write(train, "train", format="binary")
"""
script = dml(script).input(train=train_YX)
ml.execute(script)
{code}

I'm seeing large amounts of memory being used in conversion during the 
{{mapPartitionsToPair at RDDConverterUtils.java:311}} stage.  For example, I 
have a scenario where it read in 1493.2 GB as "Input", and performed a "Shuffle 
Write" of 2.5 TB.  A subsequent stage of {{saveAsHadoopFile at 
WriteSPInstruction.java:261}} then did a "Shuffle Read" of 2.5TB, and "Output" 
1829.1 GB.  This was for a simple script that took in DataFrames with a vector 
column and wrote to disk in binary format.  It kept running out of heap space 
memory, so I kept increasing the executor memory 3x until it finally ran.  
Additionally, the latter stage had a very skewed execution time across the 
partitions, with ~1hour for the first 1000 paritions (out of 20,000), ~20 
minutes for the next 18,000 partitions, and ~1 hour for the final 1000 
partitions.  The passed in DataFrame had an average of 180 rows per partition 
with a max of 215, and a min of 155.

cc [~mboehm7]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (SYSTEMML-993) Performance: Improve Vector DataFrame Conversions

Reply via email to