[ 
https://issues.apache.org/jira/browse/SYSTEMML-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15533828#comment-15533828
 ] 

Mike Dusenberry commented on SYSTEMML-946:
------------------------------------------

I think this still needs to be explored further.  I'm seeing large amounts of 
memory being used in conversion during the {{mapPartitionsToPair at 
RDDConverterUtils.java:311}} stage.  For example, I have a scenario where it 
read in 1493.2 GB as "Input", and performed a "Shuffle Write" of 2.5 TB.  A 
subsequent stage of {{saveAsHadoopFile at WriteSPInstruction.java:261}} then 
did a "Shuffle Read" of 2.5TB, and "Output" 1829.1 GB.  This was for a simple 
script that took in DataFrames with a vector column and wrote to disk in binary 
format.  It kept running out of heap space memory, so I kept increasing the 
executor memory 3x until it finally ran.  Additionally, the latter stage had a 
very skewed execution time across the partitions, with ~1hour for the first 
1000 paritions (out of 20,000), ~20 minutes for the next 18,000 partitions, and 
~1 hour for the final 1000 partitions.  The passed in DataFrame had an average 
of 180 rows per partition with a max of 215, and a min of 155.

cc [~mboehm7]

> OOM on spark dataframe-matrix / csv-matrix conversion
> -----------------------------------------------------
>
>                 Key: SYSTEMML-946
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-946
>             Project: SystemML
>          Issue Type: Bug
>          Components: Runtime
>            Reporter: Matthias Boehm
>            Assignee: Matthias Boehm
>             Fix For: SystemML 0.11
>
>         Attachments: mnist_lenet.dml
>
>
> The decision on dense/sparse block allocation in our dataframeToBinaryBlock 
> and csvToBinaryBlock data converters is purely based on the sparsity. This 
> works very well for the common case of tall & skinny matrices. However, for 
> scenarios with dense data but huge number of columns a single partition will 
> rarely have 1000 rows to fill an entire row of blocks. This leads to 
> unnecessary allocation and dense-sparse conversion as well as potential 
> out-of-memory errors because the temporary memory requirement can be up to 
> 1000x larger than the input partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to