[ 
https://issues.apache.org/jira/browse/SYSTEMML-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15537245#comment-15537245
 ] 

Matthias Boehm commented on SYSTEMML-993:
-----------------------------------------

just a quick update with regard to the skew issue - based on a couple of 
experiments, I can confirm that the hash function is causing this problem.

Here are the results comparing the distribution of blocks to partitions for two 
different scenarios (10M x 1K, blocksize 1K, dense and 3.3M x 65K, blocksize 
1K, dense). The hash functions compared are {{H1: MatrixIndexes.hashCode()}} 
and {{H2}} with the following code, which is similar to 
{{Arrays.hashCode(long[])}}

{code}
//basic hash mixing of two longs hashes (w/o object creation)
int h = (int)(key1 ^ (key1 >>> 32));
return h*31 + (int)(key2 ^ (key2 >>> 32));
{code}

Scenario 1: 10M x 1K, blocksize 1K, dense 
H1: min 16 blocks, max 18 blocks, 596/596 partitions with >=1 block
H2: min 16 blocks, max 17 blocks, 596/596 partitions with >= 1 block

Scenario 2: 3.3M x 65K, blocksize 1K, dense
H1: min 10 blocks, max 66 blocks, 3456/13182 partitions with >= 1 block
H2: min 14 blocks, max 18 blocks, 13182/13182 partitions with >= 1 block

While on Scenario 1 (tall and skinny matrix), both functions perform well, our 
existing hash function H1 performs very badly for wide matrices. 
[[email protected]] This explains the huge difference in execution time as 
2/3 of all partitions were completely empty.

> Performance: Improve Vector DataFrame Conversions
> -------------------------------------------------
>
>                 Key: SYSTEMML-993
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-993
>             Project: SystemML
>          Issue Type: Improvement
>            Reporter: Mike Dusenberry
>            Priority: Critical
>
> Currently, the performance of vector DataFrame conversions leaves much to be 
> desired, with regards to frequent OOM errors, and overall slow performance.  
> Scenario:
> * Spark DataFrame:
> ** 3,745,888 rows
> ** Each row contains one {{vector}} column, where the vector is dense and of 
> length 65,539.  Note: This is a 256x256 pixel image stretched out and 
> appended with a 3-column one-hot encoded label. I.e. this is a forced 
> workaround to get both the labels and features into SystemML as efficiently 
> as possible.
> * SystemML script + MLContext invocation (Python):  Simply accept the 
> DataFrame as input and save as a SystemML matrix in binary form.  Note: I'm 
> not grabbing the output here, so the matrix will literally be written by 
> SystemML.
> ** {code}
> script = """
> write(train, "train", format="binary")
> """
> script = dml(script).input(train=train_YX)
> ml.execute(script)
> {code}
> I'm seeing large amounts of memory being used in conversion during the 
> {{mapPartitionsToPair at RDDConverterUtils.java:311}} stage.  For example, I 
> have a scenario where it read in 1493.2 GB as "Input", and performed a 
> "Shuffle Write" of 2.5 TB.  A subsequent stage of {{saveAsHadoopFile at 
> WriteSPInstruction.java:261}} then did a "Shuffle Read" of 2.5TB, and 
> "Output" 1829.1 GB.  This was for a simple script that took in DataFrames 
> with a vector column and wrote to disk in binary format.  It kept running out 
> of heap space memory, so I kept increasing the executor memory 3x until it 
> finally ran.  Additionally, the latter stage had a very skewed execution time 
> across the partitions, with ~1hour for the first 1000 paritions (out of 
> 20,000), ~20 minutes for the next 18,000 partitions, and ~1 hour for the 
> final 1000 partitions.  The passed in DataFrame had an average of 180 rows 
> per partition with a max of 215, and a min of 155.
> cc [~mboehm7]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to