[
https://issues.apache.org/jira/browse/SYSTEMML-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15535012#comment-15535012
]
Matthias Boehm commented on SYSTEMML-993:
-----------------------------------------
ok thanks for the details [[email protected]] - here is the list of things
we need to address:
(1) new IPA pass to remove the checkpoint after the persistent read if directly
followed by a persistent write (in your case this unnecessarily occupies 75% of
executor memory)
(2) investigation of our custom hash function for matrix indexes w.r.t. the
skew issue you've observed (I want to change this function for a while now, new
data structures such as our native LongLongDoubleHashMap already use a better
hash function for long-long-pairs)
(3) Less allocations on row/vector to sparse row appends (sparse rows with
estimated nnz use a grow factor of 2x which still requires log N allocation
steps, we could avoid that by counting the row nnz and allocating it once,
which will likely reduce the pressure on GC).
(4) Compressed sparse blocks (the reason why you see a size increase is that we
have to represent partial blocks in sparse and apparently the default snappy
compression during shuffle does not lead to a good compression ratio; we should
introduce a custom serialization format for small sparse blocks, where we can
encode the nnz per row and column indexes as short instead of int).
Until our next 0.11 RC, i.e., tomorrow, I'll resolve (1) and (3) - the other
issues will be addressed until our 1.0 release.
> Performance: Improve Vector DataFrame Conversions
> -------------------------------------------------
>
> Key: SYSTEMML-993
> URL: https://issues.apache.org/jira/browse/SYSTEMML-993
> Project: SystemML
> Issue Type: Improvement
> Reporter: Mike Dusenberry
> Priority: Critical
>
> Currently, the performance of vector DataFrame conversions leaves much to be
> desired, with regards to frequent OOM errors, and overall slow performance.
> Scenario:
> * Spark DataFrame:
> ** 3,745,888 rows
> ** Each row contains one {{vector}} column, where the vector is dense and of
> length 65,539. Note: This is a 256x256 pixel image stretched out and
> appended with a 3-column one-hot encoded label. I.e. this is a forced
> workaround to get both the labels and features into SystemML as efficiently
> as possible.
> * SystemML script + MLContext invocation (Python): Simply accept the
> DataFrame as input and save as a SystemML matrix in binary form. Note: I'm
> not grabbing the output here, so the matrix will literally be written by
> SystemML.
> ** {code}
> script = """
> write(train, "train", format="binary")
> """
> script = dml(script).input(train=train_YX)
> ml.execute(script)
> {code}
> I'm seeing large amounts of memory being used in conversion during the
> {{mapPartitionsToPair at RDDConverterUtils.java:311}} stage. For example, I
> have a scenario where it read in 1493.2 GB as "Input", and performed a
> "Shuffle Write" of 2.5 TB. A subsequent stage of {{saveAsHadoopFile at
> WriteSPInstruction.java:261}} then did a "Shuffle Read" of 2.5TB, and
> "Output" 1829.1 GB. This was for a simple script that took in DataFrames
> with a vector column and wrote to disk in binary format. It kept running out
> of heap space memory, so I kept increasing the executor memory 3x until it
> finally ran. Additionally, the latter stage had a very skewed execution time
> across the partitions, with ~1hour for the first 1000 paritions (out of
> 20,000), ~20 minutes for the next 18,000 partitions, and ~1 hour for the
> final 1000 partitions. The passed in DataFrame had an average of 180 rows
> per partition with a max of 215, and a min of 155.
> cc [~mboehm7]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)