[
https://issues.apache.org/jira/browse/SYSTEMML-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matthias Boehm updated SYSTEMML-1025:
-------------------------------------
Priority: Blocker (was: Major)
Description:
During many runs of our entire performance testsuite, we've seen quite some
performance variability, especially for scenario L dense (80GB) where spark
operations are the dominating factor for end-to-end performance. These issues
showed up over all algorithms and configurations but especially for multinomial
classification and parfor scripts.
Let's take for example Naive Bayes over the dense 10M x 1K input with 20
classes. Below are the results of 7 consecutive runs:
{code}
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 67
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 362
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 484
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 64
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 310
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 91
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 68
{code}
After a detailed investigation, it seems that imbalance, garbage collection,
and poor data locality are reasons:
* First, we generated the inputs with our Spark backend. Apparently, the rand
operations caused imbalance due to garbage collection of some nodes. However,
this is a very realistic scenario as we cannot always assume perfect balance.
* Second, especially for multinomial classification and parfor scripts, the
intermediates are not just vectors but larger matrices or simply more
intermediates. This led again to more garbage collection.
* Third, the scheduler delay of 3s for pending tasks was exceeded due to
garbage collection of some other tasks, leading to remote execution which
significantly slowed down the overall execution.
> Perftest: Large performance variability on scenario L dense (80GB)
> ------------------------------------------------------------------
>
> Key: SYSTEMML-1025
> URL: https://issues.apache.org/jira/browse/SYSTEMML-1025
> Project: SystemML
> Issue Type: Bug
> Reporter: Matthias Boehm
> Priority: Blocker
>
> During many runs of our entire performance testsuite, we've seen quite some
> performance variability, especially for scenario L dense (80GB) where spark
> operations are the dominating factor for end-to-end performance. These issues
> showed up over all algorithms and configurations but especially for
> multinomial classification and parfor scripts.
> Let's take for example Naive Bayes over the dense 10M x 1K input with 20
> classes. Below are the results of 7 consecutive runs:
> {code}
> NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 67
> NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 362
> NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 484
> NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 64
> NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 310
> NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 91
> NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 68
> {code}
> After a detailed investigation, it seems that imbalance, garbage collection,
> and poor data locality are reasons:
> * First, we generated the inputs with our Spark backend. Apparently, the rand
> operations caused imbalance due to garbage collection of some nodes. However,
> this is a very realistic scenario as we cannot always assume perfect balance.
> * Second, especially for multinomial classification and parfor scripts, the
> intermediates are not just vectors but larger matrices or simply more
> intermediates. This led again to more garbage collection.
> * Third, the scheduler delay of 3s for pending tasks was exceeded due to
> garbage collection of some other tasks, leading to remote execution which
> significantly slowed down the overall execution.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)