[
https://issues.apache.org/jira/browse/SYSTEMML-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matthias Boehm updated SYSTEMML-1025:
-------------------------------------
Description:
During many runs of our entire performance testsuite, we've seen quite some
performance variability, especially for scenario L dense (80GB) where spark
operations are the dominating factor for end-to-end performance. These issues
showed up over all algorithms and configurations but especially for multinomial
classification and parfor scripts.
Let's take for example Naive Bayes over the dense 10M x 1K input with 20
classes. Below are the results of 7 consecutive runs:
{code}
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 67
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 362
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 484
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 64
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 310
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 91
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 68
{code}
After a detailed investigation, it seems that imbalance, garbage collection,
and poor data locality are the reasons:
* First, we generated the inputs with our Spark backend. Apparently, the rand
operation caused imbalance due to garbage collection of some nodes. However,
this is a very realistic scenario as we cannot always assume perfect balance.
* Second, especially for multinomial classification and parfor scripts, the
intermediates are not just vectors but larger matrices or there are simply more
intermediates. This led again to more garbage collection.
* Third, the scheduler delay of 3s for pending tasks was exceeded due to
garbage collection, leading to remote execution which significantly slowed down
the overall execution.
To resolve these issues, we should make the following two changes:
* (1) More conservative configuration of spark.locality.wait in systemml's
preferred spark configuration, where we did not consider this at all so far.
* (2) Improvements of reduce-all operations which current unnecessarily create
intermediate pair outputs and hence unnecessary Tuple2 and MatrixIndexes
objects.
With a default scheduler delay of 5s instead of the default 3s as well as
improved reduce-all for mapmm, groupedagg, tsmm, tsmm2, zipmm, and uagg, we got
the following promising results (which include spark context creation and
initial read):
{code}
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 52
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 45
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 44
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 44
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 51
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 50
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 47
{code}
cc [~reinwald] [~niketanpansare] [~freiss]
was:
During many runs of our entire performance testsuite, we've seen quite some
performance variability, especially for scenario L dense (80GB) where spark
operations are the dominating factor for end-to-end performance. These issues
showed up over all algorithms and configurations but especially for multinomial
classification and parfor scripts.
Let's take for example Naive Bayes over the dense 10M x 1K input with 20
classes. Below are the results of 7 consecutive runs:
{code}
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 67
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 362
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 484
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 64
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 310
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 91
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 68
{code}
After a detailed investigation, it seems that imbalance, garbage collection,
and poor data locality are the reasons:
* First, we generated the inputs with our Spark backend. Apparently, the rand
operation caused imbalance due to garbage collection of some nodes. However,
this is a very realistic scenario as we cannot always assume perfect balance.
* Second, especially for multinomial classification and parfor scripts, the
intermediates are not just vectors but larger matrices or there are simply more
intermediates. This led again to more garbage collection.
* Third, the scheduler delay of 3s for pending tasks was exceeded due to
garbage collection, leading to remote execution which significantly slowed down
the overall execution.
To resolve these issues, we should make the following two changes:
* (1) More conservative configuration of spark.locality.wait in systemml's
preferred spark configuration, where we did not consider this at all so far.
* (2) Improvements of reduce-all operations which current unnecessarily create
intermediate pair outputs and hence unnecessary Tuple2 and MatrixIndexes
objects.
With a default scheduler delay of 5s instead of the default 3s as well as
improved reduce-all for mapmm, groupedagg, tsmm, tsmm2, zipmm, and uagg, we got
the following promising results (which include spark context creation and
initial read):
{code}
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 52
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 45
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 44
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 44
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 51
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 50
NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 47
{code}
> Perftest: Large performance variability on scenario L dense (80GB)
> ------------------------------------------------------------------
>
> Key: SYSTEMML-1025
> URL: https://issues.apache.org/jira/browse/SYSTEMML-1025
> Project: SystemML
> Issue Type: Bug
> Reporter: Matthias Boehm
> Priority: Blocker
>
> During many runs of our entire performance testsuite, we've seen quite some
> performance variability, especially for scenario L dense (80GB) where spark
> operations are the dominating factor for end-to-end performance. These issues
> showed up over all algorithms and configurations but especially for
> multinomial classification and parfor scripts.
> Let's take for example Naive Bayes over the dense 10M x 1K input with 20
> classes. Below are the results of 7 consecutive runs:
> {code}
> NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 67
> NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 362
> NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 484
> NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 64
> NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 310
> NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 91
> NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 68
> {code}
> After a detailed investigation, it seems that imbalance, garbage collection,
> and poor data locality are the reasons:
> * First, we generated the inputs with our Spark backend. Apparently, the rand
> operation caused imbalance due to garbage collection of some nodes. However,
> this is a very realistic scenario as we cannot always assume perfect balance.
> * Second, especially for multinomial classification and parfor scripts, the
> intermediates are not just vectors but larger matrices or there are simply
> more intermediates. This led again to more garbage collection.
> * Third, the scheduler delay of 3s for pending tasks was exceeded due to
> garbage collection, leading to remote execution which significantly slowed
> down the overall execution.
> To resolve these issues, we should make the following two changes:
> * (1) More conservative configuration of spark.locality.wait in systemml's
> preferred spark configuration, where we did not consider this at all so far.
> * (2) Improvements of reduce-all operations which current unnecessarily
> create intermediate pair outputs and hence unnecessary Tuple2 and
> MatrixIndexes objects.
> With a default scheduler delay of 5s instead of the default 3s as well as
> improved reduce-all for mapmm, groupedagg, tsmm, tsmm2, zipmm, and uagg, we
> got the following promising results (which include spark context creation and
> initial read):
> {code}
> NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 52
> NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 45
> NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 44
> NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 44
> NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 51
> NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 50
> NaiveBayes train on mbperftest/multinomial/X10M_1k_dense_k150: 47
> {code}
> cc [~reinwald] [~niketanpansare] [~freiss]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)