[jira] [Commented] (SYSTEMML-2376) Preparation of baseline experiments

Matthias Boehm (JIRA) Thu, 14 Jun 2018 22:09:26 -0700


    [ 
https://issues.apache.org/jira/browse/SYSTEMML-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513347#comment-16513347
 ]


Matthias Boehm commented on SYSTEMML-2376:
------------------------------------------

That is a great start - thanks for the automated scripts. I made a couple of 
modifications:
* You can simply put the {{SystemML-config.xml}} into the same directory as 
SystemML.jar - then it will be picked up.
* There is no need to include the log4j properties or nn library - the log4j is 
used out of Spark's config directory and we package our own script already into 
SystemML.jar
* I added a {{./sparkDML2.sh}} which includes the memory configurations and 
spark submission. With the local parameter server, we simply run in Spark's 
driver process which is equivalent to standalone invocation but we can use one 
consistent setup later with distributed operations as well.
* I modified the invocation script to run only the max number of epochs and 
workers to get reasonable runtimes because running all combinations for workers 
(up to 80) and epochs would take way too long. I also changed the batch sizes 
to a more common parameterization.

Furthermore, I tried to run with mkl but in the used environment, I run into 
the following core dump (which might need a fix similar to the issue you 
encountered with openblas - let us double check the used instruction 
parallelism):

{code}
2018-06-14 21:07:31 INFO  NativeHelper:185 - Using native blas: mkl
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGFPE (0x8) at pc=0x00007fbed61902d6, pid=352940, tid=0x00007fd886176700
#
# JRE version: OpenJDK Runtime Environment (8.0_161-b14) (build 1.8.0_161-b14)
# Java VM: OpenJDK 64-Bit Server VM (25.161-b14 mixed mode linux-amd64 )
# Problematic frame:
# C  [libmkl_avx512.so+0x206d2d6]  mkl_dnn_avx512_bkdGemmDirectConv_F64+0x276
{code}

Most importantly, however, the experiments currently get stuck already on the 
first combination. This requires some work. For the meantime, there are a 
couple of observations, that should be addressed:
* In BSP, the execution hangs (before stats) independent of the number of 
epochs (tried with 1 and 2) no work is performed anymore.
* Double check the value accuracy (which is reported as 0 for mnist60k, at 
least with ASP) which indicates some issue. Could we please include a test that 
checks for similar outputs of the parameter server compared to basic mini batch 
execution?
* Currently, the number of workers is set to k-1 which is irritating when 
explicitly specified by a user as k. 
* Having an optional flag for reporting progress would be nice (e.g., current 
epoch / current batch, maybe max for ASP).


> Preparation of baseline experiments
> -----------------------------------
>
>                 Key: SYSTEMML-2376
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2376
>             Project: SystemML
>          Issue Type: Technical task
>            Reporter: LI Guobao
>            Assignee: LI Guobao
>            Priority: Major
>             Fix For: SystemML 1.2
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (SYSTEMML-2376) Preparation of baseline experiments

Reply via email to