[ 
https://issues.apache.org/jira/browse/SYSTEMML-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513618#comment-16513618
 ] 

LI Guobao commented on SYSTEMML-2397:
-------------------------------------

 [~mboehm7], sure, I will analyse it.

> Paramserv ASP failing w/ OOM (too many threads)
> -----------------------------------------------
>
>                 Key: SYSTEMML-2397
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2397
>             Project: SystemML
>          Issue Type: Bug
>            Reporter: Matthias Boehm
>            Priority: Major
>
> Paramserv ASP with 2 epochs, 80 workers, update per EPOCH failing due to OOM 
> despite 200GB max heap. [~Guobao] could you please have a look? I suspect 
> that the degree of parallelism of instructions is not set correctly leading 
> to 80x80 concurrent threads. The easiest way to debug would be to use 
> {{Explain.explain}} to the worker instructions and check that every 
> instruction has an assigned degree of parallelism of 1.
> {code}
> 2018-06-14 22:31:16 ERROR DMLScript:543 - Failed to execute DML script.
> org.apache.sysml.runtime.DMLRuntimeException: 
> org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime error in program 
> block generated from statement block between lines 0 and 71 -- Error 
> evaluating instruction: 
> CP°paramserv°agg=./mnist_lenet_paramserv.dml::aggregation°checkpointing=NONE°scheme=DISJOINT_CONTIGUOUS°hyperparams=_Var824°upd=./mnist_lenet_paramserv.dml::gradients°utype=ASP°freq=EPOCH°k=80°val_features=_mVar823°batchsize=64°labels=_mVar825°mode=LOCAL°features=_mVar826°model=_Var844°val_labels=_mVar819°epochs=2°_Var845·LIST·UNKNOWN
>       at 
> org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:123)
>       at 
> org.apache.sysml.api.ScriptExecutorUtils.executeRuntimeProgram(ScriptExecutorUtils.java:100)
>       at org.apache.sysml.api.DMLScript.execute(DMLScript.java:746)
>       at org.apache.sysml.api.DMLScript.executeScript(DMLScript.java:517)
>       at org.apache.sysml.api.DMLScript.main(DMLScript.java:248)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>       at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
>       at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
>       at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
>       at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
>       at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime error 
> in program block generated from statement block between lines 0 and 71 -- 
> Error evaluating instruction: 
> CP°paramserv°agg=./mnist_lenet_paramserv.dml::aggregation°checkpointing=NONE°scheme=DISJOINT_CONTIGUOUS°hyperparams=_Var824°upd=./mnist_lenet_paramserv.dml::gradients°utype=ASP°freq=EPOCH°k=80°val_features=_mVar823°batchsize=64°labels=_mVar825°mode=LOCAL°features=_mVar826°model=_Var844°val_labels=_mVar819°epochs=2°_Var845·LIST·UNKNOWN
>       at 
> org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:282)
>       at 
> org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:210)
>       at 
> org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:161)
>       at 
> org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:116)
>       ... 14 more
> Caused by: org.apache.sysml.runtime.DMLRuntimeException: 
> ParamservBuiltinCPInstruction: some error occurred: 
>       at 
> org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.processInstruction(ParamservBuiltinCPInstruction.java:163)
>       at 
> org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:252)
>       ... 17 more
> Caused by: java.util.concurrent.ExecutionException: 
> java.lang.OutOfMemoryError: unable to create new native thread
>       at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>       at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>       at 
> org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.processInstruction(ParamservBuiltinCPInstruction.java:158)
>       ... 18 more
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
>       at java.lang.Thread.start0(Native Method)
>       at java.lang.Thread.start(Thread.java:717)
>       at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>       at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
>       at 
> java.util.concurrent.AbstractExecutorService.invokeAll(AbstractExecutorService.java:238)
>       at 
> org.apache.sysml.runtime.util.CommonThreadPool.invokeAll(CommonThreadPool.java:76)
>       at 
> org.apache.sysml.runtime.matrix.data.LibMatrixDNN.execute(LibMatrixDNN.java:755)
>       at 
> org.apache.sysml.runtime.matrix.data.LibMatrixDNN.reluBackward(LibMatrixDNN.java:284)
>       at 
> org.apache.sysml.runtime.instructions.cp.ConvolutionCPInstruction.processReluBackwardInstruction(ConvolutionCPInstruction.java:298)
>       at 
> org.apache.sysml.runtime.instructions.cp.ConvolutionCPInstruction.processInstruction(ConvolutionCPInstruction.java:465)
>       at 
> org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:252)
>       at 
> org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:210)
>       at 
> org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:161)
>       at 
> org.apache.sysml.runtime.controlprogram.FunctionProgramBlock.execute(FunctionProgramBlock.java:116)
>       at 
> org.apache.sysml.runtime.instructions.cp.FunctionCallCPInstruction.processInstruction(FunctionCallCPInstruction.java:152)
>       at 
> org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.computeGradients(LocalPSWorker.java:170)
>       at 
> org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.computeEpoch(LocalPSWorker.java:79)
>       at 
> org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.call(LocalPSWorker.java:58)
>       at 
> org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.call(LocalPSWorker.java:35)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to