[
https://issues.apache.org/jira/browse/SYSTEMML-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513997#comment-16513997
]
Matthias Boehm commented on SYSTEMML-2397:
------------------------------------------
Ok, just to make sure we are on the same page:
* Instruction parallelism: If you have 80 vcores, and 80 workers we need to set
the parallelism per instruction to 1. If we have 40 workers, to 2, and so on.
* In order to avoid these issues, try to use the parfor function copy as is. It
will decide the names and place them into the program accordingly. So you only
need to adopt the parfor naming scheme of functions. This will also ensure
there are no side effects between different workers (which might cause the
accuracy issues).
* Before calling the function copy, please set the degree of parallelism on
hops for the original program and recompile it to instructions.
> Paramserv ASP failing w/ OOM (too many threads)
> -----------------------------------------------
>
> Key: SYSTEMML-2397
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2397
> Project: SystemML
> Issue Type: Bug
> Reporter: Matthias Boehm
> Priority: Major
>
> Paramserv ASP with 2 epochs, 80 workers, update per EPOCH failing due to OOM
> despite 200GB max heap. [~Guobao] could you please have a look? I suspect
> that the degree of parallelism of instructions is not set correctly leading
> to 80x80 concurrent threads. The easiest way to debug would be to use
> {{Explain.explain}} to the worker instructions and check that every
> instruction has an assigned degree of parallelism of 1.
> {code}
> 2018-06-14 22:31:16 ERROR DMLScript:543 - Failed to execute DML script.
> org.apache.sysml.runtime.DMLRuntimeException:
> org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime error in program
> block generated from statement block between lines 0 and 71 -- Error
> evaluating instruction:
> CP°paramserv°agg=./mnist_lenet_paramserv.dml::aggregation°checkpointing=NONE°scheme=DISJOINT_CONTIGUOUS°hyperparams=_Var824°upd=./mnist_lenet_paramserv.dml::gradients°utype=ASP°freq=EPOCH°k=80°val_features=_mVar823°batchsize=64°labels=_mVar825°mode=LOCAL°features=_mVar826°model=_Var844°val_labels=_mVar819°epochs=2°_Var845·LIST·UNKNOWN
> at
> org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:123)
> at
> org.apache.sysml.api.ScriptExecutorUtils.executeRuntimeProgram(ScriptExecutorUtils.java:100)
> at org.apache.sysml.api.DMLScript.execute(DMLScript.java:746)
> at org.apache.sysml.api.DMLScript.executeScript(DMLScript.java:517)
> at org.apache.sysml.api.DMLScript.main(DMLScript.java:248)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime error
> in program block generated from statement block between lines 0 and 71 --
> Error evaluating instruction:
> CP°paramserv°agg=./mnist_lenet_paramserv.dml::aggregation°checkpointing=NONE°scheme=DISJOINT_CONTIGUOUS°hyperparams=_Var824°upd=./mnist_lenet_paramserv.dml::gradients°utype=ASP°freq=EPOCH°k=80°val_features=_mVar823°batchsize=64°labels=_mVar825°mode=LOCAL°features=_mVar826°model=_Var844°val_labels=_mVar819°epochs=2°_Var845·LIST·UNKNOWN
> at
> org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:282)
> at
> org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:210)
> at
> org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:161)
> at
> org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:116)
> ... 14 more
> Caused by: org.apache.sysml.runtime.DMLRuntimeException:
> ParamservBuiltinCPInstruction: some error occurred:
> at
> org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.processInstruction(ParamservBuiltinCPInstruction.java:163)
> at
> org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:252)
> ... 17 more
> Caused by: java.util.concurrent.ExecutionException:
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at
> org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.processInstruction(ParamservBuiltinCPInstruction.java:158)
> ... 18 more
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> at
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> at
> java.util.concurrent.AbstractExecutorService.invokeAll(AbstractExecutorService.java:238)
> at
> org.apache.sysml.runtime.util.CommonThreadPool.invokeAll(CommonThreadPool.java:76)
> at
> org.apache.sysml.runtime.matrix.data.LibMatrixDNN.execute(LibMatrixDNN.java:755)
> at
> org.apache.sysml.runtime.matrix.data.LibMatrixDNN.reluBackward(LibMatrixDNN.java:284)
> at
> org.apache.sysml.runtime.instructions.cp.ConvolutionCPInstruction.processReluBackwardInstruction(ConvolutionCPInstruction.java:298)
> at
> org.apache.sysml.runtime.instructions.cp.ConvolutionCPInstruction.processInstruction(ConvolutionCPInstruction.java:465)
> at
> org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:252)
> at
> org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:210)
> at
> org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:161)
> at
> org.apache.sysml.runtime.controlprogram.FunctionProgramBlock.execute(FunctionProgramBlock.java:116)
> at
> org.apache.sysml.runtime.instructions.cp.FunctionCallCPInstruction.processInstruction(FunctionCallCPInstruction.java:152)
> at
> org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.computeGradients(LocalPSWorker.java:170)
> at
> org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.computeEpoch(LocalPSWorker.java:79)
> at
> org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.call(LocalPSWorker.java:58)
> at
> org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.call(LocalPSWorker.java:35)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)