[
https://issues.apache.org/jira/browse/SYSTEMML-869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15486037#comment-15486037
]
Mike Dusenberry commented on SYSTEMML-869:
------------------------------------------
cc [~mboehm7] Unfortunately, I've run into this issue again with a different
set of scripts via MLContext.
Same idea though:
* Three different scripts run in succession via MLContext.
* First script takes in DataFrames, does some matrix math, and returns
{{Matrix}} objects {{X}}, {{X_val}}, {{Y}}, & {{Y_val}}.
* Second script takes in {{X}}, {{X_val}}, {{Y}}, & {{Y_val}}, trains a model,
and returns model coefficient {{Matrix}} objects {{W}}, {{b}}.
* Third script takes in all {{Matrix}} objects, {{X}}, {{X_val}}, {{Y}},
{{Y_val}}, {{W}}, & {{b}}, and returns a couple of scalar values evaluating
performance of the model.
* The first two scripts run just fine, but the third script fails with a
{{Caused by: java.io.IOException: File
scratch_space/_p40317_9.30.110.134/_t0/temp2648_51476 does not exist on
HDFS/LFS.}} error.
* If I instead copy the math in the first script into both the second & third
scripts and pass the original DataFrames into the second & thirds scripts (thus
doing the conversions of the original data both times), the second and third
scripts run fine.
{code}
: org.apache.sysml.api.mlcontext.MLContextException: Exception when executing
script
at org.apache.sysml.api.mlcontext.MLContext.execute(MLContext.java:301)
at org.apache.sysml.api.mlcontext.MLContext.execute(MLContext.java:271)
at sun.reflect.GeneratedMethodAccessor64.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.sysml.api.mlcontext.MLContextException: Exception
occurred while executing runtime program
at
org.apache.sysml.api.mlcontext.ScriptExecutor.executeRuntimeProgram(ScriptExecutor.java:381)
at
org.apache.sysml.api.mlcontext.ScriptExecutor.execute(ScriptExecutor.java:324)
at org.apache.sysml.api.mlcontext.MLContext.execute(MLContext.java:293)
... 11 more
Caused by: org.apache.sysml.runtime.DMLRuntimeException:
org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime error in program
block generated from statement block between lines 1 and 18 -- Error evaluating
instruction: CP°-°0·SCALAR·INT·true°Y·MATRIX·DOUBLE°_mVar5272·MATRIX·DOUBLE
at
org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:152)
at
org.apache.sysml.api.mlcontext.ScriptExecutor.executeRuntimeProgram(ScriptExecutor.java:379)
... 13 more
Caused by: org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime error
in program block generated from statement block between lines 1 and 18 -- Error
evaluating instruction:
CP°-°0·SCALAR·INT·true°Y·MATRIX·DOUBLE°_mVar5272·MATRIX·DOUBLE
at
org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:335)
at
org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:224)
at
org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:168)
at
org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:145)
... 14 more
Caused by: org.apache.sysml.runtime.controlprogram.caching.CacheException:
Reading of scratch_space//_p40317_9.30.110.134//_t0/temp2648_51476 (_mVar4813)
failed.
at
org.apache.sysml.runtime.controlprogram.caching.CacheableData.acquireRead(CacheableData.java:476)
at
org.apache.sysml.runtime.controlprogram.context.ExecutionContext.getMatrixInput(ExecutionContext.java:241)
at
org.apache.sysml.runtime.instructions.cp.ScalarMatrixArithmeticCPInstruction.processInstruction(ScalarMatrixArithmeticCPInstruction.java:49)
at
org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:305)
... 17 more
Caused by: java.io.IOException: File
scratch_space/_p40317_9.30.110.134/_t0/temp2648_51476 does not exist on
HDFS/LFS.
at
org.apache.sysml.runtime.io.MatrixReader.checkValidInputFile(MatrixReader.java:147)
at
org.apache.sysml.runtime.io.ReaderBinaryBlockParallel.readMatrixFromHDFS(ReaderBinaryBlockParallel.java:67)
at
org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:291)
at
org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:250)
at
org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromHDFS(MatrixObject.java:476)
at
org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromHDFS(MatrixObject.java:61)
at
org.apache.sysml.runtime.controlprogram.caching.CacheableData.readBlobFromHDFS(CacheableData.java:998)
at
org.apache.sysml.runtime.controlprogram.caching.CacheableData.acquireRead(CacheableData.java:455)
... 20 more
{code}
[~mboehm7] Can you take another look and see if there are anymore changes that
might need to be made in the API related to this issue? This is currently a
blocker to the project I'm working on.
> Error converting Matrix to Spark DataFrame with MLContext After Subsequent
> Executions
> -------------------------------------------------------------------------------------
>
> Key: SYSTEMML-869
> URL: https://issues.apache.org/jira/browse/SYSTEMML-869
> Project: SystemML
> Issue Type: Bug
> Components: APIs
> Reporter: Mike Dusenberry
> Assignee: Matthias Boehm
> Priority: Blocker
> Fix For: SystemML 0.11
>
>
> Running the LeNet deep learning example notebook with the new {{MLContext}}
> API in Python results in the below error when converting the resulting
> {{Matrix}} to a Spark {{DataFrame}} via the {{toDF()}} call. This only
> occurs with the large LeNet example, and not for the similar "Softmax
> Classifier" example that has a smaller model.
> {code}
> Py4JJavaError: An error occurred while calling o34.asDataFrame.
> : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> file:/Users/mwdusenb/Documents/Code/systemML/deep_learning/examples/scratch_space/_p85157_9.31.116.142/_t0/temp816_133
> at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
> at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$3.apply(PairRDDFunctions.scala:642)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$3.apply(PairRDDFunctions.scala:642)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
> at
> org.apache.spark.rdd.PairRDDFunctions.groupByKey(PairRDDFunctions.scala:641)
> at org.apache.spark.api.java.JavaPairRDD.groupByKey(JavaPairRDD.scala:538)
> at
> org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt.binaryBlockToDataFrame(RDDConverterUtilsExt.java:502)
> at
> org.apache.sysml.api.mlcontext.MLContextConversionUtil.matrixObjectToDataFrame(MLContextConversionUtil.java:762)
> at org.apache.sysml.api.mlcontext.Matrix.asDataFrame(Matrix.java:111)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
> at py4j.Gateway.invoke(Gateway.java:259)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:209)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> To setup, I used the instructions [here |
> https://github.com/dusenberrymw/systemml-nn/tree/master/examples], running
> the {{Example - MNIST LeNet.ipynb}} notebook. Additionally, to speed up the
> actual training time, I modified [line 84 & 85 of mnist_lenet.dml |
> https://github.com/dusenberrymw/systemml-nn/blob/master/examples/mnist_lenet.dml#L84]
> to set the {{epochs = 1}}, and {{iters = 1}}.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)