[ 
https://issues.apache.org/jira/browse/SYSTEMML-869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15486037#comment-15486037
 ] 

Mike Dusenberry commented on SYSTEMML-869:
------------------------------------------

cc [~mboehm7] Unfortunately, I've run into this issue again with a different 
set of scripts via MLContext.

Same idea though: 
* Three different scripts run in succession via MLContext.
* First script takes in DataFrames, does some matrix math, and returns 
{{Matrix}} objects {{X}}, {{X_val}}, {{Y}}, & {{Y_val}}.
* Second script takes in {{X}}, {{X_val}}, {{Y}}, & {{Y_val}}, trains a model, 
and returns model coefficient {{Matrix}} objects {{W}}, {{b}}.
* Third script takes in all {{Matrix}} objects, {{X}}, {{X_val}}, {{Y}}, 
{{Y_val}}, {{W}}, & {{b}}, and returns a couple of scalar values evaluating 
performance of the model.
* The first two scripts run just fine, but the third script fails with a 
{{Caused by: java.io.IOException: File 
scratch_space/_p40317_9.30.110.134/_t0/temp2648_51476 does not exist on 
HDFS/LFS.}} error.
* If I instead copy the math in the first script into both the second & third 
scripts and pass the original DataFrames into the second & thirds scripts (thus 
doing the conversions of the original data both times), the second and third 
scripts run fine.


{code}
: org.apache.sysml.api.mlcontext.MLContextException: Exception when executing 
script
        at org.apache.sysml.api.mlcontext.MLContext.execute(MLContext.java:301)
        at org.apache.sysml.api.mlcontext.MLContext.execute(MLContext.java:271)
        at sun.reflect.GeneratedMethodAccessor64.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.sysml.api.mlcontext.MLContextException: Exception 
occurred while executing runtime program
        at 
org.apache.sysml.api.mlcontext.ScriptExecutor.executeRuntimeProgram(ScriptExecutor.java:381)
        at 
org.apache.sysml.api.mlcontext.ScriptExecutor.execute(ScriptExecutor.java:324)
        at org.apache.sysml.api.mlcontext.MLContext.execute(MLContext.java:293)
        ... 11 more
Caused by: org.apache.sysml.runtime.DMLRuntimeException: 
org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime error in program 
block generated from statement block between lines 1 and 18 -- Error evaluating 
instruction: CP°-°0·SCALAR·INT·true°Y·MATRIX·DOUBLE°_mVar5272·MATRIX·DOUBLE
        at 
org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:152)
        at 
org.apache.sysml.api.mlcontext.ScriptExecutor.executeRuntimeProgram(ScriptExecutor.java:379)
        ... 13 more
Caused by: org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime error 
in program block generated from statement block between lines 1 and 18 -- Error 
evaluating instruction: 
CP°-°0·SCALAR·INT·true°Y·MATRIX·DOUBLE°_mVar5272·MATRIX·DOUBLE
        at 
org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:335)
        at 
org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:224)
        at 
org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:168)
        at 
org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:145)
        ... 14 more
Caused by: org.apache.sysml.runtime.controlprogram.caching.CacheException: 
Reading of scratch_space//_p40317_9.30.110.134//_t0/temp2648_51476 (_mVar4813) 
failed.
        at 
org.apache.sysml.runtime.controlprogram.caching.CacheableData.acquireRead(CacheableData.java:476)
        at 
org.apache.sysml.runtime.controlprogram.context.ExecutionContext.getMatrixInput(ExecutionContext.java:241)
        at 
org.apache.sysml.runtime.instructions.cp.ScalarMatrixArithmeticCPInstruction.processInstruction(ScalarMatrixArithmeticCPInstruction.java:49)
        at 
org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:305)
        ... 17 more
Caused by: java.io.IOException: File 
scratch_space/_p40317_9.30.110.134/_t0/temp2648_51476 does not exist on 
HDFS/LFS.
        at 
org.apache.sysml.runtime.io.MatrixReader.checkValidInputFile(MatrixReader.java:147)
        at 
org.apache.sysml.runtime.io.ReaderBinaryBlockParallel.readMatrixFromHDFS(ReaderBinaryBlockParallel.java:67)
        at 
org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:291)
        at 
org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:250)
        at 
org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromHDFS(MatrixObject.java:476)
        at 
org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromHDFS(MatrixObject.java:61)
        at 
org.apache.sysml.runtime.controlprogram.caching.CacheableData.readBlobFromHDFS(CacheableData.java:998)
        at 
org.apache.sysml.runtime.controlprogram.caching.CacheableData.acquireRead(CacheableData.java:455)
        ... 20 more
{code}

[~mboehm7] Can you take another look and see if there are anymore changes that 
might need to be made in the API related to this issue?  This is currently a 
blocker to the project I'm working on.

> Error converting Matrix to Spark DataFrame with MLContext After Subsequent 
> Executions
> -------------------------------------------------------------------------------------
>
>                 Key: SYSTEMML-869
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-869
>             Project: SystemML
>          Issue Type: Bug
>          Components: APIs
>            Reporter: Mike Dusenberry
>            Assignee: Matthias Boehm
>            Priority: Blocker
>             Fix For: SystemML 0.11
>
>
> Running the LeNet deep learning example notebook with the new {{MLContext}} 
> API in Python results in the below error when converting the resulting 
> {{Matrix}} to a Spark {{DataFrame}} via the {{toDF()}} call.  This only 
> occurs with the large LeNet example, and not for the similar "Softmax 
> Classifier" example that has a smaller model. 
> {code}
> Py4JJavaError: An error occurred while calling o34.asDataFrame.
> : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> file:/Users/mwdusenb/Documents/Code/systemML/deep_learning/examples/scratch_space/_p85157_9.31.116.142/_t0/temp816_133
>     at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
>     at 
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
>     at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
>     at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>     at scala.Option.getOrElse(Option.scala:120)
>     at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>     at scala.Option.getOrElse(Option.scala:120)
>     at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>     at scala.Option.getOrElse(Option.scala:120)
>     at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>     at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65)
>     at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$3.apply(PairRDDFunctions.scala:642)
>     at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$3.apply(PairRDDFunctions.scala:642)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
>     at 
> org.apache.spark.rdd.PairRDDFunctions.groupByKey(PairRDDFunctions.scala:641)
>     at org.apache.spark.api.java.JavaPairRDD.groupByKey(JavaPairRDD.scala:538)
>     at 
> org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt.binaryBlockToDataFrame(RDDConverterUtilsExt.java:502)
>     at 
> org.apache.sysml.api.mlcontext.MLContextConversionUtil.matrixObjectToDataFrame(MLContextConversionUtil.java:762)
>     at org.apache.sysml.api.mlcontext.Matrix.asDataFrame(Matrix.java:111)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:497)
>     at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>     at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
>     at py4j.Gateway.invoke(Gateway.java:259)
>     at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>     at py4j.commands.CallCommand.execute(CallCommand.java:79)
>     at py4j.GatewayConnection.run(GatewayConnection.java:209)
>     at java.lang.Thread.run(Thread.java:745)
> {code}
> To setup, I used the instructions [here | 
> https://github.com/dusenberrymw/systemml-nn/tree/master/examples], running 
> the {{Example - MNIST LeNet.ipynb}} notebook.  Additionally, to speed up the 
> actual training time, I modified [line 84 & 85 of mnist_lenet.dml | 
> https://github.com/dusenberrymw/systemml-nn/blob/master/examples/mnist_lenet.dml#L84]
>  to set the {{epochs = 1}}, and {{iters = 1}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to