OK this is interesting: Scenario 1 I slightly modified 'sample.dml' to add statements to print dimensions of SM, P and iX, and ran it on the same data. The dimensions AND the output were correct. That is, subset '1' and '2' contain roughly 80% and 20% of original data.
Please see attached: sample-debug.dml: sample.dml with 3 print functions inserted train-test-debug_1.mtd train-test-debug_2.mtd: meta data of outputs. Note 'rows' are correct. Scenario 2 This is confusing so I commented out the 'print' statements in 'sample.dml' and ran it on the same data, and the output were INCORRECT. That is, subset '1' and '2' contain the same rows as the original data. Please see attached: Please see attached: sample-debug-noprint.dml: 3 print functions were commented out train-test-debug-noprint_1.mtd train-test-debug-noprint_2.mtd meta data of outputs. Note 'rows' are incorrect. There was no errors in either trials. Ethan On Thu, Apr 14, 2016 at 4:37 PM, Ethan Xu <[email protected]> wrote: > Hello, > > I encountered an unexpected behavior from 'sample.dml' on a dataset on > Hadoop. Instead of splitting the data, it replaced rows of original data > with 0's. Here are the details: > > I called sample.dml in attempt to split is a 35 million by 2396 numeric > matrix to two 80% and 20% subsets. The two outcome subsets '1' and '2' both > still contain 35 million rows, instead of 35*80% and 35*20% rows. > > However it looks like 20% of the rows in '1' are replaced with 0's (but > not removed). It is as if line 66 of sample.dml ( > https://github.com/apache/incubator-systemml/blob/master/scripts/utils/sample.dml) > that calls removeEmpty() doesn't exist. > > Here is the submission script: > > printf "0.8\n0.2" | hadoop fs -put - /path/split-perc.csv > echo '{"data_type": "matrix", "value_type":"double", "rows": 2, "cols": 1, > "format": "csv"}' | hadoop fs -put - /path/split-perc.csv.mtd > > ## Split file to training and test sets > hadoop jar $sysJar /path.to.systemML/scripts/utils/sample.dml > -config=$sysConfCust -nvargs X=/path/originalData.csv > sv=/path/split-perc.csv O=/path/train-test ofmt=csv > > > There was no error messages and all MR jobs were executed successfully. > What other information can I provide to diagnose the issue? > > Thanks, > > Ethan > > > > >
