'sample.dml' replaces rows with 0's

Ethan Xu Thu, 14 Apr 2016 13:37:58 -0700

Hello,

I encountered an unexpected behavior from 'sample.dml' on a dataset on
Hadoop. Instead of splitting the data, it replaced rows of original data
with 0's. Here are the details:


I called sample.dml in attempt to split is a 35 million by 2396 numeric
matrix to two 80% and 20% subsets. The two outcome subsets '1' and '2' both
still contain 35 million rows, instead of 35*80% and 35*20% rows.

However it looks like 20% of the rows in '1' are replaced with 0's (but not
removed). It is as if line 66 of sample.dml (
https://github.com/apache/incubator-systemml/blob/master/scripts/utils/sample.dml)
that calls removeEmpty() doesn't exist.

Here is the submission script:

printf "0.8\n0.2" | hadoop fs -put - /path/split-perc.csv
echo '{"data_type": "matrix", "value_type":"double", "rows": 2, "cols": 1,
"format": "csv"}' | hadoop fs -put - /path/split-perc.csv.mtd

## Split file to training and test sets
hadoop jar $sysJar /path.to.systemML/scripts/utils/sample.dml
-config=$sysConfCust -nvargs X=/path/originalData.csv
sv=/path/split-perc.csv O=/path/train-test ofmt=csv


There was no error messages and all MR jobs were executed successfully.
What other information can I provide to diagnose the issue?

Thanks,

Ethan

'sample.dml' replaces rows with 0's

Reply via email to