Hello, I encountered an unexpected behavior from 'sample.dml' on a dataset on Hadoop. Instead of splitting the data, it replaced rows of original data with 0's. Here are the details:
I called sample.dml in attempt to split is a 35 million by 2396 numeric matrix to two 80% and 20% subsets. The two outcome subsets '1' and '2' both still contain 35 million rows, instead of 35*80% and 35*20% rows. However it looks like 20% of the rows in '1' are replaced with 0's (but not removed). It is as if line 66 of sample.dml ( https://github.com/apache/incubator-systemml/blob/master/scripts/utils/sample.dml) that calls removeEmpty() doesn't exist. Here is the submission script: printf "0.8\n0.2" | hadoop fs -put - /path/split-perc.csv echo '{"data_type": "matrix", "value_type":"double", "rows": 2, "cols": 1, "format": "csv"}' | hadoop fs -put - /path/split-perc.csv.mtd ## Split file to training and test sets hadoop jar $sysJar /path.to.systemML/scripts/utils/sample.dml -config=$sysConfCust -nvargs X=/path/originalData.csv sv=/path/split-perc.csv O=/path/train-test ofmt=csv There was no error messages and all MR jobs were executed successfully. What other information can I provide to diagnose the issue? Thanks, Ethan
