Thank you Shirish and Matthias for looking into this issue. I got some small updates from more runs.
Shirish, Hmm my browser told me that the scripts were attached. There must be some connection issue. I attached them again to this email. Hope they got through this time. I also tested the same scripts on small toy data in local mode and they behaved correctly. Matthias you mentioned in your testsuite the metadata was incorrect but the dataset itself looked OK. In my case both the metadata and the data seem to be incorrect. Here is how this was confirmed: The output of sample-debug-noprint.dml (attached) contains 4 files: "1", "1.mtd" (attached as train-test-debug-noprint-1.mtd), "2", "2.mtd" (attached as train-test-debug-noprint-1.mtd). The auto generated metadata indicates there are 35478061 rows in "1". 1. I replaced the automatically generated metadata file of "1" with a generic one (attached as 1-generic.mtd) which does not specify the number of rows. 2. I ran a script (attached "countzeros.dml") to find the number of rows, as well as the number of 0's in each column of "1". The script returned that there were 35479057 rows in "1", which was 996 more than what's shown in the metadata (???). 3. I ran the same script to count rows and 0's of the original data set on which 'sample-debug-print.dml' was run. The number of rows was 35478061. 4. I found the difference of the number of 0's (by column) between the the original data and "1". The columns that contained no 0's in the original data set had 7099710 zeros in "1", which is roughly 20% of row counts. 5. Therefore it still looks like for some reason 'sample-debug-noprint.dml' did randomly replaced 20% of rows with 0's but didn't remove them. Also the sizes of the original data and "1" are 178G and 186.3G on HDFS. I did use a custom configuration for all the submissions. The configuration file is also attached. Thanks, Ethan On Fri, Apr 15, 2016 at 12:41 AM, Matthias Boehm <[email protected]> wrote: > well, it looks like an issue of incorrect meta data propagation (wrong > propagation of dimensions through mr pmm instructions). The data itself > looks good if I write a 20% sample to textcell (what is used in our > testsuite). > > @Shirish: thanks for looking into it. Just fyi, while testing this on an > ultra-sparse scenario, I also encountered a runtime issue of deep copying > sparse rows (fix will be available tomorrow), so for now don't worry about > it if you encounter the same issue. > > Regards, > Matthias > > > [image: Inactive hide details for Shirish Tatikonda ---04/14/2016 08:43:34 > PM---Hi Ethan, I just tried the script on a toy data and I c]Shirish > Tatikonda ---04/14/2016 08:43:34 PM---Hi Ethan, I just tried the script on > a toy data and I could reproduce this erroneous > > From: Shirish Tatikonda <[email protected]> > To: [email protected] > Date: 04/14/2016 08:43 PM > Subject: Re: 'sample.dml' replaces rows with 0's > ------------------------------ > > > > Hi Ethan, > > I just tried the script on a toy data and I could reproduce this erroneous > behavior when run in Hadoop mode -- both local and Spark modes are good. I > will look into it. > > BTW, you forgot to attach the scripts. > > Shirish > > On Thu, Apr 14, 2016 at 5:02 PM, Ethan Xu <[email protected]> wrote: > > > OK this is interesting: > > > > Scenario 1 > > I slightly modified 'sample.dml' to add statements to print dimensions of > > SM, P and iX, and ran it on the same data. The dimensions AND the output > > were correct. That is, subset '1' and '2' contain roughly 80% and 20% of > > original data. > > > > Please see attached: > > sample-debug.dml: > > sample.dml with 3 print functions inserted > > train-test-debug_1.mtd > > train-test-debug_2.mtd: > > meta data of outputs. Note 'rows' are correct. > > > > > > Scenario 2 > > This is confusing so I commented out the 'print' statements in > > 'sample.dml' and ran it on the same data, and the output were INCORRECT. > > That is, subset '1' and '2' contain the same rows as the original data. > > > > Please see attached: > > Please see attached: > > sample-debug-noprint.dml: > > 3 print functions were commented out > > train-test-debug-noprint_1.mtd > > train-test-debug-noprint_2.mtd > > meta data of outputs. Note 'rows' are incorrect. > > > > There was no errors in either trials. > > > > Ethan > > > > On Thu, Apr 14, 2016 at 4:37 PM, Ethan Xu <[email protected]> > wrote: > > > >> Hello, > >> > >> I encountered an unexpected behavior from 'sample.dml' on a dataset on > >> Hadoop. Instead of splitting the data, it replaced rows of original data > >> with 0's. Here are the details: > >> > >> I called sample.dml in attempt to split is a 35 million by 2396 numeric > >> matrix to two 80% and 20% subsets. The two outcome subsets '1' and '2' > both > >> still contain 35 million rows, instead of 35*80% and 35*20% rows. > >> > >> However it looks like 20% of the rows in '1' are replaced with 0's (but > >> not removed). It is as if line 66 of sample.dml ( > >> > https://github.com/apache/incubator-systemml/blob/master/scripts/utils/sample.dml > ) > >> that calls removeEmpty() doesn't exist. > >> > >> Here is the submission script: > >> > >> printf "0.8\n0.2" | hadoop fs -put - /path/split-perc.csv > >> echo '{"data_type": "matrix", "value_type":"double", "rows": 2, "cols": > >> 1, "format": "csv"}' | hadoop fs -put - /path/split-perc.csv.mtd > >> > >> ## Split file to training and test sets > >> hadoop jar $sysJar /path.to.systemML/scripts/utils/sample.dml > >> -config=$sysConfCust -nvargs X=/path/originalData.csv > >> sv=/path/split-perc.csv O=/path/train-test ofmt=csv > >> > >> > >> There was no error messages and all MR jobs were executed successfully. > >> What other information can I provide to diagnose the issue? > >> > >> Thanks, > >> > >> Ethan > >> > >> > >> > >> > >> > > > > >
<!-- * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, * software distributed under the License is distributed on an * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY * KIND, either express or implied. See the License for the * specific language governing permissions and limitations * under the License. --> <root> <mapred.child.java.opts>-Xms904m -Xmx904m -Xmn90m -XX:+UseParallelGC -XX:-UseParallelOldGC</mapred.child.java.opts> <mapred.job.shuffle.merge.percent>0.15</mapred.job.shuffle.merge.percent> <mapred.job.shuffle.input.buffer.percent>0.15</mapred.job.shuffle.input.buffer.percent> <mapred.job.reduce.input.buffer.percent>0</mapred.job.reduce.input.buffer.percent> <!-- local fs tmp working directory--> <localtmpdir>/usr/local/explorys/files/ethan.xu/tmp/systemml</localtmpdir> <!-- hdfs tmp working directory--> <scratch>scratch_space</scratch> <!-- compiler optimization level, valid values: 0 | 1 | 2 | 3 | 4, default: 2 --> <optlevel>2</optlevel> <!-- default number of reduce tasks per MR job, default: 2 x number of nodes --> <numreducers>10</numreducers> <!-- override jvm reuse flag for specific MR jobs, valid values: true | false --> <jvmreuse>false</jvmreuse> <!-- default block dim for binary block files --> <defaultblocksize>1000</defaultblocksize> <!-- run systemml control program as yarn appmaster, in case of MR1 always falls back to client, please disable for debug mode --> <dml.yarn.appmaster>false</dml.yarn.appmaster> <!-- maximum jvm heap size of the dml yarn appmaster in MB, the requested memory is 1.5x this parameter --> <dml.yarn.appmaster.mem>2048</dml.yarn.appmaster.mem> <!-- maximum jvm heap size of the map/reduce tasks in MB, the requested memory is 1.5x this parameter, negative values ignored --> <dml.yarn.mapreduce.mem>2048</dml.yarn.mapreduce.mem> <!-- yarn application submission queue, relevant for default capacity scheduler --> <dml.yarn.app.queue>default</dml.yarn.app.queue> <!-- enables multi-threaded matrix multiplications in singlenode control program --> <cp.parallel.matrixmult>true</cp.parallel.matrixmult> <!-- enables multi-threaded read/write of text formats in singlenode control program --> <cp.parallel.textio>true</cp.parallel.textio> </root>
