What kind of Vector is failing to set() in that code? About the state enum, what if (for whatever reason, not multi-threaded-ness) there are multiple iterators to that vector? Something like a reference count (how many iterators point to it) would probably be needed, and keeping it sane would only be possible in one thread. Although this seems kind of brittle.
+1 for numNonDefault. On Mon, Apr 15, 2013 at 8:36 PM, Robin Anil <[email protected]> wrote: > Another behavior difference. > > The numNonDefaultElement for a DenseVector returns the total length. This > causes Pearson Correlation Similarity to differ from if it was implemented > using on of the SparseVector. > I am proposing to fix the numNonDefaultElement to correctly iterate over > the dense vector to figure out non zero values ? Sounds ok > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. > > > On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil <[email protected]> wrote: > >> Found the bug PearsonCorrelationSimilarity was trying to mutate the >> object while iterating. >> >> >> 1. while (it.hasNext()) { >> 2. Vector.Element e = it.next(); >> 3. *vector.set(e.index(),* e.get() - average); >> 4. } >> >> This has a side effect of causing the underlying hash-map or object to >> change. >> >> The right behavior is to set the value of the index while iterating. >> >> 1. while (it.hasNext()) { >> 2. Vector.Element e = it.next(); >> 3. *e.set(e.get()* - average); >> 4. } >> >> I am sure we are incorrectly doing the first style across the code at >> many places. >> >> I am proposing this >> >> When iterating, we lock the set interface on the vector using a State >> enum. If anyone tries to mutate, we throw an exception. >> We flip the state when we complete iterating (hasNext = false) or when we >> explicitly close the iterator (adding a close method on the iterator). >> >> Again this is all a single thread fix. if a vector is being mutated and >> iterated across multiple threads, all hell can break loose. >> >> Robin >> >> >> >> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <[email protected]>wrote: >> >>> Spoke too soon still failure. I am uploading the latest patch. These >>> are the current failing tests. >>> >>> >>> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88 >>> not expecting cluster:{0:1.0,1:1.0} >>> >>> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88 >>> not expecting cluster:{0:1.0,1:1.0} >>> >>> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86 >>> null >>> >>> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86 >>> null >>> >>> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88 >>> expected:<0.5303300858899108> but was:<0.38729833462074176> >>> >>> >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. >>> >>> >>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <[email protected]>wrote: >>> >>>> Found it, fixed it. I am submitting soon. >>>> >>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. >>>> >>>> >>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <[email protected]>wrote: >>>> >>>>> Robin, >>>>> >>>>> Can you make sure that the patches are somewhere that Dan can pick up >>>>> this >>>>> work? He is in GMT+2 and is probably about to appear on the scene. >>>>> >>>>> >>>>> >>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <[email protected]> >>>>> wrote: >>>>> >>>>> > Strike that there are still failures. Investigating. if I cant fix >>>>> it in >>>>> > the next hour, I will submit them sometime in the evening tomorrow. >>>>> > >>>>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. >>>>> > >>>>> > >>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <[email protected]> >>>>> wrote: >>>>> > >>>>> > > Tests pass. Submitting the patches. >>>>> > > >>>>> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. >>>>> > > >>>>> > > >>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <[email protected] >>>>> > >>>>> > wrote: >>>>> > > >>>>> > >> Added a few more tests. Throw NoSuchElementException like Java >>>>> > >> Collections when iterating past the end. Things look solid, >>>>> performance >>>>> > is >>>>> > >> 2x. All Math tests pass. I am now waiting for the entire test >>>>> suites to >>>>> > run >>>>> > >> before submitting. >>>>> > >> >>>>> > >> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. >>>>> > >> >>>>> > >> >>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <[email protected] >>>>> > >>>>> > wrote: >>>>> > >> >>>>> > >>> I am not sure what I did. But removing Guava Abstract iterator >>>>> actually >>>>> > >>> sped up the dot, cosine, euclidean by another 60%. Things are >>>>> now 2x >>>>> > faster >>>>> > >>> than trunk. While also correcting the behavior (I hope) >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>>>> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1 >>>>> > >>> >>>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. >>>>> > >>> >>>>> > >>> >>>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil < >>>>> [email protected] >>>>> > >wrote: >>>>> > >>> >>>>> > >>>> Also note that this is code gen, I have to create >>>>> > Element$keyType$Value >>>>> > >>>> for each and every combination not just int double. and also >>>>> update >>>>> > all >>>>> > >>>> callers to user ElementIntDouble instead of Element. Is it >>>>> worth it ? >>>>> > >>>> >>>>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. >>>>> > >>>> >>>>> > >>>> >>>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning < >>>>> [email protected] >>>>> > >wrote: >>>>> > >>>> >>>>> > >>>>> Collections (no longer colt collections) are now part of >>>>> mahout math. >>>>> > >>>>> No >>>>> > >>>>> need to keep them separate. The lower iterator can reference >>>>> > >>>>> Vector.Element >>>>> > >>>>> >>>>> > >>>>> >>>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil < >>>>> [email protected]> >>>>> > >>>>> wrote: >>>>> > >>>>> >>>>> > >>>>> > I would have loved to but Element is a sub interface in >>>>> Vector. If >>>>> > >>>>> we want >>>>> > >>>>> > to keep colt collections separate we have to keep this >>>>> separation. >>>>> > >>>>> > >>>>> > >>>>> >>>>> > >>>> >>>>> > >>>> >>>>> > >>> >>>>> > >> >>>>> > > >>>>> > >>>>> >>>> >>>> >>> >> >
