Committed. I moved the getNumNonDefaultElements() change to a new method getNumNonZeroElements() and used that for Pearson Correlation. The getNumNonDefaultElements() will work as before, that really needs to be a private method.
http://svn.apache.org/viewvc?view=revision&revision=1468209 Sync, let me know if you find any issues with this. Robin On Mon, Apr 15, 2013 at 2:30 PM, Robin Anil <[email protected]> wrote: > Iterable is a safer interface, you can implement non-zero-ness check > easily. Iterator is not. > > I think I have fixed all the failing tests (They were failing because the > asFormatString order seems to have changed with the new iterators) > > https://reviews.apache.org/r/10455/diff/6/ > > > > On Mon, Apr 15, 2013 at 2:21 PM, Jake Mannix <[email protected]>wrote: > >> On Mon, Apr 15, 2013 at 12:14 PM, Robin Anil <[email protected]> >> wrote: >> >> > Another crazy idea for the future is to kill the usage of >> > OpenIntDoubleHashMap entirely and copy parts of it inside RASV which >> will >> > only deal with nonzero keys and non zero values. RASV can then keep >> track >> > of non-zero elements in a variable to speed up those lookups. >> > >> > >> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. >> > >> > >> > On Mon, Apr 15, 2013 at 2:11 PM, Robin Anil <[email protected]> >> wrote: >> > >> > > The point 3 is coming from the philosophy that all Vectors behave the >> > same >> > > way and numNonDefaultElements of a DenseVector is same as that of a >> > > SparseVector. Eg, if PersonSimilarity relies upon it for document >> length, >> > > it should be behave the same way. >> > > >> > > The point 4 can be solved by killing the iterator interface entirely >> and >> > > creating forEachNonZero(function()) method which will only call if the >> > > element is nonzero. >> > >> >> Killing iteration would be really really bad, from a useability >> standpoint. >> In fact, >> I've been moving in the other direction: >> https://reviews.apache.org/r/9867/ >> adds iterators to the basic collection interface! >> >> >> >> > > >> > > >> > > >> > > On Mon, Apr 15, 2013 at 2:08 PM, Jake Mannix <[email protected] >> > >wrote: >> > > >> > >> On Mon, Apr 15, 2013 at 11:58 AM, Robin Anil <[email protected]> >> > >> wrote: >> > >> >> > >> > This is what I propose: >> > >> > >> > >> > 1) Allow setting value to zero while iterating (e.set(0.0)). >> > >> > >> > >> >> > >> This is in addition to the fact that we already allow setting nonzero >> > >> values >> > >> while iterating, right? >> > >> >> > >> >> > >> > 2) Do not allow callers to use vector.set(index, 0.0) during >> > iterating). >> > >> > This can cause re-hashing. (Can set a dirty bit in the hashmap >> during >> > >> > rehash to throw a concurrent modified exception) >> > >> > >> > >> >> > >> Agreed - this is a commonly accepted requirement: I think in fact we >> > >> should pro-actively throw ConcurrentModificationException if someone >> > >> tries to call vector.set / vector.assign while iterating. >> > >> >> > >> >> > >> > 3) Update the numNonDefaultElements to iterate over the array to >> > >> discount >> > >> > 0.0 instead of returning the hashMap values. >> > >> > 4) IterateNonZero may iterate over a few zeros if you did set the >> > >> dimension >> > >> > to 0. Most of the statistics code should handle 0 values correctly. >> > >> > >> > >> >> > >> Yeah, are we really strict about getNumNonDefaultElements really >> always >> > >> returning exactly the number of nonzeroes? I was under the >> impression >> > >> that >> > >> for e.g. DenseVector, it would give the overal size, even if some >> were >> > 0, >> > >> and that it was basically tracking the amount of space the vector was >> > >> taking >> > >> up. But I can see the argument that it really should return what it >> > says >> > >> it >> > >> returns, if that is relied upon. >> > >> >> > >> >> > >> > >> > >> > >> > >> > >> > >> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. >> > >> > >> > >> > >> > >> > On Mon, Apr 15, 2013 at 1:50 PM, Jake Mannix < >> [email protected]> >> > >> > wrote: >> > >> > >> > >> > > Ah, this was the one corner case I was worried about - we do >> > >> special-case >> > >> > > setting to 0, >> > >> > > as meaning remove from the hashmap, yes. >> > >> > > >> > >> > > What's the TL;DR of what you did to work around this? Should we >> > allow >> > >> > > this? Even >> > >> > > if it's through the Vector.Element instance, should it be ok? If >> > so, >> > >> how >> > >> > > to handle? >> > >> > > >> > >> > > >> > >> > > On Mon, Apr 15, 2013 at 11:04 AM, Robin Anil < >> [email protected]> >> > >> > wrote: >> > >> > > >> > >> > > > I am adding the tests and updating the patch. >> > >> > > > >> > >> > > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. >> > >> > > > >> > >> > > > >> > >> > > > On Mon, Apr 15, 2013 at 1:03 PM, Robin Anil < >> [email protected] >> > > >> > >> > > wrote: >> > >> > > > >> > >> > > > > You can re-iterate if the state is in iteration. But you >> cannot >> > >> > write. >> > >> > > > > >> > >> > > > > This is what is happening: >> > >> > > > > >> > >> > > > > One of the values are becoming 0. So Vector tries to remove >> it >> > >> from >> > >> > the >> > >> > > > > underlying hashmap. This changes the layout, if a vector has >> to >> > be >> > >> > > > mutated >> > >> > > > > while iterating, we have to set 0 value in the hashmap and >> not >> > >> remove >> > >> > > it >> > >> > > > > like what the Vector layer is doing. This adds another >> > complexity, >> > >> > the >> > >> > > > > vector iterator has to deal with skipping over elements with >> 0 >> > >> value. >> > >> > > > > >> > >> > > > > >> > >> > > > > Try this >> > >> > > > > >> > >> > > > > Create a vector of length 13 and set the following values. >> > >> > > > > >> > >> > > > > >> > >> > > > > 1. double[] val = new double[] { 0, 2, 0, 0, 8, 3, 0, >> 6, >> > >> 0, 1, >> > >> > > 1, >> > >> > > > > 2, 1 }; >> > >> > > > > 2. for (int i = 0; i < val.length; ++i) { >> > >> > > > > 3. vector.set(i, val[i]); >> > >> > > > > 4. } >> > >> > > > > >> > >> > > > > Iterate again and while iterating set one of the values as >> zero. >> > >> > > > > >> > >> > > > > On Mon, Apr 15, 2013 at 12:56 PM, Dan Filimon < >> > >> > > > [email protected] >> > >> > > > > > wrote: >> > >> > > > > >> > >> > > > >> What kind of Vector is failing to set() in that code? >> > >> > > > >> >> > >> > > > >> About the state enum, what if (for whatever reason, not >> > >> > > > >> multi-threaded-ness) there are multiple iterators to that >> > vector? >> > >> > > > >> Something like a reference count (how many iterators point >> to >> > it) >> > >> > > would >> > >> > > > >> probably be needed, and keeping it sane would only be >> possible >> > in >> > >> > one >> > >> > > > >> thread. Although this seems kind of brittle. >> > >> > > > >> >> > >> > > > >> +1 for numNonDefault. >> > >> > > > >> >> > >> > > > >> >> > >> > > > >> On Mon, Apr 15, 2013 at 8:36 PM, Robin Anil < >> > >> [email protected]> >> > >> > > > wrote: >> > >> > > > >> >> > >> > > > >>> Another behavior difference. >> > >> > > > >>> >> > >> > > > >>> The numNonDefaultElement for a DenseVector returns the >> total >> > >> > length. >> > >> > > > >>> This causes Pearson Correlation Similarity to differ from >> if >> > it >> > >> was >> > >> > > > >>> implemented using on of the SparseVector. >> > >> > > > >>> I am proposing to fix the numNonDefaultElement to correctly >> > >> iterate >> > >> > > > over >> > >> > > > >>> the dense vector to figure out non zero values ? Sounds ok >> > >> > > > >>> >> > >> > > > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google >> > Inc. >> > >> > > > >>> >> > >> > > > >>> >> > >> > > > >>> On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil < >> > >> [email protected] >> > >> > > > >wrote: >> > >> > > > >>> >> > >> > > > >>>> Found the bug PearsonCorrelationSimilarity was trying to >> > mutate >> > >> > the >> > >> > > > >>>> object while iterating. >> > >> > > > >>>> >> > >> > > > >>>> >> > >> > > > >>>> 1. while (it.hasNext()) { >> > >> > > > >>>> 2. Vector.Element e = it.next(); >> > >> > > > >>>> 3. *vector.set(e.index(),* e.get() - average); >> > >> > > > >>>> 4. } >> > >> > > > >>>> >> > >> > > > >>>> This has a side effect of causing the underlying hash-map >> or >> > >> > object >> > >> > > to >> > >> > > > >>>> change. >> > >> > > > >>>> >> > >> > > > >>>> The right behavior is to set the value of the index while >> > >> > iterating. >> > >> > > > >>>> >> > >> > > > >>>> 1. while (it.hasNext()) { >> > >> > > > >>>> 2. Vector.Element e = it.next(); >> > >> > > > >>>> 3. *e.set(e.get()* - average); >> > >> > > > >>>> 4. } >> > >> > > > >>>> >> > >> > > > >>>> I am sure we are incorrectly doing the first style across >> the >> > >> code >> > >> > > at >> > >> > > > >>>> many places. >> > >> > > > >>>> >> > >> > > > >>>> I am proposing this >> > >> > > > >>>> >> > >> > > > >>>> When iterating, we lock the set interface on the vector >> > using a >> > >> > > State >> > >> > > > >>>> enum. If anyone tries to mutate, we throw an exception. >> > >> > > > >>>> We flip the state when we complete iterating (hasNext = >> > false) >> > >> or >> > >> > > when >> > >> > > > >>>> we explicitly close the iterator (adding a close method on >> > the >> > >> > > > iterator). >> > >> > > > >>>> >> > >> > > > >>>> Again this is all a single thread fix. if a vector is >> being >> > >> > mutated >> > >> > > > and >> > >> > > > >>>> iterated across multiple threads, all hell can break >> loose. >> > >> > > > >>>> >> > >> > > > >>>> Robin >> > >> > > > >>>> >> > >> > > > >>>> >> > >> > > > >>>> >> > >> > > > >>>> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil < >> > >> > [email protected] >> > >> > > > >wrote: >> > >> > > > >>>> >> > >> > > > >>>>> Spoke too soon still failure. I am uploading the latest >> > >> patch. >> > >> > > These >> > >> > > > >>>>> are the current failing tests. >> > >> > > > >>>>> >> > >> > > > >>>>> >> > >> > > > >> > >> > > >> > >> > >> > >> >> > >> >> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88 >> > >> > > > >>>>> not expecting cluster:{0:1.0,1:1.0} >> > >> > > > >>>>> >> > >> > > > >>>>> >> > >> > > > >> > >> > > >> > >> > >> > >> >> > >> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88 >> > >> > > > >>>>> not expecting cluster:{0:1.0,1:1.0} >> > >> > > > >>>>> >> > >> > > > >>>>> >> > >> > > > >> > >> > > >> > >> > >> > >> >> > >> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86 >> > >> > > > >>>>> null >> > >> > > > >>>>> >> > >> > > > >>>>> >> > >> > > > >> > >> > > >> > >> > >> > >> >> > >> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86 >> > >> > > > >>>>> null >> > >> > > > >>>>> >> > >> > > > >>>>> >> > >> > > > >> > >> > > >> > >> > >> > >> >> > >> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88 >> > >> > > > >>>>> expected:<0.5303300858899108> but >> was:<0.38729833462074176> >> > >> > > > >>>>> >> > >> > > > >>>>> >> > >> > > > >>>>> Robin Anil | Software Engineer | +1 312 869 2602 | >> Google >> > >> Inc. >> > >> > > > >>>>> >> > >> > > > >>>>> >> > >> > > > >>>>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil < >> > >> > [email protected] >> > >> > > > >wrote: >> > >> > > > >>>>> >> > >> > > > >>>>>> Found it, fixed it. I am submitting soon. >> > >> > > > >>>>>> >> > >> > > > >>>>>> Robin Anil | Software Engineer | +1 312 869 2602 | >> Google >> > >> Inc. >> > >> > > > >>>>>> >> > >> > > > >>>>>> >> > >> > > > >>>>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning < >> > >> > > > [email protected]>wrote: >> > >> > > > >>>>>> >> > >> > > > >>>>>>> Robin, >> > >> > > > >>>>>>> >> > >> > > > >>>>>>> Can you make sure that the patches are somewhere that >> Dan >> > >> can >> > >> > > pick >> > >> > > > >>>>>>> up this >> > >> > > > >>>>>>> work? He is in GMT+2 and is probably about to appear >> on >> > the >> > >> > > scene. >> > >> > > > >>>>>>> >> > >> > > > >>>>>>> >> > >> > > > >>>>>>> >> > >> > > > >>>>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil < >> > >> > > [email protected]> >> > >> > > > >>>>>>> wrote: >> > >> > > > >>>>>>> >> > >> > > > >>>>>>> > Strike that there are still failures. Investigating. >> if >> > I >> > >> > cant >> > >> > > > fix >> > >> > > > >>>>>>> it in >> > >> > > > >>>>>>> > the next hour, I will submit them sometime in the >> > evening >> > >> > > > tomorrow. >> > >> > > > >>>>>>> > >> > >> > > > >>>>>>> > Robin Anil | Software Engineer | +1 312 869 2602 | >> > Google >> > >> > Inc. >> > >> > > > >>>>>>> > >> > >> > > > >>>>>>> > >> > >> > > > >>>>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil < >> > >> > > > [email protected]> >> > >> > > > >>>>>>> wrote: >> > >> > > > >>>>>>> > >> > >> > > > >>>>>>> > > Tests pass. Submitting the patches. >> > >> > > > >>>>>>> > > >> > >> > > > >>>>>>> > > Robin Anil | Software Engineer | +1 312 869 2602 | >> > >> Google >> > >> > > Inc. >> > >> > > > >>>>>>> > > >> > >> > > > >>>>>>> > > >> > >> > > > >>>>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil < >> > >> > > > >>>>>>> [email protected]> >> > >> > > > >>>>>>> > wrote: >> > >> > > > >>>>>>> > > >> > >> > > > >>>>>>> > >> Added a few more tests. Throw >> NoSuchElementException >> > >> like >> > >> > > Java >> > >> > > > >>>>>>> > >> Collections when iterating past the end. Things >> look >> > >> > solid, >> > >> > > > >>>>>>> performance >> > >> > > > >>>>>>> > is >> > >> > > > >>>>>>> > >> 2x. All Math tests pass. I am now waiting for the >> > >> entire >> > >> > > test >> > >> > > > >>>>>>> suites to >> > >> > > > >>>>>>> > run >> > >> > > > >>>>>>> > >> before submitting. >> > >> > > > >>>>>>> > >> >> > >> > > > >>>>>>> > >> Robin Anil | Software Engineer | +1 312 869 2602| >> > >> Google >> > >> > > > Inc. >> > >> > > > >>>>>>> > >> >> > >> > > > >>>>>>> > >> >> > >> > > > >>>>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil < >> > >> > > > >>>>>>> [email protected]> >> > >> > > > >>>>>>> > wrote: >> > >> > > > >>>>>>> > >> >> > >> > > > >>>>>>> > >>> I am not sure what I did. But removing Guava >> > Abstract >> > >> > > > iterator >> > >> > > > >>>>>>> actually >> > >> > > > >>>>>>> > >>> sped up the dot, cosine, euclidean by another >> 60%. >> > >> Things >> > >> > > are >> > >> > > > >>>>>>> now 2x >> > >> > > > >>>>>>> > faster >> > >> > > > >>>>>>> > >>> than trunk. While also correcting the behavior (I >> > >> hope) >> > >> > > > >>>>>>> > >>> >> > >> > > > >>>>>>> > >>> >> > >> > > > >>>>>>> > >>> >> > >> > > > >>>>>>> > >> > >> > > > >>>>>>> >> > >> > > > >> > >> > > >> > >> > >> > >> >> > >> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1 >> > >> > > > >>>>>>> > >>> >> > >> > > > >>>>>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602| >> > >> > Google >> > >> > > > Inc. >> > >> > > > >>>>>>> > >>> >> > >> > > > >>>>>>> > >>> >> > >> > > > >>>>>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil < >> > >> > > > >>>>>>> [email protected] >> > >> > > > >>>>>>> > >wrote: >> > >> > > > >>>>>>> > >>> >> > >> > > > >>>>>>> > >>>> Also note that this is code gen, I have to >> create >> > >> > > > >>>>>>> > Element$keyType$Value >> > >> > > > >>>>>>> > >>>> for each and every combination not just int >> double. >> > >> and >> > >> > > also >> > >> > > > >>>>>>> update >> > >> > > > >>>>>>> > all >> > >> > > > >>>>>>> > >>>> callers to user ElementIntDouble instead of >> > Element. >> > >> Is >> > >> > it >> > >> > > > >>>>>>> worth it ? >> > >> > > > >>>>>>> > >>>> >> > >> > > > >>>>>>> > >>>> Robin Anil | Software Engineer | +1 312 869 >> 2602 | >> > >> > Google >> > >> > > > >>>>>>> Inc. >> > >> > > > >>>>>>> > >>>> >> > >> > > > >>>>>>> > >>>> >> > >> > > > >>>>>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning < >> > >> > > > >>>>>>> [email protected] >> > >> > > > >>>>>>> > >wrote: >> > >> > > > >>>>>>> > >>>> >> > >> > > > >>>>>>> > >>>>> Collections (no longer colt collections) are >> now >> > >> part >> > >> > of >> > >> > > > >>>>>>> mahout math. >> > >> > > > >>>>>>> > >>>>> No >> > >> > > > >>>>>>> > >>>>> need to keep them separate. The lower iterator >> > can >> > >> > > > reference >> > >> > > > >>>>>>> > >>>>> Vector.Element >> > >> > > > >>>>>>> > >>>>> >> > >> > > > >>>>>>> > >>>>> >> > >> > > > >>>>>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil < >> > >> > > > >>>>>>> [email protected]> >> > >> > > > >>>>>>> > >>>>> wrote: >> > >> > > > >>>>>>> > >>>>> >> > >> > > > >>>>>>> > >>>>> > I would have loved to but Element is a sub >> > >> interface >> > >> > in >> > >> > > > >>>>>>> Vector. If >> > >> > > > >>>>>>> > >>>>> we want >> > >> > > > >>>>>>> > >>>>> > to keep colt collections separate we have to >> > keep >> > >> > this >> > >> > > > >>>>>>> separation. >> > >> > > > >>>>>>> > >>>>> > >> > >> > > > >>>>>>> > >>>>> >> > >> > > > >>>>>>> > >>>> >> > >> > > > >>>>>>> > >>>> >> > >> > > > >>>>>>> > >>> >> > >> > > > >>>>>>> > >> >> > >> > > > >>>>>>> > > >> > >> > > > >>>>>>> > >> > >> > > > >>>>>>> >> > >> > > > >>>>>> >> > >> > > > >>>>>> >> > >> > > > >>>>> >> > >> > > > >>>> >> > >> > > > >>> >> > >> > > > >> >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > > >> > >> > > >> > >> > > -- >> > >> > > >> > >> > > -jake >> > >> > > >> > >> > >> > >> >> > >> >> > >> >> > >> -- >> > >> >> > >> -jake >> > >> >> > > >> > > >> > >> >> >> >> -- >> >> -jake >> > >
