Another crazy idea for the future is to kill the usage of OpenIntDoubleHashMap entirely and copy parts of it inside RASV which will only deal with nonzero keys and non zero values. RASV can then keep track of non-zero elements in a variable to speed up those lookups.
Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Mon, Apr 15, 2013 at 2:11 PM, Robin Anil <[email protected]> wrote: > The point 3 is coming from the philosophy that all Vectors behave the same > way and numNonDefaultElements of a DenseVector is same as that of a > SparseVector. Eg, if PersonSimilarity relies upon it for document length, > it should be behave the same way. > > The point 4 can be solved by killing the iterator interface entirely and > creating forEachNonZero(function()) method which will only call if the > element is nonzero. > > > > On Mon, Apr 15, 2013 at 2:08 PM, Jake Mannix <[email protected]>wrote: > >> On Mon, Apr 15, 2013 at 11:58 AM, Robin Anil <[email protected]> >> wrote: >> >> > This is what I propose: >> > >> > 1) Allow setting value to zero while iterating (e.set(0.0)). >> > >> >> This is in addition to the fact that we already allow setting nonzero >> values >> while iterating, right? >> >> >> > 2) Do not allow callers to use vector.set(index, 0.0) during iterating). >> > This can cause re-hashing. (Can set a dirty bit in the hashmap during >> > rehash to throw a concurrent modified exception) >> > >> >> Agreed - this is a commonly accepted requirement: I think in fact we >> should pro-actively throw ConcurrentModificationException if someone >> tries to call vector.set / vector.assign while iterating. >> >> >> > 3) Update the numNonDefaultElements to iterate over the array to >> discount >> > 0.0 instead of returning the hashMap values. >> > 4) IterateNonZero may iterate over a few zeros if you did set the >> dimension >> > to 0. Most of the statistics code should handle 0 values correctly. >> > >> >> Yeah, are we really strict about getNumNonDefaultElements really always >> returning exactly the number of nonzeroes? I was under the impression >> that >> for e.g. DenseVector, it would give the overal size, even if some were 0, >> and that it was basically tracking the amount of space the vector was >> taking >> up. But I can see the argument that it really should return what it says >> it >> returns, if that is relied upon. >> >> >> > >> > >> > >> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. >> > >> > >> > On Mon, Apr 15, 2013 at 1:50 PM, Jake Mannix <[email protected]> >> > wrote: >> > >> > > Ah, this was the one corner case I was worried about - we do >> special-case >> > > setting to 0, >> > > as meaning remove from the hashmap, yes. >> > > >> > > What's the TL;DR of what you did to work around this? Should we allow >> > > this? Even >> > > if it's through the Vector.Element instance, should it be ok? If so, >> how >> > > to handle? >> > > >> > > >> > > On Mon, Apr 15, 2013 at 11:04 AM, Robin Anil <[email protected]> >> > wrote: >> > > >> > > > I am adding the tests and updating the patch. >> > > > >> > > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. >> > > > >> > > > >> > > > On Mon, Apr 15, 2013 at 1:03 PM, Robin Anil <[email protected]> >> > > wrote: >> > > > >> > > > > You can re-iterate if the state is in iteration. But you cannot >> > write. >> > > > > >> > > > > This is what is happening: >> > > > > >> > > > > One of the values are becoming 0. So Vector tries to remove it >> from >> > the >> > > > > underlying hashmap. This changes the layout, if a vector has to be >> > > > mutated >> > > > > while iterating, we have to set 0 value in the hashmap and not >> remove >> > > it >> > > > > like what the Vector layer is doing. This adds another complexity, >> > the >> > > > > vector iterator has to deal with skipping over elements with 0 >> value. >> > > > > >> > > > > >> > > > > Try this >> > > > > >> > > > > Create a vector of length 13 and set the following values. >> > > > > >> > > > > >> > > > > 1. double[] val = new double[] { 0, 2, 0, 0, 8, 3, 0, 6, >> 0, 1, >> > > 1, >> > > > > 2, 1 }; >> > > > > 2. for (int i = 0; i < val.length; ++i) { >> > > > > 3. vector.set(i, val[i]); >> > > > > 4. } >> > > > > >> > > > > Iterate again and while iterating set one of the values as zero. >> > > > > >> > > > > On Mon, Apr 15, 2013 at 12:56 PM, Dan Filimon < >> > > > [email protected] >> > > > > > wrote: >> > > > > >> > > > >> What kind of Vector is failing to set() in that code? >> > > > >> >> > > > >> About the state enum, what if (for whatever reason, not >> > > > >> multi-threaded-ness) there are multiple iterators to that vector? >> > > > >> Something like a reference count (how many iterators point to it) >> > > would >> > > > >> probably be needed, and keeping it sane would only be possible in >> > one >> > > > >> thread. Although this seems kind of brittle. >> > > > >> >> > > > >> +1 for numNonDefault. >> > > > >> >> > > > >> >> > > > >> On Mon, Apr 15, 2013 at 8:36 PM, Robin Anil < >> [email protected]> >> > > > wrote: >> > > > >> >> > > > >>> Another behavior difference. >> > > > >>> >> > > > >>> The numNonDefaultElement for a DenseVector returns the total >> > length. >> > > > >>> This causes Pearson Correlation Similarity to differ from if it >> was >> > > > >>> implemented using on of the SparseVector. >> > > > >>> I am proposing to fix the numNonDefaultElement to correctly >> iterate >> > > > over >> > > > >>> the dense vector to figure out non zero values ? Sounds ok >> > > > >>> >> > > > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. >> > > > >>> >> > > > >>> >> > > > >>> On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil < >> [email protected] >> > > > >wrote: >> > > > >>> >> > > > >>>> Found the bug PearsonCorrelationSimilarity was trying to mutate >> > the >> > > > >>>> object while iterating. >> > > > >>>> >> > > > >>>> >> > > > >>>> 1. while (it.hasNext()) { >> > > > >>>> 2. Vector.Element e = it.next(); >> > > > >>>> 3. *vector.set(e.index(),* e.get() - average); >> > > > >>>> 4. } >> > > > >>>> >> > > > >>>> This has a side effect of causing the underlying hash-map or >> > object >> > > to >> > > > >>>> change. >> > > > >>>> >> > > > >>>> The right behavior is to set the value of the index while >> > iterating. >> > > > >>>> >> > > > >>>> 1. while (it.hasNext()) { >> > > > >>>> 2. Vector.Element e = it.next(); >> > > > >>>> 3. *e.set(e.get()* - average); >> > > > >>>> 4. } >> > > > >>>> >> > > > >>>> I am sure we are incorrectly doing the first style across the >> code >> > > at >> > > > >>>> many places. >> > > > >>>> >> > > > >>>> I am proposing this >> > > > >>>> >> > > > >>>> When iterating, we lock the set interface on the vector using a >> > > State >> > > > >>>> enum. If anyone tries to mutate, we throw an exception. >> > > > >>>> We flip the state when we complete iterating (hasNext = false) >> or >> > > when >> > > > >>>> we explicitly close the iterator (adding a close method on the >> > > > iterator). >> > > > >>>> >> > > > >>>> Again this is all a single thread fix. if a vector is being >> > mutated >> > > > and >> > > > >>>> iterated across multiple threads, all hell can break loose. >> > > > >>>> >> > > > >>>> Robin >> > > > >>>> >> > > > >>>> >> > > > >>>> >> > > > >>>> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil < >> > [email protected] >> > > > >wrote: >> > > > >>>> >> > > > >>>>> Spoke too soon still failure. I am uploading the latest >> patch. >> > > These >> > > > >>>>> are the current failing tests. >> > > > >>>>> >> > > > >>>>> >> > > > >> > > >> > >> >> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88 >> > > > >>>>> not expecting cluster:{0:1.0,1:1.0} >> > > > >>>>> >> > > > >>>>> >> > > > >> > > >> > >> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88 >> > > > >>>>> not expecting cluster:{0:1.0,1:1.0} >> > > > >>>>> >> > > > >>>>> >> > > > >> > > >> > >> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86 >> > > > >>>>> null >> > > > >>>>> >> > > > >>>>> >> > > > >> > > >> > >> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86 >> > > > >>>>> null >> > > > >>>>> >> > > > >>>>> >> > > > >> > > >> > >> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88 >> > > > >>>>> expected:<0.5303300858899108> but was:<0.38729833462074176> >> > > > >>>>> >> > > > >>>>> >> > > > >>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google >> Inc. >> > > > >>>>> >> > > > >>>>> >> > > > >>>>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil < >> > [email protected] >> > > > >wrote: >> > > > >>>>> >> > > > >>>>>> Found it, fixed it. I am submitting soon. >> > > > >>>>>> >> > > > >>>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google >> Inc. >> > > > >>>>>> >> > > > >>>>>> >> > > > >>>>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning < >> > > > [email protected]>wrote: >> > > > >>>>>> >> > > > >>>>>>> Robin, >> > > > >>>>>>> >> > > > >>>>>>> Can you make sure that the patches are somewhere that Dan >> can >> > > pick >> > > > >>>>>>> up this >> > > > >>>>>>> work? He is in GMT+2 and is probably about to appear on the >> > > scene. >> > > > >>>>>>> >> > > > >>>>>>> >> > > > >>>>>>> >> > > > >>>>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil < >> > > [email protected]> >> > > > >>>>>>> wrote: >> > > > >>>>>>> >> > > > >>>>>>> > Strike that there are still failures. Investigating. if I >> > cant >> > > > fix >> > > > >>>>>>> it in >> > > > >>>>>>> > the next hour, I will submit them sometime in the evening >> > > > tomorrow. >> > > > >>>>>>> > >> > > > >>>>>>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google >> > Inc. >> > > > >>>>>>> > >> > > > >>>>>>> > >> > > > >>>>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil < >> > > > [email protected]> >> > > > >>>>>>> wrote: >> > > > >>>>>>> > >> > > > >>>>>>> > > Tests pass. Submitting the patches. >> > > > >>>>>>> > > >> > > > >>>>>>> > > Robin Anil | Software Engineer | +1 312 869 2602 | >> Google >> > > Inc. >> > > > >>>>>>> > > >> > > > >>>>>>> > > >> > > > >>>>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil < >> > > > >>>>>>> [email protected]> >> > > > >>>>>>> > wrote: >> > > > >>>>>>> > > >> > > > >>>>>>> > >> Added a few more tests. Throw NoSuchElementException >> like >> > > Java >> > > > >>>>>>> > >> Collections when iterating past the end. Things look >> > solid, >> > > > >>>>>>> performance >> > > > >>>>>>> > is >> > > > >>>>>>> > >> 2x. All Math tests pass. I am now waiting for the >> entire >> > > test >> > > > >>>>>>> suites to >> > > > >>>>>>> > run >> > > > >>>>>>> > >> before submitting. >> > > > >>>>>>> > >> >> > > > >>>>>>> > >> Robin Anil | Software Engineer | +1 312 869 2602 | >> Google >> > > > Inc. >> > > > >>>>>>> > >> >> > > > >>>>>>> > >> >> > > > >>>>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil < >> > > > >>>>>>> [email protected]> >> > > > >>>>>>> > wrote: >> > > > >>>>>>> > >> >> > > > >>>>>>> > >>> I am not sure what I did. But removing Guava Abstract >> > > > iterator >> > > > >>>>>>> actually >> > > > >>>>>>> > >>> sped up the dot, cosine, euclidean by another 60%. >> Things >> > > are >> > > > >>>>>>> now 2x >> > > > >>>>>>> > faster >> > > > >>>>>>> > >>> than trunk. While also correcting the behavior (I >> hope) >> > > > >>>>>>> > >>> >> > > > >>>>>>> > >>> >> > > > >>>>>>> > >>> >> > > > >>>>>>> > >> > > > >>>>>>> >> > > > >> > > >> > >> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1 >> > > > >>>>>>> > >>> >> > > > >>>>>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602 | >> > Google >> > > > Inc. >> > > > >>>>>>> > >>> >> > > > >>>>>>> > >>> >> > > > >>>>>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil < >> > > > >>>>>>> [email protected] >> > > > >>>>>>> > >wrote: >> > > > >>>>>>> > >>> >> > > > >>>>>>> > >>>> Also note that this is code gen, I have to create >> > > > >>>>>>> > Element$keyType$Value >> > > > >>>>>>> > >>>> for each and every combination not just int double. >> and >> > > also >> > > > >>>>>>> update >> > > > >>>>>>> > all >> > > > >>>>>>> > >>>> callers to user ElementIntDouble instead of Element. >> Is >> > it >> > > > >>>>>>> worth it ? >> > > > >>>>>>> > >>>> >> > > > >>>>>>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 | >> > Google >> > > > >>>>>>> Inc. >> > > > >>>>>>> > >>>> >> > > > >>>>>>> > >>>> >> > > > >>>>>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning < >> > > > >>>>>>> [email protected] >> > > > >>>>>>> > >wrote: >> > > > >>>>>>> > >>>> >> > > > >>>>>>> > >>>>> Collections (no longer colt collections) are now >> part >> > of >> > > > >>>>>>> mahout math. >> > > > >>>>>>> > >>>>> No >> > > > >>>>>>> > >>>>> need to keep them separate. The lower iterator can >> > > > reference >> > > > >>>>>>> > >>>>> Vector.Element >> > > > >>>>>>> > >>>>> >> > > > >>>>>>> > >>>>> >> > > > >>>>>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil < >> > > > >>>>>>> [email protected]> >> > > > >>>>>>> > >>>>> wrote: >> > > > >>>>>>> > >>>>> >> > > > >>>>>>> > >>>>> > I would have loved to but Element is a sub >> interface >> > in >> > > > >>>>>>> Vector. If >> > > > >>>>>>> > >>>>> we want >> > > > >>>>>>> > >>>>> > to keep colt collections separate we have to keep >> > this >> > > > >>>>>>> separation. >> > > > >>>>>>> > >>>>> > >> > > > >>>>>>> > >>>>> >> > > > >>>>>>> > >>>> >> > > > >>>>>>> > >>>> >> > > > >>>>>>> > >>> >> > > > >>>>>>> > >> >> > > > >>>>>>> > > >> > > > >>>>>>> > >> > > > >>>>>>> >> > > > >>>>>> >> > > > >>>>>> >> > > > >>>>> >> > > > >>>> >> > > > >>> >> > > > >> >> > > > > >> > > > >> > > >> > > >> > > >> > > -- >> > > >> > > -jake >> > > >> > >> >> >> >> -- >> >> -jake >> > >
