This is what I propose: 1) Allow setting value to zero while iterating (e.set(0.0)). 2) Do not allow callers to use vector.set(index, 0.0) during iterating). This can cause re-hashing. (Can set a dirty bit in the hashmap during rehash to throw a concurrent modified exception) 3) Update the numNonDefaultElements to iterate over the array to discount 0.0 instead of returning the hashMap values. 4) IterateNonZero may iterate over a few zeros if you did set the dimension to 0. Most of the statistics code should handle 0 values correctly.
Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Mon, Apr 15, 2013 at 1:50 PM, Jake Mannix <[email protected]> wrote: > Ah, this was the one corner case I was worried about - we do special-case > setting to 0, > as meaning remove from the hashmap, yes. > > What's the TL;DR of what you did to work around this? Should we allow > this? Even > if it's through the Vector.Element instance, should it be ok? If so, how > to handle? > > > On Mon, Apr 15, 2013 at 11:04 AM, Robin Anil <[email protected]> wrote: > > > I am adding the tests and updating the patch. > > > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. > > > > > > On Mon, Apr 15, 2013 at 1:03 PM, Robin Anil <[email protected]> > wrote: > > > > > You can re-iterate if the state is in iteration. But you cannot write. > > > > > > This is what is happening: > > > > > > One of the values are becoming 0. So Vector tries to remove it from the > > > underlying hashmap. This changes the layout, if a vector has to be > > mutated > > > while iterating, we have to set 0 value in the hashmap and not remove > it > > > like what the Vector layer is doing. This adds another complexity, the > > > vector iterator has to deal with skipping over elements with 0 value. > > > > > > > > > Try this > > > > > > Create a vector of length 13 and set the following values. > > > > > > > > > 1. double[] val = new double[] { 0, 2, 0, 0, 8, 3, 0, 6, 0, 1, > 1, > > > 2, 1 }; > > > 2. for (int i = 0; i < val.length; ++i) { > > > 3. vector.set(i, val[i]); > > > 4. } > > > > > > Iterate again and while iterating set one of the values as zero. > > > > > > On Mon, Apr 15, 2013 at 12:56 PM, Dan Filimon < > > [email protected] > > > > wrote: > > > > > >> What kind of Vector is failing to set() in that code? > > >> > > >> About the state enum, what if (for whatever reason, not > > >> multi-threaded-ness) there are multiple iterators to that vector? > > >> Something like a reference count (how many iterators point to it) > would > > >> probably be needed, and keeping it sane would only be possible in one > > >> thread. Although this seems kind of brittle. > > >> > > >> +1 for numNonDefault. > > >> > > >> > > >> On Mon, Apr 15, 2013 at 8:36 PM, Robin Anil <[email protected]> > > wrote: > > >> > > >>> Another behavior difference. > > >>> > > >>> The numNonDefaultElement for a DenseVector returns the total length. > > >>> This causes Pearson Correlation Similarity to differ from if it was > > >>> implemented using on of the SparseVector. > > >>> I am proposing to fix the numNonDefaultElement to correctly iterate > > over > > >>> the dense vector to figure out non zero values ? Sounds ok > > >>> > > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. > > >>> > > >>> > > >>> On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil <[email protected] > > >wrote: > > >>> > > >>>> Found the bug PearsonCorrelationSimilarity was trying to mutate the > > >>>> object while iterating. > > >>>> > > >>>> > > >>>> 1. while (it.hasNext()) { > > >>>> 2. Vector.Element e = it.next(); > > >>>> 3. *vector.set(e.index(),* e.get() - average); > > >>>> 4. } > > >>>> > > >>>> This has a side effect of causing the underlying hash-map or object > to > > >>>> change. > > >>>> > > >>>> The right behavior is to set the value of the index while iterating. > > >>>> > > >>>> 1. while (it.hasNext()) { > > >>>> 2. Vector.Element e = it.next(); > > >>>> 3. *e.set(e.get()* - average); > > >>>> 4. } > > >>>> > > >>>> I am sure we are incorrectly doing the first style across the code > at > > >>>> many places. > > >>>> > > >>>> I am proposing this > > >>>> > > >>>> When iterating, we lock the set interface on the vector using a > State > > >>>> enum. If anyone tries to mutate, we throw an exception. > > >>>> We flip the state when we complete iterating (hasNext = false) or > when > > >>>> we explicitly close the iterator (adding a close method on the > > iterator). > > >>>> > > >>>> Again this is all a single thread fix. if a vector is being mutated > > and > > >>>> iterated across multiple threads, all hell can break loose. > > >>>> > > >>>> Robin > > >>>> > > >>>> > > >>>> > > >>>> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <[email protected] > > >wrote: > > >>>> > > >>>>> Spoke too soon still failure. I am uploading the latest patch. > These > > >>>>> are the current failing tests. > > >>>>> > > >>>>> > > > > ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88 > > >>>>> not expecting cluster:{0:1.0,1:1.0} > > >>>>> > > >>>>> > > > ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88 > > >>>>> not expecting cluster:{0:1.0,1:1.0} > > >>>>> > > >>>>> > > > ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86 > > >>>>> null > > >>>>> > > >>>>> > > > ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86 > > >>>>> null > > >>>>> > > >>>>> > > > VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88 > > >>>>> expected:<0.5303300858899108> but was:<0.38729833462074176> > > >>>>> > > >>>>> > > >>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. > > >>>>> > > >>>>> > > >>>>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <[email protected] > > >wrote: > > >>>>> > > >>>>>> Found it, fixed it. I am submitting soon. > > >>>>>> > > >>>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. > > >>>>>> > > >>>>>> > > >>>>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning < > > [email protected]>wrote: > > >>>>>> > > >>>>>>> Robin, > > >>>>>>> > > >>>>>>> Can you make sure that the patches are somewhere that Dan can > pick > > >>>>>>> up this > > >>>>>>> work? He is in GMT+2 and is probably about to appear on the > scene. > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil < > [email protected]> > > >>>>>>> wrote: > > >>>>>>> > > >>>>>>> > Strike that there are still failures. Investigating. if I cant > > fix > > >>>>>>> it in > > >>>>>>> > the next hour, I will submit them sometime in the evening > > tomorrow. > > >>>>>>> > > > >>>>>>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil < > > [email protected]> > > >>>>>>> wrote: > > >>>>>>> > > > >>>>>>> > > Tests pass. Submitting the patches. > > >>>>>>> > > > > >>>>>>> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google > Inc. > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil < > > >>>>>>> [email protected]> > > >>>>>>> > wrote: > > >>>>>>> > > > > >>>>>>> > >> Added a few more tests. Throw NoSuchElementException like > Java > > >>>>>>> > >> Collections when iterating past the end. Things look solid, > > >>>>>>> performance > > >>>>>>> > is > > >>>>>>> > >> 2x. All Math tests pass. I am now waiting for the entire > test > > >>>>>>> suites to > > >>>>>>> > run > > >>>>>>> > >> before submitting. > > >>>>>>> > >> > > >>>>>>> > >> Robin Anil | Software Engineer | +1 312 869 2602 | Google > > Inc. > > >>>>>>> > >> > > >>>>>>> > >> > > >>>>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil < > > >>>>>>> [email protected]> > > >>>>>>> > wrote: > > >>>>>>> > >> > > >>>>>>> > >>> I am not sure what I did. But removing Guava Abstract > > iterator > > >>>>>>> actually > > >>>>>>> > >>> sped up the dot, cosine, euclidean by another 60%. Things > are > > >>>>>>> now 2x > > >>>>>>> > faster > > >>>>>>> > >>> than trunk. While also correcting the behavior (I hope) > > >>>>>>> > >>> > > >>>>>>> > >>> > > >>>>>>> > >>> > > >>>>>>> > > > >>>>>>> > > > https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1 > > >>>>>>> > >>> > > >>>>>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google > > Inc. > > >>>>>>> > >>> > > >>>>>>> > >>> > > >>>>>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil < > > >>>>>>> [email protected] > > >>>>>>> > >wrote: > > >>>>>>> > >>> > > >>>>>>> > >>>> Also note that this is code gen, I have to create > > >>>>>>> > Element$keyType$Value > > >>>>>>> > >>>> for each and every combination not just int double. and > also > > >>>>>>> update > > >>>>>>> > all > > >>>>>>> > >>>> callers to user ElementIntDouble instead of Element. Is it > > >>>>>>> worth it ? > > >>>>>>> > >>>> > > >>>>>>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google > > >>>>>>> Inc. > > >>>>>>> > >>>> > > >>>>>>> > >>>> > > >>>>>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning < > > >>>>>>> [email protected] > > >>>>>>> > >wrote: > > >>>>>>> > >>>> > > >>>>>>> > >>>>> Collections (no longer colt collections) are now part of > > >>>>>>> mahout math. > > >>>>>>> > >>>>> No > > >>>>>>> > >>>>> need to keep them separate. The lower iterator can > > reference > > >>>>>>> > >>>>> Vector.Element > > >>>>>>> > >>>>> > > >>>>>>> > >>>>> > > >>>>>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil < > > >>>>>>> [email protected]> > > >>>>>>> > >>>>> wrote: > > >>>>>>> > >>>>> > > >>>>>>> > >>>>> > I would have loved to but Element is a sub interface in > > >>>>>>> Vector. If > > >>>>>>> > >>>>> we want > > >>>>>>> > >>>>> > to keep colt collections separate we have to keep this > > >>>>>>> separation. > > >>>>>>> > >>>>> > > > >>>>>>> > >>>>> > > >>>>>>> > >>>> > > >>>>>>> > >>>> > > >>>>>>> > >>> > > >>>>>>> > >> > > >>>>>>> > > > > >>>>>>> > > > >>>>>>> > > >>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > > > > > > > -- > > -jake >
