I am adding the tests and updating the patch. Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
On Mon, Apr 15, 2013 at 1:03 PM, Robin Anil <[email protected]> wrote: > You can re-iterate if the state is in iteration. But you cannot write. > > This is what is happening: > > One of the values are becoming 0. So Vector tries to remove it from the > underlying hashmap. This changes the layout, if a vector has to be mutated > while iterating, we have to set 0 value in the hashmap and not remove it > like what the Vector layer is doing. This adds another complexity, the > vector iterator has to deal with skipping over elements with 0 value. > > > Try this > > Create a vector of length 13 and set the following values. > > > 1. double[] val = new double[] { 0, 2, 0, 0, 8, 3, 0, 6, 0, 1, 1, > 2, 1 }; > 2. for (int i = 0; i < val.length; ++i) { > 3. vector.set(i, val[i]); > 4. } > > Iterate again and while iterating set one of the values as zero. > > On Mon, Apr 15, 2013 at 12:56 PM, Dan Filimon <[email protected] > > wrote: > >> What kind of Vector is failing to set() in that code? >> >> About the state enum, what if (for whatever reason, not >> multi-threaded-ness) there are multiple iterators to that vector? >> Something like a reference count (how many iterators point to it) would >> probably be needed, and keeping it sane would only be possible in one >> thread. Although this seems kind of brittle. >> >> +1 for numNonDefault. >> >> >> On Mon, Apr 15, 2013 at 8:36 PM, Robin Anil <[email protected]> wrote: >> >>> Another behavior difference. >>> >>> The numNonDefaultElement for a DenseVector returns the total length. >>> This causes Pearson Correlation Similarity to differ from if it was >>> implemented using on of the SparseVector. >>> I am proposing to fix the numNonDefaultElement to correctly iterate over >>> the dense vector to figure out non zero values ? Sounds ok >>> >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. >>> >>> >>> On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil <[email protected]>wrote: >>> >>>> Found the bug PearsonCorrelationSimilarity was trying to mutate the >>>> object while iterating. >>>> >>>> >>>> 1. while (it.hasNext()) { >>>> 2. Vector.Element e = it.next(); >>>> 3. *vector.set(e.index(),* e.get() - average); >>>> 4. } >>>> >>>> This has a side effect of causing the underlying hash-map or object to >>>> change. >>>> >>>> The right behavior is to set the value of the index while iterating. >>>> >>>> 1. while (it.hasNext()) { >>>> 2. Vector.Element e = it.next(); >>>> 3. *e.set(e.get()* - average); >>>> 4. } >>>> >>>> I am sure we are incorrectly doing the first style across the code at >>>> many places. >>>> >>>> I am proposing this >>>> >>>> When iterating, we lock the set interface on the vector using a State >>>> enum. If anyone tries to mutate, we throw an exception. >>>> We flip the state when we complete iterating (hasNext = false) or when >>>> we explicitly close the iterator (adding a close method on the iterator). >>>> >>>> Again this is all a single thread fix. if a vector is being mutated and >>>> iterated across multiple threads, all hell can break loose. >>>> >>>> Robin >>>> >>>> >>>> >>>> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <[email protected]>wrote: >>>> >>>>> Spoke too soon still failure. I am uploading the latest patch. These >>>>> are the current failing tests. >>>>> >>>>> >>>>> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88 >>>>> not expecting cluster:{0:1.0,1:1.0} >>>>> >>>>> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88 >>>>> not expecting cluster:{0:1.0,1:1.0} >>>>> >>>>> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86 >>>>> null >>>>> >>>>> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86 >>>>> null >>>>> >>>>> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88 >>>>> expected:<0.5303300858899108> but was:<0.38729833462074176> >>>>> >>>>> >>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. >>>>> >>>>> >>>>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <[email protected]>wrote: >>>>> >>>>>> Found it, fixed it. I am submitting soon. >>>>>> >>>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. >>>>>> >>>>>> >>>>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning >>>>>> <[email protected]>wrote: >>>>>> >>>>>>> Robin, >>>>>>> >>>>>>> Can you make sure that the patches are somewhere that Dan can pick >>>>>>> up this >>>>>>> work? He is in GMT+2 and is probably about to appear on the scene. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> > Strike that there are still failures. Investigating. if I cant fix >>>>>>> it in >>>>>>> > the next hour, I will submit them sometime in the evening tomorrow. >>>>>>> > >>>>>>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. >>>>>>> > >>>>>>> > >>>>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <[email protected]> >>>>>>> wrote: >>>>>>> > >>>>>>> > > Tests pass. Submitting the patches. >>>>>>> > > >>>>>>> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. >>>>>>> > > >>>>>>> > > >>>>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil < >>>>>>> [email protected]> >>>>>>> > wrote: >>>>>>> > > >>>>>>> > >> Added a few more tests. Throw NoSuchElementException like Java >>>>>>> > >> Collections when iterating past the end. Things look solid, >>>>>>> performance >>>>>>> > is >>>>>>> > >> 2x. All Math tests pass. I am now waiting for the entire test >>>>>>> suites to >>>>>>> > run >>>>>>> > >> before submitting. >>>>>>> > >> >>>>>>> > >> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil < >>>>>>> [email protected]> >>>>>>> > wrote: >>>>>>> > >> >>>>>>> > >>> I am not sure what I did. But removing Guava Abstract iterator >>>>>>> actually >>>>>>> > >>> sped up the dot, cosine, euclidean by another 60%. Things are >>>>>>> now 2x >>>>>>> > faster >>>>>>> > >>> than trunk. While also correcting the behavior (I hope) >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>>>>>> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1 >>>>>>> > >>> >>>>>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil < >>>>>>> [email protected] >>>>>>> > >wrote: >>>>>>> > >>> >>>>>>> > >>>> Also note that this is code gen, I have to create >>>>>>> > Element$keyType$Value >>>>>>> > >>>> for each and every combination not just int double. and also >>>>>>> update >>>>>>> > all >>>>>>> > >>>> callers to user ElementIntDouble instead of Element. Is it >>>>>>> worth it ? >>>>>>> > >>>> >>>>>>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google >>>>>>> Inc. >>>>>>> > >>>> >>>>>>> > >>>> >>>>>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning < >>>>>>> [email protected] >>>>>>> > >wrote: >>>>>>> > >>>> >>>>>>> > >>>>> Collections (no longer colt collections) are now part of >>>>>>> mahout math. >>>>>>> > >>>>> No >>>>>>> > >>>>> need to keep them separate. The lower iterator can reference >>>>>>> > >>>>> Vector.Element >>>>>>> > >>>>> >>>>>>> > >>>>> >>>>>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil < >>>>>>> [email protected]> >>>>>>> > >>>>> wrote: >>>>>>> > >>>>> >>>>>>> > >>>>> > I would have loved to but Element is a sub interface in >>>>>>> Vector. If >>>>>>> > >>>>> we want >>>>>>> > >>>>> > to keep colt collections separate we have to keep this >>>>>>> separation. >>>>>>> > >>>>> > >>>>>>> > >>>>> >>>>>>> > >>>> >>>>>>> > >>>> >>>>>>> > >>> >>>>>>> > >> >>>>>>> > > >>>>>>> > >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >
