Found the bug PearsonCorrelationSimilarity was trying to mutate the object
while iterating.
1. while (it.hasNext()) {
2. Vector.Element e = it.next();
3. *vector.set(e.index(),* e.get() - average);
4. }
This has a side effect of causing the underlying hash-map or object to
change.
The right behavior is to set the value of the index while iterating.
1. while (it.hasNext()) {
2. Vector.Element e = it.next();
3. *e.set(e.get()* - average);
4. }
I am sure we are incorrectly doing the first style across the code at many
places.
I am proposing this
When iterating, we lock the set interface on the vector using a State enum.
If anyone tries to mutate, we throw an exception.
We flip the state when we complete iterating (hasNext = false) or when we
explicitly close the iterator (adding a close method on the iterator).
Again this is all a single thread fix. if a vector is being mutated and
iterated across multiple threads, all hell can break loose.
Robin
On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <[email protected]> wrote:
> Spoke too soon still failure. I am uploading the latest patch. These are
> the current failing tests.
>
>
> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
> not expecting cluster:{0:1.0,1:1.0}
>
> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
> not expecting cluster:{0:1.0,1:1.0}
>
> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> null
>
> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> null
>
> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88
> expected:<0.5303300858899108> but was:<0.38729833462074176>
>
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <[email protected]> wrote:
>
>> Found it, fixed it. I am submitting soon.
>>
>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>
>>
>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <[email protected]>wrote:
>>
>>> Robin,
>>>
>>> Can you make sure that the patches are somewhere that Dan can pick up
>>> this
>>> work? He is in GMT+2 and is probably about to appear on the scene.
>>>
>>>
>>>
>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <[email protected]>
>>> wrote:
>>>
>>> > Strike that there are still failures. Investigating. if I cant fix it
>>> in
>>> > the next hour, I will submit them sometime in the evening tomorrow.
>>> >
>>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>> >
>>> >
>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <[email protected]>
>>> wrote:
>>> >
>>> > > Tests pass. Submitting the patches.
>>> > >
>>> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>> > >
>>> > >
>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <[email protected]>
>>> > wrote:
>>> > >
>>> > >> Added a few more tests. Throw NoSuchElementException like Java
>>> > >> Collections when iterating past the end. Things look solid,
>>> performance
>>> > is
>>> > >> 2x. All Math tests pass. I am now waiting for the entire test
>>> suites to
>>> > run
>>> > >> before submitting.
>>> > >>
>>> > >> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>> > >>
>>> > >>
>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <[email protected]>
>>> > wrote:
>>> > >>
>>> > >>> I am not sure what I did. But removing Guava Abstract iterator
>>> actually
>>> > >>> sped up the dot, cosine, euclidean by another 60%. Things are now
>>> 2x
>>> > faster
>>> > >>> than trunk. While also correcting the behavior (I hope)
>>> > >>>
>>> > >>>
>>> > >>>
>>> >
>>> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
>>> > >>>
>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>> > >>>
>>> > >>>
>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <[email protected]
>>> > >wrote:
>>> > >>>
>>> > >>>> Also note that this is code gen, I have to create
>>> > Element$keyType$Value
>>> > >>>> for each and every combination not just int double. and also
>>> update
>>> > all
>>> > >>>> callers to user ElementIntDouble instead of Element. Is it worth
>>> it ?
>>> > >>>>
>>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>> > >>>>
>>> > >>>>
>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <
>>> [email protected]
>>> > >wrote:
>>> > >>>>
>>> > >>>>> Collections (no longer colt collections) are now part of mahout
>>> math.
>>> > >>>>> No
>>> > >>>>> need to keep them separate. The lower iterator can reference
>>> > >>>>> Vector.Element
>>> > >>>>>
>>> > >>>>>
>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <
>>> [email protected]>
>>> > >>>>> wrote:
>>> > >>>>>
>>> > >>>>> > I would have loved to but Element is a sub interface in
>>> Vector. If
>>> > >>>>> we want
>>> > >>>>> > to keep colt collections separate we have to keep this
>>> separation.
>>> > >>>>> >
>>> > >>>>>
>>> > >>>>
>>> > >>>>
>>> > >>>
>>> > >>
>>> > >
>>> >
>>>
>>
>>
>