Another behavior difference.

The numNonDefaultElement for a DenseVector returns the total length. This
causes Pearson Correlation Similarity to differ from if it was implemented
using on of the SparseVector.
I am proposing to fix the numNonDefaultElement to correctly iterate over
the dense vector to figure out non zero values ? Sounds ok

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil <[email protected]> wrote:

> Found the bug PearsonCorrelationSimilarity was trying to mutate the
> object while iterating.
>
>
>    1.     while (it.hasNext()) {
>    2.       Vector.Element e = it.next();
>    3.       *vector.set(e.index(),* e.get() - average);
>    4.     }
>
> This has a side effect of causing the underlying hash-map or object to
> change.
>
> The right behavior is to set the value of the index while iterating.
>
>    1.     while (it.hasNext()) {
>    2.       Vector.Element e = it.next();
>    3.       *e.set(e.get()* - average);
>    4.     }
>
> I am sure we are incorrectly doing the first style across the code at many
> places.
>
> I am proposing this
>
> When iterating, we lock the set interface on the vector using a State
> enum. If anyone tries to mutate, we throw an exception.
> We flip the state when we complete iterating (hasNext = false) or when we
> explicitly close the iterator (adding a close method on the iterator).
>
> Again this is all a single thread fix. if a vector is being mutated and
> iterated across multiple threads, all hell can break loose.
>
> Robin
>
>
>
> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <[email protected]> wrote:
>
>> Spoke too soon still failure.  I am uploading the latest patch. These are
>> the current failing tests.
>>
>>  
>> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
>> not expecting cluster:{0:1.0,1:1.0}
>>
>> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
>> not expecting cluster:{0:1.0,1:1.0}
>>
>> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
>> null
>>
>> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
>> null
>>
>> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88
>> expected:<0.5303300858899108> but was:<0.38729833462074176>
>>
>>
>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>
>>
>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <[email protected]>wrote:
>>
>>> Found it, fixed it. I am submitting soon.
>>>
>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>
>>>
>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <[email protected]>wrote:
>>>
>>>> Robin,
>>>>
>>>> Can you make sure that the patches are somewhere that Dan can pick up
>>>> this
>>>> work?  He is in GMT+2 and is probably about to appear on the scene.
>>>>
>>>>
>>>>
>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <[email protected]>
>>>> wrote:
>>>>
>>>> > Strike that there are still failures. Investigating. if I cant fix it
>>>> in
>>>> > the next hour, I will submit them sometime in the evening tomorrow.
>>>> >
>>>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>> >
>>>> >
>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <[email protected]>
>>>> wrote:
>>>> >
>>>> > > Tests pass. Submitting the patches.
>>>> > >
>>>> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>> > >
>>>> > >
>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <[email protected]>
>>>> > wrote:
>>>> > >
>>>> > >> Added a few more tests. Throw NoSuchElementException like Java
>>>> > >> Collections when iterating past the end. Things look solid,
>>>> performance
>>>> > is
>>>> > >> 2x. All Math tests pass. I am now waiting for the entire test
>>>> suites to
>>>> > run
>>>> > >> before submitting.
>>>> > >>
>>>> > >> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>> > >>
>>>> > >>
>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <[email protected]>
>>>> > wrote:
>>>> > >>
>>>> > >>> I am not sure what I did. But removing Guava Abstract iterator
>>>> actually
>>>> > >>> sped up the dot, cosine, euclidean by another 60%. Things are now
>>>> 2x
>>>> > faster
>>>> > >>> than trunk. While also correcting the behavior (I hope)
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> >
>>>> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
>>>> > >>>
>>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>> > >>>
>>>> > >>>
>>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <[email protected]
>>>> > >wrote:
>>>> > >>>
>>>> > >>>> Also note that this is code gen, I have to create
>>>> > Element$keyType$Value
>>>> > >>>> for each and every combination not just int double. and also
>>>> update
>>>> > all
>>>> > >>>> callers to user ElementIntDouble instead of Element. Is it worth
>>>> it ?
>>>> > >>>>
>>>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>> > >>>>
>>>> > >>>>
>>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <
>>>> [email protected]
>>>> > >wrote:
>>>> > >>>>
>>>> > >>>>> Collections (no longer colt collections) are now part of mahout
>>>> math.
>>>> > >>>>>  No
>>>> > >>>>> need to keep them separate.  The lower iterator can reference
>>>> > >>>>> Vector.Element
>>>> > >>>>>
>>>> > >>>>>
>>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <
>>>> [email protected]>
>>>> > >>>>> wrote:
>>>> > >>>>>
>>>> > >>>>> > I would have loved to but Element is a sub interface in
>>>> Vector. If
>>>> > >>>>> we want
>>>> > >>>>> > to keep colt collections separate we have to keep this
>>>> separation.
>>>> > >>>>> >
>>>> > >>>>>
>>>> > >>>>
>>>> > >>>>
>>>> > >>>
>>>> > >>
>>>> > >
>>>> >
>>>>
>>>
>>>
>>
>

Reply via email to