What kind of Vector is failing to set() in that code?

About the state enum, what if (for whatever reason, not
multi-threaded-ness) there are multiple iterators to that vector?
Something like a reference count (how many iterators point to it) would
probably be needed, and keeping it sane would only be possible in one
thread. Although this seems kind of brittle.

+1 for numNonDefault.


On Mon, Apr 15, 2013 at 8:36 PM, Robin Anil <[email protected]> wrote:

> Another behavior difference.
>
> The numNonDefaultElement for a DenseVector returns the total length. This
> causes Pearson Correlation Similarity to differ from if it was implemented
> using on of the SparseVector.
> I am proposing to fix the numNonDefaultElement to correctly iterate over
> the dense vector to figure out non zero values ? Sounds ok
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil <[email protected]> wrote:
>
>> Found the bug PearsonCorrelationSimilarity was trying to mutate the
>> object while iterating.
>>
>>
>>    1.     while (it.hasNext()) {
>>    2.       Vector.Element e = it.next();
>>    3.       *vector.set(e.index(),* e.get() - average);
>>    4.     }
>>
>> This has a side effect of causing the underlying hash-map or object to
>> change.
>>
>> The right behavior is to set the value of the index while iterating.
>>
>>    1.     while (it.hasNext()) {
>>    2.       Vector.Element e = it.next();
>>    3.       *e.set(e.get()* - average);
>>    4.     }
>>
>> I am sure we are incorrectly doing the first style across the code at
>> many places.
>>
>> I am proposing this
>>
>> When iterating, we lock the set interface on the vector using a State
>> enum. If anyone tries to mutate, we throw an exception.
>> We flip the state when we complete iterating (hasNext = false) or when we
>> explicitly close the iterator (adding a close method on the iterator).
>>
>> Again this is all a single thread fix. if a vector is being mutated and
>> iterated across multiple threads, all hell can break loose.
>>
>> Robin
>>
>>
>>
>> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <[email protected]>wrote:
>>
>>> Spoke too soon still failure.  I am uploading the latest patch. These
>>> are the current failing tests.
>>>
>>>  
>>> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
>>> not expecting cluster:{0:1.0,1:1.0}
>>>
>>> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
>>> not expecting cluster:{0:1.0,1:1.0}
>>>
>>> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
>>> null
>>>
>>> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
>>> null
>>>
>>> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88
>>> expected:<0.5303300858899108> but was:<0.38729833462074176>
>>>
>>>
>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>
>>>
>>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <[email protected]>wrote:
>>>
>>>> Found it, fixed it. I am submitting soon.
>>>>
>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>
>>>>
>>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <[email protected]>wrote:
>>>>
>>>>> Robin,
>>>>>
>>>>> Can you make sure that the patches are somewhere that Dan can pick up
>>>>> this
>>>>> work?  He is in GMT+2 and is probably about to appear on the scene.
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <[email protected]>
>>>>> wrote:
>>>>>
>>>>> > Strike that there are still failures. Investigating. if I cant fix
>>>>> it in
>>>>> > the next hour, I will submit them sometime in the evening tomorrow.
>>>>> >
>>>>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>> >
>>>>> >
>>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <[email protected]>
>>>>> wrote:
>>>>> >
>>>>> > > Tests pass. Submitting the patches.
>>>>> > >
>>>>> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>> > >
>>>>> > >
>>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <[email protected]
>>>>> >
>>>>> > wrote:
>>>>> > >
>>>>> > >> Added a few more tests. Throw NoSuchElementException like Java
>>>>> > >> Collections when iterating past the end. Things look solid,
>>>>> performance
>>>>> > is
>>>>> > >> 2x. All Math tests pass. I am now waiting for the entire test
>>>>> suites to
>>>>> > run
>>>>> > >> before submitting.
>>>>> > >>
>>>>> > >> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>> > >>
>>>>> > >>
>>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <[email protected]
>>>>> >
>>>>> > wrote:
>>>>> > >>
>>>>> > >>> I am not sure what I did. But removing Guava Abstract iterator
>>>>> actually
>>>>> > >>> sped up the dot, cosine, euclidean by another 60%. Things are
>>>>> now 2x
>>>>> > faster
>>>>> > >>> than trunk. While also correcting the behavior (I hope)
>>>>> > >>>
>>>>> > >>>
>>>>> > >>>
>>>>> >
>>>>> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
>>>>> > >>>
>>>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>> > >>>
>>>>> > >>>
>>>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <
>>>>> [email protected]
>>>>> > >wrote:
>>>>> > >>>
>>>>> > >>>> Also note that this is code gen, I have to create
>>>>> > Element$keyType$Value
>>>>> > >>>> for each and every combination not just int double. and also
>>>>> update
>>>>> > all
>>>>> > >>>> callers to user ElementIntDouble instead of Element. Is it
>>>>> worth it ?
>>>>> > >>>>
>>>>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>> > >>>>
>>>>> > >>>>
>>>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <
>>>>> [email protected]
>>>>> > >wrote:
>>>>> > >>>>
>>>>> > >>>>> Collections (no longer colt collections) are now part of
>>>>> mahout math.
>>>>> > >>>>>  No
>>>>> > >>>>> need to keep them separate.  The lower iterator can reference
>>>>> > >>>>> Vector.Element
>>>>> > >>>>>
>>>>> > >>>>>
>>>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <
>>>>> [email protected]>
>>>>> > >>>>> wrote:
>>>>> > >>>>>
>>>>> > >>>>> > I would have loved to but Element is a sub interface in
>>>>> Vector. If
>>>>> > >>>>> we want
>>>>> > >>>>> > to keep colt collections separate we have to keep this
>>>>> separation.
>>>>> > >>>>> >
>>>>> > >>>>>
>>>>> > >>>>
>>>>> > >>>>
>>>>> > >>>
>>>>> > >>
>>>>> > >
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to