I am adding the tests and updating the patch.

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Mon, Apr 15, 2013 at 1:03 PM, Robin Anil <[email protected]> wrote:

> You can re-iterate if the state is in iteration. But you cannot write.
>
> This is what is happening:
>
> One of the values are becoming 0. So Vector tries to remove it from the
> underlying hashmap. This changes the layout, if a vector has to be mutated
> while iterating, we have to set 0 value in the hashmap and not remove it
> like what the Vector layer is doing. This adds another complexity, the
> vector iterator has to deal with skipping over elements with 0 value.
>
>
> Try this
>
> Create a vector of length 13 and set the following values.
>
>
>    1.     double[] val = new double[] { 0, 2, 0, 0, 8, 3, 0, 6, 0, 1, 1,
>    2, 1 };
>    2.     for (int i = 0; i < val.length; ++i) {
>    3.       vector.set(i, val[i]);
>    4.     }
>
> Iterate again and while iterating set one of the values as zero.
>
> On Mon, Apr 15, 2013 at 12:56 PM, Dan Filimon <[email protected]
> > wrote:
>
>> What kind of Vector is failing to set() in that code?
>>
>> About the state enum, what if (for whatever reason, not
>> multi-threaded-ness) there are multiple iterators to that vector?
>> Something like a reference count (how many iterators point to it) would
>> probably be needed, and keeping it sane would only be possible in one
>> thread. Although this seems kind of brittle.
>>
>> +1 for numNonDefault.
>>
>>
>> On Mon, Apr 15, 2013 at 8:36 PM, Robin Anil <[email protected]> wrote:
>>
>>> Another behavior difference.
>>>
>>> The numNonDefaultElement for a DenseVector returns the total length.
>>> This causes Pearson Correlation Similarity to differ from if it was
>>> implemented using on of the SparseVector.
>>> I am proposing to fix the numNonDefaultElement to correctly iterate over
>>> the dense vector to figure out non zero values ? Sounds ok
>>>
>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>
>>>
>>> On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil <[email protected]>wrote:
>>>
>>>> Found the bug PearsonCorrelationSimilarity was trying to mutate the
>>>> object while iterating.
>>>>
>>>>
>>>>    1.     while (it.hasNext()) {
>>>>    2.       Vector.Element e = it.next();
>>>>    3.       *vector.set(e.index(),* e.get() - average);
>>>>    4.     }
>>>>
>>>> This has a side effect of causing the underlying hash-map or object to
>>>> change.
>>>>
>>>> The right behavior is to set the value of the index while iterating.
>>>>
>>>>    1.     while (it.hasNext()) {
>>>>    2.       Vector.Element e = it.next();
>>>>    3.       *e.set(e.get()* - average);
>>>>    4.     }
>>>>
>>>> I am sure we are incorrectly doing the first style across the code at
>>>> many places.
>>>>
>>>> I am proposing this
>>>>
>>>> When iterating, we lock the set interface on the vector using a State
>>>> enum. If anyone tries to mutate, we throw an exception.
>>>> We flip the state when we complete iterating (hasNext = false) or when
>>>> we explicitly close the iterator (adding a close method on the iterator).
>>>>
>>>> Again this is all a single thread fix. if a vector is being mutated and
>>>> iterated across multiple threads, all hell can break loose.
>>>>
>>>> Robin
>>>>
>>>>
>>>>
>>>> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <[email protected]>wrote:
>>>>
>>>>> Spoke too soon still failure.  I am uploading the latest patch. These
>>>>> are the current failing tests.
>>>>>
>>>>>  
>>>>> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
>>>>> not expecting cluster:{0:1.0,1:1.0}
>>>>>
>>>>> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
>>>>> not expecting cluster:{0:1.0,1:1.0}
>>>>>
>>>>> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
>>>>> null
>>>>>
>>>>> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
>>>>> null
>>>>>
>>>>> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88
>>>>> expected:<0.5303300858899108> but was:<0.38729833462074176>
>>>>>
>>>>>
>>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>>
>>>>>
>>>>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <[email protected]>wrote:
>>>>>
>>>>>> Found it, fixed it. I am submitting soon.
>>>>>>
>>>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>>>
>>>>>>
>>>>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning 
>>>>>> <[email protected]>wrote:
>>>>>>
>>>>>>> Robin,
>>>>>>>
>>>>>>> Can you make sure that the patches are somewhere that Dan can pick
>>>>>>> up this
>>>>>>> work?  He is in GMT+2 and is probably about to appear on the scene.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> > Strike that there are still failures. Investigating. if I cant fix
>>>>>>> it in
>>>>>>> > the next hour, I will submit them sometime in the evening tomorrow.
>>>>>>> >
>>>>>>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>>>> >
>>>>>>> >
>>>>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <[email protected]>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > > Tests pass. Submitting the patches.
>>>>>>> > >
>>>>>>> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>>>> > >
>>>>>>> > >
>>>>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <
>>>>>>> [email protected]>
>>>>>>> > wrote:
>>>>>>> > >
>>>>>>> > >> Added a few more tests. Throw NoSuchElementException like Java
>>>>>>> > >> Collections when iterating past the end. Things look solid,
>>>>>>> performance
>>>>>>> > is
>>>>>>> > >> 2x. All Math tests pass. I am now waiting for the entire test
>>>>>>> suites to
>>>>>>> > run
>>>>>>> > >> before submitting.
>>>>>>> > >>
>>>>>>> > >> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>>>> > >>
>>>>>>> > >>
>>>>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <
>>>>>>> [email protected]>
>>>>>>> > wrote:
>>>>>>> > >>
>>>>>>> > >>> I am not sure what I did. But removing Guava Abstract iterator
>>>>>>> actually
>>>>>>> > >>> sped up the dot, cosine, euclidean by another 60%. Things are
>>>>>>> now 2x
>>>>>>> > faster
>>>>>>> > >>> than trunk. While also correcting the behavior (I hope)
>>>>>>> > >>>
>>>>>>> > >>>
>>>>>>> > >>>
>>>>>>> >
>>>>>>> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
>>>>>>> > >>>
>>>>>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>>>> > >>>
>>>>>>> > >>>
>>>>>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <
>>>>>>> [email protected]
>>>>>>> > >wrote:
>>>>>>> > >>>
>>>>>>> > >>>> Also note that this is code gen, I have to create
>>>>>>> > Element$keyType$Value
>>>>>>> > >>>> for each and every combination not just int double. and also
>>>>>>> update
>>>>>>> > all
>>>>>>> > >>>> callers to user ElementIntDouble instead of Element. Is it
>>>>>>> worth it ?
>>>>>>> > >>>>
>>>>>>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google
>>>>>>> Inc.
>>>>>>> > >>>>
>>>>>>> > >>>>
>>>>>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <
>>>>>>> [email protected]
>>>>>>> > >wrote:
>>>>>>> > >>>>
>>>>>>> > >>>>> Collections (no longer colt collections) are now part of
>>>>>>> mahout math.
>>>>>>> > >>>>>  No
>>>>>>> > >>>>> need to keep them separate.  The lower iterator can reference
>>>>>>> > >>>>> Vector.Element
>>>>>>> > >>>>>
>>>>>>> > >>>>>
>>>>>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <
>>>>>>> [email protected]>
>>>>>>> > >>>>> wrote:
>>>>>>> > >>>>>
>>>>>>> > >>>>> > I would have loved to but Element is a sub interface in
>>>>>>> Vector. If
>>>>>>> > >>>>> we want
>>>>>>> > >>>>> > to keep colt collections separate we have to keep this
>>>>>>> separation.
>>>>>>> > >>>>> >
>>>>>>> > >>>>>
>>>>>>> > >>>>
>>>>>>> > >>>>
>>>>>>> > >>>
>>>>>>> > >>
>>>>>>> > >
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to