Ah, this was the one corner case I was worried about - we do special-case
setting to 0,
as meaning remove from the hashmap, yes.

What's the TL;DR of what you did to work around this?  Should we allow
this?  Even
if it's through the Vector.Element instance, should it be ok?  If so, how
to handle?


On Mon, Apr 15, 2013 at 11:04 AM, Robin Anil <[email protected]> wrote:

> I am adding the tests and updating the patch.
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Mon, Apr 15, 2013 at 1:03 PM, Robin Anil <[email protected]> wrote:
>
> > You can re-iterate if the state is in iteration. But you cannot write.
> >
> > This is what is happening:
> >
> > One of the values are becoming 0. So Vector tries to remove it from the
> > underlying hashmap. This changes the layout, if a vector has to be
> mutated
> > while iterating, we have to set 0 value in the hashmap and not remove it
> > like what the Vector layer is doing. This adds another complexity, the
> > vector iterator has to deal with skipping over elements with 0 value.
> >
> >
> > Try this
> >
> > Create a vector of length 13 and set the following values.
> >
> >
> >    1.     double[] val = new double[] { 0, 2, 0, 0, 8, 3, 0, 6, 0, 1, 1,
> >    2, 1 };
> >    2.     for (int i = 0; i < val.length; ++i) {
> >    3.       vector.set(i, val[i]);
> >    4.     }
> >
> > Iterate again and while iterating set one of the values as zero.
> >
> > On Mon, Apr 15, 2013 at 12:56 PM, Dan Filimon <
> [email protected]
> > > wrote:
> >
> >> What kind of Vector is failing to set() in that code?
> >>
> >> About the state enum, what if (for whatever reason, not
> >> multi-threaded-ness) there are multiple iterators to that vector?
> >> Something like a reference count (how many iterators point to it) would
> >> probably be needed, and keeping it sane would only be possible in one
> >> thread. Although this seems kind of brittle.
> >>
> >> +1 for numNonDefault.
> >>
> >>
> >> On Mon, Apr 15, 2013 at 8:36 PM, Robin Anil <[email protected]>
> wrote:
> >>
> >>> Another behavior difference.
> >>>
> >>> The numNonDefaultElement for a DenseVector returns the total length.
> >>> This causes Pearson Correlation Similarity to differ from if it was
> >>> implemented using on of the SparseVector.
> >>> I am proposing to fix the numNonDefaultElement to correctly iterate
> over
> >>> the dense vector to figure out non zero values ? Sounds ok
> >>>
> >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >>>
> >>>
> >>> On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil <[email protected]
> >wrote:
> >>>
> >>>> Found the bug PearsonCorrelationSimilarity was trying to mutate the
> >>>> object while iterating.
> >>>>
> >>>>
> >>>>    1.     while (it.hasNext()) {
> >>>>    2.       Vector.Element e = it.next();
> >>>>    3.       *vector.set(e.index(),* e.get() - average);
> >>>>    4.     }
> >>>>
> >>>> This has a side effect of causing the underlying hash-map or object to
> >>>> change.
> >>>>
> >>>> The right behavior is to set the value of the index while iterating.
> >>>>
> >>>>    1.     while (it.hasNext()) {
> >>>>    2.       Vector.Element e = it.next();
> >>>>    3.       *e.set(e.get()* - average);
> >>>>    4.     }
> >>>>
> >>>> I am sure we are incorrectly doing the first style across the code at
> >>>> many places.
> >>>>
> >>>> I am proposing this
> >>>>
> >>>> When iterating, we lock the set interface on the vector using a State
> >>>> enum. If anyone tries to mutate, we throw an exception.
> >>>> We flip the state when we complete iterating (hasNext = false) or when
> >>>> we explicitly close the iterator (adding a close method on the
> iterator).
> >>>>
> >>>> Again this is all a single thread fix. if a vector is being mutated
> and
> >>>> iterated across multiple threads, all hell can break loose.
> >>>>
> >>>> Robin
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <[email protected]
> >wrote:
> >>>>
> >>>>> Spoke too soon still failure.  I am uploading the latest patch. These
> >>>>> are the current failing tests.
> >>>>>
> >>>>>
>  
> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
> >>>>> not expecting cluster:{0:1.0,1:1.0}
> >>>>>
> >>>>>
> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
> >>>>> not expecting cluster:{0:1.0,1:1.0}
> >>>>>
> >>>>>
> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> >>>>> null
> >>>>>
> >>>>>
> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> >>>>> null
> >>>>>
> >>>>>
> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88
> >>>>> expected:<0.5303300858899108> but was:<0.38729833462074176>
> >>>>>
> >>>>>
> >>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >>>>>
> >>>>>
> >>>>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <[email protected]
> >wrote:
> >>>>>
> >>>>>> Found it, fixed it. I am submitting soon.
> >>>>>>
> >>>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >>>>>>
> >>>>>>
> >>>>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <
> [email protected]>wrote:
> >>>>>>
> >>>>>>> Robin,
> >>>>>>>
> >>>>>>> Can you make sure that the patches are somewhere that Dan can pick
> >>>>>>> up this
> >>>>>>> work?  He is in GMT+2 and is probably about to appear on the scene.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <[email protected]>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> > Strike that there are still failures. Investigating. if I cant
> fix
> >>>>>>> it in
> >>>>>>> > the next hour, I will submit them sometime in the evening
> tomorrow.
> >>>>>>> >
> >>>>>>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >>>>>>> >
> >>>>>>> >
> >>>>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <
> [email protected]>
> >>>>>>> wrote:
> >>>>>>> >
> >>>>>>> > > Tests pass. Submitting the patches.
> >>>>>>> > >
> >>>>>>> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >>>>>>> > >
> >>>>>>> > >
> >>>>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <
> >>>>>>> [email protected]>
> >>>>>>> > wrote:
> >>>>>>> > >
> >>>>>>> > >> Added a few more tests. Throw NoSuchElementException like Java
> >>>>>>> > >> Collections when iterating past the end. Things look solid,
> >>>>>>> performance
> >>>>>>> > is
> >>>>>>> > >> 2x. All Math tests pass. I am now waiting for the entire test
> >>>>>>> suites to
> >>>>>>> > run
> >>>>>>> > >> before submitting.
> >>>>>>> > >>
> >>>>>>> > >> Robin Anil | Software Engineer | +1 312 869 2602 | Google
> Inc.
> >>>>>>> > >>
> >>>>>>> > >>
> >>>>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <
> >>>>>>> [email protected]>
> >>>>>>> > wrote:
> >>>>>>> > >>
> >>>>>>> > >>> I am not sure what I did. But removing Guava Abstract
> iterator
> >>>>>>> actually
> >>>>>>> > >>> sped up the dot, cosine, euclidean by another 60%. Things are
> >>>>>>> now 2x
> >>>>>>> > faster
> >>>>>>> > >>> than trunk. While also correcting the behavior (I hope)
> >>>>>>> > >>>
> >>>>>>> > >>>
> >>>>>>> > >>>
> >>>>>>> >
> >>>>>>>
> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
> >>>>>>> > >>>
> >>>>>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google
> Inc.
> >>>>>>> > >>>
> >>>>>>> > >>>
> >>>>>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <
> >>>>>>> [email protected]
> >>>>>>> > >wrote:
> >>>>>>> > >>>
> >>>>>>> > >>>> Also note that this is code gen, I have to create
> >>>>>>> > Element$keyType$Value
> >>>>>>> > >>>> for each and every combination not just int double. and also
> >>>>>>> update
> >>>>>>> > all
> >>>>>>> > >>>> callers to user ElementIntDouble instead of Element. Is it
> >>>>>>> worth it ?
> >>>>>>> > >>>>
> >>>>>>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google
> >>>>>>> Inc.
> >>>>>>> > >>>>
> >>>>>>> > >>>>
> >>>>>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <
> >>>>>>> [email protected]
> >>>>>>> > >wrote:
> >>>>>>> > >>>>
> >>>>>>> > >>>>> Collections (no longer colt collections) are now part of
> >>>>>>> mahout math.
> >>>>>>> > >>>>>  No
> >>>>>>> > >>>>> need to keep them separate.  The lower iterator can
> reference
> >>>>>>> > >>>>> Vector.Element
> >>>>>>> > >>>>>
> >>>>>>> > >>>>>
> >>>>>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <
> >>>>>>> [email protected]>
> >>>>>>> > >>>>> wrote:
> >>>>>>> > >>>>>
> >>>>>>> > >>>>> > I would have loved to but Element is a sub interface in
> >>>>>>> Vector. If
> >>>>>>> > >>>>> we want
> >>>>>>> > >>>>> > to keep colt collections separate we have to keep this
> >>>>>>> separation.
> >>>>>>> > >>>>> >
> >>>>>>> > >>>>>
> >>>>>>> > >>>>
> >>>>>>> > >>>>
> >>>>>>> > >>>
> >>>>>>> > >>
> >>>>>>> > >
> >>>>>>> >
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>



-- 

  -jake

Reply via email to