This is what I propose:

1) Allow setting value to zero while iterating (e.set(0.0)).
2) Do not allow callers to use vector.set(index, 0.0) during iterating).
This can cause re-hashing. (Can set a dirty bit in the hashmap during
rehash to throw a concurrent modified exception)
3) Update the numNonDefaultElements to iterate over the array to discount
0.0 instead of returning the hashMap values.
4) IterateNonZero may iterate over a few zeros if you did set the dimension
to 0. Most of the statistics code should handle 0 values correctly.



Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Mon, Apr 15, 2013 at 1:50 PM, Jake Mannix <[email protected]> wrote:

> Ah, this was the one corner case I was worried about - we do special-case
> setting to 0,
> as meaning remove from the hashmap, yes.
>
> What's the TL;DR of what you did to work around this?  Should we allow
> this?  Even
> if it's through the Vector.Element instance, should it be ok?  If so, how
> to handle?
>
>
> On Mon, Apr 15, 2013 at 11:04 AM, Robin Anil <[email protected]> wrote:
>
> > I am adding the tests and updating the patch.
> >
> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >
> >
> > On Mon, Apr 15, 2013 at 1:03 PM, Robin Anil <[email protected]>
> wrote:
> >
> > > You can re-iterate if the state is in iteration. But you cannot write.
> > >
> > > This is what is happening:
> > >
> > > One of the values are becoming 0. So Vector tries to remove it from the
> > > underlying hashmap. This changes the layout, if a vector has to be
> > mutated
> > > while iterating, we have to set 0 value in the hashmap and not remove
> it
> > > like what the Vector layer is doing. This adds another complexity, the
> > > vector iterator has to deal with skipping over elements with 0 value.
> > >
> > >
> > > Try this
> > >
> > > Create a vector of length 13 and set the following values.
> > >
> > >
> > >    1.     double[] val = new double[] { 0, 2, 0, 0, 8, 3, 0, 6, 0, 1,
> 1,
> > >    2, 1 };
> > >    2.     for (int i = 0; i < val.length; ++i) {
> > >    3.       vector.set(i, val[i]);
> > >    4.     }
> > >
> > > Iterate again and while iterating set one of the values as zero.
> > >
> > > On Mon, Apr 15, 2013 at 12:56 PM, Dan Filimon <
> > [email protected]
> > > > wrote:
> > >
> > >> What kind of Vector is failing to set() in that code?
> > >>
> > >> About the state enum, what if (for whatever reason, not
> > >> multi-threaded-ness) there are multiple iterators to that vector?
> > >> Something like a reference count (how many iterators point to it)
> would
> > >> probably be needed, and keeping it sane would only be possible in one
> > >> thread. Although this seems kind of brittle.
> > >>
> > >> +1 for numNonDefault.
> > >>
> > >>
> > >> On Mon, Apr 15, 2013 at 8:36 PM, Robin Anil <[email protected]>
> > wrote:
> > >>
> > >>> Another behavior difference.
> > >>>
> > >>> The numNonDefaultElement for a DenseVector returns the total length.
> > >>> This causes Pearson Correlation Similarity to differ from if it was
> > >>> implemented using on of the SparseVector.
> > >>> I am proposing to fix the numNonDefaultElement to correctly iterate
> > over
> > >>> the dense vector to figure out non zero values ? Sounds ok
> > >>>
> > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > >>>
> > >>>
> > >>> On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil <[email protected]
> > >wrote:
> > >>>
> > >>>> Found the bug PearsonCorrelationSimilarity was trying to mutate the
> > >>>> object while iterating.
> > >>>>
> > >>>>
> > >>>>    1.     while (it.hasNext()) {
> > >>>>    2.       Vector.Element e = it.next();
> > >>>>    3.       *vector.set(e.index(),* e.get() - average);
> > >>>>    4.     }
> > >>>>
> > >>>> This has a side effect of causing the underlying hash-map or object
> to
> > >>>> change.
> > >>>>
> > >>>> The right behavior is to set the value of the index while iterating.
> > >>>>
> > >>>>    1.     while (it.hasNext()) {
> > >>>>    2.       Vector.Element e = it.next();
> > >>>>    3.       *e.set(e.get()* - average);
> > >>>>    4.     }
> > >>>>
> > >>>> I am sure we are incorrectly doing the first style across the code
> at
> > >>>> many places.
> > >>>>
> > >>>> I am proposing this
> > >>>>
> > >>>> When iterating, we lock the set interface on the vector using a
> State
> > >>>> enum. If anyone tries to mutate, we throw an exception.
> > >>>> We flip the state when we complete iterating (hasNext = false) or
> when
> > >>>> we explicitly close the iterator (adding a close method on the
> > iterator).
> > >>>>
> > >>>> Again this is all a single thread fix. if a vector is being mutated
> > and
> > >>>> iterated across multiple threads, all hell can break loose.
> > >>>>
> > >>>> Robin
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <[email protected]
> > >wrote:
> > >>>>
> > >>>>> Spoke too soon still failure.  I am uploading the latest patch.
> These
> > >>>>> are the current failing tests.
> > >>>>>
> > >>>>>
> >
>  
> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
> > >>>>> not expecting cluster:{0:1.0,1:1.0}
> > >>>>>
> > >>>>>
> >
> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
> > >>>>> not expecting cluster:{0:1.0,1:1.0}
> > >>>>>
> > >>>>>
> >
> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> > >>>>> null
> > >>>>>
> > >>>>>
> >
> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> > >>>>> null
> > >>>>>
> > >>>>>
> >
> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88
> > >>>>> expected:<0.5303300858899108> but was:<0.38729833462074176>
> > >>>>>
> > >>>>>
> > >>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > >>>>>
> > >>>>>
> > >>>>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <[email protected]
> > >wrote:
> > >>>>>
> > >>>>>> Found it, fixed it. I am submitting soon.
> > >>>>>>
> > >>>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > >>>>>>
> > >>>>>>
> > >>>>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <
> > [email protected]>wrote:
> > >>>>>>
> > >>>>>>> Robin,
> > >>>>>>>
> > >>>>>>> Can you make sure that the patches are somewhere that Dan can
> pick
> > >>>>>>> up this
> > >>>>>>> work?  He is in GMT+2 and is probably about to appear on the
> scene.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <
> [email protected]>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>> > Strike that there are still failures. Investigating. if I cant
> > fix
> > >>>>>>> it in
> > >>>>>>> > the next hour, I will submit them sometime in the evening
> > tomorrow.
> > >>>>>>> >
> > >>>>>>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > >>>>>>> >
> > >>>>>>> >
> > >>>>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <
> > [email protected]>
> > >>>>>>> wrote:
> > >>>>>>> >
> > >>>>>>> > > Tests pass. Submitting the patches.
> > >>>>>>> > >
> > >>>>>>> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google
> Inc.
> > >>>>>>> > >
> > >>>>>>> > >
> > >>>>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <
> > >>>>>>> [email protected]>
> > >>>>>>> > wrote:
> > >>>>>>> > >
> > >>>>>>> > >> Added a few more tests. Throw NoSuchElementException like
> Java
> > >>>>>>> > >> Collections when iterating past the end. Things look solid,
> > >>>>>>> performance
> > >>>>>>> > is
> > >>>>>>> > >> 2x. All Math tests pass. I am now waiting for the entire
> test
> > >>>>>>> suites to
> > >>>>>>> > run
> > >>>>>>> > >> before submitting.
> > >>>>>>> > >>
> > >>>>>>> > >> Robin Anil | Software Engineer | +1 312 869 2602 | Google
> > Inc.
> > >>>>>>> > >>
> > >>>>>>> > >>
> > >>>>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <
> > >>>>>>> [email protected]>
> > >>>>>>> > wrote:
> > >>>>>>> > >>
> > >>>>>>> > >>> I am not sure what I did. But removing Guava Abstract
> > iterator
> > >>>>>>> actually
> > >>>>>>> > >>> sped up the dot, cosine, euclidean by another 60%. Things
> are
> > >>>>>>> now 2x
> > >>>>>>> > faster
> > >>>>>>> > >>> than trunk. While also correcting the behavior (I hope)
> > >>>>>>> > >>>
> > >>>>>>> > >>>
> > >>>>>>> > >>>
> > >>>>>>> >
> > >>>>>>>
> >
> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
> > >>>>>>> > >>>
> > >>>>>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google
> > Inc.
> > >>>>>>> > >>>
> > >>>>>>> > >>>
> > >>>>>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <
> > >>>>>>> [email protected]
> > >>>>>>> > >wrote:
> > >>>>>>> > >>>
> > >>>>>>> > >>>> Also note that this is code gen, I have to create
> > >>>>>>> > Element$keyType$Value
> > >>>>>>> > >>>> for each and every combination not just int double. and
> also
> > >>>>>>> update
> > >>>>>>> > all
> > >>>>>>> > >>>> callers to user ElementIntDouble instead of Element. Is it
> > >>>>>>> worth it ?
> > >>>>>>> > >>>>
> > >>>>>>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google
> > >>>>>>> Inc.
> > >>>>>>> > >>>>
> > >>>>>>> > >>>>
> > >>>>>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <
> > >>>>>>> [email protected]
> > >>>>>>> > >wrote:
> > >>>>>>> > >>>>
> > >>>>>>> > >>>>> Collections (no longer colt collections) are now part of
> > >>>>>>> mahout math.
> > >>>>>>> > >>>>>  No
> > >>>>>>> > >>>>> need to keep them separate.  The lower iterator can
> > reference
> > >>>>>>> > >>>>> Vector.Element
> > >>>>>>> > >>>>>
> > >>>>>>> > >>>>>
> > >>>>>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <
> > >>>>>>> [email protected]>
> > >>>>>>> > >>>>> wrote:
> > >>>>>>> > >>>>>
> > >>>>>>> > >>>>> > I would have loved to but Element is a sub interface in
> > >>>>>>> Vector. If
> > >>>>>>> > >>>>> we want
> > >>>>>>> > >>>>> > to keep colt collections separate we have to keep this
> > >>>>>>> separation.
> > >>>>>>> > >>>>> >
> > >>>>>>> > >>>>>
> > >>>>>>> > >>>>
> > >>>>>>> > >>>>
> > >>>>>>> > >>>
> > >>>>>>> > >>
> > >>>>>>> > >
> > >>>>>>> >
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> > >
> >
>
>
>
> --
>
>   -jake
>

Reply via email to