Committed. I moved the getNumNonDefaultElements() change to a new method
getNumNonZeroElements() and used that for Pearson Correlation. The
getNumNonDefaultElements() will work as before, that really needs to be a
private method.

http://svn.apache.org/viewvc?view=revision&revision=1468209

Sync, let me know if you find any issues with this.

Robin


On Mon, Apr 15, 2013 at 2:30 PM, Robin Anil <[email protected]> wrote:

> Iterable is a safer interface, you can implement non-zero-ness check
> easily. Iterator is not.
>
> I think I have fixed all the failing tests (They were failing because the
> asFormatString order seems to have changed with the new iterators)
>
> https://reviews.apache.org/r/10455/diff/6/
>
>
>
> On Mon, Apr 15, 2013 at 2:21 PM, Jake Mannix <[email protected]>wrote:
>
>> On Mon, Apr 15, 2013 at 12:14 PM, Robin Anil <[email protected]>
>> wrote:
>>
>> > Another crazy idea for the future is to kill the usage of
>> > OpenIntDoubleHashMap entirely and copy parts of it inside RASV which
>> will
>> > only deal with nonzero keys and non zero values. RASV can then keep
>> track
>> > of non-zero elements in a variable to speed up those lookups.
>> >
>> >
>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>> >
>> >
>> > On Mon, Apr 15, 2013 at 2:11 PM, Robin Anil <[email protected]>
>> wrote:
>> >
>> > > The point 3 is coming from the philosophy that all Vectors behave the
>> > same
>> > > way and numNonDefaultElements of a DenseVector is same as that of a
>> > > SparseVector. Eg, if PersonSimilarity relies upon it for document
>> length,
>> > > it should be behave the same way.
>> > >
>> > > The point 4 can be solved by killing the iterator interface entirely
>> and
>> > > creating forEachNonZero(function()) method which will only call if the
>> > > element is nonzero.
>> >
>>
>> Killing iteration would be really really bad, from a useability
>> standpoint.
>>  In fact,
>> I've been moving in the other direction:
>> https://reviews.apache.org/r/9867/
>> adds iterators to the basic collection interface!
>>
>>
>>
>> >  >
>> > >
>> > >
>> > > On Mon, Apr 15, 2013 at 2:08 PM, Jake Mannix <[email protected]
>> > >wrote:
>> > >
>> > >> On Mon, Apr 15, 2013 at 11:58 AM, Robin Anil <[email protected]>
>> > >> wrote:
>> > >>
>> > >> > This is what I propose:
>> > >> >
>> > >> > 1) Allow setting value to zero while iterating (e.set(0.0)).
>> > >> >
>> > >>
>> > >> This is in addition to the fact that we already allow setting nonzero
>> > >> values
>> > >> while iterating, right?
>> > >>
>> > >>
>> > >> > 2) Do not allow callers to use vector.set(index, 0.0) during
>> > iterating).
>> > >> > This can cause re-hashing. (Can set a dirty bit in the hashmap
>> during
>> > >> > rehash to throw a concurrent modified exception)
>> > >> >
>> > >>
>> > >> Agreed - this is a commonly accepted requirement: I think in fact we
>> > >> should pro-actively throw ConcurrentModificationException if someone
>> > >> tries to call vector.set / vector.assign while iterating.
>> > >>
>> > >>
>> > >> > 3) Update the numNonDefaultElements to iterate over the array to
>> > >> discount
>> > >> > 0.0 instead of returning the hashMap values.
>> > >> > 4) IterateNonZero may iterate over a few zeros if you did set the
>> > >> dimension
>> > >> > to 0. Most of the statistics code should handle 0 values correctly.
>> > >> >
>> > >>
>> > >> Yeah, are we really strict about getNumNonDefaultElements really
>> always
>> > >> returning exactly the number of nonzeroes?  I was under the
>> impression
>> > >> that
>> > >> for e.g. DenseVector, it would give the overal size, even if some
>> were
>> > 0,
>> > >> and that it was basically tracking the amount of space the vector was
>> > >> taking
>> > >> up.  But I can see the argument that it really should return what it
>> > says
>> > >> it
>> > >> returns, if that is relied upon.
>> > >>
>> > >>
>> > >> >
>> > >> >
>> > >> >
>> > >> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>> > >> >
>> > >> >
>> > >> > On Mon, Apr 15, 2013 at 1:50 PM, Jake Mannix <
>> [email protected]>
>> > >> > wrote:
>> > >> >
>> > >> > > Ah, this was the one corner case I was worried about - we do
>> > >> special-case
>> > >> > > setting to 0,
>> > >> > > as meaning remove from the hashmap, yes.
>> > >> > >
>> > >> > > What's the TL;DR of what you did to work around this?  Should we
>> > allow
>> > >> > > this?  Even
>> > >> > > if it's through the Vector.Element instance, should it be ok?  If
>> > so,
>> > >> how
>> > >> > > to handle?
>> > >> > >
>> > >> > >
>> > >> > > On Mon, Apr 15, 2013 at 11:04 AM, Robin Anil <
>> [email protected]>
>> > >> > wrote:
>> > >> > >
>> > >> > > > I am adding the tests and updating the patch.
>> > >> > > >
>> > >> > > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>> > >> > > >
>> > >> > > >
>> > >> > > > On Mon, Apr 15, 2013 at 1:03 PM, Robin Anil <
>> [email protected]
>> > >
>> > >> > > wrote:
>> > >> > > >
>> > >> > > > > You can re-iterate if the state is in iteration. But you
>> cannot
>> > >> > write.
>> > >> > > > >
>> > >> > > > > This is what is happening:
>> > >> > > > >
>> > >> > > > > One of the values are becoming 0. So Vector tries to remove
>> it
>> > >> from
>> > >> > the
>> > >> > > > > underlying hashmap. This changes the layout, if a vector has
>> to
>> > be
>> > >> > > > mutated
>> > >> > > > > while iterating, we have to set 0 value in the hashmap and
>> not
>> > >> remove
>> > >> > > it
>> > >> > > > > like what the Vector layer is doing. This adds another
>> > complexity,
>> > >> > the
>> > >> > > > > vector iterator has to deal with skipping over elements with
>> 0
>> > >> value.
>> > >> > > > >
>> > >> > > > >
>> > >> > > > > Try this
>> > >> > > > >
>> > >> > > > > Create a vector of length 13 and set the following values.
>> > >> > > > >
>> > >> > > > >
>> > >> > > > >    1.     double[] val = new double[] { 0, 2, 0, 0, 8, 3, 0,
>> 6,
>> > >> 0, 1,
>> > >> > > 1,
>> > >> > > > >    2, 1 };
>> > >> > > > >    2.     for (int i = 0; i < val.length; ++i) {
>> > >> > > > >    3.       vector.set(i, val[i]);
>> > >> > > > >    4.     }
>> > >> > > > >
>> > >> > > > > Iterate again and while iterating set one of the values as
>> zero.
>> > >> > > > >
>> > >> > > > > On Mon, Apr 15, 2013 at 12:56 PM, Dan Filimon <
>> > >> > > > [email protected]
>> > >> > > > > > wrote:
>> > >> > > > >
>> > >> > > > >> What kind of Vector is failing to set() in that code?
>> > >> > > > >>
>> > >> > > > >> About the state enum, what if (for whatever reason, not
>> > >> > > > >> multi-threaded-ness) there are multiple iterators to that
>> > vector?
>> > >> > > > >> Something like a reference count (how many iterators point
>> to
>> > it)
>> > >> > > would
>> > >> > > > >> probably be needed, and keeping it sane would only be
>> possible
>> > in
>> > >> > one
>> > >> > > > >> thread. Although this seems kind of brittle.
>> > >> > > > >>
>> > >> > > > >> +1 for numNonDefault.
>> > >> > > > >>
>> > >> > > > >>
>> > >> > > > >> On Mon, Apr 15, 2013 at 8:36 PM, Robin Anil <
>> > >> [email protected]>
>> > >> > > > wrote:
>> > >> > > > >>
>> > >> > > > >>> Another behavior difference.
>> > >> > > > >>>
>> > >> > > > >>> The numNonDefaultElement for a DenseVector returns the
>> total
>> > >> > length.
>> > >> > > > >>> This causes Pearson Correlation Similarity to differ from
>> if
>> > it
>> > >> was
>> > >> > > > >>> implemented using on of the SparseVector.
>> > >> > > > >>> I am proposing to fix the numNonDefaultElement to correctly
>> > >> iterate
>> > >> > > > over
>> > >> > > > >>> the dense vector to figure out non zero values ? Sounds ok
>> > >> > > > >>>
>> > >> > > > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google
>> > Inc.
>> > >> > > > >>>
>> > >> > > > >>>
>> > >> > > > >>> On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil <
>> > >> [email protected]
>> > >> > > > >wrote:
>> > >> > > > >>>
>> > >> > > > >>>> Found the bug PearsonCorrelationSimilarity was trying to
>> > mutate
>> > >> > the
>> > >> > > > >>>> object while iterating.
>> > >> > > > >>>>
>> > >> > > > >>>>
>> > >> > > > >>>>    1.     while (it.hasNext()) {
>> > >> > > > >>>>    2.       Vector.Element e = it.next();
>> > >> > > > >>>>    3.       *vector.set(e.index(),* e.get() - average);
>> > >> > > > >>>>    4.     }
>> > >> > > > >>>>
>> > >> > > > >>>> This has a side effect of causing the underlying hash-map
>> or
>> > >> > object
>> > >> > > to
>> > >> > > > >>>> change.
>> > >> > > > >>>>
>> > >> > > > >>>> The right behavior is to set the value of the index while
>> > >> > iterating.
>> > >> > > > >>>>
>> > >> > > > >>>>    1.     while (it.hasNext()) {
>> > >> > > > >>>>    2.       Vector.Element e = it.next();
>> > >> > > > >>>>    3.       *e.set(e.get()* - average);
>> > >> > > > >>>>    4.     }
>> > >> > > > >>>>
>> > >> > > > >>>> I am sure we are incorrectly doing the first style across
>> the
>> > >> code
>> > >> > > at
>> > >> > > > >>>> many places.
>> > >> > > > >>>>
>> > >> > > > >>>> I am proposing this
>> > >> > > > >>>>
>> > >> > > > >>>> When iterating, we lock the set interface on the vector
>> > using a
>> > >> > > State
>> > >> > > > >>>> enum. If anyone tries to mutate, we throw an exception.
>> > >> > > > >>>> We flip the state when we complete iterating (hasNext =
>> > false)
>> > >> or
>> > >> > > when
>> > >> > > > >>>> we explicitly close the iterator (adding a close method on
>> > the
>> > >> > > > iterator).
>> > >> > > > >>>>
>> > >> > > > >>>> Again this is all a single thread fix. if a vector is
>> being
>> > >> > mutated
>> > >> > > > and
>> > >> > > > >>>> iterated across multiple threads, all hell can break
>> loose.
>> > >> > > > >>>>
>> > >> > > > >>>> Robin
>> > >> > > > >>>>
>> > >> > > > >>>>
>> > >> > > > >>>>
>> > >> > > > >>>> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <
>> > >> > [email protected]
>> > >> > > > >wrote:
>> > >> > > > >>>>
>> > >> > > > >>>>> Spoke too soon still failure.  I am uploading the latest
>> > >> patch.
>> > >> > > These
>> > >> > > > >>>>> are the current failing tests.
>> > >> > > > >>>>>
>> > >> > > > >>>>>
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>>  
>> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
>> > >> > > > >>>>> not expecting cluster:{0:1.0,1:1.0}
>> > >> > > > >>>>>
>> > >> > > > >>>>>
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
>> > >> > > > >>>>> not expecting cluster:{0:1.0,1:1.0}
>> > >> > > > >>>>>
>> > >> > > > >>>>>
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
>> > >> > > > >>>>> null
>> > >> > > > >>>>>
>> > >> > > > >>>>>
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
>> > >> > > > >>>>> null
>> > >> > > > >>>>>
>> > >> > > > >>>>>
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88
>> > >> > > > >>>>> expected:<0.5303300858899108> but
>> was:<0.38729833462074176>
>> > >> > > > >>>>>
>> > >> > > > >>>>>
>> > >> > > > >>>>> Robin Anil | Software Engineer | +1 312 869 2602 |
>> Google
>> > >> Inc.
>> > >> > > > >>>>>
>> > >> > > > >>>>>
>> > >> > > > >>>>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <
>> > >> > [email protected]
>> > >> > > > >wrote:
>> > >> > > > >>>>>
>> > >> > > > >>>>>> Found it, fixed it. I am submitting soon.
>> > >> > > > >>>>>>
>> > >> > > > >>>>>> Robin Anil | Software Engineer | +1 312 869 2602 |
>> Google
>> > >> Inc.
>> > >> > > > >>>>>>
>> > >> > > > >>>>>>
>> > >> > > > >>>>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <
>> > >> > > > [email protected]>wrote:
>> > >> > > > >>>>>>
>> > >> > > > >>>>>>> Robin,
>> > >> > > > >>>>>>>
>> > >> > > > >>>>>>> Can you make sure that the patches are somewhere that
>> Dan
>> > >> can
>> > >> > > pick
>> > >> > > > >>>>>>> up this
>> > >> > > > >>>>>>> work?  He is in GMT+2 and is probably about to appear
>> on
>> > the
>> > >> > > scene.
>> > >> > > > >>>>>>>
>> > >> > > > >>>>>>>
>> > >> > > > >>>>>>>
>> > >> > > > >>>>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <
>> > >> > > [email protected]>
>> > >> > > > >>>>>>> wrote:
>> > >> > > > >>>>>>>
>> > >> > > > >>>>>>> > Strike that there are still failures. Investigating.
>> if
>> > I
>> > >> > cant
>> > >> > > > fix
>> > >> > > > >>>>>>> it in
>> > >> > > > >>>>>>> > the next hour, I will submit them sometime in the
>> > evening
>> > >> > > > tomorrow.
>> > >> > > > >>>>>>> >
>> > >> > > > >>>>>>> > Robin Anil | Software Engineer | +1 312 869 2602 |
>> > Google
>> > >> > Inc.
>> > >> > > > >>>>>>> >
>> > >> > > > >>>>>>> >
>> > >> > > > >>>>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <
>> > >> > > > [email protected]>
>> > >> > > > >>>>>>> wrote:
>> > >> > > > >>>>>>> >
>> > >> > > > >>>>>>> > > Tests pass. Submitting the patches.
>> > >> > > > >>>>>>> > >
>> > >> > > > >>>>>>> > > Robin Anil | Software Engineer | +1 312 869 2602 |
>> > >> Google
>> > >> > > Inc.
>> > >> > > > >>>>>>> > >
>> > >> > > > >>>>>>> > >
>> > >> > > > >>>>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <
>> > >> > > > >>>>>>> [email protected]>
>> > >> > > > >>>>>>> > wrote:
>> > >> > > > >>>>>>> > >
>> > >> > > > >>>>>>> > >> Added a few more tests. Throw
>> NoSuchElementException
>> > >> like
>> > >> > > Java
>> > >> > > > >>>>>>> > >> Collections when iterating past the end. Things
>> look
>> > >> > solid,
>> > >> > > > >>>>>>> performance
>> > >> > > > >>>>>>> > is
>> > >> > > > >>>>>>> > >> 2x. All Math tests pass. I am now waiting for the
>> > >> entire
>> > >> > > test
>> > >> > > > >>>>>>> suites to
>> > >> > > > >>>>>>> > run
>> > >> > > > >>>>>>> > >> before submitting.
>> > >> > > > >>>>>>> > >>
>> > >> > > > >>>>>>> > >> Robin Anil | Software Engineer | +1 312 869 2602|
>> > >> Google
>> > >> > > > Inc.
>> > >> > > > >>>>>>> > >>
>> > >> > > > >>>>>>> > >>
>> > >> > > > >>>>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <
>> > >> > > > >>>>>>> [email protected]>
>> > >> > > > >>>>>>> > wrote:
>> > >> > > > >>>>>>> > >>
>> > >> > > > >>>>>>> > >>> I am not sure what I did. But removing Guava
>> > Abstract
>> > >> > > > iterator
>> > >> > > > >>>>>>> actually
>> > >> > > > >>>>>>> > >>> sped up the dot, cosine, euclidean by another
>> 60%.
>> > >> Things
>> > >> > > are
>> > >> > > > >>>>>>> now 2x
>> > >> > > > >>>>>>> > faster
>> > >> > > > >>>>>>> > >>> than trunk. While also correcting the behavior (I
>> > >> hope)
>> > >> > > > >>>>>>> > >>>
>> > >> > > > >>>>>>> > >>>
>> > >> > > > >>>>>>> > >>>
>> > >> > > > >>>>>>> >
>> > >> > > > >>>>>>>
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
>> > >> > > > >>>>>>> > >>>
>> > >> > > > >>>>>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602|
>> > >> > Google
>> > >> > > > Inc.
>> > >> > > > >>>>>>> > >>>
>> > >> > > > >>>>>>> > >>>
>> > >> > > > >>>>>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <
>> > >> > > > >>>>>>> [email protected]
>> > >> > > > >>>>>>> > >wrote:
>> > >> > > > >>>>>>> > >>>
>> > >> > > > >>>>>>> > >>>> Also note that this is code gen, I have to
>> create
>> > >> > > > >>>>>>> > Element$keyType$Value
>> > >> > > > >>>>>>> > >>>> for each and every combination not just int
>> double.
>> > >> and
>> > >> > > also
>> > >> > > > >>>>>>> update
>> > >> > > > >>>>>>> > all
>> > >> > > > >>>>>>> > >>>> callers to user ElementIntDouble instead of
>> > Element.
>> > >> Is
>> > >> > it
>> > >> > > > >>>>>>> worth it ?
>> > >> > > > >>>>>>> > >>>>
>> > >> > > > >>>>>>> > >>>> Robin Anil | Software Engineer | +1 312 869
>> 2602 |
>> > >> > Google
>> > >> > > > >>>>>>> Inc.
>> > >> > > > >>>>>>> > >>>>
>> > >> > > > >>>>>>> > >>>>
>> > >> > > > >>>>>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <
>> > >> > > > >>>>>>> [email protected]
>> > >> > > > >>>>>>> > >wrote:
>> > >> > > > >>>>>>> > >>>>
>> > >> > > > >>>>>>> > >>>>> Collections (no longer colt collections) are
>> now
>> > >> part
>> > >> > of
>> > >> > > > >>>>>>> mahout math.
>> > >> > > > >>>>>>> > >>>>>  No
>> > >> > > > >>>>>>> > >>>>> need to keep them separate.  The lower iterator
>> > can
>> > >> > > > reference
>> > >> > > > >>>>>>> > >>>>> Vector.Element
>> > >> > > > >>>>>>> > >>>>>
>> > >> > > > >>>>>>> > >>>>>
>> > >> > > > >>>>>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <
>> > >> > > > >>>>>>> [email protected]>
>> > >> > > > >>>>>>> > >>>>> wrote:
>> > >> > > > >>>>>>> > >>>>>
>> > >> > > > >>>>>>> > >>>>> > I would have loved to but Element is a sub
>> > >> interface
>> > >> > in
>> > >> > > > >>>>>>> Vector. If
>> > >> > > > >>>>>>> > >>>>> we want
>> > >> > > > >>>>>>> > >>>>> > to keep colt collections separate we have to
>> > keep
>> > >> > this
>> > >> > > > >>>>>>> separation.
>> > >> > > > >>>>>>> > >>>>> >
>> > >> > > > >>>>>>> > >>>>>
>> > >> > > > >>>>>>> > >>>>
>> > >> > > > >>>>>>> > >>>>
>> > >> > > > >>>>>>> > >>>
>> > >> > > > >>>>>>> > >>
>> > >> > > > >>>>>>> > >
>> > >> > > > >>>>>>> >
>> > >> > > > >>>>>>>
>> > >> > > > >>>>>>
>> > >> > > > >>>>>>
>> > >> > > > >>>>>
>> > >> > > > >>>>
>> > >> > > > >>>
>> > >> > > > >>
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> > >
>> > >> > >
>> > >> > > --
>> > >> > >
>> > >> > >   -jake
>> > >> > >
>> > >> >
>> > >>
>> > >>
>> > >>
>> > >> --
>> > >>
>> > >>   -jake
>> > >>
>> > >
>> > >
>> >
>>
>>
>>
>> --
>>
>>   -jake
>>
>
>

Reply via email to