The quantity of features required of Vector&Matrix is at the edge of practicality. I would just ban equals()/hashCode() entirely, throwing exceptions. If you must, I would keep a common XOR value altered with each set() operation. Supporting sparsity is possible.
On Thu, Feb 23, 2012 at 5:26 PM, Jake Mannix <[email protected]> wrote: > On Thu, Feb 23, 2012 at 4:38 PM, Ted Dunning <[email protected]> wrote: > >> D, >> >> I think that the current idea on the table is to do essentially nothing >> except carefully examine and possibly remove cases of tables or sets with >> Vector keys. >> > > Yes, that's what I'm suggesting. If you grep through our code for > Map<Vector > or Map<DenseVector, or Map<SparseVector, - there are a few cases, and we > might want to see if any of these are sources of inefficiency in this sneaky > way. > > -jake > > >> >> What specifically are you arguing against? >> >> On Fri, Feb 24, 2012 at 12:25 AM, Dmitriy Lyubimov <[email protected] >> >wrote: >> >> > Right. Ok. >> > >> > just in case, here's my updated list of cons ... >> > >> > 1) if people explicitly happen to write a.equals(b) where they just as >> > easily could've written a==b if they were ok with identity comparison, >> > it is pretty surely means they expect equals-by-value and not identity >> > comparison. >> > 2) same goes for HashMap/IdentityHashMap (with much less obviousness >> > though). >> > >> > 3) We probably don't want to prohibit the use of vectors as MR map >> > output keys (although i don't immediately see use cases for it), that >> > means we have to support comparable-by-value, which would be >> > contractually incoherent with equals-by-identity at the same time. >> > >> > 4) What Ted said about precaching hash codes for Strings I seem to be >> > able to easily convert into another cons: apparently best practice >> > suggests to go extra step to optimize hash coding rather than to drop >> > equals-by-value, hashCode-by-value. There are techniques to improve >> > that (e.g. same identity comparison can be had since if a==b then it >> > also true that a.equals(b) also true. Hashcoding can be indeed cached >> > (with caveat that since vector really is not immutable, then it needs >> > to maintain lazy cashing and cache cancellation simiar to >> > GregorianCalendar stuff, but that's probably a hassle). >> > >> > I guess we can look into some hash coding optimization if it seems >> > compelling that we do so. well i guess Sean is saying that too. >> > >> > On Thu, Feb 23, 2012 at 2:31 PM, Sean Owen <[email protected]> wrote: >> > > I think equals() and hashCode() ought to be retained at least for >> > > consistency. There is no sin in implementing these per se. If >> hashCode() >> > is >> > > too slow in some context then it can be cached if it is so important. >> > This >> > > is what String does. >> > > >> > > However I doubt these should be used as keys in general - that is the >> > issue >> > > to be fixed if anything if there is a performance problem. >> > > >> > > Do you want people to use equals()? Dunno it's up to the caller really. >> > > >> > > Sean >> > > On Feb 23, 2012 5:25 PM, "Jake Mannix" <[email protected]> wrote: >> > > >> > >> Hey Devs. >> > >> >> > >> Was prototyping some stuff in Mahout last night, and noticed >> something >> > >> I'm not sure if we've talked about before: because we have equals() >> for >> > >> Vector instances return true iff the numeric values of the vectors are >> > >> equal, and we also have a consistent hashCode(), anytime you have >> > >> HashMap<Vector, Anything>, all the typical things you think are O(1) >> are >> > >> really O(vector.numNonZeroes()). I tried to look through the codebase >> > and >> > >> see where we hang onto maps with vector keys, and we do it sometimes. >> > >> Maybe we shouldn't? Most Vectors have identities (clusterId, >> > documentId, >> > >> topicId, etc...) which we could normalize away... or maybe we should >> be >> > >> using IdentityHashMap, to ensure you're using strict object identity >> and >> > >> avoid doing this calculation? This could be really slow if these are >> > big >> > >> dense vectors, for instance. >> > >> >> > >> This looks like it could be a really easy place to accidentally add >> > heavy >> > >> complexity to things. Do we really want people do be checking >> > >> *mathematical* equals() on vectors which have floating point >> precision? >> > >> >> > >> -jake >> > >> >> > >> -- Lance Norskog [email protected]
