1. New story for Canopy Clustering: Clustering algorithm must support arbitrary user defined distance metrics.
2. From a clustering perspective, scaling the point and weighting the distance are equivalent. Scaling the point; however, changes it in the output as well as in the clustering computations. That is what I want to avoid in my application. 3. Installable transformers seem like a reasonable general-purpose facility that would satisfy a whole family of stories, yet to be written. Every time I've over-generalized the solution to a particular story I have regretted it. Here I'm being a devout minimalist. Jeff -----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED] Sent: Friday, February 15, 2008 1:57 PM To: [email protected] Subject: Re: Weighted Manhattan Distance Metric It is vital that any clustering algorithm support user defined metrics. It should be mentioned that weighting the distance computation is the same as scaling the inputs for most metrics of the type you are describing. The big win for installable transformers is saving the extra passes over the data. On 2/15/08 1:51 PM, "Jeff Eastman" <[EMAIL PROTECTED]> wrote: > In my application, I don't really want to scale the input data. What I > want is to weight the clustering distance calculation and pass the input > data through without adjustment. > > This seems to be a lot more mechanism than this single story implies. In > an effort to avoid premature abstraction, what I had in mind was more > like: > > - add a configure(JobConf job) method to the DistanceMeasure interface > so that implementations could be configured (e.g. by loading a weight > vector) > - use the class loader to instantiate arbitrary user-defined distance > measures vs. hard coded choices. > > This way, I could invent my own application-specific RealisticManhattan > and use it without requiring extensions to the library. > > Jeff > > -----Original Message----- > From: Ted Dunning [mailto:[EMAIL PROTECTED] > Sent: Thursday, February 14, 2008 6:15 PM > To: [email protected] > Subject: Re: Weighted Manhattan Distance Metric > > > > This comes under the heading of scaling and transforming inputs. > > This can be done as a separate MR step with no reduce. > > It would probably make a very significant difference to performance, > though, > if simple transformations and selections could be included into the > larger > scale process in a systematic way. That would eliminate many passes > over > the data. > > Thus, rather than adding the capability for scaling to each clustering > and > classification algorithm, it would be much better to be able to add > transformer and selector objects to any algorithm. > > The obvious interface that suggests itself is something like this: > > interface RecordTransformer<T extends Writable> { > /** > * Returns a new record that is the transformation of the input > record, > * or updates the input record in place and returns null. > */ > T transform(T record); > } > > interface RecordSelector<T extends Writable> { > /** > * Returns true if the record should be retained. > */ > boolean shouldProcess(T record); > } > > And the universal classifier interface should advertise: > > /** > * Register a transformer. First added is first applied. > */ > void addTransformer(Transformer<T> t); > > /** > * Register a selector. Only records that pass ALL selectors will > be > * retained. > */ > void addSelector(Selector<T> s); > > > There should also be an abstract class that adds the desired behavior to > any > data processing algorithm that wants it. > > On 2/14/08 5:35 PM, "Jeff Eastman" <[EMAIL PROTECTED]> wrote: > >> In Manhattan and elsewhere, the streets and avenues are not symmetric: >> the avenues are much farther apart than are the streets. This means > that >> W 53rd St. & 8th Ave is much farther from W 53rd & 7th than it is from > W >> 52nd & 8th. A distance metric that treats all dimensions as equal > would >> be off by a factor of about 2.5. A weighted distance metric that knew > of >> this difference would produce distance values - and hence clusters - >> that more closely matched the real world. >> >> >> >> Generalizing this to n-d, the new distance metric might look like > this: >> >> >> >> distance = sum(abs(p2[i] - p1[i]) * s[i] ) where S = a vector of >> (positive) scale factors. >> >> >> >> Would this be an appropriate new clustering feature? >> >> Jeff >> >
