In my application, I don't really want to scale the input data. What I
want is to weight the clustering distance calculation and pass the input
data through without adjustment.

This seems to be a lot more mechanism than this single story implies. In
an effort to avoid premature abstraction, what I had in mind was more
like:

- add a configure(JobConf job) method to the DistanceMeasure interface
so that implementations could be configured (e.g. by loading a weight
vector)
- use the class loader to instantiate arbitrary user-defined distance
measures vs. hard coded choices.

This way, I could invent my own application-specific RealisticManhattan
and use it without requiring extensions to the library.

Jeff

-----Original Message-----
From: Ted Dunning [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 14, 2008 6:15 PM
To: [email protected]
Subject: Re: Weighted Manhattan Distance Metric



This comes under the heading of scaling and transforming inputs.

This can be done as a separate MR step with no reduce.

It would probably make a very significant difference to performance,
though,
if simple transformations and selections could be included into the
larger
scale process in a systematic way.  That would eliminate many passes
over
the data.

Thus, rather than adding the capability for scaling to each clustering
and
classification algorithm, it would be much better to be able to add
transformer and selector objects to any algorithm.

The obvious interface that suggests itself is something like this:

interface RecordTransformer<T extends Writable> {
    /**
     * Returns a new record that is the transformation of the input
record,
     * or updates the input record in place and returns null.
     */
   T transform(T record);
}

interface RecordSelector<T extends Writable> {
    /**
     * Returns true if the record should be retained.
     */
    boolean shouldProcess(T record);
}

And the universal classifier interface should advertise:

    /**
     * Register a transformer.  First added is first applied.
     */
    void addTransformer(Transformer<T> t);

    /**
     * Register a selector.  Only records that pass ALL selectors will
be
     * retained.
     */
    void addSelector(Selector<T> s);


There should also be an abstract class that adds the desired behavior to
any
data processing algorithm that wants it.

On 2/14/08 5:35 PM, "Jeff Eastman" <[EMAIL PROTECTED]> wrote:

> In Manhattan and elsewhere, the streets and avenues are not symmetric:
> the avenues are much farther apart than are the streets. This means
that
> W 53rd St. & 8th Ave is much farther from W 53rd & 7th than it is from
W
> 52nd & 8th. A distance metric that treats all dimensions as equal
would
> be off by a factor of about 2.5. A weighted distance metric that knew
of
> this difference would produce distance values - and hence clusters -
> that more closely matched the real world.
> 
>  
> 
> Generalizing this to n-d, the new distance metric might look like
this:
> 
>  
> 
> distance = sum(abs(p2[i] - p1[i]) * s[i] ) where S = a vector of
> (positive) scale factors.
> 
>  
> 
> Would this be an appropriate new clustering feature?
> 
> Jeff
> 

Reply via email to