This comes under the heading of scaling and transforming inputs.

This can be done as a separate MR step with no reduce.

It would probably make a very significant difference to performance, though,
if simple transformations and selections could be included into the larger
scale process in a systematic way.  That would eliminate many passes over
the data.

Thus, rather than adding the capability for scaling to each clustering and
classification algorithm, it would be much better to be able to add
transformer and selector objects to any algorithm.

The obvious interface that suggests itself is something like this:

interface RecordTransformer<T extends Writable> {
    /**
     * Returns a new record that is the transformation of the input record,
     * or updates the input record in place and returns null.
     */
   T transform(T record);
}

interface RecordSelector<T extends Writable> {
    /**
     * Returns true if the record should be retained.
     */
    boolean shouldProcess(T record);
}

And the universal classifier interface should advertise:

    /**
     * Register a transformer.  First added is first applied.
     */
    void addTransformer(Transformer<T> t);

    /**
     * Register a selector.  Only records that pass ALL selectors will be
     * retained.
     */
    void addSelector(Selector<T> s);


There should also be an abstract class that adds the desired behavior to any
data processing algorithm that wants it.

On 2/14/08 5:35 PM, "Jeff Eastman" <[EMAIL PROTECTED]> wrote:

> In Manhattan and elsewhere, the streets and avenues are not symmetric:
> the avenues are much farther apart than are the streets. This means that
> W 53rd St. & 8th Ave is much farther from W 53rd & 7th than it is from W
> 52nd & 8th. A distance metric that treats all dimensions as equal would
> be off by a factor of about 2.5. A weighted distance metric that knew of
> this difference would produce distance values - and hence clusters -
> that more closely matched the real world.
> 
>  
> 
> Generalizing this to n-d, the new distance metric might look like this:
> 
>  
> 
> distance = sum(abs(p2[i] - p1[i]) * s[i] ) where S = a vector of
> (positive) scale factors.
> 
>  
> 
> Would this be an appropriate new clustering feature?
> 
> Jeff
> 

Reply via email to