On Sun, Apr 18, 2010 at 12:11 AM, Drew Farris <drew.far...@gmail.com> wrote:

> On Sat, Apr 17, 2010 at 2:23 PM, Sean Owen <sro...@gmail.com> wrote:
> >
> > At the moment I want to understand how to patch up the fuzzy k-means
> > code in this regard -- will probably switch to something slightly less
> > state-dependent than asFormatString() as a key and be done with it for
> > the moment.
>
> After looking at it a bit, it seems like the most expedient solution
> would be to add 'name' back into the Vector class. Whether it needs to
> be part of equals(), I don't really know at this point, but I suspect
> not.
>
> It doesn't appear that asFormatString() will do the job simply because
> it's just an alternate representation of the entire vector, not an
> identifier. Not sure what the history with this is here, but why
> asFormatString() as opposed to toString()?
>
> It seems that the decorator alternative would involve something like a
> NamedVector class that adds an id, implements Vector and holds any
> type of Vector to with it delegates all calls to. This might work,
> well but require more extensive modifications to the clustering code.
> Does anyone else think this is an approach worth exploring?
>
> Does the Vector really need a String name or could it simply hold an
> integer or long id?
>
I think a long id would do. As most gigantic tables are indexed these days
by a BIGINT(in MYSQL). It is easy to assign random ids to documents/clusters
in a single map/reduce job by partitioning the int64 space into the number
of mappers. But changing that at the moment will modify a lot of things,
(all clustering algorithms, clusterdumper)

For this bug, lets put the id back in and remove it from the
comparator/equals. Lets focus on getting the document structure correct

Robin

> Drew
>

Reply via email to