On Sun, Apr 18, 2010 at 12:11 AM, Drew Farris <drew.far...@gmail.com> wrote:
> On Sat, Apr 17, 2010 at 2:23 PM, Sean Owen <sro...@gmail.com> wrote: > > > > At the moment I want to understand how to patch up the fuzzy k-means > > code in this regard -- will probably switch to something slightly less > > state-dependent than asFormatString() as a key and be done with it for > > the moment. > > After looking at it a bit, it seems like the most expedient solution > would be to add 'name' back into the Vector class. Whether it needs to > be part of equals(), I don't really know at this point, but I suspect > not. > > It doesn't appear that asFormatString() will do the job simply because > it's just an alternate representation of the entire vector, not an > identifier. Not sure what the history with this is here, but why > asFormatString() as opposed to toString()? > > It seems that the decorator alternative would involve something like a > NamedVector class that adds an id, implements Vector and holds any > type of Vector to with it delegates all calls to. This might work, > well but require more extensive modifications to the clustering code. > Does anyone else think this is an approach worth exploring? > > Does the Vector really need a String name or could it simply hold an > integer or long id? > I think a long id would do. As most gigantic tables are indexed these days by a BIGINT(in MYSQL). It is easy to assign random ids to documents/clusters in a single map/reduce job by partitioning the int64 space into the number of mappers. But changing that at the moment will modify a lot of things, (all clustering algorithms, clusterdumper) For this bug, lets put the id back in and remove it from the comparator/equals. Lets focus on getting the document structure correct Robin > Drew >