Re: Clustering from DB

Grant Ingersoll Sun, 26 Jul 2009 18:20:34 -0700

That does indeed look like a problem.  I'll fix.

On Jul 26, 2009, at 2:37 PM, nfantone wrote:

While (still) experiencing performance issues and inspecting kMeans
code, I found this lying around SquaredEuclideanDistanceMeasure.java:

 public double distance(double centroidLengthSquare, Vector centroid,
Vector v) {
   if (centroid.size() != centroid.size()) {
     throw new CardinalityException();
   }
   ...
  }

I bet someone meant to compare centroid and v sizes and didn'tnoticed.


On Fri, Jul 24, 2009 at 12:38 PM, nfantone<[email protected]> wrote:

Well, as it turned out, it didn't have anything to do with my
performance issue but I found out that writing a Cluster (with a
single vector as its center) to a file and then reading it, requires
the center to be added as point; otherwise, you won't be able to
retrieve it as it should. Therefore, one should do:

// Writing
String id = "someID";
Vector v = new SparseVector();
Cluster c = new Cluster(v);
c.addPoint(v);
seqWriter.append(new Text(id), c);

// Reading
Writable key = (Writable) seqReader.getKeyClass().newInstance();
Cluster value = (Cluster) seqReader.getValueClass().newInstance();
while (seqReader.next(key, value)) {
...
Vector centroid = value.getCenter();
...
}

This way, 'key' corresponds to 'id' and 'v' to 'centroid'. I think
this shouldn't happen. Then again, it's not that relevant, I guess.

Sorry for bringing different subjects to the same thread.

On Fri, Jul 24, 2009 at 9:14 AM, nfantone<[email protected]> wrote:

I've been using RandomSeedGenerator to generate initial clusters for
kMeans and while checking its code I stumbled upon this:

     while (reader.next(key, value)) {
       Cluster newCluster = new Cluster(value);
       newCluster.addPoint(value);
       ....
     }
I can see it adds the vector to the newly created cluster, eventhough
it is setting it as its center in the constructor. Wasn't this
corrected in a past revision? I thought this was not necessary
anymore. I'll look into it a little bit more and see if this has
something to do with my lack of performance with my dataset.

On Thu, Jul 23, 2009 at 3:45 PM, nfantone<[email protected]> wrote:
Perhaps a larger convergence value might help (-d, I believe).
I'll try that.
There was no significant change while modifying the convergencevalue.
At least, none was observed during the first three iterations which
lasted the same amount of time than before, more or less.
Is there any chance your data is publicly shareable? Come tothink of
it,
with the vector representations, as long as you don't publishthe key
(which
terms map to which index), I would think most all data ispublicly
shareable.
I'm sorry, I don't quite understand what you're asking. Publicly
shareable? As in user-permissions to access/read/write the data?
As in post a copy of the SequenceFile somewhere for download,assuming you
can.  Then others could presumably try it out.
My bad. Of course it is:

http://cringer.3kh.net/web/user-dataset.data.tar.bz2

That's the ~62Mb SequenceFile sample I've using, in <Text,
SparseVector> logical format.
That does seem like an awfully long time for 62 MB on a 6 nodecluster. How many >terations are running?
I'm running the whole thing with a 20 iterations cap. Everyiteration
- EXCEPT the first one which, oddly, lasted just two minutes-, took
around 3hs to complete:

Hadoop job_200907221734_0001
Finished in: 1mins, 42sec

Hadoop job_200907221734_0004
Finished in: 2hrs, 34mins, 3sec

Hadoop job_200907221734_0005
Finished in: 2hrs, 59mins, 34sec
How did you generate your initial clusters?
I generate the initial clusters via the RandomSeedGeneratorsetting a'k' value of 200. This is what I did to initiate the process forthe
first time:
./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data input/user.data./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data init/user.data
./bin/hadoop jar ~/mahout-core-0.2.jar
org.apache.mahout.clustering.kmeans.KMeansDriver -i input/user.data -c
init -o output -r 32 -d 0.01 -k 200
Where are the iteration jobs spending most of their time (mapvs. reduce)
I'm tempted to say map here, but their spent time is rather
comparable, actually. Reduce attempts are taking an hour and ahalf toend (average), and so are Map attempts. Here are somerepresentative
examples from the web UI:

reduce
attempt_200907221734_0002_r_000006_0
22-Jul-2009 21:15:01 (1hrs, 55mins, 55sec)

map
attempt_200907221734_0002_m_000000_0
22-Jul-2009 20:52:27 (2hrs, 16mins, 12sec)

Perhaps, there's some inconvenient in the way I create the
SequenceFile? I could share the JAVA code as well, if required.


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Clustering from DB

Reply via email to