Well, as it turned out, it didn't have anything to do with my
performance issue but I found out that writing a Cluster (with a
single vector as its center) to a file and then reading it, requires
the center to be added as point; otherwise, you won't be able to
retrieve it as it should. Therefore, one should do:
// Writing
String id = "someID";
Vector v = new SparseVector();
Cluster c = new Cluster(v);
c.addPoint(v);
seqWriter.append(new Text(id), c);
// Reading
Writable key = (Writable) seqReader.getKeyClass().newInstance();
Cluster value = (Cluster) seqReader.getValueClass().newInstance();
while (seqReader.next(key, value)) {
...
Vector centroid = value.getCenter();
...
}
This way, 'key' corresponds to 'id' and 'v' to 'centroid'. I think
this shouldn't happen. Then again, it's not that relevant, I guess.
Sorry for bringing different subjects to the same thread.
On Fri, Jul 24, 2009 at 9:14 AM, nfantone<[email protected]> wrote:
> I've been using RandomSeedGenerator to generate initial clusters for
> kMeans and while checking its code I stumbled upon this:
>
> while (reader.next(key, value)) {
> Cluster newCluster = new Cluster(value);
> newCluster.addPoint(value);
> ....
> }
>
> I can see it adds the vector to the newly created cluster, even though
> it is setting it as its center in the constructor. Wasn't this
> corrected in a past revision? I thought this was not necessary
> anymore. I'll look into it a little bit more and see if this has
> something to do with my lack of performance with my dataset.
>
> On Thu, Jul 23, 2009 at 3:45 PM, nfantone<[email protected]> wrote:
>>>>> Perhaps a larger convergence value might help (-d, I believe).
>>>>
>>>> I'll try that.
>>
>> There was no significant change while modifying the convergence value.
>> At least, none was observed during the first three iterations which
>> lasted the same amount of time than before, more or less.
>>
>>>>> Is there any chance your data is publicly shareable? Come to think of
>>>>> it,
>>>>> with the vector representations, as long as you don't publish the key
>>>>> (which
>>>>> terms map to which index), I would think most all data is publicly
>>>>> shareable.
>>>>
>>>> I'm sorry, I don't quite understand what you're asking. Publicly
>>>> shareable? As in user-permissions to access/read/write the data?
>>>
>>> As in post a copy of the SequenceFile somewhere for download, assuming you
>>> can. Then others could presumably try it out.
>>
>> My bad. Of course it is:
>>
>> http://cringer.3kh.net/web/user-dataset.data.tar.bz2
>>
>> That's the ~62Mb SequenceFile sample I've using, in <Text,
>> SparseVector> logical format.
>>
>>>That does seem like an awfully long time for 62 MB on a 6 node cluster. How
>>>many >terations are running?
>>
>> I'm running the whole thing with a 20 iterations cap. Every iteration
>> - EXCEPT the first one which, oddly, lasted just two minutes-, took
>> around 3hs to complete:
>>
>> Hadoop job_200907221734_0001
>> Finished in: 1mins, 42sec
>>
>> Hadoop job_200907221734_0004
>> Finished in: 2hrs, 34mins, 3sec
>>
>> Hadoop job_200907221734_0005
>> Finished in: 2hrs, 59mins, 34sec
>>
>>> How did you generate your initial clusters?
>>
>> I generate the initial clusters via the RandomSeedGenerator setting a
>> 'k' value of 200. This is what I did to initiate the process for the
>> first time:
>>
>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data input/user.data
>> ./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data init/user.data
>> ./bin/hadoop jar ~/mahout-core-0.2.jar
>> org.apache.mahout.clustering.kmeans.KMeansDriver -i input/user.data -c
>> init -o output -r 32 -d 0.01 -k 200
>>
>>>Where are the iteration jobs spending most of their time (map vs. reduce)
>>
>> I'm tempted to say map here, but their spent time is rather
>> comparable, actually. Reduce attempts are taking an hour and a half to
>> end (average), and so are Map attempts. Here are some representative
>> examples from the web UI:
>>
>> reduce
>> attempt_200907221734_0002_r_000006_0
>> 22-Jul-2009 21:15:01 (1hrs, 55mins, 55sec)
>>
>> map
>> attempt_200907221734_0002_m_000000_0
>> 22-Jul-2009 20:52:27 (2hrs, 16mins, 12sec)
>>
>> Perhaps, there's some inconvenient in the way I create the
>> SequenceFile? I could share the JAVA code as well, if required.
>>
>