Re: Clustering from DB

nfantone Thu, 23 Jul 2009 12:13:51 -0700

>>> Perhaps a larger convergence value might help (-d, I believe).
>>
>> I'll try that.


There was no significant change while modifying the convergence value.
At least, none was observed during the first three iterations which
lasted the same amount of time than before, more or less.

>>> Is there any chance your data is publicly shareable?  Come to think of
>>> it,
>>> with the vector representations, as long as you don't publish the key
>>> (which
>>> terms map to which index), I would think most all data is publicly
>>> shareable.
>>
>> I'm sorry, I don't quite understand what you're asking. Publicly
>> shareable? As in user-permissions to access/read/write the data?
>
> As in post a copy of the SequenceFile somewhere for download, assuming you
> can.  Then others could presumably try it out.

My bad. Of course it is:

http://cringer.3kh.net/web/user-dataset.data.tar.bz2

That's the ~62Mb SequenceFile sample I've using, in <Text,
SparseVector> logical format.

>That does seem like an awfully long time for 62 MB on a 6 node cluster. How 
>many >terations are running?

I'm running the whole thing with a 20 iterations cap. Every iteration
- EXCEPT the first one which, oddly, lasted just two minutes-, took
around 3hs to complete:

Hadoop job_200907221734_0001
Finished in: 1mins, 42sec

Hadoop job_200907221734_0004
Finished in: 2hrs, 34mins, 3sec

Hadoop job_200907221734_0005
Finished in: 2hrs, 59mins, 34sec

> How did you generate your initial clusters?

I generate the initial clusters via the RandomSeedGenerator setting a
'k' value of 200.  This is what I did to initiate the process for the
first time:

./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data input/user.data
./bin/hadoop dfs -D dfs.block.size=4194304 -put ~/user.data init/user.data
./bin/hadoop jar ~/mahout-core-0.2.jar
org.apache.mahout.clustering.kmeans.KMeansDriver -i input/user.data -c
init -o output -r 32 -d 0.01 -k 200

>Where are the iteration jobs spending most of their time (map vs. reduce)

I'm tempted to say map here, but their spent time is rather
comparable, actually. Reduce attempts are taking an hour and a half to
end (average), and so are Map attempts. Here are some representative
examples from the web UI:

reduce
attempt_200907221734_0002_r_000006_0
22-Jul-2009 21:15:01 (1hrs, 55mins, 55sec)

map
attempt_200907221734_0002_m_000000_0
22-Jul-2009 20:52:27 (2hrs, 16mins, 12sec)

Perhaps, there's some inconvenient in the way I create the
SequenceFile? I could share the JAVA code as well, if required.

Re: Clustering from DB

Reply via email to