I like the second option for refactoring the code. I think it is doable.
And where is your code on Github?
A new platform for a new era
On Sun, Jun 1, 2014 at 1:06 PM, Maxence Ahlouche <maxence.ahlou...@gmail.com
> Hi all!
> I've pushed my report for this week on my repo . Here is a copy!
> Attached is the patch containing my work for this week.
> Week 2 - 2014/01/01
> This week, I have worked on the beginning of the kmedoids module.
> Unfortunately, I was supposed to have something working for last Wednesday,
> and it is still not ready, mostly because I've lost time this week by being
> sick, and by packing all my stuff in preparation for relocation.
> The good news now: this week is my last school (exam) week, and that means
> full-time GSoC starting next Monday! Also, I've studied the kmeans module
> quite thoroughly, and I can finally understand how it all goes on, at the
> exception of one bit: the enormous SQL request used to update the
> For kmedoids, I've abandoned the idea of making the loop by myself and
> have decided instead to stick to copying kmeans as much as possible, as it
> seems easier than doing it all by myself. The only part that remains to be
> adapted is that big SQL query I haven't totally understood yet. I've asked
> the help of Atri, but surely the help of an experienced MADlib hacker would
> speed things up :) Atri and I would also like to deal with this through a
> voip meeting, to ease communication. If anyone wants to join, you're
> As for the technology we'll use, I have a Mumble server running somewhere,
> if that fits to everyone. Otherwise, suggest something!
> I am available from Monday 2 at 3 p.m. (UTC) to Wednesday 4 at 10 a.m.
> (exam weeks are quite light).
> This week, I have also faced the first design decisions I have to make.
> For kmedoids, the centroids are points of the dataset. So, if I wanted to
> identify them precisely, I'd need to use their ids, but that would mean
> having a prototype different than the kmeans one. So, for now, I've decided
> to use the points coordinates only, hoping I will not run into trouble. If
> I ever do, switching to ids should'nt be too hard. Also, if the user wants
> to input initial medoids, he can input whatever points he wants, be they
> part of the dataset or not. After the first iteration, the centroids will
> anyway be points of the dataset (maybe I could just select the points
> nearest to the coordinates they input as initial centroids).
> Second, I'll need to refactor the code in kmeans and kmedoids, as these
> two modules are very similar. There are several options for this:
> 1. One big "clustering" module containing everything
> clustering-related (ugly but easy option);
> 2. A "clustering" module and "kmeans", "kmedoids", "optics", "utils"
> submodules (the best imo, but I'm not sure it's doable);
> 3. A "clustering_utils" module at the same level as the others (less
> ugly than the first one, but easy too).
> Any opinions?
> Next week, I'll get a working kmedoids module, do some refactoring, and
> then add the extra methods, similar to what's done in kmeans, for the
> different seedings. Once that's done, I'll make it compatible with all
> three ports (I'm currently producing Postgres-only code, as it's the
> easiest for me to test), and write the tests and doc. The deadline for this
> last step is in two weeks; I don't know yet if I'll be on time by then or
> not. It will depend on how fast I can get kmedoids working, and how fast
> I'll go once I'm full time GSoC.
> Finally, don't hesitate to tell me if you think my decisions are wrong,
> I'm glad to learn :)
>  http://git.viod.eu/viod/gsoc_2014/blob/master/reports.rst
> Maxence Ahlouche
> 06 06 66 97 00