I like the second option for refactoring the code. I think it is doable. And where is your code on Github?
Hai -- *Pivotal <http://www.gopivotal.com/>* A new platform for a new era On Sun, Jun 1, 2014 at 1:06 PM, Maxence Ahlouche <[email protected] > wrote: > Hi all! > > I've pushed my report for this week on my repo [0]. Here is a copy! > Attached is the patch containing my work for this week. > Week 2 - 2014/01/01 > > This week, I have worked on the beginning of the kmedoids module. > Unfortunately, I was supposed to have something working for last Wednesday, > and it is still not ready, mostly because I've lost time this week by being > sick, and by packing all my stuff in preparation for relocation. > > The good news now: this week is my last school (exam) week, and that means > full-time GSoC starting next Monday! Also, I've studied the kmeans module > quite thoroughly, and I can finally understand how it all goes on, at the > exception of one bit: the enormous SQL request used to update the > IterationController. > > For kmedoids, I've abandoned the idea of making the loop by myself and > have decided instead to stick to copying kmeans as much as possible, as it > seems easier than doing it all by myself. The only part that remains to be > adapted is that big SQL query I haven't totally understood yet. I've asked > the help of Atri, but surely the help of an experienced MADlib hacker would > speed things up :) Atri and I would also like to deal with this through a > voip meeting, to ease communication. If anyone wants to join, you're > welcome! > > As for the technology we'll use, I have a Mumble server running somewhere, > if that fits to everyone. Otherwise, suggest something! > > I am available from Monday 2 at 3 p.m. (UTC) to Wednesday 4 at 10 a.m. > (exam weeks are quite light). > > This week, I have also faced the first design decisions I have to make. > For kmedoids, the centroids are points of the dataset. So, if I wanted to > identify them precisely, I'd need to use their ids, but that would mean > having a prototype different than the kmeans one. So, for now, I've decided > to use the points coordinates only, hoping I will not run into trouble. If > I ever do, switching to ids should'nt be too hard. Also, if the user wants > to input initial medoids, he can input whatever points he wants, be they > part of the dataset or not. After the first iteration, the centroids will > anyway be points of the dataset (maybe I could just select the points > nearest to the coordinates they input as initial centroids). > > Second, I'll need to refactor the code in kmeans and kmedoids, as these > two modules are very similar. There are several options for this: > > 1. One big "clustering" module containing everything > clustering-related (ugly but easy option); > 2. A "clustering" module and "kmeans", "kmedoids", "optics", "utils" > submodules (the best imo, but I'm not sure it's doable); > 3. A "clustering_utils" module at the same level as the others (less > ugly than the first one, but easy too). > > Any opinions? > > Next week, I'll get a working kmedoids module, do some refactoring, and > then add the extra methods, similar to what's done in kmeans, for the > different seedings. Once that's done, I'll make it compatible with all > three ports (I'm currently producing Postgres-only code, as it's the > easiest for me to test), and write the tests and doc. The deadline for this > last step is in two weeks; I don't know yet if I'll be on time by then or > not. It will depend on how fast I can get kmedoids working, and how fast > I'll go once I'm full time GSoC. > > Finally, don't hesitate to tell me if you think my decisions are wrong, > I'm glad to learn :) > [0] http://git.viod.eu/viod/gsoc_2014/blob/master/reports.rst > > > -- > Maxence Ahlouche > 06 06 66 97 00 >
