Re: [HACKERS] [GSoC] Clustering in MADlib - status update

Hai Qian Mon, 02 Jun 2014 12:31:37 -0700

I like the second option for refactoring the code. I think it is doable.

And where is your code on Github?


Hai

--
*Pivotal <http://www.gopivotal.com/>*
A new platform for a new era


On Sun, Jun 1, 2014 at 1:06 PM, Maxence Ahlouche <[email protected]
> wrote:

> Hi all!
>
> I've pushed my report for this week on my repo [0]. Here is a copy!
> Attached is the patch containing my work for this week.
> Week 2 - 2014/01/01
>
> This week, I have worked on the beginning of the kmedoids module.
> Unfortunately, I was supposed to have something working for last Wednesday,
> and it is still not ready, mostly because I've lost time this week by being
> sick, and by packing all my stuff in preparation for relocation.
>
> The good news now: this week is my last school (exam) week, and that means
> full-time GSoC starting next Monday! Also, I've studied the kmeans module
> quite thoroughly, and I can finally understand how it all goes on, at the
> exception of one bit: the enormous SQL request used to update the
> IterationController.
>
> For kmedoids, I've abandoned the idea of making the loop by myself and
> have decided instead to stick to copying kmeans as much as possible, as it
> seems easier than doing it all by myself. The only part that remains to be
> adapted is that big SQL query I haven't totally understood yet. I've asked
> the help of Atri, but surely the help of an experienced MADlib hacker would
> speed things up :) Atri and I would also like to deal with this through a
> voip meeting, to ease communication. If anyone wants to join, you're
> welcome!
>
> As for the technology we'll use, I have a Mumble server running somewhere,
> if that fits to everyone. Otherwise, suggest something!
>
> I am available from Monday 2 at 3 p.m. (UTC) to Wednesday 4 at 10 a.m.
> (exam weeks are quite light).
>
> This week, I have also faced the first design decisions I have to make.
> For kmedoids, the centroids are points of the dataset. So, if I wanted to
> identify them precisely, I'd need to use their ids, but that would mean
> having a prototype different than the kmeans one. So, for now, I've decided
> to use the points coordinates only, hoping I will not run into trouble. If
> I ever do, switching to ids should'nt be too hard. Also, if the user wants
> to input initial medoids, he can input whatever points he wants, be they
> part of the dataset or not. After the first iteration, the centroids will
> anyway be points of the dataset (maybe I could just select the points
> nearest to the coordinates they input as initial centroids).
>
> Second, I'll need to refactor the code in kmeans and kmedoids, as these
> two modules are very similar. There are several options for this:
>
>    1. One big "clustering" module containing everything
>    clustering-related (ugly but easy option);
>    2. A "clustering" module and "kmeans", "kmedoids", "optics", "utils"
>    submodules (the best imo, but I'm not sure it's doable);
>    3. A "clustering_utils" module at the same level as the others (less
>    ugly than the first one, but easy too).
>
> Any opinions?
>
> Next week, I'll get a working kmedoids module, do some refactoring, and
> then add the extra methods, similar to what's done in kmeans, for the
> different seedings. Once that's done, I'll make it compatible with all
> three ports (I'm currently producing Postgres-only code, as it's the
> easiest for me to test), and write the tests and doc. The deadline for this
> last step is in two weeks; I don't know yet if I'll be on time by then or
> not. It will depend on how fast I can get kmedoids working, and how fast
> I'll go once I'm full time GSoC.
>
> Finally, don't hesitate to tell me if you think my decisions are wrong,
> I'm glad to learn :)
> [0] http://git.viod.eu/viod/gsoc_2014/blob/master/reports.rst
>
>
> --
> Maxence Ahlouche
> 06 06 66 97 00
>

Re: [HACKERS] [GSoC] Clustering in MADlib - status update

Reply via email to