Isabel,

Let me give you the backrground of this. The code in question is a
second-generation implementation (by the same author) of this very hard
algorithm.The author is  a professional who now works for Basis (where I
work) who was working, at the time, at an academic institution. I have
opened negotiations with the relevant professor to see if I can't find a way
to make it open source.

Last year, before I knew anything about this implementation, I built one.
Mine, written in C++, can turn 111921 words with 8899115 distinct bigrams
into 1000 clusters in about 14 hours on a 4-core system. While I'm a fairly
experienced C++ code tuner, I am not much of a mathematician, and so I
missed some mathematical shortcuts.

I used OpenMP to get some parallelization into the code. However, it seems
to me that some major rethinking would be required to cast the problem in a
form where Java and Hadoop could do anything with it. However, I could just
be thick-headed.

If I succeed in getting the code to open source, I'd be very interested in
seeing if any of you hadoop-artists can suggest an approach that could
produce comparable or better clock time without having to apply a gigantic
amount of hardware. GIven such an approach, the author or I might find
interesting to try an implementation an a contribution to Mahout.

--benson


On Sun, Oct 4, 2009 at 7:54 AM, Isabel Drost <[email protected]> wrote:

> On Saturday 03 October 2009 18:45:12 Sean Owen wrote:
> > Let me however revive my suggestion that Mahout include a 'sandbox'
> > module of sorts to host anything at all. This neatly allows for
> > incorporation of anything, in any state, without confusing users as to
> > what should be expected of Mahout 'proper', which should be a
> > reasonably high bar come version 1.0.
>
> +1 Until that is realized, I would suggest to not scare away people just
> because they used the "wrong" programming language/lib/...
>
> Benson, do you think there might be a tiny chance that you can motivate the
> student to contribute his implementation as a JIRA issue and work together
> with the community to make it run on Hadoop? Does that even make sense for
> the algorithm implemented?
>
> Isabel
>
>
> --
>  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
>  /,`.-'`'    -.  ;-;;,_
>  |,4-  ) )-,_..;\ (  `'-'
> '---''(_/--'  `-'\_) (fL)  IM:  <xmpp://[email protected]>
>
>

Reply via email to