How do you think these various libraries fit into Hadoop? Does it
make sense to just build what we need using HBase? I see http://wiki.apache.org/hadoop/Matrix
does some matrix things, but then it has a Groovy overlay, so it
isn't quite what we want, I don't think.
Perhaps, we should just think about, and push up to Hadoop if we can,
our own set of Hadoop based matrix libraries. Starting off, we need a
decent way to create a matrix and populate it, then also basic matrix
things like addition, multiplication, etc. Then we can add other
things as we need them? For instance, I am interested in TextRank
(search for Mihalcea and TextRank) and it essentially comes down to
doing an iterative algorithm over a matrix. I was thinking I might,
as a way to get deeper into the latest Hadoop, use it as a sample,
useful algorithm. It's not specifically ML, but it does have
interesting results and it is fairly easy to implement.
Should we just lay out a page on the Wiki where we can start thinking
about matrix needs? Using other libraries is definitely an option,
but I am not sure if they will be optimal in the Hadoop environment.
-Grant
On Feb 6, 2008, at 12:18 PM, Ted Dunning wrote:
There are unfortunately many choices for linear algebra in JVM's, none
particularly satisfactory.
Colt is the one I use. It has a very odd syntax, but gives good
performance. The structure is such that it is very hard to extend
to, say,
sparse matrices. The licensing on Colt isn't particularly easy,
either and
I have been unable to contact the author to see about liberalizing it.
Jama is now essentially defunct, but it had a very simple API and
not very
high performance. Extending to additional matrix types is also not
feasible
due to the design exposing matrix internal structure as a double
indexed
matrix. The licensing on Jama is very open.
MTJ is high performance and has a less strange API than Colt, but I
haven't
used it so I can't say much about performance. I get the impression
it
would be difficult to extend, but I could well be wrong about that.
Commons math uses an extension of Jama, I think. I haven't used
it. The
last time I looked seriously at commons math, the committers had
some very
odd agendas going on so I dropped it from consideration. It looks
like it
has come quite a ways since then, but I haven't dug into it deeply
since my
first evaluation.
On 2/6/08 12:45 AM, "Paul Elschot" <[EMAIL PROTECTED]> wrote:
Op Wednesday 06 February 2008 05:23:31 schreef Markus Weimer:
Hi,
One of my contributions to Elefant is an adapter to the Java
Version of UIMA
which allows you to pipe Python strings through a UIMA annotation
engine and
get feature vectors to work with back. This was done using JPype: <
http://jpype.sourceforge.net/>, a tool which links the JVM to the
CPython
VM.
I choose this non-obvious approach because we use native code Python
extensions for the matrix operations, an area where Java
regrettably lacks
behind big time compared to native code. So, Jython was out of the
question
as I don't know any way to access a CPython extension from Jython.
I found
JPype to do the job and to do it well (the overhead per Cross-VM
call was
around 1ms on my laptop). So for those craving for a state-of-the-
art Python
with decent extensions and access to Java code, this might be an
option.
Well, one of my favourite Java libraries made it into the email
address of
this
list, and I must say, I was hoping to get some good solutions to
the problem
of
linear algebra in a JVM here. Has this problem been discussed
beforehand?
I have only used linear algebra packages well before there was Java,
so I wonder how to go about it now.
Regards,
Paul Elschot
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ