[Jprogramming] text mining benchmark

` ` Wed, 05 Jan 2011 11:53:08 -0800

Hello all:

Given a list of strings, k,              /hundreds of thousands of phrases,
e.g. ("hello j";"software community";...)
and a single massive string, v,     /tens of millions of words, e.g.  "hello
j software community ..."


What is the fastest way to calculate the indices for all exact matches from
k that exist in v.

Brute force string searches (even in parallel) perform poorly at this task.
A simple .net string search loop would take 2 weeks on my test data.
Even in q, running ss in parallel, takes several hours.

I wrote a function in q that brought the runtime down to 50 secs, and it
could run even faster if not bound by the 32 bit developer edition version.

Here is the post for that topic on the kdb+ google group:
http://groups.google.com/group/personal-kdbplus/browse_thread/thread/76d99ba880a8db58
If anyone wants test data, shoot me an email.

I am curious as to how well J performs this task, given that the 64-bit
version is available, and more generally, as to how J compares to k/q in
terms of speed.

Thanks,

`k.os
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

[Jprogramming] text mining benchmark

Reply via email to