> Dawid Weiss wrote:
> > You could also try splitting the document into paragraphs and use Carrot2's
> > Lingo algorithm (www.carrot2.org) on a paragraph-level to extract clusters.
> > Labelling routine in Lingo should extract 'key' phrases; this analysis is
> > heavily frequency-based, but... you know, you may want to try it.
>
> Just to make sure I'm following...
>
> So you're suggesting splitting the document into paragraphs, then
> treating each paragraph as if it were a Carrot2 search result,
> performing the clustering, then looking at the label Lingo chooses for
> each cluster, and treating that label as the "key phrase"?
I tried it. Not so great results, but perhaps I'm doing it wrong.
Here's my code. The input file is a text file with an ID number and
one paragraph per (long) line -- standard textual paragraphs. I'm
running on a corpus of technical papers.
Bill
-----------------------------------------------------------------
import org.carrot2.filter.lingo.common.*;
import org.carrot2.filter.lingo.lsicluster.*;
import java.io.*;
public class test {
public static void main (String[] argv) {
DefaultClusteringContext context = new DefaultClusteringContext();
try {
BufferedReader r = new BufferedReader(new FileReader(argv[0]));
String line;
while ((line = r.readLine()) != null) {
String[] parts = line.trim().split("\\s");
// there must be an easier way to split off the first token
// of a line...
if (parts.length > 1) {
String id = parts[0];
// and to glue the other parts together again...
String body = parts[1];
for (int i = 2; i < parts.length; i++) {
body = body + " " + parts[i];
}
context.addSnippet(new Snippet(id, "", body));
}
}
} catch (Exception x) {
x.printStackTrace(System.err);
}
context.setQuery("");
Cluster[] clusters = context.cluster();
for (int i = 0; i < clusters.length; i++) {
System.out.println("Cluster --");
String[] labels = clusters[i].getLabels();
for (int j = 0; j < labels.length; j++) {
System.out.println(" Label: " + labels[j]);
}
Snippet[] snippets = clusters[i].getSnippets();
System.out.println(" " + snippets.length + " snippets:");
for (int j = 0; j < snippets.length; j++) {
System.out.println(" " + snippets[j].getSnippetId() +
" -- " + snippets[j].getText());
}
}
}
}
----------------------------------------------------------------
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]