Keywords indexing, "top words", and co-occurrence

Kaspar Fischer Fri, 18 Dec 2009 06:11:15 -0800

Hi everybody,

I need to do some text analysis and am looking for a software library (in Java, 
preferably) to use for this. Lucene came to my mind first, but I actually hope 
that there is some library (based on Lucene, for example) that solves the 
problems directly.


What I want to do is the following:

1. In documents that get added to the system I need to find keywords from a 
predefined, fixed set of keywords. For example, the user will make a query for 
all documents containing the word "traffic" (this word need not be a keyword) 
and I want to show the number of keyword hits in all documents that contain 
"traffic":

- car, cars, automobile, automobiles (3)
- - Mercedes (2)
- - Ferrari (2)
- train, trains (4) // one doc contains "TGV", 3 contain "train" or "trains"
- - TGV (1)
- - ICE (0)
- plane, planes (5)
- - Boeing (4)
- - Airbus (1)

In short: I want to count keyword hits in the documents returned by some query. 
Notice that the keywords are hierarchically organized and may have synonyms 
("car" = "cars" = "automobile").

2. If the user queries for free-input word A ("hamburger", say) I want to find 
all keywords (from the above hierarchy) that are close to "hamburger" in some 
sense (word-distance or some similar measure of distance in text) and order 
them by number of occurrence.

Can this be done in Lucene? Or do you know of any frameworks that achieve such 
results?

Regarding to size, I expect the querys (for "traffic" in 1., or "hamburger" in 
2.) to return at most 500 documents and each document to contain at most 50 
keywords.

Many thanks,
Kaspar

Keywords indexing, "top words", and co-occurrence

Reply via email to