Hi,
I'm about to write an application that does very simple text analysis,
namely dictionary based entity entraction. The alternative is to do in
memory matching with substring:
String text; // could be any size, but normally "news paper length"
List matches;
for( String wordOrPhrase : dictionary) {
if ( text.substring( wordOrPhrase ) >= 0 ) {
matches.add( wordOrPhrase );
}
}
I am concerned the above code will be quite cpu intensitive, it will also be
case sensitive and lot leave any room for fuzzy matching.
I thought this task could also be solved by indexing every bit of text that
is to be analyzed, and then executing a query per dicionary entry:
(pseudo)
lucene.index(text)
List matches
for( String wordOrPhrase : dictionary {
if( lucene.search( wordOrPharse, text_id) gives hit ) {
matches.add(wordOrPhrase)
}
}
I have not used lucene very much, so I don't know if it is a good idea or
not to use lucene for this task at all. Could anyone please share their
thoughs on this?
Thanks,
Geir