RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

Chong, Herb Mon, 17 Nov 2003 06:57:18 -0800

i have a program written in Icon that does basic sentence splitting. with about 5 
heuristics and one small lookup table, i can get well over 90% accuracy doing sentence 
boundary detection on email. for well edited English text, like newswires, i can 
manage closer to 99%. this is all that is needed for significantly improving a search 
engine's performance when the query engine respects sentence boundaries. incidentally, 
the GATE Information Extraction framework cites some references that indicate that for 
named entity feature extraction, their system can exceed the ability of trained humans 
to detect and classify named entities if only one person does the detection. 
collaborating humans are still better, but no-one has the time in practical 
applications.


you probably know, since you know about Markov chains, that within sentence term 
correlation, and hence the language model, is different than across sentences. 
linguists have known this for a very long time. it isn't hard to put this capability 
into a search engine, but it absolutely breaks down unless there is sentence boundary 
information stored for use at query time.

Herb....

-----Original Message-----
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 5:54 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]


Well ... Sure, nothing can replace a human mind. But believe it or not, 
there are studies which show that even human experts can significantly 
differ in their opinions on what are key-phrases for a given text. So, 
the results are never clear cut with humans either...

So, in this sense a heuristic tool for sentence splitting and key-phrase 
detection can go long ways. For example, the application I mentioned, 
uses quite a few heuristic rules (+ Markov chains as a heavier 
ammunition :-), and it comes up with the following phrases for your 
email discussion (the text quoted below):

(lang=EN): NLP, trainable rule-based tagging, natural language 
processing, apache, NLP expert

Now, this set of key-phrases does reflect the main noun-phrases in the 
text... which means I have a practical and tangible benefit from NLP. 
QED ;-)

Best regards,
Andrzej

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

Reply via email to