Re: advice on integrating NLP engine during indexing

Grant Ingersoll Thu, 20 Dec 2007 06:56:09 -0800

FYI: you will get a broader audience on java-user, this list is mostlyfor discussion of higher level Lucene things that effect two or moreof the Lucene projects.

That being said, a custom analyzer is the way to go to redact theappropriate information. If you have your files in some sort ofmarkup, you can easily create fields to contain the various metadatathat you have generated (i.e. history of violence.) One new thingthat I have been intrigued with for use in NLP applications is the newTeeTokenFilter and SinkTokenizer that can be used to siphon offinteresting tokens for other fields based on the tokens of an existingfield. This can save on the need to reanalyze content over and overfor different analysis needs. This is, however, advanced usage fornow (although I hope it will become more common)


Cheers
Grant

On Dec 20, 2007, at 9:48 AM, 1world1love wrote:

Greetings all. I am new to Lucene and am looking for a little
advice/direction/feedback on what I am trying to do. I want to indexand
query millions of documents that are unstructured and resemble
crime/police/phsychiatric reports; no problem, lucene is perfect forthis.
The trick is that I need to exclude certain terms from the indexsuch asthose terms that are negated or information that could potentiallyidentifypeople. I have a collection of natural language processing toolsthat are
able to tag or remove/replace such terms.
I need to design the indexing such that I can feed each documentthroughthese tools and then incorporate the results into the indexingstrategy.
As an example, if I have a report that has the phrase: "Mr. Smithhas no
history of violence against women prior to this event"
The NLP engine would recognize the name Smith and the negation ofthe term"violence" and would tag them as such. I would then like to excludethose
terms from the indexing as seems prudent.
Another strategy I would like to look at is to include the tags inthe indexto incorprate it into the search engine. That is to say, whether asubject"likely" has a history of violence, "may" have a history ofviolence, or
"does not" have a history of violence.
I assume that I will need to design a custom analyzer to do this,but I washoping to solicit any comments, advice, or general suggestionsbefore I get
started.

Thanks in advance,

j


--
View this message in context: 
http://www.nabble.com/advice-on-integrating-NLP-engine-during-indexing-tp14437913p14437913.html
Sent from the Lucene - General mailing list archive at Nabble.com.


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: advice on integrating NLP engine during indexing

Reply via email to