FYI: you will get a broader audience on java-user, this list is mostly for discussion of higher level Lucene things that effect two or more of the Lucene projects.

That being said, a custom analyzer is the way to go to redact the appropriate information. If you have your files in some sort of markup, you can easily create fields to contain the various metadata that you have generated (i.e. history of violence.) One new thing that I have been intrigued with for use in NLP applications is the new TeeTokenFilter and SinkTokenizer that can be used to siphon off interesting tokens for other fields based on the tokens of an existing field. This can save on the need to reanalyze content over and over for different analysis needs. This is, however, advanced usage for now (although I hope it will become more common)

Cheers
Grant

On Dec 20, 2007, at 9:48 AM, 1world1love wrote:


Greetings all. I am new to Lucene and am looking for a little
advice/direction/feedback on what I am trying to do. I want to index and
query millions of documents that are unstructured and resemble
crime/police/phsychiatric reports; no problem, lucene is perfect for this.

The trick is that I need to exclude certain terms from the index such as those terms that are negated or information that could potentially identify people. I have a collection of natural language processing tools that are
able to tag or remove/replace such terms.

I need to design the indexing such that I can feed each document through these tools and then incorporate the results into the indexing strategy.

As an example, if I have a report that has the phrase: "Mr. Smith has no
history of violence against women prior to this event"

The NLP engine would recognize the name Smith and the negation of the term "violence" and would tag them as such. I would then like to exclude those
terms from the indexing as seems prudent.

Another strategy I would like to look at is to include the tags in the index to incorprate it into the search engine. That is to say, whether a subject "likely" has a history of violence, "may" have a history of violence, or
"does not" have a history of violence.

I assume that I will need to design a custom analyzer to do this, but I was hoping to solicit any comments, advice, or general suggestions before I get
started.

Thanks in advance,

j


--
View this message in context: 
http://www.nabble.com/advice-on-integrating-NLP-engine-during-indexing-tp14437913p14437913.html
Sent from the Lucene - General mailing list archive at Nabble.com.


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



Reply via email to