FYI: you will get a broader audience on java-user, this list is mostly
for discussion of higher level Lucene things that effect two or more
of the Lucene projects.
That being said, a custom analyzer is the way to go to redact the
appropriate information. If you have your files in some sort of
markup, you can easily create fields to contain the various metadata
that you have generated (i.e. history of violence.) One new thing
that I have been intrigued with for use in NLP applications is the new
TeeTokenFilter and SinkTokenizer that can be used to siphon off
interesting tokens for other fields based on the tokens of an existing
field. This can save on the need to reanalyze content over and over
for different analysis needs. This is, however, advanced usage for
now (although I hope it will become more common)
Cheers
Grant
On Dec 20, 2007, at 9:48 AM, 1world1love wrote:
Greetings all. I am new to Lucene and am looking for a little
advice/direction/feedback on what I am trying to do. I want to index
and
query millions of documents that are unstructured and resemble
crime/police/phsychiatric reports; no problem, lucene is perfect for
this.
The trick is that I need to exclude certain terms from the index
such as
those terms that are negated or information that could potentially
identify
people. I have a collection of natural language processing tools
that are
able to tag or remove/replace such terms.
I need to design the indexing such that I can feed each document
through
these tools and then incorporate the results into the indexing
strategy.
As an example, if I have a report that has the phrase: "Mr. Smith
has no
history of violence against women prior to this event"
The NLP engine would recognize the name Smith and the negation of
the term
"violence" and would tag them as such. I would then like to exclude
those
terms from the indexing as seems prudent.
Another strategy I would like to look at is to include the tags in
the index
to incorprate it into the search engine. That is to say, whether a
subject
"likely" has a history of violence, "may" have a history of
violence, or
"does not" have a history of violence.
I assume that I will need to design a custom analyzer to do this,
but I was
hoping to solicit any comments, advice, or general suggestions
before I get
started.
Thanks in advance,
j
--
View this message in context:
http://www.nabble.com/advice-on-integrating-NLP-engine-during-indexing-tp14437913p14437913.html
Sent from the Lucene - General mailing list archive at Nabble.com.
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ