This is actually a tough problem in general: polysemy sense disambiguation. In your case, I think 
it's more like you'll probably need to do some named entity resolution to differentiate 
"George Washington" from "George Washington Carver" as they are two different 
entities.

Do you have a list of all the entity names in your corpus (either manually curated or by some 
pattern matching?). If you do, one thing you can do is to write a tokenizer that emit one token for 
each entity. So, for example, "George Washington" string emits a token like 
_George_Washington_, "George Washington Carver" emits _George Washington_Carver_, etc.

There are some open source NLP library that has does this, but the quality 
varies, as it will most likely depend on your domain and training data set.

Hope this helps,
Tri

On Jul 11, 2014, at 07:20 AM, Michael Ryan <mr...@moreover.com> wrote:

I'm trying to solve the following problem...

I have 3 documents that contain the following contents:
1: "George Washington Carver blah blah blah."
2: "George Washington blah blah blah."
3: "George Washington Carver blah blah blah. George Washington blah blah blah."

I want to create a query that matches documents 2 and 3, but not 1. That is, I want to find documents that 
mention "George Washington". It's okay if they also mention "George Washington Carver", 
but I don't want documents that only mention "George Washington Carver". So simply doing something 
like this does not solve it:
"George Washington" NOT "George Washington Carver"

Is there a Query type that does this out of the box? I've looked at the various 
types of span queries, but none of them seem to do this. I think it should be 
theoretically possible given the position data that Lucene stores...

-Michael

Reply via email to