Have you considered evaluating doc-score thresholds for limiting your results? Since the perfect answers to these situations lie in the constant tweaking and twiddling of analysis and tokenization, one way I've found to help is to evaluate result scores. In your "Ontario CA" example, limiting results to two vs. three vs. four is not necessarily so important as getting the first results accurate.
Without seeing your test code, I'm guessing that a search for "Ontario CA" returns [3] and [4] ahead of any other results. You could try looking at the score for each returned hit, and based on some threshold (80-20 rule, maybe) display what's returned. Just a thought. -- j On 1/27/06, Colin Young <[EMAIL PROTECTED]> wrote: > > The reason I only want 2 hits is because [2] is more "specific" in my > domain -- I could also have Toronto, Ontario; Kingston, Ontario etc. > which would take the hits up to 5 now. > > What I'm really after is finding a way to index and search that would > make [2] an invalid retrieval. > > My latest attempt is like this (field name: value): > > Type: city > Name: london ontario canada > Name: london on canada > Name: london ontario ca > Name: london on ca > Primary-name: london > > So the new list of documents is something like this (<type>: <name > entries> {<primary-name>}): > > [1] city: London, United Kingdom {london} > [2] city: London, Ontario, Canada {london} > [3] city: Ontario, California, United States {ontario} > [4] state: Ontario, Canada {ontario} > [5] city: Vancouver, Washington, United States {vancouver} > [6] city: Vancouver, British Columbia, Canada {vancouver} > [7] city: Washington, DC, United States {washington} > [8] state: Washington, United States {washington} > > I realize that I'm adding a lot of duplicate info -- I haven't got to > the refactoring stage yet, so I'm trying to keep my unit test setup very > explicit. The final analysis process will be pulling the geographic > entities from a database so I'll have all the synonyms, types (city, > state, country), etc. at that point and can write custom routines for > documents of each type (city, state, country). > > The idea here is to filter the results so that only documents where the > primary-name appears in the user's query string come back. i.e. if the > user typed "Ontario, CA", so only [3, 4] are valid results now since [2] > has a primary-name of "london" which does not appear in the user's > query, while [3, 4] both have a primary-name of "ontario". Now I'm just > having some trouble creating a filter (I've managed so far to filter out > _everything_). I can't quite sort out how to do a (displaying my SQL > background here) "where <term> in <query string>". I'm including my > current search code at the end of this response. > > Unfortunately I can't just assume the first term in the user's query is > the primary-name since it could be more than one word (e.g. for "St. > Louis du Ha Ha Quebec", "St. Louis du Ha Ha" is the primary-name). > > Thanks > > Colin > > // sample call: > // Hits hits = GeographySearch.Search(searcher, "any", new String[] > {"Ontario", "CA"}); > > public static Hits Search(Searcher searcher, String typeToFind, String[] > queryString) > throws IOException, ParseException > { > TermQuery entityType = new TermQuery(new Term("class", > typeToFind)); > BooleanQuery filterQuery = new BooleanQuery(); > PhraseQuery query = new PhraseQuery(); > query.setSlop(1); > > for (int i = 0; i < queryString.length; i++) > { > query.add(new Term("name", > queryString[i].toLowerCase())); > filterQuery.add( > new TermQuery(new Term("primary-name", > queryString[i])), > BooleanClause.Occur.SHOULD); > } > > BooleanQuery geographyQuery = new BooleanQuery(); > if (typeToFind != "any") geographyQuery.add(entityType, > BooleanClause.Occur.MUST); > geographyQuery.add(query, BooleanClause.Occur.MUST); > > QueryFilter filter = new QueryFilter(filterQuery); > > Hits hits = searcher.search(geographyQuery, filter); > return hits; > } > > -----Original Message----- > From: Rajesh Munavalli [mailto:[EMAIL PROTECTED] > Sent: 27 January, 2006 14:28 > To: java-user@lucene.apache.org > Subject: Re: Help with indexing and query strategy > > Hi Colin, > Even assuming you came up with a good way of indexing, the > example query "Ontario, CA" should yield 3 hits. All 2, 3 and 4 are > valid retrievals. Could you please justify which 2 hits you want and > why? > > Thanks, > > Rajesh Munavalli > > Colin Young wrote: > > I'm having some trouble coming up with a good search strategy for > geographical data. e.g., given: > > > > [1] city: London, United Kingdom > > [2] city: London, Ontario, Canada > > [3] city: Ontario, California, United States [4] state: Ontario, > > Canada [5] city: Vancouver, Washington, United States [6] city: > > Vancouver, British Columbia, Canada [7] city: Washington, DC, United > > States [8] state: Washington, United States > > > > and also given the following synonyms: > > > > Ontario = ON > > California = CA > > Washington = WA > > Canada = CA > > United States = US = America = United States of America United Kingdom > > > = UK = Great Britain = England > > > > for the following queries, I want the listed number of hits '()' from > matching '[]': > > > > i. Ontario (2) [3, 4] > > ii. London (2) [1, 2] > > iii. Ontario, Canada (1) [4] > > iv. Ontario, California (1) [3] > > v. Ontario, CA (2) [3, 4] > > vi. Ontario, US (1) [3] > > vii. Vancouver (2) [5, 6] > > viii. Washington (2) [7, 8] > > ix. Washington, DC (1) [7] > > x. Vancouver, CA (1) [6] > > xi. Vancouver, WA (1) [5] > > > > How do I index and store the input (assume that I know the mechanics > so I'm not looking for specific java syntax or how to generate synonyms > during analysis) so that I get the desired results. My current attempt > indexes strings like "London Ontario Canada", "London ON Canada", > "London Ontario CA", "London ON CA" -- i.e. every combination of entity > name and corresponding code -- in a content field and creates a type > field containing "city" (or "state" or "country" as appropriate to > identify the type of entity being indexed) and uses a phrase query with > a slop of 1 which works really well except e.g. "Ontario CA" for which > I'd like 2 hits, but given the above data gives 3 hits (from 2, 3 and 4, > and the problem will only get worse as I add more cities in Ontario > since each results in a hit). The slop of 1 is required since not all > countries customarily use states, and I need to support the user > optionally dropping the state as in the above example of "Ontario, CA" > where we don't know if the user intended the "CA" to represent the state > of California or the country of Canada, while "London, UK" would be > unambiguous. > > > > The major problem as I see it is that at parse time I don't know if > the user is searching for a city, state or country, and I don't want to > force them to specify that. > > > > Does anyone have any good ideas to help me solve this problem? > > > > Thanks. > > > > Colin Young > > > > > > Notice: This email message is for the sole use of the intended > recipient(s) and may contain confidential and privileged information. > Any unauthorized review, use, disclosure or distribution is prohibited. > If you are not the intended recipient, please contact the sender by > reply email and destroy all copies of the original message. > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > Notice: This email message is for the sole use of the intended > recipient(s) and may contain confidential and privileged information. Any > unauthorized review, use, disclosure or distribution is prohibited. If you > are not the intended recipient, please contact the sender by reply email and > destroy all copies of the original message. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >