Re: Help with indexing and query strategy

Rajesh Munavalli Mon, 30 Jan 2006 10:16:01 -0800

For now, the best I could come up with is the following scheme

SAMPLE DOCUMENTS:
------------------------------------
Lets say there are four documents:


Doc1: st louis, missouri, usa
Doc2: st louis du ha ha, quebec, canada
Doc3: new york, NY, united states of america
Doc4: ny, usa

INDEX PHASE:
-----------------------
Index the documents as follows:
* Lucene index will have two fields
    - Primary: To hold the primary information
    - Secondary: To hold the secondary information

Doc: PRIMARY:<primary_info> SECONDARY:<secondary_info>

Doc1:    PRIMARY:    0 st louis
    SECONDARY:    0 missouri
            0 ms
            1 usa
            1 united states of america

Doc 2:    PRIMARY:    0 st louis du ha ha
    SECONDARY    0 quebec
            0 qb
            1 canada
            1 ca

Doc 3:    PRIMARY    0 new york
            0 nyc
    SECONDARY    0 ny
            0 new york
            1 united states of america
            1 usa

Doc 4:    PRIMARY    0 ny
    SECONDARY    0 usa
            0 united states of america


QUERY PHASE:
-------------------------
At the query time split the query of "n" words as follows

q1 q2 q3.....qn

(primary:"q1"^1 AND secondary:"q2 q3 ....qn"^SLOPE 1) OR (primary:"q1 q2"^2
AND secondary:"q3 q4....qn"^SLOPE 1) .........(primary:"q1 q2...qn-1"^n-1
secondary:"qn") OR (primary:"q1 q2...qn"^n)


SAMPLE QUERY:
---------------------------
Query1:
 "st louis du ha ha qb ca" (NOTE: not separated by any delimiters ",")

Expanded query:
------------------------
(primary:"st"^1 AND secondary:"louis du ha ha qb ca"^SLOPE 1) OR
(primary:"st louis"^2 AND secondary:"du ha ha qb ca"^SLOPE 1) OR
(primary:"st louis du"^3 AND secondary:"ha ha qb ca"^SLOPE 1) OR
(primary:"st louis du ha"^4 AND secondary:"ha qb ca"^SLOPE 1) OR
(primary:"st louis du ha ha"^5 AND secondary:"qb ca"^SLOPE 1) OR
(primary:"st louis du ha ha qb"^6 AND secondary:"ca") OR
(primary:"st louis du ha ha qb ca")

Result: This would retrieve Doc 2 only
NOTE:
----------
Eventhough the query matches "st louis" in Doc 1" the secondary information
does not match. Hence it wont be retrieved.


Query 2:
"ny united states of america"

Expanded query:
------------------------
(primary:"ny"^1 AND secondary:"united states of america"^SLOPE 1) OR
(primary:"ny united"^2 AND secondary:"states of america"^SLOPE 1) OR
(primary:ny "united states"^3 AND secondary:"of america"^SLOPE 1) OR
(primary:"ny united states of"^4 AND secondary:"america"^SLOPE 1) OR
(primary:"ny united states of america"^5)

Result: This would rertrieve Doc 4 only.

I hope this helps....:)

Good luck,

Rajesh Munavalli

On 1/27/06, Colin Young <[EMAIL PROTECTED]> wrote:
>
> 1) Yes. One location per document.
>
> 2) Using the SimpleAnalyzer (for now). I have city, state and country as
> separate fields, so I could tokenize each as a single token if that
> would work better. I think that avoids the need for a delimiter at index
> time.
>
> 3) I am not making any assumptions now at query time, but the goal is
> that we should support commas and spaces (i.e. "London, Ontario, Canada"
> or "London Ontario Canada" are equivalent). My unit tests are supplying
> the query assuming it's been tokenized already (i.e. I'm sending in
> String[] for the query terms).
>
> 4) We don't want to return Albany unless the user has Albany in the
> query.
>
> Thanks again for looking at this.
>
> Colin
>

Re: Help with indexing and query strategy

Reply via email to