RE: intra-word delimiters

Rajesh Munavalli Mon, 15 Aug 2005 21:06:55 -0700

How about this....

(1) Follow the first three steps mentioned by Marvin
(2) For step 4 instead of trying to come up with all concatenation, create 
concatenations of 3 words

For example an entity of length n, W1 W2 W3....Wn will have the index positions 
as follows

Index Position   Value
0                W1W2W3
                 W1W2
                 W1
1                W2W3W4
                 W2W3
                 W4

and so on...

(3) At the time of querying follow the steps 1 and 2 to create the query terms 
with respective BOOST levels as follows....
QUERY: Power Shot SD 500
PROCESSED QUERY: power shot sd 500

Query Terms ORed:
powershotsd^3, powershot^2, power
shotsd500^3, shotsd^2, shot
sd500^2, sd
500

QUERY: CanonPowerShotSD500
PROCESSED QUERY: canonpowershotsd 500
Query Terms ORed:
canonpowershotsd500^2, canonpowershotsd

QUERY: Canon-Powershot-SD 500
PROCESSED QUERY: canon powershot sd 500
Query Terms ORed:
canonpowershotsd^3, canonpowershot^2, canon
powershotsd500^3, powershotsd^2, powershot
sd500^2, sd
500

Hopefully 3 grams should be good enough to capture enough information to 
retrieve useful documents. You can adjust the boost levels and assess the 
performance.

Hope it helps...

Rajesh Munavalli

-----Original Message-----
From: Yonik Seeley [mailto:[EMAIL PROTECTED]
Sent: Mon 8/15/2005 7:47 PM
To: [email protected]
Subject: Re: intra-word delimiters

That was the plan, but step (4) really seems problematic.

- term expansion this way can lead to a lot of false matches
- phrase queries with many bordering words break
- settingt term positions such that phrase queries work on all combos
of subwords is non-trivial.

It seems like a better approach might be a new query type that can
handle things like this.

As an example, consider a-b-c-d (4 subwords)... one way of indexing
the tokens would be:

Pos0: a
Pos1: b,  ab,  a
Pos2: c,  bc,  abc,  cd
Pos3: d,  abcd

There are only 10 uniqe tokens n(n/2+1/2), but I needed to index 11 in
order to satisfy all possible phrase queries of catenated subwords. 
Notice how many other things will now match though (ac, aab,
aababcabcd, etc).

In addition, any algorithm I come up with to generate those term
position uses even more terms than the hand-generated one above.

By using index expansion in this manner, we have lost info about the
original ordering.  A new type of fuzzy phrase query seems like it
might be able to do a better job in many circumstances.

-Yonik

On 8/15/05, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
> 1) Lowercase.
> 2) Convert non-alphanumeric characters to spaces.
> 3) Introduce a space at every boundary between a letter and a number.
> 4) concatenate all 1, 2, 3 .. n term combinations and index them.
> 5) Don't stem.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: intra-word delimiters

Reply via email to