How about this....
(1) Follow the first three steps mentioned by Marvin
(2) For step 4 instead of trying to come up with all concatenation, create
concatenations of 3 words
For example an entity of length n, W1 W2 W3....Wn will have the index positions
as follows
Index Position Value
0 W1W2W3
W1W2
W1
1 W2W3W4
W2W3
W4
and so on...
(3) At the time of querying follow the steps 1 and 2 to create the query terms
with respective BOOST levels as follows....
QUERY: Power Shot SD 500
PROCESSED QUERY: power shot sd 500
Query Terms ORed:
powershotsd^3, powershot^2, power
shotsd500^3, shotsd^2, shot
sd500^2, sd
500
QUERY: CanonPowerShotSD500
PROCESSED QUERY: canonpowershotsd 500
Query Terms ORed:
canonpowershotsd500^2, canonpowershotsd
QUERY: Canon-Powershot-SD 500
PROCESSED QUERY: canon powershot sd 500
Query Terms ORed:
canonpowershotsd^3, canonpowershot^2, canon
powershotsd500^3, powershotsd^2, powershot
sd500^2, sd
500
Hopefully 3 grams should be good enough to capture enough information to
retrieve useful documents. You can adjust the boost levels and assess the
performance.
Hope it helps...
Rajesh Munavalli
-----Original Message-----
From: Yonik Seeley [mailto:[EMAIL PROTECTED]
Sent: Mon 8/15/2005 7:47 PM
To: [email protected]
Subject: Re: intra-word delimiters
That was the plan, but step (4) really seems problematic.
- term expansion this way can lead to a lot of false matches
- phrase queries with many bordering words break
- settingt term positions such that phrase queries work on all combos
of subwords is non-trivial.
It seems like a better approach might be a new query type that can
handle things like this.
As an example, consider a-b-c-d (4 subwords)... one way of indexing
the tokens would be:
Pos0: a
Pos1: b, ab, a
Pos2: c, bc, abc, cd
Pos3: d, abcd
There are only 10 uniqe tokens n(n/2+1/2), but I needed to index 11 in
order to satisfy all possible phrase queries of catenated subwords.
Notice how many other things will now match though (ac, aab,
aababcabcd, etc).
In addition, any algorithm I come up with to generate those term
position uses even more terms than the hand-generated one above.
By using index expansion in this manner, we have lost info about the
original ordering. A new type of fuzzy phrase query seems like it
might be able to do a better job in many circumstances.
-Yonik
On 8/15/05, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
> 1) Lowercase.
> 2) Convert non-alphanumeric characters to spaces.
> 3) Introduce a space at every boundary between a letter and a number.
> 4) concatenate all 1, 2, 3 .. n term combinations and index them.
> 5) Don't stem.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]