Christoph Mangold wrote:

Hi Andrzej,

I found this Email of yours on the lucene newsgroup.

Hi Christoph,

You need to remember that what I described is more a hack than a proper solution. However, it worked for me well enough, without the need to modify the core of Lucene...

There are other tricks you can use here, too... In one of my projects I
had a need to store a list of weighted keywords. No problem storing
multiple tokens under the same field name, as Erik explained above.
However, in Lucene you can only apply a single boost value to a field. I
ended up encoding the keywords like "10.0 keyword" and then writing an
analyzer which skips the initial numbers when processing this particular
field (which was stored, indexed and tokenized).


I have exactly the problem you describe. Could you provide some details on
this? In particular the following points are not clear to me:

- Is 10.0 supposed to be the boost of the keyword?

Yes. However, this boost cannot really be used during the scoring, only during the query formulation - because these boost values are not indexed, only stored as text. In other words: the value "10.0" is ignored during searching, but since in my scenario I would formulate the query based on the keywords from the current document, I could use the values from the current document to boost terms in the query.


- Do you name the field "10.0 keyword: house" or the keyword itself (like
"keyword: 10.0 house")?

The latter.

- In either case: If you enhance the analyser to skip the value, I don't
understand how the Scorer can use the boost information afterwords.

Scorer cannot - that's the drawback. I can use it, however, when constructing the query based on the current document kyewords.



It would be great if you would provide some clues about which classes should be enhanced etc.

Other approach which you may investigate is to artificially add more identical tokens to the field, to increase the number of index postings, which will result in increased score. E.g.


keyword: "hello world SEPARATOR hello world"

would give you automatically a boost of 2.0 for all of "hello", "world" and "hello world", and would correctly not give a boost for "world hello" (because of the inserted SEPARATOR token). If you don't have to store the values of these fields then the index size would increase only minimally, because the indexed tokens are stored only once, and then only their postings (occurences) are stored.

--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to