Other indexing strategies:
- AFAIK, you could probably cheat by multiplying the number of tokens in
headers thus affecting the scoring.
For example:
<h1>hello world</h1> <p> foo bar </p>
content -> hello world hello world foo bar
This is not very tweekable though.
- As Tate suggests, you can also use multiple fields and apply your search
on all of them:
<h1>hello world</h1> <p> foo bar </p>
content-> hello world foo bar
headers-> hello world
or even
<h1>hello world</h1> <h2> foo bar </h2>
content-> hello world foo bar
header1-> hello world
header2-> foo bar
The result of this is that you can fine-grained control over different
fields. At this point, you can boost at indexing or at search time. I
personnaly opt for search time because it is more open for tweeking as
oposed to reindexing everything whenever you want to change a boost
factor.
As for the complexities that Tate mentions for query parsing, he's right
that it's a pain when using the built-in query parser, but you can always
use the api directly to build whatever queries you need.
HTH,
sv
On Fri, 13 Aug 2004, Tate Avery wrote:
>
> Well, as far as I know you can boost 3 different things:
>
> - Field
> - Document
> - Query
>
> So, I think you need to craft a solution using one of those.
>
> Here are some possibilities for each:
>
> 1) Field
> - make a keyword field which is alongside your content field
> - boost your keyword field during indexing
> - expand user queries to search 'content' and 'keywords'
>
> 2) Document
> - I don't really think this one helps you in anyway
>
> 3) Query
> - Scan a user query and selectively boost words that are known keywords
> - This requires a keyword list and is not really scalable
>
> That is all that comes to mind, at first glance. So, IMO, the winner IS #1.
>
> For example:
>
> Field _headline = Field.Text("headline", "...");
> _headline.setBoost(3);
>
> Field _content = Field.Text("content", "...");
>
> _document.addField(_headline);
> _document.addField(_content);
>
>
> But, the tricky part is modifying queries to use both fields. If a user
> enters "virus", it is easy (i.e. "content:(virus) OR headline:(virus)").
> But, it quickly gets more complex with more complex queries (especially
> boolean queries with AND and such ... you probably would need something
> roughly like this: "a AND b" = "content:(a AND b) OR headline:(a AND b)
> OR (content:a AND headline:b) OR (headline:a AND content:b) and so on).
>
> That's my 2 cents.
>
> T
>
>
>
> -----Original Message-----
> From: news [mailto:[EMAIL PROTECTED] Behalf Of Leos Literak
> Sent: Friday, August 13, 2004 8:52 AM
> To: [EMAIL PROTECTED]
> Subject: Re: boost keywords
>
>
> Gerard Sychay napsal(a):
> > Well, there is always the Lucene wiki. There's not a patterns page per
> > se, but you could start one..
>
> of course I could. If I had something to add :-)
>
> but back to my issue. no reaction? So much people using
> Lucene and no one knows? I would be gratefull for any
> advice. Thanks
>
> Leos
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]