Re: Boosting results

2008-11-11 Thread Erik Hatcher


On Nov 11, 2008, at 8:32 AM, Stefan Trcek wrote:


On Tuesday 11 November 2008 02:18:39 Erik Hatcher wrote:


The integration won't be too painful... the main thing is that Solr
requires* some configuration files, literally on the filesystem, in
order to fire up and be happy.  And you'll need to craft Solr's
schema.xml to jive with how you indexed with pure Lucene.


Thanks Erik, I will give Solr a try. A list of files and classes I  
have

to use or supply to Solr will be appreciated. For now it is
- EmbeddedSolrServer
- SolrQuery
- schema.xml


Yeah, it'll look something like this: http://svn.apache.org/repos/asf/lucene/solr/branches/solr-ruby-refactoring/examples/solrjruby.rb 



That's JRuby code, but is easily translatable into pure Java.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Boosting results

2008-11-11 Thread Stefan Trcek
On Monday 10 November 2008 14:58:15 Mark Miller wrote:
  But: it's slow to load a field for the first time.  LUCENE-1231
  (column-stride fields) aims to greatly speed up the load time.

 Test it out though. In some recent testing I was doing it was *way*
 faster than I thought it would be based on what I had been reading.
 Of course if every term is unique, its going to be worse, but even
 with like 10 mil docs and a few hundred thousand uniques, either I
 was doing something wrong, or even on my 4200rpm laptop hd, it loaded
 like nothing (of course even a second load and then a search is much
 slower than just a warmed search though). Was hoping to see some
 advantage with a payload implementation with LUCENE-831, but really
 didn't seem to...

Currently I have 50 mil docs maximum, but usually 5 mil or less, so this 
seems to work for me, too.

Stefan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Boosting results

2008-11-11 Thread Stefan Trcek
On Tuesday 11 November 2008 02:18:39 Erik Hatcher wrote:

 The integration won't be too painful... the main thing is that Solr
 requires* some configuration files, literally on the filesystem, in
 order to fire up and be happy.  And you'll need to craft Solr's
 schema.xml to jive with how you indexed with pure Lucene.

Thanks Erik, I will give Solr a try. A list of files and classes I have 
to use or supply to Solr will be appreciated. For now it is
- EmbeddedSolrServer
- SolrQuery
- schema.xml

 That'll do the job, without a servlet engine.  But a servlet engine  
 can be mighty handy when you need to go to distributed search,  
 replication, etc.  But one can use Solr very much like using Lucene,
 API-only (but with config files).

Yes - for additional tasks you may use additional software or services, 
but I do not like to bloat the project for nothing.

Stefan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Boosting results

2008-11-10 Thread Mark Miller

Michael McCandless wrote:


But: it's slow to load a field for the first time.  LUCENE-1231 
(column-stride fields) aims to greatly speed up the load time.
Test it out though. In some recent testing I was doing it was *way* 
faster than I thought it would be based on what I had been reading. Of 
course if every term is unique, its going to be worse, but even with 
like 10 mil docs and a few hundred thousand uniques, either I was doing 
something wrong, or even on my 4200rpm laptop hd, it loaded like nothing 
(of course even a second load and then a search is much slower than just 
a warmed search though). Was hoping to see some advantage with a payload 
implementation with LUCENE-831, but really didn't seem to...


It's also memory-consuming.

Finally, you might want to instead look at Solr, which provides facet 
counting out of the box, rather than roll your own...


Mike

Stefan Trcek wrote:


On Friday 07 November 2008 18:46:17 Michael McCandless wrote:


Sorting populates the field cache (internal to Lucene) for that
field,   meaning it loads all values for all docs and holds them in
memory. This makes the first query slow, and, consumes RAM, in
proportion to how large your index is.


Can you direct me to the API how to access these cached values?
I'd like to have a function like: List all unique values of the
categories (A, B, C...) for documents that match this query.

i.e. for a query text:john show up categories=(A,B)

Doc 1: category=A text=john
Doc 2: category=B text=mary
Doc 3: category=B text=john
Doc 4: category=C text=mary

This is intended for search refinement (I use about 200 categories).
Sorry for hijacking this thread.

Stefan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Boosting results

2008-11-10 Thread Michael McCandless


Well .. the FieldCache API is documented here (for 2.4.0):


http://lucene.apache.org/java/2_4_0/api/core/org/apache/lucene/search/FieldCache.html

EG you can load ints (for example) like this:

FieldCache.DEFAULT.getInts(reader, myfield);

This returns an array mapping docID -- int value for that field.  You  
need to ensure that field has only 1 token per document (and that it  
parses to an int, for this example).


But: it's slow to load a field for the first time.  LUCENE-1231  
(column-stride fields) aims to greatly speed up the load time.


It's also memory-consuming.

Finally, you might want to instead look at Solr, which provides facet  
counting out of the box, rather than roll your own...


Mike

Stefan Trcek wrote:


On Friday 07 November 2008 18:46:17 Michael McCandless wrote:


Sorting populates the field cache (internal to Lucene) for that
field,   meaning it loads all values for all docs and holds them in
memory. This makes the first query slow, and, consumes RAM, in
proportion to how large your index is.


Can you direct me to the API how to access these cached values?
I'd like to have a function like: List all unique values of the
categories (A, B, C...) for documents that match this query.

i.e. for a query text:john show up categories=(A,B)

Doc 1: category=A text=john
Doc 2: category=B text=mary
Doc 3: category=B text=john
Doc 4: category=C text=mary

This is intended for search refinement (I use about 200 categories).
Sorry for hijacking this thread.

Stefan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Boosting results

2008-11-10 Thread Stefan Trcek
On Friday 07 November 2008 18:46:17 Michael McCandless wrote:

 Sorting populates the field cache (internal to Lucene) for that
 field,   meaning it loads all values for all docs and holds them in
 memory. This makes the first query slow, and, consumes RAM, in
 proportion to how large your index is.

Can you direct me to the API how to access these cached values?
I'd like to have a function like: List all unique values of the 
categories (A, B, C...) for documents that match this query.

i.e. for a query text:john show up categories=(A,B)

Doc 1: category=A text=john
Doc 2: category=B text=mary
Doc 3: category=B text=john
Doc 4: category=C text=mary

This is intended for search refinement (I use about 200 categories).
Sorry for hijacking this thread.

Stefan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Boosting results

2008-11-10 Thread Stefan Trcek
On Monday 10 November 2008 13:55:31 Michael McCandless wrote:

 Finally, you might want to instead look at Solr, which provides facet
 counting out of the box, rather than roll your own...

Doooh - new api, but it's facet counting sounds good.

Any starting points for moving from plain lucene to Solr in a smooth 
way? I doubt whether it is possible to integrate the facet counting 
part of Solr into my plain lucene application?

For searching: Do I have to have a Solr server (servlet engine) running 
or will EmbeddedSolrServer and SolrQuery do the job?

For indexing: Can I use a ready to use lucene index in Solr?

Stefan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Boosting results

2008-11-10 Thread Erik Hatcher


On Nov 10, 2008, at 2:42 PM, Stefan Trcek wrote:

On Monday 10 November 2008 13:55:31 Michael McCandless wrote:


Finally, you might want to instead look at Solr, which provides facet
counting out of the box, rather than roll your own...


Doooh - new api, but it's facet counting sounds good.

Any starting points for moving from plain lucene to Solr in a smooth
way? I doubt whether it is possible to integrate the facet counting
part of Solr into my plain lucene application?


The integration won't be too painful... the main thing is that Solr  
requires* some configuration files, literally on the filesystem, in  
order to fire up and be happy.  And you'll need to craft Solr's  
schema.xml to jive with how you indexed with pure Lucene.


For searching: Do I have to have a Solr server (servlet engine)  
running

or will EmbeddedSolrServer and SolrQuery do the job?


That'll do the job, without a servlet engine.  But a servlet engine  
can be mighty handy when you need to go to distributed search,  
replication, etc.  But one can use Solr very much like using Lucene,  
API-only (but with config files).



For indexing: Can I use a ready to use lucene index in Solr?


Yup, see above.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Boosting results

2008-11-07 Thread Erick Erickson
dh, sorting. I absolutely love it when I overlook the obvious G.

[EMAIL PROTECTED]

On Fri, Nov 7, 2008 at 4:58 AM, Michael McCandless 
[EMAIL PROTECTED] wrote:


 Couldn't you just do a single Query that sorts first by category and second
 by relevance?

 Mike


 Erick Erickson wrote:

  It seems to me that the easiest thing would be to fire two queries and
 then just concatenate the results

 category:A AND body:fred

 category:B AND body:fred


 If you really, really didn't want to fire two queries, you could create
 filters on category A and category B and make a couple of
 passes through your results seeing if the returned documents were in
 the filter, but you'd still concatenate the results. Actually in your
 specific example you could make one filter on A.

 You could also consider a custom scorer that, added 1,000,000 to every
 category A document.

 How much were you boosting by? What happens if you boost by a very large
 factor?
 As in ridiculously large?

 Best
 Erick

 On Thu, Nov 6, 2008 at 7:42 PM, Scott Smith [EMAIL PROTECTED]
 wrote:

  I'm interested in comments on the following problem.



 I have a set of documents.  They fall into 3 categories.  Call these
 categories A, B, and C.  Each document has an indexed, non-tokenized
 field called category which contains A, B, or C (they are mutually
 exclusive categories).



 All of the documents contain a field called body which contains a
 bunch of text.  This field is indexed and tokenized.



 So, I want to do a search which looks something like:



 (category:A OR category:B) AND body:fred



 I want all of the category A documents to come before the category B
 documents.  Effectively, I want to have the category A documents first
 (sorted by relevancy) and then the category B documents after (sorted by
 relevancy).



 I thought I could do this by boosting the category portion of the query,
 but that doesn't seem to work consistently.  I was setting the boost on
 the category A term to 1.0 and the boost on the category B term to 0.0.



 Any thoughts how to skin this?



 Scott




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Boosting results

2008-11-07 Thread Michael McCandless


Couldn't you just do a single Query that sorts first by category and  
second by relevance?


Mike

Erick Erickson wrote:


It seems to me that the easiest thing would be to fire two queries and
then just concatenate the results

category:A AND body:fred

category:B AND body:fred


If you really, really didn't want to fire two queries, you could  
create

filters on category A and category B and make a couple of
passes through your results seeing if the returned documents were in
the filter, but you'd still concatenate the results. Actually in your
specific example you could make one filter on A.

You could also consider a custom scorer that, added 1,000,000 to every
category A document.

How much were you boosting by? What happens if you boost by a very  
large

factor?
As in ridiculously large?

Best
Erick

On Thu, Nov 6, 2008 at 7:42 PM, Scott Smith  
[EMAIL PROTECTED]wrote:



I'm interested in comments on the following problem.



I have a set of documents.  They fall into 3 categories.  Call these
categories A, B, and C.  Each document has an indexed, non-tokenized
field called category which contains A, B, or C (they are mutually
exclusive categories).



All of the documents contain a field called body which contains a
bunch of text.  This field is indexed and tokenized.



So, I want to do a search which looks something like:



(category:A OR category:B) AND body:fred



I want all of the category A documents to come before the category B
documents.  Effectively, I want to have the category A documents  
first
(sorted by relevancy) and then the category B documents after  
(sorted by

relevancy).



I thought I could do this by boosting the category portion of the  
query,
but that doesn't seem to work consistently.  I was setting the  
boost on
the category A term to 1.0 and the boost on the category B term to  
0.0.




Any thoughts how to skin this?



Scott





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Boosting results

2008-11-07 Thread Matthew DeLoria
This actually brings up an interesting question, and something I have been
curious about.

In this case, does it make more sense to do Boosting by Category, or to do
sorting? From what I understand, Lucene sorting involves putting the
relevant fields into memory, and then executing a sort.

Is this how sorting actually works in Lucene? If so, is it even a good idea
considering the large data sets in Lucene? What would really be the
difference between sorting and boosting?

M

On Fri, Nov 7, 2008 at 7:59 AM, Erick Erickson [EMAIL PROTECTED]wrote:

 dh, sorting. I absolutely love it when I overlook the obvious G.

 [EMAIL PROTECTED]

 On Fri, Nov 7, 2008 at 4:58 AM, Michael McCandless 
 [EMAIL PROTECTED] wrote:

 
  Couldn't you just do a single Query that sorts first by category and
 second
  by relevance?
 
  Mike
 
 
  Erick Erickson wrote:
 
   It seems to me that the easiest thing would be to fire two queries and
  then just concatenate the results
 
  category:A AND body:fred
 
  category:B AND body:fred
 
 
  If you really, really didn't want to fire two queries, you could create
  filters on category A and category B and make a couple of
  passes through your results seeing if the returned documents were in
  the filter, but you'd still concatenate the results. Actually in your
  specific example you could make one filter on A.
 
  You could also consider a custom scorer that, added 1,000,000 to every
  category A document.
 
  How much were you boosting by? What happens if you boost by a very large
  factor?
  As in ridiculously large?
 
  Best
  Erick
 
  On Thu, Nov 6, 2008 at 7:42 PM, Scott Smith [EMAIL PROTECTED]
  wrote:
 
   I'm interested in comments on the following problem.
 
 
 
  I have a set of documents.  They fall into 3 categories.  Call these
  categories A, B, and C.  Each document has an indexed, non-tokenized
  field called category which contains A, B, or C (they are mutually
  exclusive categories).
 
 
 
  All of the documents contain a field called body which contains a
  bunch of text.  This field is indexed and tokenized.
 
 
 
  So, I want to do a search which looks something like:
 
 
 
  (category:A OR category:B) AND body:fred
 
 
 
  I want all of the category A documents to come before the category B
  documents.  Effectively, I want to have the category A documents first
  (sorted by relevancy) and then the category B documents after (sorted
 by
  relevancy).
 
 
 
  I thought I could do this by boosting the category portion of the
 query,
  but that doesn't seem to work consistently.  I was setting the boost on
  the category A term to 1.0 and the boost on the category B term to 0.0.
 
 
 
  Any thoughts how to skin this?
 
 
 
  Scott
 
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 




-- 
Matthew P. DeLoria
[EMAIL PROTECTED]


RE: Boosting results

2008-11-07 Thread Scott Smith
Well, it's not like sorting hadn't occurred to me.  Unfortunately, what
I recalled was that you could only sort results on one field (I do date
sorted searches all the time in my application).  I should have gone
back and looked.  My memory failed me as I can see that you can sort on
multiple fields and score (aka relevancy) is one of the pseudo fields.
That'll work.

Thanks.

Scott

-Original Message-
From: Erick Erickson [mailto:[EMAIL PROTECTED] 
Sent: Friday, November 07, 2008 5:59 AM
To: java-user@lucene.apache.org
Subject: Re: Boosting results

dh, sorting. I absolutely love it when I overlook the obvious G.

[EMAIL PROTECTED]

On Fri, Nov 7, 2008 at 4:58 AM, Michael McCandless 
[EMAIL PROTECTED] wrote:


 Couldn't you just do a single Query that sorts first by category and
second
 by relevance?

 Mike


 Erick Erickson wrote:

  It seems to me that the easiest thing would be to fire two queries
and
 then just concatenate the results

 category:A AND body:fred

 category:B AND body:fred


 If you really, really didn't want to fire two queries, you could
create
 filters on category A and category B and make a couple of
 passes through your results seeing if the returned documents were in
 the filter, but you'd still concatenate the results. Actually in your
 specific example you could make one filter on A.

 You could also consider a custom scorer that, added 1,000,000 to
every
 category A document.

 How much were you boosting by? What happens if you boost by a very
large
 factor?
 As in ridiculously large?

 Best
 Erick

 On Thu, Nov 6, 2008 at 7:42 PM, Scott Smith
[EMAIL PROTECTED]
 wrote:

  I'm interested in comments on the following problem.



 I have a set of documents.  They fall into 3 categories.  Call these
 categories A, B, and C.  Each document has an indexed, non-tokenized
 field called category which contains A, B, or C (they are mutually
 exclusive categories).



 All of the documents contain a field called body which contains a
 bunch of text.  This field is indexed and tokenized.



 So, I want to do a search which looks something like:



 (category:A OR category:B) AND body:fred



 I want all of the category A documents to come before the category B
 documents.  Effectively, I want to have the category A documents
first
 (sorted by relevancy) and then the category B documents after
(sorted by
 relevancy).



 I thought I could do this by boosting the category portion of the
query,
 but that doesn't seem to work consistently.  I was setting the boost
on
 the category A term to 1.0 and the boost on the category B term to
0.0.



 Any thoughts how to skin this?



 Scott




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Boosting results

2008-11-07 Thread Michael McCandless


This is a good point.

Sorting populates the field cache (internal to Lucene) for that field,  
meaning it loads all values for all docs and holds them in memory.   
This makes the first query slow, and, consumes RAM, in proportion to  
how large your index is.


Whereas boosting should be able to achieve the use case without these  
limitations.


Mike

Matthew DeLoria wrote:

This actually brings up an interesting question, and something I  
have been

curious about.

In this case, does it make more sense to do Boosting by Category, or  
to do

sorting? From what I understand, Lucene sorting involves putting the
relevant fields into memory, and then executing a sort.

Is this how sorting actually works in Lucene? If so, is it even a  
good idea

considering the large data sets in Lucene? What would really be the
difference between sorting and boosting?

M

On Fri, Nov 7, 2008 at 7:59 AM, Erick Erickson [EMAIL PROTECTED] 
wrote:


dh, sorting. I absolutely love it when I overlook the obvious  
G.


[EMAIL PROTECTED]

On Fri, Nov 7, 2008 at 4:58 AM, Michael McCandless 
[EMAIL PROTECTED] wrote:



Couldn't you just do a single Query that sorts first by category and

second

by relevance?

Mike


Erick Erickson wrote:

It seems to me that the easiest thing would be to fire two queries  
and

then just concatenate the results

category:A AND body:fred

category:B AND body:fred


If you really, really didn't want to fire two queries, you could  
create

filters on category A and category B and make a couple of
passes through your results seeing if the returned documents were  
in
the filter, but you'd still concatenate the results. Actually in  
your

specific example you could make one filter on A.

You could also consider a custom scorer that, added 1,000,000 to  
every

category A document.

How much were you boosting by? What happens if you boost by a  
very large

factor?
As in ridiculously large?

Best
Erick

On Thu, Nov 6, 2008 at 7:42 PM, Scott Smith [EMAIL PROTECTED]

wrote:


I'm interested in comments on the following problem.




I have a set of documents.  They fall into 3 categories.  Call  
these
categories A, B, and C.  Each document has an indexed, non- 
tokenized
field called category which contains A, B, or C (they are  
mutually

exclusive categories).



All of the documents contain a field called body which  
contains a

bunch of text.  This field is indexed and tokenized.



So, I want to do a search which looks something like:



(category:A OR category:B) AND body:fred



I want all of the category A documents to come before the  
category B
documents.  Effectively, I want to have the category A documents  
first
(sorted by relevancy) and then the category B documents after  
(sorted

by

relevancy).



I thought I could do this by boosting the category portion of the

query,
but that doesn't seem to work consistently.  I was setting the  
boost on
the category A term to 1.0 and the boost on the category B term  
to 0.0.




Any thoughts how to skin this?



Scott





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]








--
Matthew P. DeLoria
[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Boosting results

2008-11-07 Thread Peter Keegan
If you sort first by score, keep in mind that the raw scores are very
precise and you could see many unique values in the result set. The
secondary sort field would only be used to break equal scores. We had to use
a custom comparator to 'smooth out' the scores to allow the second field to
take effect.

Peter


On Fri, Nov 7, 2008 at 11:17 AM, Scott Smith [EMAIL PROTECTED]wrote:

 Well, it's not like sorting hadn't occurred to me.  Unfortunately, what
 I recalled was that you could only sort results on one field (I do date
 sorted searches all the time in my application).  I should have gone
 back and looked.  My memory failed me as I can see that you can sort on
 multiple fields and score (aka relevancy) is one of the pseudo fields.
 That'll work.

 Thanks.

 Scott

 -Original Message-
 From: Erick Erickson [mailto:[EMAIL PROTECTED]
 Sent: Friday, November 07, 2008 5:59 AM
 To: java-user@lucene.apache.org
 Subject: Re: Boosting results

 dh, sorting. I absolutely love it when I overlook the obvious G.

 [EMAIL PROTECTED]

 On Fri, Nov 7, 2008 at 4:58 AM, Michael McCandless 
 [EMAIL PROTECTED] wrote:

 
  Couldn't you just do a single Query that sorts first by category and
 second
  by relevance?
 
  Mike
 
 
  Erick Erickson wrote:
 
   It seems to me that the easiest thing would be to fire two queries
 and
  then just concatenate the results
 
  category:A AND body:fred
 
  category:B AND body:fred
 
 
  If you really, really didn't want to fire two queries, you could
 create
  filters on category A and category B and make a couple of
  passes through your results seeing if the returned documents were in
  the filter, but you'd still concatenate the results. Actually in your
  specific example you could make one filter on A.
 
  You could also consider a custom scorer that, added 1,000,000 to
 every
  category A document.
 
  How much were you boosting by? What happens if you boost by a very
 large
  factor?
  As in ridiculously large?
 
  Best
  Erick
 
  On Thu, Nov 6, 2008 at 7:42 PM, Scott Smith
 [EMAIL PROTECTED]
  wrote:
 
   I'm interested in comments on the following problem.
 
 
 
  I have a set of documents.  They fall into 3 categories.  Call these
  categories A, B, and C.  Each document has an indexed, non-tokenized
  field called category which contains A, B, or C (they are mutually
  exclusive categories).
 
 
 
  All of the documents contain a field called body which contains a
  bunch of text.  This field is indexed and tokenized.
 
 
 
  So, I want to do a search which looks something like:
 
 
 
  (category:A OR category:B) AND body:fred
 
 
 
  I want all of the category A documents to come before the category B
  documents.  Effectively, I want to have the category A documents
 first
  (sorted by relevancy) and then the category B documents after
 (sorted by
  relevancy).
 
 
 
  I thought I could do this by boosting the category portion of the
 query,
  but that doesn't seem to work consistently.  I was setting the boost
 on
  the category A term to 1.0 and the boost on the category B term to
 0.0.
 
 
 
  Any thoughts how to skin this?
 
 
 
  Scott
 
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Boosting results

2008-11-06 Thread Scott Smith
I'm interested in comments on the following problem.  

 

I have a set of documents.  They fall into 3 categories.  Call these
categories A, B, and C.  Each document has an indexed, non-tokenized
field called category which contains A, B, or C (they are mutually
exclusive categories).  

 

All of the documents contain a field called body which contains a
bunch of text.  This field is indexed and tokenized.

 

So, I want to do a search which looks something like:

 

(category:A OR category:B) AND body:fred

 

I want all of the category A documents to come before the category B
documents.  Effectively, I want to have the category A documents first
(sorted by relevancy) and then the category B documents after (sorted by
relevancy).

 

I thought I could do this by boosting the category portion of the query,
but that doesn't seem to work consistently.  I was setting the boost on
the category A term to 1.0 and the boost on the category B term to 0.0.

 

Any thoughts how to skin this?

 

Scott



Re: Boosting results

2008-11-06 Thread Erick Erickson
It seems to me that the easiest thing would be to fire two queries and
then just concatenate the results

category:A AND body:fred

category:B AND body:fred


If you really, really didn't want to fire two queries, you could create
filters on category A and category B and make a couple of
passes through your results seeing if the returned documents were in
the filter, but you'd still concatenate the results. Actually in your
specific example you could make one filter on A.

You could also consider a custom scorer that, added 1,000,000 to every
category A document.

How much were you boosting by? What happens if you boost by a very large
factor?
As in ridiculously large?

Best
Erick

On Thu, Nov 6, 2008 at 7:42 PM, Scott Smith [EMAIL PROTECTED]wrote:

 I'm interested in comments on the following problem.



 I have a set of documents.  They fall into 3 categories.  Call these
 categories A, B, and C.  Each document has an indexed, non-tokenized
 field called category which contains A, B, or C (they are mutually
 exclusive categories).



 All of the documents contain a field called body which contains a
 bunch of text.  This field is indexed and tokenized.



 So, I want to do a search which looks something like:



 (category:A OR category:B) AND body:fred



 I want all of the category A documents to come before the category B
 documents.  Effectively, I want to have the category A documents first
 (sorted by relevancy) and then the category B documents after (sorted by
 relevancy).



 I thought I could do this by boosting the category portion of the query,
 but that doesn't seem to work consistently.  I was setting the boost on
 the category A term to 1.0 and the boost on the category B term to 0.0.



 Any thoughts how to skin this?



 Scott