Re: Aggregating category hits

2006-06-14 Thread Peter Keegan

The performance results in my previous posting were based on an
implementation that performs 2 searches, one for getting 'Hits' and another
for getting the BitSet. I reimplemented this in one search using the code in
'SolrIndexSearcher.getDocListAndSetNC' and I'm now getting throughput of
350-375 qps.

This is great stuff Solr guys! I'd love to see the DocSet and DocList
features added to Lucene's IndexSearcher.

Peter

On 6/12/06, Peter Keegan [EMAIL PROTECTED] wrote:


I'm seeing query throughput of approx. 290 qps with OpenBitSet vs. 270
with BitSet. I had to reduce the max. HashDocSet size to 2K - 3K (from
10K-20K) to get optimal tradeoff.

no. docs in index: 730,000
average no. results returned: 40
average response time: 50 msec (15-20 for counting facets)
no. facets: 100 on every query

I'm not using the Solr server as we have already developed an
infrastructure.

Peter



On 6/10/06, Yonik Seeley [EMAIL PROTECTED] wrote:

 On 6/9/06, Peter Keegan [EMAIL PROTECTED] wrote:
  However, my throughput testing shows that the Solr method is at least
 50%
  faster than mine. I'm seeing a big win with the use of the HashDocSet
 for
  lower hit counts. On my 64-bit platform, a MAX_SIZE value of 10K-20K
 seems
  to provide optimal performance.

 Interesting... how many documents are in your collection?
 It would prob be nice to make the HashDocSet cutt-off dynamic rather
 than fixed.
 Are you using Solr, or just some of it's code?

   I'm looking forward to trying this with
  OpenBitSet.

 I checked in the OpenBitSet changes today.  I imagine this will lower
 the optimal max HashDocSet size for performance a little.  You might
 not see much performance improvement if most of the intersections
 involved a HashDocSet... the OpenBitSet improvements only kick in with
 bitset-bitset intersection counts.

 -Yonik
 http://incubator.apache.org/solr Solr, the open-source Lucene search
 server

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





Re: Aggregating category hits

2006-06-12 Thread Peter Keegan

I'm seeing query throughput of approx. 290 qps with OpenBitSet vs. 270 with
BitSet. I had to reduce the max. HashDocSet size to 2K - 3K (from 10K-20K)
to get optimal tradeoff.

no. docs in index: 730,000
average no. results returned: 40
average response time: 50 msec (15-20 for counting facets)
no. facets: 100 on every query

I'm not using the Solr server as we have already developed an
infrastructure.

Peter


On 6/10/06, Yonik Seeley [EMAIL PROTECTED] wrote:


On 6/9/06, Peter Keegan [EMAIL PROTECTED] wrote:
 However, my throughput testing shows that the Solr method is at least
50%
 faster than mine. I'm seeing a big win with the use of the HashDocSet
for
 lower hit counts. On my 64-bit platform, a MAX_SIZE value of 10K-20K
seems
 to provide optimal performance.

Interesting... how many documents are in your collection?
It would prob be nice to make the HashDocSet cutt-off dynamic rather than
fixed.
Are you using Solr, or just some of it's code?

  I'm looking forward to trying this with
 OpenBitSet.

I checked in the OpenBitSet changes today.  I imagine this will lower
the optimal max HashDocSet size for performance a little.  You might
not see much performance improvement if most of the intersections
involved a HashDocSet... the OpenBitSet improvements only kick in with
bitset-bitset intersection counts.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search
server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Aggregating category hits

2006-06-10 Thread zzzzz shalev
hi peter,
   
  two quick questions
   
  1. could you let me know what kind of response time you were getting with 
solr (as well as the size of data and result sizes)
   
  2. i took a really really quick look at DocSetHitCollector  and saw the 
dreaded 
   
  if (bits==null) bits = new BitSet(maxDoc);
   
  line of code,
   
  since i rewrote some lucene code to support 64-bit search instances i have 
indexes that may reach quite a few GB's , allocating bitset's (arrays of long's 
is quite expensive memory wise and i am still a little skeptical about 
performance with large result sets)
   
  i did some testing of my facet impl and after an overnight webload session 
received about a 500 milli response time average for full faceting (with result 
sets from a few thousand to over 100,000)
   
  would really like to hear your views,,
   
  thanks,
   
  

Peter Keegan [EMAIL PROTECTED] wrote:
  I compared Solr's DocSetHitCollector and counting bitset intersections to
get facet counts with a different approach that uses a custom hit collector
that tests each docid hit (bit) with each facets' bitset and increments a
count in a histogram. My assumption was that for queries with few hits, this
would be much faster than always doing bitset intersections/cardinality for
every facet all the time.

However, my throughput testing shows that the Solr method is at least 50%
faster than mine. I'm seeing a big win with the use of the HashDocSet for
lower hit counts. On my 64-bit platform, a MAX_SIZE value of 10K-20K seems
to provide optimal performance. I'm looking forward to trying this with
OpenBitSet.

Peter




On 5/29/06, z shalev wrote:

 i know im a little late replying to this thread, but, in my humble opinion
 the best way to aggregate values (not necessarily terms, but whole values in
 fields) is as follows:

 startup stage:

 for each field you would like to aggregate create a hashmap

 open an index reader and run through all the docs

 get the values to be aggregated from the fields of each doc

 create a hashcode for each value from each field collected, the hashcode
 should have some sort of prefix indicating which field its from (for exampe:
 1 = author, 2 = ) and hence which hash it is stored in (at retrieval
 time, this prefix can be used to easily retrieve the value from the correct
 hash)

 place the hashcode/value in the appropriate hash

 create an arraylist

 at index X in the arraylist place an int array of all the hashcodes
 associated with doc id X

 so for example: if i have doc id 0 which contains the values: william
 shakespeare and the value 1797 the array list at index 0 will have an int
 array containing 2 values (the 2 hashcodes of shaklespeare and 1797)

 run time:

 at run time receive the hits and iterate through the doc ids , aggregate
 the values with direct access into the arraylist (for doc id 10 go to index
 10 in the arraylist to retrieve the array of hashcodes) and lookups into the
 hashmaps

 i tested this today on a small index approx 400,000 docs (1GB of data)
 but i ran queries returning over 100,000 results

 my response time was about 550 milliseconds on large (over 100,000)
 result sets

 another point, this method should be scalable for much larger indexes as
 well, as it is linear to the result set size and not the index size (which
 is a HUGE bonus)

 if anyone wants the code let me know,




 Marvin Humphrey wrote:

 Thanks, all.

 The field cache and the bitsets both seem like good options until the
 collection grows too large, provided that the index does not need to
 be updated very frequently. Then for large collections, there's
 statistical sampling. Any of those options seems preferable to
 retrieving all docs all the time.

 Marvin Humphrey
 Rectangular Research
 http://www.rectangular.com/


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




 -
 Feel free to call! Free PC-to-PC calls. Low rates on PC-to-Phone. Get
 Yahoo! Messenger with Voice



 __
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Aggregating category hits

2006-06-10 Thread Yonik Seeley

On 6/10/06, z shalev [EMAIL PROTECTED] wrote:

  1. could you let me know what kind of response time you were getting with 
solr (as well as the size of data and result sizes)


A can tell you a little bit about ours... on one CNET faceted browsing
implementation using Solr, the number of facets to check per request
average somewhere between 100 and 200 (the total number of unique
facets is much larger though).  The median request time is 3ms (and I
don't think the majority of that time is calculating set
intersections).

We actually don't have the LRUCaches set large enough to achieve a
100% hit rate, but performance is still fine.


  2. i took a really really quick look at DocSetHitCollector  and saw the 
dreaded

  if (bits==null) bits = new BitSet(maxDoc);


Yes, DocSets can be memory intensive.  A BitSet is only used when the
number of results gets larger than a threshold... below that, a
HashDocSet is used that is O(n) rather than O(maxDoc).  So the memory
footprint also depends on the cardinality of the sets.


  since i rewrote some lucene code to support 64-bit search instances i have 
indexes that may reach quite a few GB's ,


GBs of index size, or actually billions of documents.  It's the number
of documents that matters in this case.


allocating bitset's (arrays of long's is quite expensive memory wise and i am 
still a little
skeptical about performance with large result sets)


I just checked in a replacement for BitSet that takes intersection
counts much faster.


  i did some testing of my facet impl and after an overnight webload session 
received about a 500 milli response time average for full faceting (with result 
sets from a few thousand to over 100,000)


How many documents was that with, and how many facets per document?

I certainly am interested in more memory efficient faceted browsing,
and have been meaning to try some alternatives.  So far, we've had
good results using cached DocSets though.


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Aggregating category hits

2006-06-10 Thread Yonik Seeley

On 6/9/06, Peter Keegan [EMAIL PROTECTED] wrote:

However, my throughput testing shows that the Solr method is at least 50%
faster than mine. I'm seeing a big win with the use of the HashDocSet for
lower hit counts. On my 64-bit platform, a MAX_SIZE value of 10K-20K seems
to provide optimal performance.


Interesting... how many documents are in your collection?
It would prob be nice to make the HashDocSet cutt-off dynamic rather than fixed.
Are you using Solr, or just some of it's code?


 I'm looking forward to trying this with
OpenBitSet.


I checked in the OpenBitSet changes today.  I imagine this will lower
the optimal max HashDocSet size for performance a little.  You might
not see much performance improvement if most of the intersections
involved a HashDocSet... the OpenBitSet improvements only kick in with
bitset-bitset intersection counts.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Aggregating category hits

2006-06-10 Thread Chris Hostetter

: A can tell you a little bit about ours... on one CNET faceted browsing
: implementation using Solr, the number of facets to check per request
: average somewhere between 100 and 200 (the total number of unique
: facets is much larger though).  The median request time is 3ms (and I
: don't think the majority of that time is calculating set
: intersections).

I've also posted some other numbers for a differnet CNET page which
doesn't seem very faceted but actually does many more DocSet
intersections to rank categories (1500-2000 per request), and does a
lot more complicated main search queries (really big nested
Boolean/DisjunctionMax/Phrase queries).  The average request time there
was 30ms ... but more importantly (to me and the people i work with
anyway) the 99.9th percentile was only 456ms

http://wiki.apache.org/solr/SolrPerformanceData

(note: this was before the OpenBitSet changes Yonik just commited)


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Aggregating category hits

2006-06-10 Thread zzzzz shalev
hi yonik,
   
  thanks for the thurough reply,,
   
  a few more quick questions...
   
  the number of facets to check per request
average somewhere between 100 and 200 (the total number of unique
facets is much larger though). 
   
  you mean 100 - 200 different catagories to facet?
   
  i ran the test on a 600,000 doc index, however the cool thing about my 
solution, is that the total doc count is not too relavant , i will be checking 
this with much larger indexes probably 10x the size of my initial testing, and 
algorithmically i dont expect too much of a performance dropoff, due to the 
fact that response time is effected by the result set size and not the docs in 
the index size (since i cache all faceted values on startup),
   
  as for the 500 milli, this is basically what i do in that time:
   
  1. in each search instance: initally send a query and return the top 100 
docs. start a seprate thread to collect full facet values (i do this by 
resending the same query with maxDoc as the number of results to return can 
i save this requerying somehow?)
   
  2. then merge all instances' docs using a custom parallel m searcher
   
  3. for the top 100 docs i calculate which doc came from which instance
   
  4. and send the doc id's back to each instance and have each instance create 
facets on its docs from the top 100
   
  5. each instance returns this info, i then go back to the instance and pass 
to them the top 20 terms of each facet for the actual facet counts...
   
  i do this so that the facet counts i display are from good docs, i am trying 
to avoid a situation where i recieve 5,000 results and that 4,500 of them with 
awful rankings have the same facet values and therefore the facets displayed in 
the UI are of bad ranked docs
   
  confusing
   
  however , i will look into your impl, it sounds solid, i am curretly on 
lucene 1.4.3 (which classes should i look into in solr?)
   
  comments welcomed
   
  thanks in advance!
  

Yonik Seeley [EMAIL PROTECTED] wrote:
  On 6/10/06, z shalev wrote:
 1. could you let me know what kind of response time you were getting with 
 solr (as well as the size of data and result sizes)

A can tell you a little bit about ours... on one CNET faceted browsing
implementation using Solr, the number of facets to check per request
average somewhere between 100 and 200 (the total number of unique
facets is much larger though). The median request time is 3ms (and I
don't think the majority of that time is calculating set
intersections).

We actually don't have the LRUCaches set large enough to achieve a
100% hit rate, but performance is still fine.

 2. i took a really really quick look at DocSetHitCollector and saw the dreaded

 if (bits==null) bits = new BitSet(maxDoc);

Yes, DocSets can be memory intensive. A BitSet is only used when the
number of results gets larger than a threshold... below that, a
HashDocSet is used that is O(n) rather than O(maxDoc). So the memory
footprint also depends on the cardinality of the sets.

 since i rewrote some lucene code to support 64-bit search instances i have 
 indexes that may reach quite a few GB's ,

GBs of index size, or actually billions of documents. It's the number
of documents that matters in this case.

 allocating bitset's (arrays of long's is quite expensive memory wise and i am 
 still a little
 skeptical about performance with large result sets)

I just checked in a replacement for BitSet that takes intersection
counts much faster.

 i did some testing of my facet impl and after an overnight webload session 
 received about a 500 milli response time average for full faceting (with 
 result sets from a few thousand to over 100,000)

How many documents was that with, and how many facets per document?

I certainly am interested in more memory efficient faceted browsing,
and have been meaning to try some alternatives. So far, we've had
good results using cached DocSets though.


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



 __
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Aggregating category hits

2006-06-10 Thread Yonik Seeley

On 6/10/06, z shalev [EMAIL PROTECTED] wrote:

  the number of facets to check per request
average somewhere between 100 and 200 (the total number of unique
facets is much larger though). 

  you mean 100 - 200 different catagories to facet?


I was going by memory, but 100 to 200 set intersections (specific
counts to gather) per individual request.


  i ran the test on a 600,000 doc index, however the cool thing about my 
solution, is that the total doc count is not too relavant


Yeah, you can always get faster if aproximations are OK.


  i do this so that the facet counts i display are from good docs, i am trying 
to avoid a situation where i recieve 5,000 results and that 4,500 of them with 
awful rankings have the same facet values and therefore the facets displayed in 
the UI are of bad ranked docs

  confusing


Actually, that sounds like it can make a lot of sense, as long as you
can approximate what is good vs bad.


  however , i will look into your impl, it sounds solid, i am curretly on 
lucene 1.4.3 (which classes should i look into in solr?)


Solr doesn't do faceted browsing for you yet.  It provides caching of
sets and fast set intersections.  You currently need code to tell solr
what intersection counts to get.

DocSetHitCollector and HashDocSet + BitDocSet would only help you if
you cache sets of docids.


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Aggregating category hits

2006-06-09 Thread Peter Keegan

I compared Solr's DocSetHitCollector and counting bitset intersections to
get facet counts with a different approach that uses a custom hit collector
that tests each docid hit (bit) with each facets' bitset and increments a
count in a histogram. My assumption was that for queries with few hits, this
would be much faster than always doing bitset intersections/cardinality for
every facet all the time.

However, my throughput testing shows that the Solr method is at least 50%
faster than mine. I'm seeing a big win with the use of the HashDocSet for
lower hit counts. On my 64-bit platform, a MAX_SIZE value of 10K-20K seems
to provide optimal performance. I'm looking forward to trying this with
OpenBitSet.

Peter




On 5/29/06, z shalev [EMAIL PROTECTED] wrote:


i know im a little late replying to this thread, but, in my humble opinion
the best way to aggregate values (not necessarily terms, but whole values in
fields) is as follows:

  startup stage:

  for each field you would like to aggregate create a hashmap

  open an index reader and run through all the docs

  get the values to be aggregated from the fields of each doc

  create a hashcode for each value from each field collected, the hashcode
should have some sort of prefix indicating which field its from (for exampe:
1 = author, 2 = ) and hence which hash it is stored in (at retrieval
time, this prefix can be used to easily retrieve the value from the correct
hash)

  place the hashcode/value in the appropriate hash

  create an arraylist

  at index X in the arraylist place an int array of all the hashcodes
associated with doc id X

  so for example: if i have doc id 0 which contains the values: william
shakespeare and the value 1797 the array list at index 0 will have an int
array containing 2 values (the 2 hashcodes of shaklespeare and 1797)

  run time:

  at run time receive the hits and iterate through the doc ids , aggregate
the values with direct access into the arraylist (for doc id 10 go to index
10 in the arraylist to retrieve the array of hashcodes) and lookups into the
hashmaps

  i tested this today on a small index approx 400,000 docs (1GB of data)
but i ran queries returning over 100,000 results

  my response time was about 550 milliseconds on large (over 100,000)
result sets

  another point, this method should be scalable for much larger indexes as
well, as it is linear to the result set size and not the index size (which
is a HUGE bonus)

  if anyone wants the code let me know,




Marvin Humphrey [EMAIL PROTECTED] wrote:

Thanks, all.

The field cache and the bitsets both seem like good options until the
collection grows too large, provided that the index does not need to
be updated very frequently. Then for large collections, there's
statistical sampling. Any of those options seems preferable to
retrieving all docs all the time.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
Feel free to call! Free PC-to-PC calls. Low rates on PC-to-Phone.  Get
Yahoo! Messenger with Voice



Re: Aggregating category hits

2006-05-29 Thread zzzzz shalev
i know im a little late replying to this thread, but, in my humble opinion the 
best way to aggregate values (not necessarily terms, but whole values in 
fields) is as follows:
   
  startup stage:
   
  for each field you would like to aggregate create a hashmap
   
  open an index reader and run through all the docs
   
  get the values to be aggregated from the fields of each doc
   
  create a hashcode for each value from each field collected, the hashcode 
should have some sort of prefix indicating which field its from (for exampe: 1 
= author, 2 = ) and hence which hash it is stored in (at retrieval time, 
this prefix can be used to easily retrieve the value from the correct hash)
   
  place the hashcode/value in the appropriate hash
   
  create an arraylist
   
  at index X in the arraylist place an int array of all the hashcodes 
associated with doc id X
   
  so for example: if i have doc id 0 which contains the values: william 
shakespeare and the value 1797 the array list at index 0 will have an int array 
containing 2 values (the 2 hashcodes of shaklespeare and 1797)
   
  run time:
   
  at run time receive the hits and iterate through the doc ids , aggregate the 
values with direct access into the arraylist (for doc id 10 go to index 10 in 
the arraylist to retrieve the array of hashcodes) and lookups into the hashmaps
   
  i tested this today on a small index approx 400,000 docs (1GB of data) but i 
ran queries returning over 100,000 results
   
  my response time was about 550 milliseconds on large (over 100,000) result 
sets
   
  another point, this method should be scalable for much larger indexes as 
well, as it is linear to the result set size and not the index size (which is a 
HUGE bonus)
   
  if anyone wants the code let me know,
   
   
  

Marvin Humphrey [EMAIL PROTECTED] wrote:
  
Thanks, all.

The field cache and the bitsets both seem like good options until the 
collection grows too large, provided that the index does not need to 
be updated very frequently. Then for large collections, there's 
statistical sampling. Any of those options seems preferable to 
retrieving all docs all the time.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
Feel free to call! Free PC-to-PC calls. Low rates on PC-to-Phone.  Get Yahoo! 
Messenger with Voice

Re: Aggregating category hits

2006-05-22 Thread Kapil Chhabra

Hi Jelda,
Is there any way by which I can achieve sorting of search results along 
with overriding the collect method of the HitCollector in this case?

I have been using

srch.search(query,sort);

If I replace it with srch.search(query, new HitCollector(){ impl of the 
collect method to collect counts }),

I will have no way to sort my results.

Any pointers?

Regards,
kapilChhabra

Kapil Chhabra wrote:

Thanks a lot Jelda.
I'll try this get back with the performance comparison chart.

Regards,
kapilChhabra

Ramana Jelda wrote:

Hi Kapil,
As I remember FieldCache is in lucene api since 1.4 .
Ok . Anyhow here is suedo code that can help.

//1. initialize reader on opening documentId to the categoryid 
relation as
below. Depending on your requirement you can either 
getStringIndex().. I get

StringIndex in //my project.

String[] docId2CategoryIdRelation=FieldCache.DEFAULT.getStrings(reader,
categoryFieldName);

//2. cache it
//3. search as usal with your Query providing your own HitCollector
//4. use docId2CategoryIdRelation to retrieve category id for each 
result

document
String yourCategoryId=docId2CategoryIdRelation[resultDocId]
//5.Increment yourCategoryId count (do lazy initialization of 
categoryCounts

holder.FAQ.)

//6 You are done.. :)

All the best,
Jelda




 

-Original Message-
From: Kapil Chhabra [mailto:[EMAIL PROTECTED] Sent: Tuesday, 
May 16, 2006 11:50 AM

To: java-user@lucene.apache.org
Subject: Re: Aggregating category hits

Hi Jelda,
I have not yet migrated to Lucene 1.9 and I guess FieldCache has 
been introduced in this release.

Can you please give me a pointer to your strategy of FieldCache?

Thanks  Regards,
Kapil Chhabra


Ramana Jelda wrote:
   
But this BitSet strategy is more memory consuming mainly if   
you have

documents in million numbers and categories in thousands.
So I preferred in my project FieldCache strategy.

Jelda

   

-Original Message-
From: Kapil Chhabra [mailto:[EMAIL PROTECTED]
Sent: Tuesday, May 16, 2006 7:38 AM
To: java-user@lucene.apache.org
Subject: Re: Aggregating category hits

Even I am doing the same in my application.
Once in a day, all the filters [for different categories] are 
initialized. Each time a query is fired, the Query BitSet is ANDed 
with the BitSet of each filter. The cardinality obtained is the 
desired output.
@Eric: I would like to know more about the implementation 
with DocSet

in place of Bitset.

Regards,
kapilChhabra


Erik Hatcher wrote:
   

On May 15, 2006, at 5:07 PM, Marvin Humphrey wrote:
   
If you needed to know not just the total number of hits, but the 
number of hits in each category, how would you handle that?


For instance, a search for egg would have to produce 
the 20 most

relevant documents for egg, but also a list like this:

Holiday  Seasonal / Easter 75
Books / Cooking 52
Miscellaneous 44
Kitchen Collectibles 43
Hobbies / Crafts 17
[...]

It seems to me that you'd have to retrieve each hit's


stored fields
   
and examine the contents of a category field. That's a lot of 
overhead. Is there another way?

My first implementation of faceted browsing uses BitSet's   
that get
pre-loaded for each category value (each unique term in a   

category
   
field, for example). And to intersect that with an actual   
Query, it
gets run through the QueryFilter to get its BitSet and then AND'd 
together with each of the category BitSet's. Sounds like   
a lot, but
for my applications there are not tons of these BitSet's and the 
performance has been outstanding. Now that I'm doing more


with Solr,
   
I'm beginning to leverage its amazing caching infrastructure and 
replacing BitSet's with DocSet's.


Erik





-
   
   

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
   

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




  

-
   

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Aggregating category hits

2006-05-22 Thread Ramana Jelda
I think, if you dig a little bit what lucene is when asked to do Sort then
you will get the information what you are looking for. 

Here is some help.
Lucene uses TopFieldDocCollector for sorting purpose(lookat implementation
of IndexSearcher).
So your HitCollector will extend this TopFieldDocCollector, so that you will
do your work what ever you want to do  and also let TopFieldDocCollector do
its work (sorting..).I think I don't need to explain you more. 

Then you are done. 

Have fun,
Jelda




 -Original Message-
 From: Kapil Chhabra [mailto:[EMAIL PROTECTED] 
 Sent: Monday, May 22, 2006 2:07 AM
 To: java-user@lucene.apache.org
 Subject: Re: Aggregating category hits
 
 Hi Jelda,
 Is there any way by which I can achieve sorting of search 
 results along with overriding the collect method of the 
 HitCollector in this case?
 I have been using
 
 srch.search(query,sort);
 
 If I replace it with srch.search(query, new HitCollector(){ 
 impl of the collect method to collect counts }), I will have 
 no way to sort my results.
 
 Any pointers?
 
 Regards,
 kapilChhabra
 
 Kapil Chhabra wrote:
  Thanks a lot Jelda.
  I'll try this get back with the performance comparison chart.
 
  Regards,
  kapilChhabra
 
  Ramana Jelda wrote:
  Hi Kapil,
  As I remember FieldCache is in lucene api since 1.4 .
  Ok . Anyhow here is suedo code that can help.
 
  //1. initialize reader on opening documentId to the categoryid 
  relation as below. Depending on your requirement you can either 
  getStringIndex().. I get StringIndex in //my project.
 
  String[] 
  docId2CategoryIdRelation=FieldCache.DEFAULT.getStrings(reader,
  categoryFieldName);
 
  //2. cache it
  //3. search as usal with your Query providing your own 
 HitCollector 
  //4. use docId2CategoryIdRelation to retrieve category id for each 
  result document
  String yourCategoryId=docId2CategoryIdRelation[resultDocId]
  //5.Increment yourCategoryId count (do lazy initialization of 
  categoryCounts
  holder.FAQ.)
 
  //6 You are done.. :)
 
  All the best,
  Jelda
 
 
 
 
   
  -Original Message-
  From: Kapil Chhabra [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, 
  May 16, 2006 11:50 AM
  To: java-user@lucene.apache.org
  Subject: Re: Aggregating category hits
 
  Hi Jelda,
  I have not yet migrated to Lucene 1.9 and I guess FieldCache has 
  been introduced in this release.
  Can you please give me a pointer to your strategy of FieldCache?
 
  Thanks  Regards,
  Kapil Chhabra
 
 
  Ramana Jelda wrote:
 
  But this BitSet strategy is more memory consuming mainly 
 if   
  you have
  documents in million numbers and categories in thousands.
  So I preferred in my project FieldCache strategy.
 
  Jelda
 
 
  -Original Message-
  From: Kapil Chhabra [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, May 16, 2006 7:38 AM
  To: java-user@lucene.apache.org
  Subject: Re: Aggregating category hits
 
  Even I am doing the same in my application.
  Once in a day, all the filters [for different categories] are 
  initialized. Each time a query is fired, the Query 
 BitSet is ANDed 
  with the BitSet of each filter. The cardinality obtained is the 
  desired output.
  @Eric: I would like to know more about the 
 implementation 
  with DocSet
  in place of Bitset.
 
  Regards,
  kapilChhabra
 
 
  Erik Hatcher wrote:
 
  On May 15, 2006, at 5:07 PM, Marvin Humphrey wrote:
 
  If you needed to know not just the total number of 
 hits, but the 
  number of hits in each category, how would you handle that?
 
  For instance, a search for egg would have to 
 produce 
  the 20 most
  relevant documents for egg, but also a list like this:
 
  Holiday  Seasonal / Easter 75
  Books / Cooking 52
  Miscellaneous 44
  Kitchen Collectibles 43
  Hobbies / Crafts 17
  [...]
 
  It seems to me that you'd have to retrieve each hit's
  
  stored fields
 
  and examine the contents of a category field. 
 That's a lot of 
  overhead. Is there another way?
  
  My first implementation of faceted browsing uses 
 BitSet's   
  that get
  pre-loaded for each category value (each unique term 
 in a   
  category
 
  field, for example). And to intersect that with an 
 actual   
  Query, it
  gets run through the QueryFilter to get its BitSet and 
 then AND'd 
  together with each of the category BitSet's. Sounds 
 like   
  a lot, but
  for my applications there are not tons of these 
 BitSet's and the 
  performance has been outstanding. Now that I'm doing more
  
  with Solr,
 
  I'm beginning to leverage its amazing caching 
 infrastructure and 
  replacing BitSet's with DocSet's.
 
  Erik
 
 
 
  
  
 
  -
 
 
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional

RE: Aggregating category hits

2006-05-16 Thread Ramana Jelda
But this BitSet strategy is more memory consuming mainly if you have
documents in million numbers and categories in thousands.
So I preferred in my project FieldCache strategy.

Jelda

 -Original Message-
 From: Kapil Chhabra [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, May 16, 2006 7:38 AM
 To: java-user@lucene.apache.org
 Subject: Re: Aggregating category hits
 
 Even I am doing the same in my application.
 Once in a day, all the filters [for different categories] are 
 initialized. Each time a query is fired, the Query BitSet is 
 ANDed with the BitSet of each filter. The cardinality 
 obtained is the desired output.
 @Eric: I would like to know more about the implementation 
 with DocSet in place of Bitset.
 
 Regards,
 kapilChhabra
 
 
 Erik Hatcher wrote:
 
  On May 15, 2006, at 5:07 PM, Marvin Humphrey wrote:
  If you needed to know not just the total number of hits, but the 
  number of hits in each category, how would you handle that?
 
  For instance, a search for egg would have to produce the 20 most 
  relevant documents for egg, but also a list like this:
 
  Holiday  Seasonal / Easter 75
  Books / Cooking 52
  Miscellaneous 44
  Kitchen Collectibles 43
  Hobbies / Crafts 17
  [...]
 
  It seems to me that you'd have to retrieve each hit's 
 stored fields 
  and examine the contents of a category field. That's a lot of 
  overhead. Is there another way?
 
  My first implementation of faceted browsing uses BitSet's that get 
  pre-loaded for each category value (each unique term in a category
  field, for example). And to intersect that with an actual Query, it 
  gets run through the QueryFilter to get its BitSet and then AND'd 
  together with each of the category BitSet's. Sounds like a lot, but 
  for my applications there are not tons of these BitSet's and the 
  performance has been outstanding. Now that I'm doing more 
 with Solr, 
  I'm beginning to leverage its amazing caching infrastructure and 
  replacing BitSet's with DocSet's.
 
  Erik
 
 
  
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Aggregating category hits

2006-05-16 Thread Erik Hatcher


On May 16, 2006, at 1:37 AM, Kapil Chhabra wrote:

Even I am doing the same in my application.
Once in a day, all the filters [for different categories] are  
initialized. Each time a query is fired, the Query BitSet is ANDed  
with the BitSet of each filter. The cardinality obtained is the  
desired output.
@Eric: I would like to know more about the implementation with  
DocSet in place of Bitset.


I'm still in progress with migration to Solr's infrastructure.   
Regarding the other issue Jelda brought up about memory - another  
handy thing about Solr is that it caches things into a configurable  
LRU cache which allows memory usage to be controlled at the expense  
of performance with cache misses.


Erik




Regards,
kapilChhabra


Erik Hatcher wrote:


On May 15, 2006, at 5:07 PM, Marvin Humphrey wrote:
If you needed to know not just the total number of hits, but the  
number of hits in each category, how would you handle that?


For instance, a search for egg would have to produce the 20  
most relevant documents for egg, but also a list like this:


Holiday  Seasonal / Easter 75
Books / Cooking 52
Miscellaneous 44
Kitchen Collectibles 43
Hobbies / Crafts 17
[...]

It seems to me that you'd have to retrieve each hit's stored  
fields and examine the contents of a category field. That's a  
lot of overhead. Is there another way?


My first implementation of faceted browsing uses BitSet's that get  
pre-loaded for each category value (each unique term in a  
category field, for example). And to intersect that with an  
actual Query, it gets run through the QueryFilter to get its  
BitSet and then AND'd together with each of the category BitSet's.  
Sounds like a lot, but for my applications there are not tons of  
these BitSet's and the performance has been outstanding. Now that  
I'm doing more with Solr, I'm beginning to leverage its amazing  
caching infrastructure and replacing BitSet's with DocSet's.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Aggregating category hits

2006-05-16 Thread Ramana Jelda
Hi Kapil,
As I remember FieldCache is in lucene api since 1.4 .
Ok . Anyhow here is suedo code that can help.

//1. initialize reader on opening documentId to the categoryid relation as
below. Depending on your requirement you can either getStringIndex().. I get
StringIndex in //my project.

String[] docId2CategoryIdRelation=FieldCache.DEFAULT.getStrings(reader,
categoryFieldName);

//2. cache it
//3. search as usal with your Query providing your own HitCollector
//4. use docId2CategoryIdRelation to retrieve category id for each result
document
String yourCategoryId=  docId2CategoryIdRelation[resultDocId]
//5.Increment yourCategoryId count (do lazy initialization of categoryCounts
holder.FAQ.)

//6 You are done.. :)

All the best,
Jelda




 -Original Message-
 From: Kapil Chhabra [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, May 16, 2006 11:50 AM
 To: java-user@lucene.apache.org
 Subject: Re: Aggregating category hits
 
 Hi Jelda,
 I have not yet migrated to Lucene 1.9 and I guess FieldCache 
 has been introduced in this release.
 Can you please give me a pointer to your strategy of FieldCache?
 
 Thanks  Regards,
 Kapil Chhabra
 
 
 Ramana Jelda wrote:
  But this BitSet strategy is more memory consuming mainly if 
 you have 
  documents in million numbers and categories in thousands.
  So I preferred in my project FieldCache strategy.
 
  Jelda
 

  -Original Message-
  From: Kapil Chhabra [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, May 16, 2006 7:38 AM
  To: java-user@lucene.apache.org
  Subject: Re: Aggregating category hits
 
  Even I am doing the same in my application.
  Once in a day, all the filters [for different categories] are 
  initialized. Each time a query is fired, the Query BitSet is ANDed 
  with the BitSet of each filter. The cardinality obtained is the 
  desired output.
  @Eric: I would like to know more about the implementation 
 with DocSet 
  in place of Bitset.
 
  Regards,
  kapilChhabra
 
 
  Erik Hatcher wrote:
  
  On May 15, 2006, at 5:07 PM, Marvin Humphrey wrote:

  If you needed to know not just the total number of hits, but the 
  number of hits in each category, how would you handle that?
 
  For instance, a search for egg would have to produce 
 the 20 most 
  relevant documents for egg, but also a list like this:
 
  Holiday  Seasonal / Easter 75
  Books / Cooking 52
  Miscellaneous 44
  Kitchen Collectibles 43
  Hobbies / Crafts 17
  [...]
 
  It seems to me that you'd have to retrieve each hit's
  
  stored fields
  
  and examine the contents of a category field. That's a lot of 
  overhead. Is there another way?
  
  My first implementation of faceted browsing uses BitSet's 
 that get 
  pre-loaded for each category value (each unique term in a 
 category
  field, for example). And to intersect that with an actual 
 Query, it 
  gets run through the QueryFilter to get its BitSet and then AND'd 
  together with each of the category BitSet's. Sounds like 
 a lot, but 
  for my applications there are not tons of these BitSet's and the 
  performance has been outstanding. Now that I'm doing more

  with Solr,
  
  I'm beginning to leverage its amazing caching infrastructure and 
  replacing BitSet's with DocSet's.
 
  Erik
 
 
 

  
 -
  
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 

  
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
  
 
 
  
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 

 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Aggregating category hits

2006-05-16 Thread Kapil Chhabra

Thanks a lot Jelda.
I'll try this get back with the performance comparison chart.

Regards,
kapilChhabra

Ramana Jelda wrote:

Hi Kapil,
As I remember FieldCache is in lucene api since 1.4 .
Ok . Anyhow here is suedo code that can help.

//1. initialize reader on opening documentId to the categoryid relation as
below. Depending on your requirement you can either getStringIndex().. I get
StringIndex in //my project.

String[] docId2CategoryIdRelation=FieldCache.DEFAULT.getStrings(reader,
categoryFieldName);

//2. cache it
//3. search as usal with your Query providing your own HitCollector
//4. use docId2CategoryIdRelation to retrieve category id for each result
document
String yourCategoryId=  docId2CategoryIdRelation[resultDocId]
//5.Increment yourCategoryId count (do lazy initialization of categoryCounts
holder.FAQ.)

//6 You are done.. :)

All the best,
Jelda




  

-Original Message-
From: Kapil Chhabra [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, May 16, 2006 11:50 AM

To: java-user@lucene.apache.org
Subject: Re: Aggregating category hits

Hi Jelda,
I have not yet migrated to Lucene 1.9 and I guess FieldCache 
has been introduced in this release.

Can you please give me a pointer to your strategy of FieldCache?

Thanks  Regards,
Kapil Chhabra


Ramana Jelda wrote:

But this BitSet strategy is more memory consuming mainly if 
  
you have 


documents in million numbers and categories in thousands.
So I preferred in my project FieldCache strategy.

Jelda

  
  

-Original Message-
From: Kapil Chhabra [mailto:[EMAIL PROTECTED]
Sent: Tuesday, May 16, 2006 7:38 AM
To: java-user@lucene.apache.org
Subject: Re: Aggregating category hits

Even I am doing the same in my application.
Once in a day, all the filters [for different categories] are 
initialized. Each time a query is fired, the Query BitSet is ANDed 
with the BitSet of each filter. The cardinality obtained is the 
desired output.
@Eric: I would like to know more about the implementation 

with DocSet 


in place of Bitset.

Regards,
kapilChhabra


Erik Hatcher wrote:



On May 15, 2006, at 5:07 PM, Marvin Humphrey wrote:
  
  
If you needed to know not just the total number of hits, but the 
number of hits in each category, how would you handle that?


For instance, a search for egg would have to produce 

the 20 most 


relevant documents for egg, but also a list like this:

Holiday  Seasonal / Easter 75
Books / Cooking 52
Miscellaneous 44
Kitchen Collectibles 43
Hobbies / Crafts 17
[...]

It seems to me that you'd have to retrieve each hit's



stored fields


and examine the contents of a category field. That's a lot of 
overhead. Is there another way?


My first implementation of faceted browsing uses BitSet's 
  
that get 

pre-loaded for each category value (each unique term in a 
  

category

field, for example). And to intersect that with an actual 
  
Query, it 

gets run through the QueryFilter to get its BitSet and then AND'd 
together with each of the category BitSet's. Sounds like 
  
a lot, but 

for my applications there are not tons of these BitSet's and the 
performance has been outstanding. Now that I'm doing more
  
  

with Solr,


I'm beginning to leverage its amazing caching infrastructure and 
replacing BitSet's with DocSet's.


Erik



  
  

-




To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  
  

-


To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





  

-


To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  
  




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  




Re: Aggregating category hits

2006-05-16 Thread Marvin Humphrey


Thanks, all.

The field cache and the bitsets both seem like good options until the  
collection grows too large, provided that the index does not need to  
be updated very frequently.  Then for large collections, there's  
statistical sampling.  Any of those options seems preferable to  
retrieving all docs all the time.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Aggregating category hits

2006-05-15 Thread Andrzej Bialecki

Marvin Humphrey wrote:

Greets,

If you needed to know not just the total number of hits, but the 
number of hits in each category, how would you handle that?


For instance, a search for egg would have to produce the 20 most 
relevant documents for egg, but also a list like this:


Holiday  Seasonal / Easter 75
Books / Cooking 52
Miscellaneous   44
Kitchen Collectibles43
Hobbies / Crafts17
[...]

It seems to me that you'd have to retrieve each hit's stored fields 
and examine the contents of a category field.  That's a lot of 
overhead.  Is there another way?


Statistical sampling of results and estimation, that's what I use ... 
for large result sets works reasonably well, for small result sets I 
just pay the penalty of retrieving all docs.


Also, take a look at the following: 
http://www2005.org/cdrom/docs/p245.pdf, Sampling Search-Engine Results,

A. Anagnostopoulos, A. Z. Broder, D. Carmel .

--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Aggregating category hits

2006-05-15 Thread Erik Hatcher


On May 15, 2006, at 5:07 PM, Marvin Humphrey wrote:
If you needed to know not just the total number of hits, but the  
number of hits in each category, how would you handle that?


For instance, a search for egg would have to produce the 20 most  
relevant documents for egg, but also a list like this:


Holiday  Seasonal / Easter 75
Books / Cooking 52
Miscellaneous   44
Kitchen Collectibles43
Hobbies / Crafts17
[...]

It seems to me that you'd have to retrieve each hit's stored fields  
and examine the contents of a category field.  That's a lot of  
overhead.  Is there another way?


My first implementation of faceted browsing uses BitSet's that get  
pre-loaded for each category value (each unique term in a category  
field, for example).  And to intersect that with an actual Query, it  
gets run through the QueryFilter to get its BitSet and then AND'd  
together with each of the category BitSet's.  Sounds like a lot, but  
for my applications there are not tons of these BitSet's and the  
performance has been outstanding.  Now that I'm doing more with Solr,  
I'm beginning to leverage its amazing caching infrastructure and  
replacing BitSet's with DocSet's.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Aggregating category hits

2006-05-15 Thread Kapil Chhabra

Even I am doing the same in my application.
Once in a day, all the filters [for different categories] are 
initialized. Each time a query is fired, the Query BitSet is ANDed with 
the BitSet of each filter. The cardinality obtained is the desired output.
@Eric: I would like to know more about the implementation with DocSet in 
place of Bitset.


Regards,
kapilChhabra


Erik Hatcher wrote:


On May 15, 2006, at 5:07 PM, Marvin Humphrey wrote:
If you needed to know not just the total number of hits, but the 
number of hits in each category, how would you handle that?


For instance, a search for egg would have to produce the 20 most 
relevant documents for egg, but also a list like this:


Holiday  Seasonal / Easter 75
Books / Cooking 52
Miscellaneous 44
Kitchen Collectibles 43
Hobbies / Crafts 17
[...]

It seems to me that you'd have to retrieve each hit's stored fields 
and examine the contents of a category field. That's a lot of 
overhead. Is there another way?


My first implementation of faceted browsing uses BitSet's that get 
pre-loaded for each category value (each unique term in a category 
field, for example). And to intersect that with an actual Query, it 
gets run through the QueryFilter to get its BitSet and then AND'd 
together with each of the category BitSet's. Sounds like a lot, but 
for my applications there are not tons of these BitSet's and the 
performance has been outstanding. Now that I'm doing more with Solr, 
I'm beginning to leverage its amazing caching infrastructure and 
replacing BitSet's with DocSet's.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]