Re: Aggregating category hits
The performance results in my previous posting were based on an implementation that performs 2 searches, one for getting 'Hits' and another for getting the BitSet. I reimplemented this in one search using the code in 'SolrIndexSearcher.getDocListAndSetNC' and I'm now getting throughput of 350-375 qps. This is great stuff Solr guys! I'd love to see the DocSet and DocList features added to Lucene's IndexSearcher. Peter On 6/12/06, Peter Keegan [EMAIL PROTECTED] wrote: I'm seeing query throughput of approx. 290 qps with OpenBitSet vs. 270 with BitSet. I had to reduce the max. HashDocSet size to 2K - 3K (from 10K-20K) to get optimal tradeoff. no. docs in index: 730,000 average no. results returned: 40 average response time: 50 msec (15-20 for counting facets) no. facets: 100 on every query I'm not using the Solr server as we have already developed an infrastructure. Peter On 6/10/06, Yonik Seeley [EMAIL PROTECTED] wrote: On 6/9/06, Peter Keegan [EMAIL PROTECTED] wrote: However, my throughput testing shows that the Solr method is at least 50% faster than mine. I'm seeing a big win with the use of the HashDocSet for lower hit counts. On my 64-bit platform, a MAX_SIZE value of 10K-20K seems to provide optimal performance. Interesting... how many documents are in your collection? It would prob be nice to make the HashDocSet cutt-off dynamic rather than fixed. Are you using Solr, or just some of it's code? I'm looking forward to trying this with OpenBitSet. I checked in the OpenBitSet changes today. I imagine this will lower the optimal max HashDocSet size for performance a little. You might not see much performance improvement if most of the intersections involved a HashDocSet... the OpenBitSet improvements only kick in with bitset-bitset intersection counts. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Aggregating category hits
I'm seeing query throughput of approx. 290 qps with OpenBitSet vs. 270 with BitSet. I had to reduce the max. HashDocSet size to 2K - 3K (from 10K-20K) to get optimal tradeoff. no. docs in index: 730,000 average no. results returned: 40 average response time: 50 msec (15-20 for counting facets) no. facets: 100 on every query I'm not using the Solr server as we have already developed an infrastructure. Peter On 6/10/06, Yonik Seeley [EMAIL PROTECTED] wrote: On 6/9/06, Peter Keegan [EMAIL PROTECTED] wrote: However, my throughput testing shows that the Solr method is at least 50% faster than mine. I'm seeing a big win with the use of the HashDocSet for lower hit counts. On my 64-bit platform, a MAX_SIZE value of 10K-20K seems to provide optimal performance. Interesting... how many documents are in your collection? It would prob be nice to make the HashDocSet cutt-off dynamic rather than fixed. Are you using Solr, or just some of it's code? I'm looking forward to trying this with OpenBitSet. I checked in the OpenBitSet changes today. I imagine this will lower the optimal max HashDocSet size for performance a little. You might not see much performance improvement if most of the intersections involved a HashDocSet... the OpenBitSet improvements only kick in with bitset-bitset intersection counts. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Aggregating category hits
hi peter, two quick questions 1. could you let me know what kind of response time you were getting with solr (as well as the size of data and result sizes) 2. i took a really really quick look at DocSetHitCollector and saw the dreaded if (bits==null) bits = new BitSet(maxDoc); line of code, since i rewrote some lucene code to support 64-bit search instances i have indexes that may reach quite a few GB's , allocating bitset's (arrays of long's is quite expensive memory wise and i am still a little skeptical about performance with large result sets) i did some testing of my facet impl and after an overnight webload session received about a 500 milli response time average for full faceting (with result sets from a few thousand to over 100,000) would really like to hear your views,, thanks, Peter Keegan [EMAIL PROTECTED] wrote: I compared Solr's DocSetHitCollector and counting bitset intersections to get facet counts with a different approach that uses a custom hit collector that tests each docid hit (bit) with each facets' bitset and increments a count in a histogram. My assumption was that for queries with few hits, this would be much faster than always doing bitset intersections/cardinality for every facet all the time. However, my throughput testing shows that the Solr method is at least 50% faster than mine. I'm seeing a big win with the use of the HashDocSet for lower hit counts. On my 64-bit platform, a MAX_SIZE value of 10K-20K seems to provide optimal performance. I'm looking forward to trying this with OpenBitSet. Peter On 5/29/06, z shalev wrote: i know im a little late replying to this thread, but, in my humble opinion the best way to aggregate values (not necessarily terms, but whole values in fields) is as follows: startup stage: for each field you would like to aggregate create a hashmap open an index reader and run through all the docs get the values to be aggregated from the fields of each doc create a hashcode for each value from each field collected, the hashcode should have some sort of prefix indicating which field its from (for exampe: 1 = author, 2 = ) and hence which hash it is stored in (at retrieval time, this prefix can be used to easily retrieve the value from the correct hash) place the hashcode/value in the appropriate hash create an arraylist at index X in the arraylist place an int array of all the hashcodes associated with doc id X so for example: if i have doc id 0 which contains the values: william shakespeare and the value 1797 the array list at index 0 will have an int array containing 2 values (the 2 hashcodes of shaklespeare and 1797) run time: at run time receive the hits and iterate through the doc ids , aggregate the values with direct access into the arraylist (for doc id 10 go to index 10 in the arraylist to retrieve the array of hashcodes) and lookups into the hashmaps i tested this today on a small index approx 400,000 docs (1GB of data) but i ran queries returning over 100,000 results my response time was about 550 milliseconds on large (over 100,000) result sets another point, this method should be scalable for much larger indexes as well, as it is linear to the result set size and not the index size (which is a HUGE bonus) if anyone wants the code let me know, Marvin Humphrey wrote: Thanks, all. The field cache and the bitsets both seem like good options until the collection grows too large, provided that the index does not need to be updated very frequently. Then for large collections, there's statistical sampling. Any of those options seems preferable to retrieving all docs all the time. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Feel free to call! Free PC-to-PC calls. Low rates on PC-to-Phone. Get Yahoo! Messenger with Voice __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Aggregating category hits
On 6/10/06, z shalev [EMAIL PROTECTED] wrote: 1. could you let me know what kind of response time you were getting with solr (as well as the size of data and result sizes) A can tell you a little bit about ours... on one CNET faceted browsing implementation using Solr, the number of facets to check per request average somewhere between 100 and 200 (the total number of unique facets is much larger though). The median request time is 3ms (and I don't think the majority of that time is calculating set intersections). We actually don't have the LRUCaches set large enough to achieve a 100% hit rate, but performance is still fine. 2. i took a really really quick look at DocSetHitCollector and saw the dreaded if (bits==null) bits = new BitSet(maxDoc); Yes, DocSets can be memory intensive. A BitSet is only used when the number of results gets larger than a threshold... below that, a HashDocSet is used that is O(n) rather than O(maxDoc). So the memory footprint also depends on the cardinality of the sets. since i rewrote some lucene code to support 64-bit search instances i have indexes that may reach quite a few GB's , GBs of index size, or actually billions of documents. It's the number of documents that matters in this case. allocating bitset's (arrays of long's is quite expensive memory wise and i am still a little skeptical about performance with large result sets) I just checked in a replacement for BitSet that takes intersection counts much faster. i did some testing of my facet impl and after an overnight webload session received about a 500 milli response time average for full faceting (with result sets from a few thousand to over 100,000) How many documents was that with, and how many facets per document? I certainly am interested in more memory efficient faceted browsing, and have been meaning to try some alternatives. So far, we've had good results using cached DocSets though. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Aggregating category hits
On 6/9/06, Peter Keegan [EMAIL PROTECTED] wrote: However, my throughput testing shows that the Solr method is at least 50% faster than mine. I'm seeing a big win with the use of the HashDocSet for lower hit counts. On my 64-bit platform, a MAX_SIZE value of 10K-20K seems to provide optimal performance. Interesting... how many documents are in your collection? It would prob be nice to make the HashDocSet cutt-off dynamic rather than fixed. Are you using Solr, or just some of it's code? I'm looking forward to trying this with OpenBitSet. I checked in the OpenBitSet changes today. I imagine this will lower the optimal max HashDocSet size for performance a little. You might not see much performance improvement if most of the intersections involved a HashDocSet... the OpenBitSet improvements only kick in with bitset-bitset intersection counts. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Aggregating category hits
: A can tell you a little bit about ours... on one CNET faceted browsing : implementation using Solr, the number of facets to check per request : average somewhere between 100 and 200 (the total number of unique : facets is much larger though). The median request time is 3ms (and I : don't think the majority of that time is calculating set : intersections). I've also posted some other numbers for a differnet CNET page which doesn't seem very faceted but actually does many more DocSet intersections to rank categories (1500-2000 per request), and does a lot more complicated main search queries (really big nested Boolean/DisjunctionMax/Phrase queries). The average request time there was 30ms ... but more importantly (to me and the people i work with anyway) the 99.9th percentile was only 456ms http://wiki.apache.org/solr/SolrPerformanceData (note: this was before the OpenBitSet changes Yonik just commited) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Aggregating category hits
hi yonik, thanks for the thurough reply,, a few more quick questions... the number of facets to check per request average somewhere between 100 and 200 (the total number of unique facets is much larger though). you mean 100 - 200 different catagories to facet? i ran the test on a 600,000 doc index, however the cool thing about my solution, is that the total doc count is not too relavant , i will be checking this with much larger indexes probably 10x the size of my initial testing, and algorithmically i dont expect too much of a performance dropoff, due to the fact that response time is effected by the result set size and not the docs in the index size (since i cache all faceted values on startup), as for the 500 milli, this is basically what i do in that time: 1. in each search instance: initally send a query and return the top 100 docs. start a seprate thread to collect full facet values (i do this by resending the same query with maxDoc as the number of results to return can i save this requerying somehow?) 2. then merge all instances' docs using a custom parallel m searcher 3. for the top 100 docs i calculate which doc came from which instance 4. and send the doc id's back to each instance and have each instance create facets on its docs from the top 100 5. each instance returns this info, i then go back to the instance and pass to them the top 20 terms of each facet for the actual facet counts... i do this so that the facet counts i display are from good docs, i am trying to avoid a situation where i recieve 5,000 results and that 4,500 of them with awful rankings have the same facet values and therefore the facets displayed in the UI are of bad ranked docs confusing however , i will look into your impl, it sounds solid, i am curretly on lucene 1.4.3 (which classes should i look into in solr?) comments welcomed thanks in advance! Yonik Seeley [EMAIL PROTECTED] wrote: On 6/10/06, z shalev wrote: 1. could you let me know what kind of response time you were getting with solr (as well as the size of data and result sizes) A can tell you a little bit about ours... on one CNET faceted browsing implementation using Solr, the number of facets to check per request average somewhere between 100 and 200 (the total number of unique facets is much larger though). The median request time is 3ms (and I don't think the majority of that time is calculating set intersections). We actually don't have the LRUCaches set large enough to achieve a 100% hit rate, but performance is still fine. 2. i took a really really quick look at DocSetHitCollector and saw the dreaded if (bits==null) bits = new BitSet(maxDoc); Yes, DocSets can be memory intensive. A BitSet is only used when the number of results gets larger than a threshold... below that, a HashDocSet is used that is O(n) rather than O(maxDoc). So the memory footprint also depends on the cardinality of the sets. since i rewrote some lucene code to support 64-bit search instances i have indexes that may reach quite a few GB's , GBs of index size, or actually billions of documents. It's the number of documents that matters in this case. allocating bitset's (arrays of long's is quite expensive memory wise and i am still a little skeptical about performance with large result sets) I just checked in a replacement for BitSet that takes intersection counts much faster. i did some testing of my facet impl and after an overnight webload session received about a 500 milli response time average for full faceting (with result sets from a few thousand to over 100,000) How many documents was that with, and how many facets per document? I certainly am interested in more memory efficient faceted browsing, and have been meaning to try some alternatives. So far, we've had good results using cached DocSets though. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Aggregating category hits
On 6/10/06, z shalev [EMAIL PROTECTED] wrote: the number of facets to check per request average somewhere between 100 and 200 (the total number of unique facets is much larger though). you mean 100 - 200 different catagories to facet? I was going by memory, but 100 to 200 set intersections (specific counts to gather) per individual request. i ran the test on a 600,000 doc index, however the cool thing about my solution, is that the total doc count is not too relavant Yeah, you can always get faster if aproximations are OK. i do this so that the facet counts i display are from good docs, i am trying to avoid a situation where i recieve 5,000 results and that 4,500 of them with awful rankings have the same facet values and therefore the facets displayed in the UI are of bad ranked docs confusing Actually, that sounds like it can make a lot of sense, as long as you can approximate what is good vs bad. however , i will look into your impl, it sounds solid, i am curretly on lucene 1.4.3 (which classes should i look into in solr?) Solr doesn't do faceted browsing for you yet. It provides caching of sets and fast set intersections. You currently need code to tell solr what intersection counts to get. DocSetHitCollector and HashDocSet + BitDocSet would only help you if you cache sets of docids. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Aggregating category hits
I compared Solr's DocSetHitCollector and counting bitset intersections to get facet counts with a different approach that uses a custom hit collector that tests each docid hit (bit) with each facets' bitset and increments a count in a histogram. My assumption was that for queries with few hits, this would be much faster than always doing bitset intersections/cardinality for every facet all the time. However, my throughput testing shows that the Solr method is at least 50% faster than mine. I'm seeing a big win with the use of the HashDocSet for lower hit counts. On my 64-bit platform, a MAX_SIZE value of 10K-20K seems to provide optimal performance. I'm looking forward to trying this with OpenBitSet. Peter On 5/29/06, z shalev [EMAIL PROTECTED] wrote: i know im a little late replying to this thread, but, in my humble opinion the best way to aggregate values (not necessarily terms, but whole values in fields) is as follows: startup stage: for each field you would like to aggregate create a hashmap open an index reader and run through all the docs get the values to be aggregated from the fields of each doc create a hashcode for each value from each field collected, the hashcode should have some sort of prefix indicating which field its from (for exampe: 1 = author, 2 = ) and hence which hash it is stored in (at retrieval time, this prefix can be used to easily retrieve the value from the correct hash) place the hashcode/value in the appropriate hash create an arraylist at index X in the arraylist place an int array of all the hashcodes associated with doc id X so for example: if i have doc id 0 which contains the values: william shakespeare and the value 1797 the array list at index 0 will have an int array containing 2 values (the 2 hashcodes of shaklespeare and 1797) run time: at run time receive the hits and iterate through the doc ids , aggregate the values with direct access into the arraylist (for doc id 10 go to index 10 in the arraylist to retrieve the array of hashcodes) and lookups into the hashmaps i tested this today on a small index approx 400,000 docs (1GB of data) but i ran queries returning over 100,000 results my response time was about 550 milliseconds on large (over 100,000) result sets another point, this method should be scalable for much larger indexes as well, as it is linear to the result set size and not the index size (which is a HUGE bonus) if anyone wants the code let me know, Marvin Humphrey [EMAIL PROTECTED] wrote: Thanks, all. The field cache and the bitsets both seem like good options until the collection grows too large, provided that the index does not need to be updated very frequently. Then for large collections, there's statistical sampling. Any of those options seems preferable to retrieving all docs all the time. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Feel free to call! Free PC-to-PC calls. Low rates on PC-to-Phone. Get Yahoo! Messenger with Voice
Re: Aggregating category hits
i know im a little late replying to this thread, but, in my humble opinion the best way to aggregate values (not necessarily terms, but whole values in fields) is as follows: startup stage: for each field you would like to aggregate create a hashmap open an index reader and run through all the docs get the values to be aggregated from the fields of each doc create a hashcode for each value from each field collected, the hashcode should have some sort of prefix indicating which field its from (for exampe: 1 = author, 2 = ) and hence which hash it is stored in (at retrieval time, this prefix can be used to easily retrieve the value from the correct hash) place the hashcode/value in the appropriate hash create an arraylist at index X in the arraylist place an int array of all the hashcodes associated with doc id X so for example: if i have doc id 0 which contains the values: william shakespeare and the value 1797 the array list at index 0 will have an int array containing 2 values (the 2 hashcodes of shaklespeare and 1797) run time: at run time receive the hits and iterate through the doc ids , aggregate the values with direct access into the arraylist (for doc id 10 go to index 10 in the arraylist to retrieve the array of hashcodes) and lookups into the hashmaps i tested this today on a small index approx 400,000 docs (1GB of data) but i ran queries returning over 100,000 results my response time was about 550 milliseconds on large (over 100,000) result sets another point, this method should be scalable for much larger indexes as well, as it is linear to the result set size and not the index size (which is a HUGE bonus) if anyone wants the code let me know, Marvin Humphrey [EMAIL PROTECTED] wrote: Thanks, all. The field cache and the bitsets both seem like good options until the collection grows too large, provided that the index does not need to be updated very frequently. Then for large collections, there's statistical sampling. Any of those options seems preferable to retrieving all docs all the time. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Feel free to call! Free PC-to-PC calls. Low rates on PC-to-Phone. Get Yahoo! Messenger with Voice
Re: Aggregating category hits
Hi Jelda, Is there any way by which I can achieve sorting of search results along with overriding the collect method of the HitCollector in this case? I have been using srch.search(query,sort); If I replace it with srch.search(query, new HitCollector(){ impl of the collect method to collect counts }), I will have no way to sort my results. Any pointers? Regards, kapilChhabra Kapil Chhabra wrote: Thanks a lot Jelda. I'll try this get back with the performance comparison chart. Regards, kapilChhabra Ramana Jelda wrote: Hi Kapil, As I remember FieldCache is in lucene api since 1.4 . Ok . Anyhow here is suedo code that can help. //1. initialize reader on opening documentId to the categoryid relation as below. Depending on your requirement you can either getStringIndex().. I get StringIndex in //my project. String[] docId2CategoryIdRelation=FieldCache.DEFAULT.getStrings(reader, categoryFieldName); //2. cache it //3. search as usal with your Query providing your own HitCollector //4. use docId2CategoryIdRelation to retrieve category id for each result document String yourCategoryId=docId2CategoryIdRelation[resultDocId] //5.Increment yourCategoryId count (do lazy initialization of categoryCounts holder.FAQ.) //6 You are done.. :) All the best, Jelda -Original Message- From: Kapil Chhabra [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 16, 2006 11:50 AM To: java-user@lucene.apache.org Subject: Re: Aggregating category hits Hi Jelda, I have not yet migrated to Lucene 1.9 and I guess FieldCache has been introduced in this release. Can you please give me a pointer to your strategy of FieldCache? Thanks Regards, Kapil Chhabra Ramana Jelda wrote: But this BitSet strategy is more memory consuming mainly if you have documents in million numbers and categories in thousands. So I preferred in my project FieldCache strategy. Jelda -Original Message- From: Kapil Chhabra [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 16, 2006 7:38 AM To: java-user@lucene.apache.org Subject: Re: Aggregating category hits Even I am doing the same in my application. Once in a day, all the filters [for different categories] are initialized. Each time a query is fired, the Query BitSet is ANDed with the BitSet of each filter. The cardinality obtained is the desired output. @Eric: I would like to know more about the implementation with DocSet in place of Bitset. Regards, kapilChhabra Erik Hatcher wrote: On May 15, 2006, at 5:07 PM, Marvin Humphrey wrote: If you needed to know not just the total number of hits, but the number of hits in each category, how would you handle that? For instance, a search for egg would have to produce the 20 most relevant documents for egg, but also a list like this: Holiday Seasonal / Easter 75 Books / Cooking 52 Miscellaneous 44 Kitchen Collectibles 43 Hobbies / Crafts 17 [...] It seems to me that you'd have to retrieve each hit's stored fields and examine the contents of a category field. That's a lot of overhead. Is there another way? My first implementation of faceted browsing uses BitSet's that get pre-loaded for each category value (each unique term in a category field, for example). And to intersect that with an actual Query, it gets run through the QueryFilter to get its BitSet and then AND'd together with each of the category BitSet's. Sounds like a lot, but for my applications there are not tons of these BitSet's and the performance has been outstanding. Now that I'm doing more with Solr, I'm beginning to leverage its amazing caching infrastructure and replacing BitSet's with DocSet's. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Aggregating category hits
I think, if you dig a little bit what lucene is when asked to do Sort then you will get the information what you are looking for. Here is some help. Lucene uses TopFieldDocCollector for sorting purpose(lookat implementation of IndexSearcher). So your HitCollector will extend this TopFieldDocCollector, so that you will do your work what ever you want to do and also let TopFieldDocCollector do its work (sorting..).I think I don't need to explain you more. Then you are done. Have fun, Jelda -Original Message- From: Kapil Chhabra [mailto:[EMAIL PROTECTED] Sent: Monday, May 22, 2006 2:07 AM To: java-user@lucene.apache.org Subject: Re: Aggregating category hits Hi Jelda, Is there any way by which I can achieve sorting of search results along with overriding the collect method of the HitCollector in this case? I have been using srch.search(query,sort); If I replace it with srch.search(query, new HitCollector(){ impl of the collect method to collect counts }), I will have no way to sort my results. Any pointers? Regards, kapilChhabra Kapil Chhabra wrote: Thanks a lot Jelda. I'll try this get back with the performance comparison chart. Regards, kapilChhabra Ramana Jelda wrote: Hi Kapil, As I remember FieldCache is in lucene api since 1.4 . Ok . Anyhow here is suedo code that can help. //1. initialize reader on opening documentId to the categoryid relation as below. Depending on your requirement you can either getStringIndex().. I get StringIndex in //my project. String[] docId2CategoryIdRelation=FieldCache.DEFAULT.getStrings(reader, categoryFieldName); //2. cache it //3. search as usal with your Query providing your own HitCollector //4. use docId2CategoryIdRelation to retrieve category id for each result document String yourCategoryId=docId2CategoryIdRelation[resultDocId] //5.Increment yourCategoryId count (do lazy initialization of categoryCounts holder.FAQ.) //6 You are done.. :) All the best, Jelda -Original Message- From: Kapil Chhabra [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 16, 2006 11:50 AM To: java-user@lucene.apache.org Subject: Re: Aggregating category hits Hi Jelda, I have not yet migrated to Lucene 1.9 and I guess FieldCache has been introduced in this release. Can you please give me a pointer to your strategy of FieldCache? Thanks Regards, Kapil Chhabra Ramana Jelda wrote: But this BitSet strategy is more memory consuming mainly if you have documents in million numbers and categories in thousands. So I preferred in my project FieldCache strategy. Jelda -Original Message- From: Kapil Chhabra [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 16, 2006 7:38 AM To: java-user@lucene.apache.org Subject: Re: Aggregating category hits Even I am doing the same in my application. Once in a day, all the filters [for different categories] are initialized. Each time a query is fired, the Query BitSet is ANDed with the BitSet of each filter. The cardinality obtained is the desired output. @Eric: I would like to know more about the implementation with DocSet in place of Bitset. Regards, kapilChhabra Erik Hatcher wrote: On May 15, 2006, at 5:07 PM, Marvin Humphrey wrote: If you needed to know not just the total number of hits, but the number of hits in each category, how would you handle that? For instance, a search for egg would have to produce the 20 most relevant documents for egg, but also a list like this: Holiday Seasonal / Easter 75 Books / Cooking 52 Miscellaneous 44 Kitchen Collectibles 43 Hobbies / Crafts 17 [...] It seems to me that you'd have to retrieve each hit's stored fields and examine the contents of a category field. That's a lot of overhead. Is there another way? My first implementation of faceted browsing uses BitSet's that get pre-loaded for each category value (each unique term in a category field, for example). And to intersect that with an actual Query, it gets run through the QueryFilter to get its BitSet and then AND'd together with each of the category BitSet's. Sounds like a lot, but for my applications there are not tons of these BitSet's and the performance has been outstanding. Now that I'm doing more with Solr, I'm beginning to leverage its amazing caching infrastructure and replacing BitSet's with DocSet's. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional
RE: Aggregating category hits
But this BitSet strategy is more memory consuming mainly if you have documents in million numbers and categories in thousands. So I preferred in my project FieldCache strategy. Jelda -Original Message- From: Kapil Chhabra [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 16, 2006 7:38 AM To: java-user@lucene.apache.org Subject: Re: Aggregating category hits Even I am doing the same in my application. Once in a day, all the filters [for different categories] are initialized. Each time a query is fired, the Query BitSet is ANDed with the BitSet of each filter. The cardinality obtained is the desired output. @Eric: I would like to know more about the implementation with DocSet in place of Bitset. Regards, kapilChhabra Erik Hatcher wrote: On May 15, 2006, at 5:07 PM, Marvin Humphrey wrote: If you needed to know not just the total number of hits, but the number of hits in each category, how would you handle that? For instance, a search for egg would have to produce the 20 most relevant documents for egg, but also a list like this: Holiday Seasonal / Easter 75 Books / Cooking 52 Miscellaneous 44 Kitchen Collectibles 43 Hobbies / Crafts 17 [...] It seems to me that you'd have to retrieve each hit's stored fields and examine the contents of a category field. That's a lot of overhead. Is there another way? My first implementation of faceted browsing uses BitSet's that get pre-loaded for each category value (each unique term in a category field, for example). And to intersect that with an actual Query, it gets run through the QueryFilter to get its BitSet and then AND'd together with each of the category BitSet's. Sounds like a lot, but for my applications there are not tons of these BitSet's and the performance has been outstanding. Now that I'm doing more with Solr, I'm beginning to leverage its amazing caching infrastructure and replacing BitSet's with DocSet's. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Aggregating category hits
On May 16, 2006, at 1:37 AM, Kapil Chhabra wrote: Even I am doing the same in my application. Once in a day, all the filters [for different categories] are initialized. Each time a query is fired, the Query BitSet is ANDed with the BitSet of each filter. The cardinality obtained is the desired output. @Eric: I would like to know more about the implementation with DocSet in place of Bitset. I'm still in progress with migration to Solr's infrastructure. Regarding the other issue Jelda brought up about memory - another handy thing about Solr is that it caches things into a configurable LRU cache which allows memory usage to be controlled at the expense of performance with cache misses. Erik Regards, kapilChhabra Erik Hatcher wrote: On May 15, 2006, at 5:07 PM, Marvin Humphrey wrote: If you needed to know not just the total number of hits, but the number of hits in each category, how would you handle that? For instance, a search for egg would have to produce the 20 most relevant documents for egg, but also a list like this: Holiday Seasonal / Easter 75 Books / Cooking 52 Miscellaneous 44 Kitchen Collectibles 43 Hobbies / Crafts 17 [...] It seems to me that you'd have to retrieve each hit's stored fields and examine the contents of a category field. That's a lot of overhead. Is there another way? My first implementation of faceted browsing uses BitSet's that get pre-loaded for each category value (each unique term in a category field, for example). And to intersect that with an actual Query, it gets run through the QueryFilter to get its BitSet and then AND'd together with each of the category BitSet's. Sounds like a lot, but for my applications there are not tons of these BitSet's and the performance has been outstanding. Now that I'm doing more with Solr, I'm beginning to leverage its amazing caching infrastructure and replacing BitSet's with DocSet's. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Aggregating category hits
Hi Kapil, As I remember FieldCache is in lucene api since 1.4 . Ok . Anyhow here is suedo code that can help. //1. initialize reader on opening documentId to the categoryid relation as below. Depending on your requirement you can either getStringIndex().. I get StringIndex in //my project. String[] docId2CategoryIdRelation=FieldCache.DEFAULT.getStrings(reader, categoryFieldName); //2. cache it //3. search as usal with your Query providing your own HitCollector //4. use docId2CategoryIdRelation to retrieve category id for each result document String yourCategoryId= docId2CategoryIdRelation[resultDocId] //5.Increment yourCategoryId count (do lazy initialization of categoryCounts holder.FAQ.) //6 You are done.. :) All the best, Jelda -Original Message- From: Kapil Chhabra [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 16, 2006 11:50 AM To: java-user@lucene.apache.org Subject: Re: Aggregating category hits Hi Jelda, I have not yet migrated to Lucene 1.9 and I guess FieldCache has been introduced in this release. Can you please give me a pointer to your strategy of FieldCache? Thanks Regards, Kapil Chhabra Ramana Jelda wrote: But this BitSet strategy is more memory consuming mainly if you have documents in million numbers and categories in thousands. So I preferred in my project FieldCache strategy. Jelda -Original Message- From: Kapil Chhabra [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 16, 2006 7:38 AM To: java-user@lucene.apache.org Subject: Re: Aggregating category hits Even I am doing the same in my application. Once in a day, all the filters [for different categories] are initialized. Each time a query is fired, the Query BitSet is ANDed with the BitSet of each filter. The cardinality obtained is the desired output. @Eric: I would like to know more about the implementation with DocSet in place of Bitset. Regards, kapilChhabra Erik Hatcher wrote: On May 15, 2006, at 5:07 PM, Marvin Humphrey wrote: If you needed to know not just the total number of hits, but the number of hits in each category, how would you handle that? For instance, a search for egg would have to produce the 20 most relevant documents for egg, but also a list like this: Holiday Seasonal / Easter 75 Books / Cooking 52 Miscellaneous 44 Kitchen Collectibles 43 Hobbies / Crafts 17 [...] It seems to me that you'd have to retrieve each hit's stored fields and examine the contents of a category field. That's a lot of overhead. Is there another way? My first implementation of faceted browsing uses BitSet's that get pre-loaded for each category value (each unique term in a category field, for example). And to intersect that with an actual Query, it gets run through the QueryFilter to get its BitSet and then AND'd together with each of the category BitSet's. Sounds like a lot, but for my applications there are not tons of these BitSet's and the performance has been outstanding. Now that I'm doing more with Solr, I'm beginning to leverage its amazing caching infrastructure and replacing BitSet's with DocSet's. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Aggregating category hits
Thanks a lot Jelda. I'll try this get back with the performance comparison chart. Regards, kapilChhabra Ramana Jelda wrote: Hi Kapil, As I remember FieldCache is in lucene api since 1.4 . Ok . Anyhow here is suedo code that can help. //1. initialize reader on opening documentId to the categoryid relation as below. Depending on your requirement you can either getStringIndex().. I get StringIndex in //my project. String[] docId2CategoryIdRelation=FieldCache.DEFAULT.getStrings(reader, categoryFieldName); //2. cache it //3. search as usal with your Query providing your own HitCollector //4. use docId2CategoryIdRelation to retrieve category id for each result document String yourCategoryId= docId2CategoryIdRelation[resultDocId] //5.Increment yourCategoryId count (do lazy initialization of categoryCounts holder.FAQ.) //6 You are done.. :) All the best, Jelda -Original Message- From: Kapil Chhabra [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 16, 2006 11:50 AM To: java-user@lucene.apache.org Subject: Re: Aggregating category hits Hi Jelda, I have not yet migrated to Lucene 1.9 and I guess FieldCache has been introduced in this release. Can you please give me a pointer to your strategy of FieldCache? Thanks Regards, Kapil Chhabra Ramana Jelda wrote: But this BitSet strategy is more memory consuming mainly if you have documents in million numbers and categories in thousands. So I preferred in my project FieldCache strategy. Jelda -Original Message- From: Kapil Chhabra [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 16, 2006 7:38 AM To: java-user@lucene.apache.org Subject: Re: Aggregating category hits Even I am doing the same in my application. Once in a day, all the filters [for different categories] are initialized. Each time a query is fired, the Query BitSet is ANDed with the BitSet of each filter. The cardinality obtained is the desired output. @Eric: I would like to know more about the implementation with DocSet in place of Bitset. Regards, kapilChhabra Erik Hatcher wrote: On May 15, 2006, at 5:07 PM, Marvin Humphrey wrote: If you needed to know not just the total number of hits, but the number of hits in each category, how would you handle that? For instance, a search for egg would have to produce the 20 most relevant documents for egg, but also a list like this: Holiday Seasonal / Easter 75 Books / Cooking 52 Miscellaneous 44 Kitchen Collectibles 43 Hobbies / Crafts 17 [...] It seems to me that you'd have to retrieve each hit's stored fields and examine the contents of a category field. That's a lot of overhead. Is there another way? My first implementation of faceted browsing uses BitSet's that get pre-loaded for each category value (each unique term in a category field, for example). And to intersect that with an actual Query, it gets run through the QueryFilter to get its BitSet and then AND'd together with each of the category BitSet's. Sounds like a lot, but for my applications there are not tons of these BitSet's and the performance has been outstanding. Now that I'm doing more with Solr, I'm beginning to leverage its amazing caching infrastructure and replacing BitSet's with DocSet's. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Aggregating category hits
Thanks, all. The field cache and the bitsets both seem like good options until the collection grows too large, provided that the index does not need to be updated very frequently. Then for large collections, there's statistical sampling. Any of those options seems preferable to retrieving all docs all the time. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Aggregating category hits
Marvin Humphrey wrote: Greets, If you needed to know not just the total number of hits, but the number of hits in each category, how would you handle that? For instance, a search for egg would have to produce the 20 most relevant documents for egg, but also a list like this: Holiday Seasonal / Easter 75 Books / Cooking 52 Miscellaneous 44 Kitchen Collectibles43 Hobbies / Crafts17 [...] It seems to me that you'd have to retrieve each hit's stored fields and examine the contents of a category field. That's a lot of overhead. Is there another way? Statistical sampling of results and estimation, that's what I use ... for large result sets works reasonably well, for small result sets I just pay the penalty of retrieving all docs. Also, take a look at the following: http://www2005.org/cdrom/docs/p245.pdf, Sampling Search-Engine Results, A. Anagnostopoulos, A. Z. Broder, D. Carmel . -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Aggregating category hits
On May 15, 2006, at 5:07 PM, Marvin Humphrey wrote: If you needed to know not just the total number of hits, but the number of hits in each category, how would you handle that? For instance, a search for egg would have to produce the 20 most relevant documents for egg, but also a list like this: Holiday Seasonal / Easter 75 Books / Cooking 52 Miscellaneous 44 Kitchen Collectibles43 Hobbies / Crafts17 [...] It seems to me that you'd have to retrieve each hit's stored fields and examine the contents of a category field. That's a lot of overhead. Is there another way? My first implementation of faceted browsing uses BitSet's that get pre-loaded for each category value (each unique term in a category field, for example). And to intersect that with an actual Query, it gets run through the QueryFilter to get its BitSet and then AND'd together with each of the category BitSet's. Sounds like a lot, but for my applications there are not tons of these BitSet's and the performance has been outstanding. Now that I'm doing more with Solr, I'm beginning to leverage its amazing caching infrastructure and replacing BitSet's with DocSet's. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Aggregating category hits
Even I am doing the same in my application. Once in a day, all the filters [for different categories] are initialized. Each time a query is fired, the Query BitSet is ANDed with the BitSet of each filter. The cardinality obtained is the desired output. @Eric: I would like to know more about the implementation with DocSet in place of Bitset. Regards, kapilChhabra Erik Hatcher wrote: On May 15, 2006, at 5:07 PM, Marvin Humphrey wrote: If you needed to know not just the total number of hits, but the number of hits in each category, how would you handle that? For instance, a search for egg would have to produce the 20 most relevant documents for egg, but also a list like this: Holiday Seasonal / Easter 75 Books / Cooking 52 Miscellaneous 44 Kitchen Collectibles 43 Hobbies / Crafts 17 [...] It seems to me that you'd have to retrieve each hit's stored fields and examine the contents of a category field. That's a lot of overhead. Is there another way? My first implementation of faceted browsing uses BitSet's that get pre-loaded for each category value (each unique term in a category field, for example). And to intersect that with an actual Query, it gets run through the QueryFilter to get its BitSet and then AND'd together with each of the category BitSet's. Sounds like a lot, but for my applications there are not tons of these BitSet's and the performance has been outstanding. Now that I'm doing more with Solr, I'm beginning to leverage its amazing caching infrastructure and replacing BitSet's with DocSet's. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]