Re: docMap array in SegmentMergeInfo
On a multi-cpu system, this loop to build the docMap array can cause severe thread thrashing because of the synchronized method 'isDeleted'. I have observed this on an index with over 1 million documents (which contains a few thousand deleted docs) when multiple threads perform a search with either a sort field or a range query. A stack dump shows all threads here: waiting for monitor entry [0x6d2cf000..0x6d2cfd6c] at org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:241) - waiting to lock 0x04e40278 The performances worsens as the number of threads increases. The searches may take minutes to complete. If only a single thread issues the search, it completes fairly quickly. I also noticed from looking at the code that the docMap doesn't appear to be used in these cases. It seems only to be used for merging segments. If the index is in 'search/read-only' mode, is there a way around this bottleneck? Thanks, Peter On 7/13/05, Doug Cutting [EMAIL PROTECTED] wrote: Lokesh Bajaj wrote: For a very large index where we might want to delete/replace some documents, this would require a lot of memory (for 100 million documents, this would need 381 MB of memory). Is there any reason why this was implemented this way? In practice this has not been an issue. A single index with 100M documents is usually quite slow to search. When collections get this big folks tend to instead search multiple indexes in parallel in order to keep response times acceptable. Also, 381Mb of RAM is often not a problem for folks with 100M documents. But this is not to say that it could never be a problem. For folks with limited RAM and/or lots of small documents it could indeed be an issue. It seems like this could be implemented as a much smaller array that only keeps track of the deleted document numbers and it would still be very efficient to calculate the new document number by using this much smaller array. Has this been done by anyone else or been considered for change in the Lucene code? Please submit a patch to the java-dev list. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: docMap array in SegmentMergeInfo
Here is one stack trace: Full thread dump Java HotSpot(TM) Client VM (1.5.0_03-b07 mixed mode): Thread-6 prio=5 tid=0x6cf7a7f0 nid=0x59e50 waiting for monitor entry [0x6d2cf000..0x6d2cfd6c] at org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:241) - waiting to lock 0x04e40278 (a org.apache.lucene.index.SegmentReader) at org.apache.lucene.index.SegmentMergeInfo.init(SegmentMergeInfo.java:43) at org.apache.lucene.index.MultiTermEnum.init(MultiReader.java:277) at org.apache.lucene.index.MultiReader.terms(MultiReader.java:186) at org.apache.lucene.search.RangeQuery.rewrite(RangeQuery.java:75) at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:243) at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:166) at org.apache.lucene.search.Query.weight(Query.java:84) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:158) at org.apache.lucene.search.Searcher.search(Searcher.java:67) at org.apache.lucene.search.QueryFilter.bits(QueryFilter.java:62) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:121) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64) at org.apache.lucene.search.Hits.init(Hits.java:51) at org.apache.lucene.search.Searcher.search(Searcher.java:49) I've also seen it happen during sorting from: FieldSortedHitQueue.comparatorAuto - FieldCacheImpl.getAuto() - MultiReader.terms() - MultiTermEnum.init() - SegmentMergerInfo.init() - SegmentReader.isDeleted() Peter On 10/11/05, Yonik Seeley [EMAIL PROTECTED] wrote: We've been using this in production for a while and it fixed the extremely slow searches when there are deleted documents. Who was the caller of isDeleted()? There may be an opportunity for an easy optimization to grab the BitVector and reuse it instead of repeatedly calling isDeleted() on the IndexReader. -Yonik Now hiring -- http://tinyurl.com/7m67g
Re: docMap array in SegmentMergeInfo
Hi Yonik, Your patch has corrected the thread thrashing problem on multi-cpu systems. I've tested it with both 1.4.3 and 1.9. I haven't seen 100X performance gain, but that's because I'm caching QueryFilters and Lucene is caching the sort fields. Thanks for the fast response! btw, I had previously tried Chris's fix (replace synchronized method with snapshot reference), but I was getting errors trying to fetch stored fields from the Hits. I didn't chase it down, but the errors went away when I reverted that specific patch. Peter On 10/12/05, Yonik Seeley [EMAIL PROTECTED] wrote: Here's the patch: http://issues.apache.org/jira/browse/LUCENE-454 It resulted in quite a performance boost indeed! On 10/12/05, Yonik Seeley [EMAIL PROTECTED] wrote: Thanks for the trace Peter, and great catch! It certainly does look like avoiding the construction of the docMap for a MultiTermEnum will be a significant optimization. -Yonik Now hiring -- http://tinyurl.com/7m67g
Re: Throughput doesn't increase when using more concurrent threads
This is just fyi - in my stress tests on a 8-cpu box (that's 8 real cpus), the maximum throughput occurred with just 4 query threads. The query throughput decreased with fewer than 4 or greater than 4 query threads. The entire index was most likely in the file system cache, too. Periodic snapshots of stack traces showed most threads blocked in the synchronization in: FSIndexInput.readInternal(), when the thread count exceeded 4. Peter On 11/22/05, Oren Shir [EMAIL PROTECTED] wrote: Hi, There are two sunchronization points: on the stream and on the reader. Using different FSDirectoriy and IndexReaders should solve this. I'll let you know once I code it. Right now I'm checking if making my Documents store less data will move the bottleneck to some other place. Thanks again, Oren Shir On 11/21/05, Doug Cutting [EMAIL PROTECTED] wrote: Jay Booth wrote: I had a similar problem with threading, the problem turned out to be that in the back end of the FSDirectory class I believe it was, there was a synchronized block on the actual RandomAccessFile resource when reading a block of data from it... high-concurrency situations caused threads to stack up in front of this synchronized block and our CPU time wound up being spent thrashing between blocked threads instead of doing anything useful. This is correct. In Lucene, multiple streams per file are created by cloning, and all clones of an FSDirectory input stream share a RandomAccessFile and must synchronize input from it. MmapDirectory does not have this limitation. If your indexes are less than a few GB or you are using 64-bit hardware, then MmapDirectory should work well for you. Otherwise it would be simple to write an nio-based Directory that does not use mmap that is also unsynchronized. Such a contribution would be welcome. Making multiple IndexSearchers and FSDirectories didn't help because in the back end, lucene consults a singleton HashMap of some kind (don't remember implementation) that maintained a single FSDirectory for any given index being accessed from the JVM... multiple calls to FSDirectory.getDirectory actually return the same FSDirectory object with synchronization at the same point. This does not make sense to me. FSDirectory does keep a cache of FSDirectory instances, but i/o should not be synchronized on these. One should be able to open multiple input streams on the same file from an FSDirectory. But this would not be a great solution, since file handle limits would soon become a problem. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
It's a 3GHz Intel box with Xeon processors, 64GB ram :) Peter On 1/25/06, Yonik Seeley [EMAIL PROTECTED] wrote: Thanks Peter, that's useful info. Just out of curiosity, what kind of box is this? what CPUs? -Yonik On 1/25/06, Peter Keegan [EMAIL PROTECTED] wrote: This is just fyi - in my stress tests on a 8-cpu box (that's 8 real cpus), the maximum throughput occurred with just 4 query threads. The query throughput decreased with fewer than 4 or greater than 4 query threads. The entire index was most likely in the file system cache, too. Periodic snapshots of stack traces showed most threads blocked in the synchronization in: FSIndexInput.readInternal(), when the thread count exceeded 4. Peter On 11/22/05, Oren Shir [EMAIL PROTECTED] wrote: Hi, There are two sunchronization points: on the stream and on the reader. Using different FSDirectoriy and IndexReaders should solve this. I'll let you know once I code it. Right now I'm checking if making my Documents store less data will move the bottleneck to some other place. Thanks again, Oren Shir On 11/21/05, Doug Cutting [EMAIL PROTECTED] wrote: Jay Booth wrote: I had a similar problem with threading, the problem turned out to be that in the back end of the FSDirectory class I believe it was, there was a synchronized block on the actual RandomAccessFile resource when reading a block of data from it... high-concurrency situations caused threads to stack up in front of this synchronized block and our CPU time wound up being spent thrashing between blocked threads instead of doing anything useful. This is correct. In Lucene, multiple streams per file are created by cloning, and all clones of an FSDirectory input stream share a RandomAccessFile and must synchronize input from it. MmapDirectory does not have this limitation. If your indexes are less than a few GB or you are using 64-bit hardware, then MmapDirectory should work well for you. Otherwise it would be simple to write an nio-based Directory that does not use mmap that is also unsynchronized. Such a contribution would be welcome. Making multiple IndexSearchers and FSDirectories didn't help because in the back end, lucene consults a singleton HashMap of some kind (don't remember implementation) that maintained a single FSDirectory for any given index being accessed from the JVM... multiple calls to FSDirectory.getDirectory actually return the same FSDirectory object with synchronization at the same point. This does not make sense to me. FSDirectory does keep a cache of FSDirectory instances, but i/o should not be synchronized on these. One should be able to open multiple input streams on the same file from an FSDirectory. But this would not be a great solution, since file handle limits would soon become a problem. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Yes, it's hyperthreaded (16 cpus show up in task manager - the box is running 2003). I plan to turn off hyperthreading to see if it has any effect. Peter On 1/25/06, Yonik Seeley [EMAIL PROTECTED] wrote: On 1/25/06, Peter Keegan [EMAIL PROTECTED] wrote: It's a 3GHz Intel box with Xeon processors, 64GB ram :) Nice! Xeon processors are normally hyperthreaded. On a linux box, if you cat /proc/cpuinfo, you will see 8 processors for a 4 physical CPU system. Are you positive you have 8 physical Xeon processors? -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Paul, I tried this but it ran out of memory trying to read the 500Mb .fdt file. I tried various values for MAX_BBUF, but it still ran out of memory (I'm using -Xmx1600M, which is the jvm's maximum value (v1.5)) I'll give NioFSDirectory a try. Thanks, Peter On 1/26/06, Paul Elschot [EMAIL PROTECTED] wrote: On Wednesday 25 January 2006 20:51, Peter Keegan wrote: The index is non-compound format and optimized. Yes, I did try MMapDirectory, but the index is too big - 3.5 GB (1.3GB is term vectors) Peter You could also give this a try: http://issues.apache.org/jira/browse/LUCENE-283 Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Ray, The throughput is worse with NioFSDIrectory than with the FSDIrectory (patched and unpatched). The bottleneck still seems to be synchronization, this time in NioFile.getChannel (7 of the 8 threads were blocked there during one snapshot). I tried this with 4 and 8 channels. The throughput with the patched FSDirectory was about the same as before the patch. Thanks, Peter On 1/26/06, Ray Tsang [EMAIL PROTECTED] wrote: Speaking of NioFSDirectory, I thought there was one posted a while ago, is this something that can be used? http://issues.apache.org/jira/browse/LUCENE-414 ray, On 11/22/05, Doug Cutting [EMAIL PROTECTED] wrote: Jay Booth wrote: I had a similar problem with threading, the problem turned out to be that in the back end of the FSDirectory class I believe it was, there was a synchronized block on the actual RandomAccessFile resource when reading a block of data from it... high-concurrency situations caused threads to stack up in front of this synchronized block and our CPU time wound up being spent thrashing between blocked threads instead of doing anything useful. This is correct. In Lucene, multiple streams per file are created by cloning, and all clones of an FSDirectory input stream share a RandomAccessFile and must synchronize input from it. MmapDirectory does not have this limitation. If your indexes are less than a few GB or you are using 64-bit hardware, then MmapDirectory should work well for you. Otherwise it would be simple to write an nio-based Directory that does not use mmap that is also unsynchronized. Such a contribution would be welcome. Making multiple IndexSearchers and FSDirectories didn't help because in the back end, lucene consults a singleton HashMap of some kind (don't remember implementation) that maintained a single FSDirectory for any given index being accessed from the JVM... multiple calls to FSDirectory.getDirectory actually return the same FSDirectory object with synchronization at the same point. This does not make sense to me. FSDirectory does keep a cache of FSDirectory instances, but i/o should not be synchronized on these. One should be able to open multiple input streams on the same file from an FSDirectory. But this would not be a great solution, since file handle limits would soon become a problem. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
I'd love to try this, but I'm not aware of any 64-bit jvms for Windows on Intel. If you know of any, please let me know. Linux may be an option, too. btw, I'm getting a sustained rate of 135 queries/sec with 4 threads, which is pretty impressive. Another way around the concurrency limit is to run multiple jvms. The throughput of each is less, but the aggregate throughput is higher. Peter On 1/26/06, Yonik Seeley [EMAIL PROTECTED] wrote: Hmmm, can you run the 64 bit version of Windows (and hence a 64 bit JVM?) We're running with heap sizes up to 8GB (RH Linux 64 bit, Opterons, Sun Java 1.5) -Yonik On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote: Paul, I tried this but it ran out of memory trying to read the 500Mb .fdt file. I tried various values for MAX_BBUF, but it still ran out of memory (I'm using -Xmx1600M, which is the jvm's maximum value (v1.5)) I'll give NioFSDirectory a try. Thanks, Peter On 1/26/06, Paul Elschot [EMAIL PROTECTED] wrote: On Wednesday 25 January 2006 20:51, Peter Keegan wrote: The index is non-compound format and optimized. Yes, I did try MMapDirectory, but the index is too big - 3.5 GB (1.3GB is term vectors) Peter You could also give this a try: http://issues.apache.org/jira/browse/LUCENE-283 Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Dumb question: does the 64-bit compiler (javac) generate different code than the 32-bit version, or is it just the jvm that matters? My reported speedups were soley from using the 64-bit jvm with jar files from the 32-bit compiler. Peter On 1/26/06, Yonik Seeley [EMAIL PROTECTED] wrote: Nice speedup! The extra registers in 64 bit mode hay have helped a little too. -Yonik On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote: Correction: make that 285 qps :) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Ray, The short answer is that you can make Lucene blazingly fast by using advice and design principles mentioned in this forum and of course reading 'Lucene in Action'. For example, use a 'content' field for searching all fields (vs mutli-field search), put all your stored data in one field, understand the cost of numeric search and sorting. On the platform side, go multi-CPU and of course 64-bit if possible :) Also, I would venture to guess that a lot of search bottlenecks have nothing to do with Lucene, but rather in the infrastructure around it. For example, how does your client interface to the search engine? My results use a plain socket interface between client and server (one connection for queries, another for results), using a simple query/results data format. Introducing other web infrastructures invites degradation in performance, too. I've a bit of experience with search engines, but I'm obviously still learning thanks to this group. Peter On 1/26/06, Ray Tsang [EMAIL PROTECTED] wrote: Peter, Wow, the speed up in impressive! But may I ask what did you do to achieve 135 queries/sec prior to the JVM swich? ray, On 1/27/06, Peter Keegan [EMAIL PROTECTED] wrote: Correction: make that 285 qps :) On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote: I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now getting 250 queries/sec and excellent cpu utilization (equal concurrency on all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't aware of it. Thanks all very much. Peter On 1/26/06, Doug Cutting [EMAIL PROTECTED] wrote: Doug Cutting wrote: A 64-bit JVM with NioDirectory would really be optimal for this. Oops. I meant MMapDirectory, not NioDirectory. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Ray, The 135 qps rate was using the standard FSDirectory in 1.9. Peter On 1/26/06, Ray Tsang [EMAIL PROTECTED] wrote: Paul, Thanks for the advice! But for the 100+queries/sec on a 32-bit platfrom, did you end up applying other patches? or use different FSDirectory implementations? Thanks! ray, On 1/27/06, Peter Keegan [EMAIL PROTECTED] wrote: Ray, The short answer is that you can make Lucene blazingly fast by using advice and design principles mentioned in this forum and of course reading 'Lucene in Action'. For example, use a 'content' field for searching all fields (vs mutli-field search), put all your stored data in one field, understand the cost of numeric search and sorting. On the platform side, go multi-CPU and of course 64-bit if possible :) Also, I would venture to guess that a lot of search bottlenecks have nothing to do with Lucene, but rather in the infrastructure around it. For example, how does your client interface to the search engine? My results use a plain socket interface between client and server (one connection for queries, another for results), using a simple query/results data format. Introducing other web infrastructures invites degradation in performance, too. I've a bit of experience with search engines, but I'm obviously still learning thanks to this group. Peter On 1/26/06, Ray Tsang [EMAIL PROTECTED] wrote: Peter, Wow, the speed up in impressive! But may I ask what did you do to achieve 135 queries/sec prior to the JVM swich? ray, On 1/27/06, Peter Keegan [EMAIL PROTECTED] wrote: Correction: make that 285 qps :) On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote: I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now getting 250 queries/sec and excellent cpu utilization (equal concurrency on all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't aware of it. Thanks all very much. Peter On 1/26/06, Doug Cutting [EMAIL PROTECTED] wrote: Doug Cutting wrote: A 64-bit JVM with NioDirectory would really be optimal for this. Oops. I meant MMapDirectory, not NioDirectory. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
I cranked up the dial on my query tester and was able to get the rate up to 325 qps. Unfortunately, the machine died shortly thereafter (memory errors :-( ) Hopefully, it was just a coincidence. I haven't measured 64-bit indexing speed, yet. Peter On 1/29/06, Daniel Noll [EMAIL PROTECTED] wrote: Peter Keegan wrote: I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now getting 250 queries/sec and excellent cpu utilization (equal concurrency on all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't aware of it. Wow. That's fast. Out of interest, does indexing time speed up much on 64-bit hardware? I'm particularly interested in this side of things because for our own application, any query response under half a second is good enough, but the indexing side could always be faster. :-) Daniel -- Daniel Noll Nuix Australia Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Phone: (02) 9280 0699 Fax: (02) 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
We discovered that the kernel was only using 8 CPUs. After recompiling for 16 (8+hyperthreads), it looks like the query rate will settle in around 280-300 qps. Much better, although still quite a bit slower than the opteron. Peter On 2/22/06, Yonik Seeley [EMAIL PROTECTED] wrote: Hmmm, not sure what that could be. You could try using the default FSDir instead of MMapDir to see if the differences are there. Some things that could be different: - thread scheduling (shouldn't make too much of a difference though) - synchronization workings - page replacement policy... how to figure out what pages to swap in and which to swap out, esp of the memory mapped files. You could also try a profiler on both platforms to try and see where the difference is. -Yonik On 2/22/06, Peter Keegan [EMAIL PROTECTED] wrote: I am doing a performance comparison of Lucene on Linux vs Windows. I have 2 identically configured servers (8-CPUs (real) x 3GHz Xeon processors, 64GB RAM). One is running CentOS 4 Linux, the other is running Windows server 2003 Enterprise Edition x64. Both have 64-bit JVMs from Sun. The Lucene server is using MMapDirectory. I'm running the jvm with -Xmx16000M. Peak memory usage of the jvm on Linux is about 6GB and 7.8GBon windows. I'm observing query rates of 330 queries/sec on the Wintel server, but only 200 qps on the Linux box. At first, I suspected a network bottleneck, but when I 'short-circuited' Lucene, the query rates were identical. I suspect that there are some things to be tuned in Linux, but I'm not sure what. Any advice would be appreciated. Peter On 1/30/06, Peter Keegan [EMAIL PROTECTED] wrote: I cranked up the dial on my query tester and was able to get the rate up to 325 qps. Unfortunately, the machine died shortly thereafter (memory errors :-( ) Hopefully, it was just a coincidence. I haven't measured 64-bit indexing speed, yet. Peter On 1/29/06, Daniel Noll [EMAIL PROTECTED] wrote: Peter Keegan wrote: I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now getting 250 queries/sec and excellent cpu utilization (equal concurrency on all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't aware of it. Wow. That's fast. Out of interest, does indexing time speed up much on 64-bit hardware? I'm particularly interested in this side of things because for our own application, any query response under half a second is good enough, but the indexing side could always be faster. :-) Daniel -- Daniel Noll Nuix Australia Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Phone: (02) 9280 0699 Fax: (02) 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Chris, I tried JRockit a while back on 8-cpu/windows and it was slower than Sun's. Since I seem to be cpu-bound right now, I'll be trying a 16-cpu system next (32 with hyperthreading), on LinTel. I may give JRockit another go around then. Thanks, Peter On 2/23/06, Chris Lamprecht [EMAIL PROTECTED] wrote: Peter, Have you given JRockit JVM a try? I've seen it help throughput compared to Sun's JVM on a dual xeon/linux machine, especially with concurrency (up to 6 concurrent searches happening). I'm curious to see if it makes a difference for you. -chris On 2/23/06, Peter Keegan [EMAIL PROTECTED] wrote: We discovered that the kernel was only using 8 CPUs. After recompiling for 16 (8+hyperthreads), it looks like the query rate will settle in around 280-300 qps. Much better, although still quite a bit slower than the opteron. Peter On 2/22/06, Yonik Seeley [EMAIL PROTECTED] wrote: Hmmm, not sure what that could be. You could try using the default FSDir instead of MMapDir to see if the differences are there. Some things that could be different: - thread scheduling (shouldn't make too much of a difference though) - synchronization workings - page replacement policy... how to figure out what pages to swap in and which to swap out, esp of the memory mapped files. You could also try a profiler on both platforms to try and see where the difference is. -Yonik On 2/22/06, Peter Keegan [EMAIL PROTECTED] wrote: I am doing a performance comparison of Lucene on Linux vs Windows. I have 2 identically configured servers (8-CPUs (real) x 3GHz Xeon processors, 64GB RAM). One is running CentOS 4 Linux, the other is running Windows server 2003 Enterprise Edition x64. Both have 64-bit JVMs from Sun. The Lucene server is using MMapDirectory. I'm running the jvm with -Xmx16000M. Peak memory usage of the jvm on Linux is about 6GB and 7.8GBon windows. I'm observing query rates of 330 queries/sec on the Wintel server, but only 200 qps on the Linux box. At first, I suspected a network bottleneck, but when I 'short-circuited' Lucene, the query rates were identical. I suspect that there are some things to be tuned in Linux, but I'm not sure what. Any advice would be appreciated. Peter On 1/30/06, Peter Keegan [EMAIL PROTECTED] wrote: I cranked up the dial on my query tester and was able to get the rate up to 325 qps. Unfortunately, the machine died shortly thereafter (memory errors :-( ) Hopefully, it was just a coincidence. I haven't measured 64-bit indexing speed, yet. Peter On 1/29/06, Daniel Noll [EMAIL PROTECTED] wrote: Peter Keegan wrote: I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now getting 250 queries/sec and excellent cpu utilization (equal concurrency on all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't aware of it. Wow. That's fast. Out of interest, does indexing time speed up much on 64-bit hardware? I'm particularly interested in this side of things because for our own application, any query response under half a second is good enough, but the indexing side could always be faster. :-) Daniel -- Daniel Noll Nuix Australia Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Phone: (02) 9280 0699 Fax: (02) 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Yonik, We're investigating both approaches. Yes, the resources (and permutations) are dizzying! Peter On 2/23/06, Yonik Seeley [EMAIL PROTECTED] wrote: Wow, some resources! Would it be cheaper / more scalable to copy the index to multiple boxes and loadbalance requests across them? -Yonik On 2/23/06, Peter Keegan [EMAIL PROTECTED] wrote: Since I seem to be cpu-bound right now, I'll be trying a 16-cpu system next (32 with hyperthreading), on LinTel. I may give JRockit another go around then. Thanks, Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
I ran a query performance tester against 8-cpu and 16-cpu Xeon servers (16/32 cpu hyperthreaded). on Linux. Here are the results: 8-cpu: 275 qps 16-cpu: 305 qps (the dual-core Opteron servers are still faster) Here is the stack trace of 8 of the 16 query threads during the test: at org.apache.lucene.index.SegmentReader.document(SegmentReader.java :281) - waiting to lock 0x002adf5b2110 (a org.apache.lucene.index.SegmentReader) at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:83) at org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java :146) at org.apache.lucene.search.Hits.doc(Hits.java:103) SegmentReader.document is a synchronized method. I have one stored field (binary, uncompressed) with and average length of 0.5Kb. The retrieval of this stored field is within this synchronized code. Since I am using MMapDirectory, does this retrieval need to be synchronized? Peter On 2/23/06, Peter Keegan [EMAIL PROTECTED] wrote: Yonik, We're investigating both approaches. Yes, the resources (and permutations) are dizzying! Peter On 2/23/06, Yonik Seeley [EMAIL PROTECTED] wrote: Wow, some resources! Would it be cheaper / more scalable to copy the index to multiple boxes and loadbalance requests across them? -Yonik On 2/23/06, Peter Keegan [EMAIL PROTECTED] wrote: Since I seem to be cpu-bound right now, I'll be trying a 16-cpu system next (32 with hyperthreading), on LinTel. I may give JRockit another go around then. Thanks, Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
3. Use the ThreadLocal's FieldReader in the document() method. As I understand it, this means that the document method no longer needs to be synchronized, right? I've made these changes and it does appear to improve performance. Random snapshots of the stack traces show only an occasional lock in 'isDeleted'. Mostly, though, the threads are busy scoring and adding results to priority queues, which is great. I've included some sample stacks, below. I'll report the new query rates after it has run for at least overnight, and I'd be happy submit these changes to the lucene committers, if interested. Peter Sample stack traces: QueryThread group 1,#8 prio=1 tid=0x002ce48eeb80 nid=0x6b87 runnable [0x43887000..0x43887bb0] at org.apache.lucene.search.FieldSortedHitQueue.lessThan( FieldSortedHitQueue.java:108) at org.apache.lucene.util.PriorityQueue.insert(PriorityQueue.java:61) at org.apache.lucene.search.FieldSortedHitQueue.insert( FieldSortedHitQueue.java:85) at org.apache.lucene.search.FieldSortedHitQueue.insert( FieldSortedHitQueue.java:92) at org.apache.lucene.search.TopFieldDocCollector.collect( TopFieldDocCollector.java:51) at org.apache.lucene.search.TermScorer.score(TermScorer.java:75) at org.apache.lucene.search.TermScorer.score(TermScorer.java:60) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:132) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:110) at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:225) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65) at org.apache.lucene.search.Hits.init(Hits.java:52) at org.apache.lucene.search.Searcher.search(Searcher.java:62) QueryThread group 1,#5 prio=1 tid=0x002ce4d659f0 nid=0x6b84 runnable [0x43584000..0x43584d30] at org.apache.lucene.search.TermScorer.score(TermScorer.java:75) at org.apache.lucene.search.TermScorer.score(TermScorer.java:60) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:132) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:110) at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:225) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65) at org.apache.lucene.search.Hits.init(Hits.java:52) at org.apache.lucene.search.Searcher.search(Searcher.java:62) QueryThread group 1,#4 prio=1 tid=0x002ce10afd50 nid=0x6b83 runnable [0x43483000..0x43483db0] at org.apache.lucene.store.MMapDirectory$MMapIndexInput.readByte( MMapDirectory.java:46) at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:56) at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java :101) at org.apache.lucene.index.SegmentTermDocs.skipTo(SegmentTermDocs.java :194) at org.apache.lucene.search.TermScorer.skipTo(TermScorer.java:144) at org.apache.lucene.search.ConjunctionScorer.doNext( ConjunctionScorer.java:56) at org.apache.lucene.search.ConjunctionScorer.next( ConjunctionScorer.java:51) at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java :290) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:132) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:110) at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:225) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65) at org.apache.lucene.search.Hits.init(Hits.java:52) at org.apache.lucene.search.Searcher.search(Searcher.java:62) QueryThread group 1,#3 prio=1 tid=0x002ce48959f0 nid=0x6b82 runnable [0x43382000..0x43382e30] at java.util.LinkedList.listIterator(LinkedList.java:523) at java.util.AbstractList.listIterator(AbstractList.java:349) at java.util.AbstractSequentialList.iterator(AbstractSequentialList.java :250) at org.apache.lucene.search.ConjunctionScorer.score( ConjunctionScorer.java:80) at org.apache.lucene.search.BooleanScorer2$2.score(BooleanScorer2.java :186) at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java :327) at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java :291) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:132) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:110) at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:225) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65) at org.apache.lucene.search.Hits.init(Hits.java:52) at org.apache.lucene.search.Searcher.search(Searcher.java:62) On 3/7/06, Doug Cutting [EMAIL PROTECTED] wrote: Peter Keegan wrote: I ran a query performance tester against 8-cpu and 16-cpu Xeon servers (16/32 cpu hyperthreaded). on Linux. Here are the results: 8-cpu: 275 qps 16-cpu: 305 qps (the dual-core Opteron servers are still faster) Here is the stack trace of 8 of the 16 query
Re: Throughput doesn't increase when using more concurrent threads
Chris, Should this patch work against the current code base? I'm getting this error: D:\lucene-1.9patch -b -p0 -i nio-lucene-1.9.patch patching file src/java/org/apache/lucene/index/CompoundFileReader.java patching file src/java/org/apache/lucene/index/FieldsReader.java missing header for unified diff at line 45 of patch can't find file to patch at input line 45 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -- | +47,9 @@ | fieldsStream = d.openInput(segment + .fdt); | indexStream = d.openInput(segment + .fdx); | |+fstream = new ThreadStream(fieldsStream); |+istream = new ThreadStream(indexStream); |+ | size = (int)(indexStream.length() / 8); | } | -- Thanks, Peter On 3/10/06, Chris Lamprecht [EMAIL PROTECTED] wrote: Peter, I think this is similar to the patch in this bugzilla task: http://issues.apache.org/bugzilla/show_bug.cgi?id=35838 the patch itself is http://issues.apache.org/bugzilla/attachment.cgi?id=15757 (BTW does JIRA have a way to display the patch diffs?) The above patch also has a change to SegmentReader to avoid synchronization on isDeleted(). However, with that patch, you no longer have the guarantee that one thread will immediately see deletions by another thread. This was fine for my purposes, and resulted in a big performance boost when there were deleted documents, but it may not be correct for others' needs. -chris On 3/10/06, Peter Keegan [EMAIL PROTECTED] wrote: 3. Use the ThreadLocal's FieldReader in the document() method. As I understand it, this means that the document method no longer needs to be synchronized, right? I've made these changes and it does appear to improve performance. Random snapshots of the stack traces show only an occasional lock in 'isDeleted'. Mostly, though, the threads are busy scoring and adding results to priority queues, which is great. I've included some sample stacks, below. I'll report the new query rates after it has run for at least overnight, and I'd be happy submit these changes to the lucene committers, if interested. Peter Sample stack traces: QueryThread group 1,#8 prio=1 tid=0x002ce48eeb80 nid=0x6b87 runnable [0x43887000..0x43887bb0] at org.apache.lucene.search.FieldSortedHitQueue.lessThan( FieldSortedHitQueue.java:108) at org.apache.lucene.util.PriorityQueue.insert(PriorityQueue.java :61) at org.apache.lucene.search.FieldSortedHitQueue.insert( FieldSortedHitQueue.java:85) at org.apache.lucene.search.FieldSortedHitQueue.insert( FieldSortedHitQueue.java:92) at org.apache.lucene.search.TopFieldDocCollector.collect( TopFieldDocCollector.java:51) at org.apache.lucene.search.TermScorer.score(TermScorer.java:75) at org.apache.lucene.search.TermScorer.score(TermScorer.java:60) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java :132) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java :110) at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java :225) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65) at org.apache.lucene.search.Hits.init(Hits.java:52) at org.apache.lucene.search.Searcher.search(Searcher.java:62) QueryThread group 1,#5 prio=1 tid=0x002ce4d659f0 nid=0x6b84 runnable [0x43584000..0x43584d30] at org.apache.lucene.search.TermScorer.score(TermScorer.java:75) at org.apache.lucene.search.TermScorer.score(TermScorer.java:60) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java :132) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java :110) at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java :225) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65) at org.apache.lucene.search.Hits.init(Hits.java:52) at org.apache.lucene.search.Searcher.search(Searcher.java:62) QueryThread group 1,#4 prio=1 tid=0x002ce10afd50 nid=0x6b83 runnable [0x43483000..0x43483db0] at org.apache.lucene.store.MMapDirectory$MMapIndexInput.readByte( MMapDirectory.java:46) at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:56) at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java :101) at org.apache.lucene.index.SegmentTermDocs.skipTo( SegmentTermDocs.java :194) at org.apache.lucene.search.TermScorer.skipTo(TermScorer.java:144) at org.apache.lucene.search.ConjunctionScorer.doNext( ConjunctionScorer.java:56) at org.apache.lucene.search.ConjunctionScorer.next( ConjunctionScorer.java:51) at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java :290) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java :132) at org.apache.lucene.search.IndexSearcher.search
Re: Throughput doesn't increase when using more concurrent threads
Chris, My apologies - this error was apparently caused by a file format mismatch (probably line endings). Thanks, Peter On 3/13/06, Peter Keegan [EMAIL PROTECTED] wrote: Chris, Should this patch work against the current code base? I'm getting this error: D:\lucene-1.9patch -b -p0 -i nio-lucene-1.9.patch patching file src/java/org/apache/lucene/index/CompoundFileReader.java patching file src/java/org/apache/lucene/index/FieldsReader.java missing header for unified diff at line 45 of patch can't find file to patch at input line 45 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -- | +47,9 @@ | fieldsStream = d.openInput(segment + .fdt); | indexStream = d.openInput(segment + .fdx); | |+fstream = new ThreadStream(fieldsStream); |+istream = new ThreadStream(indexStream); |+ | size = (int)(indexStream.length() / 8); | } | -- Thanks, Peter On 3/10/06, Chris Lamprecht [EMAIL PROTECTED] wrote: Peter, I think this is similar to the patch in this bugzilla task: http://issues.apache.org/bugzilla/show_bug.cgi?id=35838 the patch itself is http://issues.apache.org/bugzilla/attachment.cgi?id=15757 (BTW does JIRA have a way to display the patch diffs?) The above patch also has a change to SegmentReader to avoid synchronization on isDeleted(). However, with that patch, you no longer have the guarantee that one thread will immediately see deletions by another thread. This was fine for my purposes, and resulted in a big performance boost when there were deleted documents, but it may not be correct for others' needs. -chris On 3/10/06, Peter Keegan [EMAIL PROTECTED] wrote: 3. Use the ThreadLocal's FieldReader in the document() method. As I understand it, this means that the document method no longer needs to be synchronized, right? I've made these changes and it does appear to improve performance. Random snapshots of the stack traces show only an occasional lock in 'isDeleted'. Mostly, though, the threads are busy scoring and adding results to priority queues, which is great. I've included some sample stacks, below. I'll report the new query rates after it has run for at least overnight, and I'd be happy submit these changes to the lucene committers, if interested. Peter Sample stack traces: QueryThread group 1,#8 prio=1 tid=0x002ce48eeb80 nid=0x6b87 runnable [0x43887000..0x43887bb0] at org.apache.lucene.search.FieldSortedHitQueue.lessThan( FieldSortedHitQueue.java:108) at org.apache.lucene.util.PriorityQueue.insert(PriorityQueue.java:61) at org.apache.lucene.search.FieldSortedHitQueue.insert( FieldSortedHitQueue.java:85) at org.apache.lucene.search.FieldSortedHitQueue.insert( FieldSortedHitQueue.java:92) at org.apache.lucene.search.TopFieldDocCollector.collect( TopFieldDocCollector.java:51) at org.apache.lucene.search.TermScorer.score(TermScorer.java:75) at org.apache.lucene.search.TermScorer.score (TermScorer.java:60) at org.apache.lucene.search.IndexSearcher.search( IndexSearcher.java:132) at org.apache.lucene.search.IndexSearcher.search( IndexSearcher.java:110) at org.apache.lucene.search.MultiSearcher.search ( MultiSearcher.java:225) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65) at org.apache.lucene.search.Hits.init(Hits.java:52) at org.apache.lucene.search.Searcher.search (Searcher.java:62) QueryThread group 1,#5 prio=1 tid=0x002ce4d659f0 nid=0x6b84 runnable [0x43584000..0x43584d30] at org.apache.lucene.search.TermScorer.score (TermScorer.java:75) at org.apache.lucene.search.TermScorer.score(TermScorer.java:60) at org.apache.lucene.search.IndexSearcher.search( IndexSearcher.java:132) at org.apache.lucene.search.IndexSearcher.search ( IndexSearcher.java:110) at org.apache.lucene.search.MultiSearcher.search( MultiSearcher.java:225) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65) at org.apache.lucene.search.Hits .init(Hits.java:52) at org.apache.lucene.search.Searcher.search(Searcher.java:62) QueryThread group 1,#4 prio=1 tid=0x002ce10afd50 nid=0x6b83 runnable [0x43483000..0x43483db0] at org.apache.lucene.store.MMapDirectory$MMapIndexInput.readByte( MMapDirectory.java:46) at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:56) at org.apache.lucene.index.SegmentTermDocs.next ( SegmentTermDocs.java :101) at org.apache.lucene.index.SegmentTermDocs.skipTo( SegmentTermDocs.java :194) at org.apache.lucene.search.TermScorer.skipTo(TermScorer.java:144) at org.apache.lucene.search.ConjunctionScorer.doNext( ConjunctionScorer.java:56
Re: Good MMapDirectory performance
- I read from Peter Keegan's recent postings: - The Lucene server is using MMapDirectory. I'm running - the jvm with -Xmx16000M. Peak memory usage of the jvm - on Linux is about 6GB and 7.8GB on windows. - We don't have nearly as much memory as Peter but I - wonder whether he is gaining anything with such - a large heap. My application gets better throughput with more VM, but that is probably due to heavy use of ByteBuffers in the application, not VM for Lucene. Peter On 3/12/06, kent.fitch [EMAIL PROTECTED] wrote: I thought I'd post some good news about MMapDirectory as the comments in the release notes are quite downbeat about its performance. In some environments MMapDirectory provides a big improvement. Our test application is an index of 11.4 million documents which are derived from MARC (bibliographic) catalogue records. Our aim is to build a system to demonstrate relevance ranking and result clustering for library union catalogue searching (a union catalogue accumulates/merges records from multiple ibraries). Our main index component sizes: fdt 17GB fdx 91MB tis 82MB frq 45MB prx 11MB tii 1.2 MB We have a separate Lucence index (not discussed further) which stores the MARC records. Each document has many fields. We'll probably reduce the number after we decide on the best search strategies, but lots of fields gives us lots of flexability whilst testing search and ranking strategies. Stored and unindexed fields, used for summary results: display title display author display publication details holdingsCount (number of libraries holding) Tokenized indices: title author subject genre keyword (all text) Keyword (untokenized) indices: title author subject genre audience Dewey/LC classification language isbn/issn publication date (date range code) unique bibliographic id Wildcard Tokenized indices created by a custom stub analyzer which reduces a term to its first few characters: title author subject keyword Field boosts are set for some fields. For example, title sub title, series title, component title are all stored as title but with different field boosts (as a match on normal title is deemed more relevant than a match on series title). The document boost is set to the sqrt of the holdingsCount (favouring popular resources). The user interface supports searching and refining searches on specific fields but the most common search is created from a single google style search box. Here's a typical query generated from a 2 word search: +(titleWords:franz kafka^4.0 authorWords:franz kafka^3.0 subjectWords:franz kafka^3.0 keywords:franz kafka^1.4 title:franz kafka^4.0 (+titleWords:franz +titleWords:kafka^3.0) author:franz kafka^3.0 +authorWords:franz +authorWords:kafka^2.0) subject:franz kafka^3.0 (+subjectWords:franz +subjectWords:kafka^1.5) (+genreWords:franz +genreWords:kafka^2.0) (+keywords:franz +keywords:kafka) (+titleWildcard:fra +titleWildcard:kaf^0.7) (+authorWildcard:fra +authorWildcard:kaf^0.7) (+subjectWildcard:fra +subjectWildcard:kaf^0.7) (+keywordWildcard:fra +keywordWildcard:kaf^0.2) ) It generated 1635 hits. We then read the first 700 documents in the hit list and extract the date, subject, author, genre, Dewey/LC classification and audience fields for each, accumulating the popularity of each. Using this data, for each of the subject, author, genre, Dewey/LC and audience categories, we find the 30 most popular field values and for each of these we query the index to find their frequency in the entire index. We then render the first 100 document results (title, author, publication details, holdings) and the top 30 for each of subject, author, genre, Dewey/KC and audience, ordering each list by the popularity of the term in the hit results (sample of the first 700) and rendering the size of the text based on the frequency of the term in the entire database (a bit like the Flickr tag popularity lists). We also render a graph of hit results by date range. The initial search is very quick - typically a small number of tens of millsecs. The clustering takes much longer - reading up to 700 records, extracting all those fields, sorting to get the top 30 of each field category, looking up the frequency of each term in the database. The test machine was a SunFire440 with 2 x 1.593GHz UltraSPARC-IIIi processors and 8GB of memory running Solaris 9, Java 1.5 in 64 bit mode, Jetty. The Lucene data directory is stored on a local 10K SCSI disk. The benchmark consisted of running 13,142 representative and unique search phrases collected from another system. The search phrases are unsorted. The client (testing) system is run on another unloaded computer and was configured to run a varying number of threads representing different loads. The results discussed here were produced with 3
Re: Throughput doesn't increase when using more concurrent threads
I did some additional testing with Chris's patch and mine (based on Doug's note) vs. no patch and found that all 3 produced the same throughput - about 330 qps - over a longer period. So, there seems to be a point of diminishing returns to adding more cpus. The dual core Opterons (8 cpu) still win handily at 400 qps. Peter On 3/13/06, Peter Keegan [EMAIL PROTECTED] wrote: Chris, My apologies - this error was apparently caused by a file format mismatch (probably line endings). Thanks, Peter On 3/13/06, Peter Keegan [EMAIL PROTECTED] wrote: Chris, Should this patch work against the current code base? I'm getting this error: D:\lucene-1.9patch -b -p0 -i nio-lucene-1.9.patch patching file src/java/org/apache/lucene/index/CompoundFileReader.java patching file src/java/org/apache/lucene/index/FieldsReader.java missing header for unified diff at line 45 of patch can't find file to patch at input line 45 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -- | +47,9 @@ | fieldsStream = d.openInput(segment + .fdt); | indexStream = d.openInput(segment + .fdx); | |+fstream = new ThreadStream(fieldsStream); |+istream = new ThreadStream(indexStream); |+ | size = (int)(indexStream.length() / 8); | } | -- Thanks, Peter On 3/10/06, Chris Lamprecht [EMAIL PROTECTED] wrote: Peter, I think this is similar to the patch in this bugzilla task: http://issues.apache.org/bugzilla/show_bug.cgi?id=35838 the patch itself is http://issues.apache.org/bugzilla/attachment.cgi?id=15757 (BTW does JIRA have a way to display the patch diffs?) The above patch also has a change to SegmentReader to avoid synchronization on isDeleted(). However, with that patch, you no longer have the guarantee that one thread will immediately see deletions by another thread. This was fine for my purposes, and resulted in a big performance boost when there were deleted documents, but it may not be correct for others' needs. -chris On 3/10/06, Peter Keegan [EMAIL PROTECTED] wrote: 3. Use the ThreadLocal's FieldReader in the document() method. As I understand it, this means that the document method no longer needs to be synchronized, right? I've made these changes and it does appear to improve performance. Random snapshots of the stack traces show only an occasional lock in 'isDeleted'. Mostly, though, the threads are busy scoring and adding results to priority queues, which is great. I've included some sample stacks, below. I'll report the new query rates after it has run for at least overnight, and I'd be happy submit these changes to the lucene committers, if interested. Peter Sample stack traces: QueryThread group 1,#8 prio=1 tid=0x002ce48eeb80 nid=0x6b87 runnable [0x43887000..0x43887bb0] at org.apache.lucene.search.FieldSortedHitQueue.lessThan( FieldSortedHitQueue.java:108) at org.apache.lucene.util.PriorityQueue.insert( PriorityQueue.java :61) at org.apache.lucene.search.FieldSortedHitQueue.insert( FieldSortedHitQueue.java:85) at org.apache.lucene.search.FieldSortedHitQueue.insert( FieldSortedHitQueue.java:92) at org.apache.lucene.search.TopFieldDocCollector.collect( TopFieldDocCollector.java:51) at org.apache.lucene.search.TermScorer.score(TermScorer.java:75) at org.apache.lucene.search.TermScorer.score (TermScorer.java :60) at org.apache.lucene.search.IndexSearcher.search( IndexSearcher.java:132) at org.apache.lucene.search.IndexSearcher.search( IndexSearcher.java:110) at org.apache.lucene.search.MultiSearcher.search ( MultiSearcher.java:225) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65) at org.apache.lucene.search.Hits.init(Hits.java:52) at org.apache.lucene.search.Searcher.search (Searcher.java:62) QueryThread group 1,#5 prio=1 tid=0x002ce4d659f0 nid=0x6b84 runnable [0x43584000..0x43584d30] at org.apache.lucene.search.TermScorer.score (TermScorer.java :75) at org.apache.lucene.search.TermScorer.score(TermScorer.java:60) at org.apache.lucene.search.IndexSearcher.search( IndexSearcher.java:132) at org.apache.lucene.search.IndexSearcher.search ( IndexSearcher.java:110) at org.apache.lucene.search.MultiSearcher.search( MultiSearcher.java:225) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65) at org.apache.lucene.search.Hits .init(Hits.java:52) at org.apache.lucene.search.Searcher.search(Searcher.java:62) QueryThread group 1,#4 prio=1 tid=0x002ce10afd50 nid=0x6b83 runnable [0x43483000..0x43483db0
Re: Non scoring search
I experimented with this by using a Similiarity class that returns a constant (1) for all values and found that had no noticable affect on query performance. Peter On 12/6/05, Chris Hostetter [EMAIL PROTECTED] wrote: : I was wondering if there is a standard way to retrive documents WITHOUT : scoring and sorting them. I need a list of documents that contain certain : terms but I do not need them sorted or scored. Using Filters directly (ie: constructing them, and then calling the bits() method yourself) is the most straight forward way i know of to achieve what you describe. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Throughput doesn't increase when using more concurrent threads
Out of interest, does indexing time speed up much on 64-bit hardware? I was able to speed up indexing on 64-bit platform by taking advantage of the larger address space to parallelize the indexing process. One thread creates index segments with a set of RAMDirectories and another thread merges the segments to disk with 'addIndexes'. This resulted in a speed improvement of 27%. Peter On 1/29/06, Daniel Noll [EMAIL PROTECTED] wrote: Peter Keegan wrote: I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now getting 250 queries/sec and excellent cpu utilization (equal concurrency on all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't aware of it. Wow. That's fast. Out of interest, does indexing time speed up much on 64-bit hardware? I'm particularly interested in this side of things because for our own application, any query response under half a second is good enough, but the indexing side could always be faster. :-) Daniel -- Daniel Noll Nuix Australia Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Phone: (02) 9280 0699 Fax: (02) 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiReader and MultiSearcher
Yonik, Could you explain why an IndexSearcher constructed from multiple readers is faster than a MultiSearcher constructed from same readers? Thanks, Peter On 4/10/06, Yonik Seeley [EMAIL PROTECTED] wrote: On 4/10/06, oramas martín [EMAIL PROTECTED] wrote: Is there any performance (or other) difference between using an IndexSearcher initialized with a MultiReader instead of using a MultiSearcher? Yes, the IndexSearcher(MultiReader) solution will be faster. -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiReader and MultiSearcher
Does this mean that MultiReader doesn't merge the search results and sort the results as if there was only one index? If not, does it simply concatenate the results? Peter On 4/11/06, Yonik Seeley [EMAIL PROTECTED] wrote: On 4/11/06, Peter Keegan [EMAIL PROTECTED] wrote: Could you explain why an IndexSearcher constructed from multiple readers is faster than a MultiSearcher constructed from same readers? The convergence layer is a level lower for a MultiReader vs a MultiSearcher. A MultiReader is an IndexReader, and Queries (Scorers) run directly against it since it has efficient TermEnum and TermDocs implementations. A MultiSearcher must do independent searches against subsearchers retrieving the top n matches, and maintain an additional priority queue to merge the results to get the global top n matches. The implemetation of createWeight is also heavier (heh..) I've never measured the performance difference, and it's probably relatively small for most queries. -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiReader and MultiSearcher
Correction: the doc order is fine. My test was based on the existing 'TestMultiSearcher', and I hadn't noticed the swapping of the index order here: // VITAL STEP:adding the searcher for the empty index first, before the searcher for the populated index searchers[0] = new IndexSearcher(indexStoreB); searchers[1] = new IndexSearcher(indexStoreA); Sorry about that, Peter On 4/11/06, Doug Cutting [EMAIL PROTECTED] wrote: Peter Keegan wrote: Oops. I meant to say: Does this mean that an IndexSearcher constructed from a MultiReader doesn't merge the search results and sort the results as if there was only one index? It doesn't have to, since a MultiReader *is* a single index. A quick test indicates that it does merge the results properly, however there is a difference in the order of documents with equal score. The MultiSearcher returns the higher doc first, but the IndexSearcher returns the lowest doc first. I think docs of equal score are supposed to be returned in the order they were indexed (lower doc id first). If that's the case it is a bug. If you can reproduce this in a standalone test, please submit it to Jira. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: question about custom sort method
Suppose I have a custom sorting 'DocScoreComparator' for computing distances on each search hit from a specified coordinate (similar to the DistanceComparatorSource example in LIA). Assume that the 'specified coordinate' is different for each query. This means a new custom comparator must be created for each query, which is ok. However, Lucene caches the comparator even though it will never be reused. This could result in heavy memory usage if many queries are performed before the IndexReader is updated. Is there any way to avoid having lucene cache the custom sorting objects?
Re: MMapDirectory vs RAMDirectory
I was able to improve the behavior by setting the mapped ByteBuffer to null in the close method of MMapIndexInput. This seems to be a strong enough 'suggestion' to the gc, as I can see the references go away with process explorer, and the index files can be deleted, usually. Occasionally, a reference to the '.tis' file remains. Peter On 6/5/06, Daniel Noll [EMAIL PROTECTED] wrote: Peter Keegan wrote: There is no 'unmap' method, so my understanding is that the file mapping is valid until the underlying buffer is garbage-collected. However, forcing the gc doesn't help. You're half right. The file mapping is indeed valid until the underlying buffer is garbage collected, but you can't force the GC -- there is no API which does that. Note the wording in the Javadoc for System.gc(): Calling the gc method **suggests** that the Java Virtual Machine expend effort toward recycling unused objects in order to make the memory they currently occupy available for quick reuse. When control returns from the method call, the Java Virtual Machine has made a best effort to reclaim space from all discarded objects. The file deletes don't fail on Linux, but I'm wondering if there is still a memory leak? Linux allows you to delete a file while someone has the file descriptor open, but the file descriptor will remain valid (i.e. the delete doesn't actually occur) until everyone releases the file descriptor. I ran into similar issues as these when working on other things, and eventually ended up switching to using a RandomAccessFile, as those can be closed. Otherwise you're right -- the workaround is to routinely try to delete the file. Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699 Web: http://www.nuix.com.au/Fax: +61 2 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Aggregating category hits
I compared Solr's DocSetHitCollector and counting bitset intersections to get facet counts with a different approach that uses a custom hit collector that tests each docid hit (bit) with each facets' bitset and increments a count in a histogram. My assumption was that for queries with few hits, this would be much faster than always doing bitset intersections/cardinality for every facet all the time. However, my throughput testing shows that the Solr method is at least 50% faster than mine. I'm seeing a big win with the use of the HashDocSet for lower hit counts. On my 64-bit platform, a MAX_SIZE value of 10K-20K seems to provide optimal performance. I'm looking forward to trying this with OpenBitSet. Peter On 5/29/06, z shalev [EMAIL PROTECTED] wrote: i know im a little late replying to this thread, but, in my humble opinion the best way to aggregate values (not necessarily terms, but whole values in fields) is as follows: startup stage: for each field you would like to aggregate create a hashmap open an index reader and run through all the docs get the values to be aggregated from the fields of each doc create a hashcode for each value from each field collected, the hashcode should have some sort of prefix indicating which field its from (for exampe: 1 = author, 2 = ) and hence which hash it is stored in (at retrieval time, this prefix can be used to easily retrieve the value from the correct hash) place the hashcode/value in the appropriate hash create an arraylist at index X in the arraylist place an int array of all the hashcodes associated with doc id X so for example: if i have doc id 0 which contains the values: william shakespeare and the value 1797 the array list at index 0 will have an int array containing 2 values (the 2 hashcodes of shaklespeare and 1797) run time: at run time receive the hits and iterate through the doc ids , aggregate the values with direct access into the arraylist (for doc id 10 go to index 10 in the arraylist to retrieve the array of hashcodes) and lookups into the hashmaps i tested this today on a small index approx 400,000 docs (1GB of data) but i ran queries returning over 100,000 results my response time was about 550 milliseconds on large (over 100,000) result sets another point, this method should be scalable for much larger indexes as well, as it is linear to the result set size and not the index size (which is a HUGE bonus) if anyone wants the code let me know, Marvin Humphrey [EMAIL PROTECTED] wrote: Thanks, all. The field cache and the bitsets both seem like good options until the collection grows too large, provided that the index does not need to be updated very frequently. Then for large collections, there's statistical sampling. Any of those options seems preferable to retrieving all docs all the time. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Feel free to call! Free PC-to-PC calls. Low rates on PC-to-Phone. Get Yahoo! Messenger with Voice
Re: Aggregating category hits
I'm seeing query throughput of approx. 290 qps with OpenBitSet vs. 270 with BitSet. I had to reduce the max. HashDocSet size to 2K - 3K (from 10K-20K) to get optimal tradeoff. no. docs in index: 730,000 average no. results returned: 40 average response time: 50 msec (15-20 for counting facets) no. facets: 100 on every query I'm not using the Solr server as we have already developed an infrastructure. Peter On 6/10/06, Yonik Seeley [EMAIL PROTECTED] wrote: On 6/9/06, Peter Keegan [EMAIL PROTECTED] wrote: However, my throughput testing shows that the Solr method is at least 50% faster than mine. I'm seeing a big win with the use of the HashDocSet for lower hit counts. On my 64-bit platform, a MAX_SIZE value of 10K-20K seems to provide optimal performance. Interesting... how many documents are in your collection? It would prob be nice to make the HashDocSet cutt-off dynamic rather than fixed. Are you using Solr, or just some of it's code? I'm looking forward to trying this with OpenBitSet. I checked in the OpenBitSet changes today. I imagine this will lower the optimal max HashDocSet size for performance a little. You might not see much performance improvement if most of the intersections involved a HashDocSet... the OpenBitSet improvements only kick in with bitset-bitset intersection counts. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Does more memory help Lucene?
See my note about overlapping indexing documents with merging: http://www.gossamer-threads.com/lists/lucene/java-user/34188?search_string=%2Bkeegan%20%2Baddindexes;#34188 Peter On 6/12/06, Michael D. Curtin [EMAIL PROTECTED] wrote: Nadav Har'El wrote: Otis Gospodnetic [EMAIL PROTECTED] wrote on 12/06/2006 04:36:45 PM: Nadav, Look up one of my onjava.com Lucene articles, where I talk about this. You may also want to tell Lucene to merge segments on disk less frequently, which is what mergeFactor does. Thanks. Can you please point me to the appropriate article (I found one from March 2003, but I'm not sure if it's the one you meant). About mergeFactor() - thanks for the hint, I'll try changing it too (I used 20 so far), and see if it helps performance. Still, there is one thing about mergeFactor(), and the merge process, that I don't understand: does having more memory help this process at all? Does having a large mergeFactor() actually require more memory? The reason I'm asking this that I'm still trying to figure out whether having a machine with huge ram actually helps Lucene, or not. I'm using 1.4.3, so I don't know if things are the same in 2.0. Anyhow, I found a significant performance benefit from changing minMergeDocs and mergeFactor from their defaults of 10 and 10 to 1,000 and 70, respectively. The improvement seems to come from a reduction in the number of merges as the index is created. Each merge involves reading and writing a bunch of data already indexed, sometimes everything indexed so far, so it's easy to see how reducing the number of merges reduces the overall indexing time. I can't remember why, but I also little benefit to increasing minMergeDocs beyond 1000. A lot of time was being spent in the first merge, which takes a bunch of one-document segments in a RAMDirectory and merges them into the first-level segments on disk. I hacked the code to make this first merge (and ONLY the first merge) operate on minMergeDocs * mergeFactor documents instead, which greatly increased the RAM consumption but reduced the indexing time. In detail, what I started with was: a. read minMergeDocs of docs, creating one-doc segments in RAM b. read those one-doc RAM segments and merge them c. write the merged results to a disk segment ... i. read mergeFactor first-level disk segments and merge them j. write second-level segments to disk ... p. normal disk-based merging thereafter, as necessary And what I ended up with was: A. read minMergeDocs * mergeFactor docs, and remember them in RAM B. write a segment from all the remembered RAM docs (a modified merge) ... F. normal disk-based merging thereafter, as necessary In essence, I eliminated that first level merge, one that involved lots and lots of teeny-weeny I/O operations that were very inefficient. In my case, steps A B worked on 70,000 documents instead of 1,000. Remembering all those docs required a lot of RAM (almost 2GB), but it almost tripled indexing performance. Later, I had to knock the 70 down to 35 (maybe because my docs got a lot bigger but I don't remember now), but you get the idea. I couldn't use a mergeFactor of 70,000 because that's way more file descriptors than I could have without recompiling the kernel (I seem to remember my limit being 1,024, and each segment took 14 file descriptors). Hope it helps. --MDC - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Aggregating category hits
The performance results in my previous posting were based on an implementation that performs 2 searches, one for getting 'Hits' and another for getting the BitSet. I reimplemented this in one search using the code in 'SolrIndexSearcher.getDocListAndSetNC' and I'm now getting throughput of 350-375 qps. This is great stuff Solr guys! I'd love to see the DocSet and DocList features added to Lucene's IndexSearcher. Peter On 6/12/06, Peter Keegan [EMAIL PROTECTED] wrote: I'm seeing query throughput of approx. 290 qps with OpenBitSet vs. 270 with BitSet. I had to reduce the max. HashDocSet size to 2K - 3K (from 10K-20K) to get optimal tradeoff. no. docs in index: 730,000 average no. results returned: 40 average response time: 50 msec (15-20 for counting facets) no. facets: 100 on every query I'm not using the Solr server as we have already developed an infrastructure. Peter On 6/10/06, Yonik Seeley [EMAIL PROTECTED] wrote: On 6/9/06, Peter Keegan [EMAIL PROTECTED] wrote: However, my throughput testing shows that the Solr method is at least 50% faster than mine. I'm seeing a big win with the use of the HashDocSet for lower hit counts. On my 64-bit platform, a MAX_SIZE value of 10K-20K seems to provide optimal performance. Interesting... how many documents are in your collection? It would prob be nice to make the HashDocSet cutt-off dynamic rather than fixed. Are you using Solr, or just some of it's code? I'm looking forward to trying this with OpenBitSet. I checked in the OpenBitSet changes today. I imagine this will lower the optimal max HashDocSet size for performance a little. You might not see much performance improvement if most of the intersections involved a HashDocSet... the OpenBitSet improvements only kick in with bitset-bitset intersection counts. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene 2.0.1 release date
This makes it relatively safe for people to grab a snapshot of the trunk with less concern about latent bugs. I think the concern is that if we start doing this stuff on trunk now, people that are accustomed to snapping from the trunk might be surprised, and not in a good way. +1 on this. There are some great performance improvements in 2.0.1 Peter On 10/17/06, Steven Parkes [EMAIL PROTECTED] wrote: I think the idea is that 2.0.1 would be a patch-fix release from the branch created at 2.0 release. This release would incorporate only back-ported high-impact patches, where high-impact is defined by the community. Certainly security vulnerabilities would be included. As Otis said, to date, nobody seems to have raised any issues to that level. 2.1 will include all the patches and new features that have been committed since 2.0; there've been a number of these. But releases are done pretty ad hoc at this point and there hasn't been anyone that has expressed strong interest in (i.e., lobbied for) a release. There was a little discussion on this topic at the ApacheCon BOF. For a number of reasons, the Lucene Java trunk has been kept pretty stable, with a relatively few number of large changes. This makes it relatively safe for people to grab a snapshot of the trunk with less concern about latent bugs. I don't know how many people/projects are doing this rather than sticking with 2.0. Keeping the trunk stable doesn't provide an obvious place to start working on things that people may want to work on and share but at the same time want to allow to percolate for a while. I think the concern is that if we start doing this stuff on trunk now, people that are accustomed to snapping from the trunk might be surprised, and not in a good way. Nobody wants that. So releases can be about both what people want (getting features out) and allowing a bit more instability in trunk. That is, if the community wants that. Food for thought and/or discussion? -Original Message- From: George Aroush [mailto:[EMAIL PROTECTED] Sent: Sunday, October 15, 2006 5:15 PM To: java-user@lucene.apache.org Subject: RE: Lucene 2.0.1 release date Thanks for the reply Otis. I looked at the CHANGES.txt file and saw quit a bit of changes. For my port from Java to C#, I can't rely on the trunk code as it is (to my knowledge) changes on a monthly basic if not weekly. What I need is an official release so that I can use it as the port point. Regards, -- George Aroush -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Sunday, October 15, 2006 12:41 AM To: java-user@lucene.apache.org Subject: Re: Lucene 2.0.1 release date I'd have to check CHANGES.txt, but I don't think that many bugs have been fixed and not that many new features added that anyone is itching for a new release. Otis - Original Message From: George Aroush [EMAIL PROTECTED] To: java-dev@lucene.apache.org; java-user@lucene.apache.org Sent: Saturday, October 14, 2006 10:32:47 AM Subject: RE: Lucene 2.0.1 release date Hi folks, Sorry for reposting this question (see original email below) and this time to both mailing list. If anyone can tell me what is the plan for Lucene 2.0.1 release, I would appreciate it very much. As some of you may know, I am the porter of Lucene to Lucene.Net knowing when 2.0.1 will be released will help me plan things out. Regards, -- George Aroush -Original Message- From: George Aroush [mailto:[EMAIL PROTECTED] Sent: Thursday, October 12, 2006 12:07 AM To: java-dev@lucene.apache.org Subject: Lucene 2.0.1 release date Hi folks, What's the plan for Lucene 2.0.1 release date? Thanks! -- George Aroush - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Announcement: Lucene powering Monster job search index (Beta)
I am pleased to announce the launch of Monster's new job search Beta web site, powered by Lucene, at: http://jobsearch.beta.monster.com (notice the Lucene logo at the bottom of the page!). The jobs index is implemented with Java Lucene 2.0 on 64-bit Windows (AMD and Intel processors) Here are some of the new features: 1. 'Improve your search by'... The job search results page allows you to browse and 'drill down' through the results by job category, status, type and salary. The number of matching jobs in each facet is displayed. There will likely be many more facets to browse by in the future. This feature is currently implemented with a custom HitCollector and the DocSet class from Solr. 2. 'More like this' Find more jobs like the one you see by clicking on the 'MORE LIKE THIS' link, which is visible when you hover the mouse over the job title. This feature is implemented with Lucene's term vectors and the 'MoreLikeThis' contribution class. If you are in 'detailed view', the term vectors from the job description are used. In 'brief' view, the job title's term vectors are used. 3. 'Related Titles' When you do a 'keywords' search, click on a 'related titles' link to filter you search by similar job titles. This feature is implemented via a separate Lucene.Net index. 4. Sort by 'Miles' Find jobs close to you via zip code/radius search. In the search results page, click on the 'Miles' column to sort the results by distance from your zip code/radius. This custom sorting feature is implemented via Lucene's 'SortComparatorSource' interface. 5. Search by date, salary, distance. Find jobs posted in the last day (or 2,3, etc) or by salary range or distance. Numeric range search is one of Lucene's weak points (performance-wise) so we have implemented this with a custom HitCollector and an extension to the Lucene index files that stores the numeric field values for all documents. It is important to point out that this has all been implemented with the stock Lucene 2.0 library. No code changes were made to the Lucene core. If you have any feedback regarding the UI, please use the link on the web page (send us your feedback). You can hit me with any other questions/comments. Peter
Re: Announcement: Lucene powering Monster job search index (Beta)
On 10/27/06, Chris Lu [EMAIL PROTECTED] wrote: Hi, Peter, Really great job! Thanks. (I'll tell the team) I am interested to know how you implemented 4. Sort by 'Miles'. For example, if starting from a zip code, how to match items within 20 miles? I can tell you how we use Lucene to accomplish this. At indexing time, each job's location is indexed as a special field. How you represent the location is up to you. Each time a new index is built the location data for all documents in the index are fetched via TermEnum and TermDocs. This is practical because the searcher refresh is done at predictable times. At query time, a custom SortComparatorSource is created, using the 'reference' location (the zip/radius). The 'compare' method performs the calculation to compare the 2 documents' location values (saved from above) to the reference location. I believe this can also be accomplished with Solr's FunctionQuery, but I haven't tried that yet. Peter -- Chris Lu - Instant Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com On 10/27/06, Peter Keegan [EMAIL PROTECTED] wrote: I am pleased to announce the launch of Monster's new job search Beta web site, powered by Lucene, at: http://jobsearch.beta.monster.com (notice the Lucene logo at the bottom of the page!). The jobs index is implemented with Java Lucene 2.0 on 64-bit Windows (AMD and Intel processors) Here are some of the new features: 1. 'Improve your search by'... The job search results page allows you to browse and 'drill down' through the results by job category, status, type and salary. The number of matching jobs in each facet is displayed. There will likely be many more facets to browse by in the future. This feature is currently implemented with a custom HitCollector and the DocSet class from Solr. 2. 'More like this' Find more jobs like the one you see by clicking on the 'MORE LIKE THIS' link, which is visible when you hover the mouse over the job title. This feature is implemented with Lucene's term vectors and the 'MoreLikeThis' contribution class. If you are in 'detailed view', the term vectors from the job description are used. In 'brief' view, the job title's term vectors are used. 3. 'Related Titles' When you do a 'keywords' search, click on a 'related titles' link to filter you search by similar job titles. This feature is implemented via a separate Lucene.Net index. 4. Sort by 'Miles' Find jobs close to you via zip code/radius search. In the search results page, click on the 'Miles' column to sort the results by distance from your zip code/radius. This custom sorting feature is implemented via Lucene's 'SortComparatorSource' interface. 5. Search by date, salary, distance. Find jobs posted in the last day (or 2,3, etc) or by salary range or distance. Numeric range search is one of Lucene's weak points (performance-wise) so we have implemented this with a custom HitCollector and an extension to the Lucene index files that stores the numeric field values for all documents. It is important to point out that this has all been implemented with the stock Lucene 2.0 library. No code changes were made to the Lucene core. If you have any feedback regarding the UI, please use the link on the web page (send us your feedback). You can hit me with any other questions/comments. Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Announcement: Lucene powering Monster job search index (Beta)
Otis, The Lucene components for this beta are running on 4 dual core AMD Opteron ( 2.6GHZ) processors, for a total of 8 CPUs. It has 32GB RAM, although 16GB would probably suffice. The query rate is currently quite low probably because of the low visibility of the beta page. We haven't measured QPS rates for this configuration, yet, but if you look at some of my previous posts, you'll see some QPS data on somewhat similar hardware. I think that actual rates will be lower, though, because the complexity of the queries, counting, sorting, etc have increased. Peter On 10/28/06, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi, --- Peter Keegan [EMAIL PROTECTED] wrote: On 10/27/06, Chris Lu [EMAIL PROTECTED] wrote: Hi, Peter, Really great job! Thanks. (I'll tell the team) If it's not a secret, can you tell us a bit more about what's behind the search in terms of hardware, and how much pounding that hardware takes in terms of QPS? People always ask about this stuff. Thanks, Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Announcement: Lucene powering Monster job search index (Beta)
Alex, I like your suggestion (I've found myself wondering what the last search was, too), and I've forwarded it to the UI developer. Thanks, Peter On 10/29/06, Alexandru Popescu [EMAIL PROTECTED] wrote: Peter it looks impressive. Congrats! A small suggestion, though, after performing a search the filtering criteria is not displayed anywhere. I guess it would make sense to write it in a read-only form somewhere on the result pages: Jobs 1-50 of 7896 matches to Jobs 1-50 of 7896 matching criteria (a small hidden stuff showing the criteria). ./alex -- .w( the_mindstorm )p. On 10/29/06, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi, --- Peter Keegan [EMAIL PROTECTED] wrote: On 10/27/06, Chris Lu [EMAIL PROTECTED] wrote: Hi, Peter, Really great job! Thanks. (I'll tell the team) If it's not a secret, can you tell us a bit more about what's behind the search in terms of hardware, and how much pounding that hardware takes in terms of QPS? People always ask about this stuff. Thanks, Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Announcement: Lucene powering Monster job search index (Beta)
Joe, Fields with numeric values are stored in a separate file as binary values in an internal format. Lucene is unaware of this file and unaware of the range expression in the query. The range expression is parsed outside of Lucene and used in a custom HitCollector to filter out documents that aren't in the requested range(s). A goal was to do this without having to modify Lucene. Our scheme is pretty efficient, but not very general purpose in its current form, though. Peter On 10/30/06, Joe Shaw [EMAIL PROTECTED] wrote: Hi Peter, On Fri, 2006-10-27 at 15:29 -0400, Peter Keegan wrote: Numeric range search is one of Lucene's weak points (performance-wise) so we have implemented this with a custom HitCollector and an extension to the Lucene index files that stores the numeric field values for all documents. It is important to point out that this has all been implemented with the stock Lucene 2.0 library. No code changes were made to the Lucene core. Can you give some technical details on the extension to the Lucene index files? How did you do it without making any changes to the Lucene core? Thanks, Joe - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Announcement: Lucene powering Monster job search index (Beta)
KEGan, When you search by 4. Sort by Miles, I suppose the sorting by relevance (of the search keyword) is lost? Since this is implemented using a custom SortComparatorSource. Sorting by miles becomes the primary sort key, score and date become secondary sort fields (in the case of ties). Also, I suppose, if FunctionQuery were used, we can make job distance by miles part of the relavancy of the search results? Yes, this is my understanding of the power of FunctionQuery. Peter On 10/30/06, KEGan [EMAIL PROTECTED] wrote: Peter, Congratulation on the beta launch :) If you dont mind, I would like to ask you more on the feature 4. Sort by Miles. When you search by 4. Sort by Miles, I suppose the sorting by relevance (of the search keyword) is lost? Since this is implemented using a custom SortComparatorSource. Also, I suppose, if FunctionQuery were used, we can make job distance by miles part of the relavancy of the search results? Could you comment or confirm my assertion ? Thanks :) On 10/28/06, Peter Keegan [EMAIL PROTECTED] wrote: On 10/27/06, Chris Lu [EMAIL PROTECTED] wrote: Hi, Peter, Really great job! Thanks. (I'll tell the team) I am interested to know how you implemented 4. Sort by 'Miles'. For example, if starting from a zip code, how to match items within 20 miles? I can tell you how we use Lucene to accomplish this. At indexing time, each job's location is indexed as a special field. How you represent the location is up to you. Each time a new index is built the location data for all documents in the index are fetched via TermEnum and TermDocs. This is practical because the searcher refresh is done at predictable times. At query time, a custom SortComparatorSource is created, using the 'reference' location (the zip/radius). The 'compare' method performs the calculation to compare the 2 documents' location values (saved from above) to the reference location. I believe this can also be accomplished with Solr's FunctionQuery, but I haven't tried that yet. Peter -- Chris Lu - Instant Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com On 10/27/06, Peter Keegan [EMAIL PROTECTED] wrote: I am pleased to announce the launch of Monster's new job search Beta web site, powered by Lucene, at: http://jobsearch.beta.monster.com(notice the Lucene logo at the bottom of the page!). The jobs index is implemented with Java Lucene 2.0 on 64-bit Windows (AMD and Intel processors) Here are some of the new features: 1. 'Improve your search by'... The job search results page allows you to browse and 'drill down' through the results by job category, status, type and salary. The number of matching jobs in each facet is displayed. There will likely be many more facets to browse by in the future. This feature is currently implemented with a custom HitCollector and the DocSet class from Solr. 2. 'More like this' Find more jobs like the one you see by clicking on the 'MORE LIKE THIS' link, which is visible when you hover the mouse over the job title. This feature is implemented with Lucene's term vectors and the 'MoreLikeThis' contribution class. If you are in 'detailed view', the term vectors from the job description are used. In 'brief' view, the job title's term vectors are used. 3. 'Related Titles' When you do a 'keywords' search, click on a 'related titles' link to filter you search by similar job titles. This feature is implemented via a separate Lucene.Net index. 4. Sort by 'Miles' Find jobs close to you via zip code/radius search. In the search results page, click on the 'Miles' column to sort the results by distance from your zip code/radius. This custom sorting feature is implemented via Lucene's 'SortComparatorSource' interface. 5. Search by date, salary, distance. Find jobs posted in the last day (or 2,3, etc) or by salary range or distance. Numeric range search is one of Lucene's weak points (performance-wise) so we have implemented this with a custom HitCollector and an extension to the Lucene index files that stores the numeric field values for all documents. It is important to point out that this has all been implemented with the stock Lucene 2.0 library. No code changes were made to the Lucene core. If you have any feedback regarding the UI, please use the link on the web page (send us your feedback). You can hit me with any other questions/comments. Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Announcement: Lucene powering Monster job search index (Beta)
Paramasivam, Take a look at Solr, in particular the DocSetHitCollector class. The collector simply sets a bit in a BitSet, or saves the docIds in an array (for low hit counts). Solr's BitSet was optimized (by Yonik, I believe) to be faster than Java's BitSet, so this HitCollector is very fast. This is essentially what we are doing for counting. Peter On 11/2/06, Paramasivam Srinivasan [EMAIL PROTECTED] wrote: Hi Peter When I use the CustomHitCollector, it affect the application performance. Also how you accomplish the grouping the results with out affecting performance. Also If possible give some code snippet for custome hitcollector. TIA Sri Peter Keegan [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Joe, Fields with numeric values are stored in a separate file as binary values in an internal format. Lucene is unaware of this file and unaware of the range expression in the query. The range expression is parsed outside of Lucene and used in a custom HitCollector to filter out documents that aren't in the requested range(s). A goal was to do this without having to modify Lucene. Our scheme is pretty efficient, but not very general purpose in its current form, though. Peter On 10/30/06, Joe Shaw [EMAIL PROTECTED] wrote: Hi Peter, On Fri, 2006-10-27 at 15:29 -0400, Peter Keegan wrote: Numeric range search is one of Lucene's weak points (performance-wise) so we have implemented this with a custom HitCollector and an extension to the Lucene index files that stores the numeric field values for all documents. It is important to point out that this has all been implemented with the stock Lucene 2.0 library. No code changes were made to the Lucene core. Can you give some technical details on the extension to the Lucene index files? How did you do it without making any changes to the Lucene core? Thanks, Joe - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Announcement: Lucene powering Monster job search index (Beta)
Daniel, Yes, this is correct if you happen to be doing a radius search and sorting by mileage. Peter On 11/3/06, Daniel Rosher [EMAIL PROTECTED] wrote: Hi Peter, Does this mean you are calculating the euclidean distance twice ... once for the HitCollecter to filter 'out of range' documents, and then again for the custom Comparator to sort the returned documents? especially since the filtering is done outside Lucene? Regards, Dan Joe, Fields with numeric values are stored in a separate file as binary values in an internal format. Lucene is unaware of this file and unaware of the range expression in the query. The range expression is parsed outside of Lucene and used in a custom HitCollector to filter out documents that aren't in the requested range(s). A goal was to do this without having to modify Lucene. Our scheme is pretty efficient, but not very general purpose in its current form, though. Peter On 10/30/06, Joe Shaw [EMAIL PROTECTED] wrote: Hi Peter, On Fri, 2006-10-27 at 15:29 -0400, Peter Keegan wrote: Numeric range search is one of Lucene's weak points (performance-wise) so we have implemented this with a custom HitCollector and an extension to the Lucene index files that stores the numeric field values for all documents. It is important to point out that this has all been implemented with the stock Lucene 2.0 library. No code changes were made to the Lucene core. Can you give some technical details on the extension to the Lucene index files? How did you do it without making any changes to the Lucene core? Thanks, Joe - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Announcement: Lucene powering Monster job search index (Beta)
Correction: We only do the euclidan computation during sorting. For filtering, a simple bounding box is computed to approximate the radius, and 2 range comparisons are made to exclude documents. Because these comparisons are done outside of Lucene as integer comparisons, it is pretty fast. With 13000 results, the seach time with distance sort is about 200 msec (compared to 30 ms for a simple non-radius, date-sorted keyword search). Peter On 1/27/07, no spam [EMAIL PROTECTED] wrote: Isn't this extremely ineffecient to do the euclidean distance twice? Perhaps not a huge deal if a small search result set. I at times have 13,000 results that match my search terms of an index with 1.2 million docs. Can't you do some simple radian math first to ensure it's way out of bounds, then do the euclidian distance for the subset within bounds? I'm currently only doing the distance calc once (post hit collector). I don't have any performance numbers with the double vs single distance calc. I'm still working out the sort by radius myself. Mark On 11/3/06, Peter Keegan [EMAIL PROTECTED] wrote: Daniel, Yes, this is correct if you happen to be doing a radius search and sorting by mileage. Peter
Re: Announcement: Lucene powering Monster job search index (Beta)
Mark, I'm sorry to hear that you weren't able to get to the job search site today. I heard of a problem, but I can assure you that it had nothing to do with Lucene and our back end tiers. Can you tell me what you think is lacking for job search among the big boards? There is clearly a lot of room for improvement. How is the performance of your distance search and sort? Peter On 1/30/07, no spam [EMAIL PROTECTED] wrote: This is very similar to what I do. I use a hit collector to gather the results, then filter outside a bounding box, then calculate the euclidian distance. Last time I tried to check your search it was down. We were talking the other day at work how job search was lacking among the big boards. I'm excited to check out your new page. Mark On 1/28/07, Peter Keegan [EMAIL PROTECTED] wrote: Correction: We only do the euclidan computation during sorting. For filtering, a simple bounding box is computed to approximate the radius, and 2 range comparisons are made to exclude documents. Because these comparisons are done outside of Lucene as integer comparisons, it is pretty fast. With 13000 results, the seach time with distance sort is about 200 msec (compared to 30 ms for a simple non-radius, date-sorted keyword search). Peter On 1/27/07, no spam [EMAIL PROTECTED] wrote: Isn't this extremely ineffecient to do the euclidean distance twice? Perhaps not a huge deal if a small search result set. I at times have 13,000 results that match my search terms of an index with 1.2 million docs. Can't you do some simple radian math first to ensure it's way out of bounds, then do the euclidian distance for the subset within bounds? I'm currently only doing the distance calc once (post hit collector). I don't have any performance numbers with the double vs single distance calc. I'm still working out the sort by radius myself. Mark On 11/3/06, Peter Keegan [EMAIL PROTECTED] wrote: Daniel, Yes, this is correct if you happen to be doing a radius search and sorting by mileage. Peter
bad queryparser bug
I have discovered a serious bug in QueryParser. The following query: contents:sales contents:marketing || contents:industrial contents:sales is parsed as: +contents:sales +contents:marketing +contents:industrial +contents:sales The same parsed query occurs even with parenthesis: (contents:sales contents:marketing) || (contents:industrial contents:sales) Is there any way around this bug? Thanks, Peter
Re: bad queryparser bug
Correction: The query parser produces the correct query with the parenthesis. But, I'm still looking for a fix for this. I could use some advice on where to look in QueryParser to fix this. Thanks, Peter On 2/1/07, Peter Keegan [EMAIL PROTECTED] wrote: I have discovered a serious bug in QueryParser. The following query: contents:sales contents:marketing || contents:industrial contents:sales is parsed as: +contents:sales +contents:marketing +contents:industrial +contents:sales The same parsed query occurs even with parenthesis: (contents:sales contents:marketing) || (contents:industrial contents:sales) Is there any way around this bug? Thanks, Peter
Re: bad queryparser bug
OK, I see that I'm not the first to discover this behavior of QueryParser. Can anyone vouch for the integrity of the PrecedenceQueryParser here: http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/miscellaneous/src/java/org/apache/lucene/queryParser/precedence/ Thanks, Peter On 2/1/07, Peter Keegan [EMAIL PROTECTED] wrote: Correction: The query parser produces the correct query with the parenthesis. But, I'm still looking for a fix for this. I could use some advice on where to look in QueryParser to fix this. Thanks, Peter On 2/1/07, Peter Keegan [EMAIL PROTECTED] wrote: I have discovered a serious bug in QueryParser. The following query: contents:sales contents:marketing || contents:industrial contents:sales is parsed as: +contents:sales +contents:marketing +contents:industrial +contents:sales The same parsed query occurs even with parenthesis: (contents:sales contents:marketing) || (contents:industrial contents:sales) Is there any way around this bug? Thanks, Peter
Re: bad queryparser bug
(If i could go back in time and stop the AND/OR/NOT//|| aliases from being added to the QueryParser -- i would) Yes, this is the cause of the confusion. Our users are accustomed to the boolean logic syntax from a legacy search engine (also common to many other engines). We'll have to convert them into native QueryParser syntax as possible. Sorry for the cross post. Thanks, Peter On 2/2/07, Chris Hostetter [EMAIL PROTECTED] wrote: : The query parser produces the correct query with the parenthesis. : But, I'm still looking for a fix for this. I could use some advice on where : to look in QueryParser to fix this. the best advice i can give you: don't use the binary operators. * Lucene is not a boolean logic system * BooleanQuery does not impliment boolean logic * QueryParser is not a boolean language parser (If i could go back in time and stop the AND/OR/NOT//|| aliases from being added to the QueryParser -- i would) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: relevancy buckets and secondary searching
Hi Erick, The timing of your posting is ironic because I'm currently working on the same issue. Here's a solution that I'm going to try: Use a HitCollector with a PriorityQueue to sort all hits by raw Lucene score, ignoring the secondary sort field. After the search, re-sort just the hits from the queue above (500 in your case) with a FieldSortedHitQueue that sorts on score, then the secondary field (title in your case), but 'normalize' the score to your 'user visible' scores before re-sorting. If your 'normalized' score is computed properly, this should force the secondary sort to occur and produce the 'proper' sorting that the user expects. I think the trick here is in computing the proper normalized score from Lucene's raw scores, which will vary depending on boosts, etc. I agree with you that this special relevancy sort is a real hack to implement! Peter On 2/5/07, Erick Erickson [EMAIL PROTECTED] wrote: Am I missing anything obvious here and/or what would folks suggest... Conceptually, I want to normalize the scores of my documents during a search BUT BEFORE SORTING into 5 discrete values, say 0.1, 0.3, 0.5, 0.7, 0.9 and apply a secondary sort when two documents have the same score. Applying the secondary sort is easy, it's massaging the scores that has me stumped. We have a bunch of documents (30K). Books actually. We only display to the user 5 different relevance scores, with 5 being the most relevant. So far, so good. Within each quintile, we want to sort by title. So, suppose the following three books score a hit: relevance title 0.98 z 0.94 c 0.79 a The proper display would be 5 c 5 z 4 a It's easy enough to do a secondary sort, but that would not give me what I want. In this case, I'd get... 5 z 5 c 4 a because the secondary sort only matters if the primary sort is equal. The user is left scratching her head asking why did two books with the same relevancy have the titles out of order?. If I could massage my scores *before* sorts are done, things would be hunky-dory, but I'm not seeing how to do that. One problem is that until the top N documents have been collected, I don't know what the maximum relevance is, therefore I don't know how to normalize raw scores. I followed Hoss's thread where he talks about FakeNorms, but don't see how that applies to my problem. My result sets are strictly limited to 500, so it's not unreasonable to just get the TopDocs back and aggregate my buckets at that point and sort them. But of course I only care about this when I am using relevancy as my primary sort. For sorting on any other fields, I would just let Lucene take care of it all. So post-sorting myself leads to really ugly stuff like if (it's my special relevancy sort) do one thing else don't do that thing. repeated wherever I have to sort. Yuck. And since I'm talking about 500 docs, I don't want to wait until after I have a Hits object because I'll have to re-query several times. On an 8G index (and growing). This almost looks like a HitCollector, but not quite. This almost looks like a custom Similarity, but not quite since I want to just let Lucene compute relevance and put that into a bucket. This almost looks like FakeNorms, but not quite. This almost looks like about 8 things I tried to make work, but not quite G So, somebody out there needs to tell me what part of the manual I overlooked G... Thanks Erick
Re: Sorting by Score
Suppose one wanted to use this custom rounding score comparator on all fields and all queries. How would you get it plugged in most efficiently, given that SortField requires a non-null field name? Peter On 2/1/06, Chris Hostetter [EMAIL PROTECTED] wrote: : I've not used the sorting code yet, but it looks like you have to : provide some custom ScoreDocComparator by adding a SortField using the : SortField(String field, SortComparatorSource comparator) constructor. : I'm just not certain what you should specify for the field value since : you really want to just round off the score. : : Could someone with more experience using the Sort API clarify whether : this is possible? yes, it should be possible, and yes your description of a solution sounds right ... the only odd thing is you'd be writting a SortComparatorSource/ScoreDocComparator that would be ignoring the field it's given, but there's nothing wrong with that. Round your number to the desired precision, then compare them, and return 0 if they are equal so that the secondary sort (on date in this case) can take affect. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sorting by Score
I'm building up the Sort object for the search with 2 SortFields - first is for the custom rounded scoring, second is for date. This Sort object is used to construct a FieldSortedHitQueue which is used with a custom HitCollector. And yes, this comparator ignores the field name. hmmm, actually i see now that SortField(String,SortComparatorSource) says it cannot be null ... not sure if that's actually enforced or not The constructor doesn't complain, but FieldSortedHitQueue expects a field name when it tries to locate the comparator from the cache: at org.apache.lucene.search.FieldCacheImpl$Entry.init( FieldCacheImpl.java:60) at org.apache.lucene.search.FieldSortedHitQueue.lookup( FieldSortedHitQueue.java:157) at org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator( FieldSortedHitQueue.java:185) at org.apache.lucene.search.FieldSortedHitQueue.init( FieldSortedHitQueue.java:58) Peter On 2/27/07, Chris Hostetter [EMAIL PROTECTED] wrote: : Suppose one wanted to use this custom rounding score comparator on all : fields and all queries. How would you get it plugged in most efficiently, : given that SortField requires a non-null field name? i'm not sure i understand the first part of question .. this custom SortComparatorSource would deal only with the score, it wouldn't matter what other fields you'd want to make SortFields on to do secondary sorting. .. You as the client have to specify the Sort obejct when executing the search, and you can build that Sort object up anyway you want. Yes the SortField class has a constructor arg for field, but as you can see from the javadocs, it can be null in many circumstances (consider SortFiled#FIELD_SCORE and SortField#FIELD_DOC for instance) ... hmmm, actually i see now that SortField(String,SortComparatorSource) says it cannot be null ... not sure if that's actually enforced or not, but it's no bother -- all that matters is that you don't make any attempt to use the field name in your SortComparatorSource. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sorting by Score
can't you pick any arbitrary marker field name (that's not a real field name) and use that? Yes, I could. I guess you're saying that the field name doesn't matter, except that it's used for caching the comparator, right? ... he wants the bucketing to happen as part of hte scoring so that the secondary sort will determine the ordering within the bucket. Yes, exactly. Couldn't I just do this rounding in the HitCollector, before inserting it into the FieldSortedHitQueue? On 2/28/07, Chris Hostetter [EMAIL PROTECTED] wrote: : The first part was just to iterate through the TopDocs that's available to : my and normalize the scores right in the ScoreDocs. Like this... Won't that be done after the Lucene does the hitcollecting/sorting? ... he wants the bucketing to happen as part of hte scoring so that the secondary sort will determine the ordering within the bucket. (or am i missing something about your description?) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sorting by Score
Erich, Yes, this seems to be the simplest way to implement score 'bucketization', but wouldn't it be more efficient to do this with a custom ScoreComparator? That way, you'd do the bucketizing and sorting in one 'step' (compare()). Maybe the savings isn't measurable, though. A comparator might also allow one to do a more sophisticated rounding or bucketizing since you'd be getting 2 scores at a time. Peter On 2/28/07, Erick Erickson [EMAIL PROTECTED] wrote: Empirically, when I insert the elements in the FieldSortedHitQueue they get sorted according to the Sort object. The original query that gives me a TopDocs applied no secondary sorting, only relevancy. Since I normalized all the scores into one of only 5 discrete values, and secondary sorting was applied to all docs with the same score when I inserted them in the FieldSortedHitQueue. Now popping things of the FieldSortedHitQueue is ordered the way I want. You could just operate on the FieldSortedHitQueue at this point, but I decided the rest of my code would be simpler if I stuffed them back into the TopDocs, so there's some explanation below that you can just skip if I've cleared things up already. * The step I left out is moving the documents from the FIeldSortedHitQueue back to topDocs.scoreDocs. So the steps are as follows.. 1 bucketize the scores. That is, go through the TopDocs.scoreDocs and adjust each raw score into one of my buckets. This is made easy by the existence of topDocs.getMaxScore. TopDocs has had no sorting other than relevancy applied so far. 2 assemble the FieldSortedHitQueue by inserting each element from scoreDocs into it, with a suitable Sort object, relevance is the first field (SortField.FIELD_SCORE). 3 pop the entries off the FieldSortedHitQueue, overwriting the elements in topDocs.scoreDocs. I left out step 3, although I suppose you could operate directly on the FieldSortedHitQueue. NOTE: in my case, I just put everything back in the scoreDocs without attempting any efficiencies. If I needed more performance, I'd only put as many items back as I needed to display. But as I wrote yesterday, performance isn't an issue so there's no point. Although I know one place to look if we need to squeeze more QPS. How efficient this is is an open question. But it's fast enough and relatively simple so I stopped looking for more efficiencies Erick On 2/28/07, Chris Hostetter [EMAIL PROTECTED] wrote: : The first part was just to iterate through the TopDocs that's available to : my and normalize the scores right in the ScoreDocs. Like this... Won't that be done after the Lucene does the hitcollecting/sorting? ... he wants the bucketing to happen as part of hte scoring so that the secondary sort will determine the ordering within the bucket. (or am i missing something about your description?) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sorting by Score
Erick, I think you're right because you'd wouldn't know the max score before the comparisons. I'm just thinking about a rounding algorithm that involves comparing the raw scores to the theoretical maximum score, which I think could be computed from the Similarity class and knowing the max boost value used during indexing. Peter On 3/1/07, Erick Erickson [EMAIL PROTECTED] wrote: Peter: About a custom ScoreComparator. The problem I couldn't get past was that I needed to know the max score of all the docs in order to divide the raw scores into quintiles since I was dealing with raw scores. I didn't see how to make that work with ScoreComparator, but I confess that I didn't look very hard after someone on the list turned me on to FieldSortedHitQueue Erick On 2/28/07, Erick Erickson [EMAIL PROTECTED] wrote: It may well be, but as I said this is efficient enough for my needs so I didn't pursue it. One of my pet peeves is spending time making things more efficient when there's no need, and my index isn't going to grow enough larger to worry about that now G... Erick On 2/28/07, Peter Keegan [EMAIL PROTECTED] wrote: Erich, Yes, this seems to be the simplest way to implement score 'bucketization', but wouldn't it be more efficient to do this with a custom ScoreComparator? That way, you'd do the bucketizing and sorting in one 'step' (compare()). Maybe the savings isn't measurable, though. A comparator might also allow one to do a more sophisticated rounding or bucketizing since you'd be getting 2 scores at a time. Peter On 2/28/07, Erick Erickson [EMAIL PROTECTED] wrote: Empirically, when I insert the elements in the FieldSortedHitQueue they get sorted according to the Sort object. The original query that gives me a TopDocs applied no secondary sorting, only relevancy. Since I normalized all the scores into one of only 5 discrete values, and secondary sorting was applied to all docs with the same score when I inserted them in the FieldSortedHitQueue. Now popping things of the FieldSortedHitQueue is ordered the way I want. You could just operate on the FieldSortedHitQueue at this point, but I decided the rest of my code would be simpler if I stuffed them back into the TopDocs, so there's some explanation below that you can just skip if I've cleared things up already. * The step I left out is moving the documents from the FIeldSortedHitQueue back to topDocs.scoreDocs. So the steps are as follows.. 1 bucketize the scores. That is, go through the TopDocs.scoreDocs and adjust each raw score into one of my buckets. This is made easy by the existence of topDocs.getMaxScore . TopDocs has had no sorting other than relevancy applied so far. 2 assemble the FieldSortedHitQueue by inserting each element from scoreDocs into it, with a suitable Sort object, relevance is the first field ( SortField.FIELD_SCORE). 3 pop the entries off the FieldSortedHitQueue, overwriting the elements in topDocs.scoreDocs. I left out step 3, although I suppose you could operate directly on the FieldSortedHitQueue. NOTE: in my case, I just put everything back in the scoreDocs without attempting any efficiencies. If I needed more performance, I'd only put as many items back as I needed to display. But as I wrote yesterday, performance isn't an issue so there's no point. Although I know one place to look if we need to squeeze more QPS. How efficient this is is an open question. But it's fast enough and relatively simple so I stopped looking for more efficiencies Erick On 2/28/07, Chris Hostetter [EMAIL PROTECTED] wrote: : The first part was just to iterate through the TopDocs that's available to : my and normalize the scores right in the ScoreDocs. Like this... Won't that be done after the Lucene does the hitcollecting/sorting? ... he wants the bucketing to happen as part of hte scoring so that the secondary sort will determine the ordering within the bucket. (or am i missing something about your description?) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Ranking/scoring
I'm looking at how ReciprocalFloatFuncion and ReverseOrdFieldSource can be used to rank documents by score and date (solr.search.function contains great stuff!). The values in the date field that are used for the ValueSource are not actually used as 'floats', but rather their ordinal term values from the FieldCache string index. This means that if the 'date' field has 3000 unique string 'values' in the index, the values for 'x' in ReciprocalFloatFuncion could be 0-2999. So if I want the most recent 'date' to return a score of 1.0, one could set 'a' and 'b' in the function to 2999. Do I have this right? I got bit confused at first because I assumed that the actual field values were being used in the computation, but you really need to know the unique term count in order to get the score 'right'. By the way, as I try to get my head around the Score, Weight, and Boolean* classes (and next(), skipTo()), I nominate these for discussion in Lucene In Action II. Peter On 3/9/06, Yonik Seeley [EMAIL PROTECTED] wrote: On 3/9/06, Yang Sun [EMAIL PROTECTED] wrote: Hi Yonik, Thanks very much for your suggestion. The query boost works great for keyword matching. But in my case, I need to rank the results by date and title. For example, title:foo^2 abstract:foo^1.5 date:2004^3 will only boost the document with date=2004. What I need is boosting the distance from the specified date If all you need to do is boost more recent documents (and a single fixed boost will always work), then you can do that boosting at index time. which means 2003 will have a better ranking than 2002, 20022001, etc. I implemented a customized ScoreDocComparator class which works fine for one field. But I met some trouble when trying to combine other fields together. I'm still looking at FunctionQuery. Don't know if I can figure out something. FunctionQuery support is integrated into Solr (or currently hacked-in, as the case may be), and can be useful for debugging and trying out query types even if you don't use it for your runtime. ReciprocalFloatFunction might meet your needs for increasing the score of more recent documents: http://incubator.apache.org/solr/docs/api/org/apache/solr/search/function/ReciprocalFloatFunction.html The SolrQueryParser can make ReciprocalFloatFunction(new ReverseOrdFieldSource(my_date),1,1000,1000) out of _val_:recip(rord(my_date),1,1000,1000) -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Announcement: Lucene powering Monster job search index (Beta)
Dan, The filtering is done in the HitCollector by the bounding box, so the only hits that get collected are those that match the keywords, the bounding box, and some Lucene filters (BitSets) (I'm probably overloading the word 'filter' a bit). So, the only hits from the collector that need to be sorted are those that are roughly within the search radius. When the search radius gets larger, a new bounding box is computed for that query. Make sense? Peter On 3/16/07, Daniel Rosher [EMAIL PROTECTED] wrote: Hi Peter, Shouldn't the search perform the euclidean distance during filtering as well though, otherwise you will obtain perhaps highly relevant hits reported to the user outside the range they specified? Particularly as the search radius gets larger. Cheers, Dan On 1/28/07, Peter Keegan [EMAIL PROTECTED] wrote: Correction: We only do the euclidan computation during sorting. For filtering, a simple bounding box is computed to approximate the radius, and 2 range comparisons are made to exclude documents. Because these comparisons are done outside of Lucene as integer comparisons, it is pretty fast. With 13000 results, the seach time with distance sort is about 200 msec (compared to 30 ms for a simple non-radius, date-sorted keyword search). Peter On 1/27/07, no spam [EMAIL PROTECTED] wrote: Isn't this extremely ineffecient to do the euclidean distance twice? Perhaps not a huge deal if a small search result set. I at times have 13,000 results that match my search terms of an index with 1.2 million docs. Can't you do some simple radian math first to ensure it's way out of bounds, then do the euclidian distance for the subset within bounds? I'm currently only doing the distance calc once (post hit collector). I don't have any performance numbers with the double vs single distance calc. I'm still working out the sort by radius myself. Mark On 11/3/06, Peter Keegan [EMAIL PROTECTED] wrote: Daniel, Yes, this is correct if you happen to be doing a radius search and sorting by mileage. Peter
Re: Announcement: Lucene powering Monster job search index (Beta)
Note: this is a reply to a posting to java-dev --Peter Eric, Now that it is live, is performance pretty good? Performance is outstanding. Each server can easily handle well over 100 qps on an index of over 800K documents. There are several servers (4 dual core (8 CPU) Opteron) supporting the query load and we have backup servers for disaster recovery. For a few hours one day, all job search query traffic for the entire site was being handled by a single server - with no noticable latency! Are you using dotLucene or a webservice tier and java? We are using Java Lucene on dedicated servers. How did you implement your bounding box for the searching? It sounds like you do this outside of lucene and return a custom hitcollector. The 'bounding box' is merely the conjunction of 2 numeric range searches. It's really not that hard to do - I think there has been discussion of this elsewhere in this group. We use (not 'return') a custom HitCollector to exclude hits that aren't in the bounding box. I tried to explain this in a reply earlier today, but if I failed let me know. Why not use a rangequery or functionquery for the basic bounding before sorting Basically, 'RangeQuery' doesn't offer sufficient performance. We have implemented our own 'numeric value' search 'next to Lucene' (I think I like this better than 'outside of Lucene' ;-)). FunctionQuery could be used if you wanted the jobs sorted by a combination of keywords and distance. Our users (apparently) expect the jobs to be sorted strictly by distance on a radius search. Peter Hello Peter, Now that the monster lucene search is live, is performance pretty good? Are you still running it on a single 8 core server? Can you give me a rough idea on the number of queries you can handle/second and the number of docs in the index? Are you using dotLucene or a webservice tier and java? How did you implement your bounding box for the searching? It sounds like you do this outside of lucene and return a custom hitcollector. Why not use a rangequery or functionquery for the basic bounding before sorting? Thanks, Eric
Re: Lucene search performance: linear?
On a similar topic, has anybody measured query performance as a function of index size? Well, I did and the results surprised me. I measured query throughput on 8 indexes that varied in size from 55,000 to 4.4 million documents. When plotted on a graph, there is a distinct hyperbolic curve (1/x). I expected to see more of a linear curve with a sharp drop-off at some point. Interesting Peter On 12/5/06, Zhang, Lisheng [EMAIL PROTECTED] wrote: Hi Soeren, Thanks very much for explanations, yes, there is no linear relation when searching a keyword which is only in a few docs. Best regards, Lisheng -Original Message- From: Soeren Pekrul [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 05, 2006 10:37 AM To: java-user@lucene.apache.org Subject: Re: Lucene search performance: linear? Hello Lisheng, a search process has to do usually two thinks. First it has to find the term in the index. I don't know the implementation of finding a term in Lucene. I hope that the index is at least a sorted list or a binary tree, so it can search binary. The time finding a term depends of the term's number n_t. If it searches binary the complexity is approximately log(n_t). The search time should be better then linear. Second it has to collect the documents for a term. This depends of the documents number n_d for a term. It has to go thru the list of documents for a term. The time should be proportional to the number of documents for a term even if it doesn't calculate the similarity. Usually the number of documents for a single term is less than the total number of documents in the collection and less than the total number of terms in the index. If the number of documents for a single term is less than the total number of documents the search process for a single term including process one (finding the term) and process two (collecting the documents and calculating the score) should be better the linear to the number of documents. I indexed first 220,000, all with a special keyword, I did a simple query and only fetched 5 docs, with Hits.length()=220,000. Then I indexed 440,000 docs, with the same keyword, query it again and fetched a few docs, with Hits.length(0=440,000. In your case the query term is contained in all documents. The number of documents for a single term is equals the total number of documents in your collection. The hit collector has to collect all documents. The collecting process is proportional to the number of documents to collect. So the search for all documents should be at least linear to the total number of documents. Sören Zhang, Lisheng schrieb: Hi, I indexed first 220,000, all with a special keyword, I did a simple query and only fetched 5 docs, with Hits.length()=220,000. Then I indexed 440,000 docs, with the same keyword, query it again and fetched a few docs, with Hits.length(0=440,000. I found that search time is about linear: 2nd time is about 2 times longer than 1st query. I would like to understand: Does the linear relation come from score calculation, since we have to calculate score one by one? Or other reason? If we have B-tree index I would naively expect a better scalibility? Thanks very much for your helps, Lisheng - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FieldSortedHitQueue enhancement
The duplicate check would just be on the doc ID. I'm using TreeSet to detect duplicates with no noticeable affect on performance. The PQ only has to be checked for a previous value IFF the element about to be inserted is actually inserted and not dropped because it's less than the least value already in there. So, the TreeSet is never bigger than the size of the PQ (typically 25 to a few hundred items), not the size of all hits. Peter On 3/29/07, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hm, removing duplicates (as determined by a value of a specified document field) from the results would be nice. How would your addition affect performance, considering it has to check the PQ for a previous value for every candidate hit? Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Peter Keegan [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Thursday, March 29, 2007 9:39:13 AM Subject: FieldSortedHitQueue enhancement This is request for an enhancement to FieldSortedHitQueue/PriorityQueue that would prevent duplicate documents from being inserted, or alternatively, allow the application to prevent this (reason explained below). I can do this today by making the 'lessThan' method public and checking the queue before inserting like this: if (hq.size() maxSize) { // doc will be inserted into queue - check for duplicate before inserting } else if (hq.size() 0 !hq.lessThan((ScoreDoc)fieldDoc, (ScoreDoc)hq.top()) { // doc will be inserted into queue - check for duplicate before inserting } else { // doc will not be inserted - no check needed } However, this is just replicating existing code in PriorityQueue-insert(). An alternative would be to have a method like: public boolean wouldBeInserted(ScoreDoc doc) // returns true if doc would be inserted, without inserting The reason for this is that I have some queries that get expanded into multiple searches and the resulting hits are OR'd together. The queries contain 'terms' that are not seen by Lucene but are handled by a HitCollector that uses external data for each document to evaluate hits. The results from the priority queue should contain no duplicate documents (first or last doc wins). Do any of these suggestions seem reasonable?. So far, I've been able to use Lucene without any modifications, and hope to continue this way. Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FieldSortedHitQueue enhancement
Yes, my custom query processor can sometimes make 2 Lucene search calls which may result in duplicate docs being inserted on the same PQ. The simplest solution is to make lessThan public. I'm curious to know if anyone else is performing multiple searches under the covers. Peter On 3/29/07, Yonik Seeley [EMAIL PROTECTED] wrote: On 3/29/07, Otis Gospodnetic [EMAIL PROTECTED] wrote: Ah, I see. This is less attractive to me personally, but maybe it helps others. One thing I don't understand is why/how you'd get duplicate documents with the same doc ID in there. Isn't insert(FieldDoc fdoc) called only once for each doc? Yes, for any Lucene search method. From Peter's first message, it looks like it's his custom code that can result in duplicates. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FieldSortedHitQueue enhancement
Peter, how did you achieve 'last wins' as you must presumably remove first from the PQ? I implemented 'first wins' because the score is less important than other fields (distance, in our case), but you make a good point since score may be more important. How did you implement remove()? Peter On 3/29/07, Antony Bowesman [EMAIL PROTECTED] wrote: I've got a similar duplicate case, but my duplicates are based on an external ID rather than Doc id so occurs for a single Query. It's using a custom HitCollector but score based, not field sorted. If my duplicate contains a higher score than one on the PQ I need to update the stored score with the higher one, so PQ needs a replace() method where the stored object.equals() can be used to find the object to delete. I'm not sure if there's a way to find the object efficiently in this case other than a linear search. I implemented remove(). Peter, how did you achieve 'last wins' as you must presumably remove first from the PQ? Antony Peter Keegan wrote: The duplicate check would just be on the doc ID. I'm using TreeSet to detect duplicates with no noticeable affect on performance. The PQ only has to be checked for a previous value IFF the element about to be inserted is actually inserted and not dropped because it's less than the least value already in there. So, the TreeSet is never bigger than the size of the PQ (typically 25 to a few hundred items), not the size of all hits. Peter On 3/29/07, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hm, removing duplicates (as determined by a value of a specified document field) from the results would be nice. How would your addition affect performance, considering it has to check the PQ for a previous value for every candidate hit? Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Peter Keegan [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Thursday, March 29, 2007 9:39:13 AM Subject: FieldSortedHitQueue enhancement This is request for an enhancement to FieldSortedHitQueue/PriorityQueue that would prevent duplicate documents from being inserted, or alternatively, allow the application to prevent this (reason explained below). I can do this today by making the 'lessThan' method public and checking the queue before inserting like this: if (hq.size() maxSize) { // doc will be inserted into queue - check for duplicate before inserting } else if (hq.size() 0 !hq.lessThan((ScoreDoc)fieldDoc, (ScoreDoc)hq.top()) { // doc will be inserted into queue - check for duplicate before inserting } else { // doc will not be inserted - no check needed } However, this is just replicating existing code in PriorityQueue-insert(). An alternative would be to have a method like: public boolean wouldBeInserted(ScoreDoc doc) // returns true if doc would be inserted, without inserting The reason for this is that I have some queries that get expanded into multiple searches and the resulting hits are OR'd together. The queries contain 'terms' that are not seen by Lucene but are handled by a HitCollector that uses external data for each document to evaluate hits. The results from the priority queue should contain no duplicate documents (first or last doc wins). Do any of these suggestions seem reasonable?. So far, I've been able to use Lucene without any modifications, and hope to continue this way. Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sorting on a field that can have null values
excluding them completely is a slightly differnet task, you don't need to index a special marker value, you can just use a RangeFilter (or ConstantScoreRangeQuery) to ensure you only get docs with a value for that field (ie: field:[* TO *]) Excellent, this is a much better solution. BTW, adding a ConstantScoreRangeQuery clause to the query works fine, but building the RangeFilter from the query string field:[* TO *] doesn't work. The reason is that the terms expanded from the lowerTerm wildcard are compared to 'upperTerm' which is literally '*', which is incorrect. This would appear to be a bug in QueryParser as it ought to set lowerTerm = upperTerm = null in this case. Thanks, Peter On 4/12/07, Chris Hostetter [EMAIL PROTECTED] wrote: : If i rememebr correctly (you'll have to test this) sorting on a field : which doesn't exist for every doc does what you would want (docs with : values are listed before docs without) : The actual behavior is different than described above. I modified : TestSort.java: : The actual order of the results is: ZJI. I believe this happens because : the field string cache 'order' array contains 0's for all the documents that : don't contain the field and thus sort first. i guess wasn't precise enough in that old thread, what i ment was that not having a vlaue results in the docs sorting the same as if they had a value lower then the lowest existing value -- so they sort at the end of the list if you are doing a descending sort, and at the begining of the list if you do an ascending sort. If you want to always have them come last regardless of order, there is a SortComparator for that purpose in Solr... https://issues.apache.org/jira/browse/LUCENE-406 http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/search/MissingStringLastComparatorSource.java?view=log : Suppose I want to exclude documents from being collected if they don't : contain the sort field. One way to do this is to index a unique : 'empty_value' value for those documents and add a MUST_NOT boolean clause to : the query, for example: query terms -field:empty_value). But this seems : inefficient. Is there a better way? excluding them completely is a slightly differnet task, you don't need to index a special marker value, you can just use a RangeFilter (or ConstantScoreRangeQuery) to ensure you only get docs with a value for that field (ie: field:[* TO *]) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: optimization behaviour
Of course, that doesn't have to be the case. It would be a trivial change to merge segments and not remove the deleted docs. That usecase could be useful in conjunction with ParallelReader. If the behavior of deleted docs during merging or optimization ever changes, please make this configurable. Our application uses the Lucene docid as a key into our numeric values 'extension' file, and it depends on the simple behavior described in the previous posts. Thanks, Peter On 5/10/07, Yonik Seeley [EMAIL PROTECTED] wrote: On 5/10/07, Yonik Seeley [EMAIL PROTECTED] wrote: Deleted documents are removed on segment merges (for documents marked as deleted in those segments). Of course, that doesn't have to be the case. It would be a trivial change to merge segments and not remove the deleted docs. That usecase could be useful in conjunction with ParallelReader. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Payloads and PhraseQuery
I'm looking at the new Payload api and would like to use it in the following manner. Meta-data is indexed as a special phrase (all terms at same position) and a payload is stored with the first term of each phrase. I would like to create a custom query class that extends PhraseQuery and uses its PhraseScorer to find matching documents. The custom query class then reads the payload from the first term of the matching query and uses it to produce a new score. However, I don't see how to get the payload from the PhraseScorer's TermPositions. Is this possible? Peter
Re: Payloads and PhraseQuery
I tried to subclass PhraseScorer, but discovered that it's an abstract class and its subclasses (ExactPhraseScorer and SloppyPhraseScorer) are final classes. So instead, I extended Scorer with my custom scorer and extended PhraseWeight (after making it public). My scorer's constructor is passed the instance of PhraseScorer created by PhraseQuery.scorer(). My scorer's 'next' and 'skipTo' methods call the PhraseScorer's methods first and if the result is 'true', the payload is loaded and used to determine whether or not the PhraseScorer's doc is a hit. If not, PhraseScorer.next() or skipTo() is called again. In order to get the payload, I modified PhraseQuery to save the TermPositions array it creates for its scorers and added a 'get' method. The diff is included, below. This is probably not the best solution, but at least a starting point for further discussion. Here's the diff: Index: PhraseQuery.java === --- PhraseQuery.java(revision 551992) +++ PhraseQuery.java(working copy) @@ -36,7 +36,8 @@ private Vector terms = new Vector(); private Vector positions = new Vector(); private int slop = 0; - + private TermPositions[] tps; + /** Constructs an empty phrase query. */ public PhraseQuery() {} @@ -104,7 +105,7 @@ return result; } - private class PhraseWeight implements Weight { + public class PhraseWeight implements Weight { private Similarity similarity; private float value; private float idf; @@ -138,7 +139,7 @@ if (terms.size() == 0) // optimize zero-term case return null; - TermPositions[] tps = new TermPositions[terms.size()]; + tps = new TermPositions[terms.size()]; for (int i = 0; i terms.size(); i++) { TermPositions p = reader.termPositions((Term)terms.elementAt(i)); if (p == null) @@ -155,7 +156,9 @@ reader.norms(field)); } - +public TermPositions[] getTermPositions() { +return tps; +} public Explanation explain(IndexReader reader, int doc) throws IOException { On 6/27/07, Mark Miller [EMAIL PROTECTED] wrote: You cannot do it because TermPositions is read in the PhraseWeight.scorer(IndexReader) method (or MultiPhraseWeight) and loaded into an array which is passed to PhraseScorer. Extend the Weight as well and pass the payload to the Scorer as well is a possibility. - Mark Peter Keegan wrote: I'm looking at the new Payload api and would like to use it in the following manner. Meta-data is indexed as a special phrase (all terms at same position) and a payload is stored with the first term of each phrase. I would like to create a custom query class that extends PhraseQuery and uses its PhraseScorer to find matching documents. The custom query class then reads the payload from the first term of the matching query and uses it to produce a new score. However, I don't see how to get the payload from the PhraseScorer's TermPositions. Is this possible? Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payloads and PhraseQuery
I'm now looking at using payloads with SpanNearQuery but I don't see any clear way of getting the payload(s) from the matching span terms. The term positions for the payloads seem to be buried beneath SpanCells in the NearSpansOrdered and NearSpansUnordered classes, which are not public. I'd be content to be able to get the payload from just the first term of the span. Can anyone suggest an approach for making payloads work with SpanNearQuery? Peter On 6/27/07, Grant Ingersoll [EMAIL PROTECTED] wrote: Could you get what you need combining the BoostingTermQuery with a SpanNearQuery to produce a score? Just guessing here.. At some point, I would like to see more Query classes around the payload stuff, so please submit patches/feedback if and when you get a solution On Jun 27, 2007, at 10:45 AM, Peter Keegan wrote: I'm looking at the new Payload api and would like to use it in the following manner. Meta-data is indexed as a special phrase (all terms at same position) and a payload is stored with the first term of each phrase. I would like to create a custom query class that extends PhraseQuery and uses its PhraseScorer to find matching documents. The custom query class then reads the payload from the first term of the matching query and uses it to produce a new score. However, I don't see how to get the payload from the PhraseScorer's TermPositions. Is this possible? Peter -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payloads and PhraseQuery
I'm looking for Spans.getPositions(), as shown in BoostingTermQuery, but neither NearSpansOrdered nor NearSpansUnordered (which are the Spans provided by SpanNearQuery) provide this method and it's not clear to me how to add it. Peter On 7/11/07, Chris Hostetter [EMAIL PROTECTED] wrote: : I'm now looking at using payloads with SpanNearQuery but I don't see any : clear way of getting the payload(s) from the matching span terms. The term : positions for the payloads seem to be buried beneath SpanCells in the Isn't Spans.start() and Spans.end() what you are looking for? -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payloads and PhraseQuery
Grant, If/when you have an implementation for SpanNearQuery, I'd be happy to test it. Peter On 7/12/07, Grant Ingersoll [EMAIL PROTECTED] wrote: Yep, totally agree.One way to handle this initially at least is have isPayloadAvailable() only return true for the SpanTermQuery. The other option is to come up with some modification of the suggested methods below to return all the payloads in a span. I have a basic implementation for just the SpanTermQuery (i.e. via TermSpans) in the works. I will take a crack at fleshing out the rest at some point soon. -Grant On Jul 12, 2007, at 1:22 PM, Paul Elschot wrote: On Thursday 12 July 2007 14:50, Grant Ingersoll wrote: That is off of the TermSpans class. BTQ (BoostingTermQuery) is implemented to extend SpanQuery, thus SpanNearQuery isn't, w/o modification, going to have access to these things. However, if you look at the SpanTermQuery, you will see that it's implementation of Spans is indeed the TermSpans class. So, I think you could cast to it or handle it through instanceof. I am not completely sure here, but it seems like we may need an efficient way to access the TermPositions for each document. That is, the Spans class doesn't provide this and maybe it should somehow. Again, I am just thinking out loud here. SpanQueries can be nested, so the relationship between a span and a term position can also be one to many, not only one to one. For example a matching span in the Spans of a SpanNearQuery can be based on two matching (near enough to match) term positions. Thus, if we modified Spans to have the following methods: byte[] getPayload(byte[] data, int offset) boolean isPayloadAvailable() I think this would be useful. Perhaps this should be discussed on dev. And the same holds for the payloads, there many be more than one for a single Span. Regards, Paul Elschot Cheers, Grant On Jul 12, 2007, at 8:20 AM, Peter Keegan wrote: I'm looking for Spans.getPositions(), as shown in BoostingTermQuery, but neither NearSpansOrdered nor NearSpansUnordered (which are the Spans provided by SpanNearQuery) provide this method and it's not clear to me how to add it. Peter On 7/11/07, Chris Hostetter [EMAIL PROTECTED] wrote: : I'm now looking at using payloads with SpanNearQuery but I don't see any : clear way of getting the payload(s) from the matching span terms. The term : positions for the payloads seem to be buried beneath SpanCells in the Isn't Spans.start() and Spans.end() what you are looking for? -Hoss --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/ LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://www.grantingersoll.com/ http://lucene.grantingersoll.com http://www.paperoftheweek.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: encoding question.
The source data for my index is already in standard UTF-8 and available as a simple byte array. I need to do some simple tokenization of the data (check for whitespace and special characters that control position increment). What is the most efficient way to index this data and avoid unnecessary conversions to/from java Strings or char arrays? Looking at DocumentsWriter, I see that all terms are eventually converted to char arrays and written in modified-UTF-8, so there doesn't seem to be much advantage to having the source data in standard UTF-8. Peter On 2/14/07, Chris Hostetter [EMAIL PROTECTED] wrote: Internally Lucene deals with pure Java Strings; when writing those strings to and reading those strings back from disk, Lucene allways uses the stock Java modified UTF-8 format, regardless of what your file.encoding system property may be. typcially when people have encoding problems in their lucene applications, the origin of hte problem is in the way they fetch the data before indexing it ... if you can make a String object, and System.out.println that string and see what you expect, then handing that string to Lucene as a field value should work fine. what exactly is the value object you are calling getBytes on? ... if it's another String, then you've already got serious problems -- i can't imagine any situation where fetching the bytes from a String in one charset and using those bytes to construct another string (either in a different charset, or in the system default charset) would make any sense at all. wherever your original binary data is coming from (files on disk, network socket, etcc...) that's when you should be converting those bytes into chars using whatever charset you know those bytes represent. : Date: Wed, 14 Feb 2007 09:16:58 +0330 : From: Mohammad Norouzi [EMAIL PROTECTED] : Reply-To: java-user@lucene.apache.org : To: java-user@lucene.apache.org : Subject: encoding question. : : Hi : I want to index data with utf-8 encoding, so when adding field to a document : I am using the code new String(value.getBytes(utf-8)) : in the other hand, when I am going to search I was using the same snippet : code to convert to utf-8 but it did not work so finally I found somewhere : that had been said to use new String(valueToSearch.getBytes (cp1252),UTF8) : and it worked fine but I still has some problem. : first, some characters are weird when I get result from lucene, It seems it : is in cp1252 encoding. : second, if the java environment property file.encoding not been cp1252 the : result is completely in incorrect encoding. so I must change this property : using System.setProperty(file.encoding,cp1252) : : is lucene neglect my utf-8 encoding and proceed indexing data using cp1252? : how can I correct weird characters I received by searching? : : Thank you very much in advance. : -- : Regards, : Mohammad : -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payloads and PhraseQuery
I guess this also ties in with 'getPositionIncrementGap', which is relevant to fields with multiple occurrences. Peter On 7/27/07, Peter Keegan [EMAIL PROTECTED] wrote: I have a question about the way fields are analyzed and inverted by the index writer. Currently, if a field has multiple occurrences in a document, each occurrence is analyzed separately (see DocumentsWriter.processField). Is it safe to assume that this behavior won't change in the future? The reason I ask is that my custom analyzer's 'tokenStream' method creates a custom filter which produces a payload based on the existence of each field occurrence. However, if DocumentsWriter was changed and combined all the occurrences before inversion, my scheme wouldn't work. Since payloads are created by filters/tokenizers, it helps to keep things flexible. Thanks, Peter On 7/12/07, Grant Ingersoll [EMAIL PROTECTED] wrote: On Jul 12, 2007, at 6:12 PM, Chris Hostetter wrote: Hmm... okay so the issue is that in order to get the payload data, you have to have a TermPositions instance. instead of adding getPayload methods to the Spans class (which as Paul points out, can have nesting issues) perhaps more general solutions would be: a) a more high level getPayload API that let's you get a payload arbitrarily for a toc/position (perhaps as part of the TernDocs API?) ... then for Spans you could use this new API with Spans.start() and Spans.end(). (and all the positions in between) Not sure I follow this. I don't see the fit w/ TermDocs. b) add a variation of the TermPositions class to allow people to iterate through the terms of a TermDoc in position order (TermPosition first iterates over the Terms and then over the positions) ... then you could seek(span.start()) to get the Payload data c) add methods to the Spans API to get the subspans (if any) ... this would be the Spans corrilary to getTerms() and would always return TermSpans which would have TermPositions for getting payload data. This could be a good alternative. When we first talked about payloads we wondered if we could just make all Queries into SpanQueries by passing TermPositions instead of term docs, but in the end decided not to do it because of performance issues (some of which are lessened by lazy loading of TermPositions. The thing is, I think, that the Spans is already moving you along in the term positions, so it just seems like a natural fit to have it there, even if there is nesting. It doesn't seem like it would be that hard to then return back the nesting stuff b/c you are just collating the results from the underlying SpanTermQuery. Having said that, I haven't looked into the actual code, so take that w/ a grain of salt. I will try to do some more investigation, as others are welcome to do. Perhaps we should move this to dev? Cheers, Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: LUCENE-843 Release
I've built a production index with this patch and done some query stress testing with no problems. I'd give it a thumbs up. Peter On 7/30/07, testn [EMAIL PROTECTED] wrote: Hi guys, Do you think LUCENE-843 is stable enough? If so, do you think it's worth to release it with probably LUCENE 2.2.1? It would be nice so that people can take the advantage of it right away without risking other breaking changes in the HEAD branch or waiting until 2.3 release. Thanks, -- View this message in context: http://www.nabble.com/LUCENE-843-Release-tf4170191.html#a11863644 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Mixing SpanQuery and BooleanQuery
I'm trying to create a fairly complex SpanQuery from a binary parse tree. I create SpanOrQueries from SpanTermQueries and combine SpanOrQueries into BooleanQueries. So far, so good. The problem is that I don't see how to create a SpanNotQuery from a BooleanQuery and a SpanTermQuery. I want the BooleanQuery to be the 'include' span and the SpanTermQuery to be the 'exclude' span. Unfortunately, the BooleanQuery cannot be cast to a SpanQuery. I thought that SpanQuery and BooleanQuery could be freely intermixed, but this doesn't seem to be the case. It seems that what's really needed is a 'SpanAndQuery'. Is there another way to build this type of query? Thanks, Peter
Re: Mixing SpanQuery and BooleanQuery
Even without 'interesting' slops, it does appear that SpanNearQuery is a logical AND of all its clauses. I was distracted by the BooleanQuery examples in the javadocs :) thanks, Peter On 8/6/07, Erick Erickson [EMAIL PROTECTED] wrote: Isn't a SpanAndQuery the same as a SpanNearQuery? Perhaps with interesting slops.. Erick On 8/6/07, Peter Keegan [EMAIL PROTECTED] wrote: I'm trying to create a fairly complex SpanQuery from a binary parse tree. I create SpanOrQueries from SpanTermQueries and combine SpanOrQueries into BooleanQueries. So far, so good. The problem is that I don't see how to create a SpanNotQuery from a BooleanQuery and a SpanTermQuery. I want the BooleanQuery to be the 'include' span and the SpanTermQuery to be the 'exclude' span. Unfortunately, the BooleanQuery cannot be cast to a SpanQuery. I thought that SpanQuery and BooleanQuery could be freely intermixed, but this doesn't seem to be the case. It seems that what's really needed is a 'SpanAndQuery'. Is there another way to build this type of query? Thanks, Peter
SpanQuery and database join
I've been experimenting with using SpanQuery to perform what is essentially a limited type of database 'join'. Each document in the index contains 1 or more 'rows' of meta data from another 'table'. The meta data are simple tokens representing a column name/value pair ( e.g. color$red or location$123). Each row is represented by a span with a maximum token length equal to the maximum number of meta data columns. If a column has multiple values, they are all indexed at the same position ( e.g. color$red, color$blue). All rows are added to a single field. The spans are 'separated' from each other by introducing a position gap between them via ' Analyzer.getPositionIncrementGap'. This gap should be greater than the number of columns in each span. At query time, a SpanNearQuery is constructed to represent the meta data to join. The 'slop' value is set to the maximum number of meta data columns (minus 1). Using a simple Antlr parser, boolean span queries with AND, OR, NOT can be constructed fairly easily. The SpanQuery is And'd to the main query to build the final query. This approach is flexible and pretty efficient because no stored fields or external data are accessed at query time. Span queries are more expensive compared than other queries, though. We measure performance via throughput (as opposed to the response time for a single query), and the addition of a SpanQuery reduced throughput by 5X for ordered spans and 10X for unordered spans. Still, this may be acceptable for some applications, especially if spans are not used on every query. I thought this might interest some of you. Peter
Re: SpanQuery and database join
I suppose it could go under performance or HowTo/Interesting uses of SpanQuery. Peter On 8/13/07, Erick Erickson [EMAIL PROTECTED] wrote: Thanks for writing this up. Do you think this is an appropriate subject for the Wiki performance page? Erick On 8/13/07, Peter Keegan [EMAIL PROTECTED] wrote: I've been experimenting with using SpanQuery to perform what is essentially a limited type of database 'join'. Each document in the index contains 1 or more 'rows' of meta data from another 'table'. The meta data are simple tokens representing a column name/value pair ( e.g. color$red or location$123). Each row is represented by a span with a maximum token length equal to the maximum number of meta data columns. If a column has multiple values, they are all indexed at the same position ( e.g. color$red, color$blue). All rows are added to a single field. The spans are 'separated' from each other by introducing a position gap between them via ' Analyzer.getPositionIncrementGap'. This gap should be greater than the number of columns in each span. At query time, a SpanNearQuery is constructed to represent the meta data to join. The 'slop' value is set to the maximum number of meta data columns (minus 1). Using a simple Antlr parser, boolean span queries with AND, OR, NOT can be constructed fairly easily. The SpanQuery is And'd to the main query to build the final query. This approach is flexible and pretty efficient because no stored fields or external data are accessed at query time. Span queries are more expensive compared than other queries, though. We measure performance via throughput (as opposed to the response time for a single query), and the addition of a SpanQuery reduced throughput by 5X for ordered spans and 10X for unordered spans. Still, this may be acceptable for some applications, especially if spans are not used on every query. I thought this might interest some of you. Peter
Re: SpanQuery and database join
I added this under Use Cases. Thanks for the suggestion. Peter On 8/13/07, Grant Ingersoll [EMAIL PROTECTED] wrote: There is also a Use Cases item on the Wiki... On Aug 13, 2007, at 3:26 PM, Peter Keegan wrote: I suppose it could go under performance or HowTo/Interesting uses of SpanQuery. Peter On 8/13/07, Erick Erickson [EMAIL PROTECTED] wrote: Thanks for writing this up. Do you think this is an appropriate subject for the Wiki performance page? Erick On 8/13/07, Peter Keegan [EMAIL PROTECTED] wrote: I've been experimenting with using SpanQuery to perform what is essentially a limited type of database 'join'. Each document in the index contains 1 or more 'rows' of meta data from another 'table'. The meta data are simple tokens representing a column name/value pair ( e.g. color$red or location$123). Each row is represented by a span with a maximum token length equal to the maximum number of meta data columns. If a column has multiple values, they are all indexed at the same position ( e.g. color$red, color$blue). All rows are added to a single field. The spans are 'separated' from each other by introducing a position gap between them via ' Analyzer.getPositionIncrementGap'. This gap should be greater than the number of columns in each span. At query time, a SpanNearQuery is constructed to represent the meta data to join. The 'slop' value is set to the maximum number of meta data columns (minus 1). Using a simple Antlr parser, boolean span queries with AND, OR, NOT can be constructed fairly easily. The SpanQuery is And'd to the main query to build the final query. This approach is flexible and pretty efficient because no stored fields or external data are accessed at query time. Span queries are more expensive compared than other queries, though. We measure performance via throughput (as opposed to the response time for a single query), and the addition of a SpanQuery reduced throughput by 5X for ordered spans and 10X for unordered spans. Still, this may be acceptable for some applications, especially if spans are not used on every query. I thought this might interest some of you. Peter -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Scoring results?!
If I use BoostingTermQuery on a query containing terms without payloads, I get very different results than doing the same query with TermQuery. Presumably, this is because the BoostingSpanScorer/SpanScorer compute scores differently than TermScorer. Is there a way to make BoostingTermQuery behave like TermQuery for terms without payloads? Peter On 5/9/07, Grant Ingersoll [EMAIL PROTECTED] wrote: Hi Eric, On May 9, 2007, at 2:39 AM, supereric wrote: How I can get the tag word score in lucene. suppose that you have searched a tag word and 3 hit documents are now found. 1 -How someone could find number of occurrences in any document so it could sort the results. Span Queries tell you where the matches occur in the document by offset, but I am not sure what your sorting criteria would be. The explain method also can give you information about why a particular document scored a particular way. Also I wan to have some other policies for ranking the results. What should I do to handle that. for example I want to score boldfaced tag words in an html document twice normal texts. Although totally experimental at this stage, the new Payload stuff in the trunk version of Lucene (or nightly builds) is designed for such a scenario. Check out the BoostingTermQuery which can boost term scores based on the contents of a payload located at a particular term. Feedback on the APIs is very much appreciated. 2- How I can omit some tag words from the index?! for example common words in another language? See the StopFilter token filter and/or the StopwordAnalyzer HTH, Grant -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
BoostingTermQuery.explain() bugs
There are a couple of minor bugs in BoostingTermQuery.explain(). 1. The computation of average payload score produces NaN if no payloads were found. It should probably be: float avgPayloadScore = super.score() * (payloadsSeen 0 ? (payloadScore / payloadsSeen) : 1); 2. If the average payload score is zero, the value of the explanation is 0: result.setValue(nonPayloadExpl.getValue() * avgPayloadScore); If the query is part of a BooleanClause, this results in: no match on required clause... failure to meet condition(s) of required/prohibited clause(s) Let me know if I should open a JIRA issue. Peter
BoostingTermQuery performance
I have been experimenting with payloads and BoostingTermQuery, which I think are excellent additions to Lucene core. Currently, BoostingTermQuery extends SpanQuery. I would suggest changing this class to extend TermQuery and refactor the current version to something like 'BoostingSpanQuery'. The reason is rooted in performance. In my testing, I compared query throughput using TermQuery against 2 versions of BoostingTermQuery - the current one that extends SpanQuery and one that extends TermQuery (which I've included, below). Here are the results (qps = queries per second): TermQuery:200 qps BoostingTermQuery (extends SpanQuery): 97 qps BoostingTermQuery (extends TermQuery): 130 qps Here is a version of BoostingTermQuery that extends TermQuery. I had to modify TermQuery and TermScorer to make them public. A code review would be in order, and I would appreciate your comments on this suggestion. Peter - import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermDocs; import org.apache.lucene.index.TermPositions; import org.apache.lucene.search.*; import java.io.IOException; /** * Copyright 2004 The Apache Software Foundation * p/ * Licensed under the Apache License, Version 2.0 (the License); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * p/ * http://www.apache.org/licenses/LICENSE-2.0 * p/ * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an AS IS BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ /** * The BoostingTermQuery is very similar to the [EMAIL PROTECTED] org.apache.lucene.search.spans.SpanTermQuery} except * that it factors in the value of the payload located at each of the positions where the * [EMAIL PROTECTED] org.apache.lucene.index.Term} occurs. * p * In order to take advantage of this, you must override [EMAIL PROTECTED] org.apache.lucene.search.Similarity#scorePayload(byte[],int,int)} * which returns 1 by default. * p * Payload scores are averaged across term occurrences in the document. * * pfont color=#FF * WARNING: The status of the bPayloads/b feature is experimental. * The APIs introduced here might change in the future and will not be * supported anymore in such a case./font * * @see org.apache.lucene.search.Similarity#scorePayload(byte[], int, int) */ public class BoostingTermQuery extends TermQuery{ Term term; Similarity similarity; public BoostingTermQuery(Term term) { super(term); this.term = term; } protected Weight createWeight(Searcher searcher) throws IOException { this.similarity = getSimilarity(searcher); return new BoostingTermWeight(this, searcher); } protected class BoostingTermWeight extends TermWeight implements Weight { public BoostingTermWeight(BoostingTermQuery query, Searcher searcher) throws IOException { super(searcher); } public Scorer scorer(IndexReader reader) throws IOException { return new BoostingTermScorer(reader.termDocs(term), reader.termPositions(term), this, similarity, reader.norms(term.field())); } class BoostingTermScorer extends TermScorer { //TODO: is this the best way to allocate this? byte[] payload = new byte[256]; private TermPositions positions; protected float payloadScore; private int payloadsSeen; public BoostingTermScorer(TermDocs termDocs, TermPositions termPositions, Weight weight, Similarity similarity, byte[] norms) throws IOException { super(weight, termDocs, similarity, norms); positions = termPositions; } /** * Go to the next document * */ public boolean next() throws IOException { boolean result = super.next(); //set the payload. super.next() properly increments the term positions if (result) { if (positions.skipTo(super.doc())) { positions.nextPosition(); processPayload(similarity); } } return result; } public boolean skipTo(int target) throws IOException { boolean result = super.skipTo(target); if (result) { if (positions.skipTo(target)) { positions.nextPosition(); processPayload(similarity); } } return result; } // protected boolean setFreqCurrentDoc() throws IOException { //if (!more) { // return false; //} //doc = spans.doc(); //freq = 0.0f; //payloadScore = 0; //payloadsSeen = 0; //Similarity similarity1 = getSimilarity(); //
Re: Can I do boosting based on term postions?
This is a nice alternative to using payloads and BoostingTermQuery. Is there any reason not to make this change to SpanFirstQuery, in particular: This modification to SpanFirstQuery would be that the Spans returned by SpanFirstQuery.getSpans() must always return 0 from its start() method. Should I open a Jira issue? Thanks, Peter On Aug 3, 2007 2:11 PM, Paul Elschot [EMAIL PROTECTED] wrote: On Friday 03 August 2007 20:35, Shailendra Sharma wrote: Paul, If I understand Cedric right, he wants to have different boosting depending on search term positions in the document. By using SpanFirstQuery he will only be able to consider in terms till particular position; but he won't be able to do something like following: a) Give 100% boosting to matching in first 100 words. b) Give 80% boosting to matching in next 100 words. c) Give 60% boosting to matching in next 100 words. Though it can be done by writing DisjunctionMaxQuery having multiple SpanFirstQuery with different boosting - but I see it as a workaround only and not the direct and efficient solution. You're right, but SpanFirstQuery needs only a minor modification for this to work. This modification to SpanFirstQuery would be that the Spans returned by SpanFirstQuery.getSpans() must always return 0 from its start() method. Then the slop passed to sloppyFreq(slop) would be the distance from the beginning of the indexed field to the end of the Spans of the SpanQuery passed to SpanFirstQuery. Then the following should work: Term firstTerm = ; SpanFirstQuery sfq = new SpanFirstQuery( new SpanTermQuery( firstTerm), Integer.MAX_VALUE)) { ... public Similarity getSimilarity() { return new Similarity() { ... float sloppyFreq(slop) { return (slop 100) ? 1.0f : (slop 200) ? 0.8f : (slop 300) ? 0.6f : 0.4f ; // etc. etc. Actually, I'm a bit surprised that SpanFirstQuery does not work that way now. Regards, Paul Elschot Cedric, I am sending you the implementation of SpanTermQuery to your gmail account (lucene mailing list is bouncing email with attachment). I have named the class as VSpanTermQuery (I have followed the same package hierarchy as lucene). You also need to extend VSimilarity class - which would require implementation of method scoreSpan(..). Let me know how it went. Though I did a testing for it, but before submitting to contrib, I need to do extensive testing. Thanks, Shailendra On 8/3/07, Paul Elschot [EMAIL PROTECTED] wrote: Cedric, You can choose the end limit for SpanFirstQuery yourself. Regards, Paul Elschot On Friday 03 August 2007 05:38, Cedric Ho wrote: Hi Paul, Isn't SpanFirstQuery only match those with position less than a certain end position? I am rather looking for a query that would score a document higher for terms appear near the start but not totally discard those with terms appear near the end. Regards, Cedric On 8/2/07, Paul Elschot [EMAIL PROTECTED] wrote: Cedric, SpanFirstQuery could be a solution without payloads. You may want to give it your own Similarity.sloppyFreq() . Regards, Paul Elschot On Thursday 02 August 2007 04:07, Cedric Ho wrote: Thanks for the quick response =) On 8/1/07, Shailendra Sharma [EMAIL PROTECTED] wrote: Yes, it is easily doable through Payload facility. During indexing process (mainly tokenization), you need to push this extra information in each token. And then you can use BoostingTermQuery for using Payload value to include Payload in the score. You also need to implement Similarity for this (mainly scorePayload method). If I store, say a custom boost factor as Payload, does it means that I will store one more byte per term per document in the index file? So the index file would be much larger? Other way can be to extend SpanTermQuery, this already calculates the position of match. You just need to do something to use this position value in the score calculation. I see that SpanTermQuery takes a TermPositions from the indexReader and I can get the term position from there. However I am not sure how to incorporate it into the score calculation. Would you mind give a little more detail on this? One possible advantage of SpanTermQuery approach is that you can play around, without re-creating indices everytime. Thanks, Shailendra Sharma, CTO, Ver se' Innovation Pvt. Ltd. Bangalore, India On 8/1/07, Cedric Ho [EMAIL PROTECTED] wrote: Hi all, I was wondering if it is possible to do boosting by search terms'
Re: FieldSortedHitQueue rise in memory
Hi Brian, I ran into something similar a long time ago. My custom sort objects were being cached by Lucene, but there were too many of them because each one had different 'reference values' for different queries. So, I changed the equals and hashcode methods to NOT use any instance data, thus avoiding the caching. Could this be what you're seeing? Peter On Feb 18, 2008 4:20 PM, Brian Doyle [EMAIL PROTECTED] wrote: We've implemented a custom sort class and use it to sort by distance. We have implemented the equals and hashcode in the sort comparator. After running for a few hours we're reaching peak memory usage and eventually the server runs out of memory. We did some profiling and noticed that a large chunk of memory is being used in the lucence.search.FieldSortedHitQueueclass. Has anyone seen this behavior before or know how we can stop this class from growing in size?
Re: Swapping between indexes
Sridhar, We have been using approach 2 in our production system with good results. We have separate processes for indexing and searching. The main issue that came up was in deleting old indexes (see: *http://tinyurl.com/32q8c4*). Most of our production problems occur during indexing, and we are able to fix these without having to interrupt searching at all. This has been a real benefit. Peter On Thu, Mar 6, 2008 at 5:30 AM, Sridhar Raman [EMAIL PROTECTED] wrote: This is my situation. I have an index, which has a lot of search requests coming into it. I use just a single instance of IndexSearcher to process these requests. At the same time, this index is also getting updated by an IndexWriter. And I want these new changes to be reflected _only_ at certain intervals. I have thought of a few ways of doing this. Each has its share of problems and pluses. I would be glad if someone can help me in figuring out the right approach, especially from the performance point of view, as the number of documents that will get indexed are pretty large. Approach 1: Have just one copy of the index for both Search Index. At time T, when I need to see the new changes reflected, I close the Searcher, and open it again. - The re-open of the Searcher might be a bit slow (which I could probably solve by using some warm-up threads). - Update and Search on the index at the same - will this affect the performance? - If server crashes before time T, the new Searcher would reflect the changes, which is not acceptable. I want the changes to be reflected only at time T. If server crashes, the index should be the previous T-1 index. - Possible problems while optimising the index (as Search is also happening). + Just one copy of the index being stored. Approach 2: Keep 2 copies of the index - 1 for Search, 1 for Index. At time T, I just switch the Searcher to a copy of index that is being updated. - Before I do the switch to the new index, I need to make a copy of it so that the updates continue to happen on the other index. Is there a convenient way to make this copy? Is it efficient? - Time taken to create a new Searcher will still be a problem (but this is a problem in the previous approach as well, and we can live with it). + Optimise can happen on an index that is not being read, as a result, its resource requirements would be lesser. And probably even the speed of optimisation. + Faster search as the index update is happening on a different index. So, these are the 2 approaches I am contemplating about. Any pointers which would be the better approach? Thanks, Sridhar
theoretical maximum score
Is it possible to compute a theoretical maximum score for a given query if constraints are placed on 'tf' and 'lengthNorm'? If so, scores could be compared to a 'perfect score' (a feature request from our customers) Here are some related threads on this: In this thread: http://www.nabble.com/Newbie-questions-re%3A-scoring-td4228776.html#a4228776 Hoss writes: the only way I can think of to fairly compare scores from queries for foo:bar with queries for yak:baz is to normalize them relative a maximum possible score across the entire term query space -- but finding that maximum is a pretty complicated problem just for simple term queries ... when you start talking about more complicated query structures you really get messy -- and even then it's only fair as long as the query structures are identical, you can never compare the scores from apples and oranges And in this thread: http://www.nabble.com/non-relative-scoring-td8956299.html#a8956299 Walt writes: A tf.idf engine, like Lucene, might not have a maximum score. What if a document contains the word a thousand times? A million times? It seems that if 'tf' is limited to a max value and 'lengthNorm' is a constant, it might be possible, at least for 'simple' term queries. But Hoss says that things get messing with complicated queries. Could someone elaborate a bit? Does the index contain enough info to do this efficiently? I realize that scores values must be interpreted 'carefully', but I'm seeing a push to get more leverage from the absolute values, not just the relative values. Peter
Payloads and SpanScorer
If a SpanQuery is constructed from one or more BoostingTermQuery(s), the payloads on the terms are never processed by the SpanScorer. It seems to me that you would want the SpanScorer to score the document both on the spans distance and the payload score. So, either the SpanScorer would have to process the payloads (duplicating the code in BoostingSpanScorer), or perhaps SpanScorer could access the BoostingSpanScorers, or maybe there's another approach. Any thoughts on how to accomplish this? Peter
Re: Payloads and SpanScorer
Suppose I create a SpanNearQuery phrase with the terms long range missiles and some slop factor. Each term is actually a BoostingTermQuery. Currently, the score computed by SpanNearQuery.SpanScorer is based on the sloppy frequency of the terms and their weights (this is fine). But even though each term is actually a BoostingTermQuery, the BoostingTermScorer (and therefore 'processPayload') is never invoked for this type of query. I was looking for a way to have SpanNearQuery (also SpanOrQuery, SpanFirstQuery) recognize that the terms in the phrase should boost the overall score based on the payloads assigned to them. Thus the score from the SpanNearQuery would be higher if : a) the terms have payloads that boost their scores b) the terms are positionally next to each other (minimal slop - as it works now) Does this make sense? Peter On Thu, Jul 10, 2008 at 9:21 AM, Grant Ingersoll [EMAIL PROTECTED] wrote: I'm not fully following what you want. Can you explain a bit more? Thanks, Grant On Jul 9, 2008, at 2:55 PM, Peter Keegan wrote: If a SpanQuery is constructed from one or more BoostingTermQuery(s), the payloads on the terms are never processed by the SpanScorer. It seems to me that you would want the SpanScorer to score the document both on the spans distance and the payload score. So, either the SpanScorer would have to process the payloads (duplicating the code in BoostingSpanScorer), or perhaps SpanScorer could access the BoostingSpanScorers, or maybe there's another approach. Any thoughts on how to accomplish this? Peter -- Grant Ingersoll http://www.lucidimagination.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payloads and SpanScorer
I may take a crack at this. Any more thoughts you may have on the implementation are welcome, but I don't want to distract you too much. Thanks, Peter On Thu, Jul 10, 2008 at 1:30 PM, Grant Ingersoll [EMAIL PROTECTED] wrote: Makes sense. It was always my intent to implement things like PayloadNearQuery, see http://wiki.apache.org/lucene-java/Payload_Planning I think it would make sense to develop these and I would be happy to help shepherd a patch through, but am not in a position to generate said patch at this moment in time. On Jul 10, 2008, at 9:59 AM, Peter Keegan wrote: Suppose I create a SpanNearQuery phrase with the terms long range missiles and some slop factor. Each term is actually a BoostingTermQuery. Currently, the score computed by SpanNearQuery.SpanScorer is based on the sloppy frequency of the terms and their weights (this is fine). But even though each term is actually a BoostingTermQuery, the BoostingTermScorer (and therefore 'processPayload') is never invoked for this type of query. I was looking for a way to have SpanNearQuery (also SpanOrQuery, SpanFirstQuery) recognize that the terms in the phrase should boost the overall score based on the payloads assigned to them. Thus the score from the SpanNearQuery would be higher if : a) the terms have payloads that boost their scores b) the terms are positionally next to each other (minimal slop - as it works now) Does this make sense? Peter On Thu, Jul 10, 2008 at 9:21 AM, Grant Ingersoll [EMAIL PROTECTED] wrote: I'm not fully following what you want. Can you explain a bit more? Thanks, Grant On Jul 9, 2008, at 2:55 PM, Peter Keegan wrote: If a SpanQuery is constructed from one or more BoostingTermQuery(s), the payloads on the terms are never processed by the SpanScorer. It seems to me that you would want the SpanScorer to score the document both on the spans distance and the payload score. So, either the SpanScorer would have to process the payloads (duplicating the code in BoostingSpanScorer), or perhaps SpanScorer could access the BoostingSpanScorers, or maybe there's another approach. Any thoughts on how to accomplish this? Peter -- Grant Ingersoll http://www.lucidimagination.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://www.lucidimagination.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payloads and SpanScorer
I discovered this post from Karl Wettin in May about SpanNearQuery scoring: http://www.nabble.com/SpanNearQuery-scoring-td17425454.html#a17425454 Karl apparently had the same expectations I had about the usage model of spans and boosts. I also found JIRA issue 533 (SpanQuery scoring: SpanWeight lacks a recursive traversal of the query tree), which addresses the same problem. So, I made an attempt to modify SpanNearQuery to expand a nested BoostingTermQuery, but soon realized while debugging that since BoostingTermQuery loads payloads from all term positions in the document, not just the ones constrained by the outer SpanQuery, the resulting score could be higher than it should be. Next, I followed Grant's idea of providing span classes that read payloads. I implemented a 'BoostingNearQuery' that extends 'SpanNearQuery' that provides term boosts on proximity queries. I will submit a patch to a JIRA later. This patch works but probably needs more work. I don't like the use of 'instanceof', but I didn't want to touch Spans or TermSpans. Also, the payload code is mostly a copy of what's in BoostingTermQuery and could be common-sourced somewhere. Feel free to throw darts at it :) Peter On Thu, Jul 10, 2008 at 2:09 PM, Peter Keegan [EMAIL PROTECTED] wrote: I may take a crack at this. Any more thoughts you may have on the implementation are welcome, but I don't want to distract you too much. Thanks, Peter On Thu, Jul 10, 2008 at 1:30 PM, Grant Ingersoll [EMAIL PROTECTED] wrote: Makes sense. It was always my intent to implement things like PayloadNearQuery, see http://wiki.apache.org/lucene-java/Payload_Planning I think it would make sense to develop these and I would be happy to help shepherd a patch through, but am not in a position to generate said patch at this moment in time. On Jul 10, 2008, at 9:59 AM, Peter Keegan wrote: Suppose I create a SpanNearQuery phrase with the terms long range missiles and some slop factor. Each term is actually a BoostingTermQuery. Currently, the score computed by SpanNearQuery.SpanScorer is based on the sloppy frequency of the terms and their weights (this is fine). But even though each term is actually a BoostingTermQuery, the BoostingTermScorer (and therefore 'processPayload') is never invoked for this type of query. I was looking for a way to have SpanNearQuery (also SpanOrQuery, SpanFirstQuery) recognize that the terms in the phrase should boost the overall score based on the payloads assigned to them. Thus the score from the SpanNearQuery would be higher if : a) the terms have payloads that boost their scores b) the terms are positionally next to each other (minimal slop - as it works now) Does this make sense? Peter On Thu, Jul 10, 2008 at 9:21 AM, Grant Ingersoll [EMAIL PROTECTED] wrote: I'm not fully following what you want. Can you explain a bit more? Thanks, Grant On Jul 9, 2008, at 2:55 PM, Peter Keegan wrote: If a SpanQuery is constructed from one or more BoostingTermQuery(s), the payloads on the terms are never processed by the SpanScorer. It seems to me that you would want the SpanScorer to score the document both on the spans distance and the payload score. So, either the SpanScorer would have to process the payloads (duplicating the code in BoostingSpanScorer), or perhaps SpanScorer could access the BoostingSpanScorers, or maybe there's another approach. Any thoughts on how to accomplish this? Peter -- Grant Ingersoll http://www.lucidimagination.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://www.lucidimagination.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
BoostingTermQuery scoring
I'm using BoostingTermQuery to boost the score of documents with terms containing payloads (boost value 1). I'd like to change the scoring behavior such that if a query contains multiple BoostingTermQuery terms (either required or optional), documents containing more matching terms with payloads always score higher than documents with fewer terms with payloads. Currently, if one of the terms has a high IDF weight and contains a boosting payload but no payloads on other matching terms, it may score higher than docs with other matching terms with payloads and lower IDF. I think what I need is a way to increase the weight of a matching term in BoostingSpanScorer.score() if 'payloadsSeen 0', but I don't see how to do this. Any suggestions? Thanks, Peter
Re: BoostingTermQuery scoring
Let me give some background on the problem behind my question. Our index contains many fields (title, body, date, city, etc). Most queries search all fields, but for best performance, we create an additional 'contents' field that contains all terms from all fields so that only one field needs to be searched. Some fields, like title and city, are boosted by a factor of 5. In order to make term boosting work, we create an additional field 'boost' that contains all the terms from the boosted fields (title, city). Then, at search time, a query for petroleum engineer gets rewritten to: (+contents:petroleum +contents:engineer) (+boost:petroleum +boost:engineer). Note that the two clauses are OR'd so that a term that exists in both fields will get a higher weight in the 'boost' field. This works quite well at boosting documents with terms that exist in the boosted fields. However, it doesn't work properly if excluded terms are added, for example: (+contents:petroleum +contents:engineer -contents:drilling) (+boost:petroleum +boost:engineer -boost:drilling) If a document contains the term 'drilling' in the 'body' field, but not in the 'title' or 'city' field, a false hit occurs. Enter payloads and 'BoostingTermQuery'. At indexing time, as terms are added to the 'contents' field, they are assigned a payload (value=5) if the term also exists in one of the boosted fields. The 'scorePayload' method in our Similarity class returns the payload value as a score. The query no longer contains the 'boost' fields and is simply: +contents:petroleum +contents:engineer -contents:drilling The goal is to make the payload technique behavior similar to the 'boost' field technique. The problem is that relevance scores of the top hits are sometimes quite different. The reason is that the IDF values for a given term in the 'boost' field is often much higher than the same term in the 'contents' field. This makes sense because the 'boost' field contains a fairly small subset of the 'contents' field. Even with a payload of '5', a low IDF in the 'contents' field usually erases the effect of the payload. I have found a fairly simple (albeit inelegant) solution that seems to work. The 'boost' field is still created as before, but it is only used to compute IDF values for the weight class 'BoostingTermQuery.BoostingTermWeight. I had to make this class 'public' so that I could override the IDF value as follows: public class MNSBoostingTermQuery extends BoostingTermQuery { public MNSBoostingTermQuery(Term term) { super(term); } protected class MNSBoostingTermWeight extends BoostingTermQuery.BoostingTermWeight { public MNSBoostingTermWeight(BoostingTermQuery query, Searcher searcher) throws IOException { super(query, searcher); java.util.HashSetTerm newTerms = new java.util.HashSetTerm(); // Recompute IDF based on 'boost' field Iterator i = terms.iterator(); Term term=null; while (i.hasNext()) { term = (Term)i.next(); newTerms.add(new Term(boost, term.text())); } this.idf = this.query.getSimilarity(searcher).idf(newTerms, searcher); } } } Any thoughts about a better implementation are welcome. Peter On Thu, Nov 6, 2008 at 8:00 AM, Grant Ingersoll [EMAIL PROTECTED] wrote: Not sure, but it sounds like you are interested in a higher level Query, kind of like the BooleanQuery, but then part of it sounds like it is per document, right? Is it that you want to deal with multiple payloads in a document, or multiple BTQs in a bigger query? On Nov 4, 2008, at 9:42 AM, Peter Keegan wrote: I'm using BoostingTermQuery to boost the score of documents with terms containing payloads (boost value 1). I'd like to change the scoring behavior such that if a query contains multiple BoostingTermQuery terms (either required or optional), documents containing more matching terms with payloads always score higher than documents with fewer terms with payloads. Currently, if one of the terms has a high IDF weight and contains a boosting payload but no payloads on other matching terms, it may score higher than docs with other matching terms with payloads and lower IDF. I think what I need is a way to increase the weight of a matching term in BoostingSpanScorer.score() if 'payloadsSeen 0', but I don't see how to do this. Any suggestions? Thanks, Peter -- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: BoostingTermQuery scoring
I've discovered another flaw in using this technique: (+contents:petroleum +contents:engineer +contents:refinery) (+boost:petroleum +boost:engineer +boost:refinery) It's possible that the first clause will produce a matching doc and none of the terms in the second clause are used to score that doc. Yet another reason to use BoostingTermQuery. Peter On Thu, Nov 6, 2008 at 1:08 PM, Peter Keegan [EMAIL PROTECTED] wrote: Let me give some background on the problem behind my question. Our index contains many fields (title, body, date, city, etc). Most queries search all fields, but for best performance, we create an additional 'contents' field that contains all terms from all fields so that only one field needs to be searched. Some fields, like title and city, are boosted by a factor of 5. In order to make term boosting work, we create an additional field 'boost' that contains all the terms from the boosted fields (title, city). Then, at search time, a query for petroleum engineer gets rewritten to: (+contents:petroleum +contents:engineer) (+boost:petroleum +boost:engineer). Note that the two clauses are OR'd so that a term that exists in both fields will get a higher weight in the 'boost' field. This works quite well at boosting documents with terms that exist in the boosted fields. However, it doesn't work properly if excluded terms are added, for example: (+contents:petroleum +contents:engineer -contents:drilling) (+boost:petroleum +boost:engineer -boost:drilling) If a document contains the term 'drilling' in the 'body' field, but not in the 'title' or 'city' field, a false hit occurs. Enter payloads and 'BoostingTermQuery'. At indexing time, as terms are added to the 'contents' field, they are assigned a payload (value=5) if the term also exists in one of the boosted fields. The 'scorePayload' method in our Similarity class returns the payload value as a score. The query no longer contains the 'boost' fields and is simply: +contents:petroleum +contents:engineer -contents:drilling The goal is to make the payload technique behavior similar to the 'boost' field technique. The problem is that relevance scores of the top hits are sometimes quite different. The reason is that the IDF values for a given term in the 'boost' field is often much higher than the same term in the 'contents' field. This makes sense because the 'boost' field contains a fairly small subset of the 'contents' field. Even with a payload of '5', a low IDF in the 'contents' field usually erases the effect of the payload. I have found a fairly simple (albeit inelegant) solution that seems to work. The 'boost' field is still created as before, but it is only used to compute IDF values for the weight class 'BoostingTermQuery.BoostingTermWeight. I had to make this class 'public' so that I could override the IDF value as follows: public class MNSBoostingTermQuery extends BoostingTermQuery { public MNSBoostingTermQuery(Term term) { super(term); } protected class MNSBoostingTermWeight extends BoostingTermQuery.BoostingTermWeight { public MNSBoostingTermWeight(BoostingTermQuery query, Searcher searcher) throws IOException { super(query, searcher); java.util.HashSetTerm newTerms = new java.util.HashSetTerm(); // Recompute IDF based on 'boost' field Iterator i = terms.iterator(); Term term=null; while (i.hasNext()) { term = (Term)i.next(); newTerms.add(new Term(boost, term.text())); } this.idf = this.query.getSimilarity(searcher).idf(newTerms, searcher); } } } Any thoughts about a better implementation are welcome. Peter On Thu, Nov 6, 2008 at 8:00 AM, Grant Ingersoll [EMAIL PROTECTED]wrote: Not sure, but it sounds like you are interested in a higher level Query, kind of like the BooleanQuery, but then part of it sounds like it is per document, right? Is it that you want to deal with multiple payloads in a document, or multiple BTQs in a bigger query? On Nov 4, 2008, at 9:42 AM, Peter Keegan wrote: I'm using BoostingTermQuery to boost the score of documents with terms containing payloads (boost value 1). I'd like to change the scoring behavior such that if a query contains multiple BoostingTermQuery terms (either required or optional), documents containing more matching terms with payloads always score higher than documents with fewer terms with payloads. Currently, if one of the terms has a high IDF weight and contains a boosting payload but no payloads on other matching terms, it may score higher than docs with other matching terms with payloads and lower IDF. I think what I need is a way to increase the weight of a matching term in BoostingSpanScorer.score() if 'payloadsSeen 0', but I don't see how to do this. Any suggestions? Thanks, Peter -- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java
Re: Boosting results
If you sort first by score, keep in mind that the raw scores are very precise and you could see many unique values in the result set. The secondary sort field would only be used to break equal scores. We had to use a custom comparator to 'smooth out' the scores to allow the second field to take effect. Peter On Fri, Nov 7, 2008 at 11:17 AM, Scott Smith [EMAIL PROTECTED]wrote: Well, it's not like sorting hadn't occurred to me. Unfortunately, what I recalled was that you could only sort results on one field (I do date sorted searches all the time in my application). I should have gone back and looked. My memory failed me as I can see that you can sort on multiple fields and score (aka relevancy) is one of the pseudo fields. That'll work. Thanks. Scott -Original Message- From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Friday, November 07, 2008 5:59 AM To: java-user@lucene.apache.org Subject: Re: Boosting results dh, sorting. I absolutely love it when I overlook the obvious G. [EMAIL PROTECTED] On Fri, Nov 7, 2008 at 4:58 AM, Michael McCandless [EMAIL PROTECTED] wrote: Couldn't you just do a single Query that sorts first by category and second by relevance? Mike Erick Erickson wrote: It seems to me that the easiest thing would be to fire two queries and then just concatenate the results category:A AND body:fred category:B AND body:fred If you really, really didn't want to fire two queries, you could create filters on category A and category B and make a couple of passes through your results seeing if the returned documents were in the filter, but you'd still concatenate the results. Actually in your specific example you could make one filter on A. You could also consider a custom scorer that, added 1,000,000 to every category A document. How much were you boosting by? What happens if you boost by a very large factor? As in ridiculously large? Best Erick On Thu, Nov 6, 2008 at 7:42 PM, Scott Smith [EMAIL PROTECTED] wrote: I'm interested in comments on the following problem. I have a set of documents. They fall into 3 categories. Call these categories A, B, and C. Each document has an indexed, non-tokenized field called category which contains A, B, or C (they are mutually exclusive categories). All of the documents contain a field called body which contains a bunch of text. This field is indexed and tokenized. So, I want to do a search which looks something like: (category:A OR category:B) AND body:fred I want all of the category A documents to come before the category B documents. Effectively, I want to have the category A documents first (sorted by relevancy) and then the category B documents after (sorted by relevancy). I thought I could do this by boosting the category portion of the query, but that doesn't seem to work consistently. I was setting the boost on the category A term to 1.0 and the boost on the category B term to 0.0. Any thoughts how to skin this? Scott - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: BoostingTermQuery scoring
boost:(+petroleum +engineer +refinery) (+contents:(+petroleum +engineer +refinery) +((*:* -boost:petroleum) (*:* -boost:engineer) (*:* -boost:refinery))) That's an interesting solution. Would this result in many more documents being visited by the scorer, possibly impacting performance? (I haven't tried it yet). Thanks, Peter On Thu, Nov 6, 2008 at 6:56 PM, Steven A Rowe [EMAIL PROTECTED] wrote: Hi Peter, On 11/06/2008 at 4:25 PM, Peter Keegan wrote: I've discovered another flaw in using this technique: (+contents:petroleum +contents:engineer +contents:refinery) (+boost:petroleum +boost:engineer +boost:refinery) It's possible that the first clause will produce a matching doc and none of the terms in the second clause are used to score that doc. Yet another reason to use BoostingTermQuery. I think you could address this, without BTQ, using something like: boost:(+petroleum +engineer +refinery) (+contents:(+petroleum +engineer +refinery) +((*:* -boost:petroleum) (*:* -boost:engineer) (*:* -boost:refinery))) The last three lines gives you the set of documents that are missing at least one of the terms in the boost field. The *:* thingy, indicating a MatchAllDocsQuery, is necessary to get all documents that don't have a given term; Lucene's (sub-)query document exclusion operation needs a non-empty set on which to operate. On 11/06/2008 at 1:08 PM, Peter Keegan wrote: Then, at search time, a query for petroleum engineer gets rewritten to: (+contents:petroleum +contents:engineer) (+boost:petroleum +boost:engineer). Note that the two clauses are OR'd so that a term that exists in both fields will get a higher weight in the 'boost' field. This works quite well at boosting documents with terms that exist in the boosted fields. However, it doesn't work properly if excluded terms are added, for example: (+contents:petroleum +contents:engineer -contents:drilling) (+boost:petroleum +boost:engineer -boost:drilling) If a document contains the term 'drilling' in the 'body' field, but not in the 'title' or 'city' field, a false hit occurs. I think you could address this problem like this: +(boost:(+petroleum +engineer) (+contents:(+petroleum +engineer) +((*:* -boost:petroleum) (*:* -boost:engineer -contents:drilling You don't have to include -boost:drilling, because this condition is entailed by -contents:drilling. Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payloads
Hi Karl, I use payloads for weight only, too, with BoostingTermQuery (see: http://www.nabble.com/BoostingTermQuery-scoring-td20323615.html#a20323615) A custom tokenizer looks for the reserved character '\b' followed by a 2 byte 'boost' value. It then creates a special Token type for a custom filter which sets the payload on the token. Another reserved character '\t' is used to set a position increment value. Since we often boost multiple tokens in the stream, the payload 'boost' value is reapplied to subsequent tokens until a 'boost' value of '0' is encountered, which disables payloads. This is a bit messy and I agree that it would be nice to come up with a nice API for this. Peter On Fri, Dec 26, 2008 at 8:22 PM, Karl Wettin karl.wet...@gmail.com wrote: I would very much like to hear how people use payloads. Personally I use them for weight only. And I use them a lot, almost in all applications. I factor the weight of synonyms, stems, dediacritization and what not. I create huge indices that contains lots tokens at the same position but with different weights. I might for instance create the stream (1)motörhead^1, (0)motorhead^0.7 and I'll do this at both index and query time, i.e. I use the payload weight to calculate both payload weight used by the BoostingTermQuery scorer AND to set the boost in the query at the same time. In order to handle this I use an interface that looks something like this: public interface PayloadWeightHandler { public void setWeight(Token token, float weight); public float getWeight(Token token); } In order to use this I had to patch pretty much any filter I use and pass down a weight factor, something like: TokenStream ts = analyzer.tokenStream(f, new StringReader(motörhead ace of spaces)); ts = new SynonymTokenFilter(ts, synonyms, 0.7f); ts = new StemmerFilter(ts, 0.7f); ts = new ASCIIFoldingFilter(ts, 0.5f); All these filters would, if applicable, create new synonym tokens with slightly less weight than the input rather than replace token content: (1)mötorhead^1, (0)motorhead^0.5, (1)ace^1, (1)of^1, (1)spades^1, (1)spad^0.7 I usually use 4 byte floats while creating the stream and then convert it to 8 bit floats in a final filter before adding it to the document. Is anyone else doing something similar? It would be nice to normalize this and perhaps come up with a reusable API for this. It would also be cool if all the existing filters could be rewritten to handle this stuff. I find it to be extemely useful when creating indices with rather niched content such as song titles, names of people, street addresses, et c. For the last year or so I've done several (3) commercial implementations where I try to extend the index with incorrect typed queries but unique enough that it does not interfere with the quality of the results. It has been very successful, people get great responses in great time even though they enter an incorrect query. On a side note, in these implementaions I've completely replaced phrase queries using shingles. ShingleMatrixQuery has some built in goodies for calculating weight. Combined with SSD I see awesome results with very short response time even in fairly large indices (10M-100M documents). I'm talking about 100ms-500ms for rather complex queries under heavy load. karl - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
queryNorm affect on score
The explanation of scores from the same document returned from 2 similar queries differ in an unexpected way. There are 2 fields involved, 'contents' and 'literals'. The 'literals' field has setBoost = 0. As you an see from the explanations below, the total weight of the matching terms from the 'literal' field is 0. However, the weights produced by the matching terms in the 'contents' field is very different, even with the same matching terms. The reason is that the 'queryNorm' value is very different because the 'sumOfSquaredWeights' is very different. Why is this? First query: +(+contents:sales +contents:representative) +literals:jb$1 Explanation: 32.274593 sum of: 32.274593 sum of: 10.336284 weight(contents:sales in 14578), product of: 0.54963183 queryWeight(contents:sales), product of: 2.6595461 idf(contents: sales=83179) 0.20666377 queryNorm 18.805832 fieldWeight(contents:sales in 14578), product of: 7.071068 btq, product of: 1.4142135 tf(phraseFreq=3.0) 5.0 scorePayload(...) 2.6595461 idf(contents: sales=83179) 1.0 fieldNorm(field=contents, doc=14578) 21.93831 weight(contents:representative in 14578), product of: 0.8007395 queryWeight(contents:representative), product of: 3.8746004 idf(contents: representative=24678) 0.20666377 queryNorm 27.397562 fieldWeight(contents:representative in 14578), product of: 7.071068 btq, product of: 1.4142135 tf(phraseFreq=2.0) 5.0 scorePayload(...) 3.8746004 idf(contents: representative=24678) 1.0 fieldNorm(field=contents, doc=14578) 0.0 weight(literals:jb$1 in 14578), product of: 0.23816177 queryWeight(literals:jb$1), product of: 1.1524118 idf(docFreq=375455, numDocs=436917) 0.20666377 queryNorm 0.0 fieldWeight(literals:jb$1 in 14578), product of: 1.0 tf(termFreq(literals:jb$1)=1) 1.1524118 idf(docFreq=375455, numDocs=436917) 0.0 fieldNorm(field=literals, doc=14578) Second query: +(+contents:sales +contents:representative) +(literals:jb$1 literals:jb$) Explanation: 10.550879 sum of: 10.550879 sum of: 3.3790317 weight(contents:sales in 14578), product of: 0.17967999 queryWeight(contents:sales), product of: 2.6595461 idf(contents: sales=83179) 0.0675604 queryNorm 18.805832 fieldWeight(contents:sales in 14578), product of: 7.071068 btq, product of: 1.4142135 tf(phraseFreq=3.0) 5.0 scorePayload(...) 2.6595461 idf(contents: sales=83179) 1.0 fieldNorm(field=contents, doc=14578) 7.171847 weight(contents:representative in 14578), product of: 0.26176953 queryWeight(contents:representative), product of: 3.8746004 idf(contents: representative=24678) 0.0675604 queryNorm 27.397562 fieldWeight(contents:representative in 14578), product of: 7.071068 btq, product of: 1.4142135 tf(phraseFreq=2.0) 5.0 scorePayload(...) 3.8746004 idf(contents: representative=24678) 1.0 fieldNorm(field=contents, doc=14578) 0.0 product of: 0.0 sum of: 0.0 weight(literals:jb$1 in 14578), product of: 0.0778574 queryWeight(literals:jb$1), product of: 1.1524118 idf(docFreq=375455, numDocs=436917) 0.0675604 queryNorm 0.0 fieldWeight(literals:jb$1 in 14578), product of: 1.0 tf(termFreq(literals:jb$1)=1) 1.1524118 idf(docFreq=375455, numDocs=436917) 0.0 fieldNorm(field=literals, doc=14578) 0.5 coord(1/2) Peter
Re: queryNorm affect on score
Any comments about this? Is this just the way queryNorm works or is this a bug? Thanks, Peter On Fri, Feb 20, 2009 at 4:03 PM, Peter Keegan peterlkee...@gmail.comwrote: The explanation of scores from the same document returned from 2 similar queries differ in an unexpected way. There are 2 fields involved, 'contents' and 'literals'. The 'literals' field has setBoost = 0. As you an see from the explanations below, the total weight of the matching terms from the 'literal' field is 0. However, the weights produced by the matching terms in the 'contents' field is very different, even with the same matching terms. The reason is that the 'queryNorm' value is very different because the 'sumOfSquaredWeights' is very different. Why is this? First query: +(+contents:sales +contents:representative) +literals:jb$1 Explanation: 32.274593 sum of: 32.274593 sum of: 10.336284 weight(contents:sales in 14578), product of: 0.54963183 queryWeight(contents:sales), product of: 2.6595461 idf(contents: sales=83179) 0.20666377 queryNorm 18.805832 fieldWeight(contents:sales in 14578), product of: 7.071068 btq, product of: 1.4142135 tf(phraseFreq=3.0) 5.0 scorePayload(...) 2.6595461 idf(contents: sales=83179) 1.0 fieldNorm(field=contents, doc=14578) 21.93831 weight(contents:representative in 14578), product of: 0.8007395 queryWeight(contents:representative), product of: 3.8746004 idf(contents: representative=24678) 0.20666377 queryNorm 27.397562 fieldWeight(contents:representative in 14578), product of: 7.071068 btq, product of: 1.4142135 tf(phraseFreq=2.0) 5.0 scorePayload(...) 3.8746004 idf(contents: representative=24678) 1.0 fieldNorm(field=contents, doc=14578) 0.0 weight(literals:jb$1 in 14578), product of: 0.23816177 queryWeight(literals:jb$1), product of: 1.1524118 idf(docFreq=375455, numDocs=436917) 0.20666377 queryNorm 0.0 fieldWeight(literals:jb$1 in 14578), product of: 1.0 tf(termFreq(literals:jb$1)=1) 1.1524118 idf(docFreq=375455, numDocs=436917) 0.0 fieldNorm(field=literals, doc=14578) Second query: +(+contents:sales +contents:representative) +(literals:jb$1 literals:jb$) Explanation: 10.550879 sum of: 10.550879 sum of: 3.3790317 weight(contents:sales in 14578), product of: 0.17967999 queryWeight(contents:sales), product of: 2.6595461 idf(contents: sales=83179) 0.0675604 queryNorm 18.805832 fieldWeight(contents:sales in 14578), product of: 7.071068 btq, product of: 1.4142135 tf(phraseFreq=3.0) 5.0 scorePayload(...) 2.6595461 idf(contents: sales=83179) 1.0 fieldNorm(field=contents, doc=14578) 7.171847 weight(contents:representative in 14578), product of: 0.26176953 queryWeight(contents:representative), product of: 3.8746004 idf(contents: representative=24678) 0.0675604 queryNorm 27.397562 fieldWeight(contents:representative in 14578), product of: 7.071068 btq, product of: 1.4142135 tf(phraseFreq=2.0) 5.0 scorePayload(...) 3.8746004 idf(contents: representative=24678) 1.0 fieldNorm(field=contents, doc=14578) 0.0 product of: 0.0 sum of: 0.0 weight(literals:jb$1 in 14578), product of: 0.0778574 queryWeight(literals:jb$1), product of: 1.1524118 idf(docFreq=375455, numDocs=436917) 0.0675604 queryNorm 0.0 fieldWeight(literals:jb$1 in 14578), product of: 1.0 tf(termFreq(literals:jb$1)=1) 1.1524118 idf(docFreq=375455, numDocs=436917) 0.0 fieldNorm(field=literals, doc=14578) 0.5 coord(1/2) Peter
Re: queryNorm affect on score
Got it. This is another example of why scores can't be compared between (even similar) queries. (we don't) Thanks. On Fri, Feb 27, 2009 at 11:39 AM, Yonik Seeley yo...@lucidimagination.comwrote: On Fri, Feb 27, 2009 at 9:15 AM, Peter Keegan peterlkee...@gmail.com wrote: Any comments about this? Is this just the way queryNorm works or is this a bug? That's just the way it works... since it's applied to all clauses, it really just changes the range of scores returned, not relative ordering of documents or anything. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: queryNorm affect on score
in situations where you deal with simple query types, and matching query structures, the queryNorm *can* be used to make scores semi-comparable. Hmm. My example used matching query structures. The only difference was a single term in a field with zero weight that didn't exist in the matching document. But one score was 3X the other. Peter On Sat, Feb 28, 2009 at 12:35 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : I guess I don't really understand this comment in the similarity java doc : then: : : http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm : : *queryNorm(q) * is a normalizing factor used to make scores between queries : comparable. that comment should probably be removed ... in situations where you deal with simple query types, and matching query structures, the queryNorm *can* be used to make scores semi-comparable. To be 100% correct about what the queryNorm does in all cases: it normalizes each of the constituent values that are used in the score computation relative to the other constituent values. the main value I've seen from it is that it prevents a loss of floating point accuracy that can result from addition/multiplication of large values. -Hoss - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: queryNorm affect on score
As suggested, I added a query-time boost of 0.0f to the 'literals' field (with index-time boost still there) and I did get the same scores for both queries :) (there is a subtlety between index-time and query-time boosting that I missed.) I also tried disabling the coord factor, but that had no affect on the score, when combined with the above. This seems ok in this example since the the matching terms had boost = 0. Thanks Yonik, Peter On Sat, Feb 28, 2009 at 6:02 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Sat, Feb 28, 2009 at 3:02 PM, Peter Keegan peterlkee...@gmail.com wrote: in situations where you deal with simple query types, and matching query structures, the queryNorm *can* be used to make scores semi-comparable. Hmm. My example used matching query structures. The only difference was a single term in a field with zero weight that didn't exist in the matching document. But one score was 3X the other. But the zero boost was an index-time boost, and the queryNorm takes into account query-time boosts and idfs. You might get closer to what you expect with a query time boost of 0.0f The other thing affecting the score is the coord factor - the fact that fewer of the optional terms matched (1/2) lowers the score. The coordination factor can be disabled on any BooleanQuery. If you do both of the above, I *think* you would get the same scores for this specific example. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: queryNorm affect on score
If I set the boost=0 at query time and the query contains only terms with boost=0, the scores are NaN (because weight.queryNorm = 1/0 = infinity), instead of 0. Peter On Sun, Mar 1, 2009 at 9:27 PM, Erick Erickson erickerick...@gmail.comwrote: FWIW, Hossman pointed out that the difference between index and query time boosts is that index time boosts on title, for instance, express I care about this document's title more than other documents' titles [when it matches] Query time boosts express I care about matches on the title field more than matches on other fields. Best Erick On Sun, Mar 1, 2009 at 8:57 PM, Peter Keegan peterlkee...@gmail.com wrote: As suggested, I added a query-time boost of 0.0f to the 'literals' field (with index-time boost still there) and I did get the same scores for both queries :) (there is a subtlety between index-time and query-time boosting that I missed.) I also tried disabling the coord factor, but that had no affect on the score, when combined with the above. This seems ok in this example since the the matching terms had boost = 0. Thanks Yonik, Peter On Sat, Feb 28, 2009 at 6:02 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Sat, Feb 28, 2009 at 3:02 PM, Peter Keegan peterlkee...@gmail.com wrote: in situations where you deal with simple query types, and matching query structures, the queryNorm *can* be used to make scores semi-comparable. Hmm. My example used matching query structures. The only difference was a single term in a field with zero weight that didn't exist in the matching document. But one score was 3X the other. But the zero boost was an index-time boost, and the queryNorm takes into account query-time boosts and idfs. You might get closer to what you expect with a query time boost of 0.0f The other thing affecting the score is the coord factor - the fact that fewer of the optional terms matched (1/2) lowers the score. The coordination factor can be disabled on any BooleanQuery. If you do both of the above, I *think* you would get the same scores for this specific example. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
sloppyFreq question
The DefaultSimilarity class defines sloppyFreq as: public float sloppyFreq(int distance) { return 1.0f / (distance + 1); } For a 'SpanNearQuery', this reduces the effect of the term frequency on the score as the number of terms in the span increases. So, for a simple phrase query (using spans), the longer the phrase, the lower the TF. For a simple SpanTermQuery, the TF is reduced in half (1.0f / 1 + 1). I'm just wondering why this is the default behavior. For 'SpanTermQuery', I'd expect the TF to reflect the actual number of occurrences of the term. For a SpanNearQuery, wouldn't it still be the number of occurrences of the whole span, not the number of terms in the span? Thanks, Peter