Re: docMap array in SegmentMergeInfo

2005-10-11 Thread Peter Keegan
On a multi-cpu system, this loop to build the docMap array can cause severe
thread thrashing because of the synchronized method 'isDeleted'. I have
observed this on an index with over 1 million documents (which contains a
few thousand deleted docs) when multiple threads perform a search with
either a sort field or a range query. A stack dump shows all threads here:

waiting for monitor entry [0x6d2cf000..0x6d2cfd6c] at
org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:241) -
waiting to lock 0x04e40278

The performances worsens as the number of threads increases. The searches
may take minutes to complete.
If only a single thread issues the search, it completes fairly quickly. I
also noticed from looking at the code that the docMap doesn't appear to be
used in these cases. It seems only to be used for merging segments. If the
index is in 'search/read-only' mode, is there a way around this bottleneck?

Thanks,
Peter




On 7/13/05, Doug Cutting [EMAIL PROTECTED] wrote:

 Lokesh Bajaj wrote:
  For a very large index where we might want to delete/replace some
 documents, this would require a lot of memory (for 100 million documents,
 this would need 381 MB of memory). Is there any reason why this was
 implemented this way?

 In practice this has not been an issue. A single index with 100M
 documents is usually quite slow to search. When collections get this
 big folks tend to instead search multiple indexes in parallel in order
 to keep response times acceptable. Also, 381Mb of RAM is often not a
 problem for folks with 100M documents. But this is not to say that it
 could never be a problem. For folks with limited RAM and/or lots of
 small documents it could indeed be an issue.

  It seems like this could be implemented as a much smaller array that
 only keeps track of the deleted document numbers and it would still be very
 efficient to calculate the new document number by using this much smaller
 array. Has this been done by anyone else or been considered for change in
 the Lucene code?

 Please submit a patch to the java-dev list.

 Doug

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: docMap array in SegmentMergeInfo

2005-10-12 Thread Peter Keegan
Here is one stack trace:

Full thread dump Java HotSpot(TM) Client VM (1.5.0_03-b07 mixed mode):

Thread-6 prio=5 tid=0x6cf7a7f0 nid=0x59e50 waiting for monitor entry
[0x6d2cf000..0x6d2cfd6c]
at org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:241)
- waiting to lock 0x04e40278 (a org.apache.lucene.index.SegmentReader)
at org.apache.lucene.index.SegmentMergeInfo.init(SegmentMergeInfo.java:43)
at org.apache.lucene.index.MultiTermEnum.init(MultiReader.java:277)
at org.apache.lucene.index.MultiReader.terms(MultiReader.java:186)
at org.apache.lucene.search.RangeQuery.rewrite(RangeQuery.java:75)
at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:243)
at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:166)
at org.apache.lucene.search.Query.weight(Query.java:84)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:158)
at org.apache.lucene.search.Searcher.search(Searcher.java:67)
at org.apache.lucene.search.QueryFilter.bits(QueryFilter.java:62)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:121)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64)
at org.apache.lucene.search.Hits.init(Hits.java:51)
at org.apache.lucene.search.Searcher.search(Searcher.java:49)

I've also seen it happen during sorting from:

FieldSortedHitQueue.comparatorAuto -
FieldCacheImpl.getAuto() -
MultiReader.terms() -
MultiTermEnum.init() -
SegmentMergerInfo.init() -
SegmentReader.isDeleted()

Peter

On 10/11/05, Yonik Seeley [EMAIL PROTECTED] wrote:

  We've been using this in production for a while and it fixed the
  extremely slow searches when there are deleted documents.

 Who was the caller of isDeleted()? There may be an opportunity for an easy
 optimization to grab the BitVector and reuse it instead of repeatedly
 calling isDeleted() on the IndexReader.

 -Yonik
 Now hiring -- http://tinyurl.com/7m67g




Re: docMap array in SegmentMergeInfo

2005-10-13 Thread Peter Keegan
Hi Yonik,

Your patch has corrected the thread thrashing problem on multi-cpu systems.
I've tested it with both 1.4.3 and 1.9. I haven't seen 100X performance
gain, but that's because I'm caching QueryFilters and Lucene is caching the
sort fields.

Thanks for the fast response!

btw, I had previously tried Chris's fix (replace synchronized method with
snapshot reference), but I was getting errors trying to fetch stored fields
from the Hits. I didn't chase it down, but the errors went away when I
reverted that specific patch.

Peter


On 10/12/05, Yonik Seeley [EMAIL PROTECTED] wrote:

 Here's the patch:
 http://issues.apache.org/jira/browse/LUCENE-454

 It resulted in quite a performance boost indeed!

 On 10/12/05, Yonik Seeley [EMAIL PROTECTED] wrote:
 
  Thanks for the trace Peter, and great catch!
  It certainly does look like avoiding the construction of the docMap for
 a
  MultiTermEnum will be a significant optimization.
 
 
 -Yonik
 Now hiring -- http://tinyurl.com/7m67g




Re: Throughput doesn't increase when using more concurrent threads

2006-01-25 Thread Peter Keegan
This is just  fyi - in my stress tests on a 8-cpu box (that's 8 real cpus),
the maximum throughput occurred with just 4 query threads. The query
throughput decreased with fewer than 4 or greater than 4 query threads. The
entire index was most likely in the file system cache, too. Periodic
snapshots of stack traces showed most threads blocked in the synchronization
in: FSIndexInput.readInternal(), when the thread count exceeded 4.

Peter


On 11/22/05, Oren Shir [EMAIL PROTECTED] wrote:

 Hi,

 There are two sunchronization points: on the stream and on the reader.
 Using
 different FSDirectoriy and IndexReaders should solve this. I'll let you
 know
 once I code it. Right now I'm checking if making my Documents store less
 data will move the bottleneck to some other place.

 Thanks again,
 Oren Shir

 On 11/21/05, Doug Cutting [EMAIL PROTECTED] wrote:
 
  Jay Booth wrote:
   I had a similar problem with threading, the problem turned out to be
  that in
   the back end of the FSDirectory class I believe it was, there was a
   synchronized block on the actual RandomAccessFile resource when
 reading
  a
   block of data from it... high-concurrency situations caused threads to
  stack
   up in front of this synchronized block and our CPU time wound up being
  spent
   thrashing between blocked threads instead of doing anything useful.
 
  This is correct. In Lucene, multiple streams per file are created by
  cloning, and all clones of an FSDirectory input stream share a
  RandomAccessFile and must synchronize input from it. MmapDirectory does
  not have this limitation. If your indexes are less than a few GB or you
  are using 64-bit hardware, then MmapDirectory should work well for you.
  Otherwise it would be simple to write an nio-based Directory that does
  not use mmap that is also unsynchronized. Such a contribution would be
  welcome.
 
   Making multiple IndexSearchers and FSDirectories didn't help because
 in
  the
   back end, lucene consults a singleton HashMap of some kind (don't
  remember
   implementation) that maintained a single FSDirectory for any given
 index
   being accessed from the JVM... multiple calls to
  FSDirectory.getDirectory
   actually return the same FSDirectory object with synchronization at
 the
  same
   point.
 
  This does not make sense to me. FSDirectory does keep a cache of
  FSDirectory instances, but i/o should not be synchronized on these. One
  should be able to open multiple input streams on the same file from an
  FSDirectory. But this would not be a great solution, since file handle
  limits would soon become a problem.
 
  Doug
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 




Re: Throughput doesn't increase when using more concurrent threads

2006-01-25 Thread Peter Keegan
It's a 3GHz Intel box with Xeon processors, 64GB ram :)

Peter


On 1/25/06, Yonik Seeley [EMAIL PROTECTED] wrote:

 Thanks Peter, that's useful info.

 Just out of curiosity, what kind of box is this?  what CPUs?

 -Yonik

 On 1/25/06, Peter Keegan [EMAIL PROTECTED] wrote:
  This is just  fyi - in my stress tests on a 8-cpu box (that's 8 real
 cpus),
  the maximum throughput occurred with just 4 query threads. The query
  throughput decreased with fewer than 4 or greater than 4 query threads.
 The
  entire index was most likely in the file system cache, too. Periodic
  snapshots of stack traces showed most threads blocked in the
 synchronization
  in: FSIndexInput.readInternal(), when the thread count exceeded 4.
 
  Peter
 
 
  On 11/22/05, Oren Shir [EMAIL PROTECTED] wrote:
  
   Hi,
  
   There are two sunchronization points: on the stream and on the reader.
   Using
   different FSDirectoriy and IndexReaders should solve this. I'll let
 you
   know
   once I code it. Right now I'm checking if making my Documents store
 less
   data will move the bottleneck to some other place.
  
   Thanks again,
   Oren Shir
  
   On 11/21/05, Doug Cutting [EMAIL PROTECTED] wrote:
   
Jay Booth wrote:
 I had a similar problem with threading, the problem turned out to
 be
that in
 the back end of the FSDirectory class I believe it was, there was
 a
 synchronized block on the actual RandomAccessFile resource when
   reading
a
 block of data from it... high-concurrency situations caused
 threads to
stack
 up in front of this synchronized block and our CPU time wound up
 being
spent
 thrashing between blocked threads instead of doing anything
 useful.
   
This is correct. In Lucene, multiple streams per file are created by
cloning, and all clones of an FSDirectory input stream share a
RandomAccessFile and must synchronize input from it. MmapDirectory
 does
not have this limitation. If your indexes are less than a few GB or
 you
are using 64-bit hardware, then MmapDirectory should work well for
 you.
Otherwise it would be simple to write an nio-based Directory that
 does
not use mmap that is also unsynchronized. Such a contribution would
 be
welcome.
   
 Making multiple IndexSearchers and FSDirectories didn't help
 because
   in
the
 back end, lucene consults a singleton HashMap of some kind (don't
remember
 implementation) that maintained a single FSDirectory for any given
   index
 being accessed from the JVM... multiple calls to
FSDirectory.getDirectory
 actually return the same FSDirectory object with synchronization
 at
   the
same
 point.
   
This does not make sense to me. FSDirectory does keep a cache of
FSDirectory instances, but i/o should not be synchronized on these.
 One
should be able to open multiple input streams on the same file from
 an
FSDirectory. But this would not be a great solution, since file
 handle
limits would soon become a problem.
   
Doug
   
   
 -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Throughput doesn't increase when using more concurrent threads

2006-01-25 Thread Peter Keegan
Yes, it's hyperthreaded (16 cpus show up in task manager - the box is
running 2003). I plan to turn off hyperthreading to see if it has any
effect.

Peter


On 1/25/06, Yonik Seeley [EMAIL PROTECTED] wrote:

 On 1/25/06, Peter Keegan [EMAIL PROTECTED] wrote:
  It's a 3GHz Intel box with Xeon processors, 64GB ram :)

 Nice!

 Xeon processors are normally hyperthreaded.  On a linux box, if you
 cat /proc/cpuinfo, you will see 8 processors for a 4 physical CPU
 system.  Are you positive you have 8 physical Xeon processors?

 -Yonik

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
Paul,

I tried this but it ran out of memory trying to read the 500Mb .fdt file. I
tried various values for MAX_BBUF, but it still ran out of memory (I'm using
-Xmx1600M, which is the jvm's maximum value (v1.5))  I'll give
NioFSDirectory a try.

Thanks,
Peter


On 1/26/06, Paul Elschot [EMAIL PROTECTED] wrote:

 On Wednesday 25 January 2006 20:51, Peter Keegan wrote:
  The index is non-compound format and optimized. Yes, I did try
  MMapDirectory, but the index is too big - 3.5 GB (1.3GB is term vectors)
 
  Peter
 
 You could also give this a try:

 http://issues.apache.org/jira/browse/LUCENE-283

 Regards,
 Paul Elschot

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
Ray,

The throughput is worse with NioFSDIrectory than with the FSDIrectory
(patched and unpatched). The bottleneck still seems to be synchronization,
this time in NioFile.getChannel (7 of the 8 threads were blocked there
during one snapshot).  I tried this with 4 and 8 channels.

The throughput with the patched FSDirectory was about the same as before the
patch.

Thanks,
Peter


On 1/26/06, Ray Tsang [EMAIL PROTECTED] wrote:

 Speaking of NioFSDirectory, I thought there was one posted a while
 ago, is this something that can be used?
 http://issues.apache.org/jira/browse/LUCENE-414

 ray,

 On 11/22/05, Doug Cutting [EMAIL PROTECTED] wrote:
  Jay Booth wrote:
   I had a similar problem with threading, the problem turned out to be
 that in
   the back end of the FSDirectory class I believe it was, there was a
   synchronized block on the actual RandomAccessFile resource when
 reading a
   block of data from it... high-concurrency situations caused threads to
 stack
   up in front of this synchronized block and our CPU time wound up being
 spent
   thrashing between blocked threads instead of doing anything useful.
 
  This is correct.  In Lucene, multiple streams per file are created by
  cloning, and all clones of an FSDirectory input stream share a
  RandomAccessFile and must synchronize input from it.  MmapDirectory does
  not have this limitation.  If your indexes are less than a few GB or you
  are using 64-bit hardware, then MmapDirectory should work well for you.
Otherwise it would be simple to write an nio-based Directory that does
  not use mmap that is also unsynchronized.  Such a contribution would be
  welcome.
 
   Making multiple IndexSearchers and FSDirectories didn't help because
 in the
   back end, lucene consults a singleton HashMap of some kind (don't
 remember
   implementation) that maintained a single FSDirectory for any given
 index
   being accessed from the JVM... multiple calls to
 FSDirectory.getDirectory
   actually return the same FSDirectory object with synchronization at
 the same
   point.
 
  This does not make sense to me.  FSDirectory does keep a cache of
  FSDirectory instances, but i/o should not be synchronized on these.  One
  should be able to open multiple input streams on the same file from an
  FSDirectory.  But this would not be a great solution, since file handle
  limits would soon become a problem.
 
  Doug
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 



Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
I'd love to try this, but I'm not aware of any 64-bit jvms for Windows on
Intel. If you know of any, please let me know. Linux may be an option, too.

btw, I'm getting a sustained rate of 135 queries/sec with 4 threads, which
is pretty impressive. Another way around the concurrency limit is to run
multiple jvms. The throughput of each is less, but the aggregate throughput
is higher.

Peter


On 1/26/06, Yonik Seeley [EMAIL PROTECTED] wrote:

 Hmmm, can you run the 64 bit version of Windows (and hence a 64 bit JVM?)
 We're running with heap sizes up to 8GB (RH Linux 64 bit, Opterons,
 Sun Java 1.5)

 -Yonik

 On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote:
  Paul,
 
  I tried this but it ran out of memory trying to read the 500Mb .fdt
 file. I
  tried various values for MAX_BBUF, but it still ran out of memory (I'm
 using
  -Xmx1600M, which is the jvm's maximum value (v1.5))  I'll give
  NioFSDirectory a try.
 
  Thanks,
  Peter
 
 
  On 1/26/06, Paul Elschot [EMAIL PROTECTED] wrote:
  
   On Wednesday 25 January 2006 20:51, Peter Keegan wrote:
The index is non-compound format and optimized. Yes, I did try
MMapDirectory, but the index is too big - 3.5 GB (1.3GB is term
 vectors)
   
Peter
   
   You could also give this a try:
  
   http://issues.apache.org/jira/browse/LUCENE-283
  
   Regards,
   Paul Elschot
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
Dumb question: does the 64-bit compiler (javac) generate different code than
the 32-bit version, or is it just the jvm that matters? My reported speedups
were soley from using the 64-bit jvm with jar files from the 32-bit
compiler.

Peter


On 1/26/06, Yonik Seeley [EMAIL PROTECTED] wrote:

 Nice speedup!  The extra registers in 64 bit mode hay have helped a little
 too.

 -Yonik

 On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote:
  Correction: make that 285 qps :)

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
Ray,

The short answer is that you can make Lucene blazingly fast by using advice
and design principles mentioned in this forum and of course reading 'Lucene
in Action'. For example, use a 'content' field for searching all fields (vs
mutli-field search), put all your stored data in one field, understand the
cost of numeric search and sorting. On the platform side, go multi-CPU and
of course 64-bit if possible :)

Also, I would venture to guess that a lot of search bottlenecks have nothing
to do with Lucene, but rather in the infrastructure around it. For example,
how does your client interface to the search engine? My results use a plain
socket interface between client and server (one connection for queries,
another for results), using a simple query/results data format. Introducing
other web infrastructures invites degradation in performance, too.

I've a bit of experience with search engines, but I'm obviously still
learning thanks to this group.

Peter

On 1/26/06, Ray Tsang [EMAIL PROTECTED] wrote:

 Peter,

 Wow, the speed up in impressive! But may I ask what did you do to
 achieve 135 queries/sec prior to the JVM swich?

 ray,

 On 1/27/06, Peter Keegan [EMAIL PROTECTED] wrote:
  Correction: make that 285 qps :)
 
  On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote:
  
   I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now
   getting 250 queries/sec and excellent cpu utilization (equal
 concurrency on
   all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't
 aware
   of it.
  
   Thanks all very much.
   Peter
  
  
   On 1/26/06, Doug Cutting [EMAIL PROTECTED] wrote:
   
Doug Cutting wrote:
 A 64-bit JVM with NioDirectory would really be optimal for this.
   
Oops.  I meant MMapDirectory, not NioDirectory.
   
Doug
   
   
 -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   
   
  
 
 



Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
Ray,

The 135 qps rate was using the standard FSDirectory in 1.9.

Peter


On 1/26/06, Ray Tsang [EMAIL PROTECTED] wrote:

 Paul,

 Thanks for the advice! But for the 100+queries/sec on a 32-bit
 platfrom, did you end up applying other patches? or use different
 FSDirectory implementations?

 Thanks!

 ray,

 On 1/27/06, Peter Keegan [EMAIL PROTECTED] wrote:
  Ray,
 
  The short answer is that you can make Lucene blazingly fast by using
 advice
  and design principles mentioned in this forum and of course reading
 'Lucene
  in Action'. For example, use a 'content' field for searching all fields
 (vs
  mutli-field search), put all your stored data in one field, understand
 the
  cost of numeric search and sorting. On the platform side, go multi-CPU
 and
  of course 64-bit if possible :)
 
  Also, I would venture to guess that a lot of search bottlenecks have
 nothing
  to do with Lucene, but rather in the infrastructure around it. For
 example,
  how does your client interface to the search engine? My results use a
 plain
  socket interface between client and server (one connection for queries,
  another for results), using a simple query/results data format.
 Introducing
  other web infrastructures invites degradation in performance, too.
 
  I've a bit of experience with search engines, but I'm obviously still
  learning thanks to this group.
 
  Peter
 
  On 1/26/06, Ray Tsang [EMAIL PROTECTED] wrote:
  
   Peter,
  
   Wow, the speed up in impressive! But may I ask what did you do to
   achieve 135 queries/sec prior to the JVM swich?
  
   ray,
  
   On 1/27/06, Peter Keegan [EMAIL PROTECTED] wrote:
Correction: make that 285 qps :)
   
On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote:

 I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm
 now
 getting 250 queries/sec and excellent cpu utilization (equal
   concurrency on
 all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I
 wasn't
   aware
 of it.

 Thanks all very much.
 Peter


 On 1/26/06, Doug Cutting [EMAIL PROTECTED] wrote:
 
  Doug Cutting wrote:
   A 64-bit JVM with NioDirectory would really be optimal for
 this.
 
  Oops.  I meant MMapDirectory, not NioDirectory.
 
  Doug
 
 
   -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 

   
   
  
 
 



Re: Throughput doesn't increase when using more concurrent threads

2006-01-30 Thread Peter Keegan
I cranked up the dial on my query tester and was able to get the rate up to
325 qps. Unfortunately, the machine died shortly thereafter (memory errors
:-( ) Hopefully, it was just a coincidence. I haven't measured 64-bit
indexing speed, yet.

Peter

On 1/29/06, Daniel Noll [EMAIL PROTECTED] wrote:

 Peter Keegan wrote:
  I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now
  getting 250 queries/sec and excellent cpu utilization (equal concurrency
 on
  all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't
 aware
  of it.
 
 Wow.  That's fast.

 Out of interest, does indexing time speed up much on 64-bit hardware?
 I'm particularly interested in this side of things because for our own
 application, any query response under half a second is good enough, but
 the indexing side could always be faster. :-)

 Daniel

 --
 Daniel Noll

 Nuix Australia Pty Ltd
 Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
 Phone: (02) 9280 0699
 Fax:   (02) 9212 6902

 This message is intended only for the named recipient. If you are not
 the intended recipient you are notified that disclosing, copying,
 distributing or taking any action in reliance on the contents of this
 message or attachment is strictly prohibited.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Peter Keegan
We discovered that the kernel was only using 8 CPUs. After recompiling for
16 (8+hyperthreads), it looks like the query rate will settle in around
280-300 qps. Much better, although still quite a bit slower than the
opteron.

Peter




On 2/22/06, Yonik Seeley [EMAIL PROTECTED] wrote:

 Hmmm, not sure what that could be.
 You could try using the default FSDir instead of MMapDir to see if the
 differences are there.

 Some things that could be different:
 - thread scheduling (shouldn't make too much of a difference though)
 - synchronization workings
 - page replacement policy... how to figure out what pages to swap in
 and which to swap out, esp of the memory mapped files.

 You could also try a profiler on both platforms to try and see where
 the difference is.

 -Yonik

 On 2/22/06, Peter Keegan [EMAIL PROTECTED] wrote:
  I am doing a performance comparison of Lucene on Linux vs Windows.
 
  I have 2 identically configured servers (8-CPUs (real) x 3GHz Xeon
  processors, 64GB RAM). One is running CentOS 4 Linux, the other is
 running
  Windows server 2003 Enterprise Edition x64. Both have 64-bit JVMs from
 Sun.
  The Lucene server is using MMapDirectory. I'm running the jvm with
  -Xmx16000M. Peak memory usage of the jvm on Linux is about 6GB and 7.8GBon
  windows.
 
  I'm observing query rates of 330 queries/sec on the Wintel server, but
 only
  200 qps on the Linux box. At first, I suspected a network bottleneck,
 but
  when I 'short-circuited' Lucene, the query rates were identical.
 
  I suspect that there are some things to be tuned in Linux, but I'm not
 sure
  what. Any advice would be appreciated.
 
  Peter
 
 
 
  On 1/30/06, Peter Keegan [EMAIL PROTECTED] wrote:
  
   I cranked up the dial on my query tester and was able to get the rate
 up
   to 325 qps. Unfortunately, the machine died shortly thereafter (memory
   errors :-( ) Hopefully, it was just a coincidence. I haven't measured
 64-bit
   indexing speed, yet.
  
   Peter
  
   On 1/29/06, Daniel Noll [EMAIL PROTECTED] wrote:
   
Peter Keegan wrote:
 I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm
 now
 getting 250 queries/sec and excellent cpu utilization (equal
concurrency on
 all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I
 wasn't
aware
 of it.

Wow.  That's fast.
   
Out of interest, does indexing time speed up much on 64-bit
 hardware?
I'm particularly interested in this side of things because for our
 own
application, any query response under half a second is good enough,
 but
the indexing side could always be faster. :-)
   
Daniel
   
--
Daniel Noll
   
Nuix Australia Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
Phone: (02) 9280 0699
Fax:   (02) 9212 6902
   
This message is intended only for the named recipient. If you are
 not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of
 this
message or attachment is strictly prohibited.
   
   
   
 -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   
   
  
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Peter Keegan
Chris,

I tried JRockit a while back on 8-cpu/windows and it was slower than Sun's.
Since I seem to be cpu-bound right now, I'll be trying a 16-cpu system next
(32 with hyperthreading), on LinTel. I may give JRockit another go around
then.

Thanks,
Peter

On 2/23/06, Chris Lamprecht [EMAIL PROTECTED] wrote:

 Peter,
 Have you given JRockit JVM a try?  I've seen it help throughput
 compared to Sun's JVM on a dual xeon/linux machine, especially with
 concurrency (up to 6 concurrent searches happening).  I'm curious to
 see if it makes a difference for you.

 -chris

 On 2/23/06, Peter Keegan [EMAIL PROTECTED] wrote:
  We discovered that the kernel was only using 8 CPUs. After recompiling
 for
  16 (8+hyperthreads), it looks like the query rate will settle in around
  280-300 qps. Much better, although still quite a bit slower than the
  opteron.
 
  Peter
 
 
 
 
  On 2/22/06, Yonik Seeley [EMAIL PROTECTED] wrote:
  
   Hmmm, not sure what that could be.
   You could try using the default FSDir instead of MMapDir to see if the
   differences are there.
  
   Some things that could be different:
   - thread scheduling (shouldn't make too much of a difference though)
   - synchronization workings
   - page replacement policy... how to figure out what pages to swap in
   and which to swap out, esp of the memory mapped files.
  
   You could also try a profiler on both platforms to try and see where
   the difference is.
  
   -Yonik
  
   On 2/22/06, Peter Keegan [EMAIL PROTECTED] wrote:
I am doing a performance comparison of Lucene on Linux vs Windows.
   
I have 2 identically configured servers (8-CPUs (real) x 3GHz Xeon
processors, 64GB RAM). One is running CentOS 4 Linux, the other is
   running
Windows server 2003 Enterprise Edition x64. Both have 64-bit JVMs
 from
   Sun.
The Lucene server is using MMapDirectory. I'm running the jvm with
-Xmx16000M. Peak memory usage of the jvm on Linux is about 6GB and
 7.8GBon
windows.
   
I'm observing query rates of 330 queries/sec on the Wintel server,
 but
   only
200 qps on the Linux box. At first, I suspected a network
 bottleneck,
   but
when I 'short-circuited' Lucene, the query rates were identical.
   
I suspect that there are some things to be tuned in Linux, but I'm
 not
   sure
what. Any advice would be appreciated.
   
Peter
   
   
   
On 1/30/06, Peter Keegan [EMAIL PROTECTED] wrote:

 I cranked up the dial on my query tester and was able to get the
 rate
   up
 to 325 qps. Unfortunately, the machine died shortly thereafter
 (memory
 errors :-( ) Hopefully, it was just a coincidence. I haven't
 measured
   64-bit
 indexing speed, yet.

 Peter

 On 1/29/06, Daniel Noll [EMAIL PROTECTED] wrote:
 
  Peter Keegan wrote:
   I tried the AMD64-bit JVM from Sun and with MMapDirectory and
 I'm
   now
   getting 250 queries/sec and excellent cpu utilization (equal
  concurrency on
   all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I
   wasn't
  aware
   of it.
  
  Wow.  That's fast.
 
  Out of interest, does indexing time speed up much on 64-bit
   hardware?
  I'm particularly interested in this side of things because for
 our
   own
  application, any query response under half a second is good
 enough,
   but
  the indexing side could always be faster. :-)
 
  Daniel
 
  --
  Daniel Noll
 
  Nuix Australia Pty Ltd
  Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
  Phone: (02) 9280 0699
  Fax:   (02) 9212 6902
 
  This message is intended only for the named recipient. If you
 are
   not
  the intended recipient you are notified that disclosing,
 copying,
  distributing or taking any action in reliance on the contents of
   this
  message or attachment is strictly prohibited.
 
 
 
   -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 

   
   
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Peter Keegan
Yonik,

We're investigating both approaches.
Yes, the resources (and permutations) are dizzying!

Peter

On 2/23/06, Yonik Seeley [EMAIL PROTECTED] wrote:

 Wow, some resources!
 Would it be cheaper / more scalable to copy the index to multiple
 boxes and loadbalance requests across them?

 -Yonik

 On 2/23/06, Peter Keegan [EMAIL PROTECTED] wrote:
  Since I seem to be cpu-bound right now, I'll be trying a 16-cpu system
 next
  (32 with hyperthreading), on LinTel. I may give JRockit another go
 around
  then.
 
  Thanks,
  Peter

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Throughput doesn't increase when using more concurrent threads

2006-03-07 Thread Peter Keegan
I ran a query performance tester against 8-cpu and 16-cpu Xeon servers
(16/32 cpu hyperthreaded). on Linux. Here are the results:

8-cpu:  275 qps
16-cpu: 305 qps
(the dual-core Opteron servers are still faster)

Here is the stack trace of 8 of the 16 query threads during the test:

at org.apache.lucene.index.SegmentReader.document(SegmentReader.java
:281)
- waiting to lock 0x002adf5b2110 (a
org.apache.lucene.index.SegmentReader)
at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:83)
at org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java
:146)
at org.apache.lucene.search.Hits.doc(Hits.java:103)

SegmentReader.document is a synchronized method. I have one stored field
(binary, uncompressed) with and average length of 0.5Kb. The retrieval of
this stored field is within this synchronized code. Since I am using
MMapDirectory, does this retrieval need to be synchronized?

Peter

On 2/23/06, Peter Keegan [EMAIL PROTECTED] wrote:

 Yonik,

 We're investigating both approaches.
 Yes, the resources (and permutations) are dizzying!

 Peter


 On 2/23/06, Yonik Seeley  [EMAIL PROTECTED] wrote:
 
  Wow, some resources!
  Would it be cheaper / more scalable to copy the index to multiple
  boxes and loadbalance requests across them?
 
  -Yonik
 
  On 2/23/06, Peter Keegan [EMAIL PROTECTED] wrote:
   Since I seem to be cpu-bound right now, I'll be trying a 16-cpu system
  next
   (32 with hyperthreading), on LinTel. I may give JRockit another go
  around
   then.
  
   Thanks,
   Peter
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 



Re: Throughput doesn't increase when using more concurrent threads

2006-03-10 Thread Peter Keegan
 3. Use the ThreadLocal's FieldReader in the document() method.

As I understand it, this means that the document method no longer needs to
be synchronized, right?

I've made these changes and it does appear to improve performance. Random
snapshots of the stack traces show only an occasional lock in 'isDeleted'.
Mostly, though, the threads are busy scoring and adding results to priority
queues, which is great. I've included some sample stacks, below. I'll report
the new query rates after it has run for at least overnight, and I'd be
happy submit these changes to the lucene committers, if interested.

Peter


Sample stack traces:

QueryThread group 1,#8 prio=1 tid=0x002ce48eeb80 nid=0x6b87 runnable
[0x43887000..0x43887bb0]
at org.apache.lucene.search.FieldSortedHitQueue.lessThan(
FieldSortedHitQueue.java:108)
at org.apache.lucene.util.PriorityQueue.insert(PriorityQueue.java:61)
at org.apache.lucene.search.FieldSortedHitQueue.insert(
FieldSortedHitQueue.java:85)
at org.apache.lucene.search.FieldSortedHitQueue.insert(
FieldSortedHitQueue.java:92)
at org.apache.lucene.search.TopFieldDocCollector.collect(
TopFieldDocCollector.java:51)
at org.apache.lucene.search.TermScorer.score(TermScorer.java:75)
at org.apache.lucene.search.TermScorer.score(TermScorer.java:60)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:132)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:110)
at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:225)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65)
at org.apache.lucene.search.Hits.init(Hits.java:52)
at org.apache.lucene.search.Searcher.search(Searcher.java:62)

QueryThread group 1,#5 prio=1 tid=0x002ce4d659f0 nid=0x6b84 runnable
[0x43584000..0x43584d30]
at org.apache.lucene.search.TermScorer.score(TermScorer.java:75)
at org.apache.lucene.search.TermScorer.score(TermScorer.java:60)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:132)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:110)
at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:225)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65)
at org.apache.lucene.search.Hits.init(Hits.java:52)
at org.apache.lucene.search.Searcher.search(Searcher.java:62)

QueryThread group 1,#4 prio=1 tid=0x002ce10afd50 nid=0x6b83 runnable
[0x43483000..0x43483db0]
at org.apache.lucene.store.MMapDirectory$MMapIndexInput.readByte(
MMapDirectory.java:46)
at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:56)
at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java
:101)
at org.apache.lucene.index.SegmentTermDocs.skipTo(SegmentTermDocs.java
:194)
at org.apache.lucene.search.TermScorer.skipTo(TermScorer.java:144)
at org.apache.lucene.search.ConjunctionScorer.doNext(
ConjunctionScorer.java:56)
at org.apache.lucene.search.ConjunctionScorer.next(
ConjunctionScorer.java:51)
at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java
:290)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:132)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:110)
at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:225)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65)
at org.apache.lucene.search.Hits.init(Hits.java:52)
at org.apache.lucene.search.Searcher.search(Searcher.java:62)

QueryThread group 1,#3 prio=1 tid=0x002ce48959f0 nid=0x6b82 runnable
[0x43382000..0x43382e30]
at java.util.LinkedList.listIterator(LinkedList.java:523)
at java.util.AbstractList.listIterator(AbstractList.java:349)
at java.util.AbstractSequentialList.iterator(AbstractSequentialList.java
:250)
at org.apache.lucene.search.ConjunctionScorer.score(
ConjunctionScorer.java:80)
at org.apache.lucene.search.BooleanScorer2$2.score(BooleanScorer2.java
:186)
at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java
:327)
at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java
:291)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:132)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:110)
at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:225)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65)
at org.apache.lucene.search.Hits.init(Hits.java:52)
at org.apache.lucene.search.Searcher.search(Searcher.java:62)


On 3/7/06, Doug Cutting [EMAIL PROTECTED] wrote:

 Peter Keegan wrote:
  I ran a query performance tester against 8-cpu and 16-cpu Xeon servers
  (16/32 cpu hyperthreaded). on Linux. Here are the results:
 
  8-cpu:  275 qps
  16-cpu: 305 qps
  (the dual-core Opteron servers are still faster)
 
  Here is the stack trace of 8 of the 16 query

Re: Throughput doesn't increase when using more concurrent threads

2006-03-13 Thread Peter Keegan
Chris,

Should this patch work against the current code base? I'm getting this
error:

D:\lucene-1.9patch -b -p0 -i nio-lucene-1.9.patch
patching file src/java/org/apache/lucene/index/CompoundFileReader.java
patching file src/java/org/apache/lucene/index/FieldsReader.java
missing header for unified diff at line 45 of patch
can't find file to patch at input line 45
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--
| +47,9 @@
| fieldsStream = d.openInput(segment + .fdt);
| indexStream = d.openInput(segment + .fdx);
|
|+fstream = new ThreadStream(fieldsStream);
|+istream = new ThreadStream(indexStream);
|+
| size = (int)(indexStream.length() / 8);
|   }
|
--

Thanks,
Peter


On 3/10/06, Chris Lamprecht [EMAIL PROTECTED] wrote:

 Peter,

 I think this is similar to the patch in this bugzilla task:

 http://issues.apache.org/bugzilla/show_bug.cgi?id=35838
 the patch itself is
 http://issues.apache.org/bugzilla/attachment.cgi?id=15757

 (BTW does JIRA have a way to display the patch diffs?)

 The above patch also has a change to SegmentReader to avoid
 synchronization on isDeleted().  However, with that patch, you no
 longer have the guarantee that one thread will immediately see
 deletions by another thread.  This was fine for my purposes, and
 resulted in a big performance boost when there were deleted documents,
 but it may not be correct for others' needs.

 -chris
 On 3/10/06, Peter Keegan [EMAIL PROTECTED] wrote:
   3. Use the ThreadLocal's FieldReader in the document() method.
 
  As I understand it, this means that the document method no longer needs
 to
  be synchronized, right?
 
  I've made these changes and it does appear to improve performance.
 Random
  snapshots of the stack traces show only an occasional lock in
 'isDeleted'.
  Mostly, though, the threads are busy scoring and adding results to
 priority
  queues, which is great. I've included some sample stacks, below. I'll
 report
  the new query rates after it has run for at least overnight, and I'd be
  happy submit these changes to the lucene committers, if interested.
 
  Peter
 
 
  Sample stack traces:
 
  QueryThread group 1,#8 prio=1 tid=0x002ce48eeb80 nid=0x6b87
 runnable
  [0x43887000..0x43887bb0]
  at org.apache.lucene.search.FieldSortedHitQueue.lessThan(
  FieldSortedHitQueue.java:108)
  at org.apache.lucene.util.PriorityQueue.insert(PriorityQueue.java
 :61)
  at org.apache.lucene.search.FieldSortedHitQueue.insert(
  FieldSortedHitQueue.java:85)
  at org.apache.lucene.search.FieldSortedHitQueue.insert(
  FieldSortedHitQueue.java:92)
  at org.apache.lucene.search.TopFieldDocCollector.collect(
  TopFieldDocCollector.java:51)
  at org.apache.lucene.search.TermScorer.score(TermScorer.java:75)
  at org.apache.lucene.search.TermScorer.score(TermScorer.java:60)
  at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java
 :132)
  at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java
 :110)
  at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java
 :225)
  at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65)
  at org.apache.lucene.search.Hits.init(Hits.java:52)
  at org.apache.lucene.search.Searcher.search(Searcher.java:62)
 
  QueryThread group 1,#5 prio=1 tid=0x002ce4d659f0 nid=0x6b84
 runnable
  [0x43584000..0x43584d30]
  at org.apache.lucene.search.TermScorer.score(TermScorer.java:75)
  at org.apache.lucene.search.TermScorer.score(TermScorer.java:60)
  at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java
 :132)
  at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java
 :110)
  at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java
 :225)
  at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65)
  at org.apache.lucene.search.Hits.init(Hits.java:52)
  at org.apache.lucene.search.Searcher.search(Searcher.java:62)
 
  QueryThread group 1,#4 prio=1 tid=0x002ce10afd50 nid=0x6b83
 runnable
  [0x43483000..0x43483db0]
  at org.apache.lucene.store.MMapDirectory$MMapIndexInput.readByte(
  MMapDirectory.java:46)
  at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:56)
  at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java
  :101)
  at org.apache.lucene.index.SegmentTermDocs.skipTo(
 SegmentTermDocs.java
  :194)
  at org.apache.lucene.search.TermScorer.skipTo(TermScorer.java:144)
  at org.apache.lucene.search.ConjunctionScorer.doNext(
  ConjunctionScorer.java:56)
  at org.apache.lucene.search.ConjunctionScorer.next(
  ConjunctionScorer.java:51)
  at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java
  :290)
  at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java
 :132)
  at org.apache.lucene.search.IndexSearcher.search

Re: Throughput doesn't increase when using more concurrent threads

2006-03-13 Thread Peter Keegan
Chris,
My apologies - this error was apparently caused by a file format mismatch
(probably line endings).
Thanks,
Peter

On 3/13/06, Peter Keegan [EMAIL PROTECTED] wrote:

 Chris,

 Should this patch work against the current code base? I'm getting this
 error:

 D:\lucene-1.9patch -b -p0 -i nio-lucene-1.9.patch
 patching file src/java/org/apache/lucene/index/CompoundFileReader.java
 patching file src/java/org/apache/lucene/index/FieldsReader.java
 missing header for unified diff at line 45 of patch
 can't find file to patch at input line 45
 Perhaps you used the wrong -p or --strip option?
 The text leading up to this was:
 --
 | +47,9 @@
 | fieldsStream = d.openInput(segment + .fdt);
 | indexStream = d.openInput(segment + .fdx);
 |
 |+fstream = new ThreadStream(fieldsStream);
 |+istream = new ThreadStream(indexStream);
 |+
 | size = (int)(indexStream.length() / 8);
 |   }
 |
 --

 Thanks,
 Peter



 On 3/10/06, Chris Lamprecht [EMAIL PROTECTED] wrote:
 
  Peter,
 
  I think this is similar to the patch in this bugzilla task:
 
  http://issues.apache.org/bugzilla/show_bug.cgi?id=35838
  the patch itself is
  http://issues.apache.org/bugzilla/attachment.cgi?id=15757
 
  (BTW does JIRA have a way to display the patch diffs?)
 
  The above patch also has a change to SegmentReader to avoid
  synchronization on isDeleted().  However, with that patch, you no
  longer have the guarantee that one thread will immediately see
  deletions by another thread.  This was fine for my purposes, and
  resulted in a big performance boost when there were deleted documents,
  but it may not be correct for others' needs.
 
  -chris
  On 3/10/06, Peter Keegan [EMAIL PROTECTED] wrote:
3. Use the ThreadLocal's FieldReader in the document() method.
  
   As I understand it, this means that the document method no longer
  needs to
   be synchronized, right?
  
   I've made these changes and it does appear to improve performance.
  Random
   snapshots of the stack traces show only an occasional lock in
  'isDeleted'.
   Mostly, though, the threads are busy scoring and adding results to
  priority
   queues, which is great. I've included some sample stacks, below. I'll
  report
   the new query rates after it has run for at least overnight, and I'd
  be
   happy submit these changes to the lucene committers, if interested.
  
   Peter
  
  
   Sample stack traces:
  
   QueryThread group 1,#8 prio=1 tid=0x002ce48eeb80 nid=0x6b87
  runnable
   [0x43887000..0x43887bb0]
   at org.apache.lucene.search.FieldSortedHitQueue.lessThan(
   FieldSortedHitQueue.java:108)
   at org.apache.lucene.util.PriorityQueue.insert(PriorityQueue.java:61)
   at org.apache.lucene.search.FieldSortedHitQueue.insert(
   FieldSortedHitQueue.java:85)
   at org.apache.lucene.search.FieldSortedHitQueue.insert(
   FieldSortedHitQueue.java:92)
   at org.apache.lucene.search.TopFieldDocCollector.collect(
   TopFieldDocCollector.java:51)
   at org.apache.lucene.search.TermScorer.score(TermScorer.java:75)
   at org.apache.lucene.search.TermScorer.score (TermScorer.java:60)
   at org.apache.lucene.search.IndexSearcher.search(
  IndexSearcher.java:132)
   at org.apache.lucene.search.IndexSearcher.search(
  IndexSearcher.java:110)
   at org.apache.lucene.search.MultiSearcher.search (
  MultiSearcher.java:225)
   at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65)
   at org.apache.lucene.search.Hits.init(Hits.java:52)
   at org.apache.lucene.search.Searcher.search (Searcher.java:62)
  
   QueryThread group 1,#5 prio=1 tid=0x002ce4d659f0 nid=0x6b84
  runnable
   [0x43584000..0x43584d30]
   at org.apache.lucene.search.TermScorer.score (TermScorer.java:75)
   at org.apache.lucene.search.TermScorer.score(TermScorer.java:60)
   at org.apache.lucene.search.IndexSearcher.search(
  IndexSearcher.java:132)
   at org.apache.lucene.search.IndexSearcher.search (
  IndexSearcher.java:110)
   at org.apache.lucene.search.MultiSearcher.search(
  MultiSearcher.java:225)
   at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65)
   at org.apache.lucene.search.Hits .init(Hits.java:52)
   at org.apache.lucene.search.Searcher.search(Searcher.java:62)
  
   QueryThread group 1,#4 prio=1 tid=0x002ce10afd50 nid=0x6b83
  runnable
   [0x43483000..0x43483db0]
   at org.apache.lucene.store.MMapDirectory$MMapIndexInput.readByte(
   MMapDirectory.java:46)
   at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:56)
   at org.apache.lucene.index.SegmentTermDocs.next (
  SegmentTermDocs.java
   :101)
   at org.apache.lucene.index.SegmentTermDocs.skipTo(
  SegmentTermDocs.java
   :194)
   at org.apache.lucene.search.TermScorer.skipTo(TermScorer.java:144)
   at org.apache.lucene.search.ConjunctionScorer.doNext(
   ConjunctionScorer.java:56

Re: Good MMapDirectory performance

2006-03-14 Thread Peter Keegan
- I read from Peter Keegan's recent postings:
- The Lucene server is using MMapDirectory. I'm running
-  the jvm with -Xmx16000M. Peak memory usage of the jvm
-  on Linux is about 6GB and 7.8GB on windows.
- We don't have nearly as much memory as Peter but I
- wonder whether he is gaining anything with such
- a large heap.

My application gets better throughput with more VM, but that is probably due
to heavy use of ByteBuffers in the application, not VM for Lucene.

Peter



On 3/12/06, kent.fitch [EMAIL PROTECTED] wrote:

 I thought I'd post some good news about MMapDirectory as
 the comments in the release notes are quite downbeat about
 its performance.  In some environments MMapDirectory
 provides a big improvement.

 Our test application is an index of 11.4 million
 documents which are derived from MARC (bibliographic)
 catalogue records.  Our aim is to build a system
 to demonstrate relevance ranking and result clustering
 for library union catalogue searching (a union
 catalogue accumulates/merges records from multiple
 ibraries).

 Our main index component sizes:
 fdt 17GB
 fdx 91MB
 tis 82MB
 frq 45MB
 prx 11MB
 tii 1.2 MB

 We have a separate Lucence index (not discussed further)
 which stores the MARC records.

 Each document has many fields.   We'll probably reduce the
 number after we decide on the best search strategies, but
 lots of fields gives us lots of flexability whilst testing
 search and ranking strategies.

 Stored and unindexed fields, used for summary results:
   display title
   display author
   display publication details
   holdingsCount (number of libraries holding)

 Tokenized indices:
   title
   author
   subject
   genre
   keyword (all text)

 Keyword (untokenized) indices:
   title
   author
   subject
   genre
   audience
   Dewey/LC classification
   language
   isbn/issn
   publication date (date range code)
   unique bibliographic id

 Wildcard Tokenized indices created by a custom stub
 analyzer which reduces a term to its first few characters:
   title
   author
   subject
   keyword

 Field boosts are set for some fields.  For example, title
 sub title, series title, component title are all
 stored as title but with different field boosts (as a
 match on normal title is deemed more relevant than a match
 on series title).

 The document boost is set to the sqrt of the holdingsCount
 (favouring popular resources).

 The user interface supports searching and refining searches
 on specific fields but the most common search is created
 from a single google style search box.  Here's a typical
 query generated from a 2 word search:

 +(titleWords:franz kafka^4.0
   authorWords:franz kafka^3.0
   subjectWords:franz kafka^3.0
   keywords:franz kafka^1.4
   title:franz kafka^4.0
   (+titleWords:franz +titleWords:kafka^3.0)
   author:franz kafka^3.0
   +authorWords:franz +authorWords:kafka^2.0)
   subject:franz kafka^3.0
   (+subjectWords:franz +subjectWords:kafka^1.5)
   (+genreWords:franz +genreWords:kafka^2.0)
   (+keywords:franz +keywords:kafka)
   (+titleWildcard:fra +titleWildcard:kaf^0.7)
   (+authorWildcard:fra +authorWildcard:kaf^0.7)
   (+subjectWildcard:fra +subjectWildcard:kaf^0.7)
   (+keywordWildcard:fra +keywordWildcard:kaf^0.2)
 )

 It generated 1635 hits.  We then read the first 700
 documents in the hit list and extract the date, subject,
 author, genre, Dewey/LC classification and audience
 fields for each, accumulating the popularity of each.

 Using this data, for each of the subject, author, genre,
 Dewey/LC and audience categories, we find the 30 most
 popular field values and for each of these we query the
 index to find their frequency in the entire index.

 We then render the first 100 document results (title,
 author, publication details, holdings) and the top 30
 for each of subject, author, genre, Dewey/KC and audience,
 ordering each list by the popularity of the term in the
 hit results (sample of the first 700) and rendering the
 size of the text based on the frequency of the term in
 the entire database (a bit like the Flickr tag popularity
 lists).  We also render a graph of hit results by date
 range.

 The initial search is very quick - typically a small
 number of tens of millsecs.  The clustering takes
 much longer - reading up to 700 records, extracting
 all those fields, sorting to get the top 30 of each
 field category, looking up the frequency of each term
 in the database.

 The test machine was a SunFire440 with 2 x 1.593GHz
 UltraSPARC-IIIi processors and 8GB of memory running
 Solaris 9, Java 1.5 in 64 bit mode, Jetty. The Lucene data
 directory is stored on a local 10K SCSI disk.

 The benchmark consisted of running 13,142 representative
 and unique search phrases collected from another system.
 The search phrases are unsorted.  The client (testing)
 system is run on another unloaded computer and was
 configured to run a varying number of threads representing
 different loads.  The results discussed here were
 produced with 3 

Re: Throughput doesn't increase when using more concurrent threads

2006-03-17 Thread Peter Keegan
I did some additional testing with Chris's patch and mine (based on Doug's
note) vs. no patch and found that all 3 produced the same throughput - about
330 qps - over a longer period. So, there seems to be a point of diminishing
returns to adding more cpus. The dual core Opterons (8 cpu) still win
handily at 400 qps.

Peter


On 3/13/06, Peter Keegan [EMAIL PROTECTED] wrote:

 Chris,
 My apologies - this error was apparently caused by a file format mismatch
 (probably line endings).
 Thanks,
 Peter


 On 3/13/06, Peter Keegan [EMAIL PROTECTED] wrote:
 
  Chris,
 
  Should this patch work against the current code base? I'm getting this
  error:
 
  D:\lucene-1.9patch -b -p0 -i nio-lucene-1.9.patch
  patching file src/java/org/apache/lucene/index/CompoundFileReader.java
  patching file src/java/org/apache/lucene/index/FieldsReader.java
  missing header for unified diff at line 45 of patch
  can't find file to patch at input line 45
  Perhaps you used the wrong -p or --strip option?
  The text leading up to this was:
  --
  | +47,9 @@
  | fieldsStream = d.openInput(segment + .fdt);
  | indexStream = d.openInput(segment + .fdx);
  |
  |+fstream = new ThreadStream(fieldsStream);
  |+istream = new ThreadStream(indexStream);
  |+
  | size = (int)(indexStream.length() / 8);
  |   }
  |
  --
 
  Thanks,
  Peter
 
 
 
  On 3/10/06, Chris Lamprecht [EMAIL PROTECTED] wrote:
  
   Peter,
  
   I think this is similar to the patch in this bugzilla task:
  
   http://issues.apache.org/bugzilla/show_bug.cgi?id=35838
   the patch itself is
   http://issues.apache.org/bugzilla/attachment.cgi?id=15757
  
   (BTW does JIRA have a way to display the patch diffs?)
  
   The above patch also has a change to SegmentReader to avoid
   synchronization on isDeleted().  However, with that patch, you no
   longer have the guarantee that one thread will immediately see
   deletions by another thread.  This was fine for my purposes, and
   resulted in a big performance boost when there were deleted documents,
  
   but it may not be correct for others' needs.
  
   -chris
   On 3/10/06, Peter Keegan [EMAIL PROTECTED]  wrote:
 3. Use the ThreadLocal's FieldReader in the document() method.
   
As I understand it, this means that the document method no longer
   needs to
be synchronized, right?
   
I've made these changes and it does appear to improve performance.
   Random
snapshots of the stack traces show only an occasional lock in
   'isDeleted'.
Mostly, though, the threads are busy scoring and adding results to
   priority
queues, which is great. I've included some sample stacks, below.
   I'll report
the new query rates after it has run for at least overnight, and I'd
   be
happy submit these changes to the lucene committers, if interested.
   
Peter
   
   
Sample stack traces:
   
QueryThread group 1,#8 prio=1 tid=0x002ce48eeb80 nid=0x6b87
   runnable
[0x43887000..0x43887bb0]
at org.apache.lucene.search.FieldSortedHitQueue.lessThan(
FieldSortedHitQueue.java:108)
at org.apache.lucene.util.PriorityQueue.insert(
   PriorityQueue.java :61)
at org.apache.lucene.search.FieldSortedHitQueue.insert(
FieldSortedHitQueue.java:85)
at org.apache.lucene.search.FieldSortedHitQueue.insert(
FieldSortedHitQueue.java:92)
at org.apache.lucene.search.TopFieldDocCollector.collect(
TopFieldDocCollector.java:51)
at org.apache.lucene.search.TermScorer.score(TermScorer.java:75)
at org.apache.lucene.search.TermScorer.score (TermScorer.java
   :60)
at org.apache.lucene.search.IndexSearcher.search(
   IndexSearcher.java:132)
at org.apache.lucene.search.IndexSearcher.search(
   IndexSearcher.java:110)
at org.apache.lucene.search.MultiSearcher.search (
   MultiSearcher.java:225)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65)
at org.apache.lucene.search.Hits.init(Hits.java:52)
at org.apache.lucene.search.Searcher.search (Searcher.java:62)
   
QueryThread group 1,#5 prio=1 tid=0x002ce4d659f0 nid=0x6b84
   runnable
[0x43584000..0x43584d30]
at org.apache.lucene.search.TermScorer.score (TermScorer.java
   :75)
at org.apache.lucene.search.TermScorer.score(TermScorer.java:60)
at org.apache.lucene.search.IndexSearcher.search(
   IndexSearcher.java:132)
at org.apache.lucene.search.IndexSearcher.search (
   IndexSearcher.java:110)
at org.apache.lucene.search.MultiSearcher.search(
   MultiSearcher.java:225)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65)
at org.apache.lucene.search.Hits .init(Hits.java:52)
at org.apache.lucene.search.Searcher.search(Searcher.java:62)
   
QueryThread group 1,#4 prio=1 tid=0x002ce10afd50 nid=0x6b83
   runnable
[0x43483000..0x43483db0

Re: Non scoring search

2006-03-17 Thread Peter Keegan
I experimented with this by using a Similiarity class that returns a
constant (1) for all values and found that had no noticable affect on query
performance.

Peter

On 12/6/05, Chris Hostetter [EMAIL PROTECTED] wrote:


 : I was wondering if there is a standard way to retrive documents WITHOUT
 : scoring and sorting them.  I need a list of documents that contain
 certain
 : terms but I do not need them sorted or scored.

 Using Filters directly (ie: constructing them, and then calling the bits()
 method yourself) is the most straight forward way i know of to achieve
 what you describe.



 -Hoss


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Throughput doesn't increase when using more concurrent threads

2006-04-05 Thread Peter Keegan
 Out of interest, does indexing time speed up much on 64-bit hardware?

I was able to speed up indexing on 64-bit platform by taking advantage of
the larger address space to parallelize the indexing process. One thread
creates index segments with a set of RAMDirectories and another thread
merges the segments to disk with 'addIndexes'. This resulted in a speed
improvement of 27%.

Peter


On 1/29/06, Daniel Noll [EMAIL PROTECTED] wrote:

 Peter Keegan wrote:
  I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now
  getting 250 queries/sec and excellent cpu utilization (equal concurrency
 on
  all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't
 aware
  of it.
 
 Wow.  That's fast.

 Out of interest, does indexing time speed up much on 64-bit hardware?
 I'm particularly interested in this side of things because for our own
 application, any query response under half a second is good enough, but
 the indexing side could always be faster. :-)

 Daniel

 --
 Daniel Noll

 Nuix Australia Pty Ltd
 Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
 Phone: (02) 9280 0699
 Fax:   (02) 9212 6902

 This message is intended only for the named recipient. If you are not
 the intended recipient you are notified that disclosing, copying,
 distributing or taking any action in reliance on the contents of this
 message or attachment is strictly prohibited.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: MultiReader and MultiSearcher

2006-04-11 Thread Peter Keegan
Yonik,

Could you explain why an IndexSearcher constructed from multiple readers is
faster than a MultiSearcher constructed from same readers?

Thanks,
Peter



On 4/10/06, Yonik Seeley [EMAIL PROTECTED] wrote:

 On 4/10/06, oramas martín [EMAIL PROTECTED] wrote:
  Is there any performance (or other) difference between using an
  IndexSearcher initialized with a MultiReader instead of using a
  MultiSearcher?

 Yes, the IndexSearcher(MultiReader) solution will be faster.

 -Yonik
 http://incubator.apache.org/solr Solr, The Open Source Lucene Search
 Server

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: MultiReader and MultiSearcher

2006-04-11 Thread Peter Keegan
Does this mean that MultiReader doesn't merge the search results and sort
the results as if there was only one index? If not, does it  simply
concatenate the results?

Peter



On 4/11/06, Yonik Seeley [EMAIL PROTECTED] wrote:

 On 4/11/06, Peter Keegan [EMAIL PROTECTED] wrote:
  Could you explain why an IndexSearcher constructed from multiple readers
 is
  faster than a MultiSearcher constructed from same readers?

 The convergence layer is a level lower for a MultiReader vs a
 MultiSearcher.

 A MultiReader is an IndexReader, and Queries (Scorers) run directly
 against it since it has efficient TermEnum and TermDocs
 implementations.

 A MultiSearcher must do independent searches against subsearchers
 retrieving the top n matches, and maintain an additional priority
 queue to merge the results to get the global top n matches.  The
 implemetation of createWeight is also heavier (heh..)

 I've never measured the performance difference, and it's probably
 relatively small for most queries.

 -Yonik
 http://incubator.apache.org/solr Solr, The Open Source Lucene Search
 Server

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: MultiReader and MultiSearcher

2006-04-12 Thread Peter Keegan
Correction: the doc order is fine. My test was based on the existing
'TestMultiSearcher', and I hadn't noticed the swapping of the index order
here:

// VITAL STEP:adding the searcher for the empty index first, before
the searcher for the populated index
searchers[0] = new IndexSearcher(indexStoreB);
searchers[1] = new IndexSearcher(indexStoreA);

Sorry about that,
Peter



On 4/11/06, Doug Cutting [EMAIL PROTECTED] wrote:

 Peter Keegan wrote:
  Oops. I meant to say: Does this mean that an IndexSearcher constructed
 from
  a MultiReader doesn't merge the search results and sort the results as
 if
  there was only one index?

 It doesn't have to, since a MultiReader *is* a single index.

  A quick test indicates that it does merge the results properly, however
  there is a difference in the order of documents with equal score. The
  MultiSearcher returns the higher doc first, but the IndexSearcher
 returns
  the lowest doc first. I think docs of equal score are supposed to be
  returned in the order they were indexed (lower doc id first).

 If that's the case it is a bug.  If you can reproduce this in a
 standalone test, please submit it to Jira.

 Doug

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: question about custom sort method

2006-05-17 Thread Peter Keegan

Suppose I have a custom sorting 'DocScoreComparator' for computing distances
on each search hit from a specified coordinate (similar to the
DistanceComparatorSource example in LIA). Assume that the 'specified
coordinate' is different for each query. This means a new custom comparator
must be created for each query, which is ok. However, Lucene caches the
comparator even though it will never be reused. This could result in heavy
memory usage if many queries are performed before the IndexReader is
updated.

Is there any way to avoid having lucene cache the custom sorting objects?


Re: MMapDirectory vs RAMDirectory

2006-06-07 Thread Peter Keegan

I was able to improve the behavior by setting the mapped ByteBuffer to null
in the close method of MMapIndexInput. This seems to be a strong enough
'suggestion' to the gc, as I can see the references go away with process
explorer, and the index files can be deleted, usually. Occasionally, a
reference to the '.tis' file remains.

Peter


On 6/5/06, Daniel Noll [EMAIL PROTECTED] wrote:


Peter Keegan wrote:

 There is no 'unmap' method, so my understanding is that the file mapping
is
 valid until the underlying buffer is garbage-collected. However, forcing
 the gc doesn't help.

You're half right.

The file mapping is indeed valid until the underlying buffer is garbage
collected, but you can't force the GC -- there is no API which does
that.

Note the wording in the Javadoc for System.gc():

   Calling the gc method **suggests** that the Java Virtual Machine
expend effort toward recycling unused objects in order to make the
memory they currently occupy available for quick reuse. When control
returns from the method call, the Java Virtual Machine has made a
best effort to reclaim space from all discarded objects.

 The file deletes don't fail on Linux, but I'm wondering if there is
still a
 memory leak?

Linux allows you to delete a file while someone has the file descriptor
open, but the file descriptor will remain valid (i.e. the delete doesn't
actually occur) until everyone releases the file descriptor.

I ran into similar issues as these when working on other things, and
eventually ended up switching to using a RandomAccessFile, as those can
be closed.  Otherwise you're right -- the workaround is to routinely try
to delete the file.

Daniel

--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699
Web: http://www.nuix.com.au/Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Aggregating category hits

2006-06-09 Thread Peter Keegan

I compared Solr's DocSetHitCollector and counting bitset intersections to
get facet counts with a different approach that uses a custom hit collector
that tests each docid hit (bit) with each facets' bitset and increments a
count in a histogram. My assumption was that for queries with few hits, this
would be much faster than always doing bitset intersections/cardinality for
every facet all the time.

However, my throughput testing shows that the Solr method is at least 50%
faster than mine. I'm seeing a big win with the use of the HashDocSet for
lower hit counts. On my 64-bit platform, a MAX_SIZE value of 10K-20K seems
to provide optimal performance. I'm looking forward to trying this with
OpenBitSet.

Peter




On 5/29/06, z shalev [EMAIL PROTECTED] wrote:


i know im a little late replying to this thread, but, in my humble opinion
the best way to aggregate values (not necessarily terms, but whole values in
fields) is as follows:

  startup stage:

  for each field you would like to aggregate create a hashmap

  open an index reader and run through all the docs

  get the values to be aggregated from the fields of each doc

  create a hashcode for each value from each field collected, the hashcode
should have some sort of prefix indicating which field its from (for exampe:
1 = author, 2 = ) and hence which hash it is stored in (at retrieval
time, this prefix can be used to easily retrieve the value from the correct
hash)

  place the hashcode/value in the appropriate hash

  create an arraylist

  at index X in the arraylist place an int array of all the hashcodes
associated with doc id X

  so for example: if i have doc id 0 which contains the values: william
shakespeare and the value 1797 the array list at index 0 will have an int
array containing 2 values (the 2 hashcodes of shaklespeare and 1797)

  run time:

  at run time receive the hits and iterate through the doc ids , aggregate
the values with direct access into the arraylist (for doc id 10 go to index
10 in the arraylist to retrieve the array of hashcodes) and lookups into the
hashmaps

  i tested this today on a small index approx 400,000 docs (1GB of data)
but i ran queries returning over 100,000 results

  my response time was about 550 milliseconds on large (over 100,000)
result sets

  another point, this method should be scalable for much larger indexes as
well, as it is linear to the result set size and not the index size (which
is a HUGE bonus)

  if anyone wants the code let me know,




Marvin Humphrey [EMAIL PROTECTED] wrote:

Thanks, all.

The field cache and the bitsets both seem like good options until the
collection grows too large, provided that the index does not need to
be updated very frequently. Then for large collections, there's
statistical sampling. Any of those options seems preferable to
retrieving all docs all the time.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
Feel free to call! Free PC-to-PC calls. Low rates on PC-to-Phone.  Get
Yahoo! Messenger with Voice



Re: Aggregating category hits

2006-06-12 Thread Peter Keegan

I'm seeing query throughput of approx. 290 qps with OpenBitSet vs. 270 with
BitSet. I had to reduce the max. HashDocSet size to 2K - 3K (from 10K-20K)
to get optimal tradeoff.

no. docs in index: 730,000
average no. results returned: 40
average response time: 50 msec (15-20 for counting facets)
no. facets: 100 on every query

I'm not using the Solr server as we have already developed an
infrastructure.

Peter


On 6/10/06, Yonik Seeley [EMAIL PROTECTED] wrote:


On 6/9/06, Peter Keegan [EMAIL PROTECTED] wrote:
 However, my throughput testing shows that the Solr method is at least
50%
 faster than mine. I'm seeing a big win with the use of the HashDocSet
for
 lower hit counts. On my 64-bit platform, a MAX_SIZE value of 10K-20K
seems
 to provide optimal performance.

Interesting... how many documents are in your collection?
It would prob be nice to make the HashDocSet cutt-off dynamic rather than
fixed.
Are you using Solr, or just some of it's code?

  I'm looking forward to trying this with
 OpenBitSet.

I checked in the OpenBitSet changes today.  I imagine this will lower
the optimal max HashDocSet size for performance a little.  You might
not see much performance improvement if most of the intersections
involved a HashDocSet... the OpenBitSet improvements only kick in with
bitset-bitset intersection counts.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search
server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Does more memory help Lucene?

2006-06-12 Thread Peter Keegan

See my note about overlapping indexing documents with merging:

http://www.gossamer-threads.com/lists/lucene/java-user/34188?search_string=%2Bkeegan%20%2Baddindexes;#34188

Peter

On 6/12/06, Michael D. Curtin [EMAIL PROTECTED] wrote:


Nadav Har'El wrote:

 Otis Gospodnetic [EMAIL PROTECTED] wrote on 12/06/2006
04:36:45
 PM:


Nadav,

Look up one of my onjava.com Lucene articles, where I talk about
this.  You may also want to tell Lucene to merge segments on disk
less frequently, which is what mergeFactor does.


 Thanks. Can you please point me to the appropriate article (I found one
 from March 2003, but I'm not sure if it's the one you meant).

 About mergeFactor() - thanks for the hint, I'll try changing it too (I
used
 20 so far), and see if it helps performance.

 Still, there is one thing about mergeFactor(), and the merge process,
that
 I don't understand: does having more memory help this process at all?
Does
 having a large mergeFactor() actually require more memory? The reason
I'm
 asking this that I'm still trying to figure out whether having a machine
 with huge ram actually helps Lucene, or not.

I'm using 1.4.3, so I don't know if things are the same in 2.0.  Anyhow, I
found a significant performance benefit from changing minMergeDocs and
mergeFactor from their defaults of 10 and 10 to 1,000 and 70,
respectively.
The improvement seems to come from a reduction in the number of merges as
the
index is created.  Each merge involves reading and writing a bunch of data
already indexed, sometimes everything indexed so far, so it's easy to see
how
reducing the number of merges reduces the overall indexing time.

I can't remember why, but I also little benefit to increasing minMergeDocs
beyond 1000.  A lot of time was being spent in the first merge, which
takes a
bunch of one-document segments in a RAMDirectory and merges them into
the
first-level segments on disk.  I hacked the code to make this first merge
(and
ONLY the first merge) operate on minMergeDocs * mergeFactor documents
instead,
which greatly increased the RAM consumption but reduced the indexing
time.  In
detail, what I started with was:
   a.  read minMergeDocs of docs, creating one-doc segments in RAM
   b.  read those one-doc RAM segments and merge them
   c.  write the merged results to a disk segment
   ...
   i.  read mergeFactor first-level disk segments and merge them
   j.  write second-level segments to disk
   ...
   p.  normal disk-based merging thereafter, as necessary

And what I ended up with was:
   A.  read minMergeDocs * mergeFactor docs, and remember them in RAM
   B.  write a segment from all the remembered RAM docs (a modified merge)
   ...
   F.  normal disk-based merging thereafter, as necessary

In essence, I eliminated that first level merge, one that involved lots
and
lots of teeny-weeny I/O operations that were very inefficient.

In my case, steps A  B worked on 70,000 documents instead of 1,000.
Remembering all those docs required a lot of RAM (almost 2GB), but it
almost
tripled indexing performance.  Later, I had to knock the 70 down to 35
(maybe
because my docs got a lot bigger but I don't remember now), but you get
the
idea.  I couldn't use a mergeFactor of 70,000 because that's way more file
descriptors than I could have without recompiling the kernel (I seem to
remember my limit being 1,024, and each segment took 14 file descriptors).

Hope it helps.

--MDC

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Aggregating category hits

2006-06-14 Thread Peter Keegan

The performance results in my previous posting were based on an
implementation that performs 2 searches, one for getting 'Hits' and another
for getting the BitSet. I reimplemented this in one search using the code in
'SolrIndexSearcher.getDocListAndSetNC' and I'm now getting throughput of
350-375 qps.

This is great stuff Solr guys! I'd love to see the DocSet and DocList
features added to Lucene's IndexSearcher.

Peter

On 6/12/06, Peter Keegan [EMAIL PROTECTED] wrote:


I'm seeing query throughput of approx. 290 qps with OpenBitSet vs. 270
with BitSet. I had to reduce the max. HashDocSet size to 2K - 3K (from
10K-20K) to get optimal tradeoff.

no. docs in index: 730,000
average no. results returned: 40
average response time: 50 msec (15-20 for counting facets)
no. facets: 100 on every query

I'm not using the Solr server as we have already developed an
infrastructure.

Peter



On 6/10/06, Yonik Seeley [EMAIL PROTECTED] wrote:

 On 6/9/06, Peter Keegan [EMAIL PROTECTED] wrote:
  However, my throughput testing shows that the Solr method is at least
 50%
  faster than mine. I'm seeing a big win with the use of the HashDocSet
 for
  lower hit counts. On my 64-bit platform, a MAX_SIZE value of 10K-20K
 seems
  to provide optimal performance.

 Interesting... how many documents are in your collection?
 It would prob be nice to make the HashDocSet cutt-off dynamic rather
 than fixed.
 Are you using Solr, or just some of it's code?

   I'm looking forward to trying this with
  OpenBitSet.

 I checked in the OpenBitSet changes today.  I imagine this will lower
 the optimal max HashDocSet size for performance a little.  You might
 not see much performance improvement if most of the intersections
 involved a HashDocSet... the OpenBitSet improvements only kick in with
 bitset-bitset intersection counts.

 -Yonik
 http://incubator.apache.org/solr Solr, the open-source Lucene search
 server

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





Re: Lucene 2.0.1 release date

2006-10-18 Thread Peter Keegan

This makes it relatively safe for people to grab a snapshot of the trunk

with less concern about latent bugs.


I think the concern is that if we start doing this stuff on trunk now,

people that are accustomed to snapping from the trunk might be surprised,
and not in a good way.

+1 on this. There are some great performance improvements in 2.0.1

Peter

On 10/17/06, Steven Parkes [EMAIL PROTECTED] wrote:


I think the idea is that 2.0.1 would be a patch-fix release from the
branch created at 2.0 release. This release would incorporate only
back-ported high-impact patches, where high-impact is defined by the
community. Certainly security vulnerabilities would be included. As Otis
said, to date, nobody seems to have raised any issues to that level.

2.1 will include all the patches and new features that have been
committed since 2.0; there've been a number of these. But releases are
done pretty ad hoc at this point and there hasn't been anyone that has
expressed strong interest in (i.e., lobbied for) a release.

There was a little discussion on this topic at the ApacheCon BOF. For a
number of reasons, the Lucene Java trunk has been kept pretty stable,
with a relatively few number of large changes. This makes it relatively
safe for people to grab a snapshot of the trunk with less concern about
latent bugs. I don't know how many people/projects are doing this rather
than sticking with 2.0.

Keeping the trunk stable doesn't provide an obvious place to start
working on things that people may want to work on and share but at the
same time want to allow to percolate for a while. I think the concern is
that if we start doing this stuff on trunk now, people that are
accustomed to snapping from the trunk might be surprised, and not in a
good way. Nobody wants that.

So releases can be about both what people want (getting features out)
and allowing a bit more instability in trunk. That is, if the community
wants that.

Food for thought and/or discussion?

-Original Message-
From: George Aroush [mailto:[EMAIL PROTECTED]
Sent: Sunday, October 15, 2006 5:15 PM
To: java-user@lucene.apache.org
Subject: RE: Lucene 2.0.1 release date

Thanks for the reply Otis.

I looked at the CHANGES.txt file and saw quit a bit of changes.  For my
port
from Java to C#, I can't rely on the trunk code as it is (to my
knowledge)
changes on a monthly basic if not weekly.  What I need is an official
release so that I can use it as the port point.

Regards,

-- George Aroush


-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Sunday, October 15, 2006 12:41 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene 2.0.1 release date

I'd have to check CHANGES.txt, but I don't think that many bugs have
been
fixed and not that many new features added that anyone is itching for a
new
release.

Otis

- Original Message 
From: George Aroush [EMAIL PROTECTED]
To: java-dev@lucene.apache.org; java-user@lucene.apache.org
Sent: Saturday, October 14, 2006 10:32:47 AM
Subject: RE: Lucene 2.0.1 release date

Hi folks,

Sorry for reposting this question (see original email below) and this
time
to both mailing list.

If anyone can tell me what is the plan for Lucene 2.0.1 release, I would
appreciate it very much.

As some of you may know, I am the porter of Lucene to Lucene.Net knowing
when 2.0.1 will be released will help me plan things out.

Regards,

-- George Aroush


-Original Message-
From: George Aroush [mailto:[EMAIL PROTECTED]
Sent: Thursday, October 12, 2006 12:07 AM
To: java-dev@lucene.apache.org
Subject: Lucene 2.0.1 release date

Hi folks,

What's the plan for Lucene 2.0.1 release date?

Thanks!

-- George Aroush


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Announcement: Lucene powering Monster job search index (Beta)

2006-10-27 Thread Peter Keegan

I am pleased to announce the launch of Monster's new job search Beta web
site, powered by Lucene, at: http://jobsearch.beta.monster.com (notice the
Lucene logo at the bottom of the page!).

The jobs index is implemented with Java Lucene 2.0 on 64-bit Windows (AMD
and Intel processors)

Here are some of the new features:

1. 'Improve your search by'...

The job search results page allows you to browse and 'drill down' through
the results by job category, status, type and salary. The number of matching
jobs in each facet is displayed. There will likely be many more facets to
browse by in the future.

This feature is currently implemented with a custom HitCollector and the
DocSet class from Solr.

2. 'More like this'

Find more jobs like the one you see by clicking on the 'MORE LIKE THIS'
link, which is visible when you hover the mouse over the job title.

This feature is implemented with Lucene's term vectors and the
'MoreLikeThis' contribution class. If you are in 'detailed view', the term
vectors from the job description are used. In 'brief' view, the job title's
term vectors are used.

3. 'Related Titles'

When you do a 'keywords' search, click on a 'related titles' link to filter
you search by similar job titles.

This feature is implemented via a separate Lucene.Net index.

4. Sort by 'Miles'

Find jobs close to you via zip code/radius search. In the search results
page, click on the 'Miles' column to sort the results by distance from your
zip code/radius.

This custom sorting feature is implemented via Lucene's
'SortComparatorSource' interface.

5. Search by date, salary, distance.

Find jobs posted in the last day (or 2,3, etc) or by salary range or
distance.

Numeric range search is one of Lucene's weak points (performance-wise) so we
have implemented this with a custom HitCollector and an extension to the
Lucene index files that stores the numeric field values for all documents.

It is important to point out that this has all been implemented with the
stock Lucene 2.0 library. No code changes were made to the Lucene core.

If you have any feedback regarding the UI, please use the link on the web
page (send us your feedback). You can hit me with any other
questions/comments.

Peter


Re: Announcement: Lucene powering Monster job search index (Beta)

2006-10-27 Thread Peter Keegan

On 10/27/06, Chris Lu [EMAIL PROTECTED] wrote:


Hi, Peter,

Really great job!



Thanks. (I'll tell the team)

I am interested to know how you implemented 4. Sort by 'Miles'. For

example, if starting from a zip code, how to match items within 20
miles?



I can tell you how we use Lucene to accomplish this.
At indexing time, each job's location is indexed as a special field. How you
represent the location is up to you. Each time a new index is built the
location data for all documents in the index are fetched via TermEnum and
TermDocs. This is practical because the searcher refresh is done at
predictable times. At query time, a custom SortComparatorSource is created,
using the 'reference' location (the zip/radius). The 'compare' method
performs the calculation to compare the 2 documents' location values (saved
from above) to the reference location.

I believe this can also be accomplished with Solr's FunctionQuery, but I
haven't tried that yet.

Peter

--

Chris Lu
-
Instant Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com

On 10/27/06, Peter Keegan [EMAIL PROTECTED] wrote:
 I am pleased to announce the launch of Monster's new job search Beta web
 site, powered by Lucene, at: http://jobsearch.beta.monster.com (notice
the
 Lucene logo at the bottom of the page!).

 The jobs index is implemented with Java Lucene 2.0 on 64-bit Windows
(AMD
 and Intel processors)

 Here are some of the new features:

 1. 'Improve your search by'...

 The job search results page allows you to browse and 'drill down'
through
 the results by job category, status, type and salary. The number of
matching
 jobs in each facet is displayed. There will likely be many more facets
to
 browse by in the future.

 This feature is currently implemented with a custom HitCollector and the
 DocSet class from Solr.

 2. 'More like this'

 Find more jobs like the one you see by clicking on the 'MORE LIKE THIS'
 link, which is visible when you hover the mouse over the job title.

 This feature is implemented with Lucene's term vectors and the
 'MoreLikeThis' contribution class. If you are in 'detailed view', the
term
 vectors from the job description are used. In 'brief' view, the job
title's
 term vectors are used.

 3. 'Related Titles'

 When you do a 'keywords' search, click on a 'related titles' link to
filter
 you search by similar job titles.

 This feature is implemented via a separate Lucene.Net index.

 4. Sort by 'Miles'

 Find jobs close to you via zip code/radius search. In the search results
 page, click on the 'Miles' column to sort the results by distance from
your
 zip code/radius.

 This custom sorting feature is implemented via Lucene's
 'SortComparatorSource' interface.

 5. Search by date, salary, distance.

 Find jobs posted in the last day (or 2,3, etc) or by salary range or
 distance.

 Numeric range search is one of Lucene's weak points (performance-wise)
so we
 have implemented this with a custom HitCollector and an extension to the
 Lucene index files that stores the numeric field values for all
documents.

 It is important to point out that this has all been implemented with the
 stock Lucene 2.0 library. No code changes were made to the Lucene core.

 If you have any feedback regarding the UI, please use the link on the
web
 page (send us your feedback). You can hit me with any other
 questions/comments.

 Peter



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Announcement: Lucene powering Monster job search index (Beta)

2006-10-30 Thread Peter Keegan

Otis,

The Lucene components for this beta are running on 4 dual core AMD Opteron (
2.6GHZ) processors, for a total of 8 CPUs. It has 32GB RAM, although 16GB
would probably suffice. The query rate is currently quite low probably
because of the low visibility of the beta page. We haven't measured QPS
rates for this configuration, yet, but if you look at some of my previous
posts, you'll see some QPS data on somewhat similar hardware. I think that
actual rates will be lower, though, because the complexity of the queries,
counting, sorting, etc have increased.

Peter

On 10/28/06, Otis Gospodnetic [EMAIL PROTECTED] wrote:


Hi,

--- Peter Keegan [EMAIL PROTECTED] wrote:

 On 10/27/06, Chris Lu [EMAIL PROTECTED] wrote:
 
  Hi, Peter,
 
  Really great job!


 Thanks. (I'll tell the team)

If it's not a secret, can you tell us a bit more about what's behind
the search in terms of hardware, and how much pounding that hardware
takes in terms of QPS?  People always ask about this stuff.

Thanks,
Otis


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Announcement: Lucene powering Monster job search index (Beta)

2006-10-30 Thread Peter Keegan

Alex,

I like your suggestion (I've found myself wondering what the last search
was, too), and I've forwarded it to the UI developer.

Thanks,
Peter


On 10/29/06, Alexandru Popescu [EMAIL PROTECTED] wrote:


Peter it looks impressive. Congrats! A small suggestion, though, after
performing a search the filtering criteria is not displayed anywhere.
I guess it would make sense to write it in a read-only form somewhere
on the result pages:

Jobs 1-50 of 7896 matches to Jobs 1-50 of 7896 matching criteria (a
small hidden stuff showing the criteria).

./alex
--
.w( the_mindstorm )p.


On 10/29/06, Otis Gospodnetic [EMAIL PROTECTED] wrote:
 Hi,

 --- Peter Keegan [EMAIL PROTECTED] wrote:

  On 10/27/06, Chris Lu [EMAIL PROTECTED] wrote:
  
   Hi, Peter,
  
   Really great job!
 
 
  Thanks. (I'll tell the team)

 If it's not a secret, can you tell us a bit more about what's behind
 the search in terms of hardware, and how much pounding that hardware
 takes in terms of QPS?  People always ask about this stuff.

 Thanks,
 Otis


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Announcement: Lucene powering Monster job search index (Beta)

2006-10-30 Thread Peter Keegan

Joe,

Fields with numeric values are stored in a separate file as binary values in
an internal format. Lucene is unaware of this file and unaware of the range
expression in the query. The range expression is parsed outside of Lucene
and used in a custom HitCollector to filter out documents that aren't in the
requested range(s). A goal was to do this without having to modify Lucene.
Our scheme is pretty efficient, but not very general purpose in its current
form, though.

Peter


On 10/30/06, Joe Shaw [EMAIL PROTECTED] wrote:


Hi Peter,

On Fri, 2006-10-27 at 15:29 -0400, Peter Keegan wrote:
 Numeric range search is one of Lucene's weak points (performance-wise)
so we
 have implemented this with a custom HitCollector and an extension to the
 Lucene index files that stores the numeric field values for all
documents.

 It is important to point out that this has all been implemented with the
 stock Lucene 2.0 library. No code changes were made to the Lucene core.

Can you give some technical details on the extension to the Lucene index
files?  How did you do it without making any changes to the Lucene core?

Thanks,
Joe


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Announcement: Lucene powering Monster job search index (Beta)

2006-10-30 Thread Peter Keegan

KEGan,


When you search by 4. Sort by Miles, I suppose the sorting by relevance
(of the search keyword) is lost? Since this is implemented using a custom
SortComparatorSource.


Sorting by miles becomes the primary sort key, score and date become
secondary sort fields (in the case of ties).


Also, I suppose, if FunctionQuery were used, we can make job distance by
miles part of the relavancy of the search results?


Yes, this is my understanding of the power of FunctionQuery.

Peter

On 10/30/06, KEGan [EMAIL PROTECTED] wrote:


Peter,

Congratulation on the beta launch :)

If you dont mind, I would like to ask you more on the feature 4. Sort by
Miles.

When you search by 4. Sort by Miles, I suppose the sorting by relevance
(of the search keyword) is lost? Since this is implemented using a custom
SortComparatorSource.

Also, I suppose, if FunctionQuery were used, we can make job distance by
miles part of the relavancy of the search results?

Could you comment or confirm my assertion ? Thanks :)


On 10/28/06, Peter Keegan [EMAIL PROTECTED] wrote:

 On 10/27/06, Chris Lu [EMAIL PROTECTED] wrote:
 
  Hi, Peter,
 
  Really great job!


 Thanks. (I'll tell the team)

 I am interested to know how you implemented 4. Sort by 'Miles'. For
  example, if starting from a zip code, how to match items within 20
  miles?


 I can tell you how we use Lucene to accomplish this.
 At indexing time, each job's location is indexed as a special field. How
 you
 represent the location is up to you. Each time a new index is built the
 location data for all documents in the index are fetched via TermEnum
and
 TermDocs. This is practical because the searcher refresh is done at
 predictable times. At query time, a custom SortComparatorSource is
 created,
 using the 'reference' location (the zip/radius). The 'compare' method
 performs the calculation to compare the 2 documents' location values
 (saved
 from above) to the reference location.

 I believe this can also be accomplished with Solr's FunctionQuery, but I
 haven't tried that yet.

 Peter

 --
  Chris Lu
  -
  Instant Full-Text Search On Any Database/Application
  site: http://www.dbsight.net
  demo: http://search.dbsight.com
 
  On 10/27/06, Peter Keegan [EMAIL PROTECTED] wrote:
   I am pleased to announce the launch of Monster's new job search Beta
 web
   site, powered by Lucene, at: http://jobsearch.beta.monster.com(notice
  the
   Lucene logo at the bottom of the page!).
  
   The jobs index is implemented with Java Lucene 2.0 on 64-bit Windows
  (AMD
   and Intel processors)
  
   Here are some of the new features:
  
   1. 'Improve your search by'...
  
   The job search results page allows you to browse and 'drill down'
  through
   the results by job category, status, type and salary. The number of
  matching
   jobs in each facet is displayed. There will likely be many more
facets
  to
   browse by in the future.
  
   This feature is currently implemented with a custom HitCollector and
 the
   DocSet class from Solr.
  
   2. 'More like this'
  
   Find more jobs like the one you see by clicking on the 'MORE LIKE
 THIS'
   link, which is visible when you hover the mouse over the job title.
  
   This feature is implemented with Lucene's term vectors and the
   'MoreLikeThis' contribution class. If you are in 'detailed view',
the
  term
   vectors from the job description are used. In 'brief' view, the job
  title's
   term vectors are used.
  
   3. 'Related Titles'
  
   When you do a 'keywords' search, click on a 'related titles' link to
  filter
   you search by similar job titles.
  
   This feature is implemented via a separate Lucene.Net index.
  
   4. Sort by 'Miles'
  
   Find jobs close to you via zip code/radius search. In the search
 results
   page, click on the 'Miles' column to sort the results by distance
from
  your
   zip code/radius.
  
   This custom sorting feature is implemented via Lucene's
   'SortComparatorSource' interface.
  
   5. Search by date, salary, distance.
  
   Find jobs posted in the last day (or 2,3, etc) or by salary range or
   distance.
  
   Numeric range search is one of Lucene's weak points
(performance-wise)
  so we
   have implemented this with a custom HitCollector and an extension to
 the
   Lucene index files that stores the numeric field values for all
  documents.
  
   It is important to point out that this has all been implemented with
 the
   stock Lucene 2.0 library. No code changes were made to the Lucene
 core.
  
   If you have any feedback regarding the UI, please use the link on
the
  web
   page (send us your feedback). You can hit me with any other
   questions/comments.
  
   Peter
  
  
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 






Re: Announcement: Lucene powering Monster job search index (Beta)

2006-11-03 Thread Peter Keegan

Paramasivam,

Take a look at Solr, in particular the DocSetHitCollector class. The
collector simply sets a bit in a BitSet, or saves the docIds in an array
(for low hit counts). Solr's BitSet was optimized (by Yonik, I believe) to
be faster than Java's BitSet, so this HitCollector is very fast. This is
essentially what we are doing for counting.

Peter

On 11/2/06, Paramasivam Srinivasan [EMAIL PROTECTED] wrote:


Hi Peter

When I use the CustomHitCollector, it affect the application performance.
Also how you accomplish the grouping the results with out affecting
performance. Also If possible give some code snippet for custome
hitcollector.

TIA

Sri

Peter Keegan [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 Joe,

 Fields with numeric values are stored in a separate file as binary
values
 in
 an internal format. Lucene is unaware of this file and unaware of the
 range
 expression in the query. The range expression is parsed outside of
Lucene
 and used in a custom HitCollector to filter out documents that aren't in
 the
 requested range(s). A goal was to do this without having to modify
Lucene.
 Our scheme is pretty efficient, but not very general purpose in its
 current
 form, though.

 Peter


 On 10/30/06, Joe Shaw [EMAIL PROTECTED] wrote:

 Hi Peter,

 On Fri, 2006-10-27 at 15:29 -0400, Peter Keegan wrote:
  Numeric range search is one of Lucene's weak points
(performance-wise)
 so we
  have implemented this with a custom HitCollector and an extension to
  the
  Lucene index files that stores the numeric field values for all
 documents.
 
  It is important to point out that this has all been implemented with
  the
  stock Lucene 2.0 library. No code changes were made to the Lucene
core.

 Can you give some technical details on the extension to the Lucene
index
 files?  How did you do it without making any changes to the Lucene
core?

 Thanks,
 Joe


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Announcement: Lucene powering Monster job search index (Beta)

2006-11-03 Thread Peter Keegan

Daniel,
Yes, this is correct if you happen to be doing a radius search and sorting
by mileage.
Peter

On 11/3/06, Daniel Rosher [EMAIL PROTECTED] wrote:


Hi Peter,

Does this mean you are calculating the euclidean distance twice ... once
for
the HitCollecter to filter
'out of range' documents, and then again for the custom Comparator to sort
the returned documents?
especially since the filtering is done outside Lucene?

Regards,
Dan


Joe,

Fields with numeric values are stored in a separate file as binary values
in
an internal format. Lucene is unaware of this file and unaware of the
range
expression in the query. The range expression is parsed outside of Lucene
and used in a custom HitCollector to filter out documents that aren't in
the
requested range(s). A goal was to do this without having to modify
Lucene.
Our scheme is pretty efficient, but not very general purpose in its
current
form, though.

Peter


On 10/30/06, Joe Shaw [EMAIL PROTECTED] wrote:

 Hi Peter,

 On Fri, 2006-10-27 at 15:29 -0400, Peter Keegan wrote:
  Numeric range search is one of Lucene's weak points
(performance-wise)
 so we
  have implemented this with a custom HitCollector and an extension to
the
  Lucene index files that stores the numeric field values for all
 documents.
 
  It is important to point out that this has all been implemented with
the
  stock Lucene 2.0 library. No code changes were made to the Lucene
core.

 Can you give some technical details on the extension to the Lucene
index
 files?  How did you do it without making any changes to the Lucene
core?

 Thanks,
 Joe


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]






Re: Announcement: Lucene powering Monster job search index (Beta)

2007-01-28 Thread Peter Keegan

Correction:
We only do the euclidan computation during sorting. For filtering, a simple
bounding box is computed to approximate the radius, and 2 range comparisons
are made to exclude documents. Because these comparisons are done outside of
Lucene as integer comparisons, it is pretty fast. With 13000 results, the
seach time with distance sort is about 200 msec (compared to 30 ms for a
simple non-radius, date-sorted keyword search).

Peter

On 1/27/07, no spam [EMAIL PROTECTED] wrote:


Isn't this extremely ineffecient to do the euclidean distance twice?
Perhaps not a huge deal if a small search result set.  I at times have
13,000 results that match my search terms of an index with 1.2 million
docs.

Can't you do some simple radian math first to ensure it's way out of
bounds,
then do the euclidian distance for the subset within bounds?  I'm
currently
only doing the distance calc once (post hit collector). I don't have any
performance numbers with the double vs single distance calc.

I'm still working out the sort by radius myself.

Mark

On 11/3/06, Peter Keegan [EMAIL PROTECTED] wrote:

 Daniel,
 Yes, this is correct if you happen to be doing a radius search and
sorting
 by mileage.
 Peter






Re: Announcement: Lucene powering Monster job search index (Beta)

2007-01-30 Thread Peter Keegan

Mark,

I'm sorry to hear that you weren't able to get to the job search site today.
I heard of a problem, but I can assure you that it had nothing to do with
Lucene and our back end tiers. Can you tell me what you think is lacking for
job search among the big boards? There is clearly a lot of room for
improvement.
How is the performance of your distance search and sort?

Peter


On 1/30/07, no spam [EMAIL PROTECTED] wrote:


This is very similar to what I do.  I use a hit collector to gather the
results, then filter outside a bounding box, then calculate the euclidian
distance.

Last time I tried to check your search it was down.  We were talking the
other day at work how job search was lacking among the big boards.  I'm
excited to check out your new page.

Mark

On 1/28/07, Peter Keegan [EMAIL PROTECTED] wrote:

 Correction:
 We only do the euclidan computation during sorting. For filtering, a
 simple
 bounding box is computed to approximate the radius, and 2 range
 comparisons
 are made to exclude documents. Because these comparisons are done
outside
 of
 Lucene as integer comparisons, it is pretty fast. With 13000 results,
the
 seach time with distance sort is about 200 msec (compared to 30 ms for a
 simple non-radius, date-sorted keyword search).

 Peter

 On 1/27/07, no spam [EMAIL PROTECTED] wrote:
 
  Isn't this extremely ineffecient to do the euclidean distance twice?
  Perhaps not a huge deal if a small search result set.  I at times have
  13,000 results that match my search terms of an index with 1.2 million
  docs.
 
  Can't you do some simple radian math first to ensure it's way out of
  bounds,
  then do the euclidian distance for the subset within bounds?  I'm
  currently
  only doing the distance calc once (post hit collector). I don't have
any
  performance numbers with the double vs single distance calc.
 
  I'm still working out the sort by radius myself.
 
  Mark
 
  On 11/3/06, Peter Keegan [EMAIL PROTECTED] wrote:
  
   Daniel,
   Yes, this is correct if you happen to be doing a radius search and
  sorting
   by mileage.
   Peter
  
  
 
 






bad queryparser bug

2007-02-01 Thread Peter Keegan

I have discovered a serious bug in QueryParser. The following query:
contents:sales  contents:marketing || contents:industrial 
contents:sales

is parsed as:
+contents:sales +contents:marketing +contents:industrial +contents:sales

The same parsed query occurs even with parenthesis:
(contents:sales  contents:marketing) || (contents:industrial 
contents:sales)

Is there any way around this bug?

Thanks,
Peter


Re: bad queryparser bug

2007-02-01 Thread Peter Keegan

Correction:

The query parser produces the correct query with the parenthesis.
But, I'm still looking for a fix for this. I could use some advice on where
to look in QueryParser to fix this.

Thanks,
Peter

On 2/1/07, Peter Keegan [EMAIL PROTECTED] wrote:


I have discovered a serious bug in QueryParser. The following query:
contents:sales  contents:marketing || contents:industrial 
contents:sales

is parsed as:
+contents:sales +contents:marketing +contents:industrial +contents:sales

The same parsed query occurs even with parenthesis:
(contents:sales  contents:marketing) || (contents:industrial 
contents:sales)

Is there any way around this bug?

Thanks,
Peter




Re: bad queryparser bug

2007-02-01 Thread Peter Keegan

OK, I see that I'm not the first to discover this behavior of QueryParser.
Can anyone vouch for the integrity of the PrecedenceQueryParser here:

http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/miscellaneous/src/java/org/apache/lucene/queryParser/precedence/

Thanks,
Peter

On 2/1/07, Peter Keegan [EMAIL PROTECTED] wrote:


Correction:

The query parser produces the correct query with the parenthesis.
But, I'm still looking for a fix for this. I could use some advice on
where to look in QueryParser to fix this.

Thanks,
Peter

On 2/1/07, Peter Keegan [EMAIL PROTECTED] wrote:

 I have discovered a serious bug in QueryParser. The following query:
 contents:sales  contents:marketing || contents:industrial 
 contents:sales

 is parsed as:
 +contents:sales +contents:marketing +contents:industrial +contents:sales


 The same parsed query occurs even with parenthesis:
 (contents:sales  contents:marketing) || (contents:industrial 
 contents:sales)

 Is there any way around this bug?

 Thanks,
 Peter





Re: bad queryparser bug

2007-02-02 Thread Peter Keegan

(If i could go back in time and stop the AND/OR/NOT//|| aliases from
being added to the QueryParser -- i would)


Yes, this is the cause of the confusion. Our users are accustomed to the
boolean logic syntax from a legacy search engine (also common to many other
engines). We'll have to convert them into native QueryParser syntax as
possible.

Sorry for the cross post.

Thanks,
Peter

On 2/2/07, Chris Hostetter [EMAIL PROTECTED] wrote:



: The query parser produces the correct query with the parenthesis.
: But, I'm still looking for a fix for this. I could use some advice on
where
: to look in QueryParser to fix this.

the best advice i can give you: don't use the binary operators.

  * Lucene is not a boolean logic system
  * BooleanQuery does not impliment boolean logic
  * QueryParser is not a boolean language parser

(If i could go back in time and stop the AND/OR/NOT//|| aliases from
being added to the QueryParser -- i would)



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: relevancy buckets and secondary searching

2007-02-05 Thread Peter Keegan

Hi Erick,

The timing of your posting is ironic because I'm currently working on the
same issue. Here's a solution that I'm going to try:

Use a HitCollector with a PriorityQueue to sort all hits by raw Lucene
score, ignoring the secondary sort field.

After the search, re-sort just the hits from the queue above (500 in your
case) with a FieldSortedHitQueue that sorts on score, then the secondary
field (title in your case), but 'normalize' the score to your 'user visible'
scores before re-sorting. If your 'normalized' score is computed properly,
this should force the secondary sort to occur and produce the 'proper'
sorting that the user expects.

I think the trick here is in computing the proper normalized score from
Lucene's raw scores, which will vary depending on boosts, etc.

I agree with you that this special relevancy sort is a real hack to
implement!


Peter

On 2/5/07, Erick Erickson [EMAIL PROTECTED] wrote:


Am I missing anything obvious here and/or what would folks suggest...

Conceptually, I want to normalize the scores of my documents during a
search
BUT BEFORE SORTING into 5 discrete values, say 0.1, 0.3, 0.5, 0.7, 0.9 and
apply a secondary sort when two documents have the same score. Applying
the
secondary sort is easy, it's massaging the scores that has me stumped.

We have a bunch of documents (30K). Books actually. We only display to the
user 5 different relevance scores, with 5 being the most relevant. So
far,
so good.

Within each quintile, we want to sort by title. So, suppose the following
three books score a hit:

relevance  title
0.98  z
0.94  c
0.79  a

The proper display would be

5   c
5   z
4   a


It's easy enough to do a secondary sort, but that would not give me what I
want. In this case, I'd get...

5   z
5   c
4   a

because the secondary sort only matters if the primary sort is equal. The
user is left scratching her head asking why did two books with the same
relevancy have the titles out of order?.

If I could massage my scores *before* sorts are done, things would be
hunky-dory, but I'm not seeing how to do that. One problem is that until
the
top N documents have been collected, I don't know what the maximum
relevance
is, therefore I don't know how to normalize raw scores. I followed Hoss's
thread where he talks about FakeNorms, but don't see how that applies to
my
problem.

My result sets are strictly limited to  500, so it's not unreasonable to
just get the TopDocs back and aggregate my buckets at that point and sort
them. But of course I only care about this when I am using relevancy as my
primary sort. For sorting on any other fields, I would just let Lucene
take
care of it all. So post-sorting myself leads to really ugly stuff like

if (it's my special relevancy sort) do one thing
else don't do that thing.

repeated wherever I have to sort. Yuck.


And since I'm talking about 500 docs, I don't want to wait until after I
have a Hits object because I'll have to re-query several times. On an 8G
index (and growing).


This almost looks like a HitCollector, but not quite.
This almost looks like a custom Similarity, but not quite since I want to
just let Lucene compute relevance and put that into a bucket.
This almost looks like FakeNorms, but not quite.
This almost looks like about 8 things I tried to make work, but not quite
G

So, somebody out there needs to tell me what part of the manual I
overlooked
G...

Thanks
Erick



Re: Sorting by Score

2007-02-27 Thread Peter Keegan

Suppose one wanted to use this custom rounding score comparator on all
fields and all queries. How would you get it plugged in most efficiently,
given that SortField requires a non-null field name?

Peter

On 2/1/06, Chris Hostetter [EMAIL PROTECTED] wrote:



: I've not used the sorting code yet, but it looks like you have to
: provide some custom ScoreDocComparator by adding a SortField using the
: SortField(String field, SortComparatorSource comparator) constructor.
: I'm just not certain what you should specify for the field value since
: you really want to just round off the score.
:
: Could someone with more experience using the Sort API clarify whether
: this is possible?

yes, it should be possible, and yes your description of a solution sounds
right ... the only odd thing is you'd be writting a
SortComparatorSource/ScoreDocComparator that would be ignoring the field
it's given, but there's nothing wrong with that.

Round your number to the desired precision, then compare them, and return
0 if they are equal so that the secondary sort (on date in this case) can
take affect.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Sorting by Score

2007-02-27 Thread Peter Keegan

I'm building up the Sort object for the search with 2 SortFields - first is
for the custom rounded scoring, second is for date. This Sort object is used
to construct a FieldSortedHitQueue which is used with a custom HitCollector.
And yes, this comparator ignores the field name.


hmmm, actually i see now that SortField(String,SortComparatorSource) says

it cannot  be null ... not sure if that's actually enforced or not

The constructor doesn't complain, but FieldSortedHitQueue expects a field
name when it tries to locate the comparator from the cache:

   at org.apache.lucene.search.FieldCacheImpl$Entry.init(
FieldCacheImpl.java:60)
   at org.apache.lucene.search.FieldSortedHitQueue.lookup(
FieldSortedHitQueue.java:157)
   at org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator(
FieldSortedHitQueue.java:185)
   at org.apache.lucene.search.FieldSortedHitQueue.init(
FieldSortedHitQueue.java:58)

Peter

On 2/27/07, Chris Hostetter [EMAIL PROTECTED] wrote:



: Suppose one wanted to use this custom rounding score comparator on all
: fields and all queries. How would you get it plugged in most
efficiently,
: given that SortField requires a non-null field name?

i'm not sure i understand the first part of question .. this custom
SortComparatorSource would deal only with the score, it wouldn't matter
what other fields you'd want to make SortFields on to do secondary
sorting. .. You as the client have to specify the Sort obejct when
executing the search, and you can build that Sort object up anyway you
want.

Yes the SortField class has a constructor arg for field, but
as you can see from the javadocs, it can be null in many circumstances
(consider SortFiled#FIELD_SCORE and SortField#FIELD_DOC for instance) ...
hmmm, actually i see now that SortField(String,SortComparatorSource) says
it cannot be null ... not sure if that's actually enforced or not, but
it's no bother -- all that matters is that you don't make any attempt to
use the field name in your SortComparatorSource.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Sorting by Score

2007-02-28 Thread Peter Keegan

can't you pick any arbitrary marker field name (that's not a real field
name) and use that?


Yes, I could. I guess you're saying that the field name doesn't matter,
except that it's used for caching the comparator, right?


... he wants the bucketing to happen as part of hte scoring so that the
secondary sort will determine the ordering within the bucket.


Yes, exactly. Couldn't I just do this rounding in the HitCollector, before
inserting it into the FieldSortedHitQueue?



On 2/28/07, Chris Hostetter [EMAIL PROTECTED] wrote:



: The first part was just to iterate through the TopDocs that's available
to
: my and normalize the scores right in the ScoreDocs. Like this...

Won't that be done after the Lucene does the hitcollecting/sorting? ... he
wants the bucketing to happen as part of hte scoring so that the
secondary sort will determine the ordering within the bucket.

(or am i missing something about your description?)




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Sorting by Score

2007-02-28 Thread Peter Keegan

Erich,

Yes, this seems to be the simplest way to implement score 'bucketization',
but wouldn't it be more efficient to do this with a custom ScoreComparator?
That way, you'd do the bucketizing and sorting in one 'step' (compare()).
Maybe the savings isn't measurable, though. A comparator might also allow
one to do a more sophisticated rounding or bucketizing since you'd be
getting 2 scores at a time.

Peter


On 2/28/07, Erick Erickson [EMAIL PROTECTED] wrote:


Empirically, when I insert the elements in the FieldSortedHitQueue
they get sorted according to the Sort object. The original query
that gives me a TopDocs applied
no secondary sorting, only relevancy. Since I normalized
all the scores into one of only 5 discrete values, and secondary
sorting was applied to all docs with the same score when I inserted
them in the FieldSortedHitQueue.

Now popping things of the FieldSortedHitQueue is ordered the
way I want.

You could just operate on the FieldSortedHitQueue at this point, but
I decided the rest of my code would be simpler if I stuffed them back
into the TopDocs, so there's some explanation below that you can
just skip if I've cleared things up already.

*
The step I left out is moving the documents from the
FIeldSortedHitQueue back to topDocs.scoreDocs.
So the steps are as follows..

1 bucketize the scores. That is, go through the
TopDocs.scoreDocs and adjust each raw score into
one of my buckets. This is made easy by the
existence of topDocs.getMaxScore. TopDocs has
had no sorting other than relevancy applied so far.

2 assemble the FieldSortedHitQueue by inserting
each element from scoreDocs into it, with a suitable
Sort object, relevance is the first field (SortField.FIELD_SCORE).

3 pop the entries off the FieldSortedHitQueue, overwriting
the elements in topDocs.scoreDocs.

I left out step 3, although I suppose you could
operate directly on the FieldSortedHitQueue.

NOTE: in my case, I just put everything back in the
scoreDocs without attempting any efficiencies. If I
needed more performance, I'd only put as many items
back as I needed to display. But as I wrote yesterday,
performance isn't an issue so there's no point. Although
I know one place to look if we need to squeeze more QPS.

How efficient this is is an open question. But it's fast enough
and relatively simple so I stopped looking for more
efficiencies

Erick

On 2/28/07, Chris Hostetter [EMAIL PROTECTED] wrote:


 : The first part was just to iterate through the TopDocs that's
available
 to
 : my and normalize the scores right in the ScoreDocs. Like this...

 Won't that be done after the Lucene does the hitcollecting/sorting? ...
he
 wants the bucketing to happen as part of hte scoring so that the
 secondary sort will determine the ordering within the bucket.

 (or am i missing something about your description?)




 -Hoss


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





Re: Sorting by Score

2007-03-01 Thread Peter Keegan

Erick,

I think you're right because you'd wouldn't know the max score before the
comparisons. I'm just thinking about a rounding algorithm that involves
comparing the raw scores to the theoretical maximum score, which I think
could be computed from the Similarity class and knowing the max boost value
used during indexing.

Peter

On 3/1/07, Erick Erickson [EMAIL PROTECTED] wrote:


Peter:

About a custom ScoreComparator. The problem I couldn't get past was that I
needed to know the max score of all the docs in order to divide the raw
scores into quintiles since I was dealing with raw scores. I didn't see
how
to make that work with ScoreComparator, but I confess that I didn't look
very hard after someone on the list turned me on to
FieldSortedHitQueue

Erick

On 2/28/07, Erick Erickson [EMAIL PROTECTED] wrote:

 It may well be, but as I said this is efficient enough for my needs
 so I didn't pursue it. One of my pet peeves is spending time making
 things more efficient when there's no need, and my index isn't
 going to grow enough larger to worry about that now G...

 Erick

 On 2/28/07, Peter Keegan  [EMAIL PROTECTED] wrote:
 
  Erich,
 
  Yes, this seems to be the simplest way to implement score
  'bucketization',
  but wouldn't it be more efficient to do this with a custom
  ScoreComparator?
  That way, you'd do the bucketizing and sorting in one 'step'
  (compare()).
  Maybe the savings isn't measurable, though. A comparator might also
  allow
  one to do a more sophisticated rounding or bucketizing since you'd be
  getting 2 scores at a time.
 
  Peter
 
 
  On 2/28/07, Erick Erickson [EMAIL PROTECTED]  wrote:
  
   Empirically, when I insert the elements in the FieldSortedHitQueue
   they get sorted according to the Sort object. The original query
   that gives me a TopDocs applied
   no secondary sorting, only relevancy. Since I normalized
   all the scores into one of only 5 discrete values, and secondary
   sorting was applied to all docs with the same score when I inserted
   them in the FieldSortedHitQueue.
  
   Now popping things of the FieldSortedHitQueue is ordered the
   way I want.
  
   You could just operate on the FieldSortedHitQueue at this point, but
   I decided the rest of my code would be simpler if I stuffed them
back
   into the TopDocs, so there's some explanation below that you can
   just skip if I've cleared things up already.
  
   *
   The step I left out is moving the documents from the
   FIeldSortedHitQueue back to topDocs.scoreDocs.
   So the steps are as follows..
  
   1 bucketize the scores. That is, go through the
   TopDocs.scoreDocs and adjust each raw score into
   one of my buckets. This is made easy by the
   existence of topDocs.getMaxScore . TopDocs has
   had no sorting other than relevancy applied so far.
  
   2 assemble the FieldSortedHitQueue by inserting
   each element from scoreDocs into it, with a suitable
   Sort object, relevance is the first field ( SortField.FIELD_SCORE).
  
   3 pop the entries off the FieldSortedHitQueue, overwriting
   the elements in topDocs.scoreDocs.
  
   I left out step 3, although I suppose you could
   operate directly on the FieldSortedHitQueue.
  
   NOTE: in my case, I just put everything back in the
   scoreDocs without attempting any efficiencies. If I
   needed more performance, I'd only put as many items
   back as I needed to display. But as I wrote yesterday,
   performance isn't an issue so there's no point. Although
   I know one place to look if we need to squeeze more QPS.
  
   How efficient this is is an open question. But it's fast enough
   and relatively simple so I stopped looking for more
   efficiencies
  
   Erick
  
   On 2/28/07, Chris Hostetter [EMAIL PROTECTED]  wrote:
   
   
: The first part was just to iterate through the TopDocs that's
   available
to
: my and normalize the scores right in the ScoreDocs. Like this...
   
Won't that be done after the Lucene does the
hitcollecting/sorting?
  ...
   he
wants the bucketing to happen as part of hte scoring so that the
secondary sort will determine the ordering within the bucket.
   
(or am i missing something about your description?)
   
   
   
   
-Hoss
   
   
   
  -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   
   
  
 





Re: Lucene Ranking/scoring

2007-03-08 Thread Peter Keegan

I'm looking at how ReciprocalFloatFuncion and ReverseOrdFieldSource can be
used to rank documents by score and date (solr.search.function contains
great stuff!). The values in the date field that are used for the
ValueSource are not actually used as 'floats', but rather their ordinal term
values from the FieldCache string index. This means that if the 'date' field
has 3000 unique string 'values' in the index, the values for 'x' in
ReciprocalFloatFuncion could be 0-2999. So if I want the most recent 'date'
to return a score of 1.0, one could set 'a' and 'b' in the function to
2999.

Do I have this right? I got bit confused at first because I assumed that the
actual field values were being used in the computation, but you really need
to know the unique term count in order to get the score 'right'.

By the way, as I try to get my head around the Score, Weight, and Boolean*
classes (and next(), skipTo()), I nominate these for discussion in Lucene In
Action II.

Peter

On 3/9/06, Yonik Seeley [EMAIL PROTECTED] wrote:


On 3/9/06, Yang Sun [EMAIL PROTECTED] wrote:
 Hi Yonik,
 Thanks very much for your suggestion. The query boost works great for
 keyword matching. But in my case, I need to rank the results by date and
 title. For example, title:foo^2 abstract:foo^1.5 date:2004^3 will only
boost
 the document with date=2004. What I need is boosting the distance from
the
 specified date

If all you need to do is boost more recent documents (and a single
fixed boost will always work), then you can do that boosting at index
time.

 which means 2003 will have a better ranking than 2002,
 20022001, etc.
 I implemented a customized ScoreDocComparator class which works fine for
one
 field. But I met some trouble when trying to combine other fields
together.
 I'm still looking at FunctionQuery. Don't know if I can figure out
 something.

FunctionQuery support is integrated into Solr (or currently hacked-in,
as the case may be),  and can be useful for debugging and trying out
query types even if you don't use it for your runtime.

ReciprocalFloatFunction might meet your needs for increasing the score
of more recent documents:

http://incubator.apache.org/solr/docs/api/org/apache/solr/search/function/ReciprocalFloatFunction.html

The SolrQueryParser can make
ReciprocalFloatFunction(new ReverseOrdFieldSource(my_date),1,1000,1000)
out of _val_:recip(rord(my_date),1,1000,1000)

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search
Server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Announcement: Lucene powering Monster job search index (Beta)

2007-03-16 Thread Peter Keegan

Dan,

The filtering is done in the HitCollector by the bounding box, so the only
hits that get collected are those that match the keywords, the bounding box,
and some Lucene filters (BitSets) (I'm probably overloading the word
'filter' a bit). So, the only hits from the collector that need to be sorted
are those that are roughly within the search radius. When the search radius
gets larger, a new bounding box is computed for that query. Make sense?

Peter

On 3/16/07, Daniel Rosher [EMAIL PROTECTED] wrote:


Hi Peter,

Shouldn't the search perform the euclidean distance during filtering as
well
though, otherwise you will obtain perhaps highly relevant hits reported to
the user outside the range they specified? Particularly as the search
radius
gets larger.

Cheers,
Dan

On 1/28/07, Peter Keegan [EMAIL PROTECTED] wrote:

 Correction:
 We only do the euclidan computation during sorting. For filtering, a
 simple
 bounding box is computed to approximate the radius, and 2 range
 comparisons
 are made to exclude documents. Because these comparisons are done
outside
 of
 Lucene as integer comparisons, it is pretty fast. With 13000 results,
the
 seach time with distance sort is about 200 msec (compared to 30 ms for a
 simple non-radius, date-sorted keyword search).

 Peter

 On 1/27/07, no spam [EMAIL PROTECTED] wrote:
 
  Isn't this extremely ineffecient to do the euclidean distance twice?
  Perhaps not a huge deal if a small search result set.  I at times have
  13,000 results that match my search terms of an index with 1.2 million
  docs.
 
  Can't you do some simple radian math first to ensure it's way out of
  bounds,
  then do the euclidian distance for the subset within bounds?  I'm
  currently
  only doing the distance calc once (post hit collector). I don't have
any
  performance numbers with the double vs single distance calc.
 
  I'm still working out the sort by radius myself.
 
  Mark
 
  On 11/3/06, Peter Keegan [EMAIL PROTECTED] wrote:
  
   Daniel,
   Yes, this is correct if you happen to be doing a radius search and
  sorting
   by mileage.
   Peter
  
  
 
 




Re: Announcement: Lucene powering Monster job search index (Beta)

2007-03-16 Thread Peter Keegan

Note: this is a reply to a posting to java-dev  --Peter

Eric,


Now that it is live, is performance pretty good?


Performance is outstanding. Each server can easily handle well over 100 qps
on an index of over 800K documents. There are several servers (4 dual core
(8 CPU) Opteron) supporting the query load and we have backup servers for
disaster recovery. For a few hours one day, all job search query traffic for
the entire site was being handled by a single server - with no noticable
latency!


Are you using dotLucene or a webservice tier and java?


We are using Java Lucene on dedicated servers.



How did you implement your bounding box for the searching? It sounds like

you do this outside of lucene and return a custom hitcollector.

The 'bounding box' is merely the conjunction of 2 numeric range searches.
It's really not that hard to do - I think there has been discussion of this
elsewhere in this group. We use (not 'return') a custom HitCollector to
exclude hits that aren't in the bounding box. I tried to explain this in a
reply earlier today, but if I failed let me know.


Why not use a rangequery or functionquery for the basic bounding before

sorting

Basically, 'RangeQuery' doesn't offer sufficient performance. We have
implemented our own 'numeric value' search 'next to Lucene' (I think I like
this better than 'outside of Lucene' ;-)).  FunctionQuery could be used if
you wanted the jobs sorted by a combination of keywords and distance. Our
users (apparently) expect the jobs to be sorted strictly by distance on a
radius search.



Peter

Hello Peter,

Now that the monster lucene search is live, is performance pretty good? Are
you still running it on a single 8 core server? Can you give me a rough idea
on the number of queries you can handle/second and the number of docs in the
index? Are you using dotLucene or a webservice tier and java?

How did you implement your bounding box for the searching? It sounds like
you do this outside of lucene and return a custom hitcollector. Why not use
a rangequery or functionquery for the basic bounding before sorting?

Thanks,
Eric


Re: Lucene search performance: linear?

2007-03-21 Thread Peter Keegan

On a similar topic, has anybody measured query performance as a function of
index size?
Well, I did and the results surprised me. I measured query throughput on 8
indexes that varied in size from 55,000 to 4.4 million documents. When
plotted on a graph, there is a distinct hyperbolic curve (1/x). I expected
to see more of a linear curve with a sharp drop-off at some point.
Interesting

Peter

On 12/5/06, Zhang, Lisheng [EMAIL PROTECTED] wrote:


Hi Soeren,

Thanks very much for explanations, yes, there
is no linear relation when searching a keyword
which is only in a few docs.

Best regards, Lisheng

-Original Message-
From: Soeren Pekrul [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 05, 2006 10:37 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene search performance: linear?


Hello Lisheng,

a search process has to do usually two thinks. First it has to find the
term in the index. I don't know the implementation of finding a term in
Lucene. I hope that the index is at least a sorted list or a binary
tree, so it can search binary. The time finding a term depends of the
term's number n_t. If it searches binary the complexity is approximately
log(n_t). The search time should be better then linear.

Second it has to collect the documents for a term. This depends of the
documents number n_d for a term. It has to go thru the list of documents
for a term. The time should be proportional to the number of documents
for a term even if it doesn't calculate the similarity. Usually the
number of documents for a single term is less than the total number of
documents in the collection and less than the total number of terms in
the index.

If the number of documents for a single term is less than the total
number of documents the search process for a single term including
process one (finding the term) and process two (collecting the documents
and calculating the score) should be better the linear to the number of
documents.

 I indexed first 220,000, all with a special keyword, I did a simple
 query and only fetched 5 docs, with Hits.length()=220,000.

 Then I indexed 440,000 docs, with the same keyword, query it
 again and fetched a few docs, with Hits.length(0=440,000.

In your case the query term is contained in all documents. The number of
documents for a single term is equals the total number of documents in
your collection. The hit collector has to collect all documents. The
collecting process is proportional to the number of documents to
collect. So the search for all documents should be at least linear to
the total number of documents.

Sören

Zhang, Lisheng schrieb:
 Hi,

 I indexed first 220,000, all with a special keyword, I did a simple
 query and only fetched 5 docs, with Hits.length()=220,000.

 Then I indexed 440,000 docs, with the same keyword, query it
 again and fetched a few docs, with Hits.length(0=440,000.

 I found that search time is about linear: 2nd time is about 2 times
 longer than 1st query. I would like to understand:

 Does the linear relation come from score calculation, since we
 have to calculate score one by one? Or other reason?

 If we have B-tree index I would naively expect a better scalibility?

 Thanks very much for your helps,

 Lisheng

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: FieldSortedHitQueue enhancement

2007-03-29 Thread Peter Keegan

The duplicate check would just be on the doc ID. I'm using TreeSet to detect
duplicates with no noticeable affect on performance. The PQ only has to be
checked for a previous value IFF the element about to be inserted is
actually inserted and not dropped because it's less than the least value
already in there. So, the TreeSet is never bigger than the size of the PQ
(typically 25 to a few hundred items), not the size of all hits.

Peter

On 3/29/07, Otis Gospodnetic [EMAIL PROTECTED] wrote:


Hm, removing duplicates (as determined by a value of a specified document
field) from the results would be nice.
How would your addition affect performance, considering it has to check
the PQ for a previous value for every candidate hit?

Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Peter Keegan [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Thursday, March 29, 2007 9:39:13 AM
Subject: FieldSortedHitQueue enhancement

This is request for an enhancement to FieldSortedHitQueue/PriorityQueue
that
would prevent duplicate documents from being inserted, or alternatively,
allow the application to prevent this (reason explained below). I can do
this today by making the 'lessThan' method public and checking the queue
before inserting like this:

if (hq.size()  maxSize) {
   // doc will be inserted into queue - check for duplicate before
inserting
} else if (hq.size()  0  !hq.lessThan((ScoreDoc)fieldDoc,
(ScoreDoc)hq.top()) {
  // doc will be inserted into queue - check for duplicate before
inserting
} else {
  // doc will not be inserted - no check needed
}

However, this is just replicating existing code in
PriorityQueue-insert().
An alternative would be to have a method like:

public boolean wouldBeInserted(ScoreDoc doc)
// returns true if doc would be inserted, without inserting

The reason for this is that I have some queries that get expanded into
multiple searches and the resulting hits are OR'd together. The queries
contain 'terms' that are not seen by Lucene but are handled by a
HitCollector that uses external data for each document to evaluate hits.
The
results from the priority queue should contain no duplicate documents
(first
or last doc wins).

Do any of these suggestions seem reasonable?. So far, I've been able to
use
Lucene without any modifications, and hope to continue this way.

Peter




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: FieldSortedHitQueue enhancement

2007-03-29 Thread Peter Keegan

Yes, my custom query processor can sometimes make 2 Lucene search calls
which may result in duplicate docs being inserted on the same PQ. The
simplest solution is to make lessThan public. I'm curious to know if anyone
else is performing multiple searches under the covers.

Peter

On 3/29/07, Yonik Seeley [EMAIL PROTECTED] wrote:


On 3/29/07, Otis Gospodnetic [EMAIL PROTECTED] wrote:
 Ah, I see.  This is less attractive to me personally, but maybe it helps
others.  One thing I don't understand is why/how you'd get duplicate
documents with the same doc ID in there.  Isn't insert(FieldDoc fdoc) called
only once for each doc?

Yes, for any Lucene search method.
From Peter's first message, it looks like it's his custom code that
can result in duplicates.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: FieldSortedHitQueue enhancement

2007-03-29 Thread Peter Keegan

Peter, how did you achieve 'last wins' as you must presumably remove first

from the PQ?

I implemented 'first wins' because the score is less important than other
fields (distance, in our case), but you make a good point since score may be
more important. How did you implement remove()?

Peter


On 3/29/07, Antony Bowesman [EMAIL PROTECTED] wrote:


I've got a similar duplicate case, but my duplicates are based on an
external ID
rather than Doc id so occurs for a single Query.  It's using a custom
HitCollector but score based, not field sorted.

If my duplicate contains a higher score than one on the PQ I need to
update the
stored score with the higher one, so PQ needs a replace() method where the
stored object.equals() can be used to find the object to delete.  I'm not
sure
if there's a way to find the object efficiently in this case other than a
linear
search.  I implemented remove().

Peter, how did you achieve 'last wins' as you must presumably remove first
from
the PQ?

Antony


Peter Keegan wrote:
 The duplicate check would just be on the doc ID. I'm using TreeSet to
 detect
 duplicates with no noticeable affect on performance. The PQ only has to
be
 checked for a previous value IFF the element about to be inserted is
 actually inserted and not dropped because it's less than the least value
 already in there. So, the TreeSet is never bigger than the size of the
PQ
 (typically 25 to a few hundred items), not the size of all hits.

 Peter

 On 3/29/07, Otis Gospodnetic [EMAIL PROTECTED] wrote:

 Hm, removing duplicates (as determined by a value of a specified
document
 field) from the results would be nice.
 How would your addition affect performance, considering it has to check
 the PQ for a previous value for every candidate hit?

 Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
 Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

 - Original Message 
 From: Peter Keegan [EMAIL PROTECTED]
 To: java-user@lucene.apache.org
 Sent: Thursday, March 29, 2007 9:39:13 AM
 Subject: FieldSortedHitQueue enhancement

 This is request for an enhancement to FieldSortedHitQueue/PriorityQueue
 that
 would prevent duplicate documents from being inserted, or
alternatively,
 allow the application to prevent this (reason explained below). I can
do
 this today by making the 'lessThan' method public and checking the
queue
 before inserting like this:

 if (hq.size()  maxSize) {
// doc will be inserted into queue - check for duplicate before
 inserting
 } else if (hq.size()  0  !hq.lessThan((ScoreDoc)fieldDoc,
 (ScoreDoc)hq.top()) {
   // doc will be inserted into queue - check for duplicate before
 inserting
 } else {
   // doc will not be inserted - no check needed
 }

 However, this is just replicating existing code in
 PriorityQueue-insert().
 An alternative would be to have a method like:

 public boolean wouldBeInserted(ScoreDoc doc)
 // returns true if doc would be inserted, without inserting

 The reason for this is that I have some queries that get expanded into
 multiple searches and the resulting hits are OR'd together. The queries
 contain 'terms' that are not seen by Lucene but are handled by a
 HitCollector that uses external data for each document to evaluate
hits.
 The
 results from the priority queue should contain no duplicate documents
 (first
 or last doc wins).

 Do any of these suggestions seem reasonable?. So far, I've been able to
 use
 Lucene without any modifications, and hope to continue this way.

 Peter




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Sorting on a field that can have null values

2007-04-13 Thread Peter Keegan

excluding them completely is a slightly differnet task, you don't need to
index a special marker value, you can just use a
RangeFilter (or ConstantScoreRangeQuery) to ensure you only get docs with
a value for that field (ie: field:[* TO *])


Excellent, this is a much better solution. BTW, adding a
ConstantScoreRangeQuery clause to the query works fine, but building the
RangeFilter from the query string field:[* TO *] doesn't work. The reason
is that the terms expanded from the lowerTerm wildcard are compared to
'upperTerm' which is literally '*', which is incorrect. This would appear to
be a bug in QueryParser as it ought to set lowerTerm = upperTerm = null in
this case.

Thanks,
Peter


On 4/12/07, Chris Hostetter [EMAIL PROTECTED] wrote:



: If i rememebr correctly (you'll have to test this) sorting on a field
: which doesn't exist for every doc does what you would want (docs with
: values are listed before docs without)

: The actual behavior is different than described above. I modified
: TestSort.java:

: The actual order of the results is: ZJI. I believe this happens
because
: the field string cache 'order' array contains 0's for all the documents
that
: don't contain the field and thus sort first.

i guess wasn't precise enough in that old thread, what i ment was that not
having a vlaue results in the docs sorting the same as if they had a value
lower then the lowest existing value -- so they sort at the end of the
list if you are doing a descending sort, and at the begining of the list
if you do an ascending sort.  If you want to always have them come last
regardless of order, there is a SortComparator for that purpose in Solr...

https://issues.apache.org/jira/browse/LUCENE-406

http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/search/MissingStringLastComparatorSource.java?view=log

: Suppose I want to exclude documents from being collected if they don't
: contain the sort field. One way to do this is to index a unique
: 'empty_value' value for those documents and add a MUST_NOT boolean
clause to
: the query, for example: query terms -field:empty_value). But this
seems
: inefficient. Is there a better way?

excluding them completely is a slightly differnet task, you don't need to
index a special marker value, you can just use a
RangeFilter (or ConstantScoreRangeQuery) to ensure you only get docs with
a value for that field (ie: field:[* TO *])



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: optimization behaviour

2007-05-10 Thread Peter Keegan

Of course, that doesn't have to be the case.  It would be a trivial
change to merge segments and not remove the deleted docs.  That
usecase could be useful in conjunction with ParallelReader.


If the behavior of deleted docs during merging or optimization ever changes,
please make this configurable. Our application uses the Lucene docid as a
key into our numeric values 'extension' file, and it depends on the simple
behavior described in the previous posts.

Thanks,
Peter


On 5/10/07, Yonik Seeley [EMAIL PROTECTED] wrote:


On 5/10/07, Yonik Seeley [EMAIL PROTECTED] wrote:
 Deleted documents are removed on segment merges (for documents marked
 as deleted in those segments).

Of course, that doesn't have to be the case.  It would be a trivial
change to merge segments and not remove the deleted docs.  That
usecase could be useful in conjunction with ParallelReader.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Payloads and PhraseQuery

2007-06-27 Thread Peter Keegan

I'm looking at the new Payload api and would like to use it in the following
manner. Meta-data is indexed as a special phrase (all terms at same
position) and a payload is stored with the first term of each phrase. I
would like to create a custom query class that extends PhraseQuery and uses
its PhraseScorer to find matching documents. The custom query class then
reads the payload from the first term of the matching query and uses it to
produce a new score. However, I don't see how to get the payload from the
PhraseScorer's TermPositions. Is this possible?


Peter


Re: Payloads and PhraseQuery

2007-06-29 Thread Peter Keegan

I tried to subclass PhraseScorer, but discovered that it's an abstract class
and its subclasses (ExactPhraseScorer and SloppyPhraseScorer) are final
classes. So instead, I extended Scorer with my custom scorer and extended
PhraseWeight (after making it public). My scorer's constructor is passed the
instance of PhraseScorer created by PhraseQuery.scorer(). My scorer's 'next'
and 'skipTo' methods call the PhraseScorer's methods first and if the result
is 'true', the payload is loaded and used to determine whether or not the
PhraseScorer's doc is a hit. If not, PhraseScorer.next() or skipTo() is
called again. In order to get the payload, I modified PhraseQuery to save
the TermPositions array it creates for its scorers and added a 'get' method.
The diff is included, below.

This is probably not the best solution, but at least a starting point for
further discussion.

Here's the diff:

Index: PhraseQuery.java
===
--- PhraseQuery.java(revision 551992)
+++ PhraseQuery.java(working copy)
@@ -36,7 +36,8 @@
  private Vector terms = new Vector();
  private Vector positions = new Vector();
  private int slop = 0;
-
+  private TermPositions[] tps;
+
  /** Constructs an empty phrase query. */
  public PhraseQuery() {}

@@ -104,7 +105,7 @@
  return result;
  }

-  private class PhraseWeight implements Weight {
+  public class PhraseWeight implements Weight {
private Similarity similarity;
private float value;
private float idf;
@@ -138,7 +139,7 @@
  if (terms.size() == 0)  // optimize zero-term case
return null;

-  TermPositions[] tps = new TermPositions[terms.size()];
+  tps = new TermPositions[terms.size()];
  for (int i = 0; i  terms.size(); i++) {
TermPositions p = reader.termPositions((Term)terms.elementAt(i));
if (p == null)
@@ -155,7 +156,9 @@
 reader.norms(field));

}
-
+public TermPositions[] getTermPositions() {
+return tps;
+}
public Explanation explain(IndexReader reader, int doc)
  throws IOException {



On 6/27/07, Mark Miller [EMAIL PROTECTED] wrote:


You cannot do it because TermPositions is read in the
PhraseWeight.scorer(IndexReader) method (or MultiPhraseWeight) and
loaded into an array which is passed to PhraseScorer. Extend the Weight
as well and pass the payload to the Scorer as well is a possibility.

- Mark

Peter Keegan wrote:
 I'm looking at the new Payload api and would like to use it in the
 following
 manner. Meta-data is indexed as a special phrase (all terms at same
 position) and a payload is stored with the first term of each phrase. I
 would like to create a custom query class that extends PhraseQuery and
 uses
 its PhraseScorer to find matching documents. The custom query class then
 reads the payload from the first term of the matching query and uses
 it to
 produce a new score. However, I don't see how to get the payload from
the
 PhraseScorer's TermPositions. Is this possible?


 Peter


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Payloads and PhraseQuery

2007-07-11 Thread Peter Keegan

I'm now looking at using payloads with SpanNearQuery but I don't see any
clear way of getting the payload(s) from the matching span terms. The term
positions for the payloads seem to be buried beneath SpanCells in the
NearSpansOrdered and NearSpansUnordered classes, which are not public. I'd
be content to be able to get the payload from just the first term of the
span.

Can anyone suggest an approach for making payloads work with SpanNearQuery?

Peter


On 6/27/07, Grant Ingersoll [EMAIL PROTECTED] wrote:


Could you get what you need combining the BoostingTermQuery with a
SpanNearQuery to produce a score?  Just guessing here..

At some point, I would like to see more Query classes around the
payload stuff, so please submit patches/feedback if and when you get
a solution

On Jun 27, 2007, at 10:45 AM, Peter Keegan wrote:

 I'm looking at the new Payload api and would like to use it in the
 following
 manner. Meta-data is indexed as a special phrase (all terms at same
 position) and a payload is stored with the first term of each
 phrase. I
 would like to create a custom query class that extends PhraseQuery
 and uses
 its PhraseScorer to find matching documents. The custom query class
 then
 reads the payload from the first term of the matching query and
 uses it to
 produce a new score. However, I don't see how to get the payload
 from the
 PhraseScorer's TermPositions. Is this possible?


 Peter

--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Payloads and PhraseQuery

2007-07-12 Thread Peter Keegan

I'm looking for Spans.getPositions(), as shown in BoostingTermQuery, but
neither NearSpansOrdered nor NearSpansUnordered (which are the Spans
provided by SpanNearQuery) provide this method and it's not clear to me how
to add it.

Peter

On 7/11/07, Chris Hostetter [EMAIL PROTECTED] wrote:



: I'm now looking at using payloads with SpanNearQuery but I don't see any
: clear way of getting the payload(s) from the matching span terms. The
term
: positions for the payloads seem to be buried beneath SpanCells in the

Isn't Spans.start() and Spans.end() what you are looking for?





-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Payloads and PhraseQuery

2007-07-12 Thread Peter Keegan

Grant,

If/when you have an implementation for SpanNearQuery, I'd be happy to test
it.

Peter

On 7/12/07, Grant Ingersoll [EMAIL PROTECTED] wrote:


Yep, totally agree.One way to handle this initially at least is
have isPayloadAvailable() only return true for the SpanTermQuery.
The other option is to come up with some modification of the
suggested methods below to return all the payloads in a span.

I have a basic implementation for just the SpanTermQuery (i.e. via
TermSpans) in the works.  I will take a crack at fleshing out the
rest at some point soon.

-Grant

On Jul 12, 2007, at 1:22 PM, Paul Elschot wrote:


 On Thursday 12 July 2007 14:50, Grant Ingersoll wrote:
 That is off of the TermSpans class.  BTQ (BoostingTermQuery) is
 implemented to extend SpanQuery, thus SpanNearQuery isn't, w/o
 modification, going to have access to these things.  However, if you
 look at the SpanTermQuery, you will see that it's implementation of
 Spans is indeed the TermSpans class.  So, I think you could cast to
 it or handle it through instanceof.

 I am not completely sure here, but it seems like we may need an
 efficient way to access the TermPositions for each document.  That
 is, the Spans class doesn't provide this and maybe it should
 somehow.  Again, I am just thinking out loud here.

 SpanQueries can be nested, so the relationship between a span
 and a term position can also be one to many, not only one to one.
 For example a matching span in the Spans of a SpanNearQuery
 can be based on two matching (near enough to match) term positions.


 Thus, if we modified Spans to have the following methods:

 byte[] getPayload(byte[] data, int offset)

 boolean isPayloadAvailable()

 I think this would be useful.  Perhaps this should be discussed on
 dev.

 And the same holds for the payloads, there many be more than one
 for a single Span.

 Regards,
 Paul Elschot


 Cheers,
 Grant


 On Jul 12, 2007, at 8:20 AM, Peter Keegan wrote:

 I'm looking for Spans.getPositions(), as shown in
 BoostingTermQuery, but
 neither NearSpansOrdered nor NearSpansUnordered (which are the Spans
 provided by SpanNearQuery) provide this method and it's not clear
 to me how
 to add it.

 Peter

 On 7/11/07, Chris Hostetter [EMAIL PROTECTED] wrote:


 : I'm now looking at using payloads with SpanNearQuery but I don't
 see any
 : clear way of getting the payload(s) from the matching span
 terms. The
 term
 : positions for the payloads seem to be buried beneath SpanCells
 in the

 Isn't Spans.start() and Spans.end() what you are looking for?





 -Hoss


 ---
 --
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 --
 Grant Ingersoll
 Center for Natural Language Processing
 http://www.cnlp.org/tech/lucene.asp

 Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/
 LuceneFAQ



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


--
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: encoding question.

2007-07-19 Thread Peter Keegan

The source data for my index is already in standard UTF-8 and available as a
simple byte array. I need to do some simple tokenization of the data (check
for whitespace and special characters that control position increment). What
is the most efficient way to index this data and avoid unnecessary
conversions to/from java Strings or char arrays? Looking at DocumentsWriter,
I see that all terms are eventually converted to char arrays and written in
modified-UTF-8, so there doesn't seem to be much advantage to having the
source data in standard UTF-8.

Peter


On 2/14/07, Chris Hostetter [EMAIL PROTECTED] wrote:



Internally Lucene deals with pure Java Strings; when writing those strings
to and reading those strings back from disk, Lucene allways uses the stock
Java modified UTF-8  format, regardless of what your file.encoding
system property may be.

typcially when people have encoding problems in their lucene applications,
the origin of hte problem is in the way they fetch the data before
indexing it ... if you can make a String object, and System.out.println
that string and see what you expect, then handing that string to Lucene as
a field value should work fine.

what exactly is the value object you are calling getBytes on? ... if
it's another String, then you've already got serious problems -- i can't
imagine any situation where fetching the bytes from a String in one
charset and using those bytes to construct another string (either in a
different charset, or in the system default charset) would make any sense
at all.

wherever your original binary data is coming from (files on disk, network
socket, etcc...) that's when you should be converting those bytes into
chars using whatever charset you know those bytes represent.



: Date: Wed, 14 Feb 2007 09:16:58 +0330
: From: Mohammad Norouzi [EMAIL PROTECTED]
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: encoding question.
:
: Hi
: I want to index data with utf-8 encoding, so when adding field to a
document
: I am using the code new String(value.getBytes(utf-8))
: in the other hand, when I am going to search I was using the same
snippet
: code to convert to utf-8 but it did not work so finally I found
somewhere
: that had been said to use new String(valueToSearch.getBytes
(cp1252),UTF8)
: and it worked fine but I still has some problem.
: first, some characters are weird when I get result from lucene, It seems
it
: is in cp1252 encoding.
: second, if the java environment property file.encoding not been cp1252
the
: result is completely in incorrect encoding. so I must change this
property
: using System.setProperty(file.encoding,cp1252)
:
: is lucene neglect my utf-8 encoding and proceed indexing data using
cp1252?
: how can I correct weird characters I received by searching?
:
: Thank you very much in advance.
: --
: Regards,
: Mohammad
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Payloads and PhraseQuery

2007-07-27 Thread Peter Keegan
I guess this also ties in with 'getPositionIncrementGap', which is relevant
to fields with multiple occurrences.

Peter

On 7/27/07, Peter Keegan [EMAIL PROTECTED] wrote:

 I have a question about the way fields are analyzed and inverted by the
 index writer. Currently, if a field has multiple occurrences in a document,
 each occurrence is analyzed separately (see DocumentsWriter.processField).
 Is it safe to assume that this behavior won't change in the future? The
 reason I ask is that my custom analyzer's 'tokenStream' method creates a
 custom filter which produces a payload based on the existence of each field
 occurrence. However, if DocumentsWriter was changed and combined all the
 occurrences before inversion, my scheme wouldn't work.  Since payloads are
 created by filters/tokenizers, it helps to keep things flexible.

 Thanks,
 Peter


 On 7/12/07, Grant Ingersoll [EMAIL PROTECTED] wrote:
 
 
  On Jul 12, 2007, at 6:12 PM, Chris Hostetter wrote:
 
 
  
   Hmm... okay so the issue is that in order to get the payload data, you
   have to have a TermPositions instance.
  
   instead of adding getPayload methods to the Spans class (which as Paul
 
   points out, can have nesting issues) perhaps more general solutions
   would
   be:
  
   a) a more high level getPayload API that let's you get a payload
   arbitrarily for a toc/position (perhaps as part of the TernDocs
   API?) ...
   then for Spans you could use this new API with Spans.start() and
   Spans.end(). (and all the positions in between)
 
  Not sure I follow this.  I don't see the fit w/ TermDocs.
  
   b) add a variation of the TermPositions class to allow people to
   iterate
   through the terms of a TermDoc in position order (TermPosition first
   iterates over the Terms and then over the positions) ... then you
   could
   seek(span.start()) to get the Payload data
  
   c) add methods to the Spans API to get the subspans (if any) ... this
   would be the Spans corrilary to getTerms() and would always return
   TermSpans which would have TermPositions for getting payload data.
 
 
  This could be a good alternative.
 
  When we first talked about payloads we wondered if we could just make
  all Queries into SpanQueries by passing TermPositions instead of term
  docs, but in the end decided not to do it because of performance
  issues (some of which are lessened by lazy loading of TermPositions.
 
  The thing is, I think, that the Spans is already moving you along in
  the term positions, so it just seems like a natural fit to have it
  there, even if there is nesting.  It doesn't seem like it would be
  that hard to then return back the nesting stuff b/c you are just
  collating the results from the underlying SpanTermQuery.  Having said
  that, I haven't looked into the actual code, so take that w/ a grain
  of salt.
 
  I will try to do some more investigation, as others are welcome to
  do.  Perhaps we should move this to dev?
 
  Cheers,
  Grant
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 



Re: LUCENE-843 Release

2007-07-30 Thread Peter Keegan
I've built a production index with this patch and done some query stress
testing with no problems.
I'd give it a thumbs up.

Peter

On 7/30/07, testn [EMAIL PROTECTED] wrote:


 Hi guys,

 Do you think LUCENE-843 is stable enough? If so, do you think it's worth
 to
 release it with probably LUCENE 2.2.1? It would be nice so that people can
 take the advantage of it right away without risking other breaking changes
 in the HEAD branch or waiting until 2.3 release.

 Thanks,
 --
 View this message in context:
 http://www.nabble.com/LUCENE-843-Release-tf4170191.html#a11863644
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Mixing SpanQuery and BooleanQuery

2007-08-06 Thread Peter Keegan
I'm trying to create a fairly complex SpanQuery from a binary parse tree.
I create SpanOrQueries from SpanTermQueries and combine SpanOrQueries into
BooleanQueries. So far, so good.
The problem is that I don't see how to create a SpanNotQuery from a
BooleanQuery and a SpanTermQuery. I want the BooleanQuery to be the
'include' span and the SpanTermQuery to be the 'exclude' span.
Unfortunately, the BooleanQuery cannot be cast to a SpanQuery.

I thought that SpanQuery and BooleanQuery could be freely intermixed, but
this doesn't seem to be the case. It seems that what's really needed is a
'SpanAndQuery'.

Is there another way to build this type of query?

Thanks,
Peter


Re: Mixing SpanQuery and BooleanQuery

2007-08-06 Thread Peter Keegan
Even without 'interesting' slops, it does appear that SpanNearQuery is a
logical AND of all its clauses.
I was distracted by the BooleanQuery examples in the javadocs :)

thanks,
Peter

On 8/6/07, Erick Erickson [EMAIL PROTECTED] wrote:

 Isn't a SpanAndQuery the same as a SpanNearQuery? Perhaps
 with interesting slops..

 Erick

 On 8/6/07, Peter Keegan [EMAIL PROTECTED] wrote:
 
  I'm trying to create a fairly complex SpanQuery from a binary parse
 tree.
  I create SpanOrQueries from SpanTermQueries and combine SpanOrQueries
 into
  BooleanQueries. So far, so good.
  The problem is that I don't see how to create a SpanNotQuery from a
  BooleanQuery and a SpanTermQuery. I want the BooleanQuery to be the
  'include' span and the SpanTermQuery to be the 'exclude' span.
  Unfortunately, the BooleanQuery cannot be cast to a SpanQuery.
 
  I thought that SpanQuery and BooleanQuery could be freely intermixed,
 but
  this doesn't seem to be the case. It seems that what's really needed is
 a
  'SpanAndQuery'.
 
  Is there another way to build this type of query?
 
  Thanks,
  Peter
 



SpanQuery and database join

2007-08-13 Thread Peter Keegan
I've been experimenting with using SpanQuery to perform what is essentially
a limited type of database 'join'. Each document in the index contains 1 or
more 'rows' of meta data from another 'table'. The meta data are simple
tokens representing a column name/value pair ( e.g. color$red or
location$123).  Each row is represented by a span with a maximum token
length equal to the maximum number of meta data columns. If a column has
multiple values, they are all indexed at the same position ( e.g. color$red,
color$blue). All rows are added to a single field. The spans are 'separated'
from each other by introducing a position gap between them via '
Analyzer.getPositionIncrementGap'. This gap should be greater than the
number of columns in each span.

At query time, a SpanNearQuery is constructed to represent the meta data to
join. The 'slop' value is set to the maximum number of meta data columns
(minus 1). Using a simple Antlr parser, boolean span queries with AND, OR,
NOT can be constructed fairly easily. The SpanQuery is And'd to the main
query to build the final query.

This approach is flexible and pretty efficient because no stored fields or
external data are accessed at query time. Span queries are more expensive
compared than other queries, though. We measure performance via throughput
(as opposed to the response time for a single query), and the addition of a
SpanQuery reduced throughput by 5X for ordered spans and 10X for unordered
spans. Still, this may be acceptable for some applications, especially if
spans are not used on every query.

I thought this might interest some of you.

Peter


Re: SpanQuery and database join

2007-08-13 Thread Peter Keegan
I suppose it could go under performance or HowTo/Interesting uses of
SpanQuery.

Peter

On 8/13/07, Erick Erickson [EMAIL PROTECTED] wrote:

 Thanks for writing this up. Do you think this is an appropriate subject
 for the Wiki performance page?

 Erick

 On 8/13/07, Peter Keegan [EMAIL PROTECTED] wrote:
 
  I've been experimenting with using SpanQuery to perform what is
  essentially
  a limited type of database 'join'. Each document in the index contains 1
  or
  more 'rows' of meta data from another 'table'. The meta data are simple
  tokens representing a column name/value pair ( e.g. color$red or
  location$123).  Each row is represented by a span with a maximum token
  length equal to the maximum number of meta data columns. If a column has
  multiple values, they are all indexed at the same position ( e.g.
  color$red,
  color$blue). All rows are added to a single field. The spans are
  'separated'
  from each other by introducing a position gap between them via '
  Analyzer.getPositionIncrementGap'. This gap should be greater than the
  number of columns in each span.
 
  At query time, a SpanNearQuery is constructed to represent the meta data
  to
  join. The 'slop' value is set to the maximum number of meta data columns
  (minus 1). Using a simple Antlr parser, boolean span queries with AND,
 OR,
  NOT can be constructed fairly easily. The SpanQuery is And'd to the main
  query to build the final query.
 
  This approach is flexible and pretty efficient because no stored fields
 or
  external data are accessed at query time. Span queries are more
 expensive
  compared than other queries, though. We measure performance via
 throughput
  (as opposed to the response time for a single query), and the addition
 of
  a
  SpanQuery reduced throughput by 5X for ordered spans and 10X for
 unordered
  spans. Still, this may be acceptable for some applications, especially
 if
  spans are not used on every query.
 
  I thought this might interest some of you.
 
  Peter
 



Re: SpanQuery and database join

2007-08-14 Thread Peter Keegan
I added this under Use Cases. Thanks for the suggestion.

Peter


On 8/13/07, Grant Ingersoll [EMAIL PROTECTED] wrote:

 There is also a Use Cases item on the Wiki...

 On Aug 13, 2007, at 3:26 PM, Peter Keegan wrote:

  I suppose it could go under performance or HowTo/Interesting uses of
  SpanQuery.
 
  Peter
 
  On 8/13/07, Erick Erickson [EMAIL PROTECTED] wrote:
 
  Thanks for writing this up. Do you think this is an appropriate
  subject
  for the Wiki performance page?
 
  Erick
 
  On 8/13/07, Peter Keegan [EMAIL PROTECTED] wrote:
 
  I've been experimenting with using SpanQuery to perform what is
  essentially
  a limited type of database 'join'. Each document in the index
  contains 1
  or
  more 'rows' of meta data from another 'table'. The meta data are
  simple
  tokens representing a column name/value pair ( e.g. color$red or
  location$123).  Each row is represented by a span with a maximum
  token
  length equal to the maximum number of meta data columns. If a
  column has
  multiple values, they are all indexed at the same position ( e.g.
  color$red,
  color$blue). All rows are added to a single field. The spans are
  'separated'
  from each other by introducing a position gap between them via '
  Analyzer.getPositionIncrementGap'. This gap should be greater
  than the
  number of columns in each span.
 
  At query time, a SpanNearQuery is constructed to represent the
  meta data
  to
  join. The 'slop' value is set to the maximum number of meta data
  columns
  (minus 1). Using a simple Antlr parser, boolean span queries with
  AND,
  OR,
  NOT can be constructed fairly easily. The SpanQuery is And'd to
  the main
  query to build the final query.
 
  This approach is flexible and pretty efficient because no stored
  fields
  or
  external data are accessed at query time. Span queries are more
  expensive
  compared than other queries, though. We measure performance via
  throughput
  (as opposed to the response time for a single query), and the
  addition
  of
  a
  SpanQuery reduced throughput by 5X for ordered spans and 10X for
  unordered
  spans. Still, this may be acceptable for some applications,
  especially
  if
  spans are not used on every query.
 
  I thought this might interest some of you.
 
  Peter
 
 

 --
 Grant Ingersoll
 http://lucene.grantingersoll.com

 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Scoring results?!

2007-08-30 Thread Peter Keegan
If I use BoostingTermQuery on a query containing terms without payloads, I
get very different results than doing the same query with TermQuery.
Presumably, this is because the BoostingSpanScorer/SpanScorer compute scores
differently than TermScorer. Is there a way to make BoostingTermQuery behave
like TermQuery for terms without payloads?

Peter


On 5/9/07, Grant Ingersoll [EMAIL PROTECTED] wrote:

 Hi Eric,

 On May 9, 2007, at 2:39 AM, supereric wrote:

 
  How I can get the tag word score in lucene. suppose that you have
  searched a
  tag word and 3 hit documents
  are now found.
  1 -How someone could find number of occurrences in any document so
  it could
  sort the results.

 Span Queries tell you where the matches occur in the document by
 offset, but I am not sure what your sorting criteria would be.  The
 explain method also can give you information about why a particular
 document scored a particular way.


  Also I wan to have some other policies for ranking the results.
  What should
  I do to handle that. for example
  I want to score boldfaced tag words in an html document twice
  normal texts.

 Although totally experimental at this stage, the new Payload stuff in
 the trunk version of Lucene (or nightly builds) is designed for such
 a scenario.  Check out the BoostingTermQuery which can boost term
 scores based on the contents of a payload located at a particular
 term.  Feedback on the APIs is very much appreciated.

  2- How I can omit some tag words from the index?! for example
  common words
  in another language?

 See the StopFilter token filter and/or the StopwordAnalyzer


 
 

 HTH,
 Grant

 --
 Grant Ingersoll
 Center for Natural Language Processing
 http://www.cnlp.org/tech/lucene.asp

 Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
 LuceneFAQ



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




BoostingTermQuery.explain() bugs

2007-08-30 Thread Peter Keegan
There are a couple of minor bugs in BoostingTermQuery.explain().

1. The computation of average payload score produces NaN if no payloads were
found. It should probably be:
float avgPayloadScore = super.score() * (payloadsSeen  0 ? (payloadScore /
payloadsSeen) : 1);

2. If the average payload score is zero, the value of the explanation is 0:
result.setValue(nonPayloadExpl.getValue() * avgPayloadScore);
If the query is part of a BooleanClause, this results in:
no match on required clause...
failure to meet condition(s) of required/prohibited clause(s)

Let me know if I should open a JIRA issue.

Peter


BoostingTermQuery performance

2007-10-02 Thread Peter Keegan
I have been experimenting with payloads and BoostingTermQuery, which I think
are excellent additions to Lucene core. Currently, BoostingTermQuery extends
SpanQuery. I would suggest changing this class to extend TermQuery and
refactor the current version to something like 'BoostingSpanQuery'.

The reason is rooted in performance. In my testing, I compared query
throughput using TermQuery against 2 versions of BoostingTermQuery - the
current one that extends SpanQuery and one that extends TermQuery (which
I've included, below). Here are the results (qps = queries per second):

TermQuery:200 qps
BoostingTermQuery (extends SpanQuery): 97 qps
BoostingTermQuery (extends TermQuery): 130 qps

Here is a version of BoostingTermQuery that extends TermQuery. I had to
modify TermQuery and TermScorer to make them public. A code review would be
in order, and I would appreciate your comments on this suggestion.

Peter

-

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.index.TermPositions;
import org.apache.lucene.search.*;


import java.io.IOException;

/**
 * Copyright 2004 The Apache Software Foundation
 * p/
 * Licensed under the Apache License, Version 2.0 (the License);
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 * p/
 * http://www.apache.org/licenses/LICENSE-2.0
 * p/
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an AS IS BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

/**
 * The BoostingTermQuery is very similar to the [EMAIL PROTECTED]
org.apache.lucene.search.spans.SpanTermQuery} except
 * that it factors in the value of the payload located at each of the
positions where the
 * [EMAIL PROTECTED] org.apache.lucene.index.Term} occurs.
 * p
 * In order to take advantage of this, you must override [EMAIL PROTECTED]
org.apache.lucene.search.Similarity#scorePayload(byte[],int,int)}
 * which returns 1 by default.
 * p
 * Payload scores are averaged across term occurrences in the document.
 *
 * pfont color=#FF
 * WARNING: The status of the bPayloads/b feature is experimental.
 * The APIs introduced here might change in the future and will not be
 * supported anymore in such a case./font
 *
 * @see org.apache.lucene.search.Similarity#scorePayload(byte[], int, int)
 */
public class BoostingTermQuery extends TermQuery{
Term term;
Similarity similarity;

  public BoostingTermQuery(Term term) {
super(term);
this.term = term;

  }


  protected Weight createWeight(Searcher searcher) throws IOException {
this.similarity = getSimilarity(searcher);
return new BoostingTermWeight(this, searcher);
  }

  protected class BoostingTermWeight extends TermWeight implements Weight {


public BoostingTermWeight(BoostingTermQuery query, Searcher searcher)
throws IOException {
  super(searcher);
}




public Scorer scorer(IndexReader reader) throws IOException {
  return new BoostingTermScorer(reader.termDocs(term),
reader.termPositions(term), this, similarity,
 reader.norms(term.field()));
}

class BoostingTermScorer extends TermScorer {

  //TODO: is this the best way to allocate this?
  byte[] payload = new byte[256];
  private TermPositions positions;
  protected float payloadScore;
  private int payloadsSeen;

  public BoostingTermScorer(TermDocs termDocs, TermPositions
termPositions, Weight weight,
Similarity similarity, byte[] norms) throws
IOException {
  super(weight, termDocs, similarity, norms);
  positions = termPositions;

  }

  /**
   * Go to the next document
   *
   */
  public boolean next() throws IOException {

boolean result = super.next();
//set the payload.  super.next() properly increments the term
positions
if (result) {
  if (positions.skipTo(super.doc())) {
  positions.nextPosition();
  processPayload(similarity);
  }
}

return result;
  }

  public boolean skipTo(int target) throws IOException {
boolean result = super.skipTo(target);

if (result) {
if (positions.skipTo(target)) {
positions.nextPosition();
  processPayload(similarity);
  }
}

return result;
  }

//  protected boolean setFreqCurrentDoc() throws IOException {
//if (!more) {
//  return false;
//}
//doc = spans.doc();
//freq = 0.0f;
//payloadScore = 0;
//payloadsSeen = 0;
//Similarity similarity1 = getSimilarity();
//

Re: Can I do boosting based on term postions?

2007-12-18 Thread Peter Keegan
This is a nice alternative to using payloads and BoostingTermQuery. Is there
any reason not to make this change to SpanFirstQuery, in particular:

This modification to SpanFirstQuery would be that the Spans
returned by SpanFirstQuery.getSpans() must always return 0
from its start() method.

Should I open a Jira issue?

Thanks,
Peter


On Aug 3, 2007 2:11 PM, Paul Elschot [EMAIL PROTECTED] wrote:

 On Friday 03 August 2007 20:35, Shailendra Sharma wrote:
  Paul,
 
  If I understand Cedric right, he wants to have different boosting
 depending
  on search term positions in the document. By using SpanFirstQuery he
 will
  only be able to consider in terms till particular position;


  but he won't be
  able to do something like following:
a) Give 100% boosting to matching in first 100 words.
b) Give 80% boosting to matching in next 100 words.
c) Give 60% boosting to matching in next 100 words.

  Though it can be done by writing DisjunctionMaxQuery having multiple
  SpanFirstQuery with different boosting - but I see it as a workaround
 only
  and not the direct and efficient solution.

 You're right, but SpanFirstQuery needs only a minor modification
 for this to work.

 This modification to SpanFirstQuery would be that the Spans
 returned by SpanFirstQuery.getSpans() must always return 0
 from its start() method. Then the slop passed to sloppyFreq(slop)
 would be the distance from the beginning of the indexed field
 to the end of the Spans of the SpanQuery passed to SpanFirstQuery.

 Then the following should work:

 Term firstTerm =  ;

 SpanFirstQuery sfq = new SpanFirstQuery(
  new SpanTermQuery( firstTerm),
  Integer.MAX_VALUE)) {
 ...
 public Similarity getSimilarity() {
 return new Similarity() {
 ...
 float sloppyFreq(slop) {
  return (slop  100)  ? 1.0f
   : (slop  200) ? 0.8f
   : (slop  300) ? 0.6f
   : 0.4f ; // etc. etc.
 


 Actually, I'm a bit surprised that SpanFirstQuery does not work that
 way now.

 Regards,
 Paul Elschot


 
  Cedric,
 
  I am sending you the implementation of SpanTermQuery to your gmail
  account (lucene
  mailing list is bouncing email with attachment). I have named the class
 as
  VSpanTermQuery (I have followed the same package hierarchy as lucene).
 You
  also need to extend VSimilarity class - which would require
 implementation
  of method scoreSpan(..).
 
  Let me know how it went. Though I did a testing for it, but before
  submitting to contrib, I need to do extensive testing.
 
  Thanks,
  Shailendra
 
  On 8/3/07, Paul Elschot [EMAIL PROTECTED] wrote:
  
   Cedric,
  
   You can choose the end limit for SpanFirstQuery yourself.
  
   Regards,
   Paul Elschot
  
  
   On Friday 03 August 2007 05:38, Cedric Ho wrote:
Hi Paul,
   
Isn't SpanFirstQuery only match those with position less than a
certain end position?
   
I am rather looking for a query that would score a document higher
 for
terms appear near the start but not totally discard those with terms
appear near the end.
   
Regards,
Cedric
   
On 8/2/07, Paul Elschot [EMAIL PROTECTED] wrote:
 Cedric,

 SpanFirstQuery could be a solution without payloads.
 You may want to give it your own Similarity.sloppyFreq() .

 Regards,
 Paul Elschot

 On Thursday 02 August 2007 04:07, Cedric Ho wrote:
  Thanks for the quick response =)
 
  On 8/1/07, Shailendra Sharma [EMAIL PROTECTED]
 wrote:
   Yes, it is easily doable through Payload facility. During
   indexing
 process
   (mainly tokenization), you need to push this extra information
 in
   each
   token. And then you can use BoostingTermQuery for using
 Payload
   value
   to
   include Payload in the score. You also need to implement
   Similarity
   for
 this
   (mainly scorePayload method).
 
  If I store, say a custom boost factor as Payload, does it means
 that
   I
  will store one more byte per term per document in the index
 file? So
  the index file would be much larger?
 
  
   Other way can be to extend SpanTermQuery, this already
 calculates
   the
   position of match. You just need to do something to use this
   position
 value
   in the score calculation.
 
  I see that SpanTermQuery takes a TermPositions from the
 indexReader
  and I can get the term position from there. However I am not
 sure
   how
  to incorporate it into the score calculation. Would you mind
 give a
  little more detail on this?
 
  
   One possible advantage of SpanTermQuery approach is that you
 can
   play
   around, without re-creating indices everytime.
  
   Thanks,
   Shailendra Sharma,
   CTO, Ver se' Innovation Pvt. Ltd.
   Bangalore, India
  
   On 8/1/07, Cedric Ho [EMAIL PROTECTED] wrote:
   
Hi all,
   
I was wondering if it is possible to do boosting by search
   terms'

Re: FieldSortedHitQueue rise in memory

2008-02-19 Thread Peter Keegan
Hi Brian,

I ran into something similar a long time ago. My custom sort objects were
being cached by Lucene, but there were too many of them because each one had
different 'reference values' for different queries. So, I changed the equals
and hashcode methods to NOT use any instance data, thus avoiding the
caching.

Could this be what you're seeing?

Peter


On Feb 18, 2008 4:20 PM, Brian Doyle [EMAIL PROTECTED] wrote:

 We've implemented a custom sort class and use it to sort by distance.   We
 have implemented the equals and hashcode in the sort comparator.   After
 running for a few hours we're reaching peak memory usage and eventually
 the
 server runs out of memory.   We did some profiling and noticed that a
 large
 chunk of memory is being used in the
 lucence.search.FieldSortedHitQueueclass.   Has anyone seen this
 behavior before or know how we can stop this
 class from growing in size?



Re: Swapping between indexes

2008-03-06 Thread Peter Keegan
Sridhar,

We have been using approach 2 in our production system with good results. We
have separate processes for indexing and searching. The main issue that came
up was in deleting old indexes (see: *http://tinyurl.com/32q8c4*). Most of
our production problems occur during indexing, and we are able to fix these
without having to interrupt searching at all. This has been a real benefit.

Peter


On Thu, Mar 6, 2008 at 5:30 AM, Sridhar Raman [EMAIL PROTECTED]
wrote:

 This is my situation.  I have an index, which has a lot of search requests
 coming into it.  I use just a single instance of IndexSearcher to process
 these requests.  At the same time, this index is also getting updated by
 an
 IndexWriter.  And I want these new changes to be reflected _only_ at
 certain
 intervals.  I have thought of a few ways of doing this.  Each has its
 share
 of problems and pluses.  I would be glad if someone can help me in
 figuring
 out the right approach, especially from the performance point of view, as
 the number of documents that will get indexed are pretty large.

 Approach 1:
 Have just one copy of the index for both Search  Index.  At time T, when
 I
 need to see the new changes reflected, I close the Searcher, and open it
 again.
 - The re-open of the Searcher might be a bit slow (which I could probably
 solve by using some warm-up threads).
 - Update and Search on the index at the same - will this affect the
 performance?
 - If server crashes before time T, the new Searcher would reflect the
 changes, which is not acceptable.  I want the changes to be reflected only
 at time T.  If server crashes, the index should be the previous T-1 index.
 - Possible problems while optimising the index (as Search is also
 happening).
 + Just one copy of the index being stored.

 Approach 2:
 Keep 2 copies of the index - 1 for Search, 1 for Index.  At time T, I just
 switch the Searcher to a copy of index that is being updated.
 - Before I do the switch to the new index, I need to make a copy of it so
 that the updates continue to happen on the other index.  Is there a
 convenient way to make this copy?  Is it efficient?
 - Time taken to create a new Searcher will still be a problem (but this is
 a
 problem in the previous approach as well, and we can live with it).
 + Optimise can happen on an index that is not being read, as a result, its
 resource requirements would be lesser.  And probably even the speed of
 optimisation.
 + Faster search as the index update is happening on a different index.

 So, these are the 2 approaches I am contemplating about.  Any pointers
 which
 would be the better approach?

 Thanks,
 Sridhar



theoretical maximum score

2008-05-09 Thread Peter Keegan
Is it possible to compute a theoretical maximum score for a given query if
constraints are placed on 'tf' and 'lengthNorm'? If so, scores could be
compared to a 'perfect score' (a feature request from our customers)

Here are some related threads on this:

In this thread:

http://www.nabble.com/Newbie-questions-re%3A-scoring-td4228776.html#a4228776

Hoss writes:

 the only way I can think of to fairly compare scores from queries for
 foo:bar with queries for yak:baz is to normalize them relative a maximum
 possible score across the entire term query space -- but finding that
 maximum is a pretty complicated problem just for simple term queries ...
 when you start talking about more complicated query structures you really
 get messy -- and even then it's only fair as long as the query structures
 are identical, you can never compare the scores from apples and oranges

And in this thread:

http://www.nabble.com/non-relative-scoring-td8956299.html#a8956299

Walt writes:

 A tf.idf engine, like Lucene, might not have a maximum score.
 What if a document contains the word a thousand times?
 A million times?

It seems that if 'tf' is limited to a max value and 'lengthNorm' is a
constant, it might be possible, at least for 'simple' term queries. But Hoss
says that things get messing with complicated queries.

Could someone elaborate a bit? Does the index contain enough info to do this
efficiently?
I realize that scores values must be interpreted 'carefully', but I'm seeing
a push to get more leverage from the absolute values, not just the relative
values.

Peter


Payloads and SpanScorer

2008-07-09 Thread Peter Keegan
If a SpanQuery is constructed from one or more BoostingTermQuery(s), the
payloads on the terms are never processed by the SpanScorer. It seems to me
that you would want the SpanScorer to score the document both on the spans
distance and the payload score. So, either the SpanScorer would have to
process the payloads (duplicating the code in BoostingSpanScorer), or
perhaps SpanScorer could access the BoostingSpanScorers, or maybe there's
another approach.

Any thoughts on how to accomplish this?

Peter


Re: Payloads and SpanScorer

2008-07-10 Thread Peter Keegan
Suppose I create a SpanNearQuery phrase with the terms long range missiles
and some slop factor. Each term is actually a BoostingTermQuery. Currently,
the score computed by SpanNearQuery.SpanScorer is based on the sloppy
frequency of the terms and their weights (this is fine). But even though
each term is actually a BoostingTermQuery, the BoostingTermScorer (and
therefore 'processPayload') is never invoked for this type of query.

I was looking for a way to have SpanNearQuery (also SpanOrQuery,
SpanFirstQuery) recognize that the terms in the phrase should boost the
overall score based on the payloads assigned to them. Thus the score from
the SpanNearQuery would be higher if :

a) the terms have payloads that boost their scores
b) the terms are positionally next to each other (minimal slop - as it works
now)


Does this make sense?

Peter

On Thu, Jul 10, 2008 at 9:21 AM, Grant Ingersoll [EMAIL PROTECTED]
wrote:

 I'm not fully following what you want.  Can you explain a bit more?

 Thanks,
 Grant


 On Jul 9, 2008, at 2:55 PM, Peter Keegan wrote:

  If a SpanQuery is constructed from one or more BoostingTermQuery(s), the
 payloads on the terms are never processed by the SpanScorer. It seems to
 me
 that you would want the SpanScorer to score the document both on the spans
 distance and the payload score. So, either the SpanScorer would have to
 process the payloads (duplicating the code in BoostingSpanScorer), or
 perhaps SpanScorer could access the BoostingSpanScorers, or maybe there's
 another approach.

 Any thoughts on how to accomplish this?

 Peter


 --
 Grant Ingersoll
 http://www.lucidimagination.com

 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ








 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Payloads and SpanScorer

2008-07-10 Thread Peter Keegan
I may take a crack at this. Any more thoughts you may have on the
implementation are welcome, but I don't want to distract you too much.

Thanks,
Peter


On Thu, Jul 10, 2008 at 1:30 PM, Grant Ingersoll [EMAIL PROTECTED]
wrote:

 Makes sense.  It was always my intent to implement things like
 PayloadNearQuery, see http://wiki.apache.org/lucene-java/Payload_Planning

 I think it would make sense to develop these and I would be happy to help
 shepherd a patch through, but am not in a position to generate said patch at
 this moment in time.


 On Jul 10, 2008, at 9:59 AM, Peter Keegan wrote:

  Suppose I create a SpanNearQuery phrase with the terms long range
 missiles
 and some slop factor. Each term is actually a BoostingTermQuery.
 Currently,
 the score computed by SpanNearQuery.SpanScorer is based on the sloppy
 frequency of the terms and their weights (this is fine). But even though
 each term is actually a BoostingTermQuery, the BoostingTermScorer (and
 therefore 'processPayload') is never invoked for this type of query.

 I was looking for a way to have SpanNearQuery (also SpanOrQuery,
 SpanFirstQuery) recognize that the terms in the phrase should boost the
 overall score based on the payloads assigned to them. Thus the score from
 the SpanNearQuery would be higher if :

 a) the terms have payloads that boost their scores
 b) the terms are positionally next to each other (minimal slop - as it
 works
 now)


 Does this make sense?

 Peter

 On Thu, Jul 10, 2008 at 9:21 AM, Grant Ingersoll [EMAIL PROTECTED]
 wrote:

  I'm not fully following what you want.  Can you explain a bit more?

 Thanks,
 Grant


 On Jul 9, 2008, at 2:55 PM, Peter Keegan wrote:

 If a SpanQuery is constructed from one or more BoostingTermQuery(s), the

 payloads on the terms are never processed by the SpanScorer. It seems to
 me
 that you would want the SpanScorer to score the document both on the
 spans
 distance and the payload score. So, either the SpanScorer would have to
 process the payloads (duplicating the code in BoostingSpanScorer), or
 perhaps SpanScorer could access the BoostingSpanScorers, or maybe
 there's
 another approach.

 Any thoughts on how to accomplish this?

 Peter


 --
 Grant Ingersoll
 http://www.lucidimagination.com

 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ








 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 --
 Grant Ingersoll
 http://www.lucidimagination.com

 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ








 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Payloads and SpanScorer

2008-07-19 Thread Peter Keegan
I discovered this post from Karl Wettin in May about SpanNearQuery scoring:
http://www.nabble.com/SpanNearQuery-scoring-td17425454.html#a17425454

Karl apparently had the same expectations I had about the usage model of
spans and boosts. I also found JIRA issue 533 (SpanQuery scoring: SpanWeight
lacks a recursive traversal of the query tree), which addresses the same
problem.

So, I made an attempt to modify SpanNearQuery to expand a nested
BoostingTermQuery, but soon realized while debugging that since
BoostingTermQuery loads payloads from all term positions in the document,
not just the ones constrained by the outer SpanQuery, the resulting score
could be higher than it should be.

Next, I followed Grant's idea of providing span classes that read payloads.
I implemented a 'BoostingNearQuery' that extends 'SpanNearQuery' that
provides term boosts on proximity queries. I will submit a patch to a JIRA
later. This patch works but probably needs more work. I don't like the use
of 'instanceof', but I didn't want to touch Spans or TermSpans. Also, the
payload code is mostly a copy of what's in BoostingTermQuery and could be
common-sourced somewhere. Feel free to throw darts at it :)

Peter



On Thu, Jul 10, 2008 at 2:09 PM, Peter Keegan [EMAIL PROTECTED]
wrote:

 I may take a crack at this. Any more thoughts you may have on the
 implementation are welcome, but I don't want to distract you too much.

 Thanks,
 Peter



 On Thu, Jul 10, 2008 at 1:30 PM, Grant Ingersoll [EMAIL PROTECTED]
 wrote:

 Makes sense.  It was always my intent to implement things like
 PayloadNearQuery, see http://wiki.apache.org/lucene-java/Payload_Planning

 I think it would make sense to develop these and I would be happy to help
 shepherd a patch through, but am not in a position to generate said patch at
 this moment in time.


 On Jul 10, 2008, at 9:59 AM, Peter Keegan wrote:

  Suppose I create a SpanNearQuery phrase with the terms long range
 missiles
 and some slop factor. Each term is actually a BoostingTermQuery.
 Currently,
 the score computed by SpanNearQuery.SpanScorer is based on the sloppy
 frequency of the terms and their weights (this is fine). But even though
 each term is actually a BoostingTermQuery, the BoostingTermScorer (and
 therefore 'processPayload') is never invoked for this type of query.

 I was looking for a way to have SpanNearQuery (also SpanOrQuery,
 SpanFirstQuery) recognize that the terms in the phrase should boost the
 overall score based on the payloads assigned to them. Thus the score from
 the SpanNearQuery would be higher if :

 a) the terms have payloads that boost their scores
 b) the terms are positionally next to each other (minimal slop - as it
 works
 now)


 Does this make sense?

 Peter

 On Thu, Jul 10, 2008 at 9:21 AM, Grant Ingersoll [EMAIL PROTECTED]
 wrote:

  I'm not fully following what you want.  Can you explain a bit more?

 Thanks,
 Grant


 On Jul 9, 2008, at 2:55 PM, Peter Keegan wrote:

 If a SpanQuery is constructed from one or more BoostingTermQuery(s), the

 payloads on the terms are never processed by the SpanScorer. It seems
 to
 me
 that you would want the SpanScorer to score the document both on the
 spans
 distance and the payload score. So, either the SpanScorer would have to
 process the payloads (duplicating the code in BoostingSpanScorer), or
 perhaps SpanScorer could access the BoostingSpanScorers, or maybe
 there's
 another approach.

 Any thoughts on how to accomplish this?

 Peter


 --
 Grant Ingersoll
 http://www.lucidimagination.com

 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ








 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 --
 Grant Ingersoll
 http://www.lucidimagination.com

 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ








 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





BoostingTermQuery scoring

2008-11-04 Thread Peter Keegan
I'm using BoostingTermQuery to boost the score of documents with terms
containing payloads (boost value  1). I'd like to change the scoring
behavior such that if a query contains multiple BoostingTermQuery terms
(either required or optional), documents containing more matching terms with
payloads always score higher than documents with fewer terms with payloads.
Currently, if one of the terms has a high IDF weight and contains a boosting
payload but no payloads on other matching terms, it may score higher than
docs with other matching terms with payloads and lower IDF.

I think what I need is a way to increase the weight of a matching term in
BoostingSpanScorer.score() if 'payloadsSeen  0', but I don't see how to do
this. Any suggestions?

Thanks,
Peter


Re: BoostingTermQuery scoring

2008-11-06 Thread Peter Keegan
Let me give some background on the problem behind my question.

Our index contains many fields (title, body, date, city, etc). Most queries
search all fields, but for best performance, we create an additional
'contents' field that contains all terms from all fields so that only one
field needs to be searched. Some fields, like title and city, are boosted by
a factor of 5. In order to make term boosting work, we create an additional
field 'boost' that contains all the terms from the boosted fields (title,
city).

Then, at search time, a query for petroleum engineer gets rewritten to:
(+contents:petroleum +contents:engineer) (+boost:petroleum +boost:engineer).
Note that the two clauses are OR'd so that a term that exists in both fields
will get a higher weight in the 'boost' field. This works quite well at
boosting documents with terms that exist in the boosted fields. However, it
doesn't work properly if excluded terms are added, for example:

(+contents:petroleum +contents:engineer -contents:drilling)
(+boost:petroleum +boost:engineer -boost:drilling)

If a document contains the term 'drilling' in the 'body' field, but not in
the 'title' or 'city' field, a false hit occurs.

Enter payloads and 'BoostingTermQuery'. At indexing time, as terms are added
to the 'contents' field, they are assigned a payload (value=5) if the term
also exists in one of the boosted fields. The 'scorePayload' method in our
Similarity class returns the payload value as a score. The query no longer
contains the 'boost' fields and is simply:

+contents:petroleum +contents:engineer -contents:drilling

The goal is to make the payload technique behavior similar to the 'boost'
field technique. The problem is that relevance scores of the top hits are
sometimes quite different. The reason is that the IDF values for a given
term in the 'boost' field is often much higher than the same term in the
'contents' field. This makes sense because the 'boost' field contains a
fairly small subset of the 'contents' field. Even with a payload of '5', a
low IDF in the 'contents' field usually erases the effect of the payload.

I have found a fairly simple (albeit inelegant) solution that seems to work.
The 'boost' field is still created as before, but it is only used to compute
IDF values for the weight class 'BoostingTermQuery.BoostingTermWeight. I had
to make this class 'public' so that I could override the IDF value as
follows:

public class MNSBoostingTermQuery extends BoostingTermQuery {
  public MNSBoostingTermQuery(Term term) {
super(term);
  }
  protected class MNSBoostingTermWeight extends
BoostingTermQuery.BoostingTermWeight {
public MNSBoostingTermWeight(BoostingTermQuery query, Searcher searcher)
throws IOException {
  super(query, searcher);
  java.util.HashSetTerm newTerms = new java.util.HashSetTerm();
  // Recompute IDF based on 'boost' field
  Iterator i = terms.iterator();
  Term term=null;
  while (i.hasNext()) {
term = (Term)i.next();
newTerms.add(new Term(boost, term.text()));
  }
  this.idf = this.query.getSimilarity(searcher).idf(newTerms, searcher);
}
  }
}

Any thoughts about a better implementation are welcome.

Peter




On Thu, Nov 6, 2008 at 8:00 AM, Grant Ingersoll [EMAIL PROTECTED] wrote:

 Not sure, but it sounds like you are interested in a higher level Query,
 kind of like the BooleanQuery, but then part of it sounds like it is per
 document, right?  Is it that you want to deal with multiple payloads in a
 document, or multiple BTQs in a bigger query?

 On Nov 4, 2008, at 9:42 AM, Peter Keegan wrote:

  I'm using BoostingTermQuery to boost the score of documents with terms
 containing payloads (boost value  1). I'd like to change the scoring
 behavior such that if a query contains multiple BoostingTermQuery terms
 (either required or optional), documents containing more matching terms
 with
 payloads always score higher than documents with fewer terms with
 payloads.
 Currently, if one of the terms has a high IDF weight and contains a
 boosting
 payload but no payloads on other matching terms, it may score higher than
 docs with other matching terms with payloads and lower IDF.

 I think what I need is a way to increase the weight of a matching term in
 BoostingSpanScorer.score() if 'payloadsSeen  0', but I don't see how to
 do
 this. Any suggestions?

 Thanks,
 Peter


 --
 Grant Ingersoll


 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ










 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: BoostingTermQuery scoring

2008-11-06 Thread Peter Keegan
I've discovered another flaw in using this technique:

(+contents:petroleum +contents:engineer +contents:refinery)
(+boost:petroleum +boost:engineer +boost:refinery)

It's possible that the first clause will produce a matching doc and none of
the terms in the second clause are used to score that doc. Yet another
reason to use BoostingTermQuery.

Peter


On Thu, Nov 6, 2008 at 1:08 PM, Peter Keegan [EMAIL PROTECTED] wrote:

 Let me give some background on the problem behind my question.

 Our index contains many fields (title, body, date, city, etc). Most queries
 search all fields, but for best performance, we create an additional
 'contents' field that contains all terms from all fields so that only one
 field needs to be searched. Some fields, like title and city, are boosted by
 a factor of 5. In order to make term boosting work, we create an additional
 field 'boost' that contains all the terms from the boosted fields (title,
 city).

 Then, at search time, a query for petroleum engineer gets rewritten to:
 (+contents:petroleum +contents:engineer) (+boost:petroleum +boost:engineer).
 Note that the two clauses are OR'd so that a term that exists in both fields
 will get a higher weight in the 'boost' field. This works quite well at
 boosting documents with terms that exist in the boosted fields. However, it
 doesn't work properly if excluded terms are added, for example:

 (+contents:petroleum +contents:engineer -contents:drilling)
 (+boost:petroleum +boost:engineer -boost:drilling)

 If a document contains the term 'drilling' in the 'body' field, but not in
 the 'title' or 'city' field, a false hit occurs.

 Enter payloads and 'BoostingTermQuery'. At indexing time, as terms are
 added to the 'contents' field, they are assigned a payload (value=5) if the
 term also exists in one of the boosted fields. The 'scorePayload' method in
 our Similarity class returns the payload value as a score. The query no
 longer contains the 'boost' fields and is simply:

 +contents:petroleum +contents:engineer -contents:drilling

 The goal is to make the payload technique behavior similar to the 'boost'
 field technique. The problem is that relevance scores of the top hits are
 sometimes quite different. The reason is that the IDF values for a given
 term in the 'boost' field is often much higher than the same term in the
 'contents' field. This makes sense because the 'boost' field contains a
 fairly small subset of the 'contents' field. Even with a payload of '5', a
 low IDF in the 'contents' field usually erases the effect of the payload.

 I have found a fairly simple (albeit inelegant) solution that seems to
 work. The 'boost' field is still created as before, but it is only used to
 compute IDF values for the weight class
 'BoostingTermQuery.BoostingTermWeight. I had to make this class 'public' so
 that I could override the IDF value as follows:

 public class MNSBoostingTermQuery extends BoostingTermQuery {
   public MNSBoostingTermQuery(Term term) {
 super(term);
   }
   protected class MNSBoostingTermWeight extends
 BoostingTermQuery.BoostingTermWeight {
 public MNSBoostingTermWeight(BoostingTermQuery query, Searcher
 searcher) throws IOException {
   super(query, searcher);
   java.util.HashSetTerm newTerms = new java.util.HashSetTerm();
   // Recompute IDF based on 'boost' field
   Iterator i = terms.iterator();
   Term term=null;
   while (i.hasNext()) {
 term = (Term)i.next();
 newTerms.add(new Term(boost, term.text()));
   }
   this.idf = this.query.getSimilarity(searcher).idf(newTerms,
 searcher);
 }
   }
 }

 Any thoughts about a better implementation are welcome.

 Peter





 On Thu, Nov 6, 2008 at 8:00 AM, Grant Ingersoll [EMAIL PROTECTED]wrote:

 Not sure, but it sounds like you are interested in a higher level Query,
 kind of like the BooleanQuery, but then part of it sounds like it is per
 document, right?  Is it that you want to deal with multiple payloads in a
 document, or multiple BTQs in a bigger query?

 On Nov 4, 2008, at 9:42 AM, Peter Keegan wrote:

  I'm using BoostingTermQuery to boost the score of documents with terms
 containing payloads (boost value  1). I'd like to change the scoring
 behavior such that if a query contains multiple BoostingTermQuery terms
 (either required or optional), documents containing more matching terms
 with
 payloads always score higher than documents with fewer terms with
 payloads.
 Currently, if one of the terms has a high IDF weight and contains a
 boosting
 payload but no payloads on other matching terms, it may score higher than
 docs with other matching terms with payloads and lower IDF.

 I think what I need is a way to increase the weight of a matching term in
 BoostingSpanScorer.score() if 'payloadsSeen  0', but I don't see how to
 do
 this. Any suggestions?

 Thanks,
 Peter


 --
 Grant Ingersoll


 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java

Re: Boosting results

2008-11-07 Thread Peter Keegan
If you sort first by score, keep in mind that the raw scores are very
precise and you could see many unique values in the result set. The
secondary sort field would only be used to break equal scores. We had to use
a custom comparator to 'smooth out' the scores to allow the second field to
take effect.

Peter


On Fri, Nov 7, 2008 at 11:17 AM, Scott Smith [EMAIL PROTECTED]wrote:

 Well, it's not like sorting hadn't occurred to me.  Unfortunately, what
 I recalled was that you could only sort results on one field (I do date
 sorted searches all the time in my application).  I should have gone
 back and looked.  My memory failed me as I can see that you can sort on
 multiple fields and score (aka relevancy) is one of the pseudo fields.
 That'll work.

 Thanks.

 Scott

 -Original Message-
 From: Erick Erickson [mailto:[EMAIL PROTECTED]
 Sent: Friday, November 07, 2008 5:59 AM
 To: java-user@lucene.apache.org
 Subject: Re: Boosting results

 dh, sorting. I absolutely love it when I overlook the obvious G.

 [EMAIL PROTECTED]

 On Fri, Nov 7, 2008 at 4:58 AM, Michael McCandless 
 [EMAIL PROTECTED] wrote:

 
  Couldn't you just do a single Query that sorts first by category and
 second
  by relevance?
 
  Mike
 
 
  Erick Erickson wrote:
 
   It seems to me that the easiest thing would be to fire two queries
 and
  then just concatenate the results
 
  category:A AND body:fred
 
  category:B AND body:fred
 
 
  If you really, really didn't want to fire two queries, you could
 create
  filters on category A and category B and make a couple of
  passes through your results seeing if the returned documents were in
  the filter, but you'd still concatenate the results. Actually in your
  specific example you could make one filter on A.
 
  You could also consider a custom scorer that, added 1,000,000 to
 every
  category A document.
 
  How much were you boosting by? What happens if you boost by a very
 large
  factor?
  As in ridiculously large?
 
  Best
  Erick
 
  On Thu, Nov 6, 2008 at 7:42 PM, Scott Smith
 [EMAIL PROTECTED]
  wrote:
 
   I'm interested in comments on the following problem.
 
 
 
  I have a set of documents.  They fall into 3 categories.  Call these
  categories A, B, and C.  Each document has an indexed, non-tokenized
  field called category which contains A, B, or C (they are mutually
  exclusive categories).
 
 
 
  All of the documents contain a field called body which contains a
  bunch of text.  This field is indexed and tokenized.
 
 
 
  So, I want to do a search which looks something like:
 
 
 
  (category:A OR category:B) AND body:fred
 
 
 
  I want all of the category A documents to come before the category B
  documents.  Effectively, I want to have the category A documents
 first
  (sorted by relevancy) and then the category B documents after
 (sorted by
  relevancy).
 
 
 
  I thought I could do this by boosting the category portion of the
 query,
  but that doesn't seem to work consistently.  I was setting the boost
 on
  the category A term to 1.0 and the boost on the category B term to
 0.0.
 
 
 
  Any thoughts how to skin this?
 
 
 
  Scott
 
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: BoostingTermQuery scoring

2008-11-07 Thread Peter Keegan
 boost:(+petroleum +engineer +refinery)
 (+contents:(+petroleum +engineer +refinery)
  +((*:* -boost:petroleum)
(*:* -boost:engineer)
(*:* -boost:refinery)))

That's an interesting solution. Would this result in many more documents
being visited by the scorer, possibly impacting performance? (I haven't
tried it yet).

Thanks,
Peter



On Thu, Nov 6, 2008 at 6:56 PM, Steven A Rowe [EMAIL PROTECTED] wrote:

 Hi Peter,

 On 11/06/2008 at 4:25 PM, Peter Keegan wrote:
  I've discovered another flaw in using this technique:
 
  (+contents:petroleum +contents:engineer +contents:refinery)
  (+boost:petroleum +boost:engineer +boost:refinery)
 
  It's possible that the first clause will produce a matching
  doc and none of the terms in the second clause are used to
  score that doc. Yet another reason to use BoostingTermQuery.

 I think you could address this, without BTQ, using something like:

  boost:(+petroleum +engineer +refinery)
  (+contents:(+petroleum +engineer +refinery)
   +((*:* -boost:petroleum)
 (*:* -boost:engineer)
 (*:* -boost:refinery)))

 The last three lines gives you the set of documents that are missing at
 least one of the terms in the boost field.  The *:* thingy, indicating a
 MatchAllDocsQuery, is necessary to get all documents that don't have a given
 term; Lucene's (sub-)query document exclusion operation needs a non-empty
 set on which to operate.

 On 11/06/2008 at 1:08 PM, Peter Keegan wrote:
  Then, at search time, a query for petroleum engineer gets rewritten
  to: (+contents:petroleum +contents:engineer) (+boost:petroleum
  +boost:engineer). Note that the two clauses are OR'd so that a term that
  exists in both fields will get a higher weight in the 'boost' field.
  This works quite well at boosting documents with terms that exist in the
  boosted fields. However, it doesn't work properly if excluded terms are
  added, for example:
 
  (+contents:petroleum +contents:engineer -contents:drilling)
  (+boost:petroleum +boost:engineer -boost:drilling)
 
  If a document contains the term 'drilling' in the 'body'
  field, but not in the 'title' or 'city' field, a false hit occurs.

 I think you could address this problem like this:

  +(boost:(+petroleum +engineer)
(+contents:(+petroleum +engineer)
 +((*:* -boost:petroleum)
   (*:* -boost:engineer
  -contents:drilling

 You don't have to include -boost:drilling, because this condition is
 entailed by -contents:drilling.

 Steve

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Payloads

2008-12-29 Thread Peter Keegan
Hi Karl,

I use payloads for weight only, too, with BoostingTermQuery (see:
http://www.nabble.com/BoostingTermQuery-scoring-td20323615.html#a20323615)

A custom tokenizer looks for the reserved character '\b' followed by a 2
byte 'boost' value. It then creates a special Token type for a custom filter
which sets the payload on the token. Another reserved character '\t' is used
to set a position increment value. Since we often boost multiple tokens in
the stream, the payload 'boost' value is reapplied to subsequent tokens
until a 'boost' value of '0' is encountered, which disables payloads.

This is a bit messy and I agree that it would be nice to come up with a nice
API for this.

Peter

On Fri, Dec 26, 2008 at 8:22 PM, Karl Wettin karl.wet...@gmail.com wrote:

 I would very much like to hear how people use payloads.

 Personally I use them for weight only. And I use them a lot, almost in all
 applications. I factor the weight of synonyms, stems, dediacritization and
 what not. I create huge indices that contains lots tokens at the same
 position but with different weights. I might for instance create the stream
 (1)motörhead^1, (0)motorhead^0.7 and I'll do this at both index and
 query time, i.e. I use the payload weight to calculate both payload weight
 used by the BoostingTermQuery scorer AND to set the boost in the query at
 the same time.

 In order to handle this I use an interface that looks something like this:

 public interface PayloadWeightHandler {
  public void setWeight(Token token, float weight);
  public float getWeight(Token token);
 }

 In order to use this I had to patch pretty much any filter I use and pass
 down a weight factor, something like:

 TokenStream ts = analyzer.tokenStream(f, new StringReader(motörhead ace of
 spaces));
 ts = new SynonymTokenFilter(ts, synonyms, 0.7f);
 ts = new StemmerFilter(ts, 0.7f);
 ts = new ASCIIFoldingFilter(ts, 0.5f);

 All these filters would, if applicable, create new synonym tokens with
 slightly less weight than the input rather than replace token content:

 (1)mötorhead^1, (0)motorhead^0.5, (1)ace^1, (1)of^1, (1)spades^1,
 (1)spad^0.7

 I usually use 4 byte floats while creating the stream and then convert it
 to 8 bit floats in a final filter before adding it to the document.

 Is anyone else doing something similar? It would be nice to normalize this
 and perhaps come up with a reusable API for this. It would also be cool if
 all the existing filters could be rewritten to handle this stuff.

 I find it to be extemely useful when creating indices with rather niched
 content such as song titles, names of people, street addresses, et c. For
 the last year or so I've done several (3) commercial implementations where I
 try to extend the index with incorrect typed queries but unique enough that
 it does not interfere with the quality of the results. It has been very
 successful, people get great responses in great time even though they enter
 an incorrect query.

 On a side note, in these implementaions I've completely replaced phrase
 queries using shingles. ShingleMatrixQuery has some built in goodies for
 calculating weight. Combined with SSD I see awesome results with very short
 response time even in fairly large indices (10M-100M documents). I'm talking
 about 100ms-500ms for rather complex queries under heavy load.


  karl
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




queryNorm affect on score

2009-02-20 Thread Peter Keegan
The explanation of scores from the same document returned from 2 similar
queries differ in an unexpected way. There are 2 fields involved, 'contents'
and 'literals'. The 'literals' field has setBoost = 0. As you an see from
the explanations below, the total weight of the matching terms from the
'literal' field is 0. However, the weights produced by the matching terms in
the 'contents' field is very different, even with the same matching terms.
The reason is that the 'queryNorm' value is very different because the
'sumOfSquaredWeights' is very different. Why is this?

First query: +(+contents:sales +contents:representative) +literals:jb$1
Explanation:
32.274593  sum of:
  32.274593  sum of:
10.336284  weight(contents:sales in 14578), product of:
  0.54963183  queryWeight(contents:sales), product of:
2.6595461  idf(contents: sales=83179)
0.20666377  queryNorm
  18.805832  fieldWeight(contents:sales in 14578), product of:
7.071068  btq, product of:
  1.4142135  tf(phraseFreq=3.0)
  5.0  scorePayload(...)
2.6595461  idf(contents: sales=83179)
1.0  fieldNorm(field=contents, doc=14578)
21.93831  weight(contents:representative in 14578), product of:
  0.8007395  queryWeight(contents:representative), product of:
3.8746004  idf(contents: representative=24678)
0.20666377  queryNorm
  27.397562  fieldWeight(contents:representative in 14578), product of:
7.071068  btq, product of:
  1.4142135  tf(phraseFreq=2.0)
  5.0  scorePayload(...)
3.8746004  idf(contents: representative=24678)
1.0  fieldNorm(field=contents, doc=14578)
  0.0  weight(literals:jb$1 in 14578), product of:
0.23816177  queryWeight(literals:jb$1), product of:
  1.1524118  idf(docFreq=375455, numDocs=436917)
  0.20666377  queryNorm
0.0  fieldWeight(literals:jb$1 in 14578), product of:
  1.0  tf(termFreq(literals:jb$1)=1)
  1.1524118  idf(docFreq=375455, numDocs=436917)
  0.0  fieldNorm(field=literals, doc=14578)


Second query: +(+contents:sales +contents:representative) +(literals:jb$1
literals:jb$)
Explanation:
10.550879  sum of:
  10.550879  sum of:
3.3790317  weight(contents:sales in 14578), product of:
  0.17967999  queryWeight(contents:sales), product of:
2.6595461  idf(contents: sales=83179)
0.0675604  queryNorm
  18.805832  fieldWeight(contents:sales in 14578), product of:
7.071068  btq, product of:
  1.4142135  tf(phraseFreq=3.0)
  5.0  scorePayload(...)
2.6595461  idf(contents: sales=83179)
1.0  fieldNorm(field=contents, doc=14578)
7.171847  weight(contents:representative in 14578), product of:
  0.26176953  queryWeight(contents:representative), product of:
3.8746004  idf(contents: representative=24678)
0.0675604  queryNorm
  27.397562  fieldWeight(contents:representative in 14578), product of:
7.071068  btq, product of:
  1.4142135  tf(phraseFreq=2.0)
  5.0  scorePayload(...)
3.8746004  idf(contents: representative=24678)
1.0  fieldNorm(field=contents, doc=14578)
  0.0  product of:
0.0  sum of:
  0.0  weight(literals:jb$1 in 14578), product of:
0.0778574  queryWeight(literals:jb$1), product of:
  1.1524118  idf(docFreq=375455, numDocs=436917)
  0.0675604  queryNorm
0.0  fieldWeight(literals:jb$1 in 14578), product of:
  1.0  tf(termFreq(literals:jb$1)=1)
  1.1524118  idf(docFreq=375455, numDocs=436917)
  0.0  fieldNorm(field=literals, doc=14578)
0.5  coord(1/2)





Peter


Re: queryNorm affect on score

2009-02-27 Thread Peter Keegan
Any comments about this? Is this just the way queryNorm works or is this a
bug?

Thanks,
Peter

On Fri, Feb 20, 2009 at 4:03 PM, Peter Keegan peterlkee...@gmail.comwrote:


 The explanation of scores from the same document returned from 2 similar
 queries differ in an unexpected way. There are 2 fields involved, 'contents'
 and 'literals'. The 'literals' field has setBoost = 0. As you an see from
 the explanations below, the total weight of the matching terms from the
 'literal' field is 0. However, the weights produced by the matching terms in
 the 'contents' field is very different, even with the same matching terms.
 The reason is that the 'queryNorm' value is very different because the
 'sumOfSquaredWeights' is very different. Why is this?

 First query: +(+contents:sales +contents:representative) +literals:jb$1
 Explanation:
 32.274593  sum of:
   32.274593  sum of:
 10.336284  weight(contents:sales in 14578), product of:
   0.54963183  queryWeight(contents:sales), product of:
 2.6595461  idf(contents: sales=83179)
 0.20666377  queryNorm
   18.805832  fieldWeight(contents:sales in 14578), product of:
 7.071068  btq, product of:
   1.4142135  tf(phraseFreq=3.0)
   5.0  scorePayload(...)
 2.6595461  idf(contents: sales=83179)
 1.0  fieldNorm(field=contents, doc=14578)
 21.93831  weight(contents:representative in 14578), product of:
   0.8007395  queryWeight(contents:representative), product of:
 3.8746004  idf(contents: representative=24678)
 0.20666377  queryNorm
   27.397562  fieldWeight(contents:representative in 14578), product of:
 7.071068  btq, product of:
   1.4142135  tf(phraseFreq=2.0)
   5.0  scorePayload(...)
 3.8746004  idf(contents: representative=24678)
 1.0  fieldNorm(field=contents, doc=14578)
   0.0  weight(literals:jb$1 in 14578), product of:
 0.23816177  queryWeight(literals:jb$1), product of:
   1.1524118  idf(docFreq=375455, numDocs=436917)
   0.20666377  queryNorm
 0.0  fieldWeight(literals:jb$1 in 14578), product of:
   1.0  tf(termFreq(literals:jb$1)=1)
   1.1524118  idf(docFreq=375455, numDocs=436917)
   0.0  fieldNorm(field=literals, doc=14578)


 Second query: +(+contents:sales +contents:representative) +(literals:jb$1
 literals:jb$)
 Explanation:
 10.550879  sum of:
   10.550879  sum of:
 3.3790317  weight(contents:sales in 14578), product of:
   0.17967999  queryWeight(contents:sales), product of:
 2.6595461  idf(contents: sales=83179)
 0.0675604  queryNorm
   18.805832  fieldWeight(contents:sales in 14578), product of:
 7.071068  btq, product of:
   1.4142135  tf(phraseFreq=3.0)
   5.0  scorePayload(...)
 2.6595461  idf(contents: sales=83179)
 1.0  fieldNorm(field=contents, doc=14578)
 7.171847  weight(contents:representative in 14578), product of:
   0.26176953  queryWeight(contents:representative), product of:
 3.8746004  idf(contents: representative=24678)
 0.0675604  queryNorm
   27.397562  fieldWeight(contents:representative in 14578), product of:
 7.071068  btq, product of:
   1.4142135  tf(phraseFreq=2.0)
   5.0  scorePayload(...)
 3.8746004  idf(contents: representative=24678)
 1.0  fieldNorm(field=contents, doc=14578)
   0.0  product of:
 0.0  sum of:
   0.0  weight(literals:jb$1 in 14578), product of:
 0.0778574  queryWeight(literals:jb$1), product of:
   1.1524118  idf(docFreq=375455, numDocs=436917)
   0.0675604  queryNorm
 0.0  fieldWeight(literals:jb$1 in 14578), product of:
   1.0  tf(termFreq(literals:jb$1)=1)
   1.1524118  idf(docFreq=375455, numDocs=436917)
   0.0  fieldNorm(field=literals, doc=14578)
 0.5  coord(1/2)





 Peter



Re: queryNorm affect on score

2009-02-27 Thread Peter Keegan
Got it. This is another example of why scores can't be compared between
(even similar) queries.
 (we don't)

Thanks.

On Fri, Feb 27, 2009 at 11:39 AM, Yonik Seeley
yo...@lucidimagination.comwrote:

 On Fri, Feb 27, 2009 at 9:15 AM, Peter Keegan peterlkee...@gmail.com
 wrote:
  Any comments about this? Is this just the way queryNorm works or is this
 a
  bug?

 That's just the way it works... since it's applied to all clauses, it
 really just changes the range of scores returned, not relative
 ordering of documents or anything.

 -Yonik
 http://www.lucidimagination.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: queryNorm affect on score

2009-02-28 Thread Peter Keegan
 in situations where you  deal with simple query types, and matching query
structures, the queryNorm
 *can* be used to make scores semi-comparable.

Hmm. My example used matching query structures. The only difference was a
single term in a field with zero weight that didn't exist in the matching
document. But one score was 3X the other.

Peter

On Sat, Feb 28, 2009 at 12:35 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : I guess I don't really understand this comment in the similarity java doc
 : then:
 :
 :
 http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm
 :
 : *queryNorm(q) * is a normalizing factor used to make scores between
 queries
 : comparable.

 that comment should probably be removed ... in situations where you
 deal with simple query types, and matching query structures, the queryNorm
 *can* be used to make scores semi-comparable.

 To be 100% correct about what the queryNorm does in all cases: it
 normalizes each of the constituent values that are used in the score
 computation relative to the other constituent values.  the main value I've
 seen from it is that it prevents a loss of floating point accuracy that
 can result from addition/multiplication of large values.



 -Hoss


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: queryNorm affect on score

2009-03-01 Thread Peter Keegan
As suggested, I added a query-time boost of 0.0f to the 'literals' field
(with index-time boost still there) and I did get the same scores for both
queries :)  (there is a subtlety between index-time and query-time boosting
that I missed.)

I also tried disabling the coord factor, but that had no affect on the
score, when combined with the above. This seems ok in this example since the
the matching terms had boost = 0.

Thanks Yonik,
Peter



On Sat, Feb 28, 2009 at 6:02 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Sat, Feb 28, 2009 at 3:02 PM, Peter Keegan peterlkee...@gmail.com
 wrote:
  in situations where you  deal with simple query types, and matching
 query
  structures, the queryNorm
  *can* be used to make scores semi-comparable.
 
  Hmm. My example used matching query structures. The only difference was a
  single term in a field with zero weight that didn't exist in the matching
  document. But one score was 3X the other.

 But the zero boost was an index-time boost, and the queryNorm takes
 into account query-time boosts and idfs.  You might get closer to what
 you expect with a query time boost of 0.0f

 The other thing affecting the score is the coord factor - the fact
 that fewer of the optional terms matched (1/2) lowers the score.  The
 coordination factor can be disabled on any BooleanQuery.

 If you do both of the above, I *think* you would get the same scores
 for this specific example.

 -Yonik
 http://www.lucidimagination.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: queryNorm affect on score

2009-03-02 Thread Peter Keegan
If I set the boost=0 at query time and the query contains only terms with
boost=0, the scores are NaN (because weight.queryNorm = 1/0 = infinity),
instead of 0.

Peter


On Sun, Mar 1, 2009 at 9:27 PM, Erick Erickson erickerick...@gmail.comwrote:

 FWIW, Hossman pointed out that the difference between index and
 query time boosts is that index time boosts on title, for instance,
 express I care about this document's title more than other documents'
 titles [when it matches] Query time boosts express I care about matches
 on the title field more than matches on other fields.

 Best
 Erick

 On Sun, Mar 1, 2009 at 8:57 PM, Peter Keegan peterlkee...@gmail.com
 wrote:

  As suggested, I added a query-time boost of 0.0f to the 'literals' field
  (with index-time boost still there) and I did get the same scores for
 both
  queries :)  (there is a subtlety between index-time and query-time
 boosting
  that I missed.)
 
  I also tried disabling the coord factor, but that had no affect on the
  score, when combined with the above. This seems ok in this example since
  the
  the matching terms had boost = 0.
 
  Thanks Yonik,
  Peter
 
 
 
  On Sat, Feb 28, 2009 at 6:02 PM, Yonik Seeley 
 yo...@lucidimagination.com
  wrote:
 
   On Sat, Feb 28, 2009 at 3:02 PM, Peter Keegan peterlkee...@gmail.com
   wrote:
in situations where you  deal with simple query types, and matching
   query
structures, the queryNorm
*can* be used to make scores semi-comparable.
   
Hmm. My example used matching query structures. The only difference
 was
  a
single term in a field with zero weight that didn't exist in the
  matching
document. But one score was 3X the other.
  
   But the zero boost was an index-time boost, and the queryNorm takes
   into account query-time boosts and idfs.  You might get closer to what
   you expect with a query time boost of 0.0f
  
   The other thing affecting the score is the coord factor - the fact
   that fewer of the optional terms matched (1/2) lowers the score.  The
   coordination factor can be disabled on any BooleanQuery.
  
   If you do both of the above, I *think* you would get the same scores
   for this specific example.
  
   -Yonik
   http://www.lucidimagination.com
  
   -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  
  
 



sloppyFreq question

2009-03-03 Thread Peter Keegan
The DefaultSimilarity class defines sloppyFreq as:

public float sloppyFreq(int distance) {
  return 1.0f / (distance + 1);
}

For a 'SpanNearQuery', this reduces the effect of the term frequency on the
score as the number of terms in the span increases. So, for a simple phrase
query (using spans), the longer the phrase, the lower the TF. For a simple
SpanTermQuery, the TF is reduced in half (1.0f / 1 + 1).

I'm just wondering why this is the default behavior. For 'SpanTermQuery',
I'd expect the TF to reflect the actual number of occurrences of the term.
For a SpanNearQuery, wouldn't it still be the number of occurrences of the
whole span, not the number of terms in the span?

Thanks,
Peter


  1   2   >