Re: docMap array in SegmentMergeInfo

2005-10-11 Thread Peter Keegan
On a multi-cpu system, this loop to build the docMap array can cause severe thread thrashing because of the synchronized method 'isDeleted'. I have observed this on an index with over 1 million documents (which contains a few thousand deleted docs) when multiple threads perform a search with

Re: docMap array in SegmentMergeInfo

2005-10-12 Thread Peter Keegan
Here is one stack trace: Full thread dump Java HotSpot(TM) Client VM (1.5.0_03-b07 mixed mode): Thread-6 prio=5 tid=0x6cf7a7f0 nid=0x59e50 waiting for monitor entry [0x6d2cf000..0x6d2cfd6c] at org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:241) - waiting to lock 0x04e40278 (a

Re: docMap array in SegmentMergeInfo

2005-10-13 Thread Peter Keegan
Hi Yonik, Your patch has corrected the thread thrashing problem on multi-cpu systems. I've tested it with both 1.4.3 and 1.9. I haven't seen 100X performance gain, but that's because I'm caching QueryFilters and Lucene is caching the sort fields. Thanks for the fast response! btw, I had

Re: Throughput doesn't increase when using more concurrent threads

2006-01-25 Thread Peter Keegan
This is just fyi - in my stress tests on a 8-cpu box (that's 8 real cpus), the maximum throughput occurred with just 4 query threads. The query throughput decreased with fewer than 4 or greater than 4 query threads. The entire index was most likely in the file system cache, too. Periodic

Re: Throughput doesn't increase when using more concurrent threads

2006-01-25 Thread Peter Keegan
It's a 3GHz Intel box with Xeon processors, 64GB ram :) Peter On 1/25/06, Yonik Seeley [EMAIL PROTECTED] wrote: Thanks Peter, that's useful info. Just out of curiosity, what kind of box is this? what CPUs? -Yonik On 1/25/06, Peter Keegan [EMAIL PROTECTED] wrote: This is just fyi

Re: Throughput doesn't increase when using more concurrent threads

2006-01-25 Thread Peter Keegan
Yes, it's hyperthreaded (16 cpus show up in task manager - the box is running 2003). I plan to turn off hyperthreading to see if it has any effect. Peter On 1/25/06, Yonik Seeley [EMAIL PROTECTED] wrote: On 1/25/06, Peter Keegan [EMAIL PROTECTED] wrote: It's a 3GHz Intel box with Xeon

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
PROTECTED] wrote: On Wednesday 25 January 2006 20:51, Peter Keegan wrote: The index is non-compound format and optimized. Yes, I did try MMapDirectory, but the index is too big - 3.5 GB (1.3GB is term vectors) Peter You could also give this a try: http://issues.apache.org/jira/browse

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
Ray, The throughput is worse with NioFSDIrectory than with the FSDIrectory (patched and unpatched). The bottleneck still seems to be synchronization, this time in NioFile.getChannel (7 of the 8 threads were blocked there during one snapshot). I tried this with 4 and 8 channels. The throughput

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
Java 1.5) -Yonik On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote: Paul, I tried this but it ran out of memory trying to read the 500Mb .fdt file. I tried various values for MAX_BBUF, but it still ran out of memory (I'm using -Xmx1600M, which is the jvm's maximum value (v1.5)) I'll

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
speedup! The extra registers in 64 bit mode hay have helped a little too. -Yonik On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote: Correction: make that 285 qps :) - To unsubscribe, e-mail: [EMAIL PROTECTED

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
engines, but I'm obviously still learning thanks to this group. Peter On 1/26/06, Ray Tsang [EMAIL PROTECTED] wrote: Peter, Wow, the speed up in impressive! But may I ask what did you do to achieve 135 queries/sec prior to the JVM swich? ray, On 1/27/06, Peter Keegan [EMAIL PROTECTED

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
? Thanks! ray, On 1/27/06, Peter Keegan [EMAIL PROTECTED] wrote: Ray, The short answer is that you can make Lucene blazingly fast by using advice and design principles mentioned in this forum and of course reading 'Lucene in Action'. For example, use a 'content' field for searching all

Re: Throughput doesn't increase when using more concurrent threads

2006-01-30 Thread Peter Keegan
: Peter Keegan wrote: I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now getting 250 queries/sec and excellent cpu utilization (equal concurrency on all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't aware of it. Wow. That's fast. Out

Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Peter Keegan
to figure out what pages to swap in and which to swap out, esp of the memory mapped files. You could also try a profiler on both platforms to try and see where the difference is. -Yonik On 2/22/06, Peter Keegan [EMAIL PROTECTED] wrote: I am doing a performance comparison of Lucene on Linux vs

Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Peter Keegan
PROTECTED] wrote: Peter, Have you given JRockit JVM a try? I've seen it help throughput compared to Sun's JVM on a dual xeon/linux machine, especially with concurrency (up to 6 concurrent searches happening). I'm curious to see if it makes a difference for you. -chris On 2/23/06, Peter Keegan

Re: Throughput doesn't increase when using more concurrent threads

2006-02-23 Thread Peter Keegan
On 2/23/06, Peter Keegan [EMAIL PROTECTED] wrote: Since I seem to be cpu-bound right now, I'll be trying a 16-cpu system next (32 with hyperthreading), on LinTel. I may give JRockit another go around then. Thanks, Peter

Re: Throughput doesn't increase when using more concurrent threads

2006-03-07 Thread Peter Keegan
MMapDirectory, does this retrieval need to be synchronized? Peter On 2/23/06, Peter Keegan [EMAIL PROTECTED] wrote: Yonik, We're investigating both approaches. Yes, the resources (and permutations) are dizzying! Peter On 2/23/06, Yonik Seeley [EMAIL PROTECTED] wrote: Wow, some resources

Re: Throughput doesn't increase when using more concurrent threads

2006-03-10 Thread Peter Keegan
) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65) at org.apache.lucene.search.Hits.init(Hits.java:52) at org.apache.lucene.search.Searcher.search(Searcher.java:62) On 3/7/06, Doug Cutting [EMAIL PROTECTED] wrote: Peter Keegan wrote: I ran a query performance tester against 8-cpu and 16-cpu Xeon servers

Re: Throughput doesn't increase when using more concurrent threads

2006-03-13 Thread Peter Keegan
/06, Peter Keegan [EMAIL PROTECTED] wrote: 3. Use the ThreadLocal's FieldReader in the document() method. As I understand it, this means that the document method no longer needs to be synchronized, right? I've made these changes and it does appear to improve performance. Random

Re: Throughput doesn't increase when using more concurrent threads

2006-03-13 Thread Peter Keegan
Chris, My apologies - this error was apparently caused by a file format mismatch (probably line endings). Thanks, Peter On 3/13/06, Peter Keegan [EMAIL PROTECTED] wrote: Chris, Should this patch work against the current code base? I'm getting this error: D:\lucene-1.9patch -b -p0 -i nio

Re: Good MMapDirectory performance

2006-03-14 Thread Peter Keegan
- I read from Peter Keegan's recent postings: - The Lucene server is using MMapDirectory. I'm running - the jvm with -Xmx16000M. Peak memory usage of the jvm - on Linux is about 6GB and 7.8GB on windows. - We don't have nearly as much memory as Peter but I - wonder whether he is gaining anything

Re: Throughput doesn't increase when using more concurrent threads

2006-03-17 Thread Peter Keegan
handily at 400 qps. Peter On 3/13/06, Peter Keegan [EMAIL PROTECTED] wrote: Chris, My apologies - this error was apparently caused by a file format mismatch (probably line endings). Thanks, Peter On 3/13/06, Peter Keegan [EMAIL PROTECTED] wrote: Chris, Should this patch work

Re: Non scoring search

2006-03-17 Thread Peter Keegan
I experimented with this by using a Similiarity class that returns a constant (1) for all values and found that had no noticable affect on query performance. Peter On 12/6/05, Chris Hostetter [EMAIL PROTECTED] wrote: : I was wondering if there is a standard way to retrive documents WITHOUT :

Re: Throughput doesn't increase when using more concurrent threads

2006-04-05 Thread Peter Keegan
the segments to disk with 'addIndexes'. This resulted in a speed improvement of 27%. Peter On 1/29/06, Daniel Noll [EMAIL PROTECTED] wrote: Peter Keegan wrote: I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now getting 250 queries/sec and excellent cpu utilization (equal

Re: MultiReader and MultiSearcher

2006-04-11 Thread Peter Keegan
Yonik, Could you explain why an IndexSearcher constructed from multiple readers is faster than a MultiSearcher constructed from same readers? Thanks, Peter On 4/10/06, Yonik Seeley [EMAIL PROTECTED] wrote: On 4/10/06, oramas martín [EMAIL PROTECTED] wrote: Is there any performance (or

Re: MultiReader and MultiSearcher

2006-04-11 Thread Peter Keegan
Does this mean that MultiReader doesn't merge the search results and sort the results as if there was only one index? If not, does it simply concatenate the results? Peter On 4/11/06, Yonik Seeley [EMAIL PROTECTED] wrote: On 4/11/06, Peter Keegan [EMAIL PROTECTED] wrote: Could you explain

Re: MultiReader and MultiSearcher

2006-04-12 Thread Peter Keegan
IndexSearcher(indexStoreB); searchers[1] = new IndexSearcher(indexStoreA); Sorry about that, Peter On 4/11/06, Doug Cutting [EMAIL PROTECTED] wrote: Peter Keegan wrote: Oops. I meant to say: Does this mean that an IndexSearcher constructed from a MultiReader doesn't merge the search

Re: question about custom sort method

2006-05-17 Thread Peter Keegan
Suppose I have a custom sorting 'DocScoreComparator' for computing distances on each search hit from a specified coordinate (similar to the DistanceComparatorSource example in LIA). Assume that the 'specified coordinate' is different for each query. This means a new custom comparator must be

Re: MMapDirectory vs RAMDirectory

2006-06-07 Thread Peter Keegan
, a reference to the '.tis' file remains. Peter On 6/5/06, Daniel Noll [EMAIL PROTECTED] wrote: Peter Keegan wrote: There is no 'unmap' method, so my understanding is that the file mapping is valid until the underlying buffer is garbage-collected. However, forcing the gc doesn't help. You're half

Re: Aggregating category hits

2006-06-09 Thread Peter Keegan
I compared Solr's DocSetHitCollector and counting bitset intersections to get facet counts with a different approach that uses a custom hit collector that tests each docid hit (bit) with each facets' bitset and increments a count in a histogram. My assumption was that for queries with few hits,

Re: Aggregating category hits

2006-06-12 Thread Peter Keegan
) no. facets: 100 on every query I'm not using the Solr server as we have already developed an infrastructure. Peter On 6/10/06, Yonik Seeley [EMAIL PROTECTED] wrote: On 6/9/06, Peter Keegan [EMAIL PROTECTED] wrote: However, my throughput testing shows that the Solr method is at least 50

Re: Does more memory help Lucene?

2006-06-12 Thread Peter Keegan
See my note about overlapping indexing documents with merging: http://www.gossamer-threads.com/lists/lucene/java-user/34188?search_string=%2Bkeegan%20%2Baddindexes;#34188 Peter On 6/12/06, Michael D. Curtin [EMAIL PROTECTED] wrote: Nadav Har'El wrote: Otis Gospodnetic [EMAIL PROTECTED]

Re: Aggregating category hits

2006-06-14 Thread Peter Keegan
qps. This is great stuff Solr guys! I'd love to see the DocSet and DocList features added to Lucene's IndexSearcher. Peter On 6/12/06, Peter Keegan [EMAIL PROTECTED] wrote: I'm seeing query throughput of approx. 290 qps with OpenBitSet vs. 270 with BitSet. I had to reduce the max. HashDocSet

Re: Lucene 2.0.1 release date

2006-10-18 Thread Peter Keegan
This makes it relatively safe for people to grab a snapshot of the trunk with less concern about latent bugs. I think the concern is that if we start doing this stuff on trunk now, people that are accustomed to snapping from the trunk might be surprised, and not in a good way. +1 on this.

Announcement: Lucene powering Monster job search index (Beta)

2006-10-27 Thread Peter Keegan
I am pleased to announce the launch of Monster's new job search Beta web site, powered by Lucene, at: http://jobsearch.beta.monster.com (notice the Lucene logo at the bottom of the page!). The jobs index is implemented with Java Lucene 2.0 on 64-bit Windows (AMD and Intel processors) Here are

Re: Announcement: Lucene powering Monster job search index (Beta)

2006-10-27 Thread Peter Keegan
be accomplished with Solr's FunctionQuery, but I haven't tried that yet. Peter -- Chris Lu - Instant Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com On 10/27/06, Peter Keegan [EMAIL PROTECTED] wrote: I am pleased

Re: Announcement: Lucene powering Monster job search index (Beta)

2006-10-30 Thread Peter Keegan
Gospodnetic [EMAIL PROTECTED] wrote: Hi, --- Peter Keegan [EMAIL PROTECTED] wrote: On 10/27/06, Chris Lu [EMAIL PROTECTED] wrote: Hi, Peter, Really great job! Thanks. (I'll tell the team) If it's not a secret, can you tell us a bit more about what's behind the search in terms of hardware

Re: Announcement: Lucene powering Monster job search index (Beta)

2006-10-30 Thread Peter Keegan
/06, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi, --- Peter Keegan [EMAIL PROTECTED] wrote: On 10/27/06, Chris Lu [EMAIL PROTECTED] wrote: Hi, Peter, Really great job! Thanks. (I'll tell the team) If it's not a secret, can you tell us a bit more about what's behind

Re: Announcement: Lucene powering Monster job search index (Beta)

2006-10-30 Thread Peter Keegan
that aren't in the requested range(s). A goal was to do this without having to modify Lucene. Our scheme is pretty efficient, but not very general purpose in its current form, though. Peter On 10/30/06, Joe Shaw [EMAIL PROTECTED] wrote: Hi Peter, On Fri, 2006-10-27 at 15:29 -0400, Peter Keegan wrote

Re: Announcement: Lucene powering Monster job search index (Beta)

2006-10-30 Thread Peter Keegan
distance by miles part of the relavancy of the search results? Could you comment or confirm my assertion ? Thanks :) On 10/28/06, Peter Keegan [EMAIL PROTECTED] wrote: On 10/27/06, Chris Lu [EMAIL PROTECTED] wrote: Hi, Peter, Really great job! Thanks. (I'll tell the team) I am

Re: Announcement: Lucene powering Monster job search index (Beta)

2006-11-03 Thread Peter Keegan
If possible give some code snippet for custome hitcollector. TIA Sri Peter Keegan [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Joe, Fields with numeric values are stored in a separate file as binary values in an internal format. Lucene is unaware of this file and unaware of the range

Re: Announcement: Lucene powering Monster job search index (Beta)

2006-11-03 Thread Peter Keegan
current form, though. Peter On 10/30/06, Joe Shaw [EMAIL PROTECTED] wrote: Hi Peter, On Fri, 2006-10-27 at 15:29 -0400, Peter Keegan wrote: Numeric range search is one of Lucene's weak points (performance-wise) so we have implemented this with a custom HitCollector and an extension

Re: Announcement: Lucene powering Monster job search index (Beta)

2007-01-28 Thread Peter Keegan
(post hit collector). I don't have any performance numbers with the double vs single distance calc. I'm still working out the sort by radius myself. Mark On 11/3/06, Peter Keegan [EMAIL PROTECTED] wrote: Daniel, Yes, this is correct if you happen to be doing a radius search and sorting

Re: Announcement: Lucene powering Monster job search index (Beta)

2007-01-30 Thread Peter Keegan
tried to check your search it was down. We were talking the other day at work how job search was lacking among the big boards. I'm excited to check out your new page. Mark On 1/28/07, Peter Keegan [EMAIL PROTECTED] wrote: Correction: We only do the euclidan computation during sorting

bad queryparser bug

2007-02-01 Thread Peter Keegan
I have discovered a serious bug in QueryParser. The following query: contents:sales contents:marketing || contents:industrial contents:sales is parsed as: +contents:sales +contents:marketing +contents:industrial +contents:sales The same parsed query occurs even with parenthesis:

Re: bad queryparser bug

2007-02-01 Thread Peter Keegan
Correction: The query parser produces the correct query with the parenthesis. But, I'm still looking for a fix for this. I could use some advice on where to look in QueryParser to fix this. Thanks, Peter On 2/1/07, Peter Keegan [EMAIL PROTECTED] wrote: I have discovered a serious bug

Re: bad queryparser bug

2007-02-01 Thread Peter Keegan
, Peter Keegan [EMAIL PROTECTED] wrote: Correction: The query parser produces the correct query with the parenthesis. But, I'm still looking for a fix for this. I could use some advice on where to look in QueryParser to fix this. Thanks, Peter On 2/1/07, Peter Keegan [EMAIL PROTECTED] wrote: I

Re: bad queryparser bug

2007-02-02 Thread Peter Keegan
(If i could go back in time and stop the AND/OR/NOT//|| aliases from being added to the QueryParser -- i would) Yes, this is the cause of the confusion. Our users are accustomed to the boolean logic syntax from a legacy search engine (also common to many other engines). We'll have to convert

Re: relevancy buckets and secondary searching

2007-02-05 Thread Peter Keegan
Hi Erick, The timing of your posting is ironic because I'm currently working on the same issue. Here's a solution that I'm going to try: Use a HitCollector with a PriorityQueue to sort all hits by raw Lucene score, ignoring the secondary sort field. After the search, re-sort just the hits from

Re: Sorting by Score

2007-02-27 Thread Peter Keegan
Suppose one wanted to use this custom rounding score comparator on all fields and all queries. How would you get it plugged in most efficiently, given that SortField requires a non-null field name? Peter On 2/1/06, Chris Hostetter [EMAIL PROTECTED] wrote: : I've not used the sorting code

Re: Sorting by Score

2007-02-27 Thread Peter Keegan
I'm building up the Sort object for the search with 2 SortFields - first is for the custom rounded scoring, second is for date. This Sort object is used to construct a FieldSortedHitQueue which is used with a custom HitCollector. And yes, this comparator ignores the field name. hmmm, actually i

Re: Sorting by Score

2007-02-28 Thread Peter Keegan
can't you pick any arbitrary marker field name (that's not a real field name) and use that? Yes, I could. I guess you're saying that the field name doesn't matter, except that it's used for caching the comparator, right? ... he wants the bucketing to happen as part of hte scoring so that the

Re: Sorting by Score

2007-02-28 Thread Peter Keegan
Erich, Yes, this seems to be the simplest way to implement score 'bucketization', but wouldn't it be more efficient to do this with a custom ScoreComparator? That way, you'd do the bucketizing and sorting in one 'step' (compare()). Maybe the savings isn't measurable, though. A comparator might

Re: Sorting by Score

2007-03-01 Thread Peter Keegan
so I didn't pursue it. One of my pet peeves is spending time making things more efficient when there's no need, and my index isn't going to grow enough larger to worry about that now G... Erick On 2/28/07, Peter Keegan [EMAIL PROTECTED] wrote: Erich, Yes, this seems to be the simplest

Re: Lucene Ranking/scoring

2007-03-08 Thread Peter Keegan
I'm looking at how ReciprocalFloatFuncion and ReverseOrdFieldSource can be used to rank documents by score and date (solr.search.function contains great stuff!). The values in the date field that are used for the ValueSource are not actually used as 'floats', but rather their ordinal term values

Re: Announcement: Lucene powering Monster job search index (Beta)

2007-03-16 Thread Peter Keegan
as well though, otherwise you will obtain perhaps highly relevant hits reported to the user outside the range they specified? Particularly as the search radius gets larger. Cheers, Dan On 1/28/07, Peter Keegan [EMAIL PROTECTED] wrote: Correction: We only do the euclidan computation during sorting

Re: Announcement: Lucene powering Monster job search index (Beta)

2007-03-16 Thread Peter Keegan
Note: this is a reply to a posting to java-dev --Peter Eric, Now that it is live, is performance pretty good? Performance is outstanding. Each server can easily handle well over 100 qps on an index of over 800K documents. There are several servers (4 dual core (8 CPU) Opteron) supporting

Re: Lucene search performance: linear?

2007-03-21 Thread Peter Keegan
On a similar topic, has anybody measured query performance as a function of index size? Well, I did and the results surprised me. I measured query throughput on 8 indexes that varied in size from 55,000 to 4.4 million documents. When plotted on a graph, there is a distinct hyperbolic curve (1/x).

Re: FieldSortedHitQueue enhancement

2007-03-29 Thread Peter Keegan
: Peter Keegan [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Thursday, March 29, 2007 9:39:13 AM Subject: FieldSortedHitQueue enhancement This is request for an enhancement to FieldSortedHitQueue/PriorityQueue that would prevent duplicate documents from being inserted, or alternatively

Re: FieldSortedHitQueue enhancement

2007-03-29 Thread Peter Keegan
Yes, my custom query processor can sometimes make 2 Lucene search calls which may result in duplicate docs being inserted on the same PQ. The simplest solution is to make lessThan public. I'm curious to know if anyone else is performing multiple searches under the covers. Peter On 3/29/07,

Re: FieldSortedHitQueue enhancement

2007-03-29 Thread Peter Keegan
(). Peter, how did you achieve 'last wins' as you must presumably remove first from the PQ? Antony Peter Keegan wrote: The duplicate check would just be on the doc ID. I'm using TreeSet to detect duplicates with no noticeable affect on performance. The PQ only has to be checked

Re: Sorting on a field that can have null values

2007-04-13 Thread Peter Keegan
excluding them completely is a slightly differnet task, you don't need to index a special marker value, you can just use a RangeFilter (or ConstantScoreRangeQuery) to ensure you only get docs with a value for that field (ie: field:[* TO *]) Excellent, this is a much better solution. BTW, adding

Re: optimization behaviour

2007-05-10 Thread Peter Keegan
Of course, that doesn't have to be the case. It would be a trivial change to merge segments and not remove the deleted docs. That usecase could be useful in conjunction with ParallelReader. If the behavior of deleted docs during merging or optimization ever changes, please make this

Payloads and PhraseQuery

2007-06-27 Thread Peter Keegan
I'm looking at the new Payload api and would like to use it in the following manner. Meta-data is indexed as a special phrase (all terms at same position) and a payload is stored with the first term of each phrase. I would like to create a custom query class that extends PhraseQuery and uses its

Re: Payloads and PhraseQuery

2007-06-29 Thread Peter Keegan
and pass the payload to the Scorer as well is a possibility. - Mark Peter Keegan wrote: I'm looking at the new Payload api and would like to use it in the following manner. Meta-data is indexed as a special phrase (all terms at same position) and a payload is stored with the first term of each phrase

Re: Payloads and PhraseQuery

2007-07-11 Thread Peter Keegan
to produce a score? Just guessing here.. At some point, I would like to see more Query classes around the payload stuff, so please submit patches/feedback if and when you get a solution On Jun 27, 2007, at 10:45 AM, Peter Keegan wrote: I'm looking at the new Payload api and would like to use

Re: Payloads and PhraseQuery

2007-07-12 Thread Peter Keegan
I'm looking for Spans.getPositions(), as shown in BoostingTermQuery, but neither NearSpansOrdered nor NearSpansUnordered (which are the Spans provided by SpanNearQuery) provide this method and it's not clear to me how to add it. Peter On 7/11/07, Chris Hostetter [EMAIL PROTECTED] wrote: :

Re: Payloads and PhraseQuery

2007-07-12 Thread Peter Keegan
for the payloads, there many be more than one for a single Span. Regards, Paul Elschot Cheers, Grant On Jul 12, 2007, at 8:20 AM, Peter Keegan wrote: I'm looking for Spans.getPositions(), as shown in BoostingTermQuery, but neither NearSpansOrdered nor NearSpansUnordered (which

Re: encoding question.

2007-07-19 Thread Peter Keegan
The source data for my index is already in standard UTF-8 and available as a simple byte array. I need to do some simple tokenization of the data (check for whitespace and special characters that control position increment). What is the most efficient way to index this data and avoid unnecessary

Re: Payloads and PhraseQuery

2007-07-27 Thread Peter Keegan
I guess this also ties in with 'getPositionIncrementGap', which is relevant to fields with multiple occurrences. Peter On 7/27/07, Peter Keegan [EMAIL PROTECTED] wrote: I have a question about the way fields are analyzed and inverted by the index writer. Currently, if a field has multiple

Re: LUCENE-843 Release

2007-07-30 Thread Peter Keegan
I've built a production index with this patch and done some query stress testing with no problems. I'd give it a thumbs up. Peter On 7/30/07, testn [EMAIL PROTECTED] wrote: Hi guys, Do you think LUCENE-843 is stable enough? If so, do you think it's worth to release it with probably LUCENE

Mixing SpanQuery and BooleanQuery

2007-08-06 Thread Peter Keegan
I'm trying to create a fairly complex SpanQuery from a binary parse tree. I create SpanOrQueries from SpanTermQueries and combine SpanOrQueries into BooleanQueries. So far, so good. The problem is that I don't see how to create a SpanNotQuery from a BooleanQuery and a SpanTermQuery. I want the

Re: Mixing SpanQuery and BooleanQuery

2007-08-06 Thread Peter Keegan
with interesting slops.. Erick On 8/6/07, Peter Keegan [EMAIL PROTECTED] wrote: I'm trying to create a fairly complex SpanQuery from a binary parse tree. I create SpanOrQueries from SpanTermQueries and combine SpanOrQueries into BooleanQueries. So far, so good. The problem

SpanQuery and database join

2007-08-13 Thread Peter Keegan
I've been experimenting with using SpanQuery to perform what is essentially a limited type of database 'join'. Each document in the index contains 1 or more 'rows' of meta data from another 'table'. The meta data are simple tokens representing a column name/value pair ( e.g. color$red or

Re: SpanQuery and database join

2007-08-13 Thread Peter Keegan
I suppose it could go under performance or HowTo/Interesting uses of SpanQuery. Peter On 8/13/07, Erick Erickson [EMAIL PROTECTED] wrote: Thanks for writing this up. Do you think this is an appropriate subject for the Wiki performance page? Erick On 8/13/07, Peter Keegan [EMAIL PROTECTED

Re: SpanQuery and database join

2007-08-14 Thread Peter Keegan
I added this under Use Cases. Thanks for the suggestion. Peter On 8/13/07, Grant Ingersoll [EMAIL PROTECTED] wrote: There is also a Use Cases item on the Wiki... On Aug 13, 2007, at 3:26 PM, Peter Keegan wrote: I suppose it could go under performance or HowTo/Interesting uses

Re: Scoring results?!

2007-08-30 Thread Peter Keegan
If I use BoostingTermQuery on a query containing terms without payloads, I get very different results than doing the same query with TermQuery. Presumably, this is because the BoostingSpanScorer/SpanScorer compute scores differently than TermScorer. Is there a way to make BoostingTermQuery behave

BoostingTermQuery.explain() bugs

2007-08-30 Thread Peter Keegan
There are a couple of minor bugs in BoostingTermQuery.explain(). 1. The computation of average payload score produces NaN if no payloads were found. It should probably be: float avgPayloadScore = super.score() * (payloadsSeen 0 ? (payloadScore / payloadsSeen) : 1); 2. If the average payload

BoostingTermQuery performance

2007-10-02 Thread Peter Keegan
I have been experimenting with payloads and BoostingTermQuery, which I think are excellent additions to Lucene core. Currently, BoostingTermQuery extends SpanQuery. I would suggest changing this class to extend TermQuery and refactor the current version to something like 'BoostingSpanQuery'. The

Re: Can I do boosting based on term postions?

2007-12-18 Thread Peter Keegan
This is a nice alternative to using payloads and BoostingTermQuery. Is there any reason not to make this change to SpanFirstQuery, in particular: This modification to SpanFirstQuery would be that the Spans returned by SpanFirstQuery.getSpans() must always return 0 from its start() method. Should

Re: FieldSortedHitQueue rise in memory

2008-02-19 Thread Peter Keegan
Hi Brian, I ran into something similar a long time ago. My custom sort objects were being cached by Lucene, but there were too many of them because each one had different 'reference values' for different queries. So, I changed the equals and hashcode methods to NOT use any instance data, thus

Re: Swapping between indexes

2008-03-06 Thread Peter Keegan
Sridhar, We have been using approach 2 in our production system with good results. We have separate processes for indexing and searching. The main issue that came up was in deleting old indexes (see: *http://tinyurl.com/32q8c4*). Most of our production problems occur during indexing, and we are

theoretical maximum score

2008-05-09 Thread Peter Keegan
Is it possible to compute a theoretical maximum score for a given query if constraints are placed on 'tf' and 'lengthNorm'? If so, scores could be compared to a 'perfect score' (a feature request from our customers) Here are some related threads on this: In this thread:

Payloads and SpanScorer

2008-07-09 Thread Peter Keegan
If a SpanQuery is constructed from one or more BoostingTermQuery(s), the payloads on the terms are never processed by the SpanScorer. It seems to me that you would want the SpanScorer to score the document both on the spans distance and the payload score. So, either the SpanScorer would have to

Re: Payloads and SpanScorer

2008-07-10 Thread Peter Keegan
Ingersoll [EMAIL PROTECTED] wrote: I'm not fully following what you want. Can you explain a bit more? Thanks, Grant On Jul 9, 2008, at 2:55 PM, Peter Keegan wrote: If a SpanQuery is constructed from one or more BoostingTermQuery(s), the payloads on the terms are never processed

Re: Payloads and SpanScorer

2008-07-10 Thread Peter Keegan
PayloadNearQuery, see http://wiki.apache.org/lucene-java/Payload_Planning I think it would make sense to develop these and I would be happy to help shepherd a patch through, but am not in a position to generate said patch at this moment in time. On Jul 10, 2008, at 9:59 AM, Peter Keegan wrote

Re: Payloads and SpanScorer

2008-07-19 Thread Peter Keegan
at it :) Peter On Thu, Jul 10, 2008 at 2:09 PM, Peter Keegan [EMAIL PROTECTED] wrote: I may take a crack at this. Any more thoughts you may have on the implementation are welcome, but I don't want to distract you too much. Thanks, Peter On Thu, Jul 10, 2008 at 1:30 PM, Grant Ingersoll [EMAIL

BoostingTermQuery scoring

2008-11-04 Thread Peter Keegan
I'm using BoostingTermQuery to boost the score of documents with terms containing payloads (boost value 1). I'd like to change the scoring behavior such that if a query contains multiple BoostingTermQuery terms (either required or optional), documents containing more matching terms with payloads

Re: BoostingTermQuery scoring

2008-11-06 Thread Peter Keegan
: Not sure, but it sounds like you are interested in a higher level Query, kind of like the BooleanQuery, but then part of it sounds like it is per document, right? Is it that you want to deal with multiple payloads in a document, or multiple BTQs in a bigger query? On Nov 4, 2008, at 9:42 AM, Peter

Re: BoostingTermQuery scoring

2008-11-06 Thread Peter Keegan
that doc. Yet another reason to use BoostingTermQuery. Peter On Thu, Nov 6, 2008 at 1:08 PM, Peter Keegan [EMAIL PROTECTED] wrote: Let me give some background on the problem behind my question. Our index contains many fields (title, body, date, city, etc). Most queries search all fields

Re: Boosting results

2008-11-07 Thread Peter Keegan
If you sort first by score, keep in mind that the raw scores are very precise and you could see many unique values in the result set. The secondary sort field would only be used to break equal scores. We had to use a custom comparator to 'smooth out' the scores to allow the second field to take

Re: BoostingTermQuery scoring

2008-11-07 Thread Peter Keegan
performance? (I haven't tried it yet). Thanks, Peter On Thu, Nov 6, 2008 at 6:56 PM, Steven A Rowe [EMAIL PROTECTED] wrote: Hi Peter, On 11/06/2008 at 4:25 PM, Peter Keegan wrote: I've discovered another flaw in using this technique: (+contents:petroleum +contents:engineer +contents:refinery

Re: Payloads

2008-12-29 Thread Peter Keegan
Hi Karl, I use payloads for weight only, too, with BoostingTermQuery (see: http://www.nabble.com/BoostingTermQuery-scoring-td20323615.html#a20323615) A custom tokenizer looks for the reserved character '\b' followed by a 2 byte 'boost' value. It then creates a special Token type for a custom

queryNorm affect on score

2009-02-20 Thread Peter Keegan
The explanation of scores from the same document returned from 2 similar queries differ in an unexpected way. There are 2 fields involved, 'contents' and 'literals'. The 'literals' field has setBoost = 0. As you an see from the explanations below, the total weight of the matching terms from the

Re: queryNorm affect on score

2009-02-27 Thread Peter Keegan
Any comments about this? Is this just the way queryNorm works or is this a bug? Thanks, Peter On Fri, Feb 20, 2009 at 4:03 PM, Peter Keegan peterlkee...@gmail.comwrote: The explanation of scores from the same document returned from 2 similar queries differ in an unexpected way. There are 2

Re: queryNorm affect on score

2009-02-27 Thread Peter Keegan
Got it. This is another example of why scores can't be compared between (even similar) queries. (we don't) Thanks. On Fri, Feb 27, 2009 at 11:39 AM, Yonik Seeley yo...@lucidimagination.comwrote: On Fri, Feb 27, 2009 at 9:15 AM, Peter Keegan peterlkee...@gmail.com wrote: Any comments about

Re: queryNorm affect on score

2009-02-28 Thread Peter Keegan
in situations where you deal with simple query types, and matching query structures, the queryNorm *can* be used to make scores semi-comparable. Hmm. My example used matching query structures. The only difference was a single term in a field with zero weight that didn't exist in the matching

Re: queryNorm affect on score

2009-03-01 Thread Peter Keegan
no affect on the score, when combined with the above. This seems ok in this example since the the matching terms had boost = 0. Thanks Yonik, Peter On Sat, Feb 28, 2009 at 6:02 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Sat, Feb 28, 2009 at 3:02 PM, Peter Keegan peterlkee...@gmail.com

Re: queryNorm affect on score

2009-03-02 Thread Peter Keegan
On Sun, Mar 1, 2009 at 8:57 PM, Peter Keegan peterlkee...@gmail.com wrote: As suggested, I added a query-time boost of 0.0f to the 'literals' field (with index-time boost still there) and I did get the same scores for both queries :) (there is a subtlety between index-time and query-time

sloppyFreq question

2009-03-03 Thread Peter Keegan
The DefaultSimilarity class defines sloppyFreq as: public float sloppyFreq(int distance) { return 1.0f / (distance + 1); } For a 'SpanNearQuery', this reduces the effect of the term frequency on the score as the number of terms in the span increases. So, for a simple phrase query (using

  1   2   >