On a multi-cpu system, this loop to build the docMap array can cause severe
thread thrashing because of the synchronized method 'isDeleted'. I have
observed this on an index with over 1 million documents (which contains a
few thousand deleted docs) when multiple threads perform a search with
Here is one stack trace:
Full thread dump Java HotSpot(TM) Client VM (1.5.0_03-b07 mixed mode):
Thread-6 prio=5 tid=0x6cf7a7f0 nid=0x59e50 waiting for monitor entry
[0x6d2cf000..0x6d2cfd6c]
at org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:241)
- waiting to lock 0x04e40278 (a
Hi Yonik,
Your patch has corrected the thread thrashing problem on multi-cpu systems.
I've tested it with both 1.4.3 and 1.9. I haven't seen 100X performance
gain, but that's because I'm caching QueryFilters and Lucene is caching the
sort fields.
Thanks for the fast response!
btw, I had
This is just fyi - in my stress tests on a 8-cpu box (that's 8 real cpus),
the maximum throughput occurred with just 4 query threads. The query
throughput decreased with fewer than 4 or greater than 4 query threads. The
entire index was most likely in the file system cache, too. Periodic
It's a 3GHz Intel box with Xeon processors, 64GB ram :)
Peter
On 1/25/06, Yonik Seeley [EMAIL PROTECTED] wrote:
Thanks Peter, that's useful info.
Just out of curiosity, what kind of box is this? what CPUs?
-Yonik
On 1/25/06, Peter Keegan [EMAIL PROTECTED] wrote:
This is just fyi
Yes, it's hyperthreaded (16 cpus show up in task manager - the box is
running 2003). I plan to turn off hyperthreading to see if it has any
effect.
Peter
On 1/25/06, Yonik Seeley [EMAIL PROTECTED] wrote:
On 1/25/06, Peter Keegan [EMAIL PROTECTED] wrote:
It's a 3GHz Intel box with Xeon
PROTECTED] wrote:
On Wednesday 25 January 2006 20:51, Peter Keegan wrote:
The index is non-compound format and optimized. Yes, I did try
MMapDirectory, but the index is too big - 3.5 GB (1.3GB is term vectors)
Peter
You could also give this a try:
http://issues.apache.org/jira/browse
Ray,
The throughput is worse with NioFSDIrectory than with the FSDIrectory
(patched and unpatched). The bottleneck still seems to be synchronization,
this time in NioFile.getChannel (7 of the 8 threads were blocked there
during one snapshot). I tried this with 4 and 8 channels.
The throughput
Java 1.5)
-Yonik
On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote:
Paul,
I tried this but it ran out of memory trying to read the 500Mb .fdt
file. I
tried various values for MAX_BBUF, but it still ran out of memory (I'm
using
-Xmx1600M, which is the jvm's maximum value (v1.5)) I'll
speedup! The extra registers in 64 bit mode hay have helped a little
too.
-Yonik
On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote:
Correction: make that 285 qps :)
-
To unsubscribe, e-mail: [EMAIL PROTECTED
engines, but I'm obviously still
learning thanks to this group.
Peter
On 1/26/06, Ray Tsang [EMAIL PROTECTED] wrote:
Peter,
Wow, the speed up in impressive! But may I ask what did you do to
achieve 135 queries/sec prior to the JVM swich?
ray,
On 1/27/06, Peter Keegan [EMAIL PROTECTED
?
Thanks!
ray,
On 1/27/06, Peter Keegan [EMAIL PROTECTED] wrote:
Ray,
The short answer is that you can make Lucene blazingly fast by using
advice
and design principles mentioned in this forum and of course reading
'Lucene
in Action'. For example, use a 'content' field for searching all
:
Peter Keegan wrote:
I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now
getting 250 queries/sec and excellent cpu utilization (equal concurrency
on
all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't
aware
of it.
Wow. That's fast.
Out
to figure out what pages to swap in
and which to swap out, esp of the memory mapped files.
You could also try a profiler on both platforms to try and see where
the difference is.
-Yonik
On 2/22/06, Peter Keegan [EMAIL PROTECTED] wrote:
I am doing a performance comparison of Lucene on Linux vs
PROTECTED] wrote:
Peter,
Have you given JRockit JVM a try? I've seen it help throughput
compared to Sun's JVM on a dual xeon/linux machine, especially with
concurrency (up to 6 concurrent searches happening). I'm curious to
see if it makes a difference for you.
-chris
On 2/23/06, Peter Keegan
On 2/23/06, Peter Keegan [EMAIL PROTECTED] wrote:
Since I seem to be cpu-bound right now, I'll be trying a 16-cpu system
next
(32 with hyperthreading), on LinTel. I may give JRockit another go
around
then.
Thanks,
Peter
MMapDirectory, does this retrieval need to be synchronized?
Peter
On 2/23/06, Peter Keegan [EMAIL PROTECTED] wrote:
Yonik,
We're investigating both approaches.
Yes, the resources (and permutations) are dizzying!
Peter
On 2/23/06, Yonik Seeley [EMAIL PROTECTED] wrote:
Wow, some resources
)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65)
at org.apache.lucene.search.Hits.init(Hits.java:52)
at org.apache.lucene.search.Searcher.search(Searcher.java:62)
On 3/7/06, Doug Cutting [EMAIL PROTECTED] wrote:
Peter Keegan wrote:
I ran a query performance tester against 8-cpu and 16-cpu Xeon servers
/06, Peter Keegan [EMAIL PROTECTED] wrote:
3. Use the ThreadLocal's FieldReader in the document() method.
As I understand it, this means that the document method no longer needs
to
be synchronized, right?
I've made these changes and it does appear to improve performance.
Random
Chris,
My apologies - this error was apparently caused by a file format mismatch
(probably line endings).
Thanks,
Peter
On 3/13/06, Peter Keegan [EMAIL PROTECTED] wrote:
Chris,
Should this patch work against the current code base? I'm getting this
error:
D:\lucene-1.9patch -b -p0 -i nio
- I read from Peter Keegan's recent postings:
- The Lucene server is using MMapDirectory. I'm running
- the jvm with -Xmx16000M. Peak memory usage of the jvm
- on Linux is about 6GB and 7.8GB on windows.
- We don't have nearly as much memory as Peter but I
- wonder whether he is gaining anything
handily at 400 qps.
Peter
On 3/13/06, Peter Keegan [EMAIL PROTECTED] wrote:
Chris,
My apologies - this error was apparently caused by a file format mismatch
(probably line endings).
Thanks,
Peter
On 3/13/06, Peter Keegan [EMAIL PROTECTED] wrote:
Chris,
Should this patch work
I experimented with this by using a Similiarity class that returns a
constant (1) for all values and found that had no noticable affect on query
performance.
Peter
On 12/6/05, Chris Hostetter [EMAIL PROTECTED] wrote:
: I was wondering if there is a standard way to retrive documents WITHOUT
:
the segments to disk with 'addIndexes'. This resulted in a speed
improvement of 27%.
Peter
On 1/29/06, Daniel Noll [EMAIL PROTECTED] wrote:
Peter Keegan wrote:
I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now
getting 250 queries/sec and excellent cpu utilization (equal
Yonik,
Could you explain why an IndexSearcher constructed from multiple readers is
faster than a MultiSearcher constructed from same readers?
Thanks,
Peter
On 4/10/06, Yonik Seeley [EMAIL PROTECTED] wrote:
On 4/10/06, oramas martÃn [EMAIL PROTECTED] wrote:
Is there any performance (or
Does this mean that MultiReader doesn't merge the search results and sort
the results as if there was only one index? If not, does it simply
concatenate the results?
Peter
On 4/11/06, Yonik Seeley [EMAIL PROTECTED] wrote:
On 4/11/06, Peter Keegan [EMAIL PROTECTED] wrote:
Could you explain
IndexSearcher(indexStoreB);
searchers[1] = new IndexSearcher(indexStoreA);
Sorry about that,
Peter
On 4/11/06, Doug Cutting [EMAIL PROTECTED] wrote:
Peter Keegan wrote:
Oops. I meant to say: Does this mean that an IndexSearcher constructed
from
a MultiReader doesn't merge the search
Suppose I have a custom sorting 'DocScoreComparator' for computing distances
on each search hit from a specified coordinate (similar to the
DistanceComparatorSource example in LIA). Assume that the 'specified
coordinate' is different for each query. This means a new custom comparator
must be
, a
reference to the '.tis' file remains.
Peter
On 6/5/06, Daniel Noll [EMAIL PROTECTED] wrote:
Peter Keegan wrote:
There is no 'unmap' method, so my understanding is that the file mapping
is
valid until the underlying buffer is garbage-collected. However, forcing
the gc doesn't help.
You're half
I compared Solr's DocSetHitCollector and counting bitset intersections to
get facet counts with a different approach that uses a custom hit collector
that tests each docid hit (bit) with each facets' bitset and increments a
count in a histogram. My assumption was that for queries with few hits,
)
no. facets: 100 on every query
I'm not using the Solr server as we have already developed an
infrastructure.
Peter
On 6/10/06, Yonik Seeley [EMAIL PROTECTED] wrote:
On 6/9/06, Peter Keegan [EMAIL PROTECTED] wrote:
However, my throughput testing shows that the Solr method is at least
50
See my note about overlapping indexing documents with merging:
http://www.gossamer-threads.com/lists/lucene/java-user/34188?search_string=%2Bkeegan%20%2Baddindexes;#34188
Peter
On 6/12/06, Michael D. Curtin [EMAIL PROTECTED] wrote:
Nadav Har'El wrote:
Otis Gospodnetic [EMAIL PROTECTED]
qps.
This is great stuff Solr guys! I'd love to see the DocSet and DocList
features added to Lucene's IndexSearcher.
Peter
On 6/12/06, Peter Keegan [EMAIL PROTECTED] wrote:
I'm seeing query throughput of approx. 290 qps with OpenBitSet vs. 270
with BitSet. I had to reduce the max. HashDocSet
This makes it relatively safe for people to grab a snapshot of the trunk
with less concern about latent bugs.
I think the concern is that if we start doing this stuff on trunk now,
people that are accustomed to snapping from the trunk might be surprised,
and not in a good way.
+1 on this.
I am pleased to announce the launch of Monster's new job search Beta web
site, powered by Lucene, at: http://jobsearch.beta.monster.com (notice the
Lucene logo at the bottom of the page!).
The jobs index is implemented with Java Lucene 2.0 on 64-bit Windows (AMD
and Intel processors)
Here are
be accomplished with Solr's FunctionQuery, but I
haven't tried that yet.
Peter
--
Chris Lu
-
Instant Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
On 10/27/06, Peter Keegan [EMAIL PROTECTED] wrote:
I am pleased
Gospodnetic [EMAIL PROTECTED] wrote:
Hi,
--- Peter Keegan [EMAIL PROTECTED] wrote:
On 10/27/06, Chris Lu [EMAIL PROTECTED] wrote:
Hi, Peter,
Really great job!
Thanks. (I'll tell the team)
If it's not a secret, can you tell us a bit more about what's behind
the search in terms of hardware
/06, Otis Gospodnetic [EMAIL PROTECTED] wrote:
Hi,
--- Peter Keegan [EMAIL PROTECTED] wrote:
On 10/27/06, Chris Lu [EMAIL PROTECTED] wrote:
Hi, Peter,
Really great job!
Thanks. (I'll tell the team)
If it's not a secret, can you tell us a bit more about what's behind
that aren't in the
requested range(s). A goal was to do this without having to modify Lucene.
Our scheme is pretty efficient, but not very general purpose in its current
form, though.
Peter
On 10/30/06, Joe Shaw [EMAIL PROTECTED] wrote:
Hi Peter,
On Fri, 2006-10-27 at 15:29 -0400, Peter Keegan wrote
distance by
miles part of the relavancy of the search results?
Could you comment or confirm my assertion ? Thanks :)
On 10/28/06, Peter Keegan [EMAIL PROTECTED] wrote:
On 10/27/06, Chris Lu [EMAIL PROTECTED] wrote:
Hi, Peter,
Really great job!
Thanks. (I'll tell the team)
I am
If possible give some code snippet for custome
hitcollector.
TIA
Sri
Peter Keegan [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
Joe,
Fields with numeric values are stored in a separate file as binary
values
in
an internal format. Lucene is unaware of this file and unaware of the
range
current
form, though.
Peter
On 10/30/06, Joe Shaw [EMAIL PROTECTED] wrote:
Hi Peter,
On Fri, 2006-10-27 at 15:29 -0400, Peter Keegan wrote:
Numeric range search is one of Lucene's weak points
(performance-wise)
so we
have implemented this with a custom HitCollector and an extension
(post hit collector). I don't have any
performance numbers with the double vs single distance calc.
I'm still working out the sort by radius myself.
Mark
On 11/3/06, Peter Keegan [EMAIL PROTECTED] wrote:
Daniel,
Yes, this is correct if you happen to be doing a radius search and
sorting
tried to check your search it was down. We were talking the
other day at work how job search was lacking among the big boards. I'm
excited to check out your new page.
Mark
On 1/28/07, Peter Keegan [EMAIL PROTECTED] wrote:
Correction:
We only do the euclidan computation during sorting
I have discovered a serious bug in QueryParser. The following query:
contents:sales contents:marketing || contents:industrial
contents:sales
is parsed as:
+contents:sales +contents:marketing +contents:industrial +contents:sales
The same parsed query occurs even with parenthesis:
Correction:
The query parser produces the correct query with the parenthesis.
But, I'm still looking for a fix for this. I could use some advice on where
to look in QueryParser to fix this.
Thanks,
Peter
On 2/1/07, Peter Keegan [EMAIL PROTECTED] wrote:
I have discovered a serious bug
, Peter Keegan [EMAIL PROTECTED] wrote:
Correction:
The query parser produces the correct query with the parenthesis.
But, I'm still looking for a fix for this. I could use some advice on
where to look in QueryParser to fix this.
Thanks,
Peter
On 2/1/07, Peter Keegan [EMAIL PROTECTED] wrote:
I
(If i could go back in time and stop the AND/OR/NOT//|| aliases from
being added to the QueryParser -- i would)
Yes, this is the cause of the confusion. Our users are accustomed to the
boolean logic syntax from a legacy search engine (also common to many other
engines). We'll have to convert
Hi Erick,
The timing of your posting is ironic because I'm currently working on the
same issue. Here's a solution that I'm going to try:
Use a HitCollector with a PriorityQueue to sort all hits by raw Lucene
score, ignoring the secondary sort field.
After the search, re-sort just the hits from
Suppose one wanted to use this custom rounding score comparator on all
fields and all queries. How would you get it plugged in most efficiently,
given that SortField requires a non-null field name?
Peter
On 2/1/06, Chris Hostetter [EMAIL PROTECTED] wrote:
: I've not used the sorting code
I'm building up the Sort object for the search with 2 SortFields - first is
for the custom rounded scoring, second is for date. This Sort object is used
to construct a FieldSortedHitQueue which is used with a custom HitCollector.
And yes, this comparator ignores the field name.
hmmm, actually i
can't you pick any arbitrary marker field name (that's not a real field
name) and use that?
Yes, I could. I guess you're saying that the field name doesn't matter,
except that it's used for caching the comparator, right?
... he wants the bucketing to happen as part of hte scoring so that the
Erich,
Yes, this seems to be the simplest way to implement score 'bucketization',
but wouldn't it be more efficient to do this with a custom ScoreComparator?
That way, you'd do the bucketizing and sorting in one 'step' (compare()).
Maybe the savings isn't measurable, though. A comparator might
so I didn't pursue it. One of my pet peeves is spending time making
things more efficient when there's no need, and my index isn't
going to grow enough larger to worry about that now G...
Erick
On 2/28/07, Peter Keegan [EMAIL PROTECTED] wrote:
Erich,
Yes, this seems to be the simplest
I'm looking at how ReciprocalFloatFuncion and ReverseOrdFieldSource can be
used to rank documents by score and date (solr.search.function contains
great stuff!). The values in the date field that are used for the
ValueSource are not actually used as 'floats', but rather their ordinal term
values
as
well
though, otherwise you will obtain perhaps highly relevant hits reported to
the user outside the range they specified? Particularly as the search
radius
gets larger.
Cheers,
Dan
On 1/28/07, Peter Keegan [EMAIL PROTECTED] wrote:
Correction:
We only do the euclidan computation during sorting
Note: this is a reply to a posting to java-dev --Peter
Eric,
Now that it is live, is performance pretty good?
Performance is outstanding. Each server can easily handle well over 100 qps
on an index of over 800K documents. There are several servers (4 dual core
(8 CPU) Opteron) supporting
On a similar topic, has anybody measured query performance as a function of
index size?
Well, I did and the results surprised me. I measured query throughput on 8
indexes that varied in size from 55,000 to 4.4 million documents. When
plotted on a graph, there is a distinct hyperbolic curve (1/x).
: Peter Keegan [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Thursday, March 29, 2007 9:39:13 AM
Subject: FieldSortedHitQueue enhancement
This is request for an enhancement to FieldSortedHitQueue/PriorityQueue
that
would prevent duplicate documents from being inserted, or alternatively
Yes, my custom query processor can sometimes make 2 Lucene search calls
which may result in duplicate docs being inserted on the same PQ. The
simplest solution is to make lessThan public. I'm curious to know if anyone
else is performing multiple searches under the covers.
Peter
On 3/29/07,
().
Peter, how did you achieve 'last wins' as you must presumably remove first
from
the PQ?
Antony
Peter Keegan wrote:
The duplicate check would just be on the doc ID. I'm using TreeSet to
detect
duplicates with no noticeable affect on performance. The PQ only has to
be
checked
excluding them completely is a slightly differnet task, you don't need to
index a special marker value, you can just use a
RangeFilter (or ConstantScoreRangeQuery) to ensure you only get docs with
a value for that field (ie: field:[* TO *])
Excellent, this is a much better solution. BTW, adding
Of course, that doesn't have to be the case. It would be a trivial
change to merge segments and not remove the deleted docs. That
usecase could be useful in conjunction with ParallelReader.
If the behavior of deleted docs during merging or optimization ever changes,
please make this
I'm looking at the new Payload api and would like to use it in the following
manner. Meta-data is indexed as a special phrase (all terms at same
position) and a payload is stored with the first term of each phrase. I
would like to create a custom query class that extends PhraseQuery and uses
its
and pass the payload to the Scorer as well is a possibility.
- Mark
Peter Keegan wrote:
I'm looking at the new Payload api and would like to use it in the
following
manner. Meta-data is indexed as a special phrase (all terms at same
position) and a payload is stored with the first term of each phrase
to produce a score? Just guessing here..
At some point, I would like to see more Query classes around the
payload stuff, so please submit patches/feedback if and when you get
a solution
On Jun 27, 2007, at 10:45 AM, Peter Keegan wrote:
I'm looking at the new Payload api and would like to use
I'm looking for Spans.getPositions(), as shown in BoostingTermQuery, but
neither NearSpansOrdered nor NearSpansUnordered (which are the Spans
provided by SpanNearQuery) provide this method and it's not clear to me how
to add it.
Peter
On 7/11/07, Chris Hostetter [EMAIL PROTECTED] wrote:
:
for the payloads, there many be more than one
for a single Span.
Regards,
Paul Elschot
Cheers,
Grant
On Jul 12, 2007, at 8:20 AM, Peter Keegan wrote:
I'm looking for Spans.getPositions(), as shown in
BoostingTermQuery, but
neither NearSpansOrdered nor NearSpansUnordered (which
The source data for my index is already in standard UTF-8 and available as a
simple byte array. I need to do some simple tokenization of the data (check
for whitespace and special characters that control position increment). What
is the most efficient way to index this data and avoid unnecessary
I guess this also ties in with 'getPositionIncrementGap', which is relevant
to fields with multiple occurrences.
Peter
On 7/27/07, Peter Keegan [EMAIL PROTECTED] wrote:
I have a question about the way fields are analyzed and inverted by the
index writer. Currently, if a field has multiple
I've built a production index with this patch and done some query stress
testing with no problems.
I'd give it a thumbs up.
Peter
On 7/30/07, testn [EMAIL PROTECTED] wrote:
Hi guys,
Do you think LUCENE-843 is stable enough? If so, do you think it's worth
to
release it with probably LUCENE
I'm trying to create a fairly complex SpanQuery from a binary parse tree.
I create SpanOrQueries from SpanTermQueries and combine SpanOrQueries into
BooleanQueries. So far, so good.
The problem is that I don't see how to create a SpanNotQuery from a
BooleanQuery and a SpanTermQuery. I want the
with interesting slops..
Erick
On 8/6/07, Peter Keegan [EMAIL PROTECTED] wrote:
I'm trying to create a fairly complex SpanQuery from a binary parse
tree.
I create SpanOrQueries from SpanTermQueries and combine SpanOrQueries
into
BooleanQueries. So far, so good.
The problem
I've been experimenting with using SpanQuery to perform what is essentially
a limited type of database 'join'. Each document in the index contains 1 or
more 'rows' of meta data from another 'table'. The meta data are simple
tokens representing a column name/value pair ( e.g. color$red or
I suppose it could go under performance or HowTo/Interesting uses of
SpanQuery.
Peter
On 8/13/07, Erick Erickson [EMAIL PROTECTED] wrote:
Thanks for writing this up. Do you think this is an appropriate subject
for the Wiki performance page?
Erick
On 8/13/07, Peter Keegan [EMAIL PROTECTED
I added this under Use Cases. Thanks for the suggestion.
Peter
On 8/13/07, Grant Ingersoll [EMAIL PROTECTED] wrote:
There is also a Use Cases item on the Wiki...
On Aug 13, 2007, at 3:26 PM, Peter Keegan wrote:
I suppose it could go under performance or HowTo/Interesting uses
If I use BoostingTermQuery on a query containing terms without payloads, I
get very different results than doing the same query with TermQuery.
Presumably, this is because the BoostingSpanScorer/SpanScorer compute scores
differently than TermScorer. Is there a way to make BoostingTermQuery behave
There are a couple of minor bugs in BoostingTermQuery.explain().
1. The computation of average payload score produces NaN if no payloads were
found. It should probably be:
float avgPayloadScore = super.score() * (payloadsSeen 0 ? (payloadScore /
payloadsSeen) : 1);
2. If the average payload
I have been experimenting with payloads and BoostingTermQuery, which I think
are excellent additions to Lucene core. Currently, BoostingTermQuery extends
SpanQuery. I would suggest changing this class to extend TermQuery and
refactor the current version to something like 'BoostingSpanQuery'.
The
This is a nice alternative to using payloads and BoostingTermQuery. Is there
any reason not to make this change to SpanFirstQuery, in particular:
This modification to SpanFirstQuery would be that the Spans
returned by SpanFirstQuery.getSpans() must always return 0
from its start() method.
Should
Hi Brian,
I ran into something similar a long time ago. My custom sort objects were
being cached by Lucene, but there were too many of them because each one had
different 'reference values' for different queries. So, I changed the equals
and hashcode methods to NOT use any instance data, thus
Sridhar,
We have been using approach 2 in our production system with good results. We
have separate processes for indexing and searching. The main issue that came
up was in deleting old indexes (see: *http://tinyurl.com/32q8c4*). Most of
our production problems occur during indexing, and we are
Is it possible to compute a theoretical maximum score for a given query if
constraints are placed on 'tf' and 'lengthNorm'? If so, scores could be
compared to a 'perfect score' (a feature request from our customers)
Here are some related threads on this:
In this thread:
If a SpanQuery is constructed from one or more BoostingTermQuery(s), the
payloads on the terms are never processed by the SpanScorer. It seems to me
that you would want the SpanScorer to score the document both on the spans
distance and the payload score. So, either the SpanScorer would have to
Ingersoll [EMAIL PROTECTED]
wrote:
I'm not fully following what you want. Can you explain a bit more?
Thanks,
Grant
On Jul 9, 2008, at 2:55 PM, Peter Keegan wrote:
If a SpanQuery is constructed from one or more BoostingTermQuery(s), the
payloads on the terms are never processed
PayloadNearQuery, see http://wiki.apache.org/lucene-java/Payload_Planning
I think it would make sense to develop these and I would be happy to help
shepherd a patch through, but am not in a position to generate said patch at
this moment in time.
On Jul 10, 2008, at 9:59 AM, Peter Keegan wrote
at it :)
Peter
On Thu, Jul 10, 2008 at 2:09 PM, Peter Keegan [EMAIL PROTECTED]
wrote:
I may take a crack at this. Any more thoughts you may have on the
implementation are welcome, but I don't want to distract you too much.
Thanks,
Peter
On Thu, Jul 10, 2008 at 1:30 PM, Grant Ingersoll [EMAIL
I'm using BoostingTermQuery to boost the score of documents with terms
containing payloads (boost value 1). I'd like to change the scoring
behavior such that if a query contains multiple BoostingTermQuery terms
(either required or optional), documents containing more matching terms with
payloads
:
Not sure, but it sounds like you are interested in a higher level Query,
kind of like the BooleanQuery, but then part of it sounds like it is per
document, right? Is it that you want to deal with multiple payloads in a
document, or multiple BTQs in a bigger query?
On Nov 4, 2008, at 9:42 AM, Peter
that doc. Yet another
reason to use BoostingTermQuery.
Peter
On Thu, Nov 6, 2008 at 1:08 PM, Peter Keegan [EMAIL PROTECTED] wrote:
Let me give some background on the problem behind my question.
Our index contains many fields (title, body, date, city, etc). Most queries
search all fields
If you sort first by score, keep in mind that the raw scores are very
precise and you could see many unique values in the result set. The
secondary sort field would only be used to break equal scores. We had to use
a custom comparator to 'smooth out' the scores to allow the second field to
take
performance? (I haven't
tried it yet).
Thanks,
Peter
On Thu, Nov 6, 2008 at 6:56 PM, Steven A Rowe [EMAIL PROTECTED] wrote:
Hi Peter,
On 11/06/2008 at 4:25 PM, Peter Keegan wrote:
I've discovered another flaw in using this technique:
(+contents:petroleum +contents:engineer +contents:refinery
Hi Karl,
I use payloads for weight only, too, with BoostingTermQuery (see:
http://www.nabble.com/BoostingTermQuery-scoring-td20323615.html#a20323615)
A custom tokenizer looks for the reserved character '\b' followed by a 2
byte 'boost' value. It then creates a special Token type for a custom
The explanation of scores from the same document returned from 2 similar
queries differ in an unexpected way. There are 2 fields involved, 'contents'
and 'literals'. The 'literals' field has setBoost = 0. As you an see from
the explanations below, the total weight of the matching terms from the
Any comments about this? Is this just the way queryNorm works or is this a
bug?
Thanks,
Peter
On Fri, Feb 20, 2009 at 4:03 PM, Peter Keegan peterlkee...@gmail.comwrote:
The explanation of scores from the same document returned from 2 similar
queries differ in an unexpected way. There are 2
Got it. This is another example of why scores can't be compared between
(even similar) queries.
(we don't)
Thanks.
On Fri, Feb 27, 2009 at 11:39 AM, Yonik Seeley
yo...@lucidimagination.comwrote:
On Fri, Feb 27, 2009 at 9:15 AM, Peter Keegan peterlkee...@gmail.com
wrote:
Any comments about
in situations where you deal with simple query types, and matching query
structures, the queryNorm
*can* be used to make scores semi-comparable.
Hmm. My example used matching query structures. The only difference was a
single term in a field with zero weight that didn't exist in the matching
no affect on the
score, when combined with the above. This seems ok in this example since the
the matching terms had boost = 0.
Thanks Yonik,
Peter
On Sat, Feb 28, 2009 at 6:02 PM, Yonik Seeley yo...@lucidimagination.comwrote:
On Sat, Feb 28, 2009 at 3:02 PM, Peter Keegan peterlkee...@gmail.com
On Sun, Mar 1, 2009 at 8:57 PM, Peter Keegan peterlkee...@gmail.com
wrote:
As suggested, I added a query-time boost of 0.0f to the 'literals' field
(with index-time boost still there) and I did get the same scores for
both
queries :) (there is a subtlety between index-time and query-time
The DefaultSimilarity class defines sloppyFreq as:
public float sloppyFreq(int distance) {
return 1.0f / (distance + 1);
}
For a 'SpanNearQuery', this reduces the effect of the term frequency on the
score as the number of terms in the span increases. So, for a simple phrase
query (using
1 - 100 of 181 matches
Mail list logo