from:"Paul Elschot"

Re: SpanQuery for Terms at same position

2009-11-25 Thread Paul Elschot

Op woensdag 25 november 2009 21:20:33 schreef Christopher Tignor:
 It's worth noting however that this -1 slop doesn't seem to work for cases
 where oyu want to discover instances of more than two terms at the same
 position.  Would be nice to be able to explicitly set this in the query
 construction.

I think requiring n terms at the same position would need a slop of 1-n,
and I'd like to have some test cases added for that.
Now if I only had some time...

Regards,
Paul Elschot

 
 thanks,
 
 CT
 On Tue, Nov 24, 2009 at 9:17 AM, Christopher Tignor 
 ctig...@thinkmap.comwrote:
 
  yes that indeed works for me.
 
  thanks,
 
  CT
 
 
  On Mon, Nov 23, 2009 at 5:50 PM, Paul Elschot paul.elsc...@xs4all.nlwrote:
 
  Op maandag 23 november 2009 20:07:58 schreef Christopher Tignor:
   Also, I noticed that with the above edit to NearSpansOrdered I am
  getting
   erroneous results fo normal ordered searches using searches like:
  
   _n followed by work
  
   where because _n and work are at the same position the code changes
   accept their pairing as a valid in-order result now that the eqaul to
  clause
   has been added to the inequality.
 
  Thanks for trying this. Indeed the followed by semantics is broken for
  the ordered case when spans at the same positions are considered
  ordered.
 
  Did I understand correctly that the unordered case with a slop of -1
  and without the edit works to match terms at the same position?
  In that case it may be worthwhile to add that to the javadocs,
  and also add a few testcases.
 
  Regards,
  Paul Elschot
 
  
   CT
  
   On Mon, Nov 23, 2009 at 12:26 PM, Christopher Tignor
   ctig...@thinkmap.comwrote:
  
Thanks so much for this.
   
Using an un-ordered query, the -1 slop indeed returns the correct
  results,
matching tokens at the same position.
   
I tried the same query but ordered both after and before rebuilding
  the
source with Paul's changes to NearSpansOrdered but the query was still
failing, returning no results.
   
CT
   
   
On Mon, Nov 23, 2009 at 11:59 AM, Mark Miller markrmil...@gmail.com
  wrote:
   
Your trying -1 with ordered right? Try it with non ordered.
   
Christopher Tignor wrote:
 A slop of -1 doesn't work either.  I get no results returned.

 this would be a *really* helpful feature for me if someone might
  suggest
an
 implementation as I would really like to be able to do arbitrary
  span
 searches where tokens may be at the same position and also in other
 positions where the ordering of subsequent terms may be restricted
  as
per
 the normal span API.

 thanks,

 CT

 On Sun, Nov 22, 2009 at 7:50 AM, Paul Elschot 
  paul.elsc...@xs4all.nl
wrote:


 Op zondag 22 november 2009 04:47:50 schreef Adriano Crestani:

 Hi,

 I didn't test, but you might want to try SpanNearQuery and set
  slop to

 zero.

 Give it a try and let me know if it worked.

 The slop is the number of positions in between, so zero would
  still
be
 too
 much to only match at the same position.

 SpanNearQuery may or may not work for a slop of -1, but one could
  try
 that for both the ordered and unordered cases.
 One way to do that is to start from the existing test cases.

 Regards,
 Paul Elschot


 Regards,
 Adriano Crestani

 On Thu, Nov 19, 2009 at 7:28 PM, Christopher Tignor 

 ctig...@thinkmap.comwrote:

 Hello,

 I would like to search for all documents that contain both
  plan and

 _v

 (my part of speech token for verb) at the same position.
 I have tokenized the documents accordingly so these tokens
  exists at

 the

 same location.

 I can achieve programaticaly using PhraseQueries by adding the
  Terms
 explicitly at the same position but I need to be able to recover
  the
 Payload
 data for each
 term found within the matched instance of my query.

 Unfortunately the PayloadSpanUtil doesn't seem to return the
  same

 results

 as
 the PhraseQuery, possibly becuase it is converting it inoto
  Spans
first
 which do not support searching for Terms at the same document
position?

 Any help appreciated.

 thanks,

 CT

 --
 TH!NKMAP

 Christopher Tignor | Senior Software Architect
 155 Spring Street NY, NY 10012
 p.212-285-8600 x385 f.212-285-8999



  -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org






   
   
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional

Re: SpanQuery for Terms at same position

2009-11-23 Thread Paul Elschot

Op maandag 23 november 2009 17:27:56 schreef Christopher Tignor:
 A slop of -1 doesn't work either.  I get no results returned.

I think the problem is in the NearSpansOrdered.docSpansOrdered methods.
Could you replace the  by = in there (4 times) and try again?
That will allow spans at the same position to be considered ordered.
From a quick reading of the code both the unordered and ordered cases
might work for a slop of -1 with that modification.

 
 this would be a *really* helpful feature for me if someone might suggest an
 implementation as I would really like to be able to do arbitrary span
 searches where tokens may be at the same position and also in other
 positions where the ordering of subsequent terms may be restricted as per
 the normal span API.

My pleasure,
Paul Elschot

 
 thanks,
 
 CT
 
 On Sun, Nov 22, 2009 at 7:50 AM, Paul Elschot paul.elsc...@xs4all.nlwrote:
 
  Op zondag 22 november 2009 04:47:50 schreef Adriano Crestani:
   Hi,
  
   I didn't test, but you might want to try SpanNearQuery and set slop to
  zero.
   Give it a try and let me know if it worked.
 
  The slop is the number of positions in between, so zero would still be
  too
  much to only match at the same position.
 
  SpanNearQuery may or may not work for a slop of -1, but one could try
  that for both the ordered and unordered cases.
  One way to do that is to start from the existing test cases.
 
  Regards,
  Paul Elschot
 
  
   Regards,
   Adriano Crestani
  
   On Thu, Nov 19, 2009 at 7:28 PM, Christopher Tignor 
  ctig...@thinkmap.comwrote:
  
Hello,
   
I would like to search for all documents that contain both plan and
  _v
(my part of speech token for verb) at the same position.
I have tokenized the documents accordingly so these tokens exists at
  the
same location.
   
I can achieve programaticaly using PhraseQueries by adding the Terms
explicitly at the same position but I need to be able to recover the
Payload
data for each
term found within the matched instance of my query.
   
Unfortunately the PayloadSpanUtil doesn't seem to return the same
  results
as
the PhraseQuery, possibly becuase it is converting it inoto Spans first
which do not support searching for Terms at the same document position?
   
Any help appreciated.
   
thanks,
   
CT
   
--
TH!NKMAP
   
Christopher Tignor | Senior Software Architect
155 Spring Street NY, NY 10012
p.212-285-8600 x385 f.212-285-8999
   
  
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 
 -- 
 TH!NKMAP
 
 Christopher Tignor | Senior Software Architect
 155 Spring Street NY, NY 10012
 p.212-285-8600 x385 f.212-285-8999
 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: SpanQuery for Terms at same position

2009-11-23 Thread Paul Elschot

Op maandag 23 november 2009 20:07:58 schreef Christopher Tignor:
 Also, I noticed that with the above edit to NearSpansOrdered I am getting
 erroneous results fo normal ordered searches using searches like:
 
 _n followed by work
 
 where because _n and work are at the same position the code changes
 accept their pairing as a valid in-order result now that the eqaul to clause
 has been added to the inequality.

Thanks for trying this. Indeed the followed by semantics is broken for
the ordered case when spans at the same positions are considered
ordered.

Did I understand correctly that the unordered case with a slop of -1
and without the edit works to match terms at the same position?
In that case it may be worthwhile to add that to the javadocs,
and also add a few testcases.

Regards,
Paul Elschot

 
 CT
 
 On Mon, Nov 23, 2009 at 12:26 PM, Christopher Tignor
 ctig...@thinkmap.comwrote:
 
  Thanks so much for this.
 
  Using an un-ordered query, the -1 slop indeed returns the correct results,
  matching tokens at the same position.
 
  I tried the same query but ordered both after and before rebuilding the
  source with Paul's changes to NearSpansOrdered but the query was still
  failing, returning no results.
 
  CT
 
 
  On Mon, Nov 23, 2009 at 11:59 AM, Mark Miller markrmil...@gmail.comwrote:
 
  Your trying -1 with ordered right? Try it with non ordered.
 
  Christopher Tignor wrote:
   A slop of -1 doesn't work either.  I get no results returned.
  
   this would be a *really* helpful feature for me if someone might suggest
  an
   implementation as I would really like to be able to do arbitrary span
   searches where tokens may be at the same position and also in other
   positions where the ordering of subsequent terms may be restricted as
  per
   the normal span API.
  
   thanks,
  
   CT
  
   On Sun, Nov 22, 2009 at 7:50 AM, Paul Elschot paul.elsc...@xs4all.nl
  wrote:
  
  
   Op zondag 22 november 2009 04:47:50 schreef Adriano Crestani:
  
   Hi,
  
   I didn't test, but you might want to try SpanNearQuery and set slop to
  
   zero.
  
   Give it a try and let me know if it worked.
  
   The slop is the number of positions in between, so zero would still
  be
   too
   much to only match at the same position.
  
   SpanNearQuery may or may not work for a slop of -1, but one could try
   that for both the ordered and unordered cases.
   One way to do that is to start from the existing test cases.
  
   Regards,
   Paul Elschot
  
  
   Regards,
   Adriano Crestani
  
   On Thu, Nov 19, 2009 at 7:28 PM, Christopher Tignor 
  
   ctig...@thinkmap.comwrote:
  
   Hello,
  
   I would like to search for all documents that contain both plan and
  
   _v
  
   (my part of speech token for verb) at the same position.
   I have tokenized the documents accordingly so these tokens exists at
  
   the
  
   same location.
  
   I can achieve programaticaly using PhraseQueries by adding the Terms
   explicitly at the same position but I need to be able to recover the
   Payload
   data for each
   term found within the matched instance of my query.
  
   Unfortunately the PayloadSpanUtil doesn't seem to return the same
  
   results
  
   as
   the PhraseQuery, possibly becuase it is converting it inoto Spans
  first
   which do not support searching for Terms at the same document
  position?
  
   Any help appreciated.
  
   thanks,
  
   CT
  
   --
   TH!NKMAP
  
   Christopher Tignor | Senior Software Architect
   155 Spring Street NY, NY 10012
   p.212-285-8600 x385 f.212-285-8999
  
  
   -
   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-user-h...@lucene.apache.org
  
  
  
  
  
  
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 
  --
  TH!NKMAP
 
  Christopher Tignor | Senior Software Architect
  155 Spring Street NY, NY 10012
  p.212-285-8600 x385 f.212-285-8999
 
 
 
 
 -- 
 TH!NKMAP
 
 Christopher Tignor | Senior Software Architect
 155 Spring Street NY, NY 10012
 p.212-285-8600 x385 f.212-285-8999
 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: SpanQuery for Terms at same position

2009-11-22 Thread Paul Elschot

Op zondag 22 november 2009 04:47:50 schreef Adriano Crestani:
 Hi,
 
 I didn't test, but you might want to try SpanNearQuery and set slop to zero.
 Give it a try and let me know if it worked.

The slop is the number of positions in between, so zero would still be too
much to only match at the same position.

SpanNearQuery may or may not work for a slop of -1, but one could try
that for both the ordered and unordered cases.
One way to do that is to start from the existing test cases.

Regards,
Paul Elschot

 
 Regards,
 Adriano Crestani
 
 On Thu, Nov 19, 2009 at 7:28 PM, Christopher Tignor 
 ctig...@thinkmap.comwrote:
 
  Hello,
 
  I would like to search for all documents that contain both plan and _v
  (my part of speech token for verb) at the same position.
  I have tokenized the documents accordingly so these tokens exists at the
  same location.
 
  I can achieve programaticaly using PhraseQueries by adding the Terms
  explicitly at the same position but I need to be able to recover the
  Payload
  data for each
  term found within the matched instance of my query.
 
  Unfortunately the PayloadSpanUtil doesn't seem to return the same results
  as
  the PhraseQuery, possibly becuase it is converting it inoto Spans first
  which do not support searching for Terms at the same document position?
 
  Any help appreciated.
 
  thanks,
 
  CT
 
  --
  TH!NKMAP
 
  Christopher Tignor | Senior Software Architect
  155 Spring Street NY, NY 10012
  p.212-285-8600 x385 f.212-285-8999
 
 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Efficient filtering advise

2009-11-22 Thread Paul Elschot

Try a MultiTermQueryWrapperFilter instead of the QueryFilter.
I'd expect a modest gain in performance.

In case it is possible to form a few groups of terms that are reused,
it could even be more efficient to also use a CachingWrapperFilter
for each of these groups.

Regards,
Paul Elschot

Op zondag 22 november 2009 15:48:39 schreef Eran Sevi:
 Hi,
 
 I have a need to filter my queries using a rather large subset of terms (can
 be 10K or even 50K).
 All these terms are sure to exist in the index so the number of results can
 be about the same number of terms in the filter.
 The terms are numbers but are not subsequent and are from a large set of
 possible values (so range queries are probably not good for me).
 The index itself is about 1M docs and running even a simple query with such
 a large filter takes a lot of time even if the number of results is only a
 few hundred docs.
 It seems like the speed is affected by the length of the filter even if the
 number of results remains more or less the same, which is logical but not by
 such a large loss of performance as I'm experiencing (running the query with
 a 10K terms filter takes an average of 1s 187ms with 600 results while
 running it with a 50K terms filter takes an average of 5s 207ms with 1000
 results).
 
 Currently I'm using a QueryFilter with a boolean query in which I OR the
 different terms together.
 I also can't use a cached filter efficiently since the terms to filter on
 change almost every query.
 
 I was wondering if there's a better way to filter my queries so they won't
 take a few seconds to run?
 
 Thanks in advance for any advise,
 Eran.
 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Efficient filtering advise

2009-11-22 Thread Paul Elschot

Op zondag 22 november 2009 17:23:53 schreef Eran Sevi:
 Thanks for the tips.
 
 I'm still using version 2.4 so I can't use MultiTermQueryWrapperFilter but
 I'll definitely try to re-group the the terms that are not changing in order
 to cache them.
 How can I join several such filters together?

There are various ways. OpenBitSet and OpenBitSetDISI can do this,
and there's also BooleanFilter and ChainedFilter in contrib.

 Using FieldCacheTermsFilter sounds promising. Fortunately it is a single
 value field (our unique doc id).


Regards,
Paul Elschot
 
 I'll consider very seriously moving to 2.9.1 in order to try it out and see
 if I can get so real gain from using it or maybe using TermsFilter from
 contrib.
 
 
 On Sun, Nov 22, 2009 at 6:10 PM, Uwe Schindler u...@thetaphi.de wrote:
 
  Maybe this helps you, but read the docs, it will work only with
  single-value-fields:
 
  http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/search/FieldC
  acheTermsFilter.htmlhttp://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/search/FieldC%0AacheTermsFilter.html
 
  Uwe
 
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
  eMail: u...@thetaphi.de
 
 
   -Original Message-
   From: Eran Sevi [mailto:erans...@gmail.com]
   Sent: Sunday, November 22, 2009 3:49 PM
   To: java-user@lucene.apache.org
   Subject: Efficient filtering advise
  
   Hi,
  
   I have a need to filter my queries using a rather large subset of terms
   (can
   be 10K or even 50K).
   All these terms are sure to exist in the index so the number of results
   can
   be about the same number of terms in the filter.
   The terms are numbers but are not subsequent and are from a large set of
   possible values (so range queries are probably not good for me).
   The index itself is about 1M docs and running even a simple query with
   such
   a large filter takes a lot of time even if the number of results is only
  a
   few hundred docs.
   It seems like the speed is affected by the length of the filter even if
   the
   number of results remains more or less the same, which is logical but not
   by
   such a large loss of performance as I'm experiencing (running the query
   with
   a 10K terms filter takes an average of 1s 187ms with 600 results while
   running it with a 50K terms filter takes an average of 5s 207ms with 1000
   results).
  
   Currently I'm using a QueryFilter with a boolean query in which I OR
  the
   different terms together.
   I also can't use a cached filter efficiently since the terms to filter on
   change almost every query.
  
   I was wondering if there's a better way to filter my queries so they
  won't
   take a few seconds to run?
  
   Thanks in advance for any advise,
   Eran.
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Proposal for changing Lucene's backwards-compatibility policy

2009-10-16 Thread Paul Elschot

On Friday 16 October 2009 08:57:37 Michael Busch wrote:
 Hello Lucene users:
 
 In the past we have discussed our backwards-compatibility policy
 frequently on the Lucene developer mailinglist and we are thinking about
 making some significant changes. In this mail I'd like to outline the
 proposed changes to get some feedback from the user community.
 
 Our current backwards-compatibility policy regarding API changes
 states that we can only make changes that break
 backwards-compatibility in major releases (3.0, 4.0, etc.); the next
 major release is the upcoming 3.0.
 
 Given how often we made major releases in the past in Lucene this
 means that deprecated APIs need to stay in Lucene for a very long
 time. E.g. if we deprecate an API in 3.1 we'll have to wait until 4.0
 before we can remove it. This means that the code gets very cluttered
 and adding new features gets somewhat more difficult, as attention has
 to be paid to properly support the old *and* new APIs for a quite long
 time.
 
 The current policy also leads to delaying a last minor release before
 a major release (e.g. 2.9), because the developers consider it as the
 last chance for a long time to introduce new APIs and deprecate old ones.
 
 The proposal now is to change this policy in a way, so that an API can
 only be removed if it was deprecated in at least one release, which
 can be a major *or* minor release. E.g. if we deprecate an API and
 release it with 3.1, we can remove it with the 3.2 release.
 
 The obvious downside of this proposal is that a simple jar drop-in
 replacement will not be possible anymore with almost every Lucene release
 (excluding bugfix releases, e.g. 2.9.0-2.9.1). However, you can be
 sure that if you're using a non-deprecated API it will be in the next
 release.
 
 Note that of course these proposed changes do not affect
 backwards-compatibility with old index formats. I.e. it will still be
 possible to read all 3.X indexes with any Lucene 4.X version.
 
 Our main goal is to find the right balance between
 backwards-compatibility support for all the Lucene users out there and
 fast and productive development of new features.
 
 The developers haven't come to an agreement on this proposal yet.
 Potentionally giving up the drop-in replacement promise that Lucene
 could make in the past is the main reason for the struggle the developers
 are in and why we'd like to ask the user community for feedback to
 help us make a decision. After we gathered some feedback here we will
 call a vote on the development mailinglist where the committers have
 to officially decide whether to make these changes or not.
 
 So please tell us which you prefer as a back compatibility policy for
 Lucene:
 A) best effort drop-in back compatibility for minor version numbers
 (e.g. v3.5 will be compatible with v3.2)
 B) best effort drop-in back compatibility for the next minor version
 number only, and deprecations may be removed after one minor release
 (e.g. v3.3 will be compat with v3.2, but not v3.4)

I'd prefer B), with a minimum period of about two months to the
next release in case it removes deprecations.

Regards,
Paul Elschot

Re: faceted search performance

2009-10-13 Thread Paul Elschot

On Monday 12 October 2009 23:29:07 Christoph Boosz wrote:
 Hi Paul,
 
 Thanks for your suggestion. I will test it within the next few days.
 However, due to memory limitations, it will only work if the number of hits
 is small enough, am I right?

One can load a single term vector at a time, so in this case the memory
limitation is only in the possibly large map of doc counters per term.
For best performance try and load the term vectors in docId order,
after the original query has completed.

In any case it would be good to somehow limit the number of
documents considered, for example by using the ones with the best
query score.

Limiting the number of terms would also be good, but that less easy.

Regards,
Paul Elschot

 
 Chris
 
 2009/10/12 Paul Elschot paul.elsc...@xs4all.nl
 
  Chris,
 
  You could also store term vectors for all docs at indexing
  time, and add the termvectors for the matching docs into a
  (large) map of terms in RAM.
 
  Regards,
  Paul Elschot
 
 
  On Monday 12 October 2009 21:30:48 Christoph Boosz wrote:
   Hi Jake,
  
   Thanks for your helpful explanation.
   In fact, my initial solution was to traverse each document in the result
   once and count the contained terms. As you mentioned, this process took a
   lot of memory.
   Trying to confine the memory usage with the facet approach, I was
  surprised
   by the decline in performance.
   Now I know it's nothing abnormal, at least.
  
   Chris
  
  
   2009/10/12 Jake Mannix jake.man...@gmail.com
  
Hey Chris,
   
On Mon, Oct 12, 2009 at 10:30 AM, Christoph Boosz 
christoph.bo...@googlemail.com wrote:
   
 Thanks for your reply.
 Yes, it's likely that many terms occur in few documents.

 If I understand you right, I should do the following:
 -Write a HitCollector that simply increments a counter
 -Get the filter for the user query once: new CachingWrapperFilter(new
 QueryWrapperFilter(userQuery));
 -Create a TermQuery for each term
 -Perform the search and read the counter of the HitCollector

 I did that, but it didn't get faster. Any ideas why?

   
This killer is the TermQuery for each term part - this is huge. You
  need
to invert this process,
and use your query as is, but while walking in the HitCollector, on
  each
doc
which matches
your query, increment counters for each of the terms in that document
(which
means you need
an in-memory forward lookup for your documents, like a multivalued
FieldCache - and if you've
got roughly the same number of terms as documents, this cache is likely
  to
be as large as
your entire index - a pretty hefty RAM cost).
   
But a good thing to keep in mind is that doing this kind of faceting
(massively multivalued
on a huge term-set) requires a lot of computation, even if you have all
  the
proper structures
living in memory:
   
For each document you look at (which matches your query), you need to
  look
at all
of the terms in that document, and increment a counter for that term.
   So
however much
time it would normally take for you to do the driving query, it can
  take as
much as that
multiplied by the average number of terms in a document in your index.
   If
your documents
are big, this could be a pretty huge latency penalty.
   
 -jake

Re: faceted search performance

2009-10-12 Thread Paul Elschot

On Monday 12 October 2009 14:53:45 Christoph Boosz wrote:
Hi,

I have a question related to faceted search. My index contains more than 1
million documents, and nearly 1 million terms. My aim is to get a DocIdSet
for each term occurring in the result of a query. I use the approach
described on
http://sujitpal.blogspot.com/2007/04/lucene-search-within-search-with.htmlhttps://service.gmx.net/de/cgi/derefer?TYPE=3DEST=http%3A%2F%2Fsujitpal.blogspot.com%2F2007%2F04%2Flucene-search-within-search-with.html,
where a BitSet is built out of a QueryFilter for each term and intersected
with the BitSet representing the user query.
However, performance could be better. I guess it’s because the term filter
considers each document in the index, even if it’s not in the result. My
attempt to use a ChainedFilter, where the first filter (cached) is for the
user query, and the second one for the term (done for all terms), didn’t
speed things up, though.
Am I missing something? Is there a better way to get the DocIdSets for a
huge number of terms in a limited set of documents?

Assuming you only need the number of documents within the original query
that contain each term, one thing that can be saved is the allocation of the
resulting BitSet for each term. To do this, use the cached BitSet (or the
OpenBitSet in current lucene) for the original Query as a filter for a TermQuery
per term, and then count the matching documents by using a counting
HitCollector on the IndexSearcher.

Regards,
Paul Elschot

Re: faceted search performance

2009-10-12 Thread Paul Elschot

Chris,

You could also store term vectors for all docs at indexing
time, and add the termvectors for the matching docs into a
(large) map of terms in RAM.

Regards,
Paul Elschot


On Monday 12 October 2009 21:30:48 Christoph Boosz wrote:
 Hi Jake,
 
 Thanks for your helpful explanation.
 In fact, my initial solution was to traverse each document in the result
 once and count the contained terms. As you mentioned, this process took a
 lot of memory.
 Trying to confine the memory usage with the facet approach, I was surprised
 by the decline in performance.
 Now I know it's nothing abnormal, at least.
 
 Chris
 
 
 2009/10/12 Jake Mannix jake.man...@gmail.com
 
  Hey Chris,
 
  On Mon, Oct 12, 2009 at 10:30 AM, Christoph Boosz 
  christoph.bo...@googlemail.com wrote:
 
   Thanks for your reply.
   Yes, it's likely that many terms occur in few documents.
  
   If I understand you right, I should do the following:
   -Write a HitCollector that simply increments a counter
   -Get the filter for the user query once: new CachingWrapperFilter(new
   QueryWrapperFilter(userQuery));
   -Create a TermQuery for each term
   -Perform the search and read the counter of the HitCollector
  
   I did that, but it didn't get faster. Any ideas why?
  
 
  This killer is the TermQuery for each term part - this is huge. You need
  to invert this process,
  and use your query as is, but while walking in the HitCollector, on each
  doc
  which matches
  your query, increment counters for each of the terms in that document
  (which
  means you need
  an in-memory forward lookup for your documents, like a multivalued
  FieldCache - and if you've
  got roughly the same number of terms as documents, this cache is likely to
  be as large as
  your entire index - a pretty hefty RAM cost).
 
  But a good thing to keep in mind is that doing this kind of faceting
  (massively multivalued
  on a huge term-set) requires a lot of computation, even if you have all the
  proper structures
  living in memory:
 
  For each document you look at (which matches your query), you need to look
  at all
  of the terms in that document, and increment a counter for that term.  So
  however much
  time it would normally take for you to do the driving query, it can take as
  much as that
  multiplied by the average number of terms in a document in your index.  If
  your documents
  are big, this could be a pretty huge latency penalty.
 
   -jake

Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Paul Elschot

On Wednesday 15 July 2009 17:16:23 Michael McCandless wrote:
 So now I'm confused.  Since your query has required (+) clauses, the
 setAllowDocsOutOfOrder should have no effect, on either 2.4 or trunk.

Probably the top level BQ is using BS2 because of the required clauses,
but the nested BQ's are using BS because the docs are allowed out of order.

In that case BS2 will use skipTo() on BS, and the BS.skipTo() implementation
could well be the culprit for performance. A long time ago BS.skipTo() used to
throw an unsupported operation exception, but that does not seem to
be happening.

Eks, could you try a toString() on the top level scorer for one of the affected
queries to see whether it shows BS2 on top level and BS for the inner scorers?

Regards,
Paul Elschot


 
 BooleanQuery only uses BooleanScorer when there are no required terms,
 and allowDocsOutOfOrder is true.  So I can't explain why you see this
 setting changing anything on this query...
 
 Mike
 
 On Tue, Jul 14, 2009 at 7:04 PM, eks deveks...@yahoo.co.uk wrote:
 
  I do not know exactly why, but
  when I BooleanQuery.setAllowDocsOutOfOrder(true); I have the problem, but 
  with setAllowDocsOutOfOrder(false);  no problems whatsoever
 
  not really scientific method to find such bug, but does the job and makes 
  me happy.
 
  Empirical, deprecated methods are not to be taken as thoroughly tested, as 
  they have short life expectancy
 
 
 
 
 
  - Original Message 
  From: eks dev eks...@yahoo.co.uk
  To: java-user@lucene.apache.org
  Sent: Wednesday, 15 July, 2009 0:24:43
  Subject: Re: speed of BooleanQueries on 2.9
 
 
  Mike, we are definitely hitting something with this one!
 
  we had report from our QA chaps that our servers got stuck (limit is on 180
  Seconds Request)... We are on average 14 Requsts per second has 
  nothing to
  do with gc() as
  we can repeat it with freshly restarted searcher.
 
  - it happens on a less than 0.1% of queries, not much of a  pattern, 
  repeatable
  on our index...
  it is always combination of two expanded tokens (we use
  minimumNooShouldMatch)...
 
  (+(t1 [up to 40 expansions]) +(t2 [up to 40 expansions of t2]))
  all tokens are with set boost, and  minNumShouldMatch is set to two
 
  I cannot provide self-contained test, nor index (contains sensitive data 
  and is
  rather big, ~5G)
 
  I can repeat this test on t1 and t2 with 40 expansions each. even if I 
  take the
  most frequent tokens in collection it runs well under one second...but 
  these two
  particular tokens with their expansions are making it run forever...
 
  and yes, if I run t1 plus expansions only, it runs super fast, the same 
  for t2
 
  java 1.4U14, tried wit 1.6U6, no changes...
 
  will report if I dig something out
 
  partial stack trace while stuck, cpu is on max:
 
  org.apache.lucene.search.TopScoreDocCollector$OutOfOrderTopScoreDocCollector.collect(Unknown
  Source)
  org.apache.lucene.search.BooleanScorer.score(Unknown Source)
  org.apache.lucene.search.BooleanScorer.score(Unknown Source)
  org.apache.lucene.search.IndexSearcher.search(Unknown Source)
  org.apache.lucene.search.IndexSearcher.search(Unknown Source)
  org.apache.lucene.search.Searcher.search(Unknown Source)
 
 
 
 
 
  - Original Message 
   From: eks dev
   To: java-user@lucene.apache.org
   Sent: Monday, 13 July, 2009 13:28:45
   Subject: Re: speed of BooleanQueries on 2.9
  
   Hi Mike,
  
   getMaxNumOfCandidates() in test was 200, Index is optimised and read-only
  
   We found (due to an error in our warm-up code, funny) that only this 
   Query
  runs
   slower on 2.9.
  
   A hint where to look could be that this Query cointains two, the most 
   frequent
 
   tokens in two particular fields
   NAME:hans and ZIPS:berlin (index has ca 80Mio very short documents, 3Mio
  unique
   terms)
  
   But all of this *could be just wrong measurement*, I just could not 
   spend more
 
   time to get to the bottom of this. We moved forward as we got overall 
   better
   average performance (sweet 10% in average) on much bigger real query log 
   from
   our regression test.
  
   Anyhow I just wanted to throw it out, maybe it triggers some synapses :) 
   If
   false alarm, sorry.
  
  
  
  
  
   - Original Message 
From: Michael McCandless
To: java-user@lucene.apache.org
Sent: Monday, 13 July, 2009 11:50:48
Subject: Re: speed of BooleanQueries on 2.9
   
This is not expected; 2.9 has had a number of changes that ought to
reduce CPU cost of searching.  If this holds up we definitely need to
get to the root cause.
   
Did your test exclude the warmup query for both 2.4.1  2.9?  How many
segments in the index?  What is the actual value of
getMaxNumOfCandidates()?  If you simplify the query down (eg just do
the NAME clause or the ZIPSS clause, alone) are those also 4X slower?
   
Mike
   
On Sun, Jul 12, 2009 at 12:53 PM, eks devwrote:

 Is it possible

Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Paul Elschot

As long as next(), skipTo(), doc() and score() on a Scorer work,
the search will be done. I hope the results are correct in this
case, but I'm not sure.

Regards,
Paul Elschot

On Wednesday 15 July 2009 19:08:00 Michael McCandless wrote:
 I don't think a toplevel BS2 is able to use BS as sub-scorers?  BS2
 needs to do doc-at-once, for all sub-scorers, but BS can't do that.  I
 think?
 
 Mike
 
 On Wed, Jul 15, 2009 at 12:10 PM, Paul Elschotpaul.elsc...@xs4all.nl wrote:
  On Wednesday 15 July 2009 17:16:23 Michael McCandless wrote:
  So now I'm confused.  Since your query has required (+) clauses, the
  setAllowDocsOutOfOrder should have no effect, on either 2.4 or trunk.
 
  Probably the top level BQ is using BS2 because of the required clauses,
  but the nested BQ's are using BS because the docs are allowed out of order.
 
  In that case BS2 will use skipTo() on BS, and the BS.skipTo() implementation
  could well be the culprit for performance. A long time ago BS.skipTo() used 
  to
  throw an unsupported operation exception, but that does not seem to
  be happening.
 
  Eks, could you try a toString() on the top level scorer for one of the 
  affected
  queries to see whether it shows BS2 on top level and BS for the inner 
  scorers?
 
  Regards,
  Paul Elschot
 
 
 
  BooleanQuery only uses BooleanScorer when there are no required terms,
  and allowDocsOutOfOrder is true.  So I can't explain why you see this
  setting changing anything on this query...
 
  Mike
 
  On Tue, Jul 14, 2009 at 7:04 PM, eks deveks...@yahoo.co.uk wrote:
  
   I do not know exactly why, but
   when I BooleanQuery.setAllowDocsOutOfOrder(true); I have the problem, 
   but with setAllowDocsOutOfOrder(false);  no problems whatsoever
  
   not really scientific method to find such bug, but does the job and 
   makes me happy.
  
   Empirical, deprecated methods are not to be taken as thoroughly tested, 
   as they have short life expectancy
  
  
  
  
  
   - Original Message 
   From: eks dev eks...@yahoo.co.uk
   To: java-user@lucene.apache.org
   Sent: Wednesday, 15 July, 2009 0:24:43
   Subject: Re: speed of BooleanQueries on 2.9
  
  
   Mike, we are definitely hitting something with this one!
  
   we had report from our QA chaps that our servers got stuck (limit is on 
   180
   Seconds Request)... We are on average 14 Requsts per second has 
   nothing to
   do with gc() as
   we can repeat it with freshly restarted searcher.
  
   - it happens on a less than 0.1% of queries, not much of a  pattern, 
   repeatable
   on our index...
   it is always combination of two expanded tokens (we use
   minimumNooShouldMatch)...
  
   (+(t1 [up to 40 expansions]) +(t2 [up to 40 expansions of t2]))
   all tokens are with set boost, and  minNumShouldMatch is set to two
  
   I cannot provide self-contained test, nor index (contains sensitive 
   data and is
   rather big, ~5G)
  
   I can repeat this test on t1 and t2 with 40 expansions each. even if I 
   take the
   most frequent tokens in collection it runs well under one second...but 
   these two
   particular tokens with their expansions are making it run forever...
  
   and yes, if I run t1 plus expansions only, it runs super fast, the same 
   for t2
  
   java 1.4U14, tried wit 1.6U6, no changes...
  
   will report if I dig something out
  
   partial stack trace while stuck, cpu is on max:
  
   org.apache.lucene.search.TopScoreDocCollector$OutOfOrderTopScoreDocCollector.collect(Unknown
   Source)
   org.apache.lucene.search.BooleanScorer.score(Unknown Source)
   org.apache.lucene.search.BooleanScorer.score(Unknown Source)
   org.apache.lucene.search.IndexSearcher.search(Unknown Source)
   org.apache.lucene.search.IndexSearcher.search(Unknown Source)
   org.apache.lucene.search.Searcher.search(Unknown Source)
  
  
  
  
  
   - Original Message 
From: eks dev
To: java-user@lucene.apache.org
Sent: Monday, 13 July, 2009 13:28:45
Subject: Re: speed of BooleanQueries on 2.9
   
Hi Mike,
   
getMaxNumOfCandidates() in test was 200, Index is optimised and 
read-only
   
We found (due to an error in our warm-up code, funny) that only this 
Query
   runs
slower on 2.9.
   
A hint where to look could be that this Query cointains two, the most 
frequent
  
tokens in two particular fields
NAME:hans and ZIPS:berlin (index has ca 80Mio very short documents, 
3Mio
   unique
terms)
   
But all of this *could be just wrong measurement*, I just could not 
spend more
  
time to get to the bottom of this. We moved forward as we got overall 
better
average performance (sweet 10% in average) on much bigger real query 
log from
our regression test.
   
Anyhow I just wanted to throw it out, maybe it triggers some synapses 
:) If
false alarm, sorry.
   
   
   
   
   
- Original Message 
 From: Michael McCandless
 To: java-user@lucene.apache.org

Re: Boolean retrieval

2009-07-04 Thread Paul Elschot

It is also possible to use the HitCollector api and simply ignore
the score values.

Regards,
Paul Elschot


On Saturday 04 July 2009 21:14:41 Mark Harwood wrote:
 
 Check out booleanfilter in contrib/queries. It can be wrapped in a 
 constantScoreQuery
 
 
 
 On 4 Jul 2009, at 17:37, Lukas Michelbacher miche...@ims.uni-stuttgart.de 
 wrote:
 
 
 This is about an experiment comparing plain Boolean retrieval with
 vector-space-based retrieval.
 
 I would like to disable all of Lucene's scoring mechanisms and just
 run a true Boolean query that returns exactly the documents that match a
 query specified in Boolean syntax (OR, AND, NOT). No scoring or sorting
 required.
 
 As far as I can see, this is not supported out of the box.  Which classes
 would I have to modify?
 
 Would it be enough to create a subclass of Similarity and to ignore all terms 
 but one (coord, say) and make this term return 1 if the query matches the 
 document and 0 otherwise?
 
 Lukas
 
 --
 Lukas Michelbacher
 Institute for Natural Language Processing
 Universität Stuttgart
 email: miche...@ims.uni-stuttgart.de
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
   
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Need help : SpanNearQuery

2009-04-17 Thread Paul Elschot

To avoid passing all combinations to a NearSpansQuery
some non trivial changes would be needed in the spans package.

NearSpansUnOrdered (and maybe also NearSpansOrdered)
would have to be extended to provide matching Spans when
(the Spans of) not all terms/subqueries match.

Also, quite likely, it will be necessary add a float getWeight() method
to the Spans interface. This value could indicate how many
terms/subqueries actually matched, and then be used in SpanScorer
to provide a score for the matching document.
This weight value would also be useful in other cases, for
example to allow different weights in SpanTermQuery.

Regards,
Paul Elschot


On Friday 17 April 2009 12:18:46 Radhalakshmi Sreedharan wrote:
 To make the question simple,
 
 What I need is  the following :
 If my document field is ( ab,bc,cd,ef) and Search tokens are (ab,bc,cd). 
 
 Given the following : 
  I should get a hit even if all of the search tokens aren't present
  If the tokens are found they should be found within a distance x of each 
 other ( proximity search)
 
 I need the percentage match of the search tokens with the document field. 
 
 Currently this is my query :  
 1) I form all possible permutation of the search tokens 
 2) do a spanNearQuery of each permutation
 3)  Do a DisjunctionMaxQuery on the spannearqueries. 
 
 This is how I compute % match  : 
 % match =  ( Score by running the query on the document field ) /
   ( score by running the query on a document field created out of 
 search tokens )
 
 The numerator gives me the actual score with the search tokens run on the 
 field.
 Denominator gives  me the  best possible or maximum possible score with the 
 current search tokens
 
 For this example  If my document field is ( ab,bc,cd,ef) and Search tokens 
 are (ab,bc,cd). I expect a % match of around 90%.
 
 However I get a match of only around 50% without a boost. Using a boost 
 infact reduces my percentage. 
 
 I even overrode the queryNorm method to return a one, still the percentage 
 did not increase.
 
 Any suggestions ?
 -Original Message-
 From: Radhalakshmi Sreedharan [mailto:radhalakshm...@infosys.com] 
 Sent: Friday, April 17, 2009 12:37 PM
 To: java-user@lucene.apache.org
 Subject: RE: Need help : SpanNearQuery
 
 Hi Steven,
 Thanks for your reply.
 
 I tried out your approach and the problem got solved to an extent but still 
 it remains.
 
 The problem is the score reduces quite a bit even now as bc is not found in 
 the combinations
  ( bc,cd) ( bc,ef) and ( ab,bc,cd,ef) etc. 
 
 The boosting infact has a negative impact and reduces the score further :(
 
 The factor which is affected by boosting is the queryNorm .
 
 With a boost of 6  - 
 
 0.015559823 = (MATCH) max of:
   0.015559823 = (MATCH) weight(spanNear([SearchField:cd, SearchField:ef], 10, 
 false)^6.0 in 0), product of:
 0.07606166 = queryWeight(spanNear([SearchField:cd, SearchField:ef], 10, 
 false)^6.0), product of:
   6.0 = boost
   0.61370564 = idf(SearchField: cd=1 ef=1)
   0.02065639 = queryNorm
 0.20456855 = (MATCH) fieldWeight(SearchField:spanNear([cd, ef], 10, 
 false)^6.0 in 0), product of:
   0.3334 = tf(phraseFreq=0.3334)
   0.61370564 = idf(SearchField: cd=1 ef=1)
   1.0 = fieldNorm(field=SearchField, doc=0)
 
 Without a boost  - 
 
 0.07779912 = (MATCH) max of:
   0.07779912 = (MATCH) weight(spanNear([SearchField:cd, SearchField:ef], 10, 
 false) in 0), product of:
 0.3803083 = queryWeight(spanNear([SearchField:cd, SearchField:ef], 10, 
 false)), product of:
   0.61370564 = idf(SearchField: cd=1 ef=1)
   0.6196917 = queryNorm
 0.20456855 = (MATCH) fieldWeight(SearchField:spanNear([cd, ef], 10, 
 false) in 0), product of:
   0.3334 = tf(phraseFreq=0.3334)
   0.61370564 = idf(SearchField: cd=1 ef=1)
   1.0 = fieldNorm(field=SearchField, doc=0)
 
 
 Regards,
 Radha
 -Original Message-
 From: Steven A Rowe [mailto:sar...@syr.edu] 
 Sent: Thursday, April 16, 2009 10:35 PM
 To: java-user@lucene.apache.org
 Subject: RE: Need help : SpanNearQuery
 
 Hi Radha,
 
 On 4/16/2009 at 8:35 AM, Radhalakshmi Sredharan wrote:
  I have a question related to SpanNearQuery.
  
  I need a hit even if there are 2/3 terms found with the span being
  applied for those 2 terms.
  
  Is there any custom implementation in place for this? I checked
  SrndQuery but that also doesn't work.
  
  This is my workaround currently:
  
  1)  For a list of terms ( ab,bc, cd,ef) , make a set like ( ab,bc)
  , ( bc,cd) ( ab,cd) (bc,ef) ( ab,bc,cd) ( ab,bc,cd,ef). and so on.
  
  2)  Create a spanNearQuery for  each of these terms
  
  3)  Add it to the booleanQuery with a  SHOULD clause.
  
  However this approach gives me puzzling scores
   eg If my document has  only ( ab,bc,cd) the penalty for the missing ef
  is very high and my score comes down quite a bit.
 
 Do you know about the scoring documentation on the Lucene site: 
 http

Re: Need help : SpanNearQuery

2009-04-17 Thread Paul Elschot

On Friday 17 April 2009 16:33:27 Radhalakshmi Sreedharan wrote:
 Thanks Paul. Is there any alternative way of implementing this requirement?

Start from scratch perhaps? Anyway, spans can be really tricky, so in
case you're writing code for this, I have only four advices: test, test,
test and test.

 As a side note, Will the Shingle Filter help me getting all possible
 combination of the input tokens?

I don't know.

Regards,
Paul Elschot

Re: Index in text format

2009-04-09 Thread Paul Elschot

On Thursday 09 April 2009 21:56:44 Andy wrote:

 Is there a way to have lucene to write index in a txt file?

No. You could try a hexdump of the index file(s), but that isn't 
really human readable. Instead of that you may want to try Luke:
http://www.getopt.org/luke/

Regards,
Paul Elschot

Re: Internals question: BooleanQuery with many TermQuery children

2009-04-07 Thread Paul Elschot

On Tuesday 07 April 2009 05:04:44 Daniel Noll wrote:
 Hi all.
 
 This is something I have been wondering for a while but can't find a 
 good answer by reading the code myself.
 
 If you have a query like this:
 
( field:Value1 OR
  field:Value2 OR
  field:Value3 OR
   ... )
 
 How many TermEnum / TermDocs scans should this execute?
 
 (a) One per clause, or
 (b) One for the entire boolean query?

One per clause.

 
 I wonder because we do use a lot of queries of this nature, and I can't 
 find any direct evidence that they get logically merged, leading me to 
 believe that it's one per clause at present (and thus this becomes a 
 potential optimisation.)

The problem is not only in the scanning of the TermDocs, but also in
the merging by docId (on a heap) that has to take place when more of them
are used at the same time during the query search.

Some optimisations are already in place:
- By allowing docs scored out of order, most top level OR queries
  can be merged with a faster algorithm (distributive sort over docId ranges)
  using the term frequencies (see BooleanQuery.setAllowDocsOutOfOrder())
- Various Filters that merge into a bitset, using a single TermDocs
  and ignoring term frequencies, (see MultiTermQuery.getFilter()).
- The new TrieRangeFilter that premerges ranges at indexing time,
  also ignoring term frequencies.

Using the TermDocs one by one has another advantage in that it
reduces disk seek distances in the index. This is noticeable when
disks have heads that take more time to move longer distances.
SSD's don't have moving heads, so they have smaller performance
differences between merging into a bitset, by distributive sort,
and by a heap.

For the time being, Lucene does not have a low level facility for key values
that occur at most once per document field, so for these it normally helps
to use a Filter.

Regards,
Paul Elschot

Re: Using SpanNearQuery.getSpans() in a Search Result

2009-04-02 Thread Paul Elschot

On Thursday 02 April 2009 15:36:44 David Seltzer wrote:
 Hi all,
 
  
 
 I'm trying to figure out how to use SpanNearQuery.getSpans(IndexReader)
 when working with a result set from a query. 
 
  
 
 Maybe I have a fundamental misunderstanding of what an IndexReader is -
 I'm under the impression that it's a mechanism for sequentially
 accessing the documents in an index. So I'm not really sure how that
 helps me find the spans inside a search result.
 
  
 
 My problem is compounded by the fact that I'm using
 ParallelMultiSearcher so I'm not even 100% sure that I know what index
 each Hit is located in.

It's the other way around: for span queries a search result is created
(internally, by SpanScorer) from the spans resulting from the getSpans()
method above.

Does that help?

Regards,
Paul Elschot

 
  
 
 All of the examples I find (in LIA and from CNLP) demonstrate on an
 in-memory index created for the sake of the example.
 
 Can anyone give me any guidance on this?
 
  
 
 Thanks!
 
  
 
 -Dave

Re: number of hits of pages containing two terms

2009-03-17 Thread Paul Elschot

You may want to try Filters (starting from TermFilter) for this, especially
those based on the default OpenBitSet (see the intersection count method)
because of your interest in stop words.
10k OpenBitSets for 39 M docs will probably not fit in memory in one go,
but that can be worked around by keeping fewer of them in memory.

For non stop words, you could also try using SortedVIntList instead
of OpenBitSet to reduce memory usage. In that case there is no direct
intersection count, but a counting iteration over the intersection can be
still done without actually forming the resulting filter.

Regards,
Paul Elschot


On Tuesday 17 March 2009 12:35:19 Adrian Dimulescu wrote:
 Ian Lea wrote:
  Adrian - have you looked any further into why your original two term
  query was too slow?  My experience is that simple queries are usually
  extremely fast.  
 Let me first point out that it is not too slow in absolute terms, it 
 is only for my particular needs of attempting the number of 
 co-occurrences between ideally all non-noise terms (I plan about 10 k x 
 10 k = 100 million calculations).
  How large is the index?
 I indexed Wikipedia (the 8GB-XML dump you can download). The index size 
 is 4.4 GB. I have 39 million documents. The particularity is that I cut 
 Wikipedia in pararaphs and I consider each paragraph as a Document (not 
 one page per Document as usual). Which makes a lot of short documents. 
 Each document has a stored Id  and a non-stored analyzed body :
 
 doc.add(new Field(id, id, Store.YES, Index.NO));
 doc.add(new Field(text, p, Store.NO, Index.ANALYZED));
 
  How many occurrences of your first or second
  terms?  
 I do have in my index some words that are usually qualified as stop 
 words. My first two terms are and : 13M hits and s : 4M hits. I use 
 the SnowballAnalyzer in order to lemmatize words.
 
 My intuition is that the large number of short documents and the fact I 
 am interested in the stop words do not help performance.
 
 Thank you,
 Adrian.
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Speeding up RangeQueries?

2009-03-14 Thread Paul Elschot

On Saturday 14 March 2009 13:38:16 Niels Ott wrote:
 Hi all,
 
 I'm working on my prototype system and it turns out that RangeQueries 
 are quite slow. In a first test I have about 80.000 documents in my 
 index and I combine two range queries with a normal text query using the 
 BooleanQuery.
 
 On the long run I will need to enhance my index at indexing-time so that 
 the range queries will be substituted by simple keywords.

Perhaps that is avoidable, see the reference below.

 For now, I'm interested in a possibility to speed up range queries. Does 
 the performance of a range query depend on the length of contents in the 
 field in question?

Performance normally mostly depends on the number of terms indexed within
the queried range. To limit the number of terms used during a range search,
have a look here for more info on the new TrieRangeQuery:
http://wiki.apache.org/lucene-java/SearchNumericalFields

Regards,
Paul Elschot

Re: Faceted search with OpenBitSet/SortedVIntList

2009-02-17 Thread Paul Elschot

On Tuesday 17 February 2009 10:12:12 Raffaella Ventaglio wrote:
 Thanks for sharing this info.
 In any case, this is not a problem for me since I have used only the idea
 to choose between OpenBitSet and SortedViIntList from contrib BooleanFilter,
 but I have then implemented it in my own facets manager structure, so I do
 not use the removed finalResult method.

It would be possible to build a similar choice criterion between
OpenBitSet and SortedVIntList into CachingWrapperFilter to chose
the data structure to be used for caching.

For example when using the same criterion as in the removed methods
there, your original problem might not have occurred at all.

In the CachingWrapperFilter in trunk the choice is left to an overridable 
method.

Regards,
Paul Elschot


 
 Regards,
 Raf
 
 On Sun, Feb 15, 2009 at 2:39 PM, Paul Elschot paul.elsc...@xs4all.nlwrote:
 
 
  Meanwhile the choice between SortedVIntList and OpenBitSet
  has been removed from the trunk (development version),
  that now uses OpenBitSet only:
  https://issues.apache.org/jira/browse/LUCENE-1296
 
  In case there is preference to have SortedVIntList used in the
  next lucene version (i.e. in cases when it is smaller  than
  OpenBitSet), please comment at LUCENE-1296.
 
  Regards,
  Paul Elschot

Re: Faceted search with OpenBitSet/SortedVIntList

2009-02-15 Thread Paul Elschot


Meanwhile the choice between SortedVIntList and OpenBitSet
has been removed from the trunk (development version),
that now uses OpenBitSet only:
https://issues.apache.org/jira/browse/LUCENE-1296

In case there is preference to have SortedVIntList used in the
next lucene version (i.e. in cases when it is smaller  than
OpenBitSet), please comment at LUCENE-1296.

Regards,
Paul Elschot



On Sunday 08 February 2009 09:47:24 Raffaella Ventaglio wrote:
 Hi Paul,
 
 One way to implement that would be to use one of the boolean combination
  filters in contrib, BooleanFilter or ChainedFilter,  and simply count the
  the number of times next() returns true on the result.
 
 
 I am sorry, but I cannot understand: how can I create a BooleanFilter or a
 ChainedFilter starting from two SortedVIntList objects?
 I have not found any filter that takes an existing DocIdSet in its
 constructor...
 
 However I have seen that Filter interface is very easy to implement.
 Should I create a custom Filter that wraps my SortedVIntList and than use
 these filters to create a BooleanFilter?
 
 Thanks,
 Raf

Re: Faceted search with OpenBitSet/SortedVIntList

2009-02-08 Thread Paul Elschot

John,

On Sunday 08 February 2009 00:35:10 John Wang wrote:
 Our implementation of facet search can handle this. Using bitsets for
 intersection is not scalable performance wise when index is large.
 
 We are using a compact forwarded index representation in memory for the
 counting.

Could you describe how this compact forwarded index works?

 Similar to FieldCache idea but more compact.

Does this also use FieldCacheRangeFilter and/or FieldCacheTermsFilter?


Regards,
Paul Elschot

Re: Faceted search with OpenBitSet/SortedVIntList

2009-02-08 Thread Paul Elschot

On Sunday 08 February 2009 09:53:00 Uwe Schindler wrote:
 I would do so, it's really simple, you can even do it in an anonymous inner
 class.

It is indeed simple, but it might also help to take a look at the source code
of the Lucene classes involved.

Regards,
Paul Elschot

 
 -
 UWE SCHINDLER
 Webserver/Middleware Development
 PANGAEA - Publishing Network for Geoscientific and Environmental Data
 MARUM - University of Bremen
 Room 2500, Leobener Str., D-28359 Bremen
 Tel.: +49 421 218 65595
 Fax:  +49 421 218 65505
 http://www.pangaea.de/
 E-mail: uschind...@pangaea.de
 
  -Original Message-
  From: Raffaella Ventaglio [mailto:r.ventag...@gmail.com]
  Sent: Sunday, February 08, 2009 9:47 AM
  To: java-user@lucene.apache.org
  Subject: Re: Faceted search with OpenBitSet/SortedVIntList
  
  Hi Paul,
  
  One way to implement that would be to use one of the boolean combination
   filters in contrib, BooleanFilter or ChainedFilter,  and simply count
  the
   the number of times next() returns true on the result.
  
  
  I am sorry, but I cannot understand: how can I create a BooleanFilter or a
  ChainedFilter starting from two SortedVIntList objects?
  I have not found any filter that takes an existing DocIdSet in its
  constructor...
  
  However I have seen that Filter interface is very easy to implement.
  Should I create a custom Filter that wraps my SortedVIntList and than use
  these filters to create a BooleanFilter?
  
  Thanks,
  Raf
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Faceted search with OpenBitSet/SortedVIntList

2009-02-07 Thread Paul Elschot

On Saturday 07 February 2009 19:57:19 Raffaella Ventaglio wrote:
 Hi,
 
 I am trying to implement a kind of faceted search using Lucene 2.4.0.
 
 I have a list of configuration rules that tell me how to generate this
 facets and the corresponding queries (that can range from simple term
 queries to complex boolean queries).
 
 When my application starts, it creates the whole set of facets objects and
 initializes them.
 For each facet:
 - I create the query according to the configured rule;
 - I ask the reader for the bitset corresponding to that query and I store it
 in the Facet object;
 - I get the cardinality of the bitset and I save it in the Facet object as
 its initial count.
 
 When the user does a search I have to update the counts associated to each
 Facet:
 - I get the bitset corresponding to the query + filter generated by the
 user search;
 - I get the cardinality of the (search bitset AND facet bitset) and I
 save it as the updated count.
 
 
 In my first solution, I used only OpenBitSetDISI objects, both for Facet
 bitset and for search bitset.
 So I could use intersectionCount method to get updated counts after user
 search.
 
 This works very well and it is very fast, but when the number of documents
 in the index and the number of facets grows it is too memory consuming.
 
 
 So I tried a different solution: when I create facet bitsets I use the same
 rule applied in ChainedFilter/BooleanFilter to decide if I have to store an
 OpenBitSet or a SortedVIntList.
 When I have to calculate updated counts:
 - if the facet has an OpenBitSet, I use the intersectionCount method
 directly;
 - if the facet has a SortedVIntList, I first create a new OpenBitSetDISI
 using the SortedVIntList.iterator and then I use the intersectionCount
 method.
 
 In this way, I use a smaller amount of memory at initialization time, but
 for each user search I create a large number of objects (that I suddenly
 throw away) and this affects application performance because it wastes a lot
 of time doing GC.
 
 So my question is: is there a better way to accomplish this task?
 
 I think, it would be fine if I could calculate intersectionCount directly
 on SortedVIntList objects, but I have not found nothing like that in Lucene
 2.4 JavaDoc.
 Am I missing something?

You are not missing anything.

OpenBitSet has an optimized implementation for intersection count, and
there is no counterpart of that in SortedVIntList because until now there
has been no need for it.

One way to implement that would be to use one of the boolean combination
filters in contrib, BooleanFilter or ChainedFilter,  and simply count the
the number of times next() returns true on the result.

In case the performance of that is not good enough, another way would be
to directly add an intersection count method to SortedVIntList.
However, SortedVIntList does not allow for an efficient iterator
implementation of skipTo(), and skipTo() is used intensively by intersections.
 
 As a reference, now my index contains more than 500.000 documents and I have
 to create/manage up to 50.000 facets.
 Using second solution, at initialization time my facets structure requires
 more or less 120MB (and this is good enough), while updating counts it uses
 even 2GB of memory (and this is very bad).

50.000 facets? Well, in case the performance of the last suggestion is
not good enough, one could try and implement a better data structure
than OpenBitSet and SortedVIntList to provide a DocIdSetIterator,
preferably with a fast skipTo() and possibly with a fast intersection count.
In that case, you may want to ask further on the java-dev list.

Regards,
Paul Elschot

Re: TermScorer default buffer size

2009-01-08 Thread Paul Elschot

John, 

Continuing, see below.

On Wednesday 07 January 2009 14:24:15 Paul Elschot wrote:
 On Wednesday 07 January 2009 07:25:17 John Wang wrote:
  Hi:
  
 The default buffer size (for docid,score etc) is 32 in TermScorer.
  
  We have a large index with some terms to have very dense doc sets. By
  increasing the buffer size we see very dramatic performance improvements.
  
  With our index (may not be typical), here are some numbers with buffer
  size w.r.t. performance in our query (a large OR query):
  
  Buffer-size  improvement
  2042 -   22.0 %
  4084 -   39.1 %
  8172 -   51.1 %
  
  I understand this may not be suitable for every application, so do you
  think it makes sense to make this buffer size configurable?
  
 
 Ideally the TermScorer buffer size could be set to a size depending on
 the query structure, but there is no facility for this yet.
 For OR queries larger buffers help, but not for AND queries.
 See also LUCENE-430 on reducing buffer sizes for the underlying
 TermDocs for very sparse doc sets.

It may be possible to change the TermScorer buffer size dynamically.
For OR queries TermScorer.next() is used, and for AND queries
TermScorer.skipTo() is used.
That means that when the buffer runs out during TermScorer.next(),
it could be enlarged, for example by doubling (or quadrupling) the size
to a configurable maximum of 8K or even 16K, see above. When
TermScorer.skipTo() runs out of the buffer it could leave the buffer
size unchanged.

This involves some memory allocation during search.
That is unusual, but it could be worthwhile given the
performance improvement.

Regards,
Paul Elschot

Re: TermScorer default buffer size

2009-01-08 Thread Paul Elschot

On Friday 09 January 2009 05:29:15 John Wang wrote:
 Makes sense.
 I didn't think 32 was the empirically determined magic number ;)

That number does have a history, but I don't know the details.
 
 Are you planning to do a patch for this?

No, but could you open an issue and mention the performance
improvements?

Regards,
Paul Elschot


 
 -John
 
 On Thu, Jan 8, 2009 at 1:27 AM, Paul Elschot paul.elsc...@xs4all.nl wrote:
 
  John,
 
  Continuing, see below.
 
  On Wednesday 07 January 2009 14:24:15 Paul Elschot wrote:
   On Wednesday 07 January 2009 07:25:17 John Wang wrote:
Hi:
   
   The default buffer size (for docid,score etc) is 32 in TermScorer.
   
We have a large index with some terms to have very dense doc sets.
  By
increasing the buffer size we see very dramatic performance
  improvements.
   
With our index (may not be typical), here are some numbers with
  buffer
size w.r.t. performance in our query (a large OR query):
   
Buffer-size  improvement
2042 -   22.0 %
4084 -   39.1 %
8172 -   51.1 %
   
I understand this may not be suitable for every application, so do
  you
think it makes sense to make this buffer size configurable?
   
  
   Ideally the TermScorer buffer size could be set to a size depending on
   the query structure, but there is no facility for this yet.
   For OR queries larger buffers help, but not for AND queries.
   See also LUCENE-430 on reducing buffer sizes for the underlying
   TermDocs for very sparse doc sets.
 
  It may be possible to change the TermScorer buffer size dynamically.
  For OR queries TermScorer.next() is used, and for AND queries
  TermScorer.skipTo() is used.
  That means that when the buffer runs out during TermScorer.next(),
  it could be enlarged, for example by doubling (or quadrupling) the size
  to a configurable maximum of 8K or even 16K, see above. When
  TermScorer.skipTo() runs out of the buffer it could leave the buffer
  size unchanged.
 
  This involves some memory allocation during search.
  That is unusual, but it could be worthwhile given the
  performance improvement.
 
  Regards,
  Paul Elschot

Re: TermScorer default buffer size

2009-01-07 Thread Paul Elschot

On Wednesday 07 January 2009 07:25:17 John Wang wrote:
 Hi:
 
The default buffer size (for docid,score etc) is 32 in TermScorer.
 
 We have a large index with some terms to have very dense doc sets. By
 increasing the buffer size we see very dramatic performance improvements.
 
 With our index (may not be typical), here are some numbers with buffer
 size w.r.t. performance in our query (a large OR query):
 
 Buffer-size  improvement
 2042 -   22.0 %
 4084 -   39.1 %
 8172 -   51.1 %
 
 I understand this may not be suitable for every application, so do you
 think it makes sense to make this buffer size configurable?
 

Ideally the TermScorer buffer size could be set to a size depending on
the query structure, but there is no facility for this yet.
For OR queries larger buffers help, but not for AND queries.
See also LUCENE-430 on reducing buffer sizes for the underlying
TermDocs for very sparse doc sets.

Regards,
Paul Elschot

Re: Lucene retrieval model

2008-12-30 Thread Paul Elschot

Op Tuesday 30 December 2008 10:03:03 schreef Claudia Santos:
 Hello,

 I would like to know more about Lucene's retrieval model, more
 specifically about the boolean model.
 Is that a standard model or an extended model? I mean, it returns
 just documents that match the boolean expression or include in the
 search result all Documents which correspond to the given conditions,
 regardless of the boolean connectors - AND, OR, NOT and calculate a
 weight between 0 and 1 for all search results that contains at least
 one of the terms. The extended model evaluates documents with only
 one of the terms with a smaller value than one that contains both.

 In the Apache Lucene - Scoring's page i found not that much about:
 Lucene scoring uses a combination of the Vector Space Model (VSM) of
 Information Retrieval and the Boolean model to determine how relevant
 a given Document is to a User's query. In general, the idea behind
 the VSM is the more times a query term appears in a document relative
 to the number of times the term appears in all the documents in the
 collection, the more relevant that document is to the query. It uses
 the Boolean model to first narrow down the documents that need to be
 scored based on the use of boolean logic in the Query specification.
 Lucene also adds some capabilities and refinements onto this model to
 support boolean and fuzzy searching, but it essentially remains a VSM
 based system at the heart.


A somewhat refined Boolean model is used to determine a set of
documents, and only for documents in that set a score value
is calculated according the Lucene VSM model.

The Boolean model in Lucene does not directly use the standard
boolean connectors. Instead of that, each clause
(term, subquery) is either required, optional or prohibited.
The required and prohibited clauses determine a set of
documents to be scored in the normal Boolean AND/NOT way.

The refinement in the Boolean model is for the optional clauses:
a minimum number of optional clauses may be required for
documents to be part of the set that is scored.
The normal Boolean OR operator has 1 as that minimum number,
and in Lucene this minimum defaults to 1 when no required clauses
are present.

The required clauses and the optional clauses contribute to the score.
One might consider the scoring of the optional clauses to be an
implementation of the extended Boolean model.

Fuzzy searching is implemented by constructing a Boolean query
with optional (and actually present) terms that are similar enough to
the fuzzy query term.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: BooleanQuery Performance Help

2008-12-20 Thread Paul Elschot

Op Saturday 20 December 2008 15:23:43 schreef Prafulla Kiran:
 Hi Everyone,

 I have an index of relatively small size (400mb) , containing roughly
 0.7 million documents. The index is actually a copy of an existing
 database table. Hence, most of my queries are of the form

  +field1:value1 +field2:value2 +field3:value3. ~20 fields

 I have been running performance tests using this query. Strangely, I
 noticed that if I remove some specific clauses... I get a performance
 improvement of atleast 5 times. Here are the numbers and examples, so
 that I could be more precise

 1) Complete Query: 90 requests per second using 10 threads
 2) If I remove few specific clauses : 500 requests per second using
 10 threads
 3) If I form a new query using only 2 clauses from the set of removed
 clauses - 100 requests per second using 10 threads

 Now, some of these specific clauses are such that they match around
 half of the entire document set.  Also, note that I need all the
 query terms to be present in the documents retrieved. My target is to
 obtain 300 requests per second with the given query (20 clauses). It
 includes 2 range queries. However, I am unable to get 300 rps unless
 I remove some of the clauses (which include these range queries) .
 I have tried using filters without any significant improvement in
 performance. Also, I have more than enough RAM, so I am using the
 RAMDirectory to read the index. I have optimized my index before
 searching. All the tests have been warmed for 5 seconds ( the test
 duration is 10 seconds).

 My first question is, is this kind of decrease in performance
 expected as the number of clauses shoot up ? Using a single clause
 out of these 20 , I was able to get 2000 requests per second!
 Could someone please guide me if there are any other ways in which I
 can obtain improvement in performance ?

You might try and add brackets and a + around a group
of the less frequently occurring terms, like this:

+field1:frequentValue1 +field2:frequentValue2 +(+field3:inFrequentValue3 
+field4:inFrequentValue4)

This may help, and at least it should not degrade performance much.
Also, it will affect score values somewhat.

 Particularly, I am interested to know more about what further caching
 could be done apart from the default caching which lucene does.

More caching is probably not going to help.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: RESOLVED: help: java.lang.ArrayIndexOutOfBoundsException ScorerDocQueue.downHeap

2008-12-18 Thread Paul Elschot

Op Wednesday 17 December 2008 22:49:08 schreef 1world1love:
 Just an FYI in case anyone runs into something similar.

 Essentially I had indexes that I have been searching from a java
 stored procedure in Oracle without issue for awhile. All of a sudden,
 I started getting the error I alluded to above when there were more
 than a certain number of terms (4,5, or more depending on the terms
 or index). The error did not happen when I ran a query from a local
 server with the same filesystem mounted.

In that case the root cause of the error could be in the JVM running the 
stored procedure.


 In any case, all of my indexes checked out OK. I read through all the
 other issues related to my issue but none of the fixes did anything.

 However, setting BooleanQuery.setAllowDocsOutOfOrder(true); did in
 fact make the error go away. Although I understand the idea behind
 the setting, I am not sure why it made a difference in my case.

That option chooses another algorithm to search these queries, it
will only affect queries without required terms.
(The change in search algorithm is from BooleanScorer2 to 
BooleanScorer.)

Regards,
Paul Elschot

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Issue upgrading from lucene 2.3.2 to 2.4 (moving from bitset to docidset)

2008-12-08 Thread Paul Elschot

Michael,

The change from BitSet to DocIdSetIterator implies that you'll
need to choose an underlying data structure yourself.

A minimal approach would be to use DocIdBitSet around
BitSet, but there are better ways.

For your application you might consider to replace java's BitSet by
lucene's OpenBitSet. Also have a look at earlier discussions
on the subject: you might find a good use for OpenBitSetDISI and 
contrib/**/{BooleanFilter,ChainedFilter}.

Regards,
Paul Elschot


Op Tuesday 09 December 2008 07:44:20 schreef Michael Stoppelman:
 Hi all,

 I'm working on upgrading to Lucene 2.4.0 from 2.3.2 and was trying to
 integrate the new DodIdSet changes since o.a.l.search.Filter#bits()
 method is now depreciated. For our app we actually heavily rely on
 bits from the Filter to do post-query filtering (I explain why
 below).

 For example, if someone searches for product: ipod and then filters
 a type: nano (e.g. mini/nano/regular) AND color: red (e.g.
 red/yellow/blue). In our current model the results are gathered in
 the following way:

 1) ipod w/o attributes is run and the results are stored in a
 hitcollector 2) ipod results are now filtered for color=red AND
 type=mini using the lucene Filters
 3) The filtered results are returned to the user.

 The reason that the attributes are filtered post-query is so that we
 can return the other types and colors the user can filter by in the
 future. Meaning the UI would be able to show blue, green, pink,
 etc... if we pre-filtered results by color and type before hand we
 wouldn't know what the other filter options would be there for a
 broader result set.

 Does anyone else have this use case? I'd imagine other folks are
 probably doing similar things to accomplish this.

 M



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: 2.4 Performance

2008-11-19 Thread Paul Elschot

Op Wednesday 19 November 2008 03:39:01 schreef [EMAIL PROTECTED]:
...

 Our design is roughly as follows: we have some pre-query filters,
 queries typically involving around 25 clauses, and some
 post-processing of hits. We collect counts and filter post query
 using a hit collector, which uses the (now deprecated) bits() method
 of Filters.

 I looked at converting us to use the new DocIdSet infrastructure (to
 gain the supposed 30% speed bump), but this seems to be somewhat
 problematic as there is no guarantee for whether we will get back a
 set we can do binary operations on (for example, if we get back a
 SortedVIntList, we're pretty much out of luck - the cardinality of
 the set is large (as it's a sortedvintlist), so we can't coerce it
 into another type, and it doesn't have the set operations we need to
 use it directly.

Is this part of the problem
https://issues.apache.org/jira/browse/LUCENE-1296
?

Also consider o.a.l.util.OpenBitSetDISI, and how that is used in
contrib/queries/**/BooleanFilter

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Term numbering and range filtering

2008-11-19 Thread Paul Elschot

Tim,

Op Wednesday 19 November 2008 02:32:40 schreef Tim Sturge:
...
 
  This is less than 2x slower than the dedicated bitset and more
  than 50x faster than the range boolean query.
 
  Mike, Paul, I'm happy to contribute this (ugly but working) code
  if there is interest. Let me know and I'll open a JIRA issue for
  it.
 
  In case you think more performance improvements based on this
  are possible...

 I think this is generally useful for range and set queries on
 non-text based fields (dates, location data, prices, general
 enumerations). These all have the required property that there is
 only one value (term) per document.

 I've opened LUCENE-1461.

I finally got the point, see my comments there.

Thanks a lot, if only to show an unexpected tradeoff possibility
opened by the new Filter api.
I don't know whether you followed LUCENE-584 (Decouple Filter
from BitSet), but a contribution like this multi range filter makes
it all worthwhile.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Term numbering and range filtering

2008-11-18 Thread Paul Elschot

Op Wednesday 19 November 2008 00:43:56 schreef Tim Sturge:
 I've finished a query time implementation of a column stride filter,
 which implements DocIdSetIterator. This just builds the filter at
 process start and uses it for each subsequent query. The index itself
 is unchanged.

 The results are very impressive. Here are the results on a 45M
 document index:

 Firstly without an age constraint as a baseline:

 Query +name:tim
 startup: 0
 Hits: 15089
 first query: 1004
 100 queries: 132 (1.32 msec per query)

 Now with a cached filter. This is ideal from a speed standpoint but
 there are too many possible start/end combinations to cache all the
 filters.

 Query +name:tim age:[18 TO 35] (ConstantScoreQuery on cached
 RangeFilter) startup: 3
 Hits: 11156
 first query: 1830
 100 queries: 287 (2.87 msec per query)

 Now with an uncached filter. This is awful.

 Query +name:tim age:[18 TO 35] (uncached ConstantScoreRangeQuery)
 startup: 3
 Hits: 11156
 first query: 1665
 100 queries: 51862 (yes, 518 msec per query, 200x slower)

 A RangeQuery is slightly better but still bad (and has a different
 result set)

 Query +name:tim age:[18 TO 35] (uncached RangeQuery)
 startup: 0
 Hits: 10147
 first query: 1517
 100 queries: 27157 (271 msec is 100x slower than the filter)

 Now with the prebuilt column stride filter:

 Query +name:tim age:[18 TO 35] (ConstantScoreQuery on prebuilt
 column stride filter)

With Allow Filter as clause to BooleanQuery:
https://issues.apache.org/jira/browse/LUCENE-1345
one could even skip the ConstantScoreQuery with this.
Unfortunately 1345 is unfinished for now.

 startup: 2811
 Hits: 11156
 first query: 1395
 100 queries: 441 (back down to 4.41msec per query)

 This is less than 2x slower than the dedicated bitset and more than
 50x faster than the range boolean query.

 Mike, Paul, I'm happy to contribute this (ugly but working) code if
 there is interest. Let me know and I'll open a JIRA issue for it.

In case you think more performance improvements based on this
are possible...

Regards,
Paul Elschot.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Term numbering and range filtering

2008-11-11 Thread Paul Elschot

Op Tuesday 11 November 2008 11:29:27 schreef Michael McCandless:

 The other part of your proposal was to somehow number term text
 such that term range comparisons can be implemented fast int
 comparison.
...

http://fontoura.org/papers/paramsearch.pdf

 However that'd be quite a bit deeper change to Lucene.

The cheap version is hierarchical prefixing here:

http://wiki.apache.org/jakarta-lucene/DateRangeQueries

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Term numbering and range filtering

2008-11-11 Thread Paul Elschot

Op Tuesday 11 November 2008 21:55:45 schreef Michael McCandless:
 Also, one nice optimization we could do with the term number column-
 stride array is do bit packing (borrowing from the PFOR code)
 dynamically.

 Ie since we know there are X unique terms in this segment, when
 populating the array that maps docID to term number we could use
 exactly the right number of bits.  Enumerated fields with not many
 unique values (eg, country, state) would take relatively little RAM.
 With LUCENE-1231, where the fields are stored column stride on disk,
 we could do this packing during index such that loading at search
 time is very fast.

Perhaps we'd better continue this at LUCENE-1231 or LUCENE-1410.
I think what you're referring to is PDICT, which has frame exceptions
for values that occur infrequently.

Regards,
Paul Elschot



 Mike

 Paul Elschot wrote:
  Op Tuesday 11 November 2008 11:29:27 schreef Michael McCandless:
  The other part of your proposal was to somehow number term text
  such that term range comparisons can be implemented fast int
  comparison.
 
  ...
 
http://fontoura.org/papers/paramsearch.pdf
 
  However that'd be quite a bit deeper change to Lucene.
 
  The cheap version is hierarchical prefixing here:
 
  http://wiki.apache.org/jakarta-lucene/DateRangeQueries
 
  Regards,
  Paul Elschot
 
  ---
 -- To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Term numbering and range filtering

2008-11-10 Thread Paul Elschot

Tim,

I didn't follow all the details, so this may be somewhat off,
but did you consider using TermVectors?

Regards,
Paul Elschot


Op Monday 10 November 2008 19:18:38 schreef Tim Sturge:
 Yes, that is a significant issue. What I'm coming to realize is that
 either I will end up with something like

 class MultiFilter {
String field;
private int[] termInDoc;
MapTerm,int termToInt;
...
 }

 which can be entirely built on the current lucene APIs but has
 significantly more overhead (the termToInt mapping in particular and
 the need to construct the mapping and array on startup)


 Or I can go deep into the guts and add a data file per-segment with a
 format something like

 int version
 int numFields
 (int fieldNum, long offset) ^ numFields
 (int termForDoc) ^ (maxDocs * numFields)

 and add something to FieldInfo  like  boolean storeMultiFilter;
 and FieldInfos something like STORE_MULTIFILTER = 0x40; I'd need to
 add an int termNum to the .tis file as well.

 This is clearly a lot more work than the first solution, but it is a
 lot nicer to deal with as well. Is this interesting to anyone other
 than me?

 Tim

 On 11/9/08 12:23 PM, Michael McCandless [EMAIL PROTECTED] 
wrote:
  Conceivably, TermInfosReader could track the sequence number of
  each term.
 
  A seek/skipTo would know which sequence number it just jumped too,
  because the index is regular (every 128 terms by default), and then
  each next() call could increment that.  Then retrieving this number
  would be as costly as calling eg IndexReader.docFreq(Term) is now.
 
  But I'm not sure how a multi-segment index  would work, ie how
  would MultiSegmentReader compute this for its terms?  Or maybe
  you'd just do this per-segment?
 
  Mike
 
  Tim Sturge wrote:
  Hi,
 
  I¹m wondering if there is any easy technique to number the terms
  in an index
  (By number I mean map a sequence of terms to a contiguous range of
  integers
  and map terms to these numbers efficiently)
 
  Looking at the Term class and the .tis/.tii index format it
  appears that the
  terms are stored in an ordered and prefix-compressed format, but
  while there
  are pointers from a term to the .frq and .prx files, neither is
  really suitable as a sequence number.
 
  The reason I have this question is that I am writing a
  multi-filter for
  single term fields. My index contains many fields for which each
  document
  contains a single term (e.g. date, zipcode, country) and I need to
  perform
  range queries or set matches over these fields, many of which are
  very inclusive (they match 10% of the total documents)
 
  A cached RangeFilter works well when there are a small number of
  potential
  options (e.g. for countries) but when there are many options
  (consider a
  date range or a set of zipcodes) there are too many potential
  choices to
  cache each possibility and it is too inefficient to build a filter
  on the
  fly for each query (as you have to visit 10% of documents to build
  the filter despite the query itself matching 0.1%)
 
  Therefore I was considering building a int[reader.maxDocs()] array
  for each
  field and putting into it the term number for each document. This
  relies on
  the fact that each document contains only a single term for this
  field, but
  with it I should be able to quickly construct a ³multi-filter²
  (that is,
  something that iterates the array and checks that the term is in
  the range
  or set).
 
  Right now it looks like I can do some very ugly surgery and
  perhaps use the
  offset to the prx file even though it is not contiguous. But I¹m
  hoping
  there is a better technique that I¹m just not seeing right now.
 
  Thanks,
 
  Tim
 
  ---
 -- To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Term numbering and range filtering

2008-11-10 Thread Paul Elschot

Op Monday 10 November 2008 22:21:20 schreef Tim Sturge:
 Hmmm -- I hadn't thought about that so I took a quick look at the
 term vector support.

 What I'm really looking for is a compact but performant
 representation of a set of filters on the same (one term field).
 Using term vectors would mean an algorithm similar to:

 String myfield;
 String myterm;
 TermVector tv;
 for (int i = 0 ;  i  maxDoc ; i++) {
 tv = reader.getTermFreqVector(i,country)
 if (tv.indexOf(myterm) != -1) {
   // include this doc...
 }
 }

 The key thing I am looking to achieve here is performance comparable
 to filters. I suspect getTermFremVector() is not efficient enough but
 I'll give it a try.


Better use a TermDocs on myterm for this, have a look at the code of
RangeFilter.

Filters are normally created from a slower query by setting a bit in an 
OpenBitSet at include this doc. Then they are reused for their speed.

Filter caching could help. In case memory becomes a problem
and the filters are sparse enough, try and use SortedVIntList
as the underlying data structure in the cache. (Sparse enough means
less than 1 in 8 of all docs available the index reader.)
See also LUCENE-1296 for caching another data structure than the
one used to collect the filtered docs.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to combine filter in Lucene 2.4?

2008-11-09 Thread Paul Elschot

Op Sunday 09 November 2008 11:56:37 schreef markharw00d:
  this can't be nearly as fast as OpenBitSet.intersect() or union,

 respectively, can it?

 I had a similar concern but it doesn't seem that bad:


 https://issues.apache.org/jira/browse/LUCENE-1187?focusedCommentId=12
596546#action_12596546

 The above test showed a slight improvement using bitset.or when it
 was recognised both docidsets were OpenBitSets. This optimisation is
 now in BooleanFilter.

Further to that, the current implementation of 
OpenBitSetDISI.inPlaceAnd() is not optimal, although it
should work just fine. A patch for a performance
improvement will follow.

Regards,
Paul Elschot



 Cheers
 Mark


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to combine filter in Lucene 2.4?

2008-11-08 Thread Paul Elschot

Timo,

You may be looking for class OpenBitSetDisi in the util package,
it was made for boolean filter operations on OpenBitSets.
Also have a look at the contrib modules, OpenBitSetDisi is
used there in two classes that do (precisely?) what you need:
contrib/miscellaneous/**/ChainedFilter
contrib/queries/**/BooleanFilter

Regards,
Paul Elschot

Op Saturday 08 November 2008 19:06:15 schreef Timo Nentwig:
 Hi!

 Since Filter.bits() is deprecated and replaced by getDocIdSet() now I
 wonder how I am supposed to combine (AND) filters (for facets).

 I worked around this issue by extending Filter and let getDocIdSet()
 return an OpenBitSet to ensure that this implementation is used
 everywhere and casting to OpenBitSet will work but this is really not
 clean code.

 Thanks
 Timo

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Sorting posting lists before intersection

2008-10-13 Thread Paul Elschot

Op Monday 13 October 2008 17:00:06 schreef Andrzej Bialecki:
 Renaud Delbru wrote:
  Hi Andrzej,
 
  sorry for the late reply.
 
  I have looked at the code. As far as I understand, you sort the
  posting lists based on the first doc skip. The first posting list
  will be the one who have the first biggest document skip.
  Do the sparseness of posting lists is a good predictor for sampling
  and ordering posting lists ? Do you know evaluation of such
  technique ?

 It is _some_ predictor ... :) whether it's a good one is another
 question. It's certainly very inexpensive - we don't do any
 additional IO except what we have to do anyway, which is
 scorer.skipTo().

 In general case it's costly to calculate the frequency (or
 sparseness) of matches in a scorer without actually running the
 scorer through all its matches.

  In order to implement sorting based on frequency, we need the
  document frequency of each term. This information should be
  propagated through the Scorer classes (from TermScorer to higher
  level class such as ConjunctiveScorer). This will require a call to
  IndexReader.docFreq(term) for each of the term queries. Is docFreq
  call mean another IO access ?

 It sounds like you plan to order scorers by term frequency ... but in
 general case they won't all be TermScorers, so the frequency of
 documents matching a scorer won't have any particular connection to a
 single term freq.

This could be done, but since not all scorers will be TermScorers it
will be necessary to add a method to Scorer (or perhaps even to its
DocIdSetIterator superclass):

   public abstract int estimatedDocFreq();

and implement this for all existing instances. TermScorer could
implement it without estimating.
For AND/OR/NOT such an estimation is straightforward but for
proximity queries it would be more of a guess.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: PhraseQuery issues - differences with SpanNearQuery

2008-09-05 Thread Paul Elschot

Op Friday 05 September 2008 16:57:34 schreef Mark Miller:
 Paul Elschot wrote:
  Op Thursday 04 September 2008 20:39:13 schreef Mark Miller:
  Sounds like its more in line with what you are looking for. If I
  remember correctly, the phrase query factors in the edit distance
  in scoring, but the NearSpanQuery will just use the combined idf
  for each of the terms in it, so distance shouldnt matter with
  spans (I'm sure Paul will correct me if I am wrong).
 
  SpanScorer will use the similarity slop factor for each matching
  span size to adjust the effective frequency.
  The span size is the difference in position between the first
  and last matching term, and idf is not used for scoring Spans.
  The reason why idf is not used could be that there is no basic
  score value associated with inner spans; only top level spans
  are scored by SpanScorer.
  For more details, please consult the SpanScorer code.
 
  Regards,
  Paul Elschot

 Right, my fault, its the query normalization in the weight which uses
 idf (by pulling from each clause in the span). So its kind of
 factored into the score, but not in the way I implied. Sorry, my bad
 on the info.

Well, I had missed the phrase idf over all the SpanQuery terms
as used from the SpanWeight.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: PhraseQuery issues - differences with SpanNearQuery

2008-09-04 Thread Paul Elschot

Op Thursday 04 September 2008 20:39:13 schreef Mark Miller:
 Sounds like its more in line with what you are looking for. If I
 remember correctly, the phrase query factors in the edit distance in
 scoring, but the NearSpanQuery will just use the combined idf for
 each of the terms in it, so distance shouldnt matter with spans (I'm
 sure Paul will correct me if I am wrong).

SpanScorer will use the similarity slop factor for each matching
span size to adjust the effective frequency.
The span size is the difference in position between the first
and last matching term, and idf is not used for scoring Spans.
The reason why idf is not used could be that there is no basic
score value associated with inner spans; only top level spans
are scored by SpanScorer.
For more details, please consult the SpanScorer code.

Regards,
Paul Elschot


 - Mark

 Yannis Pavlidis wrote:
  Hi,
 
  I am having an issue when using the PhraseQuery which is best
  illustrated with this example:
 
  I have created 2 documents to emulate URLs. One with a URL of:
  http://www.airballoon.com; and title air balloon and the second
  one with URL http://www.balloonair.com; and title: balloon air.
 
  Test1 (PhraseQuery)
  ==
  Now when I use the phrase query with - title: air balloon ~2
  I get back:
 
  url: http://www.airballoon.com; - score: 1.0
  url: http://www.balloonair.com; - score: 0.57
 
  Test2 (PhraseQuery)
  ==
  Now when I use the phrase query with - title: balloon air ~2
  I get back:
  url: http://www.balloonair.com; - score: 1.0
  url: http://www.airballoon.com; - score: 0.57
 
  Test3 (PhraseQuery)
  ==
  Now when I use the phrase query with - title: air balloon ~2
  title: balloon air ~2 I get back:
  url: http://www.airballoon.com; - score: 1.0
  url: http://www.balloonair.com; - score: 1.0
 
  Test4 (SpanNearQuery)
  ===
  spanNear([title:air, title:balloon], 2, false)
  I get back:
  url: http://www.airballoon.com; - score: 1.0
  url: http://www.balloonair.com; - score: 1.0
 
  I would have expected that Test1, Test2 would actually return both
  URLs with score of 1.0 since I am setting the slop to 2. It seems
  though that lucene really favors and absolute exact match.
 
  Is it safe to assume that for what I am looking for (basically
  score the docs the same regardless on when someone is searching for
  air balloon or balloon air) it would be better to use the
  SpanNearQuery rather than the PhraseQuery?
 
  Any input would be appreciated.
 
  Thanks in advance,
 
  Yannis.

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Pre-filtering for expensive query

2008-09-03 Thread Paul Elschot

Op Wednesday 03 September 2008 18:06:57 schreef Matt Ronge:
 On Aug 30, 2008, at 3:01 PM, Paul Elschot wrote:
  Op Saturday 30 August 2008 18:19:09 schreef Matt Ronge:
  On Aug 30, 2008, at 4:43 AM, Karl Wettin wrote:
  Can you tell us a bit more about what you custom query does?
  Perhaps you can build the candidate filter and reuse it over
  and over again?
 
  I cannot reuse it. The candidate filter would be constructed by
  first running a boolean query with a number of SHOULD clauses. So
  then I know what docs atleast contain the terms I'm looking for.
  Once I have this set, I will look at the ordering of the matches
  (it's a bit more sophisticated than just a phrase query) and find
  the final matches.
 
  Sounds like you may want to take a look at SpanNearQuery.

 I'm going to take a second look at SpanNearQuery. I need it to
 support optional tokens, so I'm guessing I'll need to create a
 subclass to do that.

SpanNearQuery was not designed for optional tokens.
This can be tricky so make sure your specs are good. I know
only of this article for optional tokens and proximity:
Kunihiko Sadakane and Hiroshi Imai.  Fast algorithms for k -word 
proximity search. IEICE Trans. Fundamentals, E84-A(9), September 2001.


  Since my boolean clauses are different for each query I can't
  reuse the filter.
 
  With (a variation of) SpanNearQuery you may end up not needing
  any filtering at all, because it already uses skipTo() where
  possible.
 
  In case you are looking for documents that contain partial phrases
  from an input query that has more than 2 words, have a look at
  Nutch.

 I poked around in the Nutch docs and Javadocs, what should I look at
 in Nutch? What does it do exactly, is it the trick that Doug Cutting 
 mentioned where you concat neighboring terms together like Hello
 world becomes the token hello.world?

That is an optimization for combinations of high frequency terms,
which is built into nutch iirc. But I don't know the details, so please
ask on a nutch list.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Pre-filtering for expensive query

2008-09-03 Thread Paul Elschot

Op Saturday 30 August 2008 18:22:50 schreef Matt Ronge:
 On Aug 30, 2008, at 6:13 AM, Paul Elschot wrote:
  Op Saturday 30 August 2008 03:34:01 schreef Matt Ronge:
  Hi all,
 
  I am working on implementing a new Query, Weight and Scorer that
  is expensive to run. I'd like to limit the number of documents I
  run this query on by first building a candidate set of documents
  with a boolean query. Once I have that candidate set, I was hoping
  I could build a filter off of it, and issue that along with my
  expensive query. However, after reading the code I see that
  filtering is done during the search, and not before hand.
 
  Correct. I suppose you mean the filtering code in IndexSearcher?

 Yes, that's exactly what I mean.

As Grant pointed out, this code was recently changed
by LUCENE-584.
I was referring to the (current trunk) code including this
change that uses skipTo() on a DocIdSetIterator obtained
from the Filter. Sorry for any confusion on this.

Regards,
Paul Elschot



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Pre-filtering for expensive query

2008-08-30 Thread Paul Elschot

Op Saturday 30 August 2008 03:34:01 schreef Matt Ronge:
 Hi all,

 I am working on implementing a new Query, Weight and Scorer that is
 expensive to run. I'd like to limit the number of documents I run
 this query on by first building a candidate set of documents with a
 boolean query. Once I have that candidate set, I was hoping I could
 build a filter off of it, and issue that along with my expensive
 query. However, after reading the code I see that filtering is done
 during the search, and not before hand.

Correct. I suppose you mean the filtering code in IndexSearcher?

 So my initial boolean query 
 won't help in limiting the number of documents scored by my expensive
 query.

The trick of filtering is the use of skipTo() on both the filter and
the scorer to skip superfluous work as much as possible.
So when you make your scorer implement skipTo() efficiently,
filtering it should reduce the amount of scoring done.

Implementing skipTo() efficiently is normally done by using
TermScorer.skipTo() on the leafs of a scorer structure. So,
in case you implement your own TermScorer, take a serious
look at TermScorer.skipTo().

Normally, score value computations are not the bottleneck,
but accessing the index is, and this is where skipTo() does
the real work. At the moment avoiding score value computations
is a nice extra.


   Has anyone done any work into restricting the set of docs that a
 query operates on?

Yes, Filters.

 Or should I just implement something myself in a custom scorer?

In case you have a better way than skipTo(), or something
to improve on this issue to allow a Filter as clause to BooleanQuery:
https://issues.apache.org/jira/browse/LUCENE-1345
let us know.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Pre-filtering for expensive query

2008-08-30 Thread Paul Elschot

Op Saturday 30 August 2008 18:19:09 schreef Matt Ronge:
 On Aug 30, 2008, at 4:43 AM, Karl Wettin wrote:
  Can you tell us a bit more about what you custom query does?
  Perhaps you can build the candidate filter and reuse it over and
  over again?

 I cannot reuse it. The candidate filter would be constructed by first
 running a boolean query with a number of SHOULD clauses. So then I
 know what docs atleast contain the terms I'm looking for. Once I have
 this set, I will look at the ordering of the matches (it's a bit more
 sophisticated than just a phrase query) and find the final matches.

Sounds like you may want to take a look at SpanNearQuery.

 Since my boolean clauses are different for each query I can't reuse
 the filter.

With (a variation of) SpanNearQuery you may end up not needing
any filtering at all, because it already uses skipTo() where possible.

In case you are looking for documents that contain partial phrases
from an input query that has more than 2 words, have a look at Nutch.

Regards,
Paul Elschot




 --
 Matt

  Hi all,
 
  I am working on implementing a new Query, Weight and Scorer that
  is expensive to run. I'd like to limit the number of documents I
  run this query on by first building a candidate set of documents
  with a boolean query. Once I have that candidate set, I was hoping
  I could build a filter off of it, and issue that along with my
  expensive query. However, after reading the code I see that
  filtering is done during the search, and not before hand. So my
  initial boolean query won't help in limiting the number of
  documents scored by my expensive query.
 
  Has anyone done any work into restricting the set of docs that a
  query operates on?
  Or should I just implement something myself in a custom scorer?
 
  Thanks in advance,
  --
  Matt Ronge

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Pre-filtering for expensive query

2008-08-30 Thread Paul Elschot

Op Saturday 30 August 2008 18:22:50 schreef Matt Ronge:
 On Aug 30, 2008, at 6:13 AM, Paul Elschot wrote:
  Op Saturday 30 August 2008 03:34:01 schreef Matt Ronge:
  Hi all,
 
  I am working on implementing a new Query, Weight and Scorer that
  is expensive to run. I'd like to limit the number of documents I
  run this query on by first building a candidate set of documents
  with a boolean query. Once I have that candidate set, I was hoping
  I could build a filter off of it, and issue that along with my
  expensive query. However, after reading the code I see that
  filtering is done during the search, and not before hand.
 
  Correct. I suppose you mean the filtering code in IndexSearcher?

 Yes, that's exactly what I mean.

  So my initial boolean query
  won't help in limiting the number of documents scored by my
  expensive query.
 
  The trick of filtering is the use of skipTo() on both the filter
  and the scorer to skip superfluous work as much as possible.
  So when you make your scorer implement skipTo() efficiently,
  filtering it should reduce the amount of scoring done.
 
  Implementing skipTo() efficiently is normally done by using
  TermScorer.skipTo() on the leafs of a scorer structure. So,
  in case you implement your own TermScorer, take a serious
  look at TermScorer.skipTo().
 
  Normally, score value computations are not the bottleneck,
  but accessing the index is, and this is where skipTo() does
  the real work. At the moment avoiding score value computations
  is a nice extra.

 I was not aware of this. Where can I find the code that uses the
 filter to determine what values to feed to skipTo (I'm trying to get
 a better understand of the Lucene source)?

It's the same code in IndexSearcher.
ConjunctionScorer.skipTo() does the much the same thing for
any number of scorers.


  Or should I just implement something myself in a custom scorer?
 
  In case you have a better way than skipTo(), or something
  to improve on this issue to allow a Filter as clause to
  BooleanQuery: https://issues.apache.org/jira/browse/LUCENE-1345
  let us know.

 Thanks, if the skipTo approach doesn't work, I'll take a look at
 this.

For the moment, Andrzej's suggestion to use FilteredQuery as a clause 
could well be good enough.
Btw. FilteredQuery also contains a filtering scorer under the hood,
you could take a look there, too.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Fastest way to get just the bits of matching documents

2008-07-26 Thread Paul Elschot

Op Thursday 24 July 2008 23:00:33 schreef Robert Stewart:
 Queries are very complex in our case, some have up to 100 or more
 clauses (over several fields), including disjunctions and prohibited
 clauses. 

Other than the earlier advice, did you try setAllowDocsOutOfOrder() ?

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Scoring filters

2008-06-11 Thread Paul Elschot

Op Wednesday 11 June 2008 01:41:38 schreef Karl Wettin:
 Each of my filters represent single boosting term queries. But when
 using the filter instead o the boosting term query I loose the score
 (not sure this is true) and payload boost (if any), both essential
 for the quality of my results. If I was to add payloads to the bits
 that are set, what is the best or simplest way to get the score back
 in? How about wrapping each filter in a query?

 Are there any obvious problems with this strategy that I've missed?

Why not add the boosting term queries as required
to a BooleanQuery? This has the advantage that it uses the
index data and the various caches built into Lucene and the
underlying OS.

In case you have the memory available, it is also possible to keep
the score values of any Query with the Filter and implement a Scorer
using the filter docs and these score values. Then use this as the
scorer for a new Query, via a Weight.
Once this new Query is available, just add it as required to a
BooleanQuery.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: SpanNearQuery: how to get the intra-span matching positions?

2008-05-30 Thread Paul Elschot

Op Friday 30 May 200812:10 schreef Claudio Corsi:
 Hi all,
 I'm querying my index with a SpanNearQuery built on top of some
 SpanOrQuery. Now, the Spans object I get form the SpanNearQuery
 instance returns me back the sequence of text spans, each defined by
 their starting/ending positions. I'm wondering if there is a simple
 way to get not only the start/end positions of the entire span, but
 the single matching positions inside such span.  For example, suppose
 that a SpanNearQuery composed by 3 SpanTermQuery 
 (with a slop of K) produce as resulting span the terms sequence: t0
 t1 t2 t3  t100 (so start() == 0, end() == 100). I know that for
 sure t0 and t100 have generated a match, since the span is minimal
 (right?). 

Right. But make sure to test, some less than straightforward situations
are possible when matching spans. For example, the subqueries may
be SpanNearQuery's themselves instead of SpanTermQuery's.

 But I also know that there is a 3th match somewhere in the 
 span (I have 3 SpanTermQuery that have to match). Is there a way to
 discover it?

To get this information, you'll have to extend NearSpansOrdered and
NearSpansUnordered (package private classes in o.a.l.search.spans)
to also provide for example an int[] with the actual 
matching 'positions', or subspans each with their own begin and end.
This is fairly straightforward, but to actually use such positions
SpanScorer will also need to be extended or even replaced.

In case you want to continue this discussion, please do so
on java-dev.

Regards,
Paul Elschot.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: SpanNearQuery scoring

2008-05-23 Thread Paul Elschot

Op Friday 23 May 2008 15:19:03 schreef Karl Wettin:
 Everything (scores, explainations and not hitting breakpoints while
 debugging) seems to point at that SpanNearQuery doesn't use the
 scoring of the inner spans. Is this true? 

Yes.

 If so, is it intentional?

I don't know. The Spans interface does not contain a weight() or
score() method, so there is no way to pass such information
to SpanScorer.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: multi word synonyms

2008-05-18 Thread Paul Elschot

Op Sunday 18 May 2008 16:30:26 schreef Karl Wettin:
 18 maj 2008 kl. 00.01 skrev Paul Elschot:
  Op Saturday 17 May 2008 20:28:40 schreef Karl Wettin:
  As far as I know Lucene only handle single word synonyms at index
  time. My life would be much simpler if it was possible to add
  synonyms that spanned over multiple tokens, such as lucene in
  action=lia. I have a couple of workarounds that are OK but it
  really isn't the same thing when it comes down to the scoring.
 
  The simplest solution is to index such synonyms at the first or
  last or middle position of the source tokens, using a zero position
  increment for the synonym. Was this one of the workarounds?

 I get sloppyFreq problems with that.

  The advantage of the zero position increment is that the original
  token positions are not affected, so at least there is no influence
  on scoring because of changes in the original token positions.

 I copy a number of fields to a single one. Each such field can be
 represented in a number of languages or aliases in the same language.

 [a, b, c, d, e, f], [g, h, i],[j, k, l ,m]
  [o, p][u, v]
  [q, r, s, t]

 It would be great if the phrase query on [f, o, p, u, v] could yeild
 a 0 distance.

 If I'd been using the same synonyms for the same phrases in all
 documents at all times the edit distance would be static when
 scoring, but I don't.

 The terms of these synonyms are not really compatible with each
 other. For instance [f, g, s, t, j] should not be allowed or at least
 be heavily penalised compared to [f, o, p, j].

 Searching a combination of languages should be allowed but preferably
 only one per field copied to the big field. (Disjunction is not
 applicable.)

 It is OK the way I have it running now, but more dimensions as
 described above really increases the score quality. I confirmed that
 using permutations of documents and filtering out the duplicates.
 Now I'm thinking it could be solved using token payloads and a brand
 new MultiDimensionalSpanQuery. Not too different from what you
 suggested way back in
 http://www.nabble.com/Using-Lucene-for-searching-tokens%2C-not-storin
g-them.-to3918462.html#a3944016

That would mean a term extending tag to indicate that a term is on
an alternative path?


 There are some other issues too, but I'm not at liberty to disclose
 too much. I hope it still makes sense?

Yes. I suppose the payload would indicate how much the alternative
path length differs from the original path?

In case you can't disclose more, no answer would off course be ok, too.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: MultiTerm Or Query with per-term boost. Does it exist?

2008-05-18 Thread Paul Elschot

See below.

Op Sunday 18 May 2008 21:03:19 schreef John Jensen:
 The only problem is, that I'm thinking that a special purpose Query
 subclass might be faster, but I was wondering if others have run into
 similar situations, and whether they saw performance win by replacing
 complex BooleanQueries with a special purpose Query subclass.

 Unfortunately the boosts are query specific and can't be done at
 index time.

 Thanks,
   John

 On Sun, May 18, 2008 at 9:30 AM, Karl Wettin [EMAIL PROTECTED] 
wrote:
  18 maj 2008 kl. 02.25 skrev John Jensen:
  Hi,
 
  I have an application where I need to issue queries with a large
  number of or-terms with individual boosts.
 
  Currently I just construct a BooleanQuery with a large number
  (often 1000) of constituent TermQueries. I'm wondering if there is
  a better way to do this?
  I'm open to implementing my own Query subclass if I can expect
  significant performance improvements from doing this.

Does BooleanQuery.setAllowDocsOutOfOrder() make a difference?

Regards,
Paul Elschot


 
  What is the general problem with your approach? And what does all
  these boosted term queries represent?
 
  Would it be perhaps be possible for you to add the boost at index
  time instead of at query time?
 
 
  karl
 
 
 
  ---
 -- To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: theoretical maximum score

2008-05-17 Thread Paul Elschot

Op Saturday 17 May 2008 00:04:31 schreef Chris Hostetter:
 : Is it possible to compute a theoretical maximum score for a given
 : query if constraints are placed on 'tf' and 'lengthNorm'? If so,
 : scores could be compared to a 'perfect score' (a feature request
 : from our customers)
 
I think a theoretical maximum score is only going to work when that
maximum applies to queries of any structure. So, start with the
simplest query, associate it with a theoretical maximum score, and
then for each possible combination of subqueries ((weighted)
and/or/phrase/span) make sure that the subscore values are
combined into another value that has the same theoretical
maximum.

Have a look here to start:
https://issues.apache.org/jira/browse/LUCENE-293

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: multi word synonyms

2008-05-17 Thread Paul Elschot

Op Saturday 17 May 2008 20:28:40 schreef Karl Wettin:
 As far as I know Lucene only handle single word synonyms at index
 time. My life would be much simpler if it was possible to add
 synonyms that spanned over multiple tokens, such as lucene in
 action=lia. I have a couple of workarounds that are OK but it
 really isn't the same thing when it comes down to the scoring.

 The thing that does the best job at scoring was to assemble several
 permutations of the same document. But it doesn't feel good. I have
 cases where that means several hundred documents, and I have to do
 post processing to filter out the duplicate hits. It can turn out
 to be rather expensive. And I'm sure it mess with the scoring in
 several ways I did not notice yet.

 I've also considering creating some multi dimensional term position
 space, but I'd say that could take a lot of time to implement.

 Are there any good solutions to this?

The simplest solution is to index such synonyms at the first or last
or middle position of the source tokens, using a zero position
increment for the synonym. Was this one of the workarounds?

The advantage of the zero position increment is that the original
token positions are not affected, so at least there is no influence
on scoring because of changes in the original token positions.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Filtering a SpanQuery

2008-05-07 Thread Paul Elschot

Op Wednesday 07 May 2008 10:18:38 schreef Eran Sevi:
 Thanks Paul for your reply,

 Since my index contains a couple of millions documents and the filter
 is supposed to limit the search space to a few thousands I was hoping
 I won't have to do the filtering myself after running the query on
 all the index.

The code I gave earlier effectively does a filtered query search
on the index. It visits the resulting Spans, and does not provide
a score value per document as SpanScorer would do.
Please make sure to test that code thoroughly for reliable results.


 Maybe this is the case anyway and behind the scenes the filter does
 exactly what you suggested.

Yes, a filtered query search would use skipTo() on the Spans via
SpanScorer. But the difference between the normal case
and your case is that you don't need SpanScorer.

 From what I tested the number of results of the SpanQuery greatly
 affects the running speed so if I'm going to use about 0.1% of the
 results I'm loosing a lot of time and memory for gathering and
 storing the spans I'm not going to use.

 I don't know how SpanQuery works internally but I guess that if the
 filter is known beforehand,

A Filter needs to make a BitSet available before the query search.

 it could speed things up quite a bit. 

I would expect a substantial speedup from using skipTo() on the
Spans when only 0.1% of the results passes the filter.

Regards,
Paul Elschot

 Eran.


 On Wed, May 7, 2008 at 10:34 AM, Paul Elschot
 [EMAIL PROTECTED]

 wrote:
  Op Tuesday 06 May 2008 17:39:38 schreef Paul Elschot:
   Eran,
  
   Op Tuesday 06 May 2008 10:15:10 schreef Eran Sevi:
Hi,
   
I am looking for a way to filter a SpanQuery according to some
other query (on another field from the one used for the
SpanQuery). I need to get access to the spans themselves of
course. I don't care about the scoring of the filter results
and just need the positions of hits found in the documents that
matches the filter.
  
   I think you'll have to implement the filtering on the Spans
   yourself. That's not really difficult, just use Spans.skipTo().
   The code to do that could look sth like this (untested):
  
   Spans spans = yourSpanQuery.getSpans(reader);
   BitSet bits = yourFilter.bits(reader);
   int filterDoc = bits.nextSetBit(0);
   while ((filterDoc = 0) and spans.skipTo(filterDoc)) {
 boolean more = true;
 while (more and (spans.doc() == filterDoc)) {
// use spans.start() and spans.end() here
// ...
more = spans.next();
 }
 if (! more) {
   break;
 }
 filterDoc = bits.nextSetBit(spans.doc());
 
  At this point, no skipping on the spans should be done when
  filterDoc equals spans.doc(), so this code still needs some work.
  But I think you get the idea.
 
  Regards,
  Paul Elschot
 
   }
  
   Please check the javadocs of java.util.BitSet, there may
   be a 1 off error in the arguments to nextSetBit().
  
   Regards,
   Paul Elschot
  
I tried looking through the archives and found some reference
to a SpanQueryFilter patch, however I don't see how it can help
me achieve what I want to do. This class receives only one
query parameter (which I guess is the actual query) and not a
query and a filter for example.
   
Any help about how I can achieve this will be appreciated.
   
Thanks,
Eran.
  
   -
   To unsubscribe, e-mail:
   [EMAIL PROTECTED] For additional commands,
   e-mail: [EMAIL PROTECTED]
 
  ---
 -- To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Filtering a SpanQuery

2008-05-06 Thread Paul Elschot

Eran,

Op Tuesday 06 May 2008 10:15:10 schreef Eran Sevi:
 Hi,

 I am looking for a way to filter a SpanQuery according to some other
 query (on another field from the one used for the SpanQuery). I need
 to get access to the spans themselves of course.
  I don't care about the scoring of the filter results and just need
 the positions of hits found in the documents that matches the filter.

I think you'll have to implement the filtering on the Spans yourself.
That's not really difficult, just use Spans.skipTo().
The code to do that could look sth like this (untested):

Spans spans = yourSpanQuery.getSpans(reader);
BitSet bits = yourFilter.bits(reader);
int filterDoc = bits.nextSetBit(0);
while ((filterDoc = 0) and spans.skipTo(filterDoc)) {
  boolean more = true;
  while (more and (spans.doc() == filterDoc)) {
 // use spans.start() and spans.end() here
 // ...
 more = spans.next();
  }
  if (! more) {
break;
  }
  filterDoc = bits.nextSetBit(spans.doc());
}

Please check the javadocs of java.util.BitSet, there may
be a 1 off error in the arguments to nextSetBit().

Regards,
Paul Elschot



 I tried looking through the archives and found some reference to a
 SpanQueryFilter patch, however I don't see how it can help me achieve
 what I want to do. This class receives only one query parameter
 (which I guess is the actual query) and not a query and a filter for
 example.

 Any help about how I can achieve this will be appreciated.

 Thanks,
 Eran.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene Proximity Searches

2008-04-18 Thread Paul Elschot

Ana,

Op Friday 18 April 2008 12:41:38 schreef Ana Rabade:
 I am using ngrams and I need to force that a group of them are
 together, but if any of them fails, I need that the document is also
 scored. Perhaps you could help me to find the solution or give me a
 reference of which changes I must do. I am using SpanNearQuery,
 because the ngrams must be in order. Thanks for your answer.
- Ana Maria Freire Veiga -

Assuming that K terms are involved and K-1 of them need to match
in order as ngrams, there are the following options:

- create K SpanNearQuery's on K-1 ordered terms with appropriate
slop, add these to a BooleanQuery using Occur.SHOULD, and
search this BooleanQuery.

- starting from the same K SpanNearQuery's on K-1 terms,
search each of these separately and use your own HitCollector to
combine the scores.

For these two options, one could also use the K terms SpanNearQuery
to influence the scoring somewhat. The problem with these
options is that the number of terms in the query is quadratic
in K, possibly giving performance problems for higher values of K.
In that case, try the third option:

- modify the code of the NearSpansOrdered class in 
the org.apache.lucene.search.spans package to allow
a match for less than all subqueries. This is not going to
be straightforward, but it is possible. In case you choose
this last option, please continue on the java-dev list.

Regards,
Paul Elschot


 On Fri, Apr 4, 2008 at 12:38 PM, Ana Rabade
 [EMAIL PROTECTED]

 wrote:
  I am using ngrams and I need to force that a group of them are
  together, but if any of them fails, I need that the document is
  also scored. Perhaps you could help me to find the solution or give
  me a reference of which changes I must do. I am using
  SpanNearQuery, because the ngrams must be in order.
  Thanks for your answer.
 - Ana Maria Freire Veiga -
 
  On Thu, Apr 3, 2008 at 7:56 PM, Erick Erickson
  [EMAIL PROTECTED]
 
  wrote:
   Could you explain your use case? Because to say that you want to
   score documents that don't have all the terms with a *phrase
   query* is contradictory. The point of a phrase query is exactly
   that all the terms are there and within some some proximity.
  
  
   Best
   Erick
  
   On Thu, Apr 3, 2008 at 12:17 PM, Ana Rábade
   [EMAIL PROTECTED]
  
   wrote:
Hi!
   
I'm using Lucene Proximity Searches, but I've seen Lucene only
scores documents which contain all the terms in the phrase. I
also need to
  
   score
  
documents although they don't contain all those terms.  Is it
possible with
Lucene PhraseQueries or SpanNearQuery? If not, could you tell
me a way
  
   to
  
find my solution?
   
Thank you very much.
   
   - Ana M. Freire -



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: QueryWrapperFilter question...

2008-04-17 Thread Paul Elschot

Op Thursday 17 April 2008 06:37:18 schreef Michael Stoppelman:
 Actually, I screwed up the timing info. I wasn't including the time
 for the QueryWrapperFilter#bits(IndexReader) call. Sadly,
 it actually takes longer than the original query that had both terms
 included. Bummer. I had really convinced myself till the
 thought came to me at lunch :).

For a single query, adding a filter off course has a cost.
But when the location part can be reused in later queries,
give CachingWrapperFilter a try.

Regards,
Paul Elschot



 -M

 On Wed, Apr 16, 2008 at 6:43 PM, Karl Wettin [EMAIL PROTECTED] 
wrote:
  Michael Stoppelman skrev:
 
   Hi all,
 
   I've been doing some performance testing and found that using
   QueryWrapperFilter for a location field
   restriction I have to do allows my search results to approach
   5-10ms. This
   was surprising.
   Before the performance was between 50ms-100ms.
  
   The queries from before the optimization look like the following:
   +(+(text:cats) +(loc:1 loc:2 loc:3 ...))
  
   The QueryWrapperFilter does do a search itself. Why would
   performance be so
   drastically different when the
   QueryWrapperFilter needs to do a search? Does lucene just not
   have the statistics to optimize this query so it
   can decide which terms to filter by first?
 
  Do you wonder why a QueryWrapperFilter is faster than a Query? Then
  the answer is that the filter uses a bitset to know if a document
  matches a document or not. For each document that match text:cats
  it checks the flag in the bitset for that document number instead
  of seeking in the index to find out if also match loc:1, loc:2 or
  loc:3.
 
 
  karl
 
  ---
 -- To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Using Lucene partly as DB and 'joining' search results.

2008-04-12 Thread Paul Elschot

Op Saturday 12 April 2008 00:03:13 schreef Antony Bowesman:
 Paul Elschot wrote:
  Op Friday 11 April 2008 13:49:59 schreef Mathieu Lecarme:
  Use Filter and BitSet.
   From the personnal data, you build a Filter
  (http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/
 Fil ter.html) wich is used in the main index.
 
  With 1 billion mails, and possibly a Filter per user, you may want
  to use more compact filters than BitSets, which is currently
  possible in the development trunk of lucene.

 Thanks for the pointers.  I've already used Solr's DocSet interface
 in my implementation, which I think is where the ideas for the
 current Lucene enhancements came from.

The ideas came from quite a few sources. They can be traced
starting from changes.txt in the sources.

 They work well to reduce the 
 filter's footprint.  I'm also caching filters.

 The intention is that there is a user data index and the mail
 index(es).  The search against user data index will return a set of
 mail Ids, which is the common key between the two. Doc Ids are no 
 good between the indexes, so that means a potentially large boolean
 OR query to create the filter of labelled mails in the mail indexes. 
 I know it's a theoretical question, but will this perform?

The normal way to collect doc ids for a filter is into a bitset
iterating over the indexed ids (mail ids in your case). A bitset
has random access, so there is no need to do this in doc id order.
An OR query has to work in doc id order so it can compute a score
per doc id, and the ordering loses some performance.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Using Lucene partly as DB and 'joining' search results.

2008-04-11 Thread Paul Elschot

Op Friday 11 April 2008 13:49:59 schreef Mathieu Lecarme:
 Antony Bowesman a écrit :
  We're planning to archive email over many years and have been
  looking at using DB to store mail meta data and Lucene for the
  indexed mail data, or just Lucene on its own with email data and
  structure stored as XML and the raw message stored in the file
  system.
 
  For some customers, the volumes are likely to be well over 1
  billion mails over
  10 years, so some  partitioning of data is needed.  At the moment
  the thoughts
  are moving away from using a DB + Lucene to just Lucene along with
  a file system
  representation of the complete message.  All searches will be
  against the index then the XML mail meta data is loaded from the
  file system.
 
  The archive is read only apart from bulk deletes, but one of the
  requirements is for users to be able to label their own mail. 
  Given that a Lucene Document cannot be updated, I have thought
  about having a separate Lucene index that has just the 3 terms (or
  some combination of) userId + mailId + label.
 
  That of course would mean joining searches from the main mail data
  index and the label index.
 
  Does anyone have any experience of using Lucene this way and is it
  a realistic option of avoiding the DB at all?  I'd rather the
  headache of scaling just Lucene, which is a simple beast, than the
  whole bundle of 'stuff' that comes with the database as well.

 Use Filter and BitSet.
  From the personnal data, you build a Filter
 (http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/Fil
ter.html) wich is used in the main index.

With 1 billion mails, and possibly a Filter per user, you may want to
use more compact filters than BitSets, which is currently possible
in the development trunk of lucene.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Why Lucene has to rewrite queries prior to actual searching?

2008-04-08 Thread Paul Elschot

Op Tuesday 08 April 2008 15:18:34 schreef Itamar Syn-Hershko:
 Paul,

 I don't see how this answers the question. 

Towards the end, the page describes when a Scorer is called and
roughly what it does.

 I was asking why Lucene 
 has to access the index with exact terms, and not use RegEx or
 simpler wildcards support internally? If Lucene will be able to look
 for w?rd or wor* and treat the wildcards as wildcards, this will
 greatly improve speed of searches and will eliminate the need for
 Query rewriting.

When it is known in advance that w?rd and wor* will be used
in queries a lot, one can write a tokenizer that indexes them so
that they can be searched directly.
The problem is to know that in advance, that is at indexing time.

 Since some people may want to index chars like those used in
 wildcards, they could be escaped (or, those people will use the
 standard search classes available today instead). I'm not entirely
 sure what part of Lucene does the actual access to the terms and
 position vectors, but if it could be sub-classed or cloned, and then
 modified to honor wildcards or even RegEx, that would bring Lucene to
 new heights.

There are regular expression queries in the regex contrib module,
however these work by rewriting to actually indexed terms.

 Unless, again, there is a specific reason why this can't 
 be done.

There is no specific reason why it cannot be done, one only needs
to provide the corresponding tokenizer to be used at indexing time.

Kind regards,
Paul Elschot



 Itamar.

 -Original Message-
 From: Paul Elschot [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, April 08, 2008 1:56 AM
 To: java-user@lucene.apache.org
 Subject: Re: Why Lucene has to rewrite queries prior to actual
 searching?

 Itamar,

 Have a look here:
 http://lucene.apache.org/java/2_3_1/scoring.html

 Regards,
 Paul Elschot

 Op Tuesday 08 April 2008 00:34:48 schreef Itamar Syn-Hershko:
  Paul and John,
 
  Thanks for your quick reply.
 
  The problem with query rewriting is the beforementioned
  MaxClauseException. Instead of inflating the query and passing a
  deterministic list of terms to the actual search routine, Lucene
  could have accessed the vectors in the index using some sort of
  filter. So, for example, if it knows to access Foobar by its name
  in the index, why can't it take Foo* and just get all the vectors
  until Fop is met (for example). Why does it have to get
  deterministic list of terms?
 
  I will take a look at the Scorer - can you describe in short what
  exactly it does and where and when it is being called?
 
  I don't get John's comment though - Query::rewrite is being called
  prior to the actual searching (through QueryParser), how come it
  can use information gathered from IndexReader at search time?
 
  Itamar.
 
  -Original Message-
  From: Paul Elschot [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, April 08, 2008 12:57 AM
  To: java-user@lucene.apache.org
  Subject: Re: Why Lucene has to rewrite queries prior to actual
  searching?
 
  Itamar,
 
  Query rewrite replaces wildcards with terms available from the
  index. Usually that involves replacing a wildcard with a
  BooleanQuery that is an effective OR over the available terms while
  using a flat coordination factor, i.e. it does not matter how many
  of the available terms actually match a document, as long as at
  least one matches.
 
  For the required query parts (AND like), Scorer.skipTo() is used,
  and that could well be the filter mechanism you are referring to;
  have a look at the javadocs of Scorer, and, if necessary, at the
  actual code of ConjunctionScorer.
 
  Regards,
  Paul Elschot
 
  Op Monday 07 April 2008 23:13:09 schreef Itamar Syn-Hershko:
   Hi all,
  
   Can someone from the experts here explain why Lucene has to get a
   rewritten query for the Searcher - so Phrase or Wildcards
   queries have to rewrite themselves into a primitive query, that
   is then passed to Lucene to look for? I'm probably not familiar
   too much with the internals of Lucene, but I'd imagine that if
   you can inflate a query using wildcards via Query sub
   classing, you could as easily (?) have some sort of Filter
   mechanism during the search, so that Lucene retrieves the
   Position vectors for all the terms that pass that filter, instead
   of retrieving only the position data for deterministic terms
   (with no wildcards etc.). If that was possible to do somehow, it
   could greatly increase the searchability of Lucene indices by
   using RegEx (without re-writing and getting the dreaded
   MaxClauseCount error) and similar.
  
   Would love to hear some insights on this one.
  
   Itamar.
 
  ---
 -- To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
  ---
 -- To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional

Re: Why Lucene has to rewrite queries prior to actual searching?

2008-04-07 Thread Paul Elschot

Itamar,

Query rewrite replaces wildcards with terms available from
the index. Usually that involves replacing a wildcard with a 
BooleanQuery that is an effective OR over the available
terms while using a flat coordination factor, i.e. it does not
matter how many of the available terms actually match
a document, as long as at least one matches.

For the required query parts (AND like), Scorer.skipTo()
is used, and that could well be the filter mechanism you
are referring to; have a look at the javadocs of Scorer,
and, if necessary, at the actual code of ConjunctionScorer.

Regards,
Paul Elschot





Op Monday 07 April 2008 23:13:09 schreef Itamar Syn-Hershko:
 Hi all,

 Can someone from the experts here explain why Lucene has to get a
 rewritten query for the Searcher - so Phrase or Wildcards queries
 have to rewrite themselves into a primitive query, that is then
 passed to Lucene to look for? I'm probably not familiar too much with
 the internals of Lucene, but I'd imagine that if you can inflate a
 query using wildcards via Query sub classing, you could as easily
 (?) have some sort of Filter mechanism during the search, so that
 Lucene retrieves the Position vectors for all the terms that pass
 that filter, instead of retrieving only the position data for
 deterministic terms (with no wildcards etc.). If that was possible to
 do somehow, it could greatly increase the searchability of Lucene
 indices by using RegEx (without re-writing and getting the dreaded
 MaxClauseCount error) and similar.

 Would love to hear some insights on this one.

 Itamar.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Why Lucene has to rewrite queries prior to actual searching?

2008-04-07 Thread Paul Elschot

Itamar,

Have a look here:
http://lucene.apache.org/java/2_3_1/scoring.html

Regards,
Paul Elschot

Op Tuesday 08 April 2008 00:34:48 schreef Itamar Syn-Hershko:
 Paul and John,

 Thanks for your quick reply.

 The problem with query rewriting is the beforementioned
 MaxClauseException. Instead of inflating the query and passing a
 deterministic list of terms to the actual search routine, Lucene
 could have accessed the vectors in the index using some sort of
 filter. So, for example, if it knows to access Foobar by its name
 in the index, why can't it take Foo* and just get all the vectors
 until Fop is met (for example). Why does it have to get
 deterministic list of terms?

 I will take a look at the Scorer - can you describe in short what
 exactly it does and where and when it is being called?

 I don't get John's comment though - Query::rewrite is being called
 prior to the actual searching (through QueryParser), how come it can
 use information gathered from IndexReader at search time?

 Itamar.

 -Original Message-
 From: Paul Elschot [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, April 08, 2008 12:57 AM
 To: java-user@lucene.apache.org
 Subject: Re: Why Lucene has to rewrite queries prior to actual
 searching?

 Itamar,

 Query rewrite replaces wildcards with terms available from the index.
 Usually that involves replacing a wildcard with a BooleanQuery that
 is an effective OR over the available terms while using a flat
 coordination factor, i.e. it does not matter how many of the
 available terms actually match a document, as long as at least one
 matches.

 For the required query parts (AND like), Scorer.skipTo() is used, and
 that could well be the filter mechanism you are referring to; have a
 look at the javadocs of Scorer, and, if necessary, at the actual code
 of ConjunctionScorer.

 Regards,
 Paul Elschot

 Op Monday 07 April 2008 23:13:09 schreef Itamar Syn-Hershko:
  Hi all,
 
  Can someone from the experts here explain why Lucene has to get a
  rewritten query for the Searcher - so Phrase or Wildcards queries
  have to rewrite themselves into a primitive query, that is then
  passed to Lucene to look for? I'm probably not familiar too much
  with the internals of Lucene, but I'd imagine that if you can
  inflate a query using wildcards via Query sub classing, you
  could as easily (?) have some sort of Filter mechanism during the
  search, so that Lucene retrieves the Position vectors for all the
  terms that pass that filter, instead of retrieving only the
  position data for deterministic terms (with no wildcards etc.). If
  that was possible to do somehow, it could greatly increase the
  searchability of Lucene indices by using RegEx (without re-writing
  and getting the dreaded MaxClauseCount error) and similar.
 
  Would love to hear some insights on this one.
 
  Itamar.

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Improving Index Search Performance

2008-03-26 Thread Paul Elschot

Since you're using all the results for a query, and ignoring the
score value, you might try and do the same thing with a relational
database. But I would not expect that to be much faster,
especially when using a field cache.

Other than that, you could also go the other way, and try and 
add more data to the lucene index that can be used to reduce
the number of results to be fetched.

Regards,
Paul Elschot



Op Wednesday 26 March 2008 13:51:24 schreef Shailendra Mudgal:
  The bottom line is that reading fields from docs is expensive.
  FieldCache will, I believe, load fields for all documents but only
  once - so the second and subsequent times it will be fast.  Even
  without using a cache it is likely that things will speed up
  because of caching by the OS.

 As i mentioned in my previous mail that the companyId is a
 multivalued field, so caching it will consume a lot of memory. And
 this way we'll have to keep the document vs field mapping also in the
 memory.

  If you've got plenty of memory vs index size you could look at
  RAMDirectory or MMapDirectory.  Or how about some solid state
  disks? Someone recently posted some very impressive performance
  stats.

 The index size is around 20G and the available Memory is 4G so,
 keeping the entire index into the memory  is not possible.   But as i
 mentioned earlier that it is using only 1 G out of 4 G, so is their a
 way to specify the lucene to cache more documents , say use 2G for
 caching the index ??

 I'll appreciate more suggestions on the same problem.

 Regards,
 Vipin



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Improving Index Search Performance

2008-03-25 Thread Paul Elschot

Shailendra,

Have a look at the javadocs of HitCollector:
http://lucene.apache.org/java/2_3_0/api/core/org/apache/lucene/search/HitCollector.html

The problem is with the use of the disk head, when retrieving
the documents during collecting, the disk head has to move
between the inverted index and the stored documents; see also
the file formats.

To avoid such excessive disk head movement, you need to collect
all (or at least many more than 1 of) your document ids during
collect(), for example into an int[].
After collecting retrieve the all the docs with Searcher.doc().

Also, for the same reason, retrieving docs is best done in doc id
order, but that is unlikely to go wrong as doc ids are normally
collected in increasing order.

Regards,
Paul Elschot


Op Tuesday 25 March 2008 13:43:18 schreef Shailendra Mudgal:
 Hi Everyone,

 We are using Lucene to search on a index of around 20G size with
 around 3 million documents. We are facing performance issues loading
 large results from the index. Based on the various posts on the forum
 and documentation, we have made the following code changes to improve
 the performance:

 i. Modified the code to use HitCollector instead of Hits since we
 will be loading all the documents in the index based on keyword
 matching ii. Added MapFieldSelector to load only selected fields(2
 fields only) instead of all the 14

 After all these changes, it seems to be  taking around 90 secs to
 load 17k documents. After profiling, we found that the max time is
 spent in * searcher.doc(id,selector).

 *Here is the code:

 *public void collect(int id, float score) {
 try {
 MapFieldSelector selector = new
 MapFieldSelector(new String[] {COMPANY_ID, ID});
 doc = searcher.doc(id, selector);
 mappedCompanies = doc.getValues(COMPANY_ID);
 } catch (IOException e) {
 logger.debug(inside IDCollector.collect()

 :+e.getMessage());

 }
 }*

 *
 *We also read in one of the posts that we should use bitSet.set(doc)
 instead of calling searcher.doc(id). But we are unable to to
 understand how this might help in our case since we will anyway have
 to load the document to get the other required field(company_id).
 Also we observed that the searcher is actually using only 1G RAM
 though we have 4G allocated to it.

 Can someone suggest if there is any other optimization that can done
 to improve the search performance on MultiSearcher. Any help would be
 appreciated.

 Thanks,
 Vipin



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Call Lucene default command line Search from PHP script

2008-03-21 Thread Paul Elschot

Milu,

This is a PHP problem, not a Lucene one, so you might get better
response at a PHP mailing list.

The easy way around your problem is probably by invoking a shell
script from php that exports the class path as you indicated,
so that java can see the correct classes.

Having said that, you'll probably want to use the PHP/Java extension
to avoid initializing a JVM for each call to lucene.  Try this:
http://www.google.nl/search?q=php+java+org+apache+luceneie=UTF-8oe=UTF-8

This was one of the results:
http://www.idimmu.net/index.php?blog%5Bpagenum%5D=3


Regards,
Paul Elschot


Op Friday 21 March 2008 21:24:37 schreef milu07:
 Hello,

 My machine is Ubuntu 7.10. I am working with Apache Lucene. I have
 done with indexer and tried with command line Searcher (the default
 command line included in Lucene package:
 http://lucene.apache.org/java/2_3_1/demo2.html). When I use this at
 command line:

 java Searcher -query algorithm

 it works and returns a list of results to me. Here 'algorithm' is the
 keyword to search.

 However, I want to have a web search interface written in PHP, I use
 PHP exec() to call this Searcher from my PHP script:

 exec(java Searcher -query algorithm , $arr, $retVal);
 [I also tried: exec(java Searcher -query 'algorithm' , $arr,
 $retVal)]

 It does not work. I print the value of $retVal, it is 1.

 I come back and try: exec(java Searcher -query algorithm 21 ,
 $arr, $retVal);
 I receive:
 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/lucene/analysis/Analyzer
 and $retVal is 1

 In the command line Searcher.java of Lucene, it imports many
 libraries, is this the problem?
 import org.apache.lucene.analysis.Analyzer;
 import org.apache.lucene.analysis.standard.StandardAnalyz er;
 

 I guess this is the problem of path. However, I do not know how to
 fix it because it works in command line ($CLASSPATH points to the
 .jar file of Lucene library). May be PHP does not know $CLASSPATH.
 So, I add Lucene lib to $PATH:

 export PATH=$PATH:/usr/lib/lucene-core-2.3.1.jar:/usr/lib

 However, I get the same error message when I try: exec(java Searcher
 -query algorithm 21 , $arr, $retVal);
 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/lucene/analysis/Analyzer

 Could you please help?

 Thank you,



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Call Lucene default command line Search from PHP script

2008-03-21 Thread Paul Elschot

Op Saturday 22 March 2008 00:32:32 schreef Paul Elschot:
 Milu,

 This is a PHP problem, not a Lucene one, so you might get better
 response at a PHP mailing list.

 The easy way around your problem is probably by invoking a shell
 script from php that exports the class path as you indicated,
 so that java can see the correct classes.

I meant a shell script that exports the class path, and invokes
java from the same shell.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: HELP: how to list term score inside some document?

2008-03-14 Thread Paul Elschot

Op Friday 14 March 2008 17:28:17 schreef Rao WeiXiong:
 Dear:

 If possible to list all term scores inside some document by some
 simple method?  now i just use each term as the query to search the
 whole index to get the score. seems very cumbersome. is there any
 simple approach?

Have a look at Searcher.explain()

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: MultiFieldQueryParser - BooleanClause.Occur

2008-02-29 Thread Paul Elschot

Op Friday 29 February 2008 18:04:47 schreef Donna L Gresh:
 I believe something like the following will do what you want:

 QueryParser parserTitle = new QueryParser(title, analyzer);
 QueryParser parserAuthor = new QueryParser(author, analyzer);

 BooleanQuery overallquery = new BooleanQuery();

 BolleanQuery firstQuery = new BooleanQuery();
 Query q1= parserTitle.parse(term1);
 Query q2= parserAuthor.parse(term1);
 firstQuery.add(q1, BooleanClause.Occur.SHOULD); //should is like an
 OR firstQuery.add(q2, BooleanClause.Occur.SHOULD);

 BolleanQuery secondQuery = new BooleanQuery();
 Query q3= parserTitle.parse(term2);
 Query q4= parserAuthor.parse(term2);
 secondQuery.add(q3, BooleanClause.Occur.SHOULD);
 secondQuery.add(q4, BooleanClause.Occur.SHOULD);

 overallquery.add(firstQuery, BooleanClause.Occur.MUST); //must is
 like an AND
 overallquery.add(secondQuery, BooleanClause.Occur.MUST):

There is no need for a QueryParser in this case when using a
TermQuery instead of a Query for q1, q2, q3 and q4:

TermQuery q1 = new TermQuery(new Term(title, term1));

Regards,
Paul Elschot



 Donna  Gresh

 JensBurkhardt [EMAIL PROTECTED] wrote on 02/29/2008 10:46:51 AM:
  Hey everybody,
 
  I read that it's possible to generate a query like:
 
  (title:term1 OR author:term1) AND (title:term2 OR author:term2)
 
  and so on. I also read that BooleanClause.Occur should help me
  handle

 this

  problem.
  But i have to admit that i totally don't understand how to use it.
  If someone can explain or has a link to an explanation, this would
  be terrific
 
  Thanks and best regards
 
  Jens Burkhardt
  --
  View this message in context: http://www.nabble.
  com/MultiFieldQueryParser---BooleanClause.Occur-tp15761243p15761243
 .html Sent from the Lucene - Java Users mailing list archive at
  Nabble.com.
 
 
  ---
 -- To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to pass additional information into Similarity.scorePayload(...)

2008-02-15 Thread Paul Elschot

Hi Cedric,

I think I'm beginning to get the point of the [10/5/2],
and why you called that requirement a bit strange, see below.

To use both normal position info and paragraph position info
you'll need two separate, one normal, and one for the paragraphs.

To use the normal field to determine the matches, and  the
paragraph field to determine the weightings of these matches
the TermPositions of both fields will have to be advanced
completely in sync. That is possible, but not really nice to do.
If Lucene had multiple positions for an indexed term, it
would be straightforward.
But as long as that is not the case, you'll either have to advance
the two TermPositions in sync, or use payloads with the
paragraph numbers.

Or you could relax the paragraph numbering requirement
into a positional requirement, and use the modified SpanFirstQuery.
That could be done by using an avarage paragraph length to
determine the weight at the matching position.
As this is easy to implement, I'd first implement this and try to sell
it to the users :)

At that marketing moment you might as well ask the users
what they think of matches that cross paragraph borders.
Do you already have a firm requirement for that case?

SpanNotQuery can be used to prevent matches over paragraph
borders when these are indexed as such, but I would not expect
that you would need those, given the fuzzyness of the [10/5/2].

Regards,
Paul Elschot


Op Friday 15 February 2008 09:45:58 schreef Cedric Ho:
 Hi Paul,
 
 Do you mean the following?
 
 e.g. to index this: first second third paragraphBorder forth fifth six
 
 originally it would be indexed as:
 (first,0) (second,1) (third,2) (forth,3) (fifth,4) (six,5)
 
 now it will be:
 (first,0) (second,0) (third,0) (forth,1) (fifth,1) (six,1)
 
 Then those Query classes that depends on the positional information
 (PhraseQuery, SpanQueries) won't work then? unfortunately I'll need
 those Query classes as well.
 
 Cedric
 
 
   For each word in the input stream make sure that the position
   at which it is indexed in an extra field is the same as the paragraph
   number. That will involve only allowing a position increment at
   a paragraph border during indexing.
   Call this extra field the paragraph field if you will.
 
   Then, during search, search for a Term in paragraph field, and
   use the position from that field, i.e. the paragraph number
   to find a weight for the found term.
   Have a look at PhraseQuery on how to use term positions during
   search. It computes relative positions, but it works on the absolute
   positions that it gets from the index.
 
   SpanFirstQuery also allows to do that, it's a bit more involved, but
   in the end it works from the same absolute positions from the index.
   The version at the jira issue will even allow to use the length of the
   matching spans as the absolute paragraph number, which, in turn,
   allows the use of a Similarity for the paragraph weights [10/5/2].
 
   There is nothing special about indexed term positions; any term can
   be indexed at any position in a field. Lucene will take advantage of
   the incremental nature of positions by storing only compressed
   differences of positions in the index, but during search the original
   positions are directly available, You can do the same with payloads,
   but why reimplement something that is already available?
 
   Payloads have better uses than positional info, for one they are
   great to avoid disjunctions. For example for verbs, one could
   index only the stem and use a payload for the actual inflected
   form (singular/plural, past/present, first/second/third person, etc).
 
   Regards,
   Paul Elschot
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to pass additional information into Similarity.scorePayload(...)

2008-02-14 Thread Paul Elschot

I have no idea what the [10/5/2] means, so I can't comment on that.
In case I have missed it previously I'm sorry.

My point was that payloads need not be used for different position info.
It's possible to do that, and it may be good for performance in some cases,
but one can revert to using another field for different position info.

Regards,
Paul Elschot


Op Thursday 14 February 2008 09:44:40 schreef Cedric Ho:
 Hi Paul,
 
 Sorry I am not sure I understand your solution.
 
 Because I would need to apply this scoring logic to all the different
 types of Queries. A search may consists of something like:
 
 +(term1 phrase2 wildcard*)  +spanNear(term3 term4) [10/5/2]
 
 And this [10/5/2] ratio have to be applied to the whole search query
 before it. So I am not sure how would using just SpanFirstQuery with a
 separate field work in this situation.
 
 Anyway, I know my requirement is a bit strange, so it's ok if I can't
 do this in Lucene. I'll settle with using a ThreadLocal to store the
 [10/5/2] weighting and retrieve it in the Similarity.scorePayload(...)
 function.
 
 
 BTW, this problem I am facing now is different from the last one I
 asked here, which you have proposed with the Modified SpanFirstQuery
 solution =)
 
 But I am really grateful with all the helps I get here. Keep up the good work!
 
 Cheers,
 Cedric
 
 
 On Thu, Feb 14, 2008 at 2:58 PM, Paul Elschot [EMAIL PROTECTED] wrote:
  Op Thursday 14 February 2008 02:11:24 schreef Cedric Ho:
 
   I am using Lucene's Built-in query classes: TernQuery, PhraseQuery,
WildcardQuery, BooleanQuery and many of the SpanQueries.
   
The info I am going to pass in is just some weightings for different
part of the indexed contents. For example if the payload indicate that
a term is in the 2nd paragraph, then I'll take the weighting for the
2nd paragraph and multiply it by the score.
   
So it seems without writing my own query there's no way to do it ?
 
   In case it is only positional information that is stored in the payload
   (i.e. some integer number that does not decrease when tokenizing the
   document), it is also possible to use an extra field and make sure the
   position increment for that field is only positive when the number
   (currently your payload) increases.
   A SpanFirstQuery on this extra field would almost do, and you will
   probably need https://issues.apache.org/jira/browse/LUCENE-1093 .
   This will be somewhat slower than using a payload, because the search
   will be done in two separate fields, but it will work.
 
   Regards,
   Paul Elschot
 
 
 
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to pass additional information into Similarity.scorePayload(...)

2008-02-14 Thread Paul Elschot

Op Friday 15 February 2008 02:47:14 schreef Cedric Ho:
 Sorry that I didn't make myself clear.
 
 [10/5/2] means for terms found in the 1st paragraph, give it score*10,
 for terms in the 2nd, give it score*5, etc.
 
 So I don't know how to do this scoring if the position (paragraph)
 information is in a separate field.

For each word in the input stream make sure that the position
at which it is indexed in an extra field is the same as the paragraph
number. That will involve only allowing a position increment at
a paragraph border during indexing.
Call this extra field the paragraph field if you will.

Then, during search, search for a Term in paragraph field, and 
use the position from that field, i.e. the paragraph number
to find a weight for the found term.
Have a look at PhraseQuery on how to use term positions during
search. It computes relative positions, but it works on the absolute
positions that it gets from the index.

SpanFirstQuery also allows to do that, it's a bit more involved, but
in the end it works from the same absolute positions from the index.
The version at the jira issue will even allow to use the length of the
matching spans as the absolute paragraph number, which, in turn,
allows the use of a Similarity for the paragraph weights [10/5/2].

There is nothing special about indexed term positions; any term can
be indexed at any position in a field. Lucene will take advantage of
the incremental nature of positions by storing only compressed
differences of positions in the index, but during search the original
positions are directly available, You can do the same with payloads,
but why reimplement something that is already available?

Payloads have better uses than positional info, for one they are
great to avoid disjunctions. For example for verbs, one could
index only the stem and use a payload for the actual inflected
form (singular/plural, past/present, first/second/third person, etc).

Regards,
Paul Elschot


 
 Cedric
 
 
 On Fri, Feb 15, 2008 at 7:15 AM, Paul Elschot [EMAIL PROTECTED] wrote:
  I have no idea what the [10/5/2] means, so I can't comment on that.
   In case I have missed it previously I'm sorry.
 
   My point was that payloads need not be used for different position info.
   It's possible to do that, and it may be good for performance in some cases,
   but one can revert to using another field for different position info.
 
   Regards,
   Paul Elschot
 
 
   Op Thursday 14 February 2008 09:44:40 schreef Cedric Ho:
 
 
   Hi Paul,
   
Sorry I am not sure I understand your solution.
   
Because I would need to apply this scoring logic to all the different
types of Queries. A search may consists of something like:
   
+(term1 phrase2 wildcard*)  +spanNear(term3 term4) [10/5/2]
   
And this [10/5/2] ratio have to be applied to the whole search query
before it. So I am not sure how would using just SpanFirstQuery with a
separate field work in this situation.
   
Anyway, I know my requirement is a bit strange, so it's ok if I can't
do this in Lucene. I'll settle with using a ThreadLocal to store the
[10/5/2] weighting and retrieve it in the Similarity.scorePayload(...)
function.
   
   
BTW, this problem I am facing now is different from the last one I
asked here, which you have proposed with the Modified SpanFirstQuery
solution =)
   
But I am really grateful with all the helps I get here. Keep up the good 
  work!
   
Cheers,
Cedric
   
   
On Thu, Feb 14, 2008 at 2:58 PM, Paul Elschot [EMAIL PROTECTED] wrote:
 Op Thursday 14 February 2008 02:11:24 schreef Cedric Ho:

  I am using Lucene's Built-in query classes: TernQuery, PhraseQuery,
   WildcardQuery, BooleanQuery and many of the SpanQueries.
  
   The info I am going to pass in is just some weightings for different
   part of the indexed contents. For example if the payload indicate 
  that
   a term is in the 2nd paragraph, then I'll take the weighting for the
   2nd paragraph and multiply it by the score.
  
   So it seems without writing my own query there's no way to do it ?

  In case it is only positional information that is stored in the 
  payload
  (i.e. some integer number that does not decrease when tokenizing the
  document), it is also possible to use an extra field and make sure the
  position increment for that field is only positive when the number
  (currently your payload) increases.
  A SpanFirstQuery on this extra field would almost do, and you will
  probably need https://issues.apache.org/jira/browse/LUCENE-1093 .
  This will be somewhat slower than using a payload, because the search
  will be done in two separate fields, but it will work.

  Regards,
  Paul Elschot



  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional

Re: How to pass additional information into Similarity.scorePayload(...)

2008-02-13 Thread Paul Elschot

Op Thursday 14 February 2008 02:11:24 schreef Cedric Ho:
 I am using Lucene's Built-in query classes: TernQuery, PhraseQuery,
 WildcardQuery, BooleanQuery and many of the SpanQueries.
 
 The info I am going to pass in is just some weightings for different
 part of the indexed contents. For example if the payload indicate that
 a term is in the 2nd paragraph, then I'll take the weighting for the
 2nd paragraph and multiply it by the score.
 
 So it seems without writing my own query there's no way to do it ?

In case it is only positional information that is stored in the payload
(i.e. some integer number that does not decrease when tokenizing the
document), it is also possible to use an extra field and make sure the
position increment for that field is only positive when the number
(currently your payload) increases.
A SpanFirstQuery on this extra field would almost do, and you will
probably need https://issues.apache.org/jira/browse/LUCENE-1093 .
This will be somewhat slower than using a payload, because the search
will be done in two separate fields, but it will work.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: recall/precision with lucene

2008-02-09 Thread Paul Elschot

Op Saturday 09 February 2008 01:59:12 schreef Panos Konstantinidis:
 Hello I am a new lucene user. I am trying to calculate the recall/precision of
 a query and I was wondering if lucene provides an easy way to do it. 
 
 Currently I have a number of documents that match a given query. Then I am
 doing a search and I am getting back all the Hits. I then divide the number of
 documents that came back from lucene (the Hits size) with the number of
 documents that should have got. This is how I calculate the recall.

Since you're going to use all hits for the query, it is normally better to avoid
Hits and use a HitCollector or a TopDocs.
 
 For precision I just get the hits.score() of each relevant document. I am not
 sure if I am on the right track or if there is an easier/better way to do it. 
 I
 would appreciate any insigith into this.

To use the score value for precision one could define a cut off value for
the score value, but then the calculation for recall would also need to
be adapted. For this a HitCollector would be good.

In case you want the results sorted by decreasing score value have
a look at the search methods that return TopDocs. From this one
can make a precision/recall graph for the query by considering
the total results higher than a given score.

When a lot of such computations are needed, you may also want
to cache the values of a unique identifier field for all indexed docs,
have a look at FieldCache for this.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene syntax query matched against a string content

2008-02-08 Thread Paul Elschot

Without using a RAMDirectory index it would be necessary to
implement all Scorers used by the query directly top of the token
stream that normally goes into the index. This is possible, but
Lucene is not designed to do this, so it won't be easy.

But especially for more preparsed queries against a small set of
new documents, this might be nice to have. Still, even for that
case, it would only gain performance over using RAMDirectory
when the queries can be evaluated from the ground up,
sharing as many subqueries as possible. And that is
just the opposite of the top down way query search is
currently implemented on a prebuilt index.

The basic design for this would be to start from a set of queries
to be 'analyzed' to make them share as many subqueries
as possible, building a query graph.
Then this query graph would be fed the new documents
one by one, resulting in a score for each matching query
that was added to the query graph.
It is possible, but it would be quite a bit of work.

And then someone will come along with the requirement
to match an existing index against such a query graph,
which is not a bad idea either, but it might need yet another
way of collecting the results.

Regards,
Paul Elschot

Op Friday 08 February 2008 05:48:08 schreef Nilesh Bansal:
 Hi,
 
 I want to create a function, which takes in a query string (in lucene
 syntax), and a string as content and returns back if the query matches
 the content or not. This would mean,
 
 query = +(apache) +(lucene OR httpd)
 
 will match
 
 content = HTTPD by Apache foundation is one of the most popular open
 source projects
 
 and will not match
 
 content = Lucene and httpd are projects from same open source foundation
 
 Basically, I need to fill in the contents of the following Java
 function. This should be easy to do, but I don't know how. I obviously
 don't want to create a dummy lucene index in memory with a single
 document and then search for the query against that (for performance
 reasons).
 
 public static boolean isRelevant(String luceneQuery, String contents) {
   // TODO fill in
 }
 
 Instead of boolean, it could return a relevance score, which will be
 zero if the query is not relevant to the document.
 
 Any help will be appreciated.
 
 thanks
 Nilesh
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene to index OCR text

2008-01-29 Thread Paul Elschot

Op Tuesday 29 January 2008 03:32:08 schreef Daniel Noll:
 On Friday 25 January 2008 19:26:44 Paul Elschot wrote:
  There is no way to do exact phrase matching on OCR data, because no
  correction of OCR data will be perfect. Otherwise the OCR would have made
  the correction...
  snip suggestion to use fuzzy query
 
 The problem I see with a fuzzy query is that if you have the fuzziness set to 
 1, then fat will match mat.  But in reality, f and m don't get 
 confused with OCR.
 
 What you really want is for a given term to expand to a boolean query of all 
 possible misidentified alternatives.  For that you would first need to figure 
 out which characters are often misidentified as others, which can probably be 
 achieved by going over a certain number of documents and manually checking 
 which letters are wrong.
 
 This should provide slightly more comprehensive matching without matching 
 terms which are obviously different to the naked eye.

It's also possible to select the fuzzy terms by their document frequency, and
reject all that have a ((quite) a bit) higher doc frequency than the given term.

Combined with a query proximity to another similarly queried term this can
work reasonably well. For query search performance selecting only low
frequency terms is nice, as it avoids searching for high frequency terms.

Btw, this use of a worse spelling is more or less the opposite of suggesting
a better spelling from terms with a higher doc frequency.

 
 What would be ideal is if an analyser could do this job (a looks like 
 analyser, like how SoundEx is a sounds like analyser.)  But I get the 
 feeling that this would be very difficult.  Shame the OCR software can't 
 store this information, e.g. 80% odds that this character is a t but 20% 
 odds that it's an f.  If you had that for every character it would be very 
 useful...

Ah yes, the ideal world. Is there OCR software that provides such details?

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene to index OCR text

2008-01-25 Thread Paul Elschot

Op Friday 25 January 2008 03:46:23 schreef Kyle Maxwell:
  I've been poking around the list archives and didn't really come up against
  anything interesting. Anyone using Lucene to index OCR text? Any
  strategies/algorithms/packages you recommend?
 
  I have a large collection (10^7 docs) that's mostly the result of OCR. We
  index/search/etc. with Lucene without any trouble, but OCR errors are a
  problem, when doing exact phrase matches in particular. I'm looking for
  ideas on how to deal with this thorny problem.
 
 How about Letter-by-letter ngrams coupled with SpanQueries (or more
 likely, a custom query utilizing the TermPositions iterator)?
 

There is no way to do exact phrase matching on OCR data, because no correction
of OCR data will be perfect. Otherwise the OCR would have made the correction...

What you'll need is something like a fuzzy query as the leafs of a phrase query.
Also, there may be missing word boundaries, and in that case you'll have to use
a truncation query instead of a phrase query.

The more fuzzyness introduced in the query, the higher the chance of false
matches, so there really is no single answer to this. It depends on how many
false matches the users will accept and on how many OCR errors there are.

One could start by adding some fuzzy term matching to phrase query, and
see what the users think of that. They will lose some performance, and that
is another factor in the fuzzyness tradeoff.

SpanQueries could be used too, for these a fuzzy term match would need
to be added, as well as a query parser. When adding fuzzy term matching
to a phrase query looks to be a bit daunting, have a look at the surround query
parser in the contrib area. It has truncation and proximity based on span 
queries,
but no fuzzy term matching, so it could also be a start for investigating.

It all depends on how good the OCR was, but in some cases (think old paper)
it's just not possible to do good OCR. 

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene Performance

2008-01-19 Thread Paul Elschot

On Friday 18 January 2008 17:52:27 Thibaut Britz wrote:
 
 Hi,
 
...
 
 Another thing I noticed is that we append a lot of queries, so we have a lot
 of duplicate phrases like (A and B or C) and ... and (A and B or C) (more
 nested than that). Is lucene doing any internal query optimization (like
 karnaugh maps) by removing the last (A and B or C), as it is not needed, or
 do I have to do that myself?

Query optimization like karnaugh maps is not available in Lucene.
For each level of 'and' and 'or' in the (rewritten) query, as well as for all 
terms
in the query, a separate scorer will be used during query search.

The query rewrite could in principle do this, but it might affect the score 
values.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Self Join Query

2008-01-10 Thread Paul Elschot

Sachin,

As the merging of the results is the issue, I'll assume that you don't
have clear user requirements for that.

The simplest way out of that is to allow the users to search
the B's first, and once they have determined which B's they'd like
to use, use those B's to limit the results in of user searches in A.
That would normally be done by a filtering on B, much like RangeFilter.
Caching that filter allows for quick repeated searches in A.
Is that what the users want?

For each normalization a filter can be used to search across it.
One feature of filters is that the original score is lost.
Would you have user requirements related to this?

As the texts of A and B are the problem for reindexing, you
may want to index these separately: one index for Aid+Atext,
and one for Bid+Btext.

That leaves the A-B 1-n association: one more index for Aid+Bids.
In this last one you could also put a small text field of A.

Denormalizing the Btext into Aid+Bids as Aid+Bids+Btexts can make
it difficult for the users to explicitly select the B's. OTOH it makes it
easy to implicitly select the B's. What do the users want?

Each id field will have to be indexed to allow filtering, and stored to
allow retrieval for filtering in another index. Retrieving stored fields
is normally a performance bottleneck, so a FieldCache might be handy.

Regards,
Paul Elschot



On Thursday 10 January 2008 12:58:44 sachin wrote:
 Here are more details about my issue.
 
 I have two tables in database. A row in table 1 can have multiple rows
 associated with it in table 2. It is a one to many mapping. 
 Let's say a row in table 1 is A and it has multiple rows B1, B2 and B3
 associated with it in table 2. I need to search on both A and B types
 and the result should have A and all the Bs associated with it. Also for
 your information, A and Bs are long text in database. 
 
 I could have two approaches for indexing/searching
 
 First approach is to create the index in denormalized form. In this case
 document would be like A, B1, B2, B3. The issue with this approach is
 that any modification to any row would require me to re-index the
 document again and fetch A and all Bs again from database. This is a
 heavy process.
 
 The other approach is to index A, B1, B2 and B3 in different documents
 and after search merge the results. This makes my re-indexing lighter
 but I need to put extra logic to merge the results. For this type of
 index I would require self join kind of query from lucene. Query can be
 written by using boolean query but merging of two type of documents is a
 issue. If I go by this approach for indexing, what is the best way to
 fetch the results?
 
 I hope I have made myself clear. 
 
 Thanks
 Sachin
 
 
 
 On Tue, 2008-01-08 at 20:13 +0530, Developer Developer wrote:
  Provide more details please.
  
  Can you not use boolean query and filters if need be ?
  
  
  
  On Jan 8, 2008 7:23 AM, sachin [EMAIL PROTECTED] wrote:
  
  I need to write lucene query something similar to SQL self
  joins.
  
  My current implementation is very primitive. I fire first
  query, get the
  results, based on the result of first query I fire second
  query and then
  merge the results from both the queries. The whole processing
  is very 
  expensive. Doing this is very easy with SQL query as we need
  to just
  write self join query and database do the rest for you.
  
  What is the best way of implementing the above functionality
  in lucene?
  
  Regards 
  Sachin
  
  
  
  -
  To unsubscribe, e-mail:
  [EMAIL PROTECTED]
  For additional commands, e-mail:
  [EMAIL PROTECTED]
  
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Query processing with Lucene

2008-01-09 Thread Paul Elschot

On Tuesday 08 January 2008 22:49:18 Doron Cohen wrote:
 This is done by Lucene's scorers. You should however start
 in http://lucene.apache.org/java/docs/scoring.html, - scorers
 are described in the Algorithm section. Offsets are used
 by Phrase Scorers and by Span Scorer.

That is for the case that offsets were meant to be positions
within a document.

It is also possible that offsets were meant in the sense of using
skipTo(doc) instead of next() on a Scorer. This is done during
query search when at least one term is required.

Regards,
Paul Elschot


 
 Doron
 
 On Jan 8, 2008 11:24 PM, Marjan Celikik  [EMAIL PROTECTED] wrote:
 
  Doron Cohen wrote:
   Hi Marjan,
  
   Lucene process the query in what can be called
   one-doc-at-a-time.
  
   For the example query - x y - (not the phrase query x y) - all
   documents containing either x or y are considered a match.
  
   When processing the query - x y - the posting lists of these two
   index terms are traversed, and for each document met on the way,
   a score is computed (taking into account both terms), and collected.
   At the end of the traversal, usually best N collected docs are returned
  as
   search result. So, this is an exhaustive computation creating a union of
   the two posting. For the query - +x +y - in intersection rather than
   union is required, and the way Lucene does it is again to traverse
   the two posting lists, just that only documents seen in both lists
   are scored and collected. This allows to optimize the search,
   skipping large chunks of the posting lists, especially when
   one term is rarer than the other.
  
  Thank you for your answer.
 
  I am having trouble finding the function which traverses the documents
  such that they get scored. Can you
  please tell me where the posting lists (for a +x +y query) get
  intersected after they get read (by next() I guess)
  from the index?
 
  In particular, I am interested in how does Lucene get the new positions
  (offsets) of the documents seen
  in both posting lists, i.e. positions (in a document) for the query word
  x, and positions for the query word y.
 
  Thank you in advance!
 
  Marjan.
 
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Can I do boosting based on term postions?

2007-12-18 Thread Paul Elschot

On Tuesday 18 December 2007 14:59:45 Peter Keegan wrote:
 
 Should I open a Jira issue?
 

What shall I say?

http://www.apache.org/foundation/how-it-works.html

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Field weights

2007-12-14 Thread Paul Elschot

Karl,

This might work for you:
https://issues.apache.org/jira/browse/LUCENE-293

Regards,
Paul Elschot

On Friday 14 December 2007 18:06:01 Karl Wettin wrote:
 I have an index that contains three sorts of documents:
 
 Car brand
 Tire brand
 Tire pressure
 
 (Please bear with me, the real index has nothing to do with cars. I  
 just try to explain the problem in an alternative domain to avoid NDA  
 conflicts.)
 
 There is a heirarchial composite relationship between these sort of  
 documents. A document describing tire pressure also contains tire  
 brand and car brand. A document describing tire brand also  
 contains information about car brand. A document describing car  
 brand contains only that.
 
 The requirement is that the consumer of the API should not have to  
 specify what fields they are searching in. There is no time (nor  
 training data) to implement a hidden markov model (HMM) tokenizer or  
 something along that path in order to extract possible attributes from  
 the query string. Instead the query string is tokenized once per field  
 and assebled to one huge query. Normally this works fairly well.
 
 Here are some example documents:
 
 Volvo
 Volvo, Michelin
 Volvo, Nokian
 Volvo, Nokian, 2.2 bars
 Volvo, Firestone, 2.4 bars
 
 Saab
 Saab, Michelin
 Saab, Nokian
 Saab, Nokian, 2.1 bars
 Saab, Firestone
 Saab, Firestone, 2.4 bars
 Saab, Firestone, 2.5 bars
 
 If I search for Saab the top result will be the document  representing  
 the car brand Saab.  The query would look like this: car:saab  
 tire:saab preasure:saab
 
 But lets say Saab starts manufacturing tires too:
 
 Saab
 Saab, Saab tires
 Saab, Saab tires, 1.9 bars
 Saab, Saab tires, 1.8 bars
 
 If I search for Saab I still want the top result to be Saab the car  
 brand. But  it not longer is, the match for Saab, Saab tires now  
 have a greater score than Saab, of course.
 
 My idea is to work along the line of indexing Saab in the tire brand  
 and tire pressure field too. Now searching for Saab will yeild a  
 result where the car brand Saab is the top result.
 
 However, this will not work as I have different tokenization  
 strategies for each field (stemming and what not). Tokenizing the  
 query string Saab for the field tire brand in Swedish might end up  
 as saa and will thus not find the token Saab inserted for the  
 document describing the car brand Saab.
 
 I have a couple of experiments in my head I need to try out, starting  
 with tokezining query strings per field and using the tokens generated  
 for the field car brand as query in the tire brand and tire pressure  
 too. And vice versus.
 
 Any brilliant ideas that might work? Hacky solutions are OK.
 
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Scoring for all the documents in the index relative to a query

2007-11-19 Thread Paul Elschot

Gentlefolk,

Well, the javadocs as patched at LUCENE-584 try to change all
the cases of zero scoring to 'non matching'.

I'm happily bracing for a minor conflict with that patch. In case
someone wants to take another look at the javadocs as
patched there, don't let me stop you...

Regards,
Paul Elschot




On Monday 19 November 2007 23:35:07 Yonik Seeley wrote:
 On Nov 19, 2007 5:03 PM, Chris Hostetter [EMAIL PROTECTED] wrote:
  (I'm not actually sure how the Hits class treats negative values

 All Lucene search methods except ones that take a HitCollector filter
 out final scores = 0

 Solr does allow scores =0 through since it had different collection
 methods to avoid score normalization (back when Lucene still did it
 for TopDocs).

 -Yonik

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search performance using BooleanQueries in BooleanQueries

2007-11-06 Thread Paul Elschot

On Tuesday 06 November 2007 23:14:01 Mike Klaas wrote:
 On 29-Oct-07, at 9:43 AM, Paul Elschot wrote:
  On Friday 26 October 2007 09:36:58 Ard Schrijvers wrote:
  +prop1:a +prop2:b +prop3:c +prop4:d +prop5:e
 
  is much faster than
 
  (+(+(+(+prop1:a +prop2:b) +prop3:c) +prop4:d) +prop5:e)
 
  where the second one is a result from BooleanQuery in
  BooleanQuery, and
  all have Occur.MUST.
 
  SImplifying boolean queries like this is not available in Lucene,
  but it
  would have a positive effect on search performance, especially when
  prop1:a and prop2:b have a high document frequency.

 Wait--shouldn't the outer-most BooleanQuery provide most of this
 speedup already (since it should be skipTo'ing between the nested
 BooleanQueries and the outermost).  Is it the indirection and sub-
 query management that is causing the performance difference, or
 differences in skiptTo behaviour?

The usual Lucene answer to performance questions: it depends.

After every hit, next() needs to be called on a subquery before
skipTo() can be used to find the next hit. It is currently not defined which 
subquery will be used for this first next().

The structure of the scorers normally follows the structure of
the BooleanQueries, so the indirection over the deep subquery
scores could well  be relevant to performance, too.

Which of these factors actually dominates performance is hard
to predict in advance. The point of skipTo() is that is tries to avoid
disk I/O as much as possible for the first time that the query is
executed. Later executions are much more likely to hit the OS cache,
and then the indirections will be more relevant to performance.

I'd like to have a good way to do a performance test on a first
query execution, in the sense that it does not hit the OS cache
for its skipTo() executions, but I have not found a good way yet.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: 2/3 of terms matched + coverage filter

2007-10-31 Thread Paul Elschot

On Wednesday 31 October 2007 14:51:12 Tobias Hill wrote:
 My documents all hava a field with variables number of terms
 (but rather few):
 Doc1.field = foo bar gro
 Doc2.field = foo bar gro mot slu
 Now I would like to search using the terms foo bar gro

 Problem 1:
 I like to express that at least any two of the three terms
 must match. Do I have to construct this clause myself - i.e.
 (foo  bar) | (foo  gro) | (bar  gro), or is there some
 clever way to do this?

BooleanQuery.setMinimumNumberShouldMatch(int) does this,
have a look at the javadocs for the details.


 Problem 2:
 I like to express that if the doc.field has too many terms
 that wasn't matched it should not be included at all in the
 result. In the example above Doc2 might have too many
 terms that was not matched to be included in the result.
 Is this kind of query possible, and how?

 The general case:
 I want to find those docs that has X% of the search terms
 matched and that the acctual match covers at least Y% of
 the available terms on the document.

This Y% is not directly possible, but I would expect the default
document score to correlate reasonably well with coverage.

In case you want an exact Y% cutoff, you'll run into the fact
that the field norm (the inverse square root of the field length)
is encoded in only 8 bits, which is rather course.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Looking for Exact match but no other terms... how to express it?

2007-10-30 Thread Paul Elschot

On Tuesday 30 October 2007 16:58:09 Tobias Hill wrote:
 I want to match on the exact phrase foo bar dot on a
 specific field on my set of documents.

 I only want results where that field has exactly foo bar dot
 and no more terms. I.e. A document with foo bar dot alu
 should not match.

 A phrase query with slop 0 seems resonable but how do I
 express but nothing more than these terms.

Another way to do this is by indexing a special begin and end token before and 
after the tokens of the field, and by extending your queries with these 
special tokens, for example: =begin= foo bar dot =end= .

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search performance using BooleanQueries in BooleanQueries

2007-10-29 Thread Paul Elschot

On Friday 26 October 2007 09:36:58 Ard Schrijvers wrote:
 Hello,

 I am seeing that a query with boolean queries in boolean queries takes
 much longer than just a single boolean query when the number of hits if
 fairly large. For example

 +prop1:a +prop2:b +prop3:c +prop4:d +prop5:e

 is much faster than

 (+(+(+(+prop1:a +prop2:b) +prop3:c) +prop4:d) +prop5:e)

 where the second one is a result from BooleanQuery in BooleanQuery, and
 all have Occur.MUST.

 Is there a way to detect and rewrite the second inefficient query?
 query.rewrite() does not change the query AFAICS.

SImplifying boolean queries like this is not available in Lucene, but it
would have a positive effect on search performance, especially when
prop1:a and prop2:b have a high document frequency.

You could write this yourself, for example by overriding 
BooleanQuery.rewrite(). Take care about query weights, though.

Regards,
Paul Elschot



 thanks for any help,

 Regards Ard



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Cache BitSet or doc number?

2007-10-27 Thread Paul Elschot

Have a look at decoupling Filter from BitSet:

http://issues.apache.org/jira/browse/LUCENE-584

There also is a SortedVIntList there that stores document numbers
more compactly than BitSet,  and an implementation of
CachingFilterQuery (iirc) that chooses the more compact representation
of BitSet and SortedVIntList.

Regards,
Paul Elschot


On Saturday 27 October 2007 02:15:48 Yonik Seeley wrote:
 On 10/26/07, John Patterson [EMAIL PROTECTED] wrote:
  Thom Nelson wrote:
   Check out the HashDocSet from Solr, this is the best way to cache small
   sets of search results.  In general, the Solr BitSet/DocSet classes are
   more efficient than using the standard java.util.BitSet.  You can use
   these independent of the rest of Solr (though I recommend checking out
   Solr if you want to do complex caching).
 
  I imagine the fastest way to combine cached results is to store them in
  an array ordered by doc number so that the ConjunctionQuery can use them
  directly.  The Javadoc for HashDocSet says that they are stored out of
  order which would make this impossible.

 You're speaking at quite an abstract level... it really depends on
 what specific issue you are seeing that you're trying to solve.

 -Yonik

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Adding support for NOT NEAR construct?

2007-10-17 Thread Paul Elschot

Dave,

One can use SpanNotQuery to get NOT NEAR by using this generalized structure:

SpanNot(foo, SpanNear(foo, bar, distance))

This also allows for example:

SpanNot(two, SpanNear(one, three, distance))

Btw. I don't know of any query language that has this second form.

AND NOT normally does not work for this because it works on doc level
and not within the matching text of a field.


Regards,
Paul Elschot


On Wednesday 17 October 2007 17:57:21 Dave Golombek wrote:
 We've run into a situation where having NOT NEAR queries would really
 help. I haven't been able to find any discussion of adding this to Lucene
 in the past, so wanted to ask if people had any comments about it before I
 started trying to make the change.

 I've looked at NearSpansUnordered and it seems that reversing the logic in
 atMatch() would go a long way towards implementation; NearSpansOrdered is a
 bit harder, depending upon the exact semantics of NOT NEAR that we want
 to implement. For queries, I was thinking that either foo bar~-10  or
 foo bar!~10 might be reasonable; the former should be pretty easy to
 parse.

 Does this sound reasonable? Something for contrib?

 Thanks,
 Dave Golombek
 Senior Software Engineer
 Black Duck Software, Inc.

 [EMAIL PROTECTED]
 T +1.781.810.2079
 F +1.781.891.5145
 C +1.617.230.5634
 http://www.blackducksoftware.com



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Scoring a single document from a corpus based on a given query

2007-10-10 Thread Paul Elschot

On Wednesday 10 October 2007 18:44, lucene_user wrote:
 I would like to score a single document from a corpus based on a given
 query. The formula score(q,d) is basically what I am looking for.

 Pseudo Code of Something Close to what I am looking for:
 indexReader.score(query, documentId);

 The formula score(q,d) is used throughout the documentation to describe
 similarity but there does not seem to be a corresponding java method.

 I could work around the issue by applying a search filter to only consider
 the particular document I am looking for.  I was hoping for a cleaner
 approach.

You can try this:

Explanation e = indexSearcher.explain(query, documentId);

and get the score value from the explanation.

Have a look at the code of any Scorer.explain() method on
how to get the score value only. There really is no need to filter
in this case.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Scorer skipTo() expectations?

2007-10-04 Thread Paul Elschot

Dan,

In Scorers, when skipTo() or next() returns true for the second or later
time, the result of doc() will be increased.
When Scorer.skipTo() does not have document order, documents will
be lost, which means that not all matching documents will be found
by the search.

For disjunctions (OR), one needs to merge the documents of
two Scorers using next() to iterate over the documents.
The merging is normally done on the fly using a specialized priority queue
on the doc() values in DisjunctionSumScorer.
No sorting of  complete document lists is done at search time,
that is done at indexing time. And since TermScorer uses the
index directly, it will always return documents in order.

The only exception to document ordering is BooleanScorer.next(),
which is used by BooleanQuery for some cases of top
level disjunctions, and then only when documents are allowed
to be scored out of order. The reason for that is performance,
BooleanScorer uses a faster data structure than a priority queue,
but BooleanScorer does not implement skipTo().

Regards,
Paul Elschot




On Thursday 04 October 2007 09:12, Dan Rich wrote:
 Hi,

 I have a custom Query class that provides a long list of lucene docIds (not
 for filtering purposes), which is one clause in a standard BooleanQuery
 (which also contains TermQuery instances).

 I have a custom Scorer that goes along with the custom Query class.

 What (if any) document ordering requirements does the Scorer class have for
 its skipTo(int docId) method?

 In particular, currently I'm sorting/returning the docIds in ascending
 order from my custom Query class. That can be expensive for large docId
 lists; is sorting necessary? It looks like skipTo() might expect the
 documents it gets to be in ascending order to behave correctly as part of a
 BooleanQuery, but I can't tell for sure from the doc.

 If the document list from my custom Scorer class does not have its document
 list in ascending order (e.g. 10, 80, 40, 60, 50) will whatever uses
 skipTo() potentially lose hits? If not, is there any performance concern
 with having the docIds unordered?


  
 ___
_ Fussy? Opinionated? Impossible to please? Perfect.  Join Yahoo!'s
 user panel and lay it on us.
 http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: a query for a special AND?

2007-10-01 Thread Paul Elschot

As for suggestions on how to do this, I have no other than
to make sure that you can create the queries necessary to obtain
the required output.

Regards,
Paul Elschot


On Sunday 30 September 2007 09:20, Mohammad Norouzi wrote:
 Hi Paul,
 thanks, I dot your idea, now I am planing to implement this,
 de-normalization, now I just need your suggestion on this issue and tell
 me which one is the best

 I am considering to put a Field as follows:
 my_de_normalized_field
 service_name : service_value

 if there are more than one service and value then I need to separate with a
 character such as comma

 service_name1:value1 , service_name2: value2,

 apart from that, each value might be a range value say, 10 - 20 or a single
 value say, -2

 do you have any suggestion on this?

 thank you very much

 On 9/20/07, Paul Elschot [EMAIL PROTECTED] wrote:
  On Thursday 20 September 2007 09:19, Mohammad Norouzi wrote:
   well, you mean we should separate documents just like relational tables
 
  in
 
   databases ?
 
  Quite the contrary, it's called _de_normalization. This means that the
  documents in lucene normally contain more information than is present
  in a single relational entity.
 
   if yes, how to make the relationship between those documents
 
  Lucene has no facilities to maintain relational relationships among
  its documents. A lucene index allows free format documents, i.e.
  any document may have any field or not.
  In practice you will need at least a primary key, but even that you will
  need to program yourself.
 
  Regards,
  Paul Elschot
 
   thank you so much Paul
  
   On 9/20/07, Paul Elschot [EMAIL PROTECTED] wrote:
On Thursday 20 September 2007 07:29, Mohammad Norouzi wrote:
 Sorry Paul I just hurried in replying ;)
 I read the documents of Lucene about query syntax and I figured out
 
  the
 
what
   
 is the difference
 but my problem is different, this is preoccupied my mind and I am
 
  under
 
 pressure to solve this problem, after analyzing the results I get,
 
  now I
 
 think we need a group by in our query.

 let me tell you an example: we need a list of patients that have
 
  been
 
 examined by certain services specified by the user , say service
 one
 
  and
 
 service two.

 in this case here is the correct result:
 patient-id  service_name   patient_result
 1 s112
 1 s213
 2 s1 
 41

 2 s222

 but for example, following is incorrect because patient 1 has no
 
  service
 
 with name service2:
 patient-id  service_name   patient_result
 1 s112
 1 s313
   
That depends on what you put in your lucene documents.
You can only get complete lucene documents as query results.
For the above example a patient with all service names
should be indexed in a single lucene doc.
   
The rows above suggest that the relation between patient and
service forms the relational result. However, for a text search
engine it is usual to denormalize the relational records into
indexed documents, depending on the required output.
   
Regards,
Paul Elschot
   
 On 9/20/07, Mohammad Norouzi [EMAIL PROTECTED] wrote:
  Hi Paul,
  would you tell me what is the difference between AND and + ?
  I tried both but get different result
  with AND I get 1777 documents and with + I get nearly 25000 ?
 
  On 9/17/07, Paul Elschot [EMAIL PROTECTED] wrote:
   On Monday 17 September 2007 11:40, Mohammad Norouzi wrote:
Hi
I have a problem in getting correct result from Lucene,
 
  consider
 
we
   
   have an
  
index containing documents with fields field1 and field2
 
  etc.
 
now
   
   I want
  
to have documents in which their field1 are equal one by one
 
  and
 
their
   
field2 with two different value
   
to clarify consider I have this query:
field1:val*  (field2:myValue1 XOR field2:myValue2)
  
   Did you try this:
  
   +field1:val*  +field2:myValue1 +field2:myValue2
  
   Regards,
   Paul Elschot
  
now I want this result:
field1  field2
val1myValue1
val1myValue2
val2myValue1
val2myValue2
   
this result is not acceptable:
val3  myValue1
or
val4 myValue1
val4 myValue3
   
I put XOR as operator

Re: Translating Lucene Query Syntax to Traditional Boolean Syntax

2007-09-25 Thread Paul Elschot

On Tuesday 25 September 2007 03:05, Martin Bayly wrote:
 
 We have an application that performs searches against a Lucene based index
 and also against a Windows Desktop Search based index.
 
 For simple queries we'd like to offer our users a consistent interface that
 allows them to build basic Lucene style queries using the 'MUST HAVE' (+),
 'MUST NOT HAVE' (-) and 'SHOULD HAVE' style of operators as this is probably
 more intuitive for non 'Boolean Logic' literate users.  We would not allow
 them to use any grouping (parenthesis).
 
 Clearly we can pass this directly to Lucene, but for the Windows Desktop
 Search we need to translate the Lucene style query into a more traditional
 Boolean query.  So this is the opposite of the much discussed Boolean Query
 to Lucene Query conversion.
 
 I'm wondering if anyone has ever done this or whether there is a concept
 mismatch in there somewhere that will make it difficult to do?
 
 My thought was that you could take the standard Lucene operators and simply
 group them together as follows:
 
 e.g. (assuming the Lucene default OR operator)
 
 Lucene: +a +b -c -d e f
 
 would translate to:
 
 (a AND b NOT c NOT d) OR (a AND b NOT c NOT d AND (e OR f))
 
 If I put this back into Lucene (actually Lucene.NET but hopefully its the
 same) I get back:
 
 (+a +b -c -d)(+a +b -c -d +(e f))
 
 which I think is equivalent but not as concise!  But I have not tested this
 against a big index to see if it's equivalent and I have a suspicion that
 Lucene might score the two versions of the Lucene representation
 differently.  But that's probably not an issue provided the Boolean
 representation is semantically equivalent to the first Lucene
 representation.
 
 Anyone ever tried this before or have any comments on whether my 'logic' is
 flawed!


Under the hood, the Scorer for a BooleanQuery, BooleanScorer2, does the 
conversion from + and - to boolean operators in slightly different, but more 
concise way.

It basically maps the boolean query syntax to four operators: AND, OR,
ANDNOT, ANDoptional.
The mapping is only basical because, as a Scorer, it needs to map to other
Scorers, and these are: ConjunctionScorer, DisjunctionSumScorer,
ReqExclScorer and ReqOptSumScorer, respectively.

For ANDoptional the first subquery is required, and the second one is 
optional. This is a bit like ANDNOT in which the first query is required,
and the second one is prohibited.

The addition of ANDoptional/ReqOptSumScorer allows the conciseness,
while keeping equivalent semantics.
So I think your logic is not flawed, it's just that the traditional set  of 
boolean operators is somehow incomplete for optional subqueries.

Mapping + and - to these four operators is not really straightforward,
among others because of the coordination factor, and because of
the many different possible situations.

You're invited to have a look at the source code of BooleanScorer2.
There are also test cases for the equivalence of the semantics, see the
TestBoolean* classes.


Regards,
Paul Elschot

P.S. When documents may be scored out of order, for some disjunctions
(OR), BooleanScorer is used for performance.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: a query for a special AND?

2007-09-20 Thread Paul Elschot

On Thursday 20 September 2007 07:29, Mohammad Norouzi wrote:
 Sorry Paul I just hurried in replying ;)
 I read the documents of Lucene about query syntax and I figured out the what
 is the difference
 but my problem is different, this is preoccupied my mind and I am under
 pressure to solve this problem, after analyzing the results I get, now I
 think we need a group by in our query.
 
 let me tell you an example: we need a list of patients that have been
 examined by certain services specified by the user , say service one and
 service two.
 
 in this case here is the correct result:
 patient-id  service_name   patient_result
 1 s112
 1 s213
 2 s1  41
 2 s222
 
 but for example, following is incorrect because patient 1 has no service
 with name service2:
 patient-id  service_name   patient_result
 1 s112
 1 s313

That depends on what you put in your lucene documents.
You can only get complete lucene documents as query results.
For the above example a patient with all service names
should be indexed in a single lucene doc.

The rows above suggest that the relation between patient and
service forms the relational result. However, for a text search
engine it is usual to denormalize the relational records into
indexed documents, depending on the required output.

Regards,
Paul Elschot



 
 
 
 On 9/20/07, Mohammad Norouzi [EMAIL PROTECTED] wrote:
 
  Hi Paul,
  would you tell me what is the difference between AND and + ?
  I tried both but get different result
  with AND I get 1777 documents and with + I get nearly 25000 ?
 
 
  On 9/17/07, Paul Elschot [EMAIL PROTECTED] wrote:
  
   On Monday 17 September 2007 11:40, Mohammad Norouzi wrote:
Hi
I have a problem in getting correct result from Lucene, consider we
   have an
index containing documents with fields field1 and field2 etc. now
   I want
to have documents in which their field1 are equal one by one and their
field2 with two different value
   
to clarify consider I have this query:
field1:val*  (field2:myValue1 XOR field2:myValue2)
  
   Did you try this:
  
   +field1:val*  +field2:myValue1 +field2:myValue2
  
   Regards,
   Paul Elschot
  
  
   
now I want this result:
field1  field2
val1myValue1
val1myValue2
val2myValue1
val2myValue2
   
this result is not acceptable:
val3  myValue1
or
val4 myValue1
val4 myValue3
   
I put XOR as operator because this is not a typical OR, it's
   different, it
means documents that contains both myValue1 and myValue2 for the field
  
field2
   
how to build a query to get such result?
   
thanks in advance
--
Regards,
Mohammad
--
see my blog: http://brainable.blogspot.com/
another in Persian: http://fekre-motefavet.blogspot.com/
Sun Certified Java Programmer
ExpertsExchange Certified, Master:
http://www.experts-exchange.com/M_1938796.html
   
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
 
  --
  Regards,
  Mohammad
  --
  see my blog: http://brainable.blogspot.com/
  another in Persian: http://fekre-motefavet.blogspot.com/
  Sun Certified Java Programmer
  ExpertsExchange Certified, Master: 
http://www.experts-exchange.com/M_1938796.html
 
 
 
 
 
 -- 
 Regards,
 Mohammad
 --
 see my blog: http://brainable.blogspot.com/
 another in Persian: http://fekre-motefavet.blogspot.com/
 Sun Certified Java Programmer
 ExpertsExchange Certified, Master:
 http://www.experts-exchange.com/M_1938796.html
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: a query for a special AND?

2007-09-20 Thread Paul Elschot

On Thursday 20 September 2007 09:19, Mohammad Norouzi wrote:
 well, you mean we should separate documents just like relational tables in
 databases ?

Quite the contrary, it's called _de_normalization. This means that the
documents in lucene normally contain more information than is present
in a single relational entity.

 if yes, how to make the relationship between those documents

Lucene has no facilities to maintain relational relationships among
its documents. A lucene index allows free format documents, i.e.
any document may have any field or not.
In practice you will need at least a primary key, but even that you will
need to program yourself.

Regards,
Paul Elschot



 
 thank you so much Paul
 
 On 9/20/07, Paul Elschot [EMAIL PROTECTED] wrote:
 
  On Thursday 20 September 2007 07:29, Mohammad Norouzi wrote:
   Sorry Paul I just hurried in replying ;)
   I read the documents of Lucene about query syntax and I figured out the
  what
   is the difference
   but my problem is different, this is preoccupied my mind and I am under
   pressure to solve this problem, after analyzing the results I get, now I
   think we need a group by in our query.
  
   let me tell you an example: we need a list of patients that have been
   examined by certain services specified by the user , say service one and
   service two.
  
   in this case here is the correct result:
   patient-id  service_name   patient_result
   1 s112
   1 s213
   2 s1  41
   2 s222
  
   but for example, following is incorrect because patient 1 has no service
   with name service2:
   patient-id  service_name   patient_result
   1 s112
   1 s313
 
  That depends on what you put in your lucene documents.
  You can only get complete lucene documents as query results.
  For the above example a patient with all service names
  should be indexed in a single lucene doc.
 
  The rows above suggest that the relation between patient and
  service forms the relational result. However, for a text search
  engine it is usual to denormalize the relational records into
  indexed documents, depending on the required output.
 
  Regards,
  Paul Elschot
 
 
 
  
  
  
   On 9/20/07, Mohammad Norouzi [EMAIL PROTECTED] wrote:
   
Hi Paul,
would you tell me what is the difference between AND and + ?
I tried both but get different result
with AND I get 1777 documents and with + I get nearly 25000 ?
   
   
On 9/17/07, Paul Elschot [EMAIL PROTECTED] wrote:

 On Monday 17 September 2007 11:40, Mohammad Norouzi wrote:
  Hi
  I have a problem in getting correct result from Lucene, consider
  we
 have an
  index containing documents with fields field1 and field2 etc.
  now
 I want
  to have documents in which their field1 are equal one by one and
  their
  field2 with two different value
 
  to clarify consider I have this query:
  field1:val*  (field2:myValue1 XOR field2:myValue2)

 Did you try this:

 +field1:val*  +field2:myValue1 +field2:myValue2

 Regards,
 Paul Elschot


 
  now I want this result:
  field1  field2
  val1myValue1
  val1myValue2
  val2myValue1
  val2myValue2
 
  this result is not acceptable:
  val3  myValue1
  or
  val4 myValue1
  val4 myValue3
 
  I put XOR as operator because this is not a typical OR, it's
 different, it
  means documents that contains both myValue1 and myValue2 for the
  field

  field2
 
  how to build a query to get such result?
 
  thanks in advance
  --
  Regards,
  Mohammad
  --
  see my blog: http://brainable.blogspot.com/
  another in Persian: http://fekre-motefavet.blogspot.com/
  Sun Certified Java Programmer
  ExpertsExchange Certified, Master:
  http://www.experts-exchange.com/M_1938796.html
 


  -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


   
   
--
Regards,
Mohammad
--
see my blog: http://brainable.blogspot.com/
another in Persian: http://fekre-motefavet.blogspot.com/
Sun Certified Java Programmer
ExpertsExchange Certified, Master:
  http://www.experts-exchange.com/M_1938796.html
   
   
  
  
  
   --
   Regards,
   Mohammad

Re: a query for a special AND?

2007-09-17 Thread Paul Elschot

On Monday 17 September 2007 11:40, Mohammad Norouzi wrote:
 Hi
 I have a problem in getting correct result from Lucene, consider we have an
 index containing documents with fields field1 and field2 etc. now I want
 to have documents in which their field1 are equal one by one and their
 field2 with two different value
 
 to clarify consider I have this query:
 field1:val*  (field2:myValue1 XOR field2:myValue2)

Did you try this:

+field1:val*  +field2:myValue1 +field2:myValue2

Regards,
Paul Elschot


 
 now I want this result:
 field1  field2
 val1myValue1
 val1myValue2
 val2myValue1
 val2myValue2
 
 this result is not acceptable:
 val3  myValue1
 or
 val4 myValue1
 val4 myValue3
 
 I put XOR as operator because this is not a typical OR, it's different, it
 means documents that contains both myValue1 and myValue2 for the field
 field2
 
 how to build a query to get such result?
 
 thanks in advance
 -- 
 Regards,
 Mohammad
 --
 see my blog: http://brainable.blogspot.com/
 another in Persian: http://fekre-motefavet.blogspot.com/
 Sun Certified Java Programmer
 ExpertsExchange Certified, Master:
 http://www.experts-exchange.com/M_1938796.html
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Span queries and complex scoring

2007-09-11 Thread Paul Elschot

Cedric,

In case your requirements allow this, try and use subclass of Spans
that has a score() method that returns a value that is used together
with the other span info to provide a score value to your own
SpanScorer at the top level.
This score value can summarize the influence of the individual
span scores of the subqueries.
For this you will need to change the whole span package, but
it is somewhat simpler than using a complete Scorer for each
SpanQuery in the query tree.

With a lot of nested SpanOrQueries, merging the Spans can become
a performance bottleneck. The current situation can be improved
by creating a specialized PriorityQueue for Spans, much like the
ScorerDocQueue that is used by DisjunctionSumScorer.
With this, it is possible to avoid SpanOrQuery by using term payloads
to compute the score value for the Spans of a SpanTermQuery,
but iirc the payloads are not yet in the trunk.

Regards,
Paul Elschot

On Tuesday 11 September 2007 16:17, melix wrote:

Hi,

I'm working on an application which requires a complex scoring (based on
semantics analysis). The scoring must be highly configurable, and I've found
ways to do that, but I'm facing a discrete but annoying problem. All my
queries are, basically, complex span queries. I mean for example a
SpanNearQuery which embeds a SpanOrQuery which itself may embed another
SpanNearQuery etc...

I've followed the instructions at

http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/package-summary.html#changingScoring
about changing scoring. The problem is that a document score is highly
dependent on *what* matched, and that the getSpans() method on spanqueries
does not provide that kind of information.

I created my own SpanQuery subclasses which override the createWeight method
so that the scorer used is my own too. It basically replaces the SpanScorer,
and should recurse the spans tree to compose a score based on the type of
subqueries (near, and, or, not) and what matched. The problem is that the
getspans() methods that exists in Lucene are either anonymous classes which
I cannot browse, or that I have not access to the required information.

Basically, in a SpanOrQuery, I am not able to find out what matched. Have
any of you faced that kind of problem, and found out an elegant way to do it
without having to completely rewrite each getSpans() method for all types of
queries (this is basically what was done in a previous version of the
application) ?

Thanks,

Cedric

--
View this message in context:
http://www.nabble.com/Span-queries-and-complex-scoring-tf4422915.html#a12615745
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

1 2 3 >

1 - 100 of 265 matches

Mail list logo