Re: Searching Ranges

Alex Winston Tue, 12 Nov 2002 11:21:58 -0800

unless i am mistaken this break will only occur if the current term
within the field is greater than (or equal to when exclusive) the
upperTerm.  so if a matching term has been found within the range it
will still continue to iterate until a term meets this criteria, or the
while loop ends, unless this is the intended behavior and I am
overlooking something.


any thoughts?

thanks
alex


On Tue, 2002-11-12 at 13:25, Doug Cutting wrote:
> Isn't the break on line 162 of RangeQuery.java supposed to achieve this?
> 
> Alex Winston wrote:
> > otis,
> > 
> > i was able to fix the junit build problems, with the newest versions of
> > ant in regards to lucene unit tests.  it appears that the junit.jar must
> > appear in the $ANT_HOME/lib dir in order to run such optional taskdefs
> > as JUnitTask.
> > 
> > the following link was very helpful.
> > http://barracuda.enhydra.org/project/mailingLists/barracuda/msg04810.html
> > 
> > additionally i was able to unit test lucene with the one line change
> > that i suggested with success, although i have not looked into how
> > thorough the unit tests are for cases like this.
> > 
> > the diff follows from a cvs snapshot from yesterday (note the added
> > break;):
> > *** RangeQuery.java     Sat Nov  9 09:54:05 2002
> > --- RangeQuery.java.old Sat Nov  9 09:53:37 2002
> > ***************
> > *** 164,170 ****
> >                               TermQuery tq = new
> > TermQuery(term);         // found a match
> >                               tq.setBoost(boost);               // set
> > the boost
> >                               q.add(tq, false, false);            // add
> > to q
> > -                             break; //ADDED!
> >                           }
> >                       } 
> >                       else
> > --- 164,169 ----
> > 
> > 
> > i also pondered the ramifications of such a change, and have a few
> > thoughts.  it appears that this is successful because it eliminates the
> > massive overhead of the byte[] built by the TermScorer when there are
> > thousands of terms, but a side-effect may be that it will not accurately
> > return a valid score.  i have yet to test this, and my understanding of
> > the code is still very limited.  although i do not have a firm grasp of
> > what is involved in scoring, is there not a possibility to score based
> > on the number of results matched for this particular field as opposed to
> > the current implementation.
> > 
> > any thoughts?
> > 
> > as i look through the code some more i will offer my thoughts on a
> > possible reimplementation of RangeQuery to alleviate the overhead when
> > there are thousands of terms as opposed to this simple one line change
> > which may have hidden side-effects.
> > 
> > i can also send a copy of some simple tests to show how to create this
> > situation with profiling results if that would be helpful.
> > 
> > 
> > thanks
> > alex
> > 
> > 
> > 
> > On Fri, 2002-11-08 at 17:40, Alex Winston wrote:
> > 
> >>actually i was mistaken, i thought the tests ran successfully but after
> >>looking again i merely got a BUILD SUCCESSFUL, apparently lucenes build
> >>cannot find JUnitTask out of the box with ant1.5.1.  i have not had any
> >>time to work through the problem.  i will look into it tomorrow, if you
> >>have any thoughts in the meantime let me know.
> >>
> >>thanks
> >>alex
> >>
> >>
> >>
> >>On Fri, 2002-11-08 at 16:46, Otis Gospodnetic wrote:
> >>
> >>>Hello,
> >>>
> >>>Did you say that you run 'ant test-unit' and that all tests still pass?
> >>>If so, could you attach a cvs diff -ucN RangeQuery.java?
> >>>
> >>>Thanks,
> >>>Otis
> >>>
> >>>
> >>>--- Alex Winston <[EMAIL PROTECTED]> wrote:
> >>>
> >>>>apologizes for replying to myself, but another nice side-effect of
> >>>>this
> >>>>fix is that it virtually eliminates the potential for an
> >>>>OutOfMemoryError, which was a problem i encountered on extremely
> >>>>large
> >>>>fields, over 10000 terms, while i was profiling the RangeQuery class.
> >>>>
> >>>>i can get into specifics if need be, any thoughts?
> >>>>
> >>>>alex
> >>>>
> >>>>
> >>>> On Fri, 2002-11-08 at 15:54, Alex Winston wrote:
> >>>>
> >>>>>thanks for the reply, my apologizes for not explaining myself very
> >>>>>clearly, it has been a long day.
> >>>>>
> >>>>>you expressed exactly our situation, unfortunately this is not an
> >>>>
> >>>>option
> >>>>
> >>>>>because we want to have multiple ranges for each document as well, 
> >>>>>there is a possible extension of what you suggested but that is a
> >>>>
> >>>>last
> >>>>
> >>>>>resort.  kinda crazy i know, but you have to meet requirements :).
> >>>>>
> >>>>>but i also had a thought while i was looking through the lucene
> >>>>
> >>>>code,
> >>>>
> >>>>>and any comments are welcome.  
> >>>>>
> >>>>>i may be very mistaken because it has been a long day but if you
> >>>>
> >>>>look at
> >>>>
> >>>>>the current cvs version of RangeQuery it appears that even if a
> >>>>
> >>>>match is
> >>>>
> >>>>>found it will continue to iterate over terms within a field, and in
> >>>>
> >>>>my
> >>>>
> >>>>>case it is on the order of thousands.  if i add a break after a
> >>>>
> >>>>match
> >>>>
> >>>>>has been found it appears as though the search is improved on avg
> >>>>
> >>>>an
> >>>>
> >>>>>order of magnitude, my math has left me so i cannot be theoretical
> >>>>
> >>>>at
> >>>>
> >>>>>the moment.  i have unit tested the change on my side and on the
> >>>>
> >>>>lucene
> >>>>
> >>>>>side and it works.  note: one hard example is that a query went
> >>>>
> >>>>from 20
> >>>>
> >>>>>seconds to .5 seconds.  any initial thoughts to if there is a case
> >>>>
> >>>>where
> >>>>
> >>>>>this would not work?
> >>>>>
> >>>>>beginning line 164:
> >>>>>TermQuery tq = new TermQuery(term);        // found a match
> >>>>>tq.setBoost(boost);                         // set the boost
> >>>>>q.add(tq, false, false);           // add to q
> >>>>>break;  // ADDED!
> >>>>>
> >>>>>
> >>>>>On Fri, 2002-11-08 at 15:09, Mike Barry wrote:
> >>>>>
> >>>>>>Alex,
> >>>>>>
> >>>>>>It is rather confusing. It sounds like you've indexed
> >>>>>>a field that that can be between two values (let's say
> >>>>>>E-J) and then when you have a search term such as G
> >>>>>>you want the docs containing E-J (or A-H or F-K but not A-H
> >>>>>>nor A-C nor J-Z)
> >>>>>>
> >>>>>>Just of the top of my head but could you index the upper and
> >>>>>>lower bounds as separate fields then when you search do a
> >>>>>>compound query:
> >>>>>>
> >>>>>>     lower_bound:{ - search_term } AND upper_bound:{ search_term
> >>>>>
> >>>>- }
> >>>>
> >>>>>>just a thought.
> >>>>>>
> >>>>>>>-MikeB.
> >>>>>>
> >>>>>>
> >>>>>>Alex Winston wrote:
> >>>>>>
> >>>>>>
> >>>>>>>i was hoping that someone could briefly review my current
> >>>>>>
> >>>>solution to a
> >>>>
> >>>>>>>problem that we have encountered to see if anyone could suggest
> >>>>>>
> >>>>a
> >>>>
> >>>>>>>possible alternative, because as it stands we have pushed
> >>>>>>
> >>>>lucene past
> >>>>
> >>>>>>>its current limits.
> >>>>>>>
> >>>>>>>PROBLEM:
> >>>>>>>
> >>>>>>>we were wanting to represent a range of values for a particular
> >>>>>>
> >>>>field
> >>>>
> >>>>>>>that is searchable over a particular range.
> >>>>>>>
> >>>>>>>an example follows for clarification:
> >>>>>>>we were wanting to store a range of chapters and verses of a
> >>>>>>
> >>>>book for a
> >>>>
> >>>>>>>particular document, and in turn search to see if a query range
> >>>>>>
> >>>>includes
> >>>>
> >>>>>>>the range that is represented in the index.
> >>>>>>>
> >>>>>>>if this is unclear please ask for clarification
> >>>>>>>
> >>>>>>>IMPRACTICAL SOLUTION:
> >>>>>>>
> >>>>>>>although this solution seems somewhat impractical it is all we
> >>>>>>
> >>>>could
> >>>>
> >>>>>>>come up with.
> >>>>>>>
> >>>>>>>our solution involved storing each possible range value within
> >>>>>>
> >>>>the term
> >>>>
> >>>>>>>which would allow for RangeQuerys to be performed on this
> >>>>>>
> >>>>particular
> >>>>
> >>>>>>>field.  for very small ranges this seems somewhat practical
> >>>>>>
> >>>>after
> >>>>
> >>>>>>>profiling.  although once the field ranges began to span
> >>>>>>
> >>>>multiple
> >>>>
> >>>>>>>chapters and verses, the search times became unreasonable
> >>>>>>
> >>>>because we
> >>>>
> >>>>>>>were storing thousands of entries for each representative
> >>>>>>
> >>>>range.
> >>>>
> >>>>>>>i can elaborate on anything that is unclear,
> >>>>>>>but any thoughts on a possible alternative solution within
> >>>>>>
> >>>>lucene that
> >>>>
> >>>>>>>we overlooked would be extremely helpful.
> >>>>>>>       
> >>>>>>>
> >>>>>>>alex
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>--
> >>>>>>To unsubscribe, e-mail:  
> >>>>>
> >>>><mailto:lucene-user-unsubscribe@;jakarta.apache.org>
> >>>>
> >>>>>>For additional commands, e-mail:
> >>>>>
> >>>><mailto:lucene-user-help@;jakarta.apache.org>
> >>>>
> >>>>>>
> >>>>
> >>>>ATTACHMENT part 2 application/pgp-signature name=signature.asc
> >>>
> >>>
> >>>
> >>>__________________________________________________
> >>>Do you Yahoo!?
> >>>U2 on LAUNCH - Exclusive greatest hits videos
> >>>http://launch.yahoo.com/u2
> >>>
> >>>--
> >>>To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@;jakarta.apache.org>
> >>>For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>
> >>>
> >>>
> > 
> 
> 
> 
> --
> To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@;jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>
-- 
Alex Winston <[EMAIL PROTECTED]>

--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@;jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>

Re: Searching Ranges

Reply via email to