Re: Searching Ranges

Doug Cutting Tue, 12 Nov 2002 11:05:27 -0800

Isn't the break on line 162 of RangeQuery.java supposed to achieve this?

Alex Winston wrote:

otis,

i was able to fix the junit build problems, with the newest versions of
ant in regards to lucene unit tests. it appears that the junit.jar must
appear in the $ANT_HOME/lib dir in order to run such optional taskdefs
as JUnitTask.

the following link was very helpful.
http://barracuda.enhydra.org/project/mailingLists/barracuda/msg04810.html

additionally i was able to unit test lucene with the one line change
that i suggested with success, although i have not looked into how
thorough the unit tests are for cases like this.

the diff follows from a cvs snapshot from yesterday (note the added
break;):
*** RangeQuery.java Sat Nov 9 09:54:05 2002
--- RangeQuery.java.old Sat Nov 9 09:53:37 2002
***************
*** 164,170 ****
TermQuery tq = new
TermQuery(term); // found a match
tq.setBoost(boost); // set
the boost
q.add(tq, false, false); // add
to q
- break; //ADDED!
}
} else
--- 164,169 ----

i also pondered the ramifications of such a change, and have a few
thoughts. it appears that this is successful because it eliminates the
massive overhead of the byte[] built by the TermScorer when there are
thousands of terms, but a side-effect may be that it will not accurately
return a valid score. i have yet to test this, and my understanding of
the code is still very limited. although i do not have a firm grasp of
what is involved in scoring, is there not a possibility to score based
on the number of results matched for this particular field as opposed to
the current implementation.

any thoughts?

as i look through the code some more i will offer my thoughts on a
possible reimplementation of RangeQuery to alleviate the overhead when
there are thousands of terms as opposed to this simple one line change
which may have hidden side-effects.

i can also send a copy of some simple tests to show how to create this
situation with profiling results if that would be helpful.

thanks
alex

On Fri, 2002-11-08 at 17:40, Alex Winston wrote:
actually i was mistaken, i thought the tests ran successfully but after
looking again i merely got a BUILD SUCCESSFUL, apparently lucenes build
cannot find JUnitTask out of the box with ant1.5.1.  i have not had any
time to work through the problem.  i will look into it tomorrow, if you
have any thoughts in the meantime let me know.
thanks
alex



On Fri, 2002-11-08 at 16:46, Otis Gospodnetic wrote:
Hello,

Did you say that you run 'ant test-unit' and that all tests still pass?
If so, could you attach a cvs diff -ucN RangeQuery.java?

Thanks,
Otis


--- Alex Winston <[EMAIL PROTECTED]> wrote:
apologizes for replying to myself, but another nice side-effect of
this
fix is that it virtually eliminates the potential for an
OutOfMemoryError, which was a problem i encountered on extremely
large
fields, over 10000 terms, while i was profiling the RangeQuery class.

i can get into specifics if need be, any thoughts?

alex


On Fri, 2002-11-08 at 15:54, Alex Winston wrote:
thanks for the reply, my apologizes for not explaining myself very
clearly, it has been a long day.

you expressed exactly our situation, unfortunately this is not an
option
because we want to have multiple ranges for each document as well, there is a possible extension of what you suggested but that is a
last
resort.  kinda crazy i know, but you have to meet requirements :).

but i also had a thought while i was looking through the lucene
code,
and any comments are welcome.
i may be very mistaken because it has been a long day but if you
look at
the current cvs version of RangeQuery it appears that even if a
match is
found it will continue to iterate over terms within a field, and in
my
case it is on the order of thousands.  if i add a break after a
match
has been found it appears as though the search is improved on avg
an
order of magnitude, my math has left me so i cannot be theoretical
at
the moment.  i have unit tested the change on my side and on the
lucene
side and it works.  note: one hard example is that a query went
from 20
seconds to .5 seconds.  any initial thoughts to if there is a case
where
this would not work?

beginning line 164:
TermQuery tq = new TermQuery(term);	  // found a match
tq.setBoost(boost);			   // set the boost
q.add(tq, false, false);		  // add to q
break;  // ADDED!


On Fri, 2002-11-08 at 15:09, Mike Barry wrote:
Alex,

It is rather confusing. It sounds like you've indexed
a field that that can be between two values (let's say
E-J) and then when you have a search term such as G
you want the docs containing E-J (or A-H or F-K but not A-H
nor A-C nor J-Z)

Just of the top of my head but could you index the upper and
lower bounds as separate fields then when you search do a
compound query:

    lower_bound:{ - search_term } AND upper_bound:{ search_term
- }
just a thought.
-MikeB.
Alex Winston wrote:
i was hoping that someone could briefly review my current
solution to a
problem that we have encountered to see if anyone could suggest
a
possible alternative, because as it stands we have pushed
lucene past
its current limits.

PROBLEM:

we were wanting to represent a range of values for a particular
field
that is searchable over a particular range.

an example follows for clarification:
we were wanting to store a range of chapters and verses of a
book for a
particular document, and in turn search to see if a query range
includes
the range that is represented in the index.

if this is unclear please ask for clarification

IMPRACTICAL SOLUTION:

although this solution seems somewhat impractical it is all we
could
come up with.

our solution involved storing each possible range value within
the term
which would allow for RangeQuerys to be performed on this
particular
field.  for very small ranges this seems somewhat practical
after
profiling.  although once the field ranges began to span
multiple
chapters and verses, the search times became unreasonable
because we
were storing thousands of entries for each representative
range.
i can elaborate on anything that is unclear,
but any thoughts on a possible alternative solution within
lucene that
we overlooked would be extremely helpful.
	

alex
--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@;jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@;jakarta.apache.org>
ATTACHMENT part 2 application/pgp-signature name=signature.asc
__________________________________________________
Do you Yahoo!?
U2 on LAUNCH - Exclusive greatest hits videos
http://launch.yahoo.com/u2

--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@;jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>



--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@;jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>

Re: Searching Ranges

Reply via email to