Thanks for the feedback, Shai.
So I guess you're suggesting breaking this out into a general utility
class e.g. something like:
class TimeLimitedThreadActivity
{
//called by client
public static void startTimeLimitedActivity(long
maxTimePermitted).
public static void endTimeLimitedActivity()
//called by resources (reader/writers) that need to be shared
fairly by threads
public static void checkActivityNotElapsed(); //throws some
form of runtime exception
}
A downside of breaking it out into static methods like this is that a
thread cannot run >1 time-limited activity simultaneously but I guess
that might be a reasonable restriction.
>>Aside, how about using a PQ for the threads' times, or a TreeMap.
That will save looping over the collection to find the next candidate.
Just an implementation detail though.
Yep, that was one of the rough edges - I just wanted to get raw
timings first for the all the "is timed out?" checks we're injecting
into reader calls.
Cheers
Mark
On 27 Jun 2009, at 11:37, Shai Erera wrote:
I like the overall approach. However it's very local to an
IndexReader. I.e., if someone wanted to limit other operations (say
indexing), or does not use an IndexReader (for a Scorer impl maybe),
one cannot reuse it.
What if we factor out the timeout logic to a Timeout class (I think
it can be a static class, with the way you implemented it) and use
it in TimeLimitingIndexReader? That class can offer a method check()
which will do the internal logic (the 'if' check and throw
exception). It will be similar to the current ensureOpen() followed
by an operation.
It might be considered more expensive since it won't check a
boolean, but instead call a check() method, but it will be more
reusable. Also, ensureOpen today is also a method call, so I don't
think Timeout.check() is that bad. We can even later create a
TimeLimitingIndexWriter and document Timeout class for other usage
by external code.
Aside, how about using a PQ for the threads' times, or a TreeMap.
That will save looping over the collection to find the next
candidate. Just an implementation detail though.
Shai
On Sat, Jun 27, 2009 at 3:31 AM, Mark Harwood
<markharw...@yahoo.co.uk> wrote:
Going back to my post re TimeLimitedIndexReaders - here's an
incomplete but functional prototype:
http://www.inperspective.com/lucene/TimeLimitedIndexReader.java
http://www.inperspective.com/lucene/TestTimeLimitedIndexReader.java
The principle is that all reader accesses check a volatile variable
indicating something may have timed out (no need to check thread
locals etc.) If and only if a time out has been noted threadlocals
are checked to see which thread should throw a timeout exception.
All time-limited use of reader must be wrapped in try...finally
calls to indicate the start and stop of a timed set of activities. A
background thread maintains the next anticipated timeout deadline
and simply waits until this is reached or the list of planned
activities changes with new deadlines.
Performance seems reasonable on my Wikipedia index:
//some tests for heavy use of termenum/term docs
Read term docs for 200000 terms in 4755 ms using no timeout limit
(warm up)
Read term docs for 200000 terms in 4320 ms using no timeout limit
(warm up)
Read term docs for 200000 terms in 4320 ms using no timeout limit
Read term docs for 200000 terms in 4388 ms using reader with time-
limited access
//Example query with heavy use of termEnum/termDocs
+text:f* +text:a* +text:b* no time limit matched 1090041 docs in
2000 ms
+text:f* +text:a* +text:b* time limited collector matched 1090041
docs in 1963 ms
+text:f* +text:a* +text:b* time limited reader matched 1090041 docs
in 2121 ms
//Example fuzzy match burning CPU reading TermEnum
text:accomodation~0.5 no time limit matched 192084 docs in 6428 ms
text:accomodation~0.5 time limited collector matched 192084 docs in
5923 ms
text:accomodation~0.5 time limited reader matched 192084 docs in
5945 ms
The reader approach to limiting time is slower but has these
advantages :
1) Multiple reader activities can be time-limited rather than just
single searches
2) No code changes required to scorers/queries/filters etc
3) Tasks that spend plenty of time burning CPU before collection
happens can be killed earlier
I'm sure there's some thread safety issues to work through in my
code and not all reader classes are wrapped (e.g. TermPositions) but
the basics are there and seem to be functioning
Thoughts?