Re: Weird time results doing wildcard queries

J.J. Larrea Thu, 08 Sep 2005 18:27:12 -0700

I've verified that for a large pull from Hits, the logic as described makes it 
*significantly* faster to request the last desired hit [which could still be 
far fewer than hits.length()] before iterating through the hits, e.g. the 
hits.id line in the quoted snippet below.  Here are relative timings from two 
runs of asking for 1000 Documents:


Ready to roll with query: Abstract:infant Abstract:learning
Scenario 1: hits.length()=131629, numhits=1000, time=2058ms
Scenario 2: hits.length()=131629, numhits=1000, time=588ms

Ready to roll with query: Abstract:infant Abstract:learning
Scenario 1: hits.length()=131629, numhits=1000, time=2134ms
Scenario 2: hits.length()=131629, numhits=1000, time=623ms

(These timings do not include anything but Document retrieval from Hits)

Conclusion: It would make sense for ANY application which is going to iterate 
more than 100 Documents from Hits to do this.  Either that or utilize the 
non-Hits API such as TopDocs (next on my own agenda).

Which makes me wonder whether the caching logic of Hits, optimized for random- 
rather than linear-access, and not tuneable or controllable in 1.4.3, should be 
reviewed for a subsequent release, at least the API-breaking 2.0.  I'll wager 
that a majority of applications do nothing other than a one-time linear 
retrieval of Documents from Hits, with the potential for a lot of wasted cycles 
for those that retrieve more than a small number.

- J.J. Larrea


At 4:19 PM -0700 9/8/05, Chris Hostetter wrote:
>As Yonik pointed out in his reply, the batching/caching done by Hits is
>worse then i remembered.  It's not just batching up the retrieval of
>stored fields -- it's re-executing the underlying search to pull back the
>id,score pairs for docs 0->N*2 anytime you ask for any information about
>result N if N is not already in it's cache.
>
>Compare the timing info you have now with this...
>
>  int start = ...
>  Hits hits = searcher.search(query);
>  int queryTime = ...
>  int trash = hits.id(hits.length());
>  int prefetchTime = ...
>  for (int i = 0; i < hits.length(); i++) {
>     int id = hits.id(i);
>  }
>  int loopTime = ...
>
>...and i think you'll see what i mean.
>
>Yonik is probably right: if you really need to loop over allthe results, I
>would use one of hte more expert methods (a HitCollector would probably be
>best)
>
>
>
>
>
>
>: Date: Thu, 8 Sep 2005 17:05:18 -0600
>: From: Richard Krenek <[EMAIL PROTECTED]>
>: Reply-To: java-user@lucene.apache.org
>: To: java-user@lucene.apache.org
>: Subject: Re: Weird time results doing wildcard queries
>:
>: I did the change and here are the results:
>:
>: Query (default field is COMP_PART_NUMBER): 2444*
>: Query: COMP_PART_NUMBER:2444*
>: Query Time: 328 ms - time for query to run.
>: 383 total matching documents.
>: Cycle Time: 141 ms - time to run through hits.
>:
>:
>: Query (default field is COMP_PART_NUMBER): *91822*
>: Query: COMP_PART_NUMBER:*91822*
>: Query Time: 9375 ms
>: 251 total matching documents.
>: Cycle Time: 20094 ms
>:
>: On 9/8/05, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>: >
>: > : is if the query starts with a wildcard. In the case where it starts with
>: > a
>: > : wildcard, lucene has no option but to linearly go over every term in the
>: > : index to see if it matches your pattern. It must visit every singe term
>: > in
>: >
>: > That would explain why the search itself takes a while, but not why
>: > accessing the hits after the call to search would take a while. note
>: > where the timing code is in his example.
>: >
>: > There are two possible explanations i can think of...
>: >
>: > : >>It seems when I have a wilcard query like *abcd* vs weqrew*, the
>: > *abcd*
>: > : query will always take longer to retrieve the documents even if they are
>: > of
>: > : simular result sizes. We are talking a big difference 1 second vs 16. It
>: > is
>: >
>: > 1) How similar, and how many? ... If i remember correctly, the Hits
>: > constructor does some work to pre-fetch the first 100 results. So if you
>: > are iterating over all of the results, the first 100 are free. On the
>: > 101st iteration the prefetching method is called again to fetch N more (i
>: > don't remember what N is off the top of my head.
>: >
>: > what this means is that if you are only timing the method calls on Hits,
>: > then the first 100 documents are free -- if one wildcard search returns 99
>: > results, and the other returns 105 results, those numbers may not seemthat
>: > different, but in the first case the code you are timing is accessing
>: > nothing but memory, and in the second case it has to read from disk.
>: >
>: > 2) The second idea also requires you to answer a question" the number of
>: > results returned for each query might be identicle, but are the
>: > results themselves identical?
>: >
>: > I'm guessing that either the documents from the "slow" case are either
>: > much bigger (ie: larger stored fields) or the results from the fast case
>: > are all documents that are "near" eachother on disk, so fetching back all
>: > of hte stored fields would require less IO then if the results are stored
>: > farther apart. If i remember correctly, the stored fields of documents
>: > are kept in order that the documents are added, so hypothetically, the
>: > query you did was on a "name" field, and the documents were added to the
>: > index in alphabetical order by "name" then by definition the results for
>: > "weqrew*' will all be close together, while the results for "*abcd*" will
>: > be spread out throughout the index.
>: >
>: > an easy way to disprove that 2nd theory would be to change your timing
>: > code to this and see what happens...
>: >
>: >
>: > Hits hits = searcher.search(query);
>: > long startTime = System.currentTimeMillis();
>: > for (int i = 0; i < hits.length(); i++) {
>: > int id = hits.id(i);
>: > }
>: >
>: >
>: >
>: > -Hoss
>: >
>: >
>: > ---------------------------------------------------------------------
>: > To unsubscribe, e-mail: [EMAIL PROTECTED]
>: > For additional commands, e-mail: [EMAIL PROTECTED]
>: >
>: >
>:
>
>
>
>-Hoss
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Weird time results doing wildcard queries

Reply via email to