I've verified that for a large pull from Hits, the logic as described makes it *significantly* faster to request the last desired hit [which could still be far fewer than hits.length()] before iterating through the hits, e.g. the hits.id line in the quoted snippet below. Here are relative timings from two runs of asking for 1000 Documents:
Ready to roll with query: Abstract:infant Abstract:learning Scenario 1: hits.length()=131629, numhits=1000, time=2058ms Scenario 2: hits.length()=131629, numhits=1000, time=588ms Ready to roll with query: Abstract:infant Abstract:learning Scenario 1: hits.length()=131629, numhits=1000, time=2134ms Scenario 2: hits.length()=131629, numhits=1000, time=623ms (These timings do not include anything but Document retrieval from Hits) Conclusion: It would make sense for ANY application which is going to iterate more than 100 Documents from Hits to do this. Either that or utilize the non-Hits API such as TopDocs (next on my own agenda). Which makes me wonder whether the caching logic of Hits, optimized for random- rather than linear-access, and not tuneable or controllable in 1.4.3, should be reviewed for a subsequent release, at least the API-breaking 2.0. I'll wager that a majority of applications do nothing other than a one-time linear retrieval of Documents from Hits, with the potential for a lot of wasted cycles for those that retrieve more than a small number. - J.J. Larrea At 4:19 PM -0700 9/8/05, Chris Hostetter wrote: >As Yonik pointed out in his reply, the batching/caching done by Hits is >worse then i remembered. It's not just batching up the retrieval of >stored fields -- it's re-executing the underlying search to pull back the >id,score pairs for docs 0->N*2 anytime you ask for any information about >result N if N is not already in it's cache. > >Compare the timing info you have now with this... > > int start = ... > Hits hits = searcher.search(query); > int queryTime = ... > int trash = hits.id(hits.length()); > int prefetchTime = ... > for (int i = 0; i < hits.length(); i++) { > int id = hits.id(i); > } > int loopTime = ... > >...and i think you'll see what i mean. > >Yonik is probably right: if you really need to loop over allthe results, I >would use one of hte more expert methods (a HitCollector would probably be >best) > > > > > > >: Date: Thu, 8 Sep 2005 17:05:18 -0600 >: From: Richard Krenek <[EMAIL PROTECTED]> >: Reply-To: java-user@lucene.apache.org >: To: java-user@lucene.apache.org >: Subject: Re: Weird time results doing wildcard queries >: >: I did the change and here are the results: >: >: Query (default field is COMP_PART_NUMBER): 2444* >: Query: COMP_PART_NUMBER:2444* >: Query Time: 328 ms - time for query to run. >: 383 total matching documents. >: Cycle Time: 141 ms - time to run through hits. >: >: >: Query (default field is COMP_PART_NUMBER): *91822* >: Query: COMP_PART_NUMBER:*91822* >: Query Time: 9375 ms >: 251 total matching documents. >: Cycle Time: 20094 ms >: >: On 9/8/05, Chris Hostetter <[EMAIL PROTECTED]> wrote: >: > >: > : is if the query starts with a wildcard. In the case where it starts with >: > a >: > : wildcard, lucene has no option but to linearly go over every term in the >: > : index to see if it matches your pattern. It must visit every singe term >: > in >: > >: > That would explain why the search itself takes a while, but not why >: > accessing the hits after the call to search would take a while. note >: > where the timing code is in his example. >: > >: > There are two possible explanations i can think of... >: > >: > : >>It seems when I have a wilcard query like *abcd* vs weqrew*, the >: > *abcd* >: > : query will always take longer to retrieve the documents even if they are >: > of >: > : simular result sizes. We are talking a big difference 1 second vs 16. It >: > is >: > >: > 1) How similar, and how many? ... If i remember correctly, the Hits >: > constructor does some work to pre-fetch the first 100 results. So if you >: > are iterating over all of the results, the first 100 are free. On the >: > 101st iteration the prefetching method is called again to fetch N more (i >: > don't remember what N is off the top of my head. >: > >: > what this means is that if you are only timing the method calls on Hits, >: > then the first 100 documents are free -- if one wildcard search returns 99 >: > results, and the other returns 105 results, those numbers may not seemthat >: > different, but in the first case the code you are timing is accessing >: > nothing but memory, and in the second case it has to read from disk. >: > >: > 2) The second idea also requires you to answer a question" the number of >: > results returned for each query might be identicle, but are the >: > results themselves identical? >: > >: > I'm guessing that either the documents from the "slow" case are either >: > much bigger (ie: larger stored fields) or the results from the fast case >: > are all documents that are "near" eachother on disk, so fetching back all >: > of hte stored fields would require less IO then if the results are stored >: > farther apart. If i remember correctly, the stored fields of documents >: > are kept in order that the documents are added, so hypothetically, the >: > query you did was on a "name" field, and the documents were added to the >: > index in alphabetical order by "name" then by definition the results for >: > "weqrew*' will all be close together, while the results for "*abcd*" will >: > be spread out throughout the index. >: > >: > an easy way to disprove that 2nd theory would be to change your timing >: > code to this and see what happens... >: > >: > >: > Hits hits = searcher.search(query); >: > long startTime = System.currentTimeMillis(); >: > for (int i = 0; i < hits.length(); i++) { >: > int id = hits.id(i); >: > } >: > >: > >: > >: > -Hoss >: > >: > >: > --------------------------------------------------------------------- >: > To unsubscribe, e-mail: [EMAIL PROTECTED] >: > For additional commands, e-mail: [EMAIL PROTECTED] >: > >: > >: > > > >-Hoss > > >--------------------------------------------------------------------- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]