Hi, This code is buggy! The collect() call of the collector does not get a document ID relative to the top-level IndexSearcher, it only gets a document id relative to the reader reported in setNextReader (which is a atomic reader responsible for a single Lucene index segment).
In setNextReader, save the reference to the "current" reader. And use this "current" reader to get the stored fields: indexSearcher.search(query, queryFilter, new Collector() { AtomicReader current; @Override public void setScorer(Scorer arg0) throws IOException { } @Override public void setNextReader(AtomicReaderContext ctx) throws IOException { current = ctx.reader(); } @Override public void collect(int docID) throws IOException { Document doc = current.document(docID, loadFields); found.found(doc); } @Override public boolean acceptsDocsOutOfOrder() { return true; } }); Otherwise you get wrong document ids reported!!! Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Valentin Popov [mailto:valentin...@gmail.com] > Sent: Saturday, November 14, 2015 1:04 PM > To: java-user@lucene.apache.org > Subject: Re: 500 millions document for loop. > > Hi, Uwe. > > Thanks for you advise. > > After implementing you suggestion, our calculation time drop down from ~20 > days to 3,5 hours. > > /** > * > * DocumentFound - callback function for each document > */ > public void iterate(SearchOptions options, final DocumentFound found, final > Set<String> loadFields) throws Exception { > Query query = options.getQuery(); > Filter queryFilter = options.getQueryFilter(); > final IndexSearcher indexSearcher = new > VolumeSearcher(options).newIndexSearcher(Executors.newSingleThreadEx > ecutor()); > > indexSearcher.search(query, queryFilter, new Collector() { > > @Override > public void setScorer(Scorer arg0) throws IOException > { } > > @Override > public void setNextReader(AtomicReaderContext > arg0) throws IOException { } > > @Override > public void collect(int docID) throws IOException { > Document doc = indexSearcher.doc(docID, > loadFields); > found.found(doc); > } > > @Override > public boolean acceptsDocsOutOfOrder() { > return true; > } > }); > > } > > > > On 12 нояб. 2015 г., at 21:15, Uwe Schindler <u...@thetaphi.de> wrote: > > > > Hi, > > > >>> The big question is: Do you need the results paged at all? > >> > >> Yup, because if we return all results, we get OME. > > > > You get the OME because the paging collector cannot handle that, so this is > an XY problem. Would it not be better if you application just gets the results > as a stream and processes them one after each other? If this is the case (and > most statistics need it like that), your much better to NOT USE TOPDOCS!!!! > Your requirement is diametral to getting top-scoring documents! You want to > get ALL results as a sequence. > > > >>> Do you need them sorted? > >> > >> Nope. > > > > OK, so unsorted streaming is the right approach. > > > >>> If not, the easiest approach is to use a custom Collector that does no > >> sorting and just consumes the results. > >> > >> Main bottleneck as I see come from next page search, that took ~2-4 > >> seconds. > > > > This is because when paging the collector has to re-execute the whole > query and sort all results again, just with a larger window. So if you have > result pages of 50000 results and you want to get the second page, it will > internally sort 100000 results, because the first page needs to be calculated, > too. If you go forward in results the windows gets larger and larger, until it > finally collects all results. > > > > So just get the results as a stream by implementing the Collector API is the > right way to do this. > > > >>> > >>> Uwe > >>> > >>> ----- > >>> Uwe Schindler > >>> H.-H.-Meier-Allee 63, D-28213 Bremen > >>> http://www.thetaphi.de > >>> eMail: u...@thetaphi.de > >>> > >>>> -----Original Message----- > >>>> From: Valentin Popov [mailto:valentin...@gmail.com] > >>>> Sent: Thursday, November 12, 2015 6:48 PM > >>>> To: java-user@lucene.apache.org > >>>> Subject: Re: 500 millions document for loop. > >>>> > >>>> Toke, thanks! > >>>> > >>>> We will look at this solution, looks like this is that what we need. > >>>> > >>>> > >>>>> On 12 нояб. 2015 г., at 20:42, Toke Eskildsen <t...@statsbiblioteket.dk> > >>>> wrote: > >>>>> > >>>>> Valentin Popov <valentin...@gmail.com> wrote: > >>>>> > >>>>>> We have ~10 indexes for 500M documents, each document > >>>>>> has «archive date», and «to» address, one of our task is > >>>>>> calculate statistics of «to» for last year. Right now we are > >>>>>> using search archive_date:(current_date - 1 year) and paginate > >>>>>> results for 50k records for page. Bottleneck of that approach, > >>>>>> pagination take too long time and on powerful server it take > >>>>>> ~20 days to execute, and it is very long. > >>>>> > >>>>> Lucene does not like deep page requests due to the way the internal > >>>> Priority Queue works. Solr has CursorMark, which should be fairly > simple > >> to > >>>> emulate in your Lucene handling code: > >>>>> > >>>>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr- > efficient- > >>>> cursor-based-iteration-of-large-result-sets/ > >>>>> > >>>>> - Toke Eskildsen > >>>>> > >>>>> --------------------------------------------------------------------- > >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org > >>>>> > >>>> > >>>> Regards, > >>>> Valentin Popov > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org > >>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>> For additional commands, e-mail: java-user-h...@lucene.apache.org > >>> > >> > >> > >> С Уважением, > >> Валентин Попов > >> > >> > >> > >> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > С Уважением, > Валентин Попов > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org