Uwe, hello. Is it possible to use same fast iterator, but apply sorting for date?
Regards, Valentin. > On 14 нояб. 2015 г., at 15:54, Uwe Schindler <u...@thetaphi.de> wrote: > > For performance reasons, I would also return "false" for "out of order" > documents. This allows to access stored fields in a more effective way > (otherwise it seeks too much). For this type of collector the IO cost is > higher than the small computing performance increase caused by out of order > documents. > > Kind regards, > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > >> -----Original Message----- >> From: Valentin Popov [mailto:valentin...@gmail.com] >> Sent: Saturday, November 14, 2015 1:51 PM >> To: java-user@lucene.apache.org >> Subject: Re: 500 millions document for loop. >> >> Thank you very much! >> >> >>> On 14 нояб. 2015 г., at 15:49, Uwe Schindler <u...@thetaphi.de> wrote: >>> >>> Hi, >>> >>> This code is buggy! The collect() call of the collector does not get a >> document ID relative to the top-level IndexSearcher, it only gets a document >> id relative to the reader reported in setNextReader (which is a atomic reader >> responsible for a single Lucene index segment). >>> >>> In setNextReader, save the reference to the "current" reader. And use this >> "current" reader to get the stored fields: >>> >>> indexSearcher.search(query, queryFilter, new Collector() { >>> AtomicReader current; >>> >>> @Override >>> public void setScorer(Scorer arg0) throws IOException >> { } >>> >>> @Override >>> public void setNextReader(AtomicReaderContext ctx) >> throws IOException { >>> current = ctx.reader(); >>> } >>> >>> @Override >>> public void collect(int docID) throws IOException { >>> Document doc = current.document(docID, >> loadFields); >>> found.found(doc); >>> } >>> >>> @Override >>> public boolean acceptsDocsOutOfOrder() { >>> return true; >>> } >>> }); >>> >>> Otherwise you get wrong document ids reported!!! >>> >>> Uwe >>> >>> ----- >>> Uwe Schindler >>> H.-H.-Meier-Allee 63, D-28213 Bremen >>> http://www.thetaphi.de >>> eMail: u...@thetaphi.de >>> >>>> -----Original Message----- >>>> From: Valentin Popov [mailto:valentin...@gmail.com] >>>> Sent: Saturday, November 14, 2015 1:04 PM >>>> To: java-user@lucene.apache.org >>>> Subject: Re: 500 millions document for loop. >>>> >>>> Hi, Uwe. >>>> >>>> Thanks for you advise. >>>> >>>> After implementing you suggestion, our calculation time drop down from >> ~20 >>>> days to 3,5 hours. >>>> >>>> /** >>>> * >>>> * DocumentFound - callback function for each document >>>> */ >>>> public void iterate(SearchOptions options, final DocumentFound found, >> final >>>> Set<String> loadFields) throws Exception { >>>> Query query = options.getQuery(); >>>> Filter queryFilter = options.getQueryFilter(); >>>> final IndexSearcher indexSearcher = new >>>> >> VolumeSearcher(options).newIndexSearcher(Executors.newSingleThreadEx >>>> ecutor()); >>>> >>>> indexSearcher.search(query, queryFilter, new Collector() { >>>> >>>> @Override >>>> public void setScorer(Scorer arg0) throws IOException >>>> { } >>>> >>>> @Override >>>> public void setNextReader(AtomicReaderContext >>>> arg0) throws IOException { } >>>> >>>> @Override >>>> public void collect(int docID) throws IOException { >>>> Document doc = indexSearcher.doc(docID, >>>> loadFields); >>>> found.found(doc); >>>> } >>>> >>>> @Override >>>> public boolean acceptsDocsOutOfOrder() { >>>> return true; >>>> } >>>> }); >>>> >>>> } >>>> >>>> >>>>> On 12 нояб. 2015 г., at 21:15, Uwe Schindler <u...@thetaphi.de> wrote: >>>>> >>>>> Hi, >>>>> >>>>>>> The big question is: Do you need the results paged at all? >>>>>> >>>>>> Yup, because if we return all results, we get OME. >>>>> >>>>> You get the OME because the paging collector cannot handle that, so this >> is >>>> an XY problem. Would it not be better if you application just gets the >> results >>>> as a stream and processes them one after each other? If this is the case >> (and >>>> most statistics need it like that), your much better to NOT USE >> TOPDOCS!!!! >>>> Your requirement is diametral to getting top-scoring documents! You >> want to >>>> get ALL results as a sequence. >>>>> >>>>>>> Do you need them sorted? >>>>>> >>>>>> Nope. >>>>> >>>>> OK, so unsorted streaming is the right approach. >>>>> >>>>>>> If not, the easiest approach is to use a custom Collector that does no >>>>>> sorting and just consumes the results. >>>>>> >>>>>> Main bottleneck as I see come from next page search, that took ~2-4 >>>>>> seconds. >>>>> >>>>> This is because when paging the collector has to re-execute the whole >>>> query and sort all results again, just with a larger window. So if you have >>>> result pages of 50000 results and you want to get the second page, it will >>>> internally sort 100000 results, because the first page needs to be >> calculated, >>>> too. If you go forward in results the windows gets larger and larger, >>>> until it >>>> finally collects all results. >>>>> >>>>> So just get the results as a stream by implementing the Collector API is >> the >>>> right way to do this. >>>>> >>>>>>> >>>>>>> Uwe >>>>>>> >>>>>>> ----- >>>>>>> Uwe Schindler >>>>>>> H.-H.-Meier-Allee 63, D-28213 Bremen >>>>>>> http://www.thetaphi.de >>>>>>> eMail: u...@thetaphi.de >>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Valentin Popov [mailto:valentin...@gmail.com] >>>>>>>> Sent: Thursday, November 12, 2015 6:48 PM >>>>>>>> To: java-user@lucene.apache.org >>>>>>>> Subject: Re: 500 millions document for loop. >>>>>>>> >>>>>>>> Toke, thanks! >>>>>>>> >>>>>>>> We will look at this solution, looks like this is that what we need. >>>>>>>> >>>>>>>> >>>>>>>>> On 12 нояб. 2015 г., at 20:42, Toke Eskildsen >> <t...@statsbiblioteket.dk> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Valentin Popov <valentin...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> We have ~10 indexes for 500M documents, each document >>>>>>>>>> has «archive date», and «to» address, one of our task is >>>>>>>>>> calculate statistics of «to» for last year. Right now we are >>>>>>>>>> using search archive_date:(current_date - 1 year) and paginate >>>>>>>>>> results for 50k records for page. Bottleneck of that approach, >>>>>>>>>> pagination take too long time and on powerful server it take >>>>>>>>>> ~20 days to execute, and it is very long. >>>>>>>>> >>>>>>>>> Lucene does not like deep page requests due to the way the >> internal >>>>>>>> Priority Queue works. Solr has CursorMark, which should be fairly >>>> simple >>>>>> to >>>>>>>> emulate in your Lucene handling code: >>>>>>>>> >>>>>>>>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr- >>>> efficient- >>>>>>>> cursor-based-iteration-of-large-result-sets/ >>>>>>>>> >>>>>>>>> - Toke Eskildsen >>>>>>>>> >>>>>>>>> --------------------------------------------------------------------- >>>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>>>>> For additional commands, e-mail: java-user- >> h...@lucene.apache.org >>>>>>>>> >>>>>>>> >>>>>>>> Regards, >>>>>>>> Valentin Popov >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>>>> For additional commands, e-mail: java-user- >> h...@lucene.apache.org >>>>>>> >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>>> >>>>>> >>>>>> >>>>>> С Уважением, >>>>>> Валентин Попов >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >>>> С Уважением, >>>> Валентин Попов >>>> >>>> >>>> >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >> >> >> С Уважением, >> Валентин Попов >> >> >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > Regards, Valentin Popov --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org