RE: 500 millions document for loop.

Uwe Schindler Sat, 14 Nov 2015 04:50:09 -0800

Hi,

This code is buggy! The collect() call of the collector does not get a document 
ID relative to the top-level IndexSearcher, it only gets a document id relative 
to the reader reported in setNextReader (which is a atomic reader responsible 
for a single Lucene index segment).


In setNextReader, save the reference to the "current" reader. And use this 
"current" reader to get the stored fields:

                indexSearcher.search(query, queryFilter, new Collector() {
                        AtomicReader current; 

                        @Override
                        public void setScorer(Scorer arg0) throws IOException { 
}
 
                        @Override
                        public void setNextReader(AtomicReaderContext ctx) 
throws IOException { 
                                current = ctx.reader();
                        }
 
                        @Override
                        public void collect(int docID) throws IOException {
                                Document doc = current.document(docID, 
loadFields);
                                found.found(doc);
                        }
 
                        @Override
                        public boolean acceptsDocsOutOfOrder() {
                                return true;
                        }
                });

Otherwise you get wrong document ids reported!!!

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -----Original Message-----
> From: Valentin Popov [mailto:valentin...@gmail.com]
> Sent: Saturday, November 14, 2015 1:04 PM
> To: java-user@lucene.apache.org
> Subject: Re: 500 millions document for loop.
> 
> Hi, Uwe.
> 
> Thanks for you advise.
> 
> After implementing you suggestion, our calculation time drop down from ~20
> days to 3,5 hours.
> 
> /**
> *
> * DocumentFound - callback function for each document
> */
> public void iterate(SearchOptions options, final DocumentFound found, final
> Set<String> loadFields) throws Exception {
>               Query query = options.getQuery();
>               Filter queryFilter = options.getQueryFilter();
>               final IndexSearcher indexSearcher = new
> VolumeSearcher(options).newIndexSearcher(Executors.newSingleThreadEx
> ecutor());
> 
>               indexSearcher.search(query, queryFilter, new Collector() {
> 
>                       @Override
>                       public void setScorer(Scorer arg0) throws IOException
> { }
> 
>                       @Override
>                       public void setNextReader(AtomicReaderContext
> arg0) throws IOException { }
> 
>                       @Override
>                       public void collect(int docID) throws IOException {
>                               Document doc = indexSearcher.doc(docID,
> loadFields);
>                               found.found(doc);
>                       }
> 
>                       @Override
>                       public boolean acceptsDocsOutOfOrder() {
>                               return true;
>                       }
>               });
> 
>       }
> 
> 
> > On 12 нояб. 2015 г., at 21:15, Uwe Schindler <u...@thetaphi.de> wrote:
> >
> > Hi,
> >
> >>> The big question is: Do you need the results paged at all?
> >>
> >> Yup, because if we return all results, we get OME.
> >
> > You get the OME because the paging collector cannot handle that, so this is
> an XY problem. Would it not be better if you application just gets the results
> as a stream and processes them one after each other? If this is the case (and
> most statistics need it like that), your much better to NOT USE TOPDOCS!!!!
> Your requirement is diametral to getting top-scoring documents! You want to
> get ALL results as a sequence.
> >
> >>> Do you need them sorted?
> >>
> >> Nope.
> >
> > OK, so unsorted streaming is the right approach.
> >
> >>> If not, the easiest approach is to use a custom Collector that does no
> >> sorting and just consumes the results.
> >>
> >> Main bottleneck as I see come from next page search, that took ~2-4
> >> seconds.
> >
> > This is because when paging the collector has to re-execute the whole
> query and sort all results again, just with a larger window. So if you have
> result pages of 50000 results and you want to get the second page, it will
> internally sort 100000 results, because the first page needs to be calculated,
> too. If you go forward in results the windows gets larger and larger, until it
> finally collects all results.
> >
> > So just get the results as a stream by implementing the Collector API is the
> right way to do this.
> >
> >>>
> >>> Uwe
> >>>
> >>> -----
> >>> Uwe Schindler
> >>> H.-H.-Meier-Allee 63, D-28213 Bremen
> >>> http://www.thetaphi.de
> >>> eMail: u...@thetaphi.de
> >>>
> >>>> -----Original Message-----
> >>>> From: Valentin Popov [mailto:valentin...@gmail.com]
> >>>> Sent: Thursday, November 12, 2015 6:48 PM
> >>>> To: java-user@lucene.apache.org
> >>>> Subject: Re: 500 millions document for loop.
> >>>>
> >>>> Toke, thanks!
> >>>>
> >>>> We will look at this solution, looks like this is that what we need.
> >>>>
> >>>>
> >>>>> On 12 нояб. 2015 г., at 20:42, Toke Eskildsen <t...@statsbiblioteket.dk>
> >>>> wrote:
> >>>>>
> >>>>> Valentin Popov <valentin...@gmail.com> wrote:
> >>>>>
> >>>>>> We have ~10 indexes for 500M documents, each document
> >>>>>> has «archive date», and «to» address, one of our task is
> >>>>>> calculate statistics of «to» for last year. Right now we are
> >>>>>> using search archive_date:(current_date - 1 year) and paginate
> >>>>>> results for 50k records for page. Bottleneck of that approach,
> >>>>>> pagination take too long time and on powerful server it take
> >>>>>> ~20 days to execute, and it is very long.
> >>>>>
> >>>>> Lucene does not like deep page requests due to the way the internal
> >>>> Priority Queue works. Solr has CursorMark, which should be fairly
> simple
> >> to
> >>>> emulate in your Lucene handling code:
> >>>>>
> >>>>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-
> efficient-
> >>>> cursor-based-iteration-of-large-result-sets/
> >>>>>
> >>>>> - Toke Eskildsen
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>>>
> >>>>
> >>>> Regards,
> >>>> Valentin Popov
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>
> >>
> >>
> >> С Уважением,
> >> Валентин Попов
> >>
> >>
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
>  С Уважением,
> Валентин Попов
> 
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: 500 millions document for loop.

Reply via email to