Re: 500 millions document for loop.

Valentin Popov Tue, 26 Apr 2016 07:01:02 -0700

Uwe, hello. 

Is it possible to use same fast iterator, but apply sorting for date?


Regards,
Valentin. 

> On 14 нояб. 2015 г., at 15:54, Uwe Schindler <u...@thetaphi.de> wrote:
> 
> For performance reasons, I would also return "false" for "out of order" 
> documents. This allows to access stored fields in a more effective way 
> (otherwise it seeks too much). For this type of collector the IO cost is 
> higher than the small computing performance increase caused by out of order 
> documents.
> 
> Kind regards,
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
>> -----Original Message-----
>> From: Valentin Popov [mailto:valentin...@gmail.com]
>> Sent: Saturday, November 14, 2015 1:51 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: 500 millions document for loop.
>> 
>> Thank you very much!
>> 
>> 
>>> On 14 нояб. 2015 г., at 15:49, Uwe Schindler <u...@thetaphi.de> wrote:
>>> 
>>> Hi,
>>> 
>>> This code is buggy! The collect() call of the collector does not get a
>> document ID relative to the top-level IndexSearcher, it only gets a document
>> id relative to the reader reported in setNextReader (which is a atomic reader
>> responsible for a single Lucene index segment).
>>> 
>>> In setNextReader, save the reference to the "current" reader. And use this
>> "current" reader to get the stored fields:
>>> 
>>>             indexSearcher.search(query, queryFilter, new Collector() {
>>>                     AtomicReader current;
>>> 
>>>                     @Override
>>>                     public void setScorer(Scorer arg0) throws IOException
>> { }
>>> 
>>>                     @Override
>>>                     public void setNextReader(AtomicReaderContext ctx)
>> throws IOException {
>>>                             current = ctx.reader();
>>>                     }
>>> 
>>>                     @Override
>>>                     public void collect(int docID) throws IOException {
>>>                             Document doc = current.document(docID,
>> loadFields);
>>>                             found.found(doc);
>>>                     }
>>> 
>>>                     @Override
>>>                     public boolean acceptsDocsOutOfOrder() {
>>>                             return true;
>>>                     }
>>>             });
>>> 
>>> Otherwise you get wrong document ids reported!!!
>>> 
>>> Uwe
>>> 
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: u...@thetaphi.de
>>> 
>>>> -----Original Message-----
>>>> From: Valentin Popov [mailto:valentin...@gmail.com]
>>>> Sent: Saturday, November 14, 2015 1:04 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: 500 millions document for loop.
>>>> 
>>>> Hi, Uwe.
>>>> 
>>>> Thanks for you advise.
>>>> 
>>>> After implementing you suggestion, our calculation time drop down from
>> ~20
>>>> days to 3,5 hours.
>>>> 
>>>> /**
>>>> *
>>>> * DocumentFound - callback function for each document
>>>> */
>>>> public void iterate(SearchOptions options, final DocumentFound found,
>> final
>>>> Set<String> loadFields) throws Exception {
>>>>            Query query = options.getQuery();
>>>>            Filter queryFilter = options.getQueryFilter();
>>>>            final IndexSearcher indexSearcher = new
>>>> 
>> VolumeSearcher(options).newIndexSearcher(Executors.newSingleThreadEx
>>>> ecutor());
>>>> 
>>>>            indexSearcher.search(query, queryFilter, new Collector() {
>>>> 
>>>>                    @Override
>>>>                    public void setScorer(Scorer arg0) throws IOException
>>>> { }
>>>> 
>>>>                    @Override
>>>>                    public void setNextReader(AtomicReaderContext
>>>> arg0) throws IOException { }
>>>> 
>>>>                    @Override
>>>>                    public void collect(int docID) throws IOException {
>>>>                            Document doc = indexSearcher.doc(docID,
>>>> loadFields);
>>>>                            found.found(doc);
>>>>                    }
>>>> 
>>>>                    @Override
>>>>                    public boolean acceptsDocsOutOfOrder() {
>>>>                            return true;
>>>>                    }
>>>>            });
>>>> 
>>>>    }
>>>> 
>>>> 
>>>>> On 12 нояб. 2015 г., at 21:15, Uwe Schindler <u...@thetaphi.de> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>>>> The big question is: Do you need the results paged at all?
>>>>>> 
>>>>>> Yup, because if we return all results, we get OME.
>>>>> 
>>>>> You get the OME because the paging collector cannot handle that, so this
>> is
>>>> an XY problem. Would it not be better if you application just gets the
>> results
>>>> as a stream and processes them one after each other? If this is the case
>> (and
>>>> most statistics need it like that), your much better to NOT USE
>> TOPDOCS!!!!
>>>> Your requirement is diametral to getting top-scoring documents! You
>> want to
>>>> get ALL results as a sequence.
>>>>> 
>>>>>>> Do you need them sorted?
>>>>>> 
>>>>>> Nope.
>>>>> 
>>>>> OK, so unsorted streaming is the right approach.
>>>>> 
>>>>>>> If not, the easiest approach is to use a custom Collector that does no
>>>>>> sorting and just consumes the results.
>>>>>> 
>>>>>> Main bottleneck as I see come from next page search, that took ~2-4
>>>>>> seconds.
>>>>> 
>>>>> This is because when paging the collector has to re-execute the whole
>>>> query and sort all results again, just with a larger window. So if you have
>>>> result pages of 50000 results and you want to get the second page, it will
>>>> internally sort 100000 results, because the first page needs to be
>> calculated,
>>>> too. If you go forward in results the windows gets larger and larger, 
>>>> until it
>>>> finally collects all results.
>>>>> 
>>>>> So just get the results as a stream by implementing the Collector API is
>> the
>>>> right way to do this.
>>>>> 
>>>>>>> 
>>>>>>> Uwe
>>>>>>> 
>>>>>>> -----
>>>>>>> Uwe Schindler
>>>>>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>>>>>> http://www.thetaphi.de
>>>>>>> eMail: u...@thetaphi.de
>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Valentin Popov [mailto:valentin...@gmail.com]
>>>>>>>> Sent: Thursday, November 12, 2015 6:48 PM
>>>>>>>> To: java-user@lucene.apache.org
>>>>>>>> Subject: Re: 500 millions document for loop.
>>>>>>>> 
>>>>>>>> Toke, thanks!
>>>>>>>> 
>>>>>>>> We will look at this solution, looks like this is that what we need.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 12 нояб. 2015 г., at 20:42, Toke Eskildsen
>> <t...@statsbiblioteket.dk>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Valentin Popov <valentin...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>>> We have ~10 indexes for 500M documents, each document
>>>>>>>>>> has «archive date», and «to» address, one of our task is
>>>>>>>>>> calculate statistics of «to» for last year. Right now we are
>>>>>>>>>> using search archive_date:(current_date - 1 year) and paginate
>>>>>>>>>> results for 50k records for page. Bottleneck of that approach,
>>>>>>>>>> pagination take too long time and on powerful server it take
>>>>>>>>>> ~20 days to execute, and it is very long.
>>>>>>>>> 
>>>>>>>>> Lucene does not like deep page requests due to the way the
>> internal
>>>>>>>> Priority Queue works. Solr has CursorMark, which should be fairly
>>>> simple
>>>>>> to
>>>>>>>> emulate in your Lucene handling code:
>>>>>>>>> 
>>>>>>>>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-
>>>> efficient-
>>>>>>>> cursor-based-iteration-of-large-result-sets/
>>>>>>>>> 
>>>>>>>>> - Toke Eskildsen
>>>>>>>>> 
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>>>>> For additional commands, e-mail: java-user-
>> h...@lucene.apache.org
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Valentin Popov
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>>>> For additional commands, e-mail: java-user-
>> h...@lucene.apache.org
>>>>>>> 
>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> С Уважением,
>>>>>> Валентин Попов
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>> 
>>>> 
>>>> С Уважением,
>>>> Валентин Попов
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> 
>> 
>> 
>> С Уважением,
>> Валентин Попов
>> 
>> 
>> 
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 

Regards,
Valentin Popov





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: 500 millions document for loop.

Reply via email to