Re: 500 millions document for loop.

Valentin Popov Thu, 12 Nov 2015 10:24:28 -0800

Hi, 
> On 12 нояб. 2015 г., at 21:15, Uwe Schindler <u...@thetaphi.de> wrote:
> 
> Hi,
> 
>>> The big question is: Do you need the results paged at all?
>> 
>> Yup, because if we return all results, we get OME.
> 
> You get the OME because the paging collector cannot handle that, so this is 
> an XY problem. Would it not be better if you application just gets the 
> results as a stream and processes them one after each other? If this is the 
> case (and most statistics need it like that), your much better to NOT USE 
> TOPDOCS!!!! Your requirement is diametral to getting top-scoring documents! 
> You want to get ALL results as a sequence.


Use, thanks 

if it is possible, could you provide some code example? 

> 
>>> Do you need them sorted?
>> 
>> Nope.
> 
> OK, so unsorted streaming is the right approach.
> 
>>> If not, the easiest approach is to use a custom Collector that does no
>> sorting and just consumes the results.
>> 
>> Main bottleneck as I see come from next page search, that took ~2-4
>> seconds.
> 
> This is because when paging the collector has to re-execute the whole query 
> and sort all results again, just with a larger window. So if you have result 
> pages of 50000 results and you want to get the second page, it will 
> internally sort 100000 results, because the first page needs to be 
> calculated, too. If you go forward in results the windows gets larger and 
> larger, until it finally collects all results.
Is this mean we are not using cursor based iteration? 
> 
> So just get the results as a stream by implementing the Collector API is the 
> right way to do this.

thanks! 

> 
>>> 
>>> Uwe
>>> 
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: u...@thetaphi.de
>>> 
>>>> -----Original Message-----
>>>> From: Valentin Popov [mailto:valentin...@gmail.com]
>>>> Sent: Thursday, November 12, 2015 6:48 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: 500 millions document for loop.
>>>> 
>>>> Toke, thanks!
>>>> 
>>>> We will look at this solution, looks like this is that what we need.
>>>> 
>>>> 
>>>>> On 12 нояб. 2015 г., at 20:42, Toke Eskildsen <t...@statsbiblioteket.dk>
>>>> wrote:
>>>>> 
>>>>> Valentin Popov <valentin...@gmail.com> wrote:
>>>>> 
>>>>>> We have ~10 indexes for 500M documents, each document
>>>>>> has «archive date», and «to» address, one of our task is
>>>>>> calculate statistics of «to» for last year. Right now we are
>>>>>> using search archive_date:(current_date - 1 year) and paginate
>>>>>> results for 50k records for page. Bottleneck of that approach,
>>>>>> pagination take too long time and on powerful server it take
>>>>>> ~20 days to execute, and it is very long.
>>>>> 
>>>>> Lucene does not like deep page requests due to the way the internal
>>>> Priority Queue works. Solr has CursorMark, which should be fairly simple
>> to
>>>> emulate in your Lucene handling code:
>>>>> 
>>>>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-efficient-
>>>> cursor-based-iteration-of-large-result-sets/
>>>>> 
>>>>> - Toke Eskildsen
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>> 
>>>> 
>>>> Regards,
>>>> Valentin Popov
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> 
>> 
>> 
>> С Уважением,
>> Валентин Попов
>> 
>> 
>> 
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 

Regards,
Valentin Popov





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: 500 millions document for loop.

Reply via email to