Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-28 Thread Erick Erickson
Shawn had an interesting idea on another thread. It depends on having basically an identity field (which I see how to do manually, but don't see how to make work as a new field type in a distributed environment). And it's brilliantly simple, just a range query identity:{ TO *]&sort=identity asc

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Mikhail Khludnev
On Sun, Jul 28, 2013 at 1:25 AM, Yonik Seeley wrote: > > Which part is problematic... the creation of the DocList (the search), > Literally DocList is a copy of TopDocs. Creating TopDocs is not a search, but ranking. And ranking costs is log(rows+start) beside of numFound, which the search takes.

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Yonik Seeley
On Sat, Jul 27, 2013 at 5:05 PM, Mikhail Khludnev wrote: > anyway, even if writer pulls docs one by one, it doesn't allow to stream a > billion of them. Solr writes out DocList, which is really problematic even > in deep-paging scenarios. Which part is problematic... the creation of the DocList (

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Mikhail Khludnev
Hello, Please find below > Let me just explain better what I found when I dug inside solr: documents > (results of the query) are loaded before they are passed into a writer - so > the writers are expecting to encounter the solr documents, but these > documents were loaded by one of the componen

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Yonik Seeley
On Sat, Jul 27, 2013 at 4:30 PM, Roman Chyla wrote: > Let me just explain better what I found when I dug inside solr: documents > (results of the query) are loaded before they are passed into a writer - so > the writers are expecting to encounter the solr documents, but these > documents were load

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Roman Chyla
Hi Mikhail, I can see it is lazy-loading, but I can't judge how much complex it becomes (presumably, the filter dispatching mechanism is doing also other things - it is there not only for streaming). Let me just explain better what I found when I dug inside solr: documents (results of the query)

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Mikhail Khludnev
Roman, Let me briefly explain the design special RequestParser stores servlet output stream into the context https://github.com/m-khl/solr-patches/compare/streaming#L7R22 then special component injects special PostFilter/DelegatingCollector which writes right into output https://github.com/m-kh

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Roman Chyla
Mikhail, If your solution gives lazy loading of solr docs /and thus streaming of huge result lists/ it should be big YES! Roman On 27 Jul 2013 07:55, "Mikhail Khludnev" wrote: > Otis, > You gave links to 'deep paging' when I asked about response streaming. > Let me understand. From my POV, deep p

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Mikhail Khludnev
Otis, You gave links to 'deep paging' when I asked about response streaming. Let me understand. From my POV, deep paging is a special case for regular search scenarios. We definitely need it in Solr. However, if we are talking about data analytic like problems, when we need to select an "endless" s

Re: Processing a lot of results in Solr

2013-07-25 Thread Otis Gospodnetic
Mikhail, Yes, +1. This question comes up a few times a year. Grant created a JIRA issue for this many moons ago. https://issues.apache.org/jira/browse/LUCENE-2127 https://issues.apache.org/jira/browse/SOLR-1726 Otis -- Solr & ElasticSearch Support -- http://sematext.com/ Performance Monitoring

Re: Processing a lot of results in Solr

2013-07-24 Thread Chris Hostetter
: Subject: Processing a lot of results in Solr : Message-ID: : In-Reply-To: <1374612243070-4079869.p...@n3.nabble.com> https://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an ex

Re: Processing a lot of results in Solr

2013-07-24 Thread Mikhail Khludnev
fwiw, i did some prototype with the following differences: - it streams straight to the socket output stream - it streams on-going during collecting, without necessity to store a bitset. It might have some limited extreme usage. Is there anyone interested? On Wed, Jul 24, 2013 at 7:19 PM, Roman C

Re: Processing a lot of results in Solr

2013-07-24 Thread Roman Chyla
On Tue, Jul 23, 2013 at 10:05 PM, Matt Lieber wrote: > That sounds like a satisfactory solution for the time being - > I am assuming you dump the data from Solr in a csv format? > JSON > How did you implement the streaming processor ? (what tool did you use for > this? Not familiar with that)

Re: Processing a lot of results in Solr

2013-07-24 Thread Roman Chyla
Mikhail, It is a slightly hacked JSONWriter - actually, while poking around, I have discovered that dumping big hitsets would be possible - the main hurdle right now, is that writer is expecting to receive docuemnts with fields loaded, but if it received something that loads docs lazily, you could

Re: Processing a lot of results in Solr

2013-07-24 Thread Mikhail Khludnev
Roman, Can you disclosure how that streaming writer works? What does it stream docList or docSet? Thanks On Wed, Jul 24, 2013 at 5:57 AM, Roman Chyla wrote: > Hello Matt, > > You can consider writing a batch processing handler, which receives a query > and instead of sending results back, it

Re: Processing a lot of results in Solr

2013-07-23 Thread Matt Lieber
That sounds like a satisfactory solution for the time being - I am assuming you dump the data from Solr in a csv format? How did you implement the streaming processor ? (what tool did you use for this? Not familiar with that) You say it takes a few minutes only to dump the data - how long does it t

Re: Processing a lot of results in Solr

2013-07-23 Thread Roman Chyla
Hello Matt, You can consider writing a batch processing handler, which receives a query and instead of sending results back, it writes them into a file which is then available for streaming (it has its own UUID). I am dumping many GBs of data from solr in few minutes - your query + streaming write

Re: Processing a lot of results in Solr

2013-07-23 Thread Timothy Potter
Hi Matt, This feature is commonly known as deep paging and Lucene and Solr have issues with it ... take a look at http://solr.pl/en/2011/07/18/deep-paging-problem/ as a potential starting point using filters to bucketize a result set into sets of sub result sets. Cheers, Tim On Tue, Jul 23, 2013

Processing a lot of results in Solr

2013-07-23 Thread Matt Lieber
Hello Solr users, Question regarding processing a lot of docs returned from a query; I potentially have millions of documents returned back from a query. What is the common design to deal with this ? 2 ideas I have are: - create a client service that is multithreaded to handled this - Use the Sol