I'm hoping I'm just not using the streaming API correctly. I have about 30M
docs (~ 15 collections) in production right now that work well with just 4GB
of heap (no streaming). I can't believe streaming would choke on my test
data.

I guess there are 2 primary requirements. Reindexing an entire collection
must be done over night (so 8 hrs or so). And search must perform reasonably
well with grouped results, stats per group, and the grouping based on a
runtime dynamic set of fields. I have a plugin analytics query that does the
searching now that is passable (couple secs for 10M results, 30 secs for 40M
results), however it doesn't support sharded collections. The reindexing
(with atomic updates) takes ~5 hrs for 10M docs, or 20 hrs for 40M docs. The
problem is that I will soon have customers with 100M docs so my solution
will not scale.

To speed up reindexing I'm trying to normalize the data. The key fields that
get updated represent 1/30th the number of documents. Placing those fields
in a separate collection (unique values only) would allow reindexing to
finish much quicker. Search results must include data from both collections
(there could be a 3rd collection too but I'm trying to simplify here).
Streaming requires like sorts for grouping and merging, and these sorts are
different. Then there's the user's sort preference to consider. So what is
the right way to stream (for example) grouped results (on User, Date,
Vendor, Manufacturer) with merged Vendor data, then apply a final sort on
Manufacturer?




Joel Bernstein wrote
> It's likely that the SortStream is the issue. With the sort function you
> need enough memory to sort all the tuples coming from the underlying
> stream. The sort stream can also be done in parallel so you can split the
> tuples from col1 across N worker nodes. This will give you faster sorting
> and apply more memory to the sort.
> 
> Can you describe your exact use case? Perhaps we can think about a
> different Streaming flow that would work for you.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Specify-sorting-of-merged-streams-tp4285026p4288116.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to