I'm hoping I'm just not using the streaming API correctly. I have about 30M docs (~ 15 collections) in production right now that work well with just 4GB of heap (no streaming). I can't believe streaming would choke on my test data.
I guess there are 2 primary requirements. Reindexing an entire collection must be done over night (so 8 hrs or so). And search must perform reasonably well with grouped results, stats per group, and the grouping based on a runtime dynamic set of fields. I have a plugin analytics query that does the searching now that is passable (couple secs for 10M results, 30 secs for 40M results), however it doesn't support sharded collections. The reindexing (with atomic updates) takes ~5 hrs for 10M docs, or 20 hrs for 40M docs. The problem is that I will soon have customers with 100M docs so my solution will not scale. To speed up reindexing I'm trying to normalize the data. The key fields that get updated represent 1/30th the number of documents. Placing those fields in a separate collection (unique values only) would allow reindexing to finish much quicker. Search results must include data from both collections (there could be a 3rd collection too but I'm trying to simplify here). Streaming requires like sorts for grouping and merging, and these sorts are different. Then there's the user's sort preference to consider. So what is the right way to stream (for example) grouped results (on User, Date, Vendor, Manufacturer) with merged Vendor data, then apply a final sort on Manufacturer? Joel Bernstein wrote > It's likely that the SortStream is the issue. With the sort function you > need enough memory to sort all the tuples coming from the underlying > stream. The sort stream can also be done in parallel so you can split the > tuples from col1 across N worker nodes. This will give you faster sorting > and apply more memory to the sort. > > Can you describe your exact use case? Perhaps we can think about a > different Streaming flow that would work for you. -- View this message in context: http://lucene.472066.n3.nabble.com/Specify-sorting-of-merged-streams-tp4285026p4288116.html Sent from the Solr - User mailing list archive at Nabble.com.