I can see I may need to rethink some things. I have two joins: one is 1 to 1 (very large) and one is 1 to .03. A HashJoin may work on the smaller one. The large join looks like it may not be possible. I could get away with treating it as a filter somehow - I don't need the fields from the documents. Such as ... include col1 document (id=123) if col2 contains document with id=123.
This whole chain is a real-time user search. A 1-2 sec response would be ideal, but I'm sacrificing speed in order to get the reindexing to run much faster. Concurrency is low - like a dozen. Have you read any blogs on balancing # shards vs # replicas? Any guidelines on estimating the number of VMs this may require would be great. Joel Bernstein wrote > A few other things for you to consider: > > 1) How big are the joins? > 2) How fast do they need to go? > 3) How many queries need to run concurrently? > > #1 and 2# will dictate how many shards, replicas and parallel workers are > needed to perform the join. #3 needs to be carefully considered because > MapReduce distributed joins are not going to scale like traditional Solr > queries. -- View this message in context: http://lucene.472066.n3.nabble.com/Specify-sorting-of-merged-streams-tp4285026p4288194.html Sent from the Solr - User mailing list archive at Nabble.com.