I can see I may need to rethink some things. I have two joins: one is 1 to 1
(very large) and one is 1 to .03. A HashJoin may work on the smaller one.
The large join looks like it may not be possible. I could get away with
treating it as a filter somehow - I don't need the fields from the
documents. Such as ... include col1 document (id=123) if col2 contains
document with id=123.

This whole chain is a real-time user search. A 1-2 sec response would be
ideal, but I'm sacrificing speed in order to get the reindexing to run much
faster.

Concurrency is low - like a dozen. Have you read any blogs on balancing #
shards vs # replicas? Any guidelines on estimating the number of VMs this
may require would be great.


Joel Bernstein wrote
> A few other things for you to consider:
> 
> 1) How big are the joins?
> 2) How fast do they need to go?
> 3) How many queries need to run concurrently?
> 
> #1 and 2# will dictate how many shards, replicas and parallel workers are
> needed to perform the join. #3 needs to be carefully considered because
> MapReduce distributed joins are not going to scale like traditional Solr
> queries.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Specify-sorting-of-merged-streams-tp4285026p4288194.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to