Hi, I have 2 related questions regarding routing write requests. Thanks in advance for answering!
*Question 1:* I saw this line in the EsOutputFormat class and I was wondering why: (https://github.com/elasticsearch/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/mr/EsOutputFormat.java#L221) (https://github.com/elasticsearch/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/rest/RestRepository.java#L239) Map<Shard, Node> targetShards = repository.getWriteTargetPrimaryShards(); repository.close(); List<Shard> orderedShards = new ArrayList<Shard>(targetShards.keySet()); // make sure the order is strict Collections.sort(orderedShards); // if there's no task info, just pick a random bucket if (currentInstance <= 0) { currentInstance = new Random().nextInt(targetShards.size()) + 1; } int bucket = currentInstance % targetShards.size(); Shard chosenShard = orderedShards.get(bucket); Node targetNode = targetShards.get(chosenShard); // override the global settings to communicate directly with the target node settings.setHosts(targetNode.getIpAddress()).setPort(targetNode.getHttpPort()); repository = new RestRepository(settings); uri = SettingsUtils.nodes(settings).get(0); I was trying to understand why this is being done. My understanding of ES writes was that: - You can send a write request to any node in ES - That node will receive the request, use the document id, hash it and based on which node is supposed to be the primary - The write will be re-routed to the node meant to be the primary for that document id/hashcode - If this is the case (like in any data grid like Hazelcast, Coherence or NoSQL stores Hbase, Cassandra etc), why the special logic to find the primaries? - (http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/distrib-write.html) - This primary may not even be the one that will handle the write for my document - Is this being done simply to load balance writes uniformly across the cluster? In that case why not do a simple round robin? - (https://github.com/elasticsearch/elasticsearch-hadoop/edit/1.3/docs/src/reference/asciidoc/core/arch.adoc) *Question 2:*Is there a way to hook into the routing logic of the cluster? - Instead of picking up documents in bulk, sending them to a node in ES - Which will then again resend the documents to the primaries for different documents - Is there a way to find out which node is the primary for a document - Collect all documents meant to go to a specific primary and send them directly to that node? - (Like Hazelcast or Coherence's Partition awareness: http://www.hazelcast.org/docs/3.0/manual/html-single/#DataAffinity) - The problem with this might be that if the primary dies, then the sender has to be aware of the topology change and then resend the data to the new primary - So, we're not talking simple clients but smart-client libraries. Perhaps I should be asking, does the Java client library already do this? (I didn't think so) - Overall, wouldn't this save a lot of data re-routing over the network? Thanks. -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6996e823-0b41-42f8-b473-1afe0a8a4c1b%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
