[Hadoop] Writing directly to shards in EsOutputFormat and shard awareness

Ashwin Jayaprakash Sun, 04 May 2014 10:42:12 -0700

Hi, I have 2 related questions regarding routing write requests. Thanks in 
advance for answering!


*Question 1:*
I saw this line in the EsOutputFormat class and I was wondering why:
(https://github.com/elasticsearch/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/mr/EsOutputFormat.java#L221)
(https://github.com/elasticsearch/elasticsearch-hadoop/blob/master/mr/src/main/java/org/elasticsearch/hadoop/rest/RestRepository.java#L239)


            Map<Shard, Node> targetShards = 
repository.getWriteTargetPrimaryShards();
            repository.close();


            List<Shard> orderedShards = new 
ArrayList<Shard>(targetShards.keySet());
            // make sure the order is strict
            Collections.sort(orderedShards);

            // if there's no task info, just pick a random bucket
            if (currentInstance <= 0) {
                currentInstance = new Random().nextInt(targetShards.size()) + 1;
            }
            int bucket = currentInstance % targetShards.size();
            Shard chosenShard = orderedShards.get(bucket);
            Node targetNode = targetShards.get(chosenShard);


            // override the global settings to communicate directly with the 
target node
            
settings.setHosts(targetNode.getIpAddress()).setPort(targetNode.getHttpPort());
            repository = new RestRepository(settings);
            uri = SettingsUtils.nodes(settings).get(0);


I was trying to understand why this is being done. My understanding of ES 
writes was that:

   - You can send a write request to any node in ES
   - That node will receive the request, use the document id, hash it and 
   based on which node is supposed to be the primary
   - The write will be re-routed to the node meant to be the primary for 
   that document id/hashcode
   - If this is the case (like in any data grid like Hazelcast, Coherence 
   or NoSQL stores Hbase, Cassandra etc), why the special logic to find the 
   primaries?
      - 
      
(http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/distrib-write.html)
      - This primary may not even be the one that will handle the write for 
   my document
   - Is this being done simply to load balance writes uniformly across the 
   cluster? In that case why not do a simple round robin?
      - 
      
(https://github.com/elasticsearch/elasticsearch-hadoop/edit/1.3/docs/src/reference/asciidoc/core/arch.adoc)
      


*Question 2:*Is there a way to hook into the routing logic of the cluster?
   
   - Instead of picking up documents in bulk, sending them to a node in ES
   - Which will then again resend the documents to the primaries for 
   different documents
   - Is there a way to find out which node is the primary for a document
   - Collect all documents meant to go to a specific primary and send them 
   directly to that node?
      - (Like Hazelcast or Coherence's Partition awareness: 
      http://www.hazelcast.org/docs/3.0/manual/html-single/#DataAffinity)
      - The problem with this might be that if the primary dies, then the 
   sender has to be aware of the topology change and then resend the data to 
   the new primary
   - So, we're not talking simple clients but smart-client libraries. 
   Perhaps I should be asking, does the Java client library already do this? 
   (I didn't think so)
   - Overall, wouldn't this save a lot of data re-routing over the network?
   

Thanks.


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/6996e823-0b41-42f8-b473-1afe0a8a4c1b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[Hadoop] Writing directly to shards in EsOutputFormat and shard awareness

Reply via email to