[ https://issues.apache.org/jira/browse/CASSANDRA-9603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Carl Yeksigian reassigned CASSANDRA-9603: ----------------------------------------- Assignee: Carl Yeksigian (was: Aleksey Yeschenko) > Expose private listen_address in system.local > --------------------------------------------- > > Key: CASSANDRA-9603 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9603 > Project: Cassandra > Issue Type: Improvement > Components: Core > Reporter: Piotr Kołaczkowski > Assignee: Carl Yeksigian > Fix For: 2.1.x > > > We had some hopes CASSANDRA-9436 would add it, yet it added rpc_address > instead of both rpc_address *and* listen_address. We really need > listen_address here, because we need to get information on the private IP C* > binds to. Knowing this we could better match Spark nodes to C* nodes and > process data locally in environments where rpc_address != listen_address like > EC2. > See, Spark does not know rpc addresses nor it has a concept of broadcast > address. It only knows the hostname / IP its workers bind to. In case of > cloud environments, these are private IPs. Now if we give Spark a set of C* > nodes identified by rpc_addresses, Spark doesn't recognize them as belonging > to the same cluster. It treats them as "remote" nodes and has no idea where > to send tasks optimally. > Current situation (example): > Spark worker nodes: [10.0.0.1, 10.0.0.2, 10.0.0.3] > C* nodes: [10.0.0.1 / node1.blah.ec2.com, 10.0.0.2 / node2.blah.ec2.com, > 10.0.0.3 / node3.blah.ec2.com] > What the application knows about the cluster: [node1.blah.ec2.com, > node2.blah.ec2.com, node3.blah.ec2.com] > What the application sends to Spark for execution: > Task1 - please execute on node1.blah.ec2.com > Task2 - please execute on node2.blah.ec2.com > Task3 - please execute on node3.blah.ec2.com > How Spark understands it: "I have no idea what node1.blah.ec2.com is, let's > assign Task1 it to a *random* node" :( > Expected: > Spark worker nodes: [10.0.0.1, 10.0.0.2, 10.0.0.3] > C* nodes: [10.0.0.1 / node1.blah.ec2.com, 10.0.0.2 / node2.blah.ec2.com, > 10.0.0.3 / node3.blah.ec2.com] > What the application knows about the cluster: [10.0.0.1 / node1.blah.ec2.com, > 10.0.0.2 / node2.blah.ec2.com, 10.0.0.3 / node3.blah.ec2.com] > What the application sends to Spark for execution: > Task1 - please execute on node1.blah.ec2.com or 10.0.0.1 > Task2 - please execute on node2.blah.ec2.com or 10.0.0.2 > Task3 - please execute on node3.blah.ec2.com or 10.0.0.3 > How Spark understands it: "10.0.0.1? - I have a worker on that node, lets put > Task 1 there" -- This message was sent by Atlassian JIRA (v6.3.4#6332)