[jira] [Commented] (CASSANDRA-7431) Hadoop integration does not perform reverse DNS lookup correctly on EC2

2014-11-26 Thread Olivier Michallat (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226452#comment-14226452
 ] 

Olivier Michallat commented on CASSANDRA-7431:
--

Just wanted to mention that there is a third option coming soon: Netty 4.1 will 
ship with a built-in DNS client, which also allows reverse lookups (I've tested 
with a nightly build).

In the driver, I'm using the JNDI approach for now, but will switch to Netty 
when we upgrade to 4.1.

 Hadoop integration does not perform reverse DNS lookup correctly on EC2
 ---

 Key: CASSANDRA-7431
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7431
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Paulo Motta
Assignee: Paulo Motta
 Attachments: 2.0-CASSANDRA-7431.txt


 The split assignment on AbstractColumnFamilyInputFormat:247 peforms a reverse 
 DNS lookup of Cassandra IPs in order to preserve locality in Hadoop (task 
 trackers are identified by hostnames).
 However, the reverse lookup of an EC2 IP does not yield the EC2 hostname of 
 that endpoint when running from an EC2 instance due to the use of 
 InetAddress.getHostname().
 In order to show this, consider the following piece of code:
 {code:title=DnsResolver.java|borderStyle=solid}
 public class DnsResolver {
 public static void main(String[] args) throws Exception {
 InetAddress namenodePublicAddress = InetAddress.getByName(args[0]);
 System.out.println(getHostAddress:  + 
 namenodePublicAddress.getHostAddress());
 System.out.println(getHostName:  + 
 namenodePublicAddress.getHostName());
 }
 }
 {code}
 When this code is run from my machine to perform reverse lookup of an EC2 IP, 
 the output is:
 {code:none}
 ➜  java DnsResolver 54.201.254.99
 getHostAddress: 54.201.254.99
 getHostName: ec2-54-201-254-99.compute-1.amazonaws.com
 {code}
 When this code is executed from inside an EC2 machine, the output is:
 {code:none}
 ➜  java DnsResolver 54.201.254.99
 getHostAddress: 54.201.254.99
 getHostName: 54.201.254.99
 {code}
 However, when using linux tools such as host or dig, the EC2 hostname is 
 properly resolved from the EC2 instance, so there's some problem with Java's 
 InetAddress.getHostname() and EC2.
 Two consequences of this bug during AbstractColumnFamilyInputFormat split 
 definition are:
 1) If the Hadoop cluster is configured to use EC2 public DNS, the locality 
 will be lost, because Hadoop will try to match the CFIF split location 
 (public IP) with the task tracker location (public DNS), so no matches will 
 be found.
 2) If the Cassandra nodes' broadcast_address is set to public IPs, all hadoop 
 communication will be done via the public IP, what will incurr additional 
 transference charges. If the public IP is mapped to the EC2 DNS during split 
 definition, when the task is executed, ColumnFamilyRecordReader will resolve 
 the public DNS to the private IP of the instance, so there will be not 
 additional charges.
 A similar bug was filed in the WHIRR project: 
 https://issues.apache.org/jira/browse/WHIRR-128



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7431) Hadoop integration does not perform reverse DNS lookup correctly on EC2

2014-11-25 Thread Olivier Michallat (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224527#comment-14224527
 ] 

Olivier Michallat commented on CASSANDRA-7431:
--

I'm running into this issue with the Java driver as well (context: if both 
client and C* are deployed on EC2, the client should use private addresses for 
C* nodes in the same region, and public addresses for C* nodes in another 
region -- EC2's DNS resolves the right address automatically if you lookup 
the node's public hostname; in order to get this public hostname, we do a 
reverse DNS lookup on the IP exposed in {{system.peers.rpc_address}}).

When run from an EC2 instance, a reverse lookup with 
{{InetAddress.getHostName(publicIp)}} works correctly for instances _in another 
EC2 region_, but fails for instances in the same region. As mentioned by Paulo, 
it returns the unresolved IP, whereas command-line tools like {{host}} or 
{{dig}} correctly resolve to the public hostname. I have no explanation as to 
why it fails with Java, but it appears to be a JDK bug.

The lookup via JNDI (as done in Paulo's patch) works, but the fact that we 
initialize the factory with {{com.sun.jndi.dns.DnsContextFactory}} makes me 
wonder if this is portable to other JDK implementations. Another approach is to 
use [dnsjava|http://www.xbill.org/dnsjava/] (that's what they did in Whirr).

 Hadoop integration does not perform reverse DNS lookup correctly on EC2
 ---

 Key: CASSANDRA-7431
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7431
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Paulo Motta
Assignee: Paulo Motta
 Attachments: 2.0-CASSANDRA-7431.txt


 The split assignment on AbstractColumnFamilyInputFormat:247 peforms a reverse 
 DNS lookup of Cassandra IPs in order to preserve locality in Hadoop (task 
 trackers are identified by hostnames).
 However, the reverse lookup of an EC2 IP does not yield the EC2 hostname of 
 that endpoint when running from an EC2 instance due to the use of 
 InetAddress.getHostname().
 In order to show this, consider the following piece of code:
 {code:title=DnsResolver.java|borderStyle=solid}
 public class DnsResolver {
 public static void main(String[] args) throws Exception {
 InetAddress namenodePublicAddress = InetAddress.getByName(args[0]);
 System.out.println(getHostAddress:  + 
 namenodePublicAddress.getHostAddress());
 System.out.println(getHostName:  + 
 namenodePublicAddress.getHostName());
 }
 }
 {code}
 When this code is run from my machine to perform reverse lookup of an EC2 IP, 
 the output is:
 {code:none}
 ➜  java DnsResolver 54.201.254.99
 getHostAddress: 54.201.254.99
 getHostName: ec2-54-201-254-99.compute-1.amazonaws.com
 {code}
 When this code is executed from inside an EC2 machine, the output is:
 {code:none}
 ➜  java DnsResolver 54.201.254.99
 getHostAddress: 54.201.254.99
 getHostName: 54.201.254.99
 {code}
 However, when using linux tools such as host or dig, the EC2 hostname is 
 properly resolved from the EC2 instance, so there's some problem with Java's 
 InetAddress.getHostname() and EC2.
 Two consequences of this bug during AbstractColumnFamilyInputFormat split 
 definition are:
 1) If the Hadoop cluster is configured to use EC2 public DNS, the locality 
 will be lost, because Hadoop will try to match the CFIF split location 
 (public IP) with the task tracker location (public DNS), so no matches will 
 be found.
 2) If the Cassandra nodes' broadcast_address is set to public IPs, all hadoop 
 communication will be done via the public IP, what will incurr additional 
 transference charges. If the public IP is mapped to the EC2 DNS during split 
 definition, when the task is executed, ColumnFamilyRecordReader will resolve 
 the public DNS to the private IP of the instance, so there will be not 
 additional charges.
 A similar bug was filed in the WHIRR project: 
 https://issues.apache.org/jira/browse/WHIRR-128



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7431) Hadoop integration does not perform reverse DNS lookup correctly on EC2

2014-08-15 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14099012#comment-14099012
 ] 

Brandon Williams commented on CASSANDRA-7431:
-

This still doesn't make sense to me:

bq. However, when using linux tools such as host or dig, the EC2 hostname 
is properly resolved from the EC2 instance, so there's some problem with Java's 
InetAddress.getHostname() and EC2.

Without patching the tools or the sytem, any program is going to ask the system 
to resolve it, and it's always going to follow the rules in /etc/nsswitch.conf 
and proceed from there (usually files, then dns for hosts.)  Before adding this 
I'd like to understand exactly what's different about EC2 here, or if this is 
just a resolution issue.

 Hadoop integration does not perform reverse DNS lookup correctly on EC2
 ---

 Key: CASSANDRA-7431
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7431
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Paulo Motta
Assignee: Paulo Motta
 Attachments: 2.0-CASSANDRA-7431.txt


 The split assignment on AbstractColumnFamilyInputFormat:247 peforms a reverse 
 DNS lookup of Cassandra IPs in order to preserve locality in Hadoop (task 
 trackers are identified by hostnames).
 However, the reverse lookup of an EC2 IP does not yield the EC2 hostname of 
 that endpoint when running from an EC2 instance due to the use of 
 InetAddress.getHostname().
 In order to show this, consider the following piece of code:
 {code:title=DnsResolver.java|borderStyle=solid}
 public class DnsResolver {
 public static void main(String[] args) throws Exception {
 InetAddress namenodePublicAddress = InetAddress.getByName(args[0]);
 System.out.println(getHostAddress:  + 
 namenodePublicAddress.getHostAddress());
 System.out.println(getHostName:  + 
 namenodePublicAddress.getHostName());
 }
 }
 {code}
 When this code is run from my machine to perform reverse lookup of an EC2 IP, 
 the output is:
 {code:none}
 ➜  java DnsResolver 54.201.254.99
 getHostAddress: 54.201.254.99
 getHostName: ec2-54-201-254-99.compute-1.amazonaws.com
 {code}
 When this code is executed from inside an EC2 machine, the output is:
 {code:none}
 ➜  java DnsResolver 54.201.254.99
 getHostAddress: 54.201.254.99
 getHostName: 54.201.254.99
 {code}
 However, when using linux tools such as host or dig, the EC2 hostname is 
 properly resolved from the EC2 instance, so there's some problem with Java's 
 InetAddress.getHostname() and EC2.
 Two consequences of this bug during AbstractColumnFamilyInputFormat split 
 definition are:
 1) If the Hadoop cluster is configured to use EC2 public DNS, the locality 
 will be lost, because Hadoop will try to match the CFIF split location 
 (public IP) with the task tracker location (public DNS), so no matches will 
 be found.
 2) If the Cassandra nodes' broadcast_address is set to public IPs, all hadoop 
 communication will be done via the public IP, what will incurr additional 
 transference charges. If the public IP is mapped to the EC2 DNS during split 
 definition, when the task is executed, ColumnFamilyRecordReader will resolve 
 the public DNS to the private IP of the instance, so there will be not 
 additional charges.
 A similar bug was filed in the WHIRR project: 
 https://issues.apache.org/jira/browse/WHIRR-128



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-7431) Hadoop integration does not perform reverse DNS lookup correctly on EC2

2014-06-24 Thread Paulo Motta (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042883#comment-14042883
 ] 

Paulo Motta commented on CASSANDRA-7431:


The simplest and less intrusive solution (but not so elegant) is to add a new 
ConfigHelper option cassandra.input.ec2_hostname_resolution (couldn't find a 
better name). When this property is set, the reverse DNS lookup is done 
directly at the DNS server, bypassing InetAddress.getHostName(), that is broken 
on EC2.

In order not to add new dependencies to Cassandra, I implemented a simple DNS 
lookup helper (org.apache.cassandra.utils.DnsUtil), that performs the DNS 
lookup via JNDI. I also added a simple test to make sure reverse DNS lookups is 
working as expected.

The DnsUtil lookup helper is based on code found at the following links:
* 
https://www.captechconsulting.com/blog/david-tiller/accessing-the-dusty-corners-dns-java
* 
http://stackoverflow.com/questions/7097623/need-to-perform-a-reverse-dns-lookup-of-a-particular-ip-address-in-java
* http://www.codingforums.com/java-jsp/182959-java-reverse-dns-lookup.html

 Hadoop integration does not perform reverse DNS lookup correctly on EC2
 ---

 Key: CASSANDRA-7431
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7431
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Paulo Motta
Assignee: Paulo Motta
 Attachments: 2.0-CASSANDRA-7431.txt


 The split assignment on AbstractColumnFamilyInputFormat:247 peforms a reverse 
 DNS lookup of Cassandra IPs in order to preserve locality in Hadoop (task 
 trackers are identified by hostnames).
 However, the reverse lookup of an EC2 IP does not yield the EC2 hostname of 
 that endpoint when running from an EC2 instance due to the use of 
 InetAddress.getHostname().
 In order to show this, consider the following piece of code:
 {code:title=DnsResolver.java|borderStyle=solid}
 public class DnsResolver {
 public static void main(String[] args) throws Exception {
 InetAddress namenodePublicAddress = InetAddress.getByName(args[0]);
 System.out.println(getHostAddress:  + 
 namenodePublicAddress.getHostAddress());
 System.out.println(getHostName:  + 
 namenodePublicAddress.getHostName());
 }
 }
 {code}
 When this code is run from my machine to perform reverse lookup of an EC2 IP, 
 the output is:
 {code:none}
 ➜  java DnsResolver 54.201.254.99
 getHostAddress: 54.201.254.99
 getHostName: ec2-54-201-254-99.compute-1.amazonaws.com
 {code}
 When this code is executed from inside an EC2 machine, the output is:
 {code:none}
 ➜  java DnsResolver 54.201.254.99
 getHostAddress: 54.201.254.99
 getHostName: 54.201.254.99
 {code}
 However, when using linux tools such as host or dig, the EC2 hostname is 
 properly resolved from the EC2 instance, so there's some problem with Java's 
 InetAddress.getHostname() and EC2.
 Two consequences of this bug during AbstractColumnFamilyInputFormat split 
 definition are:
 1) If the Hadoop cluster is configured to use EC2 public DNS, the locality 
 will be lost, because Hadoop will try to match the CFIF split location 
 (public IP) with the task tracker location (public DNS), so no matches will 
 be found.
 2) If the Cassandra nodes' broadcast_address is set to public IPs, all hadoop 
 communication will be done via the public IP, what will incurr additional 
 transference charges. If the public IP is mapped to the EC2 DNS during split 
 definition, when the task is executed, ColumnFamilyRecordReader will resolve 
 the public DNS to the private IP of the instance, so there will be not 
 additional charges.
 A similar bug was filed in the WHIRR project: 
 https://issues.apache.org/jira/browse/WHIRR-128



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-7431) Hadoop integration does not perform reverse DNS lookup correctly on EC2

2014-06-22 Thread Paulo Motta (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040220#comment-14040220
 ] 

Paulo Motta commented on CASSANDRA-7431:


I guess Hadoop's InputSplit.getLocations() method (implemented by 
ColumnFamilySplit) expects a list of hostnames to be able to schedule local 
tasks, since task trackers are identified by hostnames, not IPs.

Using only private IPs in Hadoop is not feasible because you may want to access 
task tracker WEB interfaces from outside EC2, so it's handy to use EC2 public 
DNS (ec2-*.compute-1.amazonaws.com) to identify hadoop trackers, since this DNS 
is resolved internally to private IPs and externally to public IPs.

Another issue when the C* cluster uses public IPs as broadcast_address (such as 
with the EC2MultiRegionSnitch), is that Hadoop tasks will access 
ColumnFamilySplits of non-local tasks via the public IP, which costs $0.01 per 
GB. If the ColumnFamilySplit's locations are EC2 hostnames instead 
(ec2-*.compute-1.amazonaws.com), then that will be internally resolved by 
Amazon to the private IP, lowering transfer costs for non-local tasks.

 Hadoop integration does not perform reverse DNS lookup correctly on EC2
 ---

 Key: CASSANDRA-7431
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7431
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Paulo Motta
Assignee: Paulo Motta

 The split assignment on AbstractColumnFamilyInputFormat:247 peforms a reverse 
 DNS lookup of Cassandra IPs in order to preserve locality in Hadoop (task 
 trackers are identified by hostnames).
 However, the reverse lookup of an EC2 IP does not yield the EC2 hostname of 
 that endpoint when running from an EC2 instance due to the use of 
 InetAddress.getHostname().
 In order to show this, consider the following piece of code:
 {code:title=DnsResolver.java|borderStyle=solid}
 public class DnsResolver {
 public static void main(String[] args) throws Exception {
 InetAddress namenodePublicAddress = InetAddress.getByName(args[0]);
 System.out.println(getHostAddress:  + 
 namenodePublicAddress.getHostAddress());
 System.out.println(getHostName:  + 
 namenodePublicAddress.getHostName());
 }
 }
 {code}
 When this code is run from my machine to perform reverse lookup of an EC2 IP, 
 the output is:
 {code:none}
 ➜  java DnsResolver 54.201.254.99
 getHostAddress: 54.201.254.99
 getHostName: ec2-54-201-254-99.compute-1.amazonaws.com
 {code}
 When this code is executed from inside an EC2 machine, the output is:
 {code:none}
 ➜  java DnsResolver 54.201.254.99
 getHostAddress: 54.201.254.99
 getHostName: 54.201.254.99
 {code}
 However, when using linux tools such as host or dig, the EC2 hostname is 
 properly resolved from the EC2 instance, so there's some problem with Java's 
 InetAddress.getHostname() and EC2.
 Two consequences of this bug during AbstractColumnFamilyInputFormat split 
 definition are:
 1) If the Hadoop cluster is configured to use EC2 public DNS, the locality 
 will be lost, because Hadoop will try to match the CFIF split location 
 (public IP) with the task tracker location (public DNS), so no matches will 
 be found.
 2) If the Cassandra nodes' broadcast_address is set to public IPs, all hadoop 
 communication will be done via the public IP, what will incurr additional 
 transference charges. If the public IP is mapped to the EC2 DNS during split 
 definition, when the task is executed, ColumnFamilyRecordReader will resolve 
 the public DNS to the private IP of the instance, so there will be not 
 additional charges.
 A similar bug was filed in the WHIRR project: 
 https://issues.apache.org/jira/browse/WHIRR-128



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-7431) Hadoop integration does not perform reverse DNS lookup correctly on EC2

2014-06-21 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039879#comment-14039879
 ] 

Brandon Williams commented on CASSANDRA-7431:
-

Hmm, why do we even use hostnames at all?

 Hadoop integration does not perform reverse DNS lookup correctly on EC2
 ---

 Key: CASSANDRA-7431
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7431
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Reporter: Paulo Motta
Assignee: Paulo Motta

 The split assignment on AbstractColumnFamilyInputFormat:247 peforms a reverse 
 DNS lookup of Cassandra IPs in order to preserve locality in Hadoop (task 
 trackers are identified by hostnames).
 However, the reverse lookup of an EC2 IP does not yield the EC2 hostname of 
 that endpoint when running from an EC2 instance due to the use of 
 InetAddress.getHostname().
 In order to show this, consider the following piece of code:
 {code:title=DnsResolver.java|borderStyle=solid}
 public class DnsResolver {
 public static void main(String[] args) throws Exception {
 InetAddress namenodePublicAddress = InetAddress.getByName(args[0]);
 System.out.println(getHostAddress:  + 
 namenodePublicAddress.getHostAddress());
 System.out.println(getHostName:  + 
 namenodePublicAddress.getHostName());
 }
 }
 {code}
 When this code is run from my machine to perform reverse lookup of an EC2 IP, 
 the output is:
 {code:none}
 ➜  java DnsResolver 54.201.254.99
 getHostAddress: 54.201.254.99
 getHostName: ec2-54-201-254-99.compute-1.amazonaws.com
 {code}
 When this code is executed from inside an EC2 machine, the output is:
 {code:none}
 ➜  java DnsResolver 54.201.254.99
 getHostAddress: 54.201.254.99
 getHostName: 54.201.254.99
 {code}
 However, when using linux tools such as host or dig, the EC2 hostname is 
 properly resolved from the EC2 instance, so there's some problem with Java's 
 InetAddress.getHostname() and EC2.
 Two consequences of this bug during AbstractColumnFamilyInputFormat split 
 definition are:
 1) If the Hadoop cluster is configured to use EC2 public DNS, the locality 
 will be lost, because Hadoop will try to match the CFIF split location 
 (public IP) with the task tracker location (public DNS), so no matches will 
 be found.
 2) If the Cassandra nodes' broadcast_address is set to public IPs, all hadoop 
 communication will be done via the public IP, what will incurr additional 
 transference charges. If the public IP is mapped to the EC2 DNS during split 
 definition, when the task is executed, ColumnFamilyRecordReader will resolve 
 the public DNS to the private IP of the instance, so there will be not 
 additional charges.
 A similar bug was filed in the WHIRR project: 
 https://issues.apache.org/jira/browse/WHIRR-128



--
This message was sent by Atlassian JIRA
(v6.2#6252)