"VOTE FOR MODI" or teach me how not to get mails -----Original Message----- From: Vinod Kumar Vavilapalli [mailto:vino...@hortonworks.com] On Behalf Of Vinod Kumar Vavilapalli Sent: Sunday, March 23, 2014 12:20 AM To: common-user@hadoop.apache.org Subject: Re: Data Locality Importance
Like you said, it depends both on the kind of network you have and the type of your workload. Given your point about S3, I'd guess your input files/blocks are not large enough that moving code to data trumps moving data itself to the code. When that balance tilts a lot, especially when moving large input data files/blocks, data-locality will help improve performance significantly. That or when the read throughput from a remote desk << reading it from a local disk. HTH +Vinod On Mar 21, 2014, at 7:06 PM, Mike Sam <mikesam...@gmail.com> wrote: > How important is Data Locality to Hadoop? I mean, if we prefer to > separate the HDFS cluster from the MR cluster, we will lose data > locality but my question is how bad is this assuming we provider a > reasonable network connection between the two clusters? EMR kills data > locality when using S3 as storage but we do not see a significant job > time difference running same job from the HDFS cluster of the same > setup. So, I am wondering how important is Data Locality to Hadoop in practice? > > Thanks, > Mike -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com